Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux

+1

Documentation/PCI/endpoint/index.rst

··· 15 15 pci-ntb-howto 16 16 pci-vntb-function 17 17 pci-vntb-howto 18 + pci-nvme-function 18 19 19 20 function/binding/pci-test 20 21 function/binding/pci-ntb

+13

Documentation/PCI/endpoint/pci-nvme-function.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ================= 4 + PCI NVMe Function 5 + ================= 6 + 7 + :Author: Damien Le Moal <dlemoal@kernel.org> 8 + 9 + The PCI NVMe endpoint function implements a PCI NVMe controller using the NVMe 10 + subsystem target core code. The driver for this function resides with the NVMe 11 + subsystem as drivers/nvme/target/nvmet-pciep.c. 12 + 13 + See Documentation/nvme/nvme-pci-endpoint-target.rst for more details.

+12

Documentation/nvme/index.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ============== 4 + NVMe Subsystem 5 + ============== 6 + 7 + .. toctree:: 8 + :maxdepth: 2 9 + :numbered: 10 + 11 + feature-and-quirk-policy 12 + nvme-pci-endpoint-target

+368

Documentation/nvme/nvme-pci-endpoint-target.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ================================= 4 + NVMe PCI Endpoint Function Target 5 + ================================= 6 + 7 + :Author: Damien Le Moal <dlemoal@kernel.org> 8 + 9 + The NVMe PCI endpoint function target driver implements a NVMe PCIe controller 10 + using a NVMe fabrics target controller configured with the PCI transport type. 11 + 12 + Overview 13 + ======== 14 + 15 + The NVMe PCI endpoint function target driver allows exposing a NVMe target 16 + controller over a PCIe link, thus implementing an NVMe PCIe device similar to a 17 + regular M.2 SSD. The target controller is created in the same manner as when 18 + using NVMe over fabrics: the controller represents the interface to an NVMe 19 + subsystem using a port. The port transfer type must be configured to be 20 + "pci". The subsystem can be configured to have namespaces backed by regular 21 + files or block devices, or can use NVMe passthrough to expose to the PCI host an 22 + existing physical NVMe device or a NVMe fabrics host controller (e.g. a NVMe TCP 23 + host controller). 24 + 25 + The NVMe PCI endpoint function target driver relies as much as possible on the 26 + NVMe target core code to parse and execute NVMe commands submitted by the PCIe 27 + host. However, using the PCI endpoint framework API and DMA API, the driver is 28 + also responsible for managing all data transfers over the PCIe link. This 29 + implies that the NVMe PCI endpoint function target driver implements several 30 + NVMe data structure management and some NVMe command parsing. 31 + 32 + 1) The driver manages retrieval of NVMe commands in submission queues using DMA 33 + if supported, or MMIO otherwise. Each command retrieved is then executed 34 + using a work item to maximize performance with the parallel execution of 35 + multiple commands on different CPUs. The driver uses a work item to 36 + constantly poll the doorbell of all submission queues to detect command 37 + submissions from the PCIe host. 38 + 39 + 2) The driver transfers completion queues entries of completed commands to the 40 + PCIe host using MMIO copy of the entries in the host completion queue. 41 + After posting completion entries in a completion queue, the driver uses the 42 + PCI endpoint framework API to raise an interrupt to the host to signal the 43 + commands completion. 44 + 45 + 3) For any command that has a data buffer, the NVMe PCI endpoint target driver 46 + parses the command PRPs or SGLs lists to create a list of PCI address 47 + segments representing the mapping of the command data buffer on the host. 48 + The command data buffer is transferred over the PCIe link using this list of 49 + PCI address segments using DMA, if supported. If DMA is not supported, MMIO 50 + is used, which results in poor performance. For write commands, the command 51 + data buffer is transferred from the host into a local memory buffer before 52 + executing the command using the target core code. For read commands, a local 53 + memory buffer is allocated to execute the command and the content of that 54 + buffer is transferred to the host once the command completes. 55 + 56 + Controller Capabilities 57 + ----------------------- 58 + 59 + The NVMe capabilities exposed to the PCIe host through the BAR 0 registers 60 + are almost identical to the capabilities of the NVMe target controller 61 + implemented by the target core code. There are some exceptions. 62 + 63 + 1) The NVMe PCI endpoint target driver always sets the controller capability 64 + CQR bit to request "Contiguous Queues Required". This is to facilitate the 65 + mapping of a queue PCI address range to the local CPU address space. 66 + 67 + 2) The doorbell stride (DSTRB) is always set to be 4B 68 + 69 + 3) Since the PCI endpoint framework does not provide a way to handle PCI level 70 + resets, the controller capability NSSR bit (NVM Subsystem Reset Supported) 71 + is always cleared. 72 + 73 + 4) The boot partition support (BPS), Persistent Memory Region Supported (PMRS) 74 + and Controller Memory Buffer Supported (CMBS) capabilities are never 75 + reported. 76 + 77 + Supported Features 78 + ------------------ 79 + 80 + The NVMe PCI endpoint target driver implements support for both PRPs and SGLs. 81 + The driver also implements IRQ vector coalescing and submission queue 82 + arbitration burst. 83 + 84 + The maximum number of queues and the maximum data transfer size (MDTS) are 85 + configurable through configfs before starting the controller. To avoid issues 86 + with excessive local memory usage for executing commands, MDTS defaults to 512 87 + KB and is limited to a maximum of 2 MB (arbitrary limit). 88 + 89 + Mimimum number of PCI Address Mapping Windows Required 90 + ------------------------------------------------------ 91 + 92 + Most PCI endpoint controllers provide a limited number of mapping windows for 93 + mapping a PCI address range to local CPU memory addresses. The NVMe PCI 94 + endpoint target controllers uses mapping windows for the following. 95 + 96 + 1) One memory window for raising MSI or MSI-X interrupts 97 + 2) One memory window for MMIO transfers 98 + 3) One memory window for each completion queue 99 + 100 + Given the highly asynchronous nature of the NVMe PCI endpoint target driver 101 + operation, the memory windows as described above will generally not be used 102 + simultaneously, but that may happen. So a safe maximum number of completion 103 + queues that can be supported is equal to the total number of memory mapping 104 + windows of the PCI endpoint controller minus two. E.g. for an endpoint PCI 105 + controller with 32 outbound memory windows available, up to 30 completion 106 + queues can be safely operated without any risk of getting PCI address mapping 107 + errors due to the lack of memory windows. 108 + 109 + Maximum Number of Queue Pairs 110 + ----------------------------- 111 + 112 + Upon binding of the NVMe PCI endpoint target driver to the PCI endpoint 113 + controller, BAR 0 is allocated with enough space to accommodate the admin queue 114 + and multiple I/O queues. The maximum of number of I/O queues pairs that can be 115 + supported is limited by several factors. 116 + 117 + 1) The NVMe target core code limits the maximum number of I/O queues to the 118 + number of online CPUs. 119 + 2) The total number of queue pairs, including the admin queue, cannot exceed 120 + the number of MSI-X or MSI vectors available. 121 + 3) The total number of completion queues must not exceed the total number of 122 + PCI mapping windows minus 2 (see above). 123 + 124 + The NVMe endpoint function driver allows configuring the maximum number of 125 + queue pairs through configfs. 126 + 127 + Limitations and NVMe Specification Non-Compliance 128 + ------------------------------------------------- 129 + 130 + Similar to the NVMe target core code, the NVMe PCI endpoint target driver does 131 + not support multiple submission queues using the same completion queue. All 132 + submission queues must specify a unique completion queue. 133 + 134 + 135 + User Guide 136 + ========== 137 + 138 + This section describes the hardware requirements and how to setup an NVMe PCI 139 + endpoint target device. 140 + 141 + Kernel Requirements 142 + ------------------- 143 + 144 + The kernel must be compiled with the configuration options CONFIG_PCI_ENDPOINT, 145 + CONFIG_PCI_ENDPOINT_CONFIGFS, and CONFIG_NVME_TARGET_PCI_EPF enabled. 146 + CONFIG_PCI, CONFIG_BLK_DEV_NVME and CONFIG_NVME_TARGET must also be enabled 147 + (obviously). 148 + 149 + In addition to this, at least one PCI endpoint controller driver should be 150 + available for the endpoint hardware used. 151 + 152 + To facilitate testing, enabling the null-blk driver (CONFIG_BLK_DEV_NULL_BLK) 153 + is also recommended. With this, a simple setup using a null_blk block device 154 + as a subsystem namespace can be used. 155 + 156 + Hardware Requirements 157 + --------------------- 158 + 159 + To use the NVMe PCI endpoint target driver, at least one endpoint controller 160 + device is required. 161 + 162 + To find the list of endpoint controller devices in the system:: 163 + 164 + # ls /sys/class/pci_epc/ 165 + a40000000.pcie-ep 166 + 167 + If PCI_ENDPOINT_CONFIGFS is enabled:: 168 + 169 + # ls /sys/kernel/config/pci_ep/controllers 170 + a40000000.pcie-ep 171 + 172 + The endpoint board must of course also be connected to a host with a PCI cable 173 + with RX-TX signal swapped. If the host PCI slot used does not have 174 + plug-and-play capabilities, the host should be powered off when the NVMe PCI 175 + endpoint device is configured. 176 + 177 + NVMe Endpoint Device 178 + -------------------- 179 + 180 + Creating an NVMe endpoint device is a two step process. First, an NVMe target 181 + subsystem and port must be defined. Second, the NVMe PCI endpoint device must 182 + be setup and bound to the subsystem and port created. 183 + 184 + Creating a NVMe Subsystem and Port 185 + ---------------------------------- 186 + 187 + Details about how to configure a NVMe target subsystem and port are outside the 188 + scope of this document. The following only provides a simple example of a port 189 + and subsystem with a single namespace backed by a null_blk device. 190 + 191 + First, make sure that configfs is enabled:: 192 + 193 + # mount -t configfs none /sys/kernel/config 194 + 195 + Next, create a null_blk device (default settings give a 250 GB device without 196 + memory backing). The block device created will be /dev/nullb0 by default:: 197 + 198 + # modprobe null_blk 199 + # ls /dev/nullb0 200 + /dev/nullb0 201 + 202 + The NVMe PCI endpoint function target driver must be loaded:: 203 + 204 + # modprobe nvmet_pci_epf 205 + # lsmod | grep nvmet 206 + nvmet_pci_epf 32768 0 207 + nvmet 118784 1 nvmet_pci_epf 208 + nvme_core 131072 2 nvmet_pci_epf,nvmet 209 + 210 + Now, create a subsystem and a port that we will use to create a PCI target 211 + controller when setting up the NVMe PCI endpoint target device. In this 212 + example, the port is created with a maximum of 4 I/O queue pairs:: 213 + 214 + # cd /sys/kernel/config/nvmet/subsystems 215 + # mkdir nvmepf.0.nqn 216 + # echo -n "Linux-pci-epf" > nvmepf.0.nqn/attr_model 217 + # echo "0x1b96" > nvmepf.0.nqn/attr_vendor_id 218 + # echo "0x1b96" > nvmepf.0.nqn/attr_subsys_vendor_id 219 + # echo 1 > nvmepf.0.nqn/attr_allow_any_host 220 + # echo 4 > nvmepf.0.nqn/attr_qid_max 221 + 222 + Next, create and enable the subsystem namespace using the null_blk block 223 + device:: 224 + 225 + # mkdir nvmepf.0.nqn/namespaces/1 226 + # echo -n "/dev/nullb0" > nvmepf.0.nqn/namespaces/1/device_path 227 + # echo 1 > "nvmepf.0.nqn/namespaces/1/enable" 228 + 229 + Finally, create the target port and link it to the subsystem:: 230 + 231 + # cd /sys/kernel/config/nvmet/ports 232 + # mkdir 1 233 + # echo -n "pci" > 1/addr_trtype 234 + # ln -s /sys/kernel/config/nvmet/subsystems/nvmepf.0.nqn \ 235 + /sys/kernel/config/nvmet/ports/1/subsystems/nvmepf.0.nqn 236 + 237 + Creating a NVMe PCI Endpoint Device 238 + ----------------------------------- 239 + 240 + With the NVMe target subsystem and port ready for use, the NVMe PCI endpoint 241 + device can now be created and enabled. The NVMe PCI endpoint target driver 242 + should already be loaded (that is done automatically when the port is created):: 243 + 244 + # ls /sys/kernel/config/pci_ep/functions 245 + nvmet_pci_epf 246 + 247 + Next, create function 0:: 248 + 249 + # cd /sys/kernel/config/pci_ep/functions/nvmet_pci_epf 250 + # mkdir nvmepf.0 251 + # ls nvmepf.0/ 252 + baseclass_code msix_interrupts secondary 253 + cache_line_size nvme subclass_code 254 + deviceid primary subsys_id 255 + interrupt_pin progif_code subsys_vendor_id 256 + msi_interrupts revid vendorid 257 + 258 + Configure the function using any device ID (the vendor ID for the device will 259 + be automatically set to the same value as the NVMe target subsystem vendor 260 + ID):: 261 + 262 + # cd /sys/kernel/config/pci_ep/functions/nvmet_pci_epf 263 + # echo 0xBEEF > nvmepf.0/deviceid 264 + # echo 32 > nvmepf.0/msix_interrupts 265 + 266 + If the PCI endpoint controller used does not support MSI-X, MSI can be 267 + configured instead:: 268 + 269 + # echo 32 > nvmepf.0/msi_interrupts 270 + 271 + Next, let's bind our endpoint device with the target subsystem and port that we 272 + created:: 273 + 274 + # echo 1 > nvmepf.0/nvme/portid 275 + # echo "nvmepf.0.nqn" > nvmepf.0/nvme/subsysnqn 276 + 277 + The endpoint function can then be bound to the endpoint controller and the 278 + controller started:: 279 + 280 + # cd /sys/kernel/config/pci_ep 281 + # ln -s functions/nvmet_pci_epf/nvmepf.0 controllers/a40000000.pcie-ep/ 282 + # echo 1 > controllers/a40000000.pcie-ep/start 283 + 284 + On the endpoint machine, kernel messages will show information as the NVMe 285 + target device and endpoint device are created and connected. 286 + 287 + .. code-block:: text 288 + 289 + null_blk: disk nullb0 created 290 + null_blk: module loaded 291 + nvmet: adding nsid 1 to subsystem nvmepf.0.nqn 292 + nvmet_pci_epf nvmet_pci_epf.0: PCI endpoint controller supports MSI-X, 32 vectors 293 + nvmet: Created nvm controller 1 for subsystem nvmepf.0.nqn for NQN nqn.2014-08.org.nvmexpress:uuid:2ab90791-2246-4fbb-961d-4c3d5a5a0176. 294 + nvmet_pci_epf nvmet_pci_epf.0: New PCI ctrl "nvmepf.0.nqn", 4 I/O queues, mdts 524288 B 295 + 296 + PCI Root-Complex Host 297 + --------------------- 298 + 299 + Booting the PCI host will result in the initialization of the PCIe link (this 300 + may be signaled by the PCI endpoint driver with a kernel message). A kernel 301 + message on the endpoint will also signal when the host NVMe driver enables the 302 + device controller:: 303 + 304 + nvmet_pci_epf nvmet_pci_epf.0: Enabling controller 305 + 306 + On the host side, the NVMe PCI endpoint function target device will is 307 + discoverable as a PCI device, with the vendor ID and device ID as configured:: 308 + 309 + # lspci -n 310 + 0000:01:00.0 0108: 1b96:beef 311 + 312 + An this device will be recognized as an NVMe device with a single namespace:: 313 + 314 + # lsblk 315 + NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS 316 + nvme0n1 259:0 0 250G 0 disk 317 + 318 + The NVMe endpoint block device can then be used as any other regular NVMe 319 + namespace block device. The *nvme* command line utility can be used to get more 320 + detailed information about the endpoint device:: 321 + 322 + # nvme id-ctrl /dev/nvme0 323 + NVME Identify Controller: 324 + vid : 0x1b96 325 + ssvid : 0x1b96 326 + sn : 94993c85650ef7bcd625 327 + mn : Linux-pci-epf 328 + fr : 6.13.0-r 329 + rab : 6 330 + ieee : 000000 331 + cmic : 0xb 332 + mdts : 7 333 + cntlid : 0x1 334 + ver : 0x20100 335 + ... 336 + 337 + 338 + Endpoint Bindings 339 + ================= 340 + 341 + The NVMe PCI endpoint target driver uses the PCI endpoint configfs device 342 + attributes as follows. 343 + 344 + ================ =========================================================== 345 + vendorid Ignored (the vendor id of the NVMe target subsystem is used) 346 + deviceid Anything is OK (e.g. PCI_ANY_ID) 347 + revid Do not care 348 + progif_code Must be 0x02 (NVM Express) 349 + baseclass_code Must be 0x01 (PCI_BASE_CLASS_STORAGE) 350 + subclass_code Must be 0x08 (Non-Volatile Memory controller) 351 + cache_line_size Do not care 352 + subsys_vendor_id Ignored (the subsystem vendor id of the NVMe target subsystem 353 + is used) 354 + subsys_id Anything is OK (e.g. PCI_ANY_ID) 355 + msi_interrupts At least equal to the number of queue pairs desired 356 + msix_interrupts At least equal to the number of queue pairs desired 357 + interrupt_pin Interrupt PIN to use if MSI and MSI-X are not supported 358 + ================ =========================================================== 359 + 360 + The NVMe PCI endpoint target function also has some specific configurable 361 + fields defined in the *nvme* subdirectory of the function directory. These 362 + fields are as follows. 363 + 364 + ================ =========================================================== 365 + mdts_kb Maximum data transfer size in KiB (default: 512) 366 + portid The ID of the target port to use 367 + subsysnqn The NQN of the target subsystem to use 368 + ================ ===========================================================

+1

Documentation/subsystem-apis.rst

··· 60 60 cdrom/index 61 61 scsi/index 62 62 target/index 63 + nvme/index 63 64 64 65 Other subsystems 65 66 ----------------

-1

arch/um/drivers/ubd_kern.c

··· 865 865 ubd_dev->tag_set.ops = &ubd_mq_ops; 866 866 ubd_dev->tag_set.queue_depth = 64; 867 867 ubd_dev->tag_set.numa_node = NUMA_NO_NODE; 868 - ubd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE; 869 868 ubd_dev->tag_set.driver_data = ubd_dev; 870 869 ubd_dev->tag_set.nr_hw_queues = 1; 871 870

-2

block/Makefile

··· 27 27 obj-$(CONFIG_IOSCHED_BFQ) += bfq.o 28 28 29 29 obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o 30 - obj-$(CONFIG_BLK_MQ_PCI) += blk-mq-pci.o 31 - obj-$(CONFIG_BLK_MQ_VIRTIO) += blk-mq-virtio.o 32 30 obj-$(CONFIG_BLK_DEV_ZONED) += blk-zoned.o 33 31 obj-$(CONFIG_BLK_WBT) += blk-wbt.o 34 32 obj-$(CONFIG_BLK_DEBUG_FS) += blk-mq-debugfs.o

+1 -1

block/bfq-iosched.c

··· 7622 7622 #define BFQ_ATTR(name) \ 7623 7623 __ATTR(name, 0644, bfq_##name##_show, bfq_##name##_store) 7624 7624 7625 - static struct elv_fs_entry bfq_attrs[] = { 7625 + static const struct elv_fs_entry bfq_attrs[] = { 7626 7626 BFQ_ATTR(fifo_expire_sync), 7627 7627 BFQ_ATTR(fifo_expire_async), 7628 7628 BFQ_ATTR(back_seek_max),

+9 -102

block/bio.c

··· 946 946 947 947 /* 948 948 * Try to merge a page into a segment, while obeying the hardware segment 949 - * size limit. This is not for normal read/write bios, but for passthrough 950 - * or Zone Append operations that we can't split. 949 + * size limit. 950 + * 951 + * This is kept around for the integrity metadata, which is still tries 952 + * to build the initial bio to the hardware limit and doesn't have proper 953 + * helpers to split. Hopefully this will go away soon. 951 954 */ 952 955 bool bvec_try_merge_hw_page(struct request_queue *q, struct bio_vec *bv, 953 956 struct page *page, unsigned len, unsigned offset, ··· 966 963 return false; 967 964 return bvec_try_merge_page(bv, page, len, offset, same_page); 968 965 } 969 - 970 - /** 971 - * bio_add_hw_page - attempt to add a page to a bio with hw constraints 972 - * @q: the target queue 973 - * @bio: destination bio 974 - * @page: page to add 975 - * @len: vec entry length 976 - * @offset: vec entry offset 977 - * @max_sectors: maximum number of sectors that can be added 978 - * @same_page: return if the segment has been merged inside the same page 979 - * 980 - * Add a page to a bio while respecting the hardware max_sectors, max_segment 981 - * and gap limitations. 982 - */ 983 - int bio_add_hw_page(struct request_queue *q, struct bio *bio, 984 - struct page *page, unsigned int len, unsigned int offset, 985 - unsigned int max_sectors, bool *same_page) 986 - { 987 - unsigned int max_size = max_sectors << SECTOR_SHIFT; 988 - 989 - if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED))) 990 - return 0; 991 - 992 - len = min3(len, max_size, queue_max_segment_size(q)); 993 - if (len > max_size - bio->bi_iter.bi_size) 994 - return 0; 995 - 996 - if (bio->bi_vcnt > 0) { 997 - struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1]; 998 - 999 - if (bvec_try_merge_hw_page(q, bv, page, len, offset, 1000 - same_page)) { 1001 - bio->bi_iter.bi_size += len; 1002 - return len; 1003 - } 1004 - 1005 - if (bio->bi_vcnt >= 1006 - min(bio->bi_max_vecs, queue_max_segments(q))) 1007 - return 0; 1008 - 1009 - /* 1010 - * If the queue doesn't support SG gaps and adding this segment 1011 - * would create a gap, disallow it. 1012 - */ 1013 - if (bvec_gap_to_prev(&q->limits, bv, offset)) 1014 - return 0; 1015 - } 1016 - 1017 - bvec_set_page(&bio->bi_io_vec[bio->bi_vcnt], page, len, offset); 1018 - bio->bi_vcnt++; 1019 - bio->bi_iter.bi_size += len; 1020 - return len; 1021 - } 1022 - 1023 - /** 1024 - * bio_add_hw_folio - attempt to add a folio to a bio with hw constraints 1025 - * @q: the target queue 1026 - * @bio: destination bio 1027 - * @folio: folio to add 1028 - * @len: vec entry length 1029 - * @offset: vec entry offset in the folio 1030 - * @max_sectors: maximum number of sectors that can be added 1031 - * @same_page: return if the segment has been merged inside the same folio 1032 - * 1033 - * Add a folio to a bio while respecting the hardware max_sectors, max_segment 1034 - * and gap limitations. 1035 - */ 1036 - int bio_add_hw_folio(struct request_queue *q, struct bio *bio, 1037 - struct folio *folio, size_t len, size_t offset, 1038 - unsigned int max_sectors, bool *same_page) 1039 - { 1040 - if (len > UINT_MAX || offset > UINT_MAX) 1041 - return 0; 1042 - return bio_add_hw_page(q, bio, folio_page(folio, 0), len, offset, 1043 - max_sectors, same_page); 1044 - } 1045 - 1046 - /** 1047 - * bio_add_pc_page - attempt to add page to passthrough bio 1048 - * @q: the target queue 1049 - * @bio: destination bio 1050 - * @page: page to add 1051 - * @len: vec entry length 1052 - * @offset: vec entry offset 1053 - * 1054 - * Attempt to add a page to the bio_vec maplist. This can fail for a 1055 - * number of reasons, such as the bio being full or target block device 1056 - * limitations. The target block device must allow bio's up to PAGE_SIZE, 1057 - * so it is always possible to add a single page to an empty bio. 1058 - * 1059 - * This should only be used by passthrough bios. 1060 - */ 1061 - int bio_add_pc_page(struct request_queue *q, struct bio *bio, 1062 - struct page *page, unsigned int len, unsigned int offset) 1063 - { 1064 - bool same_page = false; 1065 - return bio_add_hw_page(q, bio, page, len, offset, 1066 - queue_max_hw_sectors(q), &same_page); 1067 - } 1068 - EXPORT_SYMBOL(bio_add_pc_page); 1069 966 1070 967 /** 1071 968 * __bio_add_page - add page(s) to a bio in a new segment ··· 1610 1707 */ 1611 1708 void bio_trim(struct bio *bio, sector_t offset, sector_t size) 1612 1709 { 1710 + /* We should never trim an atomic write */ 1711 + if (WARN_ON_ONCE(bio->bi_opf & REQ_ATOMIC && size)) 1712 + return; 1713 + 1613 1714 if (WARN_ON_ONCE(offset > BIO_MAX_SECTORS || size > BIO_MAX_SECTORS || 1614 1715 offset + size > bio_sectors(bio))) 1615 1716 return;

+3 -2

block/blk-cgroup-rwstat.h

··· 52 52 /** 53 53 * blkg_rwstat_add - add a value to a blkg_rwstat 54 54 * @rwstat: target blkg_rwstat 55 - * @op: REQ_OP and flags 55 + * @opf: REQ_OP and flags 56 56 * @val: value to add 57 57 * 58 58 * Add @val to @rwstat. The counters are chosen according to @rw. The ··· 83 83 /** 84 84 * blkg_rwstat_read - read the current values of a blkg_rwstat 85 85 * @rwstat: blkg_rwstat to read 86 + * @result: where to put the current values 86 87 * 87 - * Read the current snapshot of @rwstat and return it in the aux counts. 88 + * Read the current snapshot of @rwstat and return it in the @result counts. 88 89 */ 89 90 static inline void blkg_rwstat_read(struct blkg_rwstat *rwstat, 90 91 struct blkg_rwstat_sample *result)

+6 -4

block/blk-cgroup.h

··· 225 225 226 226 /** 227 227 * bio_issue_as_root_blkg - see if this bio needs to be issued as root blkg 228 - * @return: true if this bio needs to be submitted with the root blkg context. 228 + * @bio: the target &bio 229 + * 230 + * Return: true if this bio needs to be submitted with the root blkg context. 229 231 * 230 232 * In order to avoid priority inversions we sometimes need to issue a bio as if 231 233 * it were attached to the root blkg, and then backcharge to the actual owning ··· 247 245 * @q: request_queue of interest 248 246 * 249 247 * Lookup blkg for the @blkcg - @q pair. 250 - 248 + * 251 249 * Must be called in a RCU critical section. 252 250 */ 253 251 static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, ··· 270 268 } 271 269 272 270 /** 273 - * blkg_to_pdata - get policy private data 271 + * blkg_to_pd - get policy private data 274 272 * @blkg: blkg of interest 275 273 * @pol: policy of interest 276 274 * ··· 289 287 } 290 288 291 289 /** 292 - * pdata_to_blkg - get blkg associated with policy private data 290 + * pd_to_blkg - get blkg associated with policy private data 293 291 * @pd: policy private data of interest 294 292 * 295 293 * @pd is policy private data. Determine the blkg it's associated with.

+11 -10

block/blk-core.c

··· 629 629 blk_mq_submit_bio(bio); 630 630 } else if (likely(bio_queue_enter(bio) == 0)) { 631 631 struct gendisk *disk = bio->bi_bdev->bd_disk; 632 - 633 - disk->fops->submit_bio(bio); 632 + 633 + if ((bio->bi_opf & REQ_POLLED) && 634 + !(disk->queue->limits.features & BLK_FEAT_POLL)) { 635 + bio->bi_status = BLK_STS_NOTSUPP; 636 + bio_endio(bio); 637 + } else { 638 + disk->fops->submit_bio(bio); 639 + } 634 640 blk_queue_exit(disk->queue); 635 641 } 636 642 ··· 811 805 } 812 806 } 813 807 814 - if (!(q->limits.features & BLK_FEAT_POLL) && 815 - (bio->bi_opf & REQ_POLLED)) { 816 - bio_clear_polled(bio); 817 - goto not_supported; 818 - } 819 - 820 808 switch (bio_op(bio)) { 821 809 case REQ_OP_READ: 822 810 break; ··· 935 935 return 0; 936 936 937 937 q = bdev_get_queue(bdev); 938 - if (cookie == BLK_QC_T_NONE || !(q->limits.features & BLK_FEAT_POLL)) 938 + if (cookie == BLK_QC_T_NONE) 939 939 return 0; 940 940 941 941 blk_flush_plug(current->plug, false); ··· 956 956 } else { 957 957 struct gendisk *disk = q->disk; 958 958 959 - if (disk && disk->fops->poll_bio) 959 + if ((q->limits.features & BLK_FEAT_POLL) && disk && 960 + disk->fops->poll_bio) 960 961 ret = disk->fops->poll_bio(bio, iob, flags); 961 962 } 962 963 blk_queue_exit(q);

+1 -3

block/blk-integrity.c

··· 218 218 else 219 219 lim.integrity.flags |= flag; 220 220 221 - blk_mq_freeze_queue(q); 222 - err = queue_limits_commit_update(q, &lim); 223 - blk_mq_unfreeze_queue(q); 221 + err = queue_limits_commit_update_frozen(q, &lim); 224 222 if (err) 225 223 return err; 226 224 return count;

+35 -93

block/blk-map.c

··· 189 189 } 190 190 } 191 191 192 - if (bio_add_pc_page(rq->q, bio, page, bytes, offset) < bytes) { 192 + if (bio_add_page(bio, page, bytes, offset) < bytes) { 193 193 if (!map_data) 194 194 __free_page(page); 195 195 break; ··· 272 272 static int bio_map_user_iov(struct request *rq, struct iov_iter *iter, 273 273 gfp_t gfp_mask) 274 274 { 275 - iov_iter_extraction_t extraction_flags = 0; 276 - unsigned int max_sectors = queue_max_hw_sectors(rq->q); 277 275 unsigned int nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS); 278 276 struct bio *bio; 279 277 int ret; 280 - int j; 281 278 282 279 if (!iov_iter_count(iter)) 283 280 return -EINVAL; 284 281 285 282 bio = blk_rq_map_bio_alloc(rq, nr_vecs, gfp_mask); 286 - if (bio == NULL) 283 + if (!bio) 287 284 return -ENOMEM; 288 - 289 - if (blk_queue_pci_p2pdma(rq->q)) 290 - extraction_flags |= ITER_ALLOW_P2PDMA; 291 - if (iov_iter_extract_will_pin(iter)) 292 - bio_set_flag(bio, BIO_PAGE_PINNED); 293 - 294 - while (iov_iter_count(iter)) { 295 - struct page *stack_pages[UIO_FASTIOV]; 296 - struct page **pages = stack_pages; 297 - ssize_t bytes; 298 - size_t offs; 299 - int npages; 300 - 301 - if (nr_vecs > ARRAY_SIZE(stack_pages)) 302 - pages = NULL; 303 - 304 - bytes = iov_iter_extract_pages(iter, &pages, LONG_MAX, 305 - nr_vecs, extraction_flags, &offs); 306 - if (unlikely(bytes <= 0)) { 307 - ret = bytes ? bytes : -EFAULT; 308 - goto out_unmap; 309 - } 310 - 311 - npages = DIV_ROUND_UP(offs + bytes, PAGE_SIZE); 312 - 313 - if (unlikely(offs & queue_dma_alignment(rq->q))) 314 - j = 0; 315 - else { 316 - for (j = 0; j < npages; j++) { 317 - struct page *page = pages[j]; 318 - unsigned int n = PAGE_SIZE - offs; 319 - bool same_page = false; 320 - 321 - if (n > bytes) 322 - n = bytes; 323 - 324 - if (!bio_add_hw_page(rq->q, bio, page, n, offs, 325 - max_sectors, &same_page)) 326 - break; 327 - 328 - if (same_page) 329 - bio_release_page(bio, page); 330 - bytes -= n; 331 - offs = 0; 332 - } 333 - } 334 - /* 335 - * release the pages we didn't map into the bio, if any 336 - */ 337 - while (j < npages) 338 - bio_release_page(bio, pages[j++]); 339 - if (pages != stack_pages) 340 - kvfree(pages); 341 - /* couldn't stuff something into bio? */ 342 - if (bytes) { 343 - iov_iter_revert(iter, bytes); 344 - break; 345 - } 346 - } 347 - 285 + ret = bio_iov_iter_get_pages(bio, iter); 286 + if (ret) 287 + goto out_put; 348 288 ret = blk_rq_append_bio(rq, bio); 349 289 if (ret) 350 - goto out_unmap; 290 + goto out_release; 351 291 return 0; 352 292 353 - out_unmap: 293 + out_release: 354 294 bio_release_pages(bio, false); 295 + out_put: 355 296 blk_mq_map_bio_put(bio); 356 297 return ret; 357 298 } ··· 363 422 page = virt_to_page(data); 364 423 else 365 424 page = vmalloc_to_page(data); 366 - if (bio_add_pc_page(q, bio, page, bytes, 367 - offset) < bytes) { 425 + if (bio_add_page(bio, page, bytes, offset) < bytes) { 368 426 /* we don't support partial mappings */ 369 427 bio_uninit(bio); 370 428 kfree(bio); ··· 447 507 if (!reading) 448 508 memcpy(page_address(page), p, bytes); 449 509 450 - if (bio_add_pc_page(q, bio, page, bytes, 0) < bytes) 510 + if (bio_add_page(bio, page, bytes, 0) < bytes) 451 511 break; 452 512 453 513 len -= bytes; ··· 476 536 */ 477 537 int blk_rq_append_bio(struct request *rq, struct bio *bio) 478 538 { 479 - struct bvec_iter iter; 480 - struct bio_vec bv; 539 + const struct queue_limits *lim = &rq->q->limits; 540 + unsigned int max_bytes = lim->max_hw_sectors << SECTOR_SHIFT; 481 541 unsigned int nr_segs = 0; 542 + int ret; 482 543 483 - bio_for_each_bvec(bv, bio, iter) 484 - nr_segs++; 544 + /* check that the data layout matches the hardware restrictions */ 545 + ret = bio_split_rw_at(bio, lim, &nr_segs, max_bytes); 546 + if (ret) { 547 + /* if we would have to split the bio, copy instead */ 548 + if (ret > 0) 549 + ret = -EREMOTEIO; 550 + return ret; 551 + } 485 552 486 - if (!rq->bio) { 487 - blk_rq_bio_prep(rq, bio, nr_segs); 488 - } else { 553 + if (rq->bio) { 489 554 if (!ll_back_merge_fn(rq, bio, nr_segs)) 490 555 return -EINVAL; 491 556 rq->biotail->bi_next = bio; 492 557 rq->biotail = bio; 493 - rq->__data_len += (bio)->bi_iter.bi_size; 558 + rq->__data_len += bio->bi_iter.bi_size; 494 559 bio_crypt_free_ctx(bio); 560 + return 0; 495 561 } 496 562 563 + rq->nr_phys_segments = nr_segs; 564 + rq->bio = rq->biotail = bio; 565 + rq->__data_len = bio->bi_iter.bi_size; 497 566 return 0; 498 567 } 499 568 EXPORT_SYMBOL(blk_rq_append_bio); ··· 510 561 /* Prepare bio for passthrough IO given ITER_BVEC iter */ 511 562 static int blk_rq_map_user_bvec(struct request *rq, const struct iov_iter *iter) 512 563 { 513 - const struct queue_limits *lim = &rq->q->limits; 514 - unsigned int max_bytes = lim->max_hw_sectors << SECTOR_SHIFT; 515 - unsigned int nsegs; 564 + unsigned int max_bytes = rq->q->limits.max_hw_sectors << SECTOR_SHIFT; 516 565 struct bio *bio; 517 566 int ret; 518 567 ··· 523 576 return -ENOMEM; 524 577 bio_iov_bvec_set(bio, iter); 525 578 526 - /* check that the data layout matches the hardware restrictions */ 527 - ret = bio_split_rw_at(bio, lim, &nsegs, max_bytes); 528 - if (ret) { 529 - /* if we would have to split the bio, copy instead */ 530 - if (ret > 0) 531 - ret = -EREMOTEIO; 579 + ret = blk_rq_append_bio(rq, bio); 580 + if (ret) 532 581 blk_mq_map_bio_put(bio); 533 - return ret; 534 - } 535 - 536 - blk_rq_bio_prep(rq, bio, nsegs); 537 - return 0; 582 + return ret; 538 583 } 539 584 540 585 /** ··· 583 644 ret = bio_copy_user_iov(rq, map_data, &i, gfp_mask); 584 645 else 585 646 ret = bio_map_user_iov(rq, &i, gfp_mask); 586 - if (ret) 647 + if (ret) { 648 + if (ret == -EREMOTEIO) 649 + ret = -EINVAL; 587 650 goto unmap_rq; 651 + } 588 652 if (!bio) 589 653 bio = rq->bio; 590 654 } while (iov_iter_count(&i));

+70 -107

block/blk-merge.c

··· 473 473 return nr_phys_segs; 474 474 } 475 475 476 + struct phys_vec { 477 + phys_addr_t paddr; 478 + u32 len; 479 + }; 480 + 481 + static bool blk_map_iter_next(struct request *req, 482 + struct req_iterator *iter, struct phys_vec *vec) 483 + { 484 + unsigned int max_size; 485 + struct bio_vec bv; 486 + 487 + if (req->rq_flags & RQF_SPECIAL_PAYLOAD) { 488 + if (!iter->bio) 489 + return false; 490 + vec->paddr = bvec_phys(&req->special_vec); 491 + vec->len = req->special_vec.bv_len; 492 + iter->bio = NULL; 493 + return true; 494 + } 495 + 496 + if (!iter->iter.bi_size) 497 + return false; 498 + 499 + bv = mp_bvec_iter_bvec(iter->bio->bi_io_vec, iter->iter); 500 + vec->paddr = bvec_phys(&bv); 501 + max_size = get_max_segment_size(&req->q->limits, vec->paddr, UINT_MAX); 502 + bv.bv_len = min(bv.bv_len, max_size); 503 + bio_advance_iter_single(iter->bio, &iter->iter, bv.bv_len); 504 + 505 + /* 506 + * If we are entirely done with this bi_io_vec entry, check if the next 507 + * one could be merged into it. This typically happens when moving to 508 + * the next bio, but some callers also don't pack bvecs tight. 509 + */ 510 + while (!iter->iter.bi_size || !iter->iter.bi_bvec_done) { 511 + struct bio_vec next; 512 + 513 + if (!iter->iter.bi_size) { 514 + if (!iter->bio->bi_next) 515 + break; 516 + iter->bio = iter->bio->bi_next; 517 + iter->iter = iter->bio->bi_iter; 518 + } 519 + 520 + next = mp_bvec_iter_bvec(iter->bio->bi_io_vec, iter->iter); 521 + if (bv.bv_len + next.bv_len > max_size || 522 + !biovec_phys_mergeable(req->q, &bv, &next)) 523 + break; 524 + 525 + bv.bv_len += next.bv_len; 526 + bio_advance_iter_single(iter->bio, &iter->iter, next.bv_len); 527 + } 528 + 529 + vec->len = bv.bv_len; 530 + return true; 531 + } 532 + 476 533 static inline struct scatterlist *blk_next_sg(struct scatterlist **sg, 477 534 struct scatterlist *sglist) 478 535 { ··· 547 490 return sg_next(*sg); 548 491 } 549 492 550 - static unsigned blk_bvec_map_sg(struct request_queue *q, 551 - struct bio_vec *bvec, struct scatterlist *sglist, 552 - struct scatterlist **sg) 553 - { 554 - unsigned nbytes = bvec->bv_len; 555 - unsigned nsegs = 0, total = 0; 556 - 557 - while (nbytes > 0) { 558 - unsigned offset = bvec->bv_offset + total; 559 - unsigned len = get_max_segment_size(&q->limits, 560 - bvec_phys(bvec) + total, nbytes); 561 - struct page *page = bvec->bv_page; 562 - 563 - /* 564 - * Unfortunately a fair number of drivers barf on scatterlists 565 - * that have an offset larger than PAGE_SIZE, despite other 566 - * subsystems dealing with that invariant just fine. For now 567 - * stick to the legacy format where we never present those from 568 - * the block layer, but the code below should be removed once 569 - * these offenders (mostly MMC/SD drivers) are fixed. 570 - */ 571 - page += (offset >> PAGE_SHIFT); 572 - offset &= ~PAGE_MASK; 573 - 574 - *sg = blk_next_sg(sg, sglist); 575 - sg_set_page(*sg, page, len, offset); 576 - 577 - total += len; 578 - nbytes -= len; 579 - nsegs++; 580 - } 581 - 582 - return nsegs; 583 - } 584 - 585 - static inline int __blk_bvec_map_sg(struct bio_vec bv, 586 - struct scatterlist *sglist, struct scatterlist **sg) 587 - { 588 - *sg = blk_next_sg(sg, sglist); 589 - sg_set_page(*sg, bv.bv_page, bv.bv_len, bv.bv_offset); 590 - return 1; 591 - } 592 - 593 - /* only try to merge bvecs into one sg if they are from two bios */ 594 - static inline bool 595 - __blk_segment_map_sg_merge(struct request_queue *q, struct bio_vec *bvec, 596 - struct bio_vec *bvprv, struct scatterlist **sg) 597 - { 598 - 599 - int nbytes = bvec->bv_len; 600 - 601 - if (!*sg) 602 - return false; 603 - 604 - if ((*sg)->length + nbytes > queue_max_segment_size(q)) 605 - return false; 606 - 607 - if (!biovec_phys_mergeable(q, bvprv, bvec)) 608 - return false; 609 - 610 - (*sg)->length += nbytes; 611 - 612 - return true; 613 - } 614 - 615 - static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio, 616 - struct scatterlist *sglist, 617 - struct scatterlist **sg) 618 - { 619 - struct bio_vec bvec, bvprv = { NULL }; 620 - struct bvec_iter iter; 621 - int nsegs = 0; 622 - bool new_bio = false; 623 - 624 - for_each_bio(bio) { 625 - bio_for_each_bvec(bvec, bio, iter) { 626 - /* 627 - * Only try to merge bvecs from two bios given we 628 - * have done bio internal merge when adding pages 629 - * to bio 630 - */ 631 - if (new_bio && 632 - __blk_segment_map_sg_merge(q, &bvec, &bvprv, sg)) 633 - goto next_bvec; 634 - 635 - if (bvec.bv_offset + bvec.bv_len <= PAGE_SIZE) 636 - nsegs += __blk_bvec_map_sg(bvec, sglist, sg); 637 - else 638 - nsegs += blk_bvec_map_sg(q, &bvec, sglist, sg); 639 - next_bvec: 640 - new_bio = false; 641 - } 642 - if (likely(bio->bi_iter.bi_size)) { 643 - bvprv = bvec; 644 - new_bio = true; 645 - } 646 - } 647 - 648 - return nsegs; 649 - } 650 - 651 493 /* 652 - * map a request to scatterlist, return number of sg entries setup. Caller 653 - * must make sure sg can hold rq->nr_phys_segments entries 494 + * Map a request to scatterlist, return number of sg entries setup. Caller 495 + * must make sure sg can hold rq->nr_phys_segments entries. 654 496 */ 655 497 int __blk_rq_map_sg(struct request_queue *q, struct request *rq, 656 498 struct scatterlist *sglist, struct scatterlist **last_sg) 657 499 { 500 + struct req_iterator iter = { 501 + .bio = rq->bio, 502 + .iter = rq->bio->bi_iter, 503 + }; 504 + struct phys_vec vec; 658 505 int nsegs = 0; 659 506 660 - if (rq->rq_flags & RQF_SPECIAL_PAYLOAD) 661 - nsegs = __blk_bvec_map_sg(rq->special_vec, sglist, last_sg); 662 - else if (rq->bio) 663 - nsegs = __blk_bios_map_sg(q, rq->bio, sglist, last_sg); 507 + while (blk_map_iter_next(rq, &iter, &vec)) { 508 + *last_sg = blk_next_sg(last_sg, sglist); 509 + sg_set_page(*last_sg, phys_to_page(vec.paddr), vec.len, 510 + offset_in_page(vec.paddr)); 511 + nsegs++; 512 + } 664 513 665 514 if (*last_sg) 666 515 sg_mark_end(*last_sg);

+37

block/blk-mq-cpumap.c

··· 11 11 #include <linux/smp.h> 12 12 #include <linux/cpu.h> 13 13 #include <linux/group_cpus.h> 14 + #include <linux/device/bus.h> 14 15 15 16 #include "blk.h" 16 17 #include "blk-mq.h" ··· 55 54 56 55 return NUMA_NO_NODE; 57 56 } 57 + 58 + /** 59 + * blk_mq_map_hw_queues - Create CPU to hardware queue mapping 60 + * @qmap: CPU to hardware queue map 61 + * @dev: The device to map queues 62 + * @offset: Queue offset to use for the device 63 + * 64 + * Create a CPU to hardware queue mapping in @qmap. The struct bus_type 65 + * irq_get_affinity callback will be used to retrieve the affinity. 66 + */ 67 + void blk_mq_map_hw_queues(struct blk_mq_queue_map *qmap, 68 + struct device *dev, unsigned int offset) 69 + 70 + { 71 + const struct cpumask *mask; 72 + unsigned int queue, cpu; 73 + 74 + if (!dev->bus->irq_get_affinity) 75 + goto fallback; 76 + 77 + for (queue = 0; queue < qmap->nr_queues; queue++) { 78 + mask = dev->bus->irq_get_affinity(dev, queue + offset); 79 + if (!mask) 80 + goto fallback; 81 + 82 + for_each_cpu(cpu, mask) 83 + qmap->mq_map[cpu] = qmap->queue_offset + queue; 84 + } 85 + 86 + return; 87 + 88 + fallback: 89 + WARN_ON_ONCE(qmap->nr_queues > 1); 90 + blk_mq_clear_mq_map(qmap); 91 + } 92 + EXPORT_SYMBOL_GPL(blk_mq_map_hw_queues);

+4 -23

block/blk-mq-debugfs.c

··· 172 172 return 0; 173 173 } 174 174 175 - #define BLK_TAG_ALLOC_NAME(name) [BLK_TAG_ALLOC_##name] = #name 176 - static const char *const alloc_policy_name[] = { 177 - BLK_TAG_ALLOC_NAME(FIFO), 178 - BLK_TAG_ALLOC_NAME(RR), 179 - }; 180 - #undef BLK_TAG_ALLOC_NAME 181 - 182 175 #define HCTX_FLAG_NAME(name) [ilog2(BLK_MQ_F_##name)] = #name 183 176 static const char *const hctx_flag_name[] = { 184 - HCTX_FLAG_NAME(SHOULD_MERGE), 185 177 HCTX_FLAG_NAME(TAG_QUEUE_SHARED), 186 178 HCTX_FLAG_NAME(STACKING), 187 179 HCTX_FLAG_NAME(TAG_HCTX_SHARED), 188 180 HCTX_FLAG_NAME(BLOCKING), 189 - HCTX_FLAG_NAME(NO_SCHED), 181 + HCTX_FLAG_NAME(TAG_RR), 190 182 HCTX_FLAG_NAME(NO_SCHED_BY_DEFAULT), 191 183 }; 192 184 #undef HCTX_FLAG_NAME ··· 186 194 static int hctx_flags_show(void *data, struct seq_file *m) 187 195 { 188 196 struct blk_mq_hw_ctx *hctx = data; 189 - const int alloc_policy = BLK_MQ_FLAG_TO_ALLOC_POLICY(hctx->flags); 190 197 191 - BUILD_BUG_ON(ARRAY_SIZE(hctx_flag_name) != 192 - BLK_MQ_F_ALLOC_POLICY_START_BIT); 193 - BUILD_BUG_ON(ARRAY_SIZE(alloc_policy_name) != BLK_TAG_ALLOC_MAX); 198 + BUILD_BUG_ON(ARRAY_SIZE(hctx_flag_name) != ilog2(BLK_MQ_F_MAX)); 194 199 195 - seq_puts(m, "alloc_policy="); 196 - if (alloc_policy < ARRAY_SIZE(alloc_policy_name) && 197 - alloc_policy_name[alloc_policy]) 198 - seq_puts(m, alloc_policy_name[alloc_policy]); 199 - else 200 - seq_printf(m, "%d", alloc_policy); 201 - seq_puts(m, " "); 202 - blk_flags_show(m, 203 - hctx->flags ^ BLK_ALLOC_POLICY_TO_MQ_FLAG(alloc_policy), 204 - hctx_flag_name, ARRAY_SIZE(hctx_flag_name)); 200 + blk_flags_show(m, hctx->flags, hctx_flag_name, 201 + ARRAY_SIZE(hctx_flag_name)); 205 202 seq_puts(m, "\n"); 206 203 return 0; 207 204 }

-46

block/blk-mq-pci.c

··· 1 - // SPDX-License-Identifier: GPL-2.0 2 - /* 3 - * Copyright (c) 2016 Christoph Hellwig. 4 - */ 5 - #include <linux/kobject.h> 6 - #include <linux/blkdev.h> 7 - #include <linux/blk-mq-pci.h> 8 - #include <linux/pci.h> 9 - #include <linux/module.h> 10 - 11 - #include "blk-mq.h" 12 - 13 - /** 14 - * blk_mq_pci_map_queues - provide a default queue mapping for PCI device 15 - * @qmap: CPU to hardware queue map. 16 - * @pdev: PCI device associated with @set. 17 - * @offset: Offset to use for the pci irq vector 18 - * 19 - * This function assumes the PCI device @pdev has at least as many available 20 - * interrupt vectors as @set has queues. It will then query the vector 21 - * corresponding to each queue for it's affinity mask and built queue mapping 22 - * that maps a queue to the CPUs that have irq affinity for the corresponding 23 - * vector. 24 - */ 25 - void blk_mq_pci_map_queues(struct blk_mq_queue_map *qmap, struct pci_dev *pdev, 26 - int offset) 27 - { 28 - const struct cpumask *mask; 29 - unsigned int queue, cpu; 30 - 31 - for (queue = 0; queue < qmap->nr_queues; queue++) { 32 - mask = pci_irq_get_affinity(pdev, queue + offset); 33 - if (!mask) 34 - goto fallback; 35 - 36 - for_each_cpu(cpu, mask) 37 - qmap->mq_map[cpu] = qmap->queue_offset + queue; 38 - } 39 - 40 - return; 41 - 42 - fallback: 43 - WARN_ON_ONCE(qmap->nr_queues > 1); 44 - blk_mq_clear_mq_map(qmap); 45 - } 46 - EXPORT_SYMBOL_GPL(blk_mq_pci_map_queues);

+1 -2

block/blk-mq-sched.c

··· 351 351 ctx = blk_mq_get_ctx(q); 352 352 hctx = blk_mq_map_queue(q, bio->bi_opf, ctx); 353 353 type = hctx->type; 354 - if (!(hctx->flags & BLK_MQ_F_SHOULD_MERGE) || 355 - list_empty_careful(&ctx->rq_lists[type])) 354 + if (list_empty_careful(&ctx->rq_lists[type])) 356 355 goto out_put; 357 356 358 357 /* default per sw-queue merge */

+13 -28

block/blk-mq-tag.c

··· 544 544 node); 545 545 } 546 546 547 - int blk_mq_init_bitmaps(struct sbitmap_queue *bitmap_tags, 548 - struct sbitmap_queue *breserved_tags, 549 - unsigned int queue_depth, unsigned int reserved, 550 - int node, int alloc_policy) 551 - { 552 - unsigned int depth = queue_depth - reserved; 553 - bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR; 554 - 555 - if (bt_alloc(bitmap_tags, depth, round_robin, node)) 556 - return -ENOMEM; 557 - if (bt_alloc(breserved_tags, reserved, round_robin, node)) 558 - goto free_bitmap_tags; 559 - 560 - return 0; 561 - 562 - free_bitmap_tags: 563 - sbitmap_queue_free(bitmap_tags); 564 - return -ENOMEM; 565 - } 566 - 567 547 struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags, 568 - unsigned int reserved_tags, 569 - int node, int alloc_policy) 548 + unsigned int reserved_tags, unsigned int flags, int node) 570 549 { 550 + unsigned int depth = total_tags - reserved_tags; 551 + bool round_robin = flags & BLK_MQ_F_TAG_RR; 571 552 struct blk_mq_tags *tags; 572 553 573 554 if (total_tags > BLK_MQ_TAG_MAX) { ··· 563 582 tags->nr_tags = total_tags; 564 583 tags->nr_reserved_tags = reserved_tags; 565 584 spin_lock_init(&tags->lock); 585 + if (bt_alloc(&tags->bitmap_tags, depth, round_robin, node)) 586 + goto out_free_tags; 587 + if (bt_alloc(&tags->breserved_tags, reserved_tags, round_robin, node)) 588 + goto out_free_bitmap_tags; 566 589 567 - if (blk_mq_init_bitmaps(&tags->bitmap_tags, &tags->breserved_tags, 568 - total_tags, reserved_tags, node, 569 - alloc_policy) < 0) { 570 - kfree(tags); 571 - return NULL; 572 - } 573 590 return tags; 591 + 592 + out_free_bitmap_tags: 593 + sbitmap_queue_free(&tags->bitmap_tags); 594 + out_free_tags: 595 + kfree(tags); 596 + return NULL; 574 597 } 575 598 576 599 void blk_mq_free_tags(struct blk_mq_tags *tags)

-46

block/blk-mq-virtio.c

··· 1 - // SPDX-License-Identifier: GPL-2.0 2 - /* 3 - * Copyright (c) 2016 Christoph Hellwig. 4 - */ 5 - #include <linux/device.h> 6 - #include <linux/blk-mq-virtio.h> 7 - #include <linux/virtio_config.h> 8 - #include <linux/module.h> 9 - #include "blk-mq.h" 10 - 11 - /** 12 - * blk_mq_virtio_map_queues - provide a default queue mapping for virtio device 13 - * @qmap: CPU to hardware queue map. 14 - * @vdev: virtio device to provide a mapping for. 15 - * @first_vec: first interrupt vectors to use for queues (usually 0) 16 - * 17 - * This function assumes the virtio device @vdev has at least as many available 18 - * interrupt vectors as @set has queues. It will then query the vector 19 - * corresponding to each queue for it's affinity mask and built queue mapping 20 - * that maps a queue to the CPUs that have irq affinity for the corresponding 21 - * vector. 22 - */ 23 - void blk_mq_virtio_map_queues(struct blk_mq_queue_map *qmap, 24 - struct virtio_device *vdev, int first_vec) 25 - { 26 - const struct cpumask *mask; 27 - unsigned int queue, cpu; 28 - 29 - if (!vdev->config->get_vq_affinity) 30 - goto fallback; 31 - 32 - for (queue = 0; queue < qmap->nr_queues; queue++) { 33 - mask = vdev->config->get_vq_affinity(vdev, first_vec + queue); 34 - if (!mask) 35 - goto fallback; 36 - 37 - for_each_cpu(cpu, mask) 38 - qmap->mq_map[cpu] = qmap->queue_offset + queue; 39 - } 40 - 41 - return; 42 - 43 - fallback: 44 - blk_mq_map_queues(qmap); 45 - } 46 - EXPORT_SYMBOL_GPL(blk_mq_virtio_map_queues);

+34 -39

block/blk-mq.c

··· 131 131 if (!q->mq_freeze_depth) { 132 132 q->mq_freeze_owner = owner; 133 133 q->mq_freeze_owner_depth = 1; 134 + q->mq_freeze_disk_dead = !q->disk || 135 + test_bit(GD_DEAD, &q->disk->state) || 136 + !blk_queue_registered(q); 137 + q->mq_freeze_queue_dying = blk_queue_dying(q); 134 138 return true; 135 139 } 136 140 ··· 146 142 /* verify the last unfreeze in owner context */ 147 143 static bool blk_unfreeze_check_owner(struct request_queue *q) 148 144 { 149 - if (!q->mq_freeze_owner) 150 - return false; 151 145 if (q->mq_freeze_owner != current) 152 146 return false; 153 147 if (--q->mq_freeze_owner_depth == 0) { ··· 191 189 void blk_freeze_queue_start(struct request_queue *q) 192 190 { 193 191 if (__blk_freeze_queue_start(q, current)) 194 - blk_freeze_acquire_lock(q, false, false); 192 + blk_freeze_acquire_lock(q); 195 193 } 196 194 EXPORT_SYMBOL_GPL(blk_freeze_queue_start); 197 195 ··· 239 237 void blk_mq_unfreeze_queue(struct request_queue *q) 240 238 { 241 239 if (__blk_mq_unfreeze_queue(q, false)) 242 - blk_unfreeze_release_lock(q, false, false); 240 + blk_unfreeze_release_lock(q); 243 241 } 244 242 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue); 245 243 ··· 2658 2656 if (bio->bi_opf & REQ_RAHEAD) 2659 2657 rq->cmd_flags |= REQ_FAILFAST_MASK; 2660 2658 2659 + rq->bio = rq->biotail = bio; 2661 2660 rq->__sector = bio->bi_iter.bi_sector; 2662 - blk_rq_bio_prep(rq, bio, nr_segs); 2661 + rq->__data_len = bio->bi_iter.bi_size; 2662 + rq->nr_phys_segments = nr_segs; 2663 2663 if (bio_integrity(bio)) 2664 2664 rq->nr_integrity_segments = blk_rq_count_integrity_sg(rq->q, 2665 2665 bio); ··· 2984 2980 } 2985 2981 2986 2982 rq = __blk_mq_alloc_requests(&data); 2987 - if (rq) 2988 - return rq; 2989 - rq_qos_cleanup(q, bio); 2990 - if (bio->bi_opf & REQ_NOWAIT) 2991 - bio_wouldblock_error(bio); 2992 - return NULL; 2983 + if (unlikely(!rq)) 2984 + rq_qos_cleanup(q, bio); 2985 + return rq; 2993 2986 } 2994 2987 2995 2988 /* ··· 3093 3092 } 3094 3093 3095 3094 /* 3096 - * Device reconfiguration may change logical block size, so alignment 3097 - * check has to be done with queue usage counter held 3095 + * Device reconfiguration may change logical block size or reduce the 3096 + * number of poll queues, so the checks for alignment and poll support 3097 + * have to be done with queue usage counter held. 3098 3098 */ 3099 3099 if (unlikely(bio_unaligned(bio, q))) { 3100 3100 bio_io_error(bio); 3101 + goto queue_exit; 3102 + } 3103 + 3104 + if ((bio->bi_opf & REQ_POLLED) && !blk_mq_can_poll(q)) { 3105 + bio->bi_status = BLK_STS_NOTSUPP; 3106 + bio_endio(bio); 3101 3107 goto queue_exit; 3102 3108 } 3103 3109 ··· 3122 3114 goto queue_exit; 3123 3115 3124 3116 new_request: 3125 - if (!rq) { 3126 - rq = blk_mq_get_new_requests(q, plug, bio, nr_segs); 3127 - if (unlikely(!rq)) 3128 - goto queue_exit; 3129 - } else { 3117 + if (rq) { 3130 3118 blk_mq_use_cached_rq(rq, plug, bio); 3119 + } else { 3120 + rq = blk_mq_get_new_requests(q, plug, bio, nr_segs); 3121 + if (unlikely(!rq)) { 3122 + if (bio->bi_opf & REQ_NOWAIT) 3123 + bio_wouldblock_error(bio); 3124 + goto queue_exit; 3125 + } 3131 3126 } 3132 3127 3133 3128 trace_block_getrq(bio); ··· 3483 3472 if (node == NUMA_NO_NODE) 3484 3473 node = set->numa_node; 3485 3474 3486 - tags = blk_mq_init_tags(nr_tags, reserved_tags, node, 3487 - BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags)); 3475 + tags = blk_mq_init_tags(nr_tags, reserved_tags, set->flags, node); 3488 3476 if (!tags) 3489 3477 return NULL; 3490 3478 ··· 4327 4317 blk_mq_sysfs_deinit(q); 4328 4318 } 4329 4319 4330 - static bool blk_mq_can_poll(struct blk_mq_tag_set *set) 4331 - { 4332 - return set->nr_maps > HCTX_TYPE_POLL && 4333 - set->map[HCTX_TYPE_POLL].nr_queues; 4334 - } 4335 - 4336 4320 struct request_queue *blk_mq_alloc_queue(struct blk_mq_tag_set *set, 4337 4321 struct queue_limits *lim, void *queuedata) 4338 4322 { ··· 4337 4333 if (!lim) 4338 4334 lim = &default_lim; 4339 4335 lim->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT; 4340 - if (blk_mq_can_poll(set)) 4336 + if (set->nr_maps > HCTX_TYPE_POLL) 4341 4337 lim->features |= BLK_FEAT_POLL; 4342 4338 4343 4339 q = blk_alloc_queue(lim, set->numa_node); ··· 5025 5021 fallback: 5026 5022 blk_mq_update_queue_map(set); 5027 5023 list_for_each_entry(q, &set->tag_list, tag_set_list) { 5028 - struct queue_limits lim; 5029 - 5030 5024 blk_mq_realloc_hw_ctxs(set, q); 5031 5025 5032 5026 if (q->nr_hw_queues != set->nr_hw_queues) { ··· 5038 5036 set->nr_hw_queues = prev_nr_hw_queues; 5039 5037 goto fallback; 5040 5038 } 5041 - lim = queue_limits_start_update(q); 5042 - if (blk_mq_can_poll(set)) 5043 - lim.features |= BLK_FEAT_POLL; 5044 - else 5045 - lim.features &= ~BLK_FEAT_POLL; 5046 - if (queue_limits_commit_update(q, &lim) < 0) 5047 - pr_warn("updating the poll flag failed\n"); 5048 5039 blk_mq_map_swqueue(q); 5049 5040 } 5050 5041 ··· 5097 5102 int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, 5098 5103 struct io_comp_batch *iob, unsigned int flags) 5099 5104 { 5100 - struct blk_mq_hw_ctx *hctx = xa_load(&q->hctx_table, cookie); 5101 - 5102 - return blk_hctx_poll(q, hctx, iob, flags); 5105 + if (!blk_mq_can_poll(q)) 5106 + return 0; 5107 + return blk_hctx_poll(q, xa_load(&q->hctx_table, cookie), iob, flags); 5103 5108 } 5104 5109 5105 5110 int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,

+7 -4

block/blk-mq.h

··· 163 163 }; 164 164 165 165 struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, 166 - unsigned int reserved_tags, int node, int alloc_policy); 166 + unsigned int reserved_tags, unsigned int flags, int node); 167 167 void blk_mq_free_tags(struct blk_mq_tags *tags); 168 - int blk_mq_init_bitmaps(struct sbitmap_queue *bitmap_tags, 169 - struct sbitmap_queue *breserved_tags, unsigned int queue_depth, 170 - unsigned int reserved, int node, int alloc_policy); 171 168 172 169 unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data); 173 170 unsigned long blk_mq_get_tags(struct blk_mq_alloc_data *data, int nr_tags, ··· 447 450 448 451 #define blk_mq_run_dispatch_ops(q, dispatch_ops) \ 449 452 __blk_mq_run_dispatch_ops(q, true, dispatch_ops) \ 453 + 454 + static inline bool blk_mq_can_poll(struct request_queue *q) 455 + { 456 + return (q->limits.features & BLK_FEAT_POLL) && 457 + q->tag_set->map[HCTX_TYPE_POLL].nr_queues; 458 + } 450 459 451 460 #endif

+36 -6

block/blk-settings.c

··· 175 175 { 176 176 unsigned int boundary_sectors; 177 177 178 + if (!(lim->features & BLK_FEAT_ATOMIC_WRITES)) 179 + goto unsupported; 180 + 178 181 if (!lim->atomic_write_hw_max) 179 182 goto unsupported; 180 183 ··· 416 413 * @lim: limits to apply 417 414 * 418 415 * Apply the limits in @lim that were obtained from queue_limits_start_update() 419 - * and updated by the caller to @q. 416 + * and updated by the caller to @q. The caller must have frozen the queue or 417 + * ensure that there are no outstanding I/Os by other means. 420 418 * 421 419 * Returns 0 if successful, else a negative error code. 422 420 */ ··· 446 442 return error; 447 443 } 448 444 EXPORT_SYMBOL_GPL(queue_limits_commit_update); 445 + 446 + /** 447 + * queue_limits_commit_update_frozen - commit an atomic update of queue limits 448 + * @q: queue to update 449 + * @lim: limits to apply 450 + * 451 + * Apply the limits in @lim that were obtained from queue_limits_start_update() 452 + * and updated with the new values by the caller to @q. Freezes the queue 453 + * before the update and unfreezes it after. 454 + * 455 + * Returns 0 if successful, else a negative error code. 456 + */ 457 + int queue_limits_commit_update_frozen(struct request_queue *q, 458 + struct queue_limits *lim) 459 + { 460 + int ret; 461 + 462 + blk_mq_freeze_queue(q); 463 + ret = queue_limits_commit_update(q, lim); 464 + blk_mq_unfreeze_queue(q); 465 + 466 + return ret; 467 + } 468 + EXPORT_SYMBOL_GPL(queue_limits_commit_update_frozen); 449 469 450 470 /** 451 471 * queue_limits_set - apply queue limits to queue ··· 612 584 } 613 585 614 586 static void blk_stack_atomic_writes_limits(struct queue_limits *t, 615 - struct queue_limits *b) 587 + struct queue_limits *b, sector_t start) 616 588 { 617 - if (!(t->features & BLK_FEAT_ATOMIC_WRITES_STACKED)) 589 + if (!(b->features & BLK_FEAT_ATOMIC_WRITES)) 618 590 goto unsupported; 619 591 620 - if (!b->atomic_write_unit_min) 592 + if (!b->atomic_write_hw_unit_min) 593 + goto unsupported; 594 + 595 + if (!blk_atomic_write_start_sect_aligned(start, b)) 621 596 goto unsupported; 622 597 623 598 /* ··· 642 611 t->atomic_write_hw_unit_max = 0; 643 612 t->atomic_write_hw_unit_min = 0; 644 613 t->atomic_write_hw_boundary = 0; 645 - t->features &= ~BLK_FEAT_ATOMIC_WRITES_STACKED; 646 614 } 647 615 648 616 /** ··· 804 774 t->zone_write_granularity = 0; 805 775 t->max_zone_append_sectors = 0; 806 776 } 807 - blk_stack_atomic_writes_limits(t, b); 777 + blk_stack_atomic_writes_limits(t, b, start); 808 778 809 779 return ret; 810 780 }

+73 -67

block/blk-sysfs.c

··· 24 24 struct attribute attr; 25 25 ssize_t (*show)(struct gendisk *disk, char *page); 26 26 ssize_t (*store)(struct gendisk *disk, const char *page, size_t count); 27 + int (*store_limit)(struct gendisk *disk, const char *page, 28 + size_t count, struct queue_limits *lim); 27 29 void (*load_module)(struct gendisk *disk, const char *page, size_t count); 28 30 }; 29 31 ··· 155 153 QUEUE_SYSFS_SHOW_CONST(write_same_max, 0) 156 154 QUEUE_SYSFS_SHOW_CONST(poll_delay, -1) 157 155 158 - static ssize_t queue_max_discard_sectors_store(struct gendisk *disk, 159 - const char *page, size_t count) 156 + static int queue_max_discard_sectors_store(struct gendisk *disk, 157 + const char *page, size_t count, struct queue_limits *lim) 160 158 { 161 159 unsigned long max_discard_bytes; 162 - struct queue_limits lim; 163 160 ssize_t ret; 164 - int err; 165 161 166 162 ret = queue_var_store(&max_discard_bytes, page, count); 167 163 if (ret < 0) ··· 171 171 if ((max_discard_bytes >> SECTOR_SHIFT) > UINT_MAX) 172 172 return -EINVAL; 173 173 174 - lim = queue_limits_start_update(disk->queue); 175 - lim.max_user_discard_sectors = max_discard_bytes >> SECTOR_SHIFT; 176 - err = queue_limits_commit_update(disk->queue, &lim); 177 - if (err) 178 - return err; 179 - return ret; 174 + lim->max_user_discard_sectors = max_discard_bytes >> SECTOR_SHIFT; 175 + return 0; 180 176 } 181 177 182 - static ssize_t 183 - queue_max_sectors_store(struct gendisk *disk, const char *page, size_t count) 178 + static int 179 + queue_max_sectors_store(struct gendisk *disk, const char *page, size_t count, 180 + struct queue_limits *lim) 184 181 { 185 182 unsigned long max_sectors_kb; 186 - struct queue_limits lim; 187 183 ssize_t ret; 188 - int err; 189 184 190 185 ret = queue_var_store(&max_sectors_kb, page, count); 191 186 if (ret < 0) 192 187 return ret; 193 188 194 - lim = queue_limits_start_update(disk->queue); 195 - lim.max_user_sectors = max_sectors_kb << 1; 196 - err = queue_limits_commit_update(disk->queue, &lim); 197 - if (err) 198 - return err; 199 - return ret; 189 + lim->max_user_sectors = max_sectors_kb << 1; 190 + return 0; 200 191 } 201 192 202 193 static ssize_t queue_feature_store(struct gendisk *disk, const char *page, 203 - size_t count, blk_features_t feature) 194 + size_t count, struct queue_limits *lim, blk_features_t feature) 204 195 { 205 - struct queue_limits lim; 206 196 unsigned long val; 207 197 ssize_t ret; 208 198 ··· 200 210 if (ret < 0) 201 211 return ret; 202 212 203 - lim = queue_limits_start_update(disk->queue); 204 213 if (val) 205 - lim.features |= feature; 214 + lim->features |= feature; 206 215 else 207 - lim.features &= ~feature; 208 - ret = queue_limits_commit_update(disk->queue, &lim); 209 - if (ret) 210 - return ret; 211 - return count; 216 + lim->features &= ~feature; 217 + return 0; 212 218 } 213 219 214 220 #define QUEUE_SYSFS_FEATURE(_name, _feature) \ ··· 213 227 return sysfs_emit(page, "%u\n", \ 214 228 !!(disk->queue->limits.features & _feature)); \ 215 229 } \ 216 - static ssize_t queue_##_name##_store(struct gendisk *disk, \ 217 - const char *page, size_t count) \ 230 + static int queue_##_name##_store(struct gendisk *disk, \ 231 + const char *page, size_t count, struct queue_limits *lim) \ 218 232 { \ 219 - return queue_feature_store(disk, page, count, _feature); \ 233 + return queue_feature_store(disk, page, count, lim, _feature); \ 220 234 } 221 235 222 236 QUEUE_SYSFS_FEATURE(rotational, BLK_FEAT_ROTATIONAL) ··· 231 245 !!(disk->queue->limits.features & _feature)); \ 232 246 } 233 247 234 - QUEUE_SYSFS_FEATURE_SHOW(poll, BLK_FEAT_POLL); 235 248 QUEUE_SYSFS_FEATURE_SHOW(fua, BLK_FEAT_FUA); 236 249 QUEUE_SYSFS_FEATURE_SHOW(dax, BLK_FEAT_DAX); 250 + 251 + static ssize_t queue_poll_show(struct gendisk *disk, char *page) 252 + { 253 + if (queue_is_mq(disk->queue)) 254 + return sysfs_emit(page, "%u\n", blk_mq_can_poll(disk->queue)); 255 + return sysfs_emit(page, "%u\n", 256 + !!(disk->queue->limits.features & BLK_FEAT_POLL)); 257 + } 237 258 238 259 static ssize_t queue_zoned_show(struct gendisk *disk, char *page) 239 260 { ··· 259 266 return queue_var_show(!!blk_queue_passthrough_stat(disk->queue), page); 260 267 } 261 268 262 - static ssize_t queue_iostats_passthrough_store(struct gendisk *disk, 263 - const char *page, size_t count) 269 + static int queue_iostats_passthrough_store(struct gendisk *disk, 270 + const char *page, size_t count, struct queue_limits *lim) 264 271 { 265 - struct queue_limits lim; 266 272 unsigned long ios; 267 273 ssize_t ret; 268 274 ··· 269 277 if (ret < 0) 270 278 return ret; 271 279 272 - lim = queue_limits_start_update(disk->queue); 273 280 if (ios) 274 - lim.flags |= BLK_FLAG_IOSTATS_PASSTHROUGH; 281 + lim->flags |= BLK_FLAG_IOSTATS_PASSTHROUGH; 275 282 else 276 - lim.flags &= ~BLK_FLAG_IOSTATS_PASSTHROUGH; 277 - 278 - ret = queue_limits_commit_update(disk->queue, &lim); 279 - if (ret) 280 - return ret; 281 - 282 - return count; 283 + lim->flags &= ~BLK_FLAG_IOSTATS_PASSTHROUGH; 284 + return 0; 283 285 } 286 + 284 287 static ssize_t queue_nomerges_show(struct gendisk *disk, char *page) 285 288 { 286 289 return queue_var_show((blk_queue_nomerges(disk->queue) << 1) | ··· 378 391 return sysfs_emit(page, "write through\n"); 379 392 } 380 393 381 - static ssize_t queue_wc_store(struct gendisk *disk, const char *page, 382 - size_t count) 394 + static int queue_wc_store(struct gendisk *disk, const char *page, 395 + size_t count, struct queue_limits *lim) 383 396 { 384 - struct queue_limits lim; 385 397 bool disable; 386 - int err; 387 398 388 399 if (!strncmp(page, "write back", 10)) { 389 400 disable = false; ··· 392 407 return -EINVAL; 393 408 } 394 409 395 - lim = queue_limits_start_update(disk->queue); 396 410 if (disable) 397 - lim.flags |= BLK_FLAG_WRITE_CACHE_DISABLED; 411 + lim->flags |= BLK_FLAG_WRITE_CACHE_DISABLED; 398 412 else 399 - lim.flags &= ~BLK_FLAG_WRITE_CACHE_DISABLED; 400 - err = queue_limits_commit_update(disk->queue, &lim); 401 - if (err) 402 - return err; 403 - return count; 413 + lim->flags &= ~BLK_FLAG_WRITE_CACHE_DISABLED; 414 + return 0; 404 415 } 405 416 406 417 #define QUEUE_RO_ENTRY(_prefix, _name) \ ··· 412 431 .store = _prefix##_store, \ 413 432 }; 414 433 434 + #define QUEUE_LIM_RW_ENTRY(_prefix, _name) \ 435 + static struct queue_sysfs_entry _prefix##_entry = { \ 436 + .attr = { .name = _name, .mode = 0644 }, \ 437 + .show = _prefix##_show, \ 438 + .store_limit = _prefix##_store, \ 439 + } 440 + 415 441 #define QUEUE_RW_LOAD_MODULE_ENTRY(_prefix, _name) \ 416 442 static struct queue_sysfs_entry _prefix##_entry = { \ 417 443 .attr = { .name = _name, .mode = 0644 }, \ ··· 429 441 430 442 QUEUE_RW_ENTRY(queue_requests, "nr_requests"); 431 443 QUEUE_RW_ENTRY(queue_ra, "read_ahead_kb"); 432 - QUEUE_RW_ENTRY(queue_max_sectors, "max_sectors_kb"); 444 + QUEUE_LIM_RW_ENTRY(queue_max_sectors, "max_sectors_kb"); 433 445 QUEUE_RO_ENTRY(queue_max_hw_sectors, "max_hw_sectors_kb"); 434 446 QUEUE_RO_ENTRY(queue_max_segments, "max_segments"); 435 447 QUEUE_RO_ENTRY(queue_max_integrity_segments, "max_integrity_segments"); ··· 445 457 QUEUE_RO_ENTRY(queue_max_discard_segments, "max_discard_segments"); 446 458 QUEUE_RO_ENTRY(queue_discard_granularity, "discard_granularity"); 447 459 QUEUE_RO_ENTRY(queue_max_hw_discard_sectors, "discard_max_hw_bytes"); 448 - QUEUE_RW_ENTRY(queue_max_discard_sectors, "discard_max_bytes"); 460 + QUEUE_LIM_RW_ENTRY(queue_max_discard_sectors, "discard_max_bytes"); 449 461 QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data"); 450 462 451 463 QUEUE_RO_ENTRY(queue_atomic_write_max_sectors, "atomic_write_max_bytes"); ··· 465 477 QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones"); 466 478 467 479 QUEUE_RW_ENTRY(queue_nomerges, "nomerges"); 468 - QUEUE_RW_ENTRY(queue_iostats_passthrough, "iostats_passthrough"); 480 + QUEUE_LIM_RW_ENTRY(queue_iostats_passthrough, "iostats_passthrough"); 469 481 QUEUE_RW_ENTRY(queue_rq_affinity, "rq_affinity"); 470 482 QUEUE_RW_ENTRY(queue_poll, "io_poll"); 471 483 QUEUE_RW_ENTRY(queue_poll_delay, "io_poll_delay"); 472 - QUEUE_RW_ENTRY(queue_wc, "write_cache"); 484 + QUEUE_LIM_RW_ENTRY(queue_wc, "write_cache"); 473 485 QUEUE_RO_ENTRY(queue_fua, "fua"); 474 486 QUEUE_RO_ENTRY(queue_dax, "dax"); 475 487 QUEUE_RW_ENTRY(queue_io_timeout, "io_timeout"); ··· 482 494 .show = queue_logical_block_size_show, 483 495 }; 484 496 485 - QUEUE_RW_ENTRY(queue_rotational, "rotational"); 486 - QUEUE_RW_ENTRY(queue_iostats, "iostats"); 487 - QUEUE_RW_ENTRY(queue_add_random, "add_random"); 488 - QUEUE_RW_ENTRY(queue_stable_writes, "stable_writes"); 497 + QUEUE_LIM_RW_ENTRY(queue_rotational, "rotational"); 498 + QUEUE_LIM_RW_ENTRY(queue_iostats, "iostats"); 499 + QUEUE_LIM_RW_ENTRY(queue_add_random, "add_random"); 500 + QUEUE_LIM_RW_ENTRY(queue_stable_writes, "stable_writes"); 489 501 490 502 #ifdef CONFIG_BLK_WBT 491 503 static ssize_t queue_var_store64(s64 *var, const char *page) ··· 681 693 struct queue_sysfs_entry *entry = to_queue(attr); 682 694 struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj); 683 695 struct request_queue *q = disk->queue; 696 + unsigned int noio_flag; 684 697 ssize_t res; 685 698 686 - if (!entry->store) 699 + if (!entry->store_limit && !entry->store) 687 700 return -EIO; 688 701 689 702 /* ··· 695 706 if (entry->load_module) 696 707 entry->load_module(disk, page, length); 697 708 698 - blk_mq_freeze_queue(q); 709 + if (entry->store_limit) { 710 + struct queue_limits lim = queue_limits_start_update(q); 711 + 712 + res = entry->store_limit(disk, page, length, &lim); 713 + if (res < 0) { 714 + queue_limits_cancel_update(q); 715 + return res; 716 + } 717 + 718 + res = queue_limits_commit_update_frozen(q, &lim); 719 + if (res) 720 + return res; 721 + return length; 722 + } 723 + 699 724 mutex_lock(&q->sysfs_lock); 725 + blk_mq_freeze_queue(q); 726 + noio_flag = memalloc_noio_save(); 700 727 res = entry->store(disk, page, length); 701 - mutex_unlock(&q->sysfs_lock); 728 + memalloc_noio_restore(noio_flag); 702 729 blk_mq_unfreeze_queue(q); 730 + mutex_unlock(&q->sysfs_lock); 703 731 return res; 704 732 } 705 733

+31 -34

block/blk-zoned.c

··· 11 11 */ 12 12 13 13 #include <linux/kernel.h> 14 - #include <linux/module.h> 15 14 #include <linux/blkdev.h> 16 15 #include <linux/blk-mq.h> 17 - #include <linux/mm.h> 18 - #include <linux/vmalloc.h> 19 - #include <linux/sched/mm.h> 20 16 #include <linux/spinlock.h> 21 17 #include <linux/refcount.h> 22 18 #include <linux/mempool.h> ··· 459 463 static inline bool disk_should_remove_zone_wplug(struct gendisk *disk, 460 464 struct blk_zone_wplug *zwplug) 461 465 { 466 + lockdep_assert_held(&zwplug->lock); 467 + 462 468 /* If the zone write plug was already removed, we are done. */ 463 469 if (zwplug->flags & BLK_ZONE_WPLUG_UNHASHED) 464 470 return false; ··· 582 584 bio_clear_flag(bio, BIO_ZONE_WRITE_PLUGGING); 583 585 bio_io_error(bio); 584 586 disk_put_zone_wplug(zwplug); 587 + /* Drop the reference taken by disk_zone_wplug_add_bio(() */ 585 588 blk_queue_exit(q); 586 589 } 587 590 ··· 894 895 break; 895 896 } 896 897 897 - /* 898 - * Drop the extra reference on the queue usage we got when 899 - * plugging the BIO and advance the write pointer offset. 900 - */ 898 + /* Drop the reference taken by disk_zone_wplug_add_bio(). */ 901 899 blk_queue_exit(q); 902 900 zwplug->wp_offset += bio_sectors(bio); 903 901 ··· 912 916 struct bio *bio) 913 917 { 914 918 struct gendisk *disk = bio->bi_bdev->bd_disk; 919 + 920 + lockdep_assert_held(&zwplug->lock); 915 921 916 922 /* 917 923 * If we lost track of the zone write pointer due to a write error, ··· 1444 1446 unsigned int nr_seq_zones, nr_conv_zones; 1445 1447 unsigned int pool_size; 1446 1448 struct queue_limits lim; 1447 - int ret; 1448 1449 1449 1450 disk->nr_zones = args->nr_zones; 1450 1451 disk->zone_capacity = args->zone_capacity; ··· 1494 1497 } 1495 1498 1496 1499 commit: 1497 - blk_mq_freeze_queue(q); 1498 - ret = queue_limits_commit_update(q, &lim); 1499 - blk_mq_unfreeze_queue(q); 1500 - 1501 - return ret; 1500 + return queue_limits_commit_update_frozen(q, &lim); 1502 1501 } 1503 1502 1504 1503 static int blk_revalidate_conv_zone(struct blk_zone *zone, unsigned int idx, ··· 1769 1776 EXPORT_SYMBOL_GPL(blk_zone_issue_zeroout); 1770 1777 1771 1778 #ifdef CONFIG_BLK_DEBUG_FS 1779 + static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug, 1780 + struct seq_file *m) 1781 + { 1782 + unsigned int zwp_wp_offset, zwp_flags; 1783 + unsigned int zwp_zone_no, zwp_ref; 1784 + unsigned int zwp_bio_list_size; 1785 + unsigned long flags; 1786 + 1787 + spin_lock_irqsave(&zwplug->lock, flags); 1788 + zwp_zone_no = zwplug->zone_no; 1789 + zwp_flags = zwplug->flags; 1790 + zwp_ref = refcount_read(&zwplug->ref); 1791 + zwp_wp_offset = zwplug->wp_offset; 1792 + zwp_bio_list_size = bio_list_size(&zwplug->bio_list); 1793 + spin_unlock_irqrestore(&zwplug->lock, flags); 1794 + 1795 + seq_printf(m, "%u 0x%x %u %u %u\n", zwp_zone_no, zwp_flags, zwp_ref, 1796 + zwp_wp_offset, zwp_bio_list_size); 1797 + } 1772 1798 1773 1799 int queue_zone_wplugs_show(void *data, struct seq_file *m) 1774 1800 { 1775 1801 struct request_queue *q = data; 1776 1802 struct gendisk *disk = q->disk; 1777 1803 struct blk_zone_wplug *zwplug; 1778 - unsigned int zwp_wp_offset, zwp_flags; 1779 - unsigned int zwp_zone_no, zwp_ref; 1780 - unsigned int zwp_bio_list_size, i; 1781 - unsigned long flags; 1804 + unsigned int i; 1782 1805 1783 1806 if (!disk->zone_wplugs_hash) 1784 1807 return 0; 1785 1808 1786 1809 rcu_read_lock(); 1787 - for (i = 0; i < disk_zone_wplugs_hash_size(disk); i++) { 1788 - hlist_for_each_entry_rcu(zwplug, 1789 - &disk->zone_wplugs_hash[i], node) { 1790 - spin_lock_irqsave(&zwplug->lock, flags); 1791 - zwp_zone_no = zwplug->zone_no; 1792 - zwp_flags = zwplug->flags; 1793 - zwp_ref = refcount_read(&zwplug->ref); 1794 - zwp_wp_offset = zwplug->wp_offset; 1795 - zwp_bio_list_size = bio_list_size(&zwplug->bio_list); 1796 - spin_unlock_irqrestore(&zwplug->lock, flags); 1797 - 1798 - seq_printf(m, "%u 0x%x %u %u %u\n", 1799 - zwp_zone_no, zwp_flags, zwp_ref, 1800 - zwp_wp_offset, zwp_bio_list_size); 1801 - } 1802 - } 1810 + for (i = 0; i < disk_zone_wplugs_hash_size(disk); i++) 1811 + hlist_for_each_entry_rcu(zwplug, &disk->zone_wplugs_hash[i], 1812 + node) 1813 + queue_zone_wplug_show(zwplug, m); 1803 1814 rcu_read_unlock(); 1804 1815 1805 1816 return 0;

+17 -16

block/blk.h

··· 13 13 14 14 struct elevator_type; 15 15 16 + #define BLK_DEV_MAX_SECTORS (LLONG_MAX >> 9) 17 + 16 18 /* Max future timer expiry for timeouts */ 17 19 #define BLK_MAX_TIMEOUT (5 * HZ) 18 20 ··· 558 556 struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id, 559 557 struct lock_class_key *lkclass); 560 558 561 - int bio_add_hw_page(struct request_queue *q, struct bio *bio, 562 - struct page *page, unsigned int len, unsigned int offset, 563 - unsigned int max_sectors, bool *same_page); 564 - 565 - int bio_add_hw_folio(struct request_queue *q, struct bio *bio, 566 - struct folio *folio, size_t len, size_t offset, 567 - unsigned int max_sectors, bool *same_page); 568 - 569 559 /* 570 560 * Clean up a page appropriately, where the page may be pinned, may have a 571 561 * ref taken on it or neither. ··· 714 720 void blk_integrity_prepare(struct request *rq); 715 721 void blk_integrity_complete(struct request *rq, unsigned int nr_bytes); 716 722 717 - static inline void blk_freeze_acquire_lock(struct request_queue *q, bool 718 - disk_dead, bool queue_dying) 723 + #ifdef CONFIG_LOCKDEP 724 + static inline void blk_freeze_acquire_lock(struct request_queue *q) 719 725 { 720 - if (!disk_dead) 726 + if (!q->mq_freeze_disk_dead) 721 727 rwsem_acquire(&q->io_lockdep_map, 0, 1, _RET_IP_); 722 - if (!queue_dying) 728 + if (!q->mq_freeze_queue_dying) 723 729 rwsem_acquire(&q->q_lockdep_map, 0, 1, _RET_IP_); 724 730 } 725 731 726 - static inline void blk_unfreeze_release_lock(struct request_queue *q, bool 727 - disk_dead, bool queue_dying) 732 + static inline void blk_unfreeze_release_lock(struct request_queue *q) 728 733 { 729 - if (!queue_dying) 734 + if (!q->mq_freeze_queue_dying) 730 735 rwsem_release(&q->q_lockdep_map, _RET_IP_); 731 - if (!disk_dead) 736 + if (!q->mq_freeze_disk_dead) 732 737 rwsem_release(&q->io_lockdep_map, _RET_IP_); 733 738 } 739 + #else 740 + static inline void blk_freeze_acquire_lock(struct request_queue *q) 741 + { 742 + } 743 + static inline void blk_unfreeze_release_lock(struct request_queue *q) 744 + { 745 + } 746 + #endif 734 747 735 748 #endif /* BLK_INTERNAL_H */

+1 -1

block/bsg-lib.c

··· 381 381 set->queue_depth = 128; 382 382 set->numa_node = NUMA_NO_NODE; 383 383 set->cmd_size = sizeof(struct bsg_job) + dd_job_size; 384 - set->flags = BLK_MQ_F_NO_SCHED | BLK_MQ_F_BLOCKING; 384 + set->flags = BLK_MQ_F_BLOCKING; 385 385 if (blk_mq_alloc_tag_set(set)) 386 386 goto out_tag_set; 387 387

+6 -29

block/elevator.c

··· 405 405 return NULL; 406 406 } 407 407 408 - #define to_elv(atr) container_of((atr), struct elv_fs_entry, attr) 408 + #define to_elv(atr) container_of_const((atr), struct elv_fs_entry, attr) 409 409 410 410 static ssize_t 411 411 elv_attr_show(struct kobject *kobj, struct attribute *attr, char *page) 412 412 { 413 - struct elv_fs_entry *entry = to_elv(attr); 413 + const struct elv_fs_entry *entry = to_elv(attr); 414 414 struct elevator_queue *e; 415 415 ssize_t error; 416 416 ··· 428 428 elv_attr_store(struct kobject *kobj, struct attribute *attr, 429 429 const char *page, size_t length) 430 430 { 431 - struct elv_fs_entry *entry = to_elv(attr); 431 + const struct elv_fs_entry *entry = to_elv(attr); 432 432 struct elevator_queue *e; 433 433 ssize_t error; 434 434 ··· 461 461 462 462 error = kobject_add(&e->kobj, &q->disk->queue_kobj, "iosched"); 463 463 if (!error) { 464 - struct elv_fs_entry *attr = e->type->elevator_attrs; 464 + const struct elv_fs_entry *attr = e->type->elevator_attrs; 465 465 if (attr) { 466 466 while (attr->attr.name) { 467 467 if (sysfs_create_file(&e->kobj, &attr->attr)) ··· 547 547 } 548 548 EXPORT_SYMBOL_GPL(elv_unregister); 549 549 550 - static inline bool elv_support_iosched(struct request_queue *q) 551 - { 552 - if (!queue_is_mq(q) || 553 - (q->tag_set->flags & BLK_MQ_F_NO_SCHED)) 554 - return false; 555 - return true; 556 - } 557 - 558 550 /* 559 551 * For single queue devices, default to using mq-deadline. If we have multiple 560 552 * queues or mq-deadline is not available, default to "none". ··· 572 580 struct elevator_type *e; 573 581 int err; 574 582 575 - if (!elv_support_iosched(q)) 576 - return; 577 - 578 583 WARN_ON_ONCE(blk_queue_registered(q)); 579 584 580 585 if (unlikely(q->elevator)) ··· 590 601 * 591 602 * Disk isn't added yet, so verifying queue lock only manually. 592 603 */ 593 - blk_freeze_queue_start_non_owner(q); 594 - blk_freeze_acquire_lock(q, true, false); 595 - blk_mq_freeze_queue_wait(q); 604 + blk_mq_freeze_queue(q); 596 605 597 606 blk_mq_cancel_work_sync(q); 598 607 599 608 err = blk_mq_init_sched(q, e); 600 609 601 - blk_unfreeze_release_lock(q, true, false); 602 - blk_mq_unfreeze_queue_non_owner(q); 610 + blk_mq_unfreeze_queue(q); 603 611 604 612 if (err) { 605 613 pr_warn("\"%s\" elevator initialization failed, " ··· 703 717 struct elevator_type *found; 704 718 const char *name; 705 719 706 - if (!elv_support_iosched(disk->queue)) 707 - return; 708 - 709 720 strscpy(elevator_name, buf, sizeof(elevator_name)); 710 721 name = strstrip(elevator_name); 711 722 ··· 720 737 char elevator_name[ELV_NAME_MAX]; 721 738 int ret; 722 739 723 - if (!elv_support_iosched(disk->queue)) 724 - return count; 725 - 726 740 strscpy(elevator_name, buf, sizeof(elevator_name)); 727 741 ret = elevator_change(disk->queue, strstrip(elevator_name)); 728 742 if (!ret) ··· 733 753 struct elevator_queue *eq = q->elevator; 734 754 struct elevator_type *cur = NULL, *e; 735 755 int len = 0; 736 - 737 - if (!elv_support_iosched(q)) 738 - return sprintf(name, "none\n"); 739 756 740 757 if (!q->elevator) { 741 758 len += sprintf(name+len, "[none] ");

+1 -1

block/elevator.h

··· 71 71 72 72 size_t icq_size; /* see iocontext.h */ 73 73 size_t icq_align; /* ditto */ 74 - struct elv_fs_entry *elevator_attrs; 74 + const struct elv_fs_entry *elevator_attrs; 75 75 const char *elevator_name; 76 76 const char *elevator_alias; 77 77 struct module *elevator_owner;

+43 -20

block/genhd.c

··· 58 58 59 59 void set_capacity(struct gendisk *disk, sector_t sectors) 60 60 { 61 + if (sectors > BLK_DEV_MAX_SECTORS) { 62 + pr_warn_once("%s: truncate capacity from %lld to %lld\n", 63 + disk->disk_name, sectors, 64 + BLK_DEV_MAX_SECTORS); 65 + sectors = BLK_DEV_MAX_SECTORS; 66 + } 67 + 61 68 bdev_set_nr_sectors(disk->part0, sectors); 62 69 } 63 70 EXPORT_SYMBOL(set_capacity); ··· 407 400 struct device *ddev = disk_to_dev(disk); 408 401 int ret; 409 402 410 - /* Only makes sense for bio-based to set ->poll_bio */ 411 - if (queue_is_mq(disk->queue) && disk->fops->poll_bio) 403 + if (WARN_ON_ONCE(bdev_nr_sectors(disk->part0) > BLK_DEV_MAX_SECTORS)) 412 404 return -EINVAL; 413 405 414 - /* 415 - * The disk queue should now be all set with enough information about 416 - * the device for the elevator code to pick an adequate default 417 - * elevator if one is needed, that is, for devices requesting queue 418 - * registration. 419 - */ 420 - elevator_init_mq(disk->queue); 406 + if (queue_is_mq(disk->queue)) { 407 + /* 408 + * ->submit_bio and ->poll_bio are bypassed for blk-mq drivers. 409 + */ 410 + if (disk->fops->submit_bio || disk->fops->poll_bio) 411 + return -EINVAL; 421 412 422 - /* Mark bdev as having a submit_bio, if needed */ 423 - if (disk->fops->submit_bio) 413 + /* 414 + * Initialize the I/O scheduler code and pick a default one if 415 + * needed. 416 + */ 417 + elevator_init_mq(disk->queue); 418 + } else { 419 + if (!disk->fops->submit_bio) 420 + return -EINVAL; 424 421 bdev_set_flag(disk->part0, BD_HAS_SUBMIT_BIO); 422 + } 425 423 426 424 /* 427 425 * If the driver provides an explicit major number it also must provide ··· 673 661 struct request_queue *q = disk->queue; 674 662 struct block_device *part; 675 663 unsigned long idx; 676 - bool start_drain, queue_dying; 664 + bool start_drain; 677 665 678 666 might_sleep(); 679 667 ··· 702 690 */ 703 691 mutex_lock(&disk->open_mutex); 704 692 start_drain = __blk_mark_disk_dead(disk); 705 - queue_dying = blk_queue_dying(q); 706 693 if (start_drain) 707 - blk_freeze_acquire_lock(q, true, queue_dying); 694 + blk_freeze_acquire_lock(q); 708 695 xa_for_each_start(&disk->part_tbl, idx, part, 1) 709 696 drop_partition(part); 710 697 mutex_unlock(&disk->open_mutex); ··· 759 748 blk_mq_exit_queue(q); 760 749 761 750 if (start_drain) 762 - blk_unfreeze_release_lock(q, true, queue_dying); 751 + blk_unfreeze_release_lock(q); 763 752 } 764 753 EXPORT_SYMBOL(del_gendisk); 765 754 ··· 809 798 } 810 799 811 800 #ifdef CONFIG_BLOCK_LEGACY_AUTOLOAD 812 - void blk_request_module(dev_t devt) 801 + static bool blk_probe_dev(dev_t devt) 813 802 { 814 803 unsigned int major = MAJOR(devt); 815 804 struct blk_major_name **n; ··· 819 808 if ((*n)->major == major && (*n)->probe) { 820 809 (*n)->probe(devt); 821 810 mutex_unlock(&major_names_lock); 822 - return; 811 + return true; 823 812 } 824 813 } 825 814 mutex_unlock(&major_names_lock); 815 + return false; 816 + } 826 817 827 - if (request_module("block-major-%d-%d", MAJOR(devt), MINOR(devt)) > 0) 828 - /* Make old-style 2.4 aliases work */ 829 - request_module("block-major-%d", MAJOR(devt)); 818 + void blk_request_module(dev_t devt) 819 + { 820 + int error; 821 + 822 + if (blk_probe_dev(devt)) 823 + return; 824 + 825 + error = request_module("block-major-%d-%d", MAJOR(devt), MINOR(devt)); 826 + /* Make old-style 2.4 aliases work */ 827 + if (error > 0) 828 + error = request_module("block-major-%d", MAJOR(devt)); 829 + if (!error) 830 + blk_probe_dev(devt); 830 831 } 831 832 #endif /* CONFIG_BLOCK_LEGACY_AUTOLOAD */ 832 833

+1 -1

block/kyber-iosched.c

··· 889 889 #undef KYBER_LAT_SHOW_STORE 890 890 891 891 #define KYBER_LAT_ATTR(op) __ATTR(op##_lat_nsec, 0644, kyber_##op##_lat_show, kyber_##op##_lat_store) 892 - static struct elv_fs_entry kyber_sched_attrs[] = { 892 + static const struct elv_fs_entry kyber_sched_attrs[] = { 893 893 KYBER_LAT_ATTR(read), 894 894 KYBER_LAT_ATTR(write), 895 895 __ATTR_NULL

+1 -1

block/mq-deadline.c

··· 834 834 #define DD_ATTR(name) \ 835 835 __ATTR(name, 0644, deadline_##name##_show, deadline_##name##_store) 836 836 837 - static struct elv_fs_entry deadline_attrs[] = { 837 + static const struct elv_fs_entry deadline_attrs[] = { 838 838 DD_ATTR(read_expire), 839 839 DD_ATTR(write_expire), 840 840 DD_ATTR(writes_starved),

+1 -1

block/partitions/ldm.h

+1 -1

drivers/ata/ahci.h

··· 396 396 .shost_groups = ahci_shost_groups, \ 397 397 .sdev_groups = ahci_sdev_groups, \ 398 398 .change_queue_depth = ata_scsi_change_queue_depth, \ 399 - .tag_alloc_policy = BLK_TAG_ALLOC_RR, \ 399 + .tag_alloc_policy_rr = true, \ 400 400 .device_configure = ata_scsi_device_configure 401 401 402 402 extern struct ata_port_operations ahci_ops;

+1 -1

drivers/ata/pata_macio.c

··· 935 935 .device_configure = pata_macio_device_configure, 936 936 .sdev_groups = ata_common_sdev_groups, 937 937 .can_queue = ATA_DEF_QUEUE, 938 - .tag_alloc_policy = BLK_TAG_ALLOC_RR, 938 + .tag_alloc_policy_rr = true, 939 939 }; 940 940 941 941 static struct ata_port_operations pata_macio_ops = {

+1 -1

drivers/ata/sata_mv.c

··· 672 672 .dma_boundary = MV_DMA_BOUNDARY, 673 673 .sdev_groups = ata_ncq_sdev_groups, 674 674 .change_queue_depth = ata_scsi_change_queue_depth, 675 - .tag_alloc_policy = BLK_TAG_ALLOC_RR, 675 + .tag_alloc_policy_rr = true, 676 676 .device_configure = ata_scsi_device_configure 677 677 }; 678 678

+2 -2

drivers/ata/sata_nv.c

··· 385 385 .device_configure = nv_adma_device_configure, 386 386 .sdev_groups = ata_ncq_sdev_groups, 387 387 .change_queue_depth = ata_scsi_change_queue_depth, 388 - .tag_alloc_policy = BLK_TAG_ALLOC_RR, 388 + .tag_alloc_policy_rr = true, 389 389 }; 390 390 391 391 static const struct scsi_host_template nv_swncq_sht = { ··· 396 396 .device_configure = nv_swncq_device_configure, 397 397 .sdev_groups = ata_ncq_sdev_groups, 398 398 .change_queue_depth = ata_scsi_change_queue_depth, 399 - .tag_alloc_policy = BLK_TAG_ALLOC_RR, 399 + .tag_alloc_policy_rr = true, 400 400 }; 401 401 402 402 /*

-1

drivers/ata/sata_sil24.c

··· 378 378 .can_queue = SIL24_MAX_CMDS, 379 379 .sg_tablesize = SIL24_MAX_SGE, 380 380 .dma_boundary = ATA_DMA_BOUNDARY, 381 - .tag_alloc_policy = BLK_TAG_ALLOC_FIFO, 382 381 .sdev_groups = ata_ncq_sdev_groups, 383 382 .change_queue_depth = ata_scsi_change_queue_depth, 384 383 .device_configure = ata_scsi_device_configure

-1

drivers/block/amiflop.c

··· 1819 1819 unit[drive].tag_set.nr_maps = 1; 1820 1820 unit[drive].tag_set.queue_depth = 2; 1821 1821 unit[drive].tag_set.numa_node = NUMA_NO_NODE; 1822 - unit[drive].tag_set.flags = BLK_MQ_F_SHOULD_MERGE; 1823 1822 if (blk_mq_alloc_tag_set(&unit[drive].tag_set)) 1824 1823 goto out_cleanup_trackbuf; 1825 1824

-1

drivers/block/aoe/aoeblk.c

··· 368 368 set->nr_hw_queues = 1; 369 369 set->queue_depth = 128; 370 370 set->numa_node = NUMA_NO_NODE; 371 - set->flags = BLK_MQ_F_SHOULD_MERGE; 372 371 err = blk_mq_alloc_tag_set(set); 373 372 if (err) { 374 373 pr_err("aoe: cannot allocate tag set for %ld.%d\n",

-1

drivers/block/ataflop.c

··· 2088 2088 unit[i].tag_set.nr_maps = 1; 2089 2089 unit[i].tag_set.queue_depth = 2; 2090 2090 unit[i].tag_set.numa_node = NUMA_NO_NODE; 2091 - unit[i].tag_set.flags = BLK_MQ_F_SHOULD_MERGE; 2092 2091 ret = blk_mq_alloc_tag_set(&unit[i].tag_set); 2093 2092 if (ret) 2094 2093 goto err;

-1

drivers/block/floppy.c

··· 4596 4596 tag_sets[drive].nr_maps = 1; 4597 4597 tag_sets[drive].queue_depth = 2; 4598 4598 tag_sets[drive].numa_node = NUMA_NO_NODE; 4599 - tag_sets[drive].flags = BLK_MQ_F_SHOULD_MERGE; 4600 4599 err = blk_mq_alloc_tag_set(&tag_sets[drive]); 4601 4600 if (err) 4602 4601 goto out_put_disk;

+97 -81

drivers/block/loop.c

··· 68 68 struct list_head idle_worker_list; 69 69 struct rb_root worker_tree; 70 70 struct timer_list timer; 71 - bool use_dio; 72 71 bool sysfs_inited; 73 72 74 73 struct request_queue *lo_queue; ··· 181 182 return true; 182 183 } 183 184 184 - static void __loop_update_dio(struct loop_device *lo, bool dio) 185 + static bool lo_can_use_dio(struct loop_device *lo) 185 186 { 186 - struct file *file = lo->lo_backing_file; 187 - struct inode *inode = file->f_mapping->host; 188 - struct block_device *backing_bdev = NULL; 189 - bool use_dio; 187 + struct inode *inode = lo->lo_backing_file->f_mapping->host; 188 + 189 + if (!(lo->lo_backing_file->f_mode & FMODE_CAN_ODIRECT)) 190 + return false; 190 191 191 192 if (S_ISBLK(inode->i_mode)) 192 - backing_bdev = I_BDEV(inode); 193 - else if (inode->i_sb->s_bdev) 194 - backing_bdev = inode->i_sb->s_bdev; 193 + return lo_bdev_can_use_dio(lo, I_BDEV(inode)); 194 + if (inode->i_sb->s_bdev) 195 + return lo_bdev_can_use_dio(lo, inode->i_sb->s_bdev); 196 + return true; 197 + } 195 198 196 - use_dio = dio && (file->f_mode & FMODE_CAN_ODIRECT) && 197 - (!backing_bdev || lo_bdev_can_use_dio(lo, backing_bdev)); 199 + /* 200 + * Direct I/O can be enabled either by using an O_DIRECT file descriptor, or by 201 + * passing in the LO_FLAGS_DIRECT_IO flag from userspace. It will be silently 202 + * disabled when the device block size is too small or the offset is unaligned. 203 + * 204 + * loop_get_status will always report the effective LO_FLAGS_DIRECT_IO flag and 205 + * not the originally passed in one. 206 + */ 207 + static inline void loop_update_dio(struct loop_device *lo) 208 + { 209 + bool dio_in_use = lo->lo_flags & LO_FLAGS_DIRECT_IO; 198 210 199 - if (lo->use_dio == use_dio) 200 - return; 211 + lockdep_assert_held(&lo->lo_mutex); 212 + WARN_ON_ONCE(lo->lo_state == Lo_bound && 213 + lo->lo_queue->mq_freeze_depth == 0); 201 214 202 - /* flush dirty pages before changing direct IO */ 203 - vfs_fsync(file, 0); 204 - 205 - /* 206 - * The flag of LO_FLAGS_DIRECT_IO is handled similarly with 207 - * LO_FLAGS_READ_ONLY, both are set from kernel, and losetup 208 - * will get updated by ioctl(LOOP_GET_STATUS) 209 - */ 210 - if (lo->lo_state == Lo_bound) 211 - blk_mq_freeze_queue(lo->lo_queue); 212 - lo->use_dio = use_dio; 213 - if (use_dio) 215 + if (lo->lo_backing_file->f_flags & O_DIRECT) 214 216 lo->lo_flags |= LO_FLAGS_DIRECT_IO; 215 - else 217 + if ((lo->lo_flags & LO_FLAGS_DIRECT_IO) && !lo_can_use_dio(lo)) 216 218 lo->lo_flags &= ~LO_FLAGS_DIRECT_IO; 217 - if (lo->lo_state == Lo_bound) 218 - blk_mq_unfreeze_queue(lo->lo_queue); 219 + 220 + /* flush dirty pages before starting to issue direct I/O */ 221 + if ((lo->lo_flags & LO_FLAGS_DIRECT_IO) && !dio_in_use) 222 + vfs_fsync(lo->lo_backing_file, 0); 219 223 } 220 224 221 225 /** ··· 313 311 lim.discard_granularity = 0; 314 312 } 315 313 314 + /* 315 + * XXX: this updates the queue limits without freezing the queue, which 316 + * is against the locking protocol and dangerous. But we can't just 317 + * freeze the queue as we're inside the ->queue_rq method here. So this 318 + * should move out into a workqueue unless we get the file operations to 319 + * advertise if they support specific fallocate operations. 320 + */ 316 321 queue_limits_commit_update(lo->lo_queue, &lim); 317 322 } 318 323 ··· 527 518 WARN_ON_ONCE(1); 528 519 return -EIO; 529 520 } 530 - } 531 - 532 - static inline void loop_update_dio(struct loop_device *lo) 533 - { 534 - __loop_update_dio(lo, (lo->lo_backing_file->f_flags & O_DIRECT) | 535 - lo->use_dio); 536 521 } 537 522 538 523 static void loop_reread_partitions(struct loop_device *lo) ··· 967 964 968 965 memcpy(lo->lo_file_name, info->lo_file_name, LO_NAME_SIZE); 969 966 lo->lo_file_name[LO_NAME_SIZE-1] = 0; 970 - lo->lo_flags = info->lo_flags; 971 967 return 0; 972 968 } 973 969 ··· 979 977 return SECTOR_SIZE; 980 978 } 981 979 982 - static int loop_reconfigure_limits(struct loop_device *lo, unsigned int bsize) 980 + static void loop_update_limits(struct loop_device *lo, struct queue_limits *lim, 981 + unsigned int bsize) 983 982 { 984 983 struct file *file = lo->lo_backing_file; 985 984 struct inode *inode = file->f_mapping->host; 986 985 struct block_device *backing_bdev = NULL; 987 - struct queue_limits lim; 988 986 u32 granularity = 0, max_discard_sectors = 0; 989 987 990 988 if (S_ISBLK(inode->i_mode)) ··· 997 995 998 996 loop_get_discard_config(lo, &granularity, &max_discard_sectors); 999 997 1000 - lim = queue_limits_start_update(lo->lo_queue); 1001 - lim.logical_block_size = bsize; 1002 - lim.physical_block_size = bsize; 1003 - lim.io_min = bsize; 1004 - lim.features &= ~(BLK_FEAT_WRITE_CACHE | BLK_FEAT_ROTATIONAL); 998 + lim->logical_block_size = bsize; 999 + lim->physical_block_size = bsize; 1000 + lim->io_min = bsize; 1001 + lim->features &= ~(BLK_FEAT_WRITE_CACHE | BLK_FEAT_ROTATIONAL); 1005 1002 if (file->f_op->fsync && !(lo->lo_flags & LO_FLAGS_READ_ONLY)) 1006 - lim.features |= BLK_FEAT_WRITE_CACHE; 1003 + lim->features |= BLK_FEAT_WRITE_CACHE; 1007 1004 if (backing_bdev && !bdev_nonrot(backing_bdev)) 1008 - lim.features |= BLK_FEAT_ROTATIONAL; 1009 - lim.max_hw_discard_sectors = max_discard_sectors; 1010 - lim.max_write_zeroes_sectors = max_discard_sectors; 1005 + lim->features |= BLK_FEAT_ROTATIONAL; 1006 + lim->max_hw_discard_sectors = max_discard_sectors; 1007 + lim->max_write_zeroes_sectors = max_discard_sectors; 1011 1008 if (max_discard_sectors) 1012 - lim.discard_granularity = granularity; 1009 + lim->discard_granularity = granularity; 1013 1010 else 1014 - lim.discard_granularity = 0; 1015 - return queue_limits_commit_update(lo->lo_queue, &lim); 1011 + lim->discard_granularity = 0; 1016 1012 } 1017 1013 1018 1014 static int loop_configure(struct loop_device *lo, blk_mode_t mode, ··· 1019 1019 { 1020 1020 struct file *file = fget(config->fd); 1021 1021 struct address_space *mapping; 1022 + struct queue_limits lim; 1022 1023 int error; 1023 1024 loff_t size; 1024 1025 bool partscan; ··· 1064 1063 error = loop_set_status_from_info(lo, &config->info); 1065 1064 if (error) 1066 1065 goto out_unlock; 1066 + lo->lo_flags = config->info.lo_flags; 1067 1067 1068 1068 if (!(file->f_mode & FMODE_WRITE) || !(mode & BLK_OPEN_WRITE) || 1069 1069 !file->f_op->write_iter) ··· 1086 1084 disk_force_media_change(lo->lo_disk); 1087 1085 set_disk_ro(lo->lo_disk, (lo->lo_flags & LO_FLAGS_READ_ONLY) != 0); 1088 1086 1089 - lo->use_dio = lo->lo_flags & LO_FLAGS_DIRECT_IO; 1090 1087 lo->lo_device = bdev; 1091 1088 lo->lo_backing_file = file; 1092 1089 lo->old_gfp_mask = mapping_gfp_mask(mapping); 1093 1090 mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS)); 1094 1091 1095 - error = loop_reconfigure_limits(lo, config->block_size); 1092 + lim = queue_limits_start_update(lo->lo_queue); 1093 + loop_update_limits(lo, &lim, config->block_size); 1094 + /* No need to freeze the queue as the device isn't bound yet. */ 1095 + error = queue_limits_commit_update(lo->lo_queue, &lim); 1096 1096 if (error) 1097 1097 goto out_unlock; 1098 1098 ··· 1154 1150 lo->lo_sizelimit = 0; 1155 1151 memset(lo->lo_file_name, 0, LO_NAME_SIZE); 1156 1152 1157 - /* reset the block size to the default */ 1153 + /* 1154 + * Reset the block size to the default. 1155 + * 1156 + * No queue freezing needed because this is called from the final 1157 + * ->release call only, so there can't be any outstanding I/O. 1158 + */ 1158 1159 lim = queue_limits_start_update(lo->lo_queue); 1159 1160 lim.logical_block_size = SECTOR_SIZE; 1160 1161 lim.physical_block_size = SECTOR_SIZE; ··· 1253 1244 loop_set_status(struct loop_device *lo, const struct loop_info64 *info) 1254 1245 { 1255 1246 int err; 1256 - int prev_lo_flags; 1257 1247 bool partscan = false; 1258 1248 bool size_changed = false; 1259 1249 ··· 1271 1263 invalidate_bdev(lo->lo_device); 1272 1264 } 1273 1265 1274 - /* I/O need to be drained during transfer transition */ 1266 + /* I/O needs to be drained before changing lo_offset or lo_sizelimit */ 1275 1267 blk_mq_freeze_queue(lo->lo_queue); 1276 - 1277 - prev_lo_flags = lo->lo_flags; 1278 1268 1279 1269 err = loop_set_status_from_info(lo, info); 1280 1270 if (err) 1281 1271 goto out_unfreeze; 1282 1272 1283 - /* Mask out flags that can't be set using LOOP_SET_STATUS. */ 1284 - lo->lo_flags &= LOOP_SET_STATUS_SETTABLE_FLAGS; 1285 - /* For those flags, use the previous values instead */ 1286 - lo->lo_flags |= prev_lo_flags & ~LOOP_SET_STATUS_SETTABLE_FLAGS; 1287 - /* For flags that can't be cleared, use previous values too */ 1288 - lo->lo_flags |= prev_lo_flags & ~LOOP_SET_STATUS_CLEARABLE_FLAGS; 1273 + partscan = !(lo->lo_flags & LO_FLAGS_PARTSCAN) && 1274 + (info->lo_flags & LO_FLAGS_PARTSCAN); 1275 + 1276 + lo->lo_flags &= ~(LOOP_SET_STATUS_SETTABLE_FLAGS | 1277 + LOOP_SET_STATUS_CLEARABLE_FLAGS); 1278 + lo->lo_flags |= (info->lo_flags & LOOP_SET_STATUS_SETTABLE_FLAGS); 1289 1279 1290 1280 if (size_changed) { 1291 1281 loff_t new_size = get_size(lo->lo_offset, lo->lo_sizelimit, ··· 1291 1285 loop_set_size(lo, new_size); 1292 1286 } 1293 1287 1294 - /* update dio if lo_offset or transfer is changed */ 1295 - __loop_update_dio(lo, lo->use_dio); 1288 + /* update the direct I/O flag if lo_offset changed */ 1289 + loop_update_dio(lo); 1296 1290 1297 1291 out_unfreeze: 1298 1292 blk_mq_unfreeze_queue(lo->lo_queue); 1299 - 1300 - if (!err && (lo->lo_flags & LO_FLAGS_PARTSCAN) && 1301 - !(prev_lo_flags & LO_FLAGS_PARTSCAN)) { 1293 + if (partscan) 1302 1294 clear_bit(GD_SUPPRESS_PART_SCAN, &lo->lo_disk->state); 1303 - partscan = true; 1304 - } 1305 1295 out_unlock: 1306 1296 mutex_unlock(&lo->lo_mutex); 1307 1297 if (partscan) ··· 1446 1444 1447 1445 static int loop_set_dio(struct loop_device *lo, unsigned long arg) 1448 1446 { 1449 - int error = -ENXIO; 1450 - if (lo->lo_state != Lo_bound) 1451 - goto out; 1447 + bool use_dio = !!arg; 1452 1448 1453 - __loop_update_dio(lo, !!arg); 1454 - if (lo->use_dio == !!arg) 1449 + if (lo->lo_state != Lo_bound) 1450 + return -ENXIO; 1451 + if (use_dio == !!(lo->lo_flags & LO_FLAGS_DIRECT_IO)) 1455 1452 return 0; 1456 - error = -EINVAL; 1457 - out: 1458 - return error; 1453 + 1454 + if (use_dio) { 1455 + if (!lo_can_use_dio(lo)) 1456 + return -EINVAL; 1457 + /* flush dirty pages before starting to use direct I/O */ 1458 + vfs_fsync(lo->lo_backing_file, 0); 1459 + } 1460 + 1461 + blk_mq_freeze_queue(lo->lo_queue); 1462 + if (use_dio) 1463 + lo->lo_flags |= LO_FLAGS_DIRECT_IO; 1464 + else 1465 + lo->lo_flags &= ~LO_FLAGS_DIRECT_IO; 1466 + blk_mq_unfreeze_queue(lo->lo_queue); 1467 + return 0; 1459 1468 } 1460 1469 1461 1470 static int loop_set_block_size(struct loop_device *lo, unsigned long arg) 1462 1471 { 1472 + struct queue_limits lim; 1463 1473 int err = 0; 1464 1474 1465 1475 if (lo->lo_state != Lo_bound) ··· 1483 1469 sync_blockdev(lo->lo_device); 1484 1470 invalidate_bdev(lo->lo_device); 1485 1471 1472 + lim = queue_limits_start_update(lo->lo_queue); 1473 + loop_update_limits(lo, &lim, arg); 1474 + 1486 1475 blk_mq_freeze_queue(lo->lo_queue); 1487 - err = loop_reconfigure_limits(lo, arg); 1476 + err = queue_limits_commit_update(lo->lo_queue, &lim); 1488 1477 loop_update_dio(lo); 1489 1478 blk_mq_unfreeze_queue(lo->lo_queue); 1490 1479 ··· 1871 1854 cmd->use_aio = false; 1872 1855 break; 1873 1856 default: 1874 - cmd->use_aio = lo->use_dio; 1857 + cmd->use_aio = lo->lo_flags & LO_FLAGS_DIRECT_IO; 1875 1858 break; 1876 1859 } 1877 1860 ··· 2040 2023 lo->tag_set.queue_depth = hw_queue_depth; 2041 2024 lo->tag_set.numa_node = NUMA_NO_NODE; 2042 2025 lo->tag_set.cmd_size = sizeof(struct loop_cmd); 2043 - lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_STACKING | 2044 - BLK_MQ_F_NO_SCHED_BY_DEFAULT; 2026 + lo->tag_set.flags = BLK_MQ_F_STACKING | BLK_MQ_F_NO_SCHED_BY_DEFAULT; 2045 2027 lo->tag_set.driver_data = lo; 2046 2028 2047 2029 err = blk_mq_alloc_tag_set(&lo->tag_set);

-1

drivers/block/mtip32xx/mtip32xx.c

··· 3416 3416 dd->tags.reserved_tags = 1; 3417 3417 dd->tags.cmd_size = sizeof(struct mtip_cmd); 3418 3418 dd->tags.numa_node = dd->numa_node; 3419 - dd->tags.flags = BLK_MQ_F_SHOULD_MERGE; 3420 3419 dd->tags.driver_data = dd; 3421 3420 dd->tags.timeout = MTIP_NCQ_CMD_TIMEOUT_MS; 3422 3421

+89 -27

drivers/block/nbd.c

··· 62 62 bool dead; 63 63 int fallback_index; 64 64 int cookie; 65 + struct work_struct work; 65 66 }; 66 67 67 68 struct recv_thread_args { ··· 141 140 * by cmd->lock. 142 141 */ 143 142 #define NBD_CMD_INFLIGHT 2 143 + 144 + /* Just part of request header or data payload is sent successfully */ 145 + #define NBD_CMD_PARTIAL_SEND 3 144 146 145 147 struct nbd_cmd { 146 148 struct nbd_device *nbd; ··· 331 327 nsock->sent = 0; 332 328 } 333 329 334 - static int __nbd_set_size(struct nbd_device *nbd, loff_t bytesize, 335 - loff_t blksize) 330 + static int nbd_set_size(struct nbd_device *nbd, loff_t bytesize, loff_t blksize) 336 331 { 337 332 struct queue_limits lim; 338 333 int error; ··· 371 368 372 369 lim.logical_block_size = blksize; 373 370 lim.physical_block_size = blksize; 374 - error = queue_limits_commit_update(nbd->disk->queue, &lim); 371 + error = queue_limits_commit_update_frozen(nbd->disk->queue, &lim); 375 372 if (error) 376 373 return error; 377 374 ··· 380 377 if (!set_capacity_and_notify(nbd->disk, bytesize >> 9)) 381 378 kobject_uevent(&nbd_to_dev(nbd)->kobj, KOBJ_CHANGE); 382 379 return 0; 383 - } 384 - 385 - static int nbd_set_size(struct nbd_device *nbd, loff_t bytesize, 386 - loff_t blksize) 387 - { 388 - int error; 389 - 390 - blk_mq_freeze_queue(nbd->disk->queue); 391 - error = __nbd_set_size(nbd, bytesize, blksize); 392 - blk_mq_unfreeze_queue(nbd->disk->queue); 393 - 394 - return error; 395 380 } 396 381 397 382 static void nbd_complete_rq(struct request *req) ··· 456 465 457 466 if (!mutex_trylock(&cmd->lock)) 458 467 return BLK_EH_RESET_TIMER; 468 + 469 + /* partial send is handled in nbd_sock's work function */ 470 + if (test_bit(NBD_CMD_PARTIAL_SEND, &cmd->flags)) { 471 + mutex_unlock(&cmd->lock); 472 + return BLK_EH_RESET_TIMER; 473 + } 459 474 460 475 if (!test_bit(NBD_CMD_INFLIGHT, &cmd->flags)) { 461 476 mutex_unlock(&cmd->lock); ··· 612 615 } 613 616 614 617 /* 618 + * We've already sent header or part of data payload, have no choice but 619 + * to set pending and schedule it in work. 620 + * 621 + * And we have to return BLK_STS_OK to block core, otherwise this same 622 + * request may be re-dispatched with different tag, but our header has 623 + * been sent out with old tag, and this way does confuse reply handling. 624 + */ 625 + static void nbd_sched_pending_work(struct nbd_device *nbd, 626 + struct nbd_sock *nsock, 627 + struct nbd_cmd *cmd, int sent) 628 + { 629 + struct request *req = blk_mq_rq_from_pdu(cmd); 630 + 631 + /* pending work should be scheduled only once */ 632 + WARN_ON_ONCE(test_bit(NBD_CMD_PARTIAL_SEND, &cmd->flags)); 633 + 634 + nsock->pending = req; 635 + nsock->sent = sent; 636 + set_bit(NBD_CMD_PARTIAL_SEND, &cmd->flags); 637 + refcount_inc(&nbd->config_refs); 638 + schedule_work(&nsock->work); 639 + } 640 + 641 + /* 615 642 * Returns BLK_STS_RESOURCE if the caller should retry after a delay. 616 643 * Returns BLK_STS_IOERR if sending failed. 617 644 */ ··· 720 699 * completely done. 721 700 */ 722 701 if (sent) { 723 - nsock->pending = req; 724 - nsock->sent = sent; 702 + nbd_sched_pending_work(nbd, nsock, cmd, sent); 703 + return BLK_STS_OK; 725 704 } 726 705 set_bit(NBD_CMD_REQUEUED, &cmd->flags); 727 706 return BLK_STS_RESOURCE; ··· 758 737 result = sock_xmit(nbd, index, 1, &from, flags, &sent); 759 738 if (result < 0) { 760 739 if (was_interrupted(result)) { 761 - /* We've already sent the header, we 762 - * have no choice but to set pending and 763 - * return BUSY. 764 - */ 765 - nsock->pending = req; 766 - nsock->sent = sent; 767 - set_bit(NBD_CMD_REQUEUED, &cmd->flags); 768 - return BLK_STS_RESOURCE; 740 + nbd_sched_pending_work(nbd, nsock, cmd, sent); 741 + return BLK_STS_OK; 769 742 } 770 743 dev_err(disk_to_dev(nbd->disk), 771 744 "Send data failed (result %d)\n", ··· 785 770 return BLK_STS_OK; 786 771 787 772 requeue: 773 + /* 774 + * Can't requeue in case we are dealing with partial send 775 + * 776 + * We must run from pending work function. 777 + * */ 778 + if (test_bit(NBD_CMD_PARTIAL_SEND, &cmd->flags)) 779 + return BLK_STS_OK; 780 + 788 781 /* retry on a different socket */ 789 782 dev_err_ratelimited(disk_to_dev(nbd->disk), 790 783 "Request send failed, requeueing\n"); 791 784 nbd_mark_nsock_dead(nbd, nsock, 1); 792 785 nbd_requeue_cmd(cmd); 793 786 return BLK_STS_OK; 787 + } 788 + 789 + /* handle partial sending */ 790 + static void nbd_pending_cmd_work(struct work_struct *work) 791 + { 792 + struct nbd_sock *nsock = container_of(work, struct nbd_sock, work); 793 + struct request *req = nsock->pending; 794 + struct nbd_cmd *cmd = blk_mq_rq_to_pdu(req); 795 + struct nbd_device *nbd = cmd->nbd; 796 + unsigned long deadline = READ_ONCE(req->deadline); 797 + unsigned int wait_ms = 2; 798 + 799 + mutex_lock(&cmd->lock); 800 + 801 + WARN_ON_ONCE(test_bit(NBD_CMD_REQUEUED, &cmd->flags)); 802 + if (WARN_ON_ONCE(!test_bit(NBD_CMD_PARTIAL_SEND, &cmd->flags))) 803 + goto out; 804 + 805 + mutex_lock(&nsock->tx_lock); 806 + while (true) { 807 + nbd_send_cmd(nbd, cmd, cmd->index); 808 + if (!nsock->pending) 809 + break; 810 + 811 + /* don't bother timeout handler for partial sending */ 812 + if (READ_ONCE(jiffies) + msecs_to_jiffies(wait_ms) >= deadline) { 813 + cmd->status = BLK_STS_IOERR; 814 + blk_mq_complete_request(req); 815 + break; 816 + } 817 + msleep(wait_ms); 818 + wait_ms *= 2; 819 + } 820 + mutex_unlock(&nsock->tx_lock); 821 + clear_bit(NBD_CMD_PARTIAL_SEND, &cmd->flags); 822 + out: 823 + mutex_unlock(&cmd->lock); 824 + nbd_config_put(nbd); 794 825 } 795 826 796 827 static int nbd_read_reply(struct nbd_device *nbd, struct socket *sock, ··· 1285 1224 nsock->pending = NULL; 1286 1225 nsock->sent = 0; 1287 1226 nsock->cookie = 0; 1227 + INIT_WORK(&nsock->work, nbd_pending_cmd_work); 1288 1228 socks[config->num_connections++] = nsock; 1289 1229 atomic_inc(&config->live_connections); 1290 1230 blk_mq_unfreeze_queue(nbd->disk->queue); ··· 1903 1841 nbd->tag_set.queue_depth = 128; 1904 1842 nbd->tag_set.numa_node = NUMA_NO_NODE; 1905 1843 nbd->tag_set.cmd_size = sizeof(struct nbd_cmd); 1906 - nbd->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | 1907 - BLK_MQ_F_BLOCKING; 1844 + nbd->tag_set.flags = BLK_MQ_F_BLOCKING; 1908 1845 nbd->tag_set.driver_data = nbd; 1909 1846 INIT_WORK(&nbd->remove_work, nbd_dev_remove_work); 1910 1847 nbd->backend = NULL; ··· 2241 2180 flush_workqueue(nbd->recv_workq); 2242 2181 nbd_clear_que(nbd); 2243 2182 nbd->task_setup = NULL; 2183 + clear_bit(NBD_RT_BOUND, &nbd->config->runtime_flags); 2244 2184 mutex_unlock(&nbd->config_lock); 2245 2185 2246 2186 if (test_and_clear_bit(NBD_RT_HAS_CONFIG_REF,

+20 -11

drivers/block/null_blk/main.c

··· 266 266 module_param_named(zone_full, g_zone_full, bool, S_IRUGO); 267 267 MODULE_PARM_DESC(zone_full, "Initialize the sequential write required zones of a zoned device to be full. Default: false"); 268 268 269 + static bool g_rotational; 270 + module_param_named(rotational, g_rotational, bool, S_IRUGO); 271 + MODULE_PARM_DESC(rotational, "Set the rotational feature for the device. Default: false"); 272 + 269 273 static struct nullb_device *null_alloc_dev(void); 270 274 static void null_free_dev(struct nullb_device *dev); 271 275 static void null_del_dev(struct nullb *nullb); ··· 472 468 NULLB_DEVICE_ATTR(shared_tags, bool, NULL); 473 469 NULLB_DEVICE_ATTR(shared_tag_bitmap, bool, NULL); 474 470 NULLB_DEVICE_ATTR(fua, bool, NULL); 471 + NULLB_DEVICE_ATTR(rotational, bool, NULL); 475 472 476 473 static ssize_t nullb_device_power_show(struct config_item *item, char *page) 477 474 { ··· 626 621 &nullb_device_attr_shared_tags, 627 622 &nullb_device_attr_shared_tag_bitmap, 628 623 &nullb_device_attr_fua, 624 + &nullb_device_attr_rotational, 629 625 NULL, 630 626 }; 631 627 ··· 712 706 "shared_tags,size,submit_queues,use_per_node_hctx," 713 707 "virt_boundary,zoned,zone_capacity,zone_max_active," 714 708 "zone_max_open,zone_nr_conv,zone_offline,zone_readonly," 715 - "zone_size,zone_append_max_sectors,zone_full\n"); 709 + "zone_size,zone_append_max_sectors,zone_full," 710 + "rotational\n"); 716 711 } 717 712 718 713 CONFIGFS_ATTR_RO(memb_group_, features); ··· 800 793 dev->shared_tags = g_shared_tags; 801 794 dev->shared_tag_bitmap = g_shared_tag_bitmap; 802 795 dev->fua = g_fua; 796 + dev->rotational = g_rotational; 803 797 804 798 return dev; 805 799 } ··· 907 899 if (radix_tree_insert(root, idx, t_page)) { 908 900 null_free_page(t_page); 909 901 t_page = radix_tree_lookup(root, idx); 910 - WARN_ON(!t_page || t_page->page->index != idx); 902 + WARN_ON(!t_page || t_page->page->private != idx); 911 903 } else if (is_cache) 912 904 nullb->dev->curr_cache += PAGE_SIZE; 913 905 ··· 930 922 (void **)t_pages, pos, FREE_BATCH); 931 923 932 924 for (i = 0; i < nr_pages; i++) { 933 - pos = t_pages[i]->page->index; 925 + pos = t_pages[i]->page->private; 934 926 ret = radix_tree_delete_item(root, pos, t_pages[i]); 935 927 WARN_ON(ret != t_pages[i]); 936 928 null_free_page(ret); ··· 956 948 957 949 root = is_cache ? &nullb->dev->cache : &nullb->dev->data; 958 950 t_page = radix_tree_lookup(root, idx); 959 - WARN_ON(t_page && t_page->page->index != idx); 951 + WARN_ON(t_page && t_page->page->private != idx); 960 952 961 953 if (t_page && (for_write || test_bit(sector_bit, t_page->bitmap))) 962 954 return t_page; ··· 999 991 1000 992 spin_lock_irq(&nullb->lock); 1001 993 idx = sector >> PAGE_SECTORS_SHIFT; 1002 - t_page->page->index = idx; 994 + t_page->page->private = idx; 1003 995 t_page = null_radix_tree_insert(nullb, idx, t_page, !ignore_cache); 1004 996 radix_tree_preload_end(); 1005 997 ··· 1019 1011 struct nullb_page *t_page, *ret; 1020 1012 void *dst, *src; 1021 1013 1022 - idx = c_page->page->index; 1014 + idx = c_page->page->private; 1023 1015 1024 1016 t_page = null_insert_page(nullb, idx << PAGE_SECTORS_SHIFT, true); 1025 1017 ··· 1078 1070 * avoid race, we don't allow page free 1079 1071 */ 1080 1072 for (i = 0; i < nr_pages; i++) { 1081 - nullb->cache_flush_pos = c_pages[i]->page->index; 1073 + nullb->cache_flush_pos = c_pages[i]->page->private; 1082 1074 /* 1083 1075 * We found the page which is being flushed to disk by other 1084 1076 * threads ··· 1791 1783 tag_set.nr_hw_queues = g_submit_queues; 1792 1784 tag_set.queue_depth = g_hw_queue_depth; 1793 1785 tag_set.numa_node = g_home_node; 1794 - tag_set.flags = BLK_MQ_F_SHOULD_MERGE; 1795 1786 if (g_no_sched) 1796 - tag_set.flags |= BLK_MQ_F_NO_SCHED; 1787 + tag_set.flags |= BLK_MQ_F_NO_SCHED_BY_DEFAULT; 1797 1788 if (g_shared_tag_bitmap) 1798 1789 tag_set.flags |= BLK_MQ_F_TAG_HCTX_SHARED; 1799 1790 if (g_blocking) ··· 1816 1809 nullb->tag_set->nr_hw_queues = nullb->dev->submit_queues; 1817 1810 nullb->tag_set->queue_depth = nullb->dev->hw_queue_depth; 1818 1811 nullb->tag_set->numa_node = nullb->dev->home_node; 1819 - nullb->tag_set->flags = BLK_MQ_F_SHOULD_MERGE; 1820 1812 if (nullb->dev->no_sched) 1821 - nullb->tag_set->flags |= BLK_MQ_F_NO_SCHED; 1813 + nullb->tag_set->flags |= BLK_MQ_F_NO_SCHED_BY_DEFAULT; 1822 1814 if (nullb->dev->shared_tag_bitmap) 1823 1815 nullb->tag_set->flags |= BLK_MQ_F_TAG_HCTX_SHARED; 1824 1816 if (nullb->dev->blocking) ··· 1943 1937 if (dev->fua) 1944 1938 lim.features |= BLK_FEAT_FUA; 1945 1939 } 1940 + 1941 + if (dev->rotational) 1942 + lim.features |= BLK_FEAT_ROTATIONAL; 1946 1943 1947 1944 nullb->disk = blk_mq_alloc_disk(nullb->tag_set, &lim, nullb); 1948 1945 if (IS_ERR(nullb->disk)) {

+1

drivers/block/null_blk/null_blk.h

··· 107 107 bool shared_tags; /* share tag set between devices for blk-mq */ 108 108 bool shared_tag_bitmap; /* use hostwide shared tags */ 109 109 bool fua; /* Support FUA */ 110 + bool rotational; /* Fake rotational device */ 110 111 }; 111 112 112 113 struct nullb {

+3 -4

drivers/block/ps3disk.c

··· 384 384 unsigned int devidx; 385 385 struct queue_limits lim = { 386 386 .logical_block_size = dev->blk_size, 387 - .max_hw_sectors = dev->bounce_size >> 9, 387 + .max_hw_sectors = BOUNCE_SIZE >> 9, 388 388 .max_segments = -1, 389 - .max_segment_size = dev->bounce_size, 389 + .max_segment_size = BOUNCE_SIZE, 390 390 .dma_alignment = dev->blk_size - 1, 391 391 .features = BLK_FEAT_WRITE_CACHE | 392 392 BLK_FEAT_ROTATIONAL, ··· 434 434 435 435 ps3disk_identify(dev); 436 436 437 - error = blk_mq_alloc_sq_tag_set(&priv->tag_set, &ps3disk_mq_ops, 1, 438 - BLK_MQ_F_SHOULD_MERGE); 437 + error = blk_mq_alloc_sq_tag_set(&priv->tag_set, &ps3disk_mq_ops, 1, 0); 439 438 if (error) 440 439 goto fail_teardown; 441 440

-1

drivers/block/rbd.c

··· 4964 4964 rbd_dev->tag_set.ops = &rbd_mq_ops; 4965 4965 rbd_dev->tag_set.queue_depth = rbd_dev->opts->queue_depth; 4966 4966 rbd_dev->tag_set.numa_node = NUMA_NO_NODE; 4967 - rbd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE; 4968 4967 rbd_dev->tag_set.nr_hw_queues = num_present_cpus(); 4969 4968 rbd_dev->tag_set.cmd_size = sizeof(struct rbd_img_request); 4970 4969

+1 -2

drivers/block/rnbd/rnbd-clt.c

··· 1209 1209 tag_set->ops = &rnbd_mq_ops; 1210 1210 tag_set->queue_depth = sess->queue_depth; 1211 1211 tag_set->numa_node = NUMA_NO_NODE; 1212 - tag_set->flags = BLK_MQ_F_SHOULD_MERGE | 1213 - BLK_MQ_F_TAG_QUEUE_SHARED; 1212 + tag_set->flags = BLK_MQ_F_TAG_QUEUE_SHARED; 1214 1213 tag_set->cmd_size = sizeof(struct rnbd_iu) + RNBD_RDMA_SGL_SIZE; 1215 1214 1216 1215 /* for HCTX_TYPE_DEFAULT, HCTX_TYPE_READ, HCTX_TYPE_POLL */

+1 -1

drivers/block/rnbd/rnbd-srv.c

··· 167 167 bio->bi_iter.bi_sector = le64_to_cpu(msg->sector); 168 168 prio = srv_sess->ver < RNBD_PROTO_VER_MAJOR || 169 169 usrlen < sizeof(*msg) ? 0 : le16_to_cpu(msg->prio); 170 - bio_set_prio(bio, prio); 170 + bio->bi_ioprio = prio; 171 171 172 172 submit_bio(bio); 173 173

+18 -12

drivers/block/rnull.rs

··· 32 32 license: "GPL v2", 33 33 } 34 34 35 + #[pin_data] 35 36 struct NullBlkModule { 36 - _disk: Pin<KBox<Mutex<GenDisk<NullBlkDevice>>>>, 37 + #[pin] 38 + _disk: Mutex<GenDisk<NullBlkDevice>>, 37 39 } 38 40 39 - impl kernel::Module for NullBlkModule { 40 - fn init(_module: &'static ThisModule) -> Result<Self> { 41 + impl kernel::InPlaceModule for NullBlkModule { 42 + fn init(_module: &'static ThisModule) -> impl PinInit<Self, Error> { 41 43 pr_info!("Rust null_blk loaded\n"); 42 - let tagset = Arc::pin_init(TagSet::new(1, 256, 1), flags::GFP_KERNEL)?; 43 44 44 - let disk = gen_disk::GenDiskBuilder::new() 45 - .capacity_sectors(4096 << 11) 46 - .logical_block_size(4096)? 47 - .physical_block_size(4096)? 48 - .rotational(false) 49 - .build(format_args!("rnullb{}", 0), tagset)?; 45 + // Use a immediately-called closure as a stable `try` block 46 + let disk = /* try */ (|| { 47 + let tagset = Arc::pin_init(TagSet::new(1, 256, 1), flags::GFP_KERNEL)?; 50 48 51 - let disk = KBox::pin_init(new_mutex!(disk, "nullb:disk"), flags::GFP_KERNEL)?; 49 + gen_disk::GenDiskBuilder::new() 50 + .capacity_sectors(4096 << 11) 51 + .logical_block_size(4096)? 52 + .physical_block_size(4096)? 53 + .rotational(false) 54 + .build(format_args!("rnullb{}", 0), tagset) 55 + })(); 52 56 53 - Ok(Self { _disk: disk }) 57 + try_pin_init!(Self { 58 + _disk <- new_mutex!(disk?, "nullb:disk"), 59 + }) 54 60 } 55 61 } 56 62

+1 -1

drivers/block/sunvdc.c

··· 829 829 } 830 830 831 831 err = blk_mq_alloc_sq_tag_set(&port->tag_set, &vdc_mq_ops, 832 - VDC_TX_RING_SIZE, BLK_MQ_F_SHOULD_MERGE); 832 + VDC_TX_RING_SIZE, 0); 833 833 if (err) 834 834 return err; 835 835

+1 -1

drivers/block/swim.c

··· 818 818 819 819 for (drive = 0; drive < swd->floppy_count; drive++) { 820 820 err = blk_mq_alloc_sq_tag_set(&swd->unit[drive].tag_set, 821 - &swim_mq_ops, 2, BLK_MQ_F_SHOULD_MERGE); 821 + &swim_mq_ops, 2, 0); 822 822 if (err) 823 823 goto exit_put_disks; 824 824

+1 -2

drivers/block/swim3.c

··· 1208 1208 fs = &floppy_states[floppy_count]; 1209 1209 memset(fs, 0, sizeof(*fs)); 1210 1210 1211 - rc = blk_mq_alloc_sq_tag_set(&fs->tag_set, &swim3_mq_ops, 2, 1212 - BLK_MQ_F_SHOULD_MERGE); 1211 + rc = blk_mq_alloc_sq_tag_set(&fs->tag_set, &swim3_mq_ops, 2, 0); 1213 1212 if (rc) 1214 1213 goto out_unregister; 1215 1214

-1

drivers/block/ublk_drv.c

··· 2213 2213 ub->tag_set.queue_depth = ub->dev_info.queue_depth; 2214 2214 ub->tag_set.numa_node = NUMA_NO_NODE; 2215 2215 ub->tag_set.cmd_size = sizeof(struct ublk_rq_data); 2216 - ub->tag_set.flags = BLK_MQ_F_SHOULD_MERGE; 2217 2216 ub->tag_set.driver_data = ub; 2218 2217 return blk_mq_alloc_tag_set(&ub->tag_set); 2219 2218 }

+3 -6

drivers/block/virtio_blk.c

··· 13 13 #include <linux/string_helpers.h> 14 14 #include <linux/idr.h> 15 15 #include <linux/blk-mq.h> 16 - #include <linux/blk-mq-virtio.h> 17 16 #include <linux/numa.h> 18 17 #include <linux/vmalloc.h> 19 18 #include <uapi/linux/virtio_ring.h> ··· 1105 1106 lim.features |= BLK_FEAT_WRITE_CACHE; 1106 1107 else 1107 1108 lim.features &= ~BLK_FEAT_WRITE_CACHE; 1108 - blk_mq_freeze_queue(disk->queue); 1109 - i = queue_limits_commit_update(disk->queue, &lim); 1110 - blk_mq_unfreeze_queue(disk->queue); 1109 + i = queue_limits_commit_update_frozen(disk->queue, &lim); 1111 1110 if (i) 1112 1111 return i; 1113 1112 return count; ··· 1178 1181 if (i == HCTX_TYPE_POLL) 1179 1182 blk_mq_map_queues(&set->map[i]); 1180 1183 else 1181 - blk_mq_virtio_map_queues(&set->map[i], vblk->vdev, 0); 1184 + blk_mq_map_hw_queues(&set->map[i], 1185 + &vblk->vdev->dev, 0); 1182 1186 } 1183 1187 } 1184 1188 ··· 1479 1481 vblk->tag_set.ops = &virtio_mq_ops; 1480 1482 vblk->tag_set.queue_depth = queue_depth; 1481 1483 vblk->tag_set.numa_node = NUMA_NO_NODE; 1482 - vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE; 1483 1484 vblk->tag_set.cmd_size = 1484 1485 sizeof(struct virtblk_req) + 1485 1486 sizeof(struct scatterlist) * VIRTIO_BLK_INLINE_SG_CNT;

-1

drivers/block/xen-blkfront.c

··· 1131 1131 } else 1132 1132 info->tag_set.queue_depth = BLK_RING_SIZE(info); 1133 1133 info->tag_set.numa_node = NUMA_NO_NODE; 1134 - info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE; 1135 1134 info->tag_set.cmd_size = sizeof(struct blkif_req); 1136 1135 info->tag_set.driver_data = info; 1137 1136

-1

drivers/block/z2ram.c

··· 354 354 tag_set.nr_maps = 1; 355 355 tag_set.queue_depth = 16; 356 356 tag_set.numa_node = NUMA_NO_NODE; 357 - tag_set.flags = BLK_MQ_F_SHOULD_MERGE; 358 357 ret = blk_mq_alloc_tag_set(&tag_set); 359 358 if (ret) 360 359 goto out_unregister_blkdev;

+1 -1

drivers/cdrom/gdrom.c

··· 777 777 probe_gdrom_setupcd(); 778 778 779 779 err = blk_mq_alloc_sq_tag_set(&gd.tag_set, &gdrom_mq_ops, 1, 780 - BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_BLOCKING); 780 + BLK_MQ_F_BLOCKING); 781 781 if (err) 782 782 goto probe_fail_free_cd_info; 783 783

+13

drivers/md/Kconfig

··· 61 61 various kernel APIs and can only work with files on a file system not 62 62 actually sitting on the MD device. 63 63 64 + config MD_LINEAR 65 + tristate "Linear (append) mode" 66 + depends on BLK_DEV_MD 67 + help 68 + If you say Y here, then your multiple devices driver will be able to 69 + use the so-called linear mode, i.e. it will combine the hard disk 70 + partitions by simply appending one to the other. 71 + 72 + To compile this as a module, choose M here: the module 73 + will be called linear. 74 + 75 + If unsure, say Y. 76 + 64 77 config MD_RAID0 65 78 tristate "RAID-0 (striping) mode" 66 79 depends on BLK_DEV_MD

+2

drivers/md/Makefile

··· 29 29 30 30 md-mod-y += md.o md-bitmap.o 31 31 raid456-y += raid5.o raid5-cache.o raid5-ppl.o 32 + linear-y += md-linear.o 32 33 33 34 # Note: link order is important. All raid personalities 34 35 # and must come before md.o, as they each initialise 35 36 # themselves, and md.o may use the personalities when it 36 37 # auto-initialised. 37 38 39 + obj-$(CONFIG_MD_LINEAR) += linear.o 38 40 obj-$(CONFIG_MD_RAID0) += raid0.o 39 41 obj-$(CONFIG_MD_RAID1) += raid1.o 40 42 obj-$(CONFIG_MD_RAID10) += raid10.o

+1 -1

drivers/md/bcache/movinggc.c

··· 82 82 bio_init(bio, NULL, bio->bi_inline_vecs, 83 83 DIV_ROUND_UP(KEY_SIZE(&io->w->key), PAGE_SECTORS), 0); 84 84 bio_get(bio); 85 - bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0)); 85 + bio->bi_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0); 86 86 87 87 bio->bi_iter.bi_size = KEY_SIZE(&io->w->key) << 9; 88 88 bio->bi_private = &io->cl;

+1 -1

drivers/md/bcache/writeback.c

··· 334 334 bio_init(bio, NULL, bio->bi_inline_vecs, 335 335 DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS), 0); 336 336 if (!io->dc->writeback_percent) 337 - bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0)); 337 + bio->bi_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0); 338 338 339 339 bio->bi_iter.bi_size = KEY_SIZE(&w->key) << 9; 340 340 bio->bi_private = w;

+1 -1

drivers/md/dm-rq.c

··· 547 547 md->tag_set->ops = &dm_mq_ops; 548 548 md->tag_set->queue_depth = dm_get_blk_mq_queue_depth(); 549 549 md->tag_set->numa_node = md->numa_node_id; 550 - md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_STACKING; 550 + md->tag_set->flags = BLK_MQ_F_STACKING; 551 551 md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues(); 552 552 md->tag_set->driver_data = md; 553 553

+3 -3

drivers/md/dm-verity-fec.c

··· 122 122 struct bio *bio = dm_bio_from_per_bio_data(io, v->ti->per_io_data_size); 123 123 124 124 par = fec_read_parity(v, rsb, block_offset, &offset, 125 - par_buf_offset, &buf, bio_prio(bio)); 125 + par_buf_offset, &buf, bio->bi_ioprio); 126 126 if (IS_ERR(par)) 127 127 return PTR_ERR(par); 128 128 ··· 164 164 dm_bufio_release(buf); 165 165 166 166 par = fec_read_parity(v, rsb, block_offset, &offset, 167 - par_buf_offset, &buf, bio_prio(bio)); 167 + par_buf_offset, &buf, bio->bi_ioprio); 168 168 if (IS_ERR(par)) 169 169 return PTR_ERR(par); 170 170 } ··· 254 254 bufio = v->bufio; 255 255 } 256 256 257 - bbuf = dm_bufio_read_with_ioprio(bufio, block, &buf, bio_prio(bio)); 257 + bbuf = dm_bufio_read_with_ioprio(bufio, block, &buf, bio->bi_ioprio); 258 258 if (IS_ERR(bbuf)) { 259 259 DMWARN_LIMIT("%s: FEC %llu: read failed (%llu): %ld", 260 260 v->data_dev->name,

+2 -2

drivers/md/dm-verity-target.c

··· 321 321 } 322 322 } else { 323 323 data = dm_bufio_read_with_ioprio(v->bufio, hash_block, 324 - &buf, bio_prio(bio)); 324 + &buf, bio->bi_ioprio); 325 325 } 326 326 327 327 if (IS_ERR(data)) ··· 789 789 790 790 verity_fec_init_io(io); 791 791 792 - verity_submit_prefetch(v, io, bio_prio(bio)); 792 + verity_submit_prefetch(v, io, bio->bi_ioprio); 793 793 794 794 submit_bio_noacct(bio); 795 795

+6 -2

drivers/md/md-autodetect.c

··· 49 49 * instead of just one. -- KTK 50 50 * 18May2000: Added support for persistent-superblock arrays: 51 51 * md=n,0,factor,fault,device-list uses RAID0 for device n 52 + * md=n,-1,factor,fault,device-list uses LINEAR for device n 52 53 * md=n,device-list reads a RAID superblock from the devices 53 54 * elements in device-list are read by name_to_kdev_t so can be 54 55 * a hex number or something like /dev/hda1 /dev/sdb ··· 88 87 md_setup_ents++; 89 88 switch (get_option(&str, &level)) { /* RAID level */ 90 89 case 2: /* could be 0 or -1.. */ 91 - if (level == 0) { 90 + if (level == 0 || level == LEVEL_LINEAR) { 92 91 if (get_option(&str, &factor) != 2 || /* Chunk Size */ 93 92 get_option(&str, &fault) != 2) { 94 93 printk(KERN_WARNING "md: Too few arguments supplied to md=.\n"); ··· 96 95 } 97 96 md_setup_args[ent].level = level; 98 97 md_setup_args[ent].chunk = 1 << (factor+12); 99 - pername = "raid0"; 98 + if (level == LEVEL_LINEAR) 99 + pername = "linear"; 100 + else 101 + pername = "raid0"; 100 102 break; 101 103 } 102 104 fallthrough;

+66 -50

drivers/md/md-bitmap.c

··· 682 682 return; 683 683 if (!bitmap->storage.sb_page) /* no superblock */ 684 684 return; 685 - sb = kmap_atomic(bitmap->storage.sb_page); 685 + sb = kmap_local_page(bitmap->storage.sb_page); 686 686 sb->events = cpu_to_le64(bitmap->mddev->events); 687 687 if (bitmap->mddev->events < bitmap->events_cleared) 688 688 /* rocking back to read-only */ ··· 702 702 sb->nodes = cpu_to_le32(bitmap->mddev->bitmap_info.nodes); 703 703 sb->sectors_reserved = cpu_to_le32(bitmap->mddev-> 704 704 bitmap_info.space); 705 - kunmap_atomic(sb); 705 + kunmap_local(sb); 706 706 707 707 if (bitmap->storage.file) 708 708 write_file_page(bitmap, bitmap->storage.sb_page, 1); ··· 717 717 718 718 if (!bitmap || !bitmap->storage.sb_page) 719 719 return; 720 - sb = kmap_atomic(bitmap->storage.sb_page); 720 + sb = kmap_local_page(bitmap->storage.sb_page); 721 721 pr_debug("%s: bitmap file superblock:\n", bmname(bitmap)); 722 722 pr_debug(" magic: %08x\n", le32_to_cpu(sb->magic)); 723 723 pr_debug(" version: %u\n", le32_to_cpu(sb->version)); ··· 736 736 pr_debug(" sync size: %llu KB\n", 737 737 (unsigned long long)le64_to_cpu(sb->sync_size)/2); 738 738 pr_debug("max write behind: %u\n", le32_to_cpu(sb->write_behind)); 739 - kunmap_atomic(sb); 739 + kunmap_local(sb); 740 740 } 741 741 742 742 /* ··· 760 760 return -ENOMEM; 761 761 bitmap->storage.sb_index = 0; 762 762 763 - sb = kmap_atomic(bitmap->storage.sb_page); 763 + sb = kmap_local_page(bitmap->storage.sb_page); 764 764 765 765 sb->magic = cpu_to_le32(BITMAP_MAGIC); 766 766 sb->version = cpu_to_le32(BITMAP_MAJOR_HI); ··· 768 768 chunksize = bitmap->mddev->bitmap_info.chunksize; 769 769 BUG_ON(!chunksize); 770 770 if (!is_power_of_2(chunksize)) { 771 - kunmap_atomic(sb); 771 + kunmap_local(sb); 772 772 pr_warn("bitmap chunksize not a power of 2\n"); 773 773 return -EINVAL; 774 774 } ··· 803 803 sb->events_cleared = cpu_to_le64(bitmap->mddev->events); 804 804 bitmap->mddev->bitmap_info.nodes = 0; 805 805 806 - kunmap_atomic(sb); 806 + kunmap_local(sb); 807 807 808 808 return 0; 809 809 } ··· 865 865 return err; 866 866 867 867 err = -EINVAL; 868 - sb = kmap_atomic(sb_page); 868 + sb = kmap_local_page(sb_page); 869 869 870 870 chunksize = le32_to_cpu(sb->chunksize); 871 871 daemon_sleep = le32_to_cpu(sb->daemon_sleep) * HZ; ··· 932 932 err = 0; 933 933 934 934 out: 935 - kunmap_atomic(sb); 935 + kunmap_local(sb); 936 936 if (err == 0 && nodes && (bitmap->cluster_slot < 0)) { 937 937 /* Assigning chunksize is required for "re_read" */ 938 938 bitmap->mddev->bitmap_info.chunksize = chunksize; ··· 1161 1161 bit = file_page_offset(&bitmap->storage, chunk); 1162 1162 1163 1163 /* set the bit */ 1164 - kaddr = kmap_atomic(page); 1164 + kaddr = kmap_local_page(page); 1165 1165 if (test_bit(BITMAP_HOSTENDIAN, &bitmap->flags)) 1166 1166 set_bit(bit, kaddr); 1167 1167 else 1168 1168 set_bit_le(bit, kaddr); 1169 - kunmap_atomic(kaddr); 1169 + kunmap_local(kaddr); 1170 1170 pr_debug("set file bit %lu page %lu\n", bit, index); 1171 1171 /* record page number so it gets flushed to disk when unplug occurs */ 1172 1172 set_page_attr(bitmap, index - node_offset, BITMAP_PAGE_DIRTY); ··· 1190 1190 if (!page) 1191 1191 return; 1192 1192 bit = file_page_offset(&bitmap->storage, chunk); 1193 - paddr = kmap_atomic(page); 1193 + paddr = kmap_local_page(page); 1194 1194 if (test_bit(BITMAP_HOSTENDIAN, &bitmap->flags)) 1195 1195 clear_bit(bit, paddr); 1196 1196 else 1197 1197 clear_bit_le(bit, paddr); 1198 - kunmap_atomic(paddr); 1198 + kunmap_local(paddr); 1199 1199 if (!test_page_attr(bitmap, index - node_offset, BITMAP_PAGE_NEEDWRITE)) { 1200 1200 set_page_attr(bitmap, index - node_offset, BITMAP_PAGE_PENDING); 1201 1201 bitmap->allclean = 0; ··· 1214 1214 if (!page) 1215 1215 return -EINVAL; 1216 1216 bit = file_page_offset(&bitmap->storage, chunk); 1217 - paddr = kmap_atomic(page); 1217 + paddr = kmap_local_page(page); 1218 1218 if (test_bit(BITMAP_HOSTENDIAN, &bitmap->flags)) 1219 1219 set = test_bit(bit, paddr); 1220 1220 else 1221 1221 set = test_bit_le(bit, paddr); 1222 - kunmap_atomic(paddr); 1222 + kunmap_local(paddr); 1223 1223 return set; 1224 1224 } 1225 1225 ··· 1388 1388 * If the bitmap is out of date, dirty the whole page 1389 1389 * and write it out 1390 1390 */ 1391 - paddr = kmap_atomic(page); 1391 + paddr = kmap_local_page(page); 1392 1392 memset(paddr + offset, 0xff, PAGE_SIZE - offset); 1393 - kunmap_atomic(paddr); 1393 + kunmap_local(paddr); 1394 1394 1395 1395 filemap_write_page(bitmap, i, true); 1396 1396 if (test_bit(BITMAP_WRITE_ERROR, &bitmap->flags)) { ··· 1406 1406 void *paddr; 1407 1407 bool was_set; 1408 1408 1409 - paddr = kmap_atomic(page); 1409 + paddr = kmap_local_page(page); 1410 1410 if (test_bit(BITMAP_HOSTENDIAN, &bitmap->flags)) 1411 1411 was_set = test_bit(bit, paddr); 1412 1412 else 1413 1413 was_set = test_bit_le(bit, paddr); 1414 - kunmap_atomic(paddr); 1414 + kunmap_local(paddr); 1415 1415 1416 1416 if (was_set) { 1417 1417 /* if the disk bit is set, set the memory bit */ ··· 1546 1546 bitmap_super_t *sb; 1547 1547 bitmap->need_sync = 0; 1548 1548 if (bitmap->storage.filemap) { 1549 - sb = kmap_atomic(bitmap->storage.sb_page); 1549 + sb = kmap_local_page(bitmap->storage.sb_page); 1550 1550 sb->events_cleared = 1551 1551 cpu_to_le64(bitmap->events_cleared); 1552 - kunmap_atomic(sb); 1552 + kunmap_local(sb); 1553 1553 set_page_attr(bitmap, 0, 1554 1554 BITMAP_PAGE_NEEDWRITE); 1555 1555 } ··· 1671 1671 } 1672 1672 1673 1673 static int bitmap_startwrite(struct mddev *mddev, sector_t offset, 1674 - unsigned long sectors, bool behind) 1674 + unsigned long sectors) 1675 1675 { 1676 1676 struct bitmap *bitmap = mddev->bitmap; 1677 1677 1678 1678 if (!bitmap) 1679 1679 return 0; 1680 - 1681 - if (behind) { 1682 - int bw; 1683 - atomic_inc(&bitmap->behind_writes); 1684 - bw = atomic_read(&bitmap->behind_writes); 1685 - if (bw > bitmap->behind_writes_used) 1686 - bitmap->behind_writes_used = bw; 1687 - 1688 - pr_debug("inc write-behind count %d/%lu\n", 1689 - bw, bitmap->mddev->bitmap_info.max_write_behind); 1690 - } 1691 1680 1692 1681 while (sectors) { 1693 1682 sector_t blocks; ··· 1726 1737 } 1727 1738 1728 1739 static void bitmap_endwrite(struct mddev *mddev, sector_t offset, 1729 - unsigned long sectors, bool success, bool behind) 1740 + unsigned long sectors) 1730 1741 { 1731 1742 struct bitmap *bitmap = mddev->bitmap; 1732 1743 1733 1744 if (!bitmap) 1734 1745 return; 1735 - 1736 - if (behind) { 1737 - if (atomic_dec_and_test(&bitmap->behind_writes)) 1738 - wake_up(&bitmap->behind_wait); 1739 - pr_debug("dec write-behind count %d/%lu\n", 1740 - atomic_read(&bitmap->behind_writes), 1741 - bitmap->mddev->bitmap_info.max_write_behind); 1742 - } 1743 1746 1744 1747 while (sectors) { 1745 1748 sector_t blocks; ··· 1745 1764 return; 1746 1765 } 1747 1766 1748 - if (success && !bitmap->mddev->degraded && 1749 - bitmap->events_cleared < bitmap->mddev->events) { 1750 - bitmap->events_cleared = bitmap->mddev->events; 1751 - bitmap->need_sync = 1; 1752 - sysfs_notify_dirent_safe(bitmap->sysfs_can_clear); 1753 - } 1754 - 1755 - if (!success && !NEEDED(*bmc)) 1767 + if (!bitmap->mddev->degraded) { 1768 + if (bitmap->events_cleared < bitmap->mddev->events) { 1769 + bitmap->events_cleared = bitmap->mddev->events; 1770 + bitmap->need_sync = 1; 1771 + sysfs_notify_dirent_safe( 1772 + bitmap->sysfs_can_clear); 1773 + } 1774 + } else if (!NEEDED(*bmc)) { 1756 1775 *bmc |= NEEDED_MASK; 1776 + } 1757 1777 1758 1778 if (COUNTER(*bmc) == COUNTER_MAX) 1759 1779 wake_up(&bitmap->overflow_wait); ··· 2042 2060 kfree(bp[k].map); 2043 2061 kfree(bp); 2044 2062 kfree(bitmap); 2063 + } 2064 + 2065 + static void bitmap_start_behind_write(struct mddev *mddev) 2066 + { 2067 + struct bitmap *bitmap = mddev->bitmap; 2068 + int bw; 2069 + 2070 + if (!bitmap) 2071 + return; 2072 + 2073 + atomic_inc(&bitmap->behind_writes); 2074 + bw = atomic_read(&bitmap->behind_writes); 2075 + if (bw > bitmap->behind_writes_used) 2076 + bitmap->behind_writes_used = bw; 2077 + 2078 + pr_debug("inc write-behind count %d/%lu\n", 2079 + bw, bitmap->mddev->bitmap_info.max_write_behind); 2080 + } 2081 + 2082 + static void bitmap_end_behind_write(struct mddev *mddev) 2083 + { 2084 + struct bitmap *bitmap = mddev->bitmap; 2085 + 2086 + if (!bitmap) 2087 + return; 2088 + 2089 + if (atomic_dec_and_test(&bitmap->behind_writes)) 2090 + wake_up(&bitmap->behind_wait); 2091 + pr_debug("dec write-behind count %d/%lu\n", 2092 + atomic_read(&bitmap->behind_writes), 2093 + bitmap->mddev->bitmap_info.max_write_behind); 2045 2094 } 2046 2095 2047 2096 static void bitmap_wait_behind_writes(struct mddev *mddev) ··· 2994 2981 .dirty_bits = bitmap_dirty_bits, 2995 2982 .unplug = bitmap_unplug, 2996 2983 .daemon_work = bitmap_daemon_work, 2984 + 2985 + .start_behind_write = bitmap_start_behind_write, 2986 + .end_behind_write = bitmap_end_behind_write, 2997 2987 .wait_behind_writes = bitmap_wait_behind_writes, 2998 2988 2999 2989 .startwrite = bitmap_startwrite,

+5 -2

drivers/md/md-bitmap.h

··· 84 84 unsigned long e); 85 85 void (*unplug)(struct mddev *mddev, bool sync); 86 86 void (*daemon_work)(struct mddev *mddev); 87 + 88 + void (*start_behind_write)(struct mddev *mddev); 89 + void (*end_behind_write)(struct mddev *mddev); 87 90 void (*wait_behind_writes)(struct mddev *mddev); 88 91 89 92 int (*startwrite)(struct mddev *mddev, sector_t offset, 90 - unsigned long sectors, bool behind); 93 + unsigned long sectors); 91 94 void (*endwrite)(struct mddev *mddev, sector_t offset, 92 - unsigned long sectors, bool success, bool behind); 95 + unsigned long sectors); 93 96 bool (*start_sync)(struct mddev *mddev, sector_t offset, 94 97 sector_t *blocks, bool degraded); 95 98 void (*end_sync)(struct mddev *mddev, sector_t offset, sector_t *blocks);

+354

drivers/md/md-linear.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-or-later 2 + /* 3 + * linear.c : Multiple Devices driver for Linux Copyright (C) 1994-96 Marc 4 + * ZYNGIER <zyngier@ufr-info-p7.ibp.fr> or <maz@gloups.fdn.fr> 5 + */ 6 + 7 + #include <linux/blkdev.h> 8 + #include <linux/raid/md_u.h> 9 + #include <linux/seq_file.h> 10 + #include <linux/module.h> 11 + #include <linux/slab.h> 12 + #include <trace/events/block.h> 13 + #include "md.h" 14 + 15 + struct dev_info { 16 + struct md_rdev *rdev; 17 + sector_t end_sector; 18 + }; 19 + 20 + struct linear_conf { 21 + struct rcu_head rcu; 22 + sector_t array_sectors; 23 + /* a copy of mddev->raid_disks */ 24 + int raid_disks; 25 + struct dev_info disks[] __counted_by(raid_disks); 26 + }; 27 + 28 + /* 29 + * find which device holds a particular offset 30 + */ 31 + static inline struct dev_info *which_dev(struct mddev *mddev, sector_t sector) 32 + { 33 + int lo, mid, hi; 34 + struct linear_conf *conf; 35 + 36 + lo = 0; 37 + hi = mddev->raid_disks - 1; 38 + conf = mddev->private; 39 + 40 + /* 41 + * Binary Search 42 + */ 43 + 44 + while (hi > lo) { 45 + 46 + mid = (hi + lo) / 2; 47 + if (sector < conf->disks[mid].end_sector) 48 + hi = mid; 49 + else 50 + lo = mid + 1; 51 + } 52 + 53 + return conf->disks + lo; 54 + } 55 + 56 + static sector_t linear_size(struct mddev *mddev, sector_t sectors, int raid_disks) 57 + { 58 + struct linear_conf *conf; 59 + sector_t array_sectors; 60 + 61 + conf = mddev->private; 62 + WARN_ONCE(sectors || raid_disks, 63 + "%s does not support generic reshape\n", __func__); 64 + array_sectors = conf->array_sectors; 65 + 66 + return array_sectors; 67 + } 68 + 69 + static int linear_set_limits(struct mddev *mddev) 70 + { 71 + struct queue_limits lim; 72 + int err; 73 + 74 + md_init_stacking_limits(&lim); 75 + lim.max_hw_sectors = mddev->chunk_sectors; 76 + lim.max_write_zeroes_sectors = mddev->chunk_sectors; 77 + lim.io_min = mddev->chunk_sectors << 9; 78 + err = mddev_stack_rdev_limits(mddev, &lim, MDDEV_STACK_INTEGRITY); 79 + if (err) { 80 + queue_limits_cancel_update(mddev->gendisk->queue); 81 + return err; 82 + } 83 + 84 + return queue_limits_set(mddev->gendisk->queue, &lim); 85 + } 86 + 87 + static struct linear_conf *linear_conf(struct mddev *mddev, int raid_disks) 88 + { 89 + struct linear_conf *conf; 90 + struct md_rdev *rdev; 91 + int ret = -EINVAL; 92 + int cnt; 93 + int i; 94 + 95 + conf = kzalloc(struct_size(conf, disks, raid_disks), GFP_KERNEL); 96 + if (!conf) 97 + return ERR_PTR(-ENOMEM); 98 + 99 + /* 100 + * conf->raid_disks is copy of mddev->raid_disks. The reason to 101 + * keep a copy of mddev->raid_disks in struct linear_conf is, 102 + * mddev->raid_disks may not be consistent with pointers number of 103 + * conf->disks[] when it is updated in linear_add() and used to 104 + * iterate old conf->disks[] earray in linear_congested(). 105 + * Here conf->raid_disks is always consitent with number of 106 + * pointers in conf->disks[] array, and mddev->private is updated 107 + * with rcu_assign_pointer() in linear_addr(), such race can be 108 + * avoided. 109 + */ 110 + conf->raid_disks = raid_disks; 111 + 112 + cnt = 0; 113 + conf->array_sectors = 0; 114 + 115 + rdev_for_each(rdev, mddev) { 116 + int j = rdev->raid_disk; 117 + struct dev_info *disk = conf->disks + j; 118 + sector_t sectors; 119 + 120 + if (j < 0 || j >= raid_disks || disk->rdev) { 121 + pr_warn("md/linear:%s: disk numbering problem. Aborting!\n", 122 + mdname(mddev)); 123 + goto out; 124 + } 125 + 126 + disk->rdev = rdev; 127 + if (mddev->chunk_sectors) { 128 + sectors = rdev->sectors; 129 + sector_div(sectors, mddev->chunk_sectors); 130 + rdev->sectors = sectors * mddev->chunk_sectors; 131 + } 132 + 133 + conf->array_sectors += rdev->sectors; 134 + cnt++; 135 + } 136 + if (cnt != raid_disks) { 137 + pr_warn("md/linear:%s: not enough drives present. Aborting!\n", 138 + mdname(mddev)); 139 + goto out; 140 + } 141 + 142 + /* 143 + * Here we calculate the device offsets. 144 + */ 145 + conf->disks[0].end_sector = conf->disks[0].rdev->sectors; 146 + 147 + for (i = 1; i < raid_disks; i++) 148 + conf->disks[i].end_sector = 149 + conf->disks[i-1].end_sector + 150 + conf->disks[i].rdev->sectors; 151 + 152 + if (!mddev_is_dm(mddev)) { 153 + ret = linear_set_limits(mddev); 154 + if (ret) 155 + goto out; 156 + } 157 + 158 + return conf; 159 + 160 + out: 161 + kfree(conf); 162 + return ERR_PTR(ret); 163 + } 164 + 165 + static int linear_run(struct mddev *mddev) 166 + { 167 + struct linear_conf *conf; 168 + int ret; 169 + 170 + if (md_check_no_bitmap(mddev)) 171 + return -EINVAL; 172 + 173 + conf = linear_conf(mddev, mddev->raid_disks); 174 + if (IS_ERR(conf)) 175 + return PTR_ERR(conf); 176 + 177 + mddev->private = conf; 178 + md_set_array_sectors(mddev, linear_size(mddev, 0, 0)); 179 + 180 + ret = md_integrity_register(mddev); 181 + if (ret) { 182 + kfree(conf); 183 + mddev->private = NULL; 184 + } 185 + return ret; 186 + } 187 + 188 + static int linear_add(struct mddev *mddev, struct md_rdev *rdev) 189 + { 190 + /* Adding a drive to a linear array allows the array to grow. 191 + * It is permitted if the new drive has a matching superblock 192 + * already on it, with raid_disk equal to raid_disks. 193 + * It is achieved by creating a new linear_private_data structure 194 + * and swapping it in in-place of the current one. 195 + * The current one is never freed until the array is stopped. 196 + * This avoids races. 197 + */ 198 + struct linear_conf *newconf, *oldconf; 199 + 200 + if (rdev->saved_raid_disk != mddev->raid_disks) 201 + return -EINVAL; 202 + 203 + rdev->raid_disk = rdev->saved_raid_disk; 204 + rdev->saved_raid_disk = -1; 205 + 206 + newconf = linear_conf(mddev, mddev->raid_disks + 1); 207 + if (IS_ERR(newconf)) 208 + return PTR_ERR(newconf); 209 + 210 + /* newconf->raid_disks already keeps a copy of * the increased 211 + * value of mddev->raid_disks, WARN_ONCE() is just used to make 212 + * sure of this. It is possible that oldconf is still referenced 213 + * in linear_congested(), therefore kfree_rcu() is used to free 214 + * oldconf until no one uses it anymore. 215 + */ 216 + oldconf = rcu_dereference_protected(mddev->private, 217 + lockdep_is_held(&mddev->reconfig_mutex)); 218 + mddev->raid_disks++; 219 + WARN_ONCE(mddev->raid_disks != newconf->raid_disks, 220 + "copied raid_disks doesn't match mddev->raid_disks"); 221 + rcu_assign_pointer(mddev->private, newconf); 222 + md_set_array_sectors(mddev, linear_size(mddev, 0, 0)); 223 + set_capacity_and_notify(mddev->gendisk, mddev->array_sectors); 224 + kfree_rcu(oldconf, rcu); 225 + return 0; 226 + } 227 + 228 + static void linear_free(struct mddev *mddev, void *priv) 229 + { 230 + struct linear_conf *conf = priv; 231 + 232 + kfree(conf); 233 + } 234 + 235 + static bool linear_make_request(struct mddev *mddev, struct bio *bio) 236 + { 237 + struct dev_info *tmp_dev; 238 + sector_t start_sector, end_sector, data_offset; 239 + sector_t bio_sector = bio->bi_iter.bi_sector; 240 + 241 + if (unlikely(bio->bi_opf & REQ_PREFLUSH) 242 + && md_flush_request(mddev, bio)) 243 + return true; 244 + 245 + tmp_dev = which_dev(mddev, bio_sector); 246 + start_sector = tmp_dev->end_sector - tmp_dev->rdev->sectors; 247 + end_sector = tmp_dev->end_sector; 248 + data_offset = tmp_dev->rdev->data_offset; 249 + 250 + if (unlikely(bio_sector >= end_sector || 251 + bio_sector < start_sector)) 252 + goto out_of_bounds; 253 + 254 + if (unlikely(is_rdev_broken(tmp_dev->rdev))) { 255 + md_error(mddev, tmp_dev->rdev); 256 + bio_io_error(bio); 257 + return true; 258 + } 259 + 260 + if (unlikely(bio_end_sector(bio) > end_sector)) { 261 + /* This bio crosses a device boundary, so we have to split it */ 262 + struct bio *split = bio_split(bio, end_sector - bio_sector, 263 + GFP_NOIO, &mddev->bio_set); 264 + 265 + if (IS_ERR(split)) { 266 + bio->bi_status = errno_to_blk_status(PTR_ERR(split)); 267 + bio_endio(bio); 268 + return true; 269 + } 270 + 271 + bio_chain(split, bio); 272 + submit_bio_noacct(bio); 273 + bio = split; 274 + } 275 + 276 + md_account_bio(mddev, &bio); 277 + bio_set_dev(bio, tmp_dev->rdev->bdev); 278 + bio->bi_iter.bi_sector = bio->bi_iter.bi_sector - 279 + start_sector + data_offset; 280 + 281 + if (unlikely((bio_op(bio) == REQ_OP_DISCARD) && 282 + !bdev_max_discard_sectors(bio->bi_bdev))) { 283 + /* Just ignore it */ 284 + bio_endio(bio); 285 + } else { 286 + if (mddev->gendisk) 287 + trace_block_bio_remap(bio, disk_devt(mddev->gendisk), 288 + bio_sector); 289 + mddev_check_write_zeroes(mddev, bio); 290 + submit_bio_noacct(bio); 291 + } 292 + return true; 293 + 294 + out_of_bounds: 295 + pr_err("md/linear:%s: make_request: Sector %llu out of bounds on dev %pg: %llu sectors, offset %llu\n", 296 + mdname(mddev), 297 + (unsigned long long)bio->bi_iter.bi_sector, 298 + tmp_dev->rdev->bdev, 299 + (unsigned long long)tmp_dev->rdev->sectors, 300 + (unsigned long long)start_sector); 301 + bio_io_error(bio); 302 + return true; 303 + } 304 + 305 + static void linear_status(struct seq_file *seq, struct mddev *mddev) 306 + { 307 + seq_printf(seq, " %dk rounding", mddev->chunk_sectors / 2); 308 + } 309 + 310 + static void linear_error(struct mddev *mddev, struct md_rdev *rdev) 311 + { 312 + if (!test_and_set_bit(MD_BROKEN, &mddev->flags)) { 313 + char *md_name = mdname(mddev); 314 + 315 + pr_crit("md/linear%s: Disk failure on %pg detected, failing array.\n", 316 + md_name, rdev->bdev); 317 + } 318 + } 319 + 320 + static void linear_quiesce(struct mddev *mddev, int state) 321 + { 322 + } 323 + 324 + static struct md_personality linear_personality = { 325 + .name = "linear", 326 + .level = LEVEL_LINEAR, 327 + .owner = THIS_MODULE, 328 + .make_request = linear_make_request, 329 + .run = linear_run, 330 + .free = linear_free, 331 + .status = linear_status, 332 + .hot_add_disk = linear_add, 333 + .size = linear_size, 334 + .quiesce = linear_quiesce, 335 + .error_handler = linear_error, 336 + }; 337 + 338 + static int __init linear_init(void) 339 + { 340 + return register_md_personality(&linear_personality); 341 + } 342 + 343 + static void linear_exit(void) 344 + { 345 + unregister_md_personality(&linear_personality); 346 + } 347 + 348 + module_init(linear_init); 349 + module_exit(linear_exit); 350 + MODULE_LICENSE("GPL"); 351 + MODULE_DESCRIPTION("Linear device concatenation personality for MD (deprecated)"); 352 + MODULE_ALIAS("md-personality-1"); /* LINEAR - deprecated*/ 353 + MODULE_ALIAS("md-linear"); 354 + MODULE_ALIAS("md-level--1");

+30 -1

drivers/md/md.c

··· 8124 8124 return; 8125 8125 mddev->pers->error_handler(mddev, rdev); 8126 8126 8127 - if (mddev->pers->level == 0) 8127 + if (mddev->pers->level == 0 || mddev->pers->level == LEVEL_LINEAR) 8128 8128 return; 8129 8129 8130 8130 if (mddev->degraded && !test_bit(MD_BROKEN, &mddev->flags)) ··· 8745 8745 } 8746 8746 EXPORT_SYMBOL_GPL(md_submit_discard_bio); 8747 8747 8748 + static void md_bitmap_start(struct mddev *mddev, 8749 + struct md_io_clone *md_io_clone) 8750 + { 8751 + if (mddev->pers->bitmap_sector) 8752 + mddev->pers->bitmap_sector(mddev, &md_io_clone->offset, 8753 + &md_io_clone->sectors); 8754 + 8755 + mddev->bitmap_ops->startwrite(mddev, md_io_clone->offset, 8756 + md_io_clone->sectors); 8757 + } 8758 + 8759 + static void md_bitmap_end(struct mddev *mddev, struct md_io_clone *md_io_clone) 8760 + { 8761 + mddev->bitmap_ops->endwrite(mddev, md_io_clone->offset, 8762 + md_io_clone->sectors); 8763 + } 8764 + 8748 8765 static void md_end_clone_io(struct bio *bio) 8749 8766 { 8750 8767 struct md_io_clone *md_io_clone = bio->bi_private; 8751 8768 struct bio *orig_bio = md_io_clone->orig_bio; 8752 8769 struct mddev *mddev = md_io_clone->mddev; 8770 + 8771 + if (bio_data_dir(orig_bio) == WRITE && mddev->bitmap) 8772 + md_bitmap_end(mddev, md_io_clone); 8753 8773 8754 8774 if (bio->bi_status && !orig_bio->bi_status) 8755 8775 orig_bio->bi_status = bio->bi_status; ··· 8795 8775 if (blk_queue_io_stat(bdev->bd_disk->queue)) 8796 8776 md_io_clone->start_time = bio_start_io_acct(*bio); 8797 8777 8778 + if (bio_data_dir(*bio) == WRITE && mddev->bitmap) { 8779 + md_io_clone->offset = (*bio)->bi_iter.bi_sector; 8780 + md_io_clone->sectors = bio_sectors(*bio); 8781 + md_bitmap_start(mddev, md_io_clone); 8782 + } 8783 + 8798 8784 clone->bi_end_io = md_end_clone_io; 8799 8785 clone->bi_private = md_io_clone; 8800 8786 *bio = clone; ··· 8818 8792 struct md_io_clone *md_io_clone = bio->bi_private; 8819 8793 struct bio *orig_bio = md_io_clone->orig_bio; 8820 8794 struct mddev *mddev = md_io_clone->mddev; 8795 + 8796 + if (bio_data_dir(orig_bio) == WRITE && mddev->bitmap) 8797 + md_bitmap_end(mddev, md_io_clone); 8821 8798 8822 8799 if (bio->bi_status && !orig_bio->bi_status) 8823 8800 orig_bio->bi_status = bio->bi_status;

+5

drivers/md/md.h

··· 746 746 void *(*takeover) (struct mddev *mddev); 747 747 /* Changes the consistency policy of an active array. */ 748 748 int (*change_consistency_policy)(struct mddev *mddev, const char *buf); 749 + /* convert io ranges from array to bitmap */ 750 + void (*bitmap_sector)(struct mddev *mddev, sector_t *offset, 751 + unsigned long *sectors); 749 752 }; 750 753 751 754 struct md_sysfs_entry { ··· 831 828 struct mddev *mddev; 832 829 struct bio *orig_bio; 833 830 unsigned long start_time; 831 + sector_t offset; 832 + unsigned long sectors; 834 833 struct bio bio_clone; 835 834 }; 836 835

+1 -1

drivers/md/raid0.c

··· 384 384 lim.max_write_zeroes_sectors = mddev->chunk_sectors; 385 385 lim.io_min = mddev->chunk_sectors << 9; 386 386 lim.io_opt = lim.io_min * mddev->raid_disks; 387 - lim.features |= BLK_FEAT_ATOMIC_WRITES_STACKED; 387 + lim.features |= BLK_FEAT_ATOMIC_WRITES; 388 388 err = mddev_stack_rdev_limits(mddev, &lim, MDDEV_STACK_INTEGRITY); 389 389 if (err) { 390 390 queue_limits_cancel_update(mddev->gendisk->queue);

+7 -29

drivers/md/raid1.c

··· 420 420 r1_bio->behind_master_bio = NULL; 421 421 } 422 422 423 - /* clear the bitmap if all writes complete successfully */ 424 - mddev->bitmap_ops->endwrite(mddev, r1_bio->sector, r1_bio->sectors, 425 - !test_bit(R1BIO_Degraded, &r1_bio->state), 426 - test_bit(R1BIO_BehindIO, &r1_bio->state)); 423 + if (test_bit(R1BIO_BehindIO, &r1_bio->state)) 424 + mddev->bitmap_ops->end_behind_write(mddev); 427 425 md_write_end(mddev); 428 426 } 429 427 ··· 478 480 if (!test_bit(Faulty, &rdev->flags)) 479 481 set_bit(R1BIO_WriteError, &r1_bio->state); 480 482 else { 481 - /* Fail the request */ 482 - set_bit(R1BIO_Degraded, &r1_bio->state); 483 483 /* Finished with this branch */ 484 484 r1_bio->bios[mirror] = NULL; 485 485 to_put = bio; ··· 1531 1535 write_behind = true; 1532 1536 1533 1537 r1_bio->bios[i] = NULL; 1534 - if (!rdev || test_bit(Faulty, &rdev->flags)) { 1535 - if (i < conf->raid_disks) 1536 - set_bit(R1BIO_Degraded, &r1_bio->state); 1538 + if (!rdev || test_bit(Faulty, &rdev->flags)) 1537 1539 continue; 1538 - } 1539 1540 1540 1541 atomic_inc(&rdev->nr_pending); 1541 1542 if (test_bit(WriteErrorSeen, &rdev->flags)) { ··· 1551 1558 */ 1552 1559 max_sectors = bad_sectors; 1553 1560 rdev_dec_pending(rdev, mddev); 1554 - /* We don't set R1BIO_Degraded as that 1555 - * only applies if the disk is 1556 - * missing, so it might be re-added, 1557 - * and we want to know to recover this 1558 - * chunk. 1559 - * In this case the device is here, 1560 - * and the fact that this chunk is not 1561 - * in-sync is recorded in the bad 1562 - * block log 1563 - */ 1564 1561 continue; 1565 1562 } 1566 1563 if (is_bad) { ··· 1628 1645 stats.behind_writes < max_write_behind) 1629 1646 alloc_behind_master_bio(r1_bio, bio); 1630 1647 1631 - mddev->bitmap_ops->startwrite( 1632 - mddev, r1_bio->sector, r1_bio->sectors, 1633 - test_bit(R1BIO_BehindIO, &r1_bio->state)); 1648 + if (test_bit(R1BIO_BehindIO, &r1_bio->state)) 1649 + mddev->bitmap_ops->start_behind_write(mddev); 1634 1650 first_clone = 0; 1635 1651 } 1636 1652 ··· 2596 2614 * errors. 2597 2615 */ 2598 2616 fail = true; 2599 - if (!narrow_write_error(r1_bio, m)) { 2617 + if (!narrow_write_error(r1_bio, m)) 2600 2618 md_error(conf->mddev, 2601 2619 conf->mirrors[m].rdev); 2602 2620 /* an I/O failed, we can't clear the bitmap */ 2603 - set_bit(R1BIO_Degraded, &r1_bio->state); 2604 - } 2605 2621 rdev_dec_pending(conf->mirrors[m].rdev, 2606 2622 conf->mddev); 2607 2623 } ··· 2690 2710 list_del(&r1_bio->retry_list); 2691 2711 idx = sector_to_idx(r1_bio->sector); 2692 2712 atomic_dec(&conf->nr_queued[idx]); 2693 - if (mddev->degraded) 2694 - set_bit(R1BIO_Degraded, &r1_bio->state); 2695 2713 if (test_bit(R1BIO_WriteError, &r1_bio->state)) 2696 2714 close_write(r1_bio); 2697 2715 raid_end_bio_io(r1_bio); ··· 3217 3239 3218 3240 md_init_stacking_limits(&lim); 3219 3241 lim.max_write_zeroes_sectors = 0; 3220 - lim.features |= BLK_FEAT_ATOMIC_WRITES_STACKED; 3242 + lim.features |= BLK_FEAT_ATOMIC_WRITES; 3221 3243 err = mddev_stack_rdev_limits(mddev, &lim, MDDEV_STACK_INTEGRITY); 3222 3244 if (err) { 3223 3245 queue_limits_cancel_update(mddev->gendisk->queue);

-1

drivers/md/raid1.h

··· 188 188 enum r1bio_state { 189 189 R1BIO_Uptodate, 190 190 R1BIO_IsSync, 191 - R1BIO_Degraded, 192 191 R1BIO_BehindIO, 193 192 /* Set ReadError on bios that experience a readerror so that 194 193 * raid1d knows what to do with them.

+3 -25

drivers/md/raid10.c

··· 428 428 { 429 429 struct mddev *mddev = r10_bio->mddev; 430 430 431 - /* clear the bitmap if all writes complete successfully */ 432 - mddev->bitmap_ops->endwrite(mddev, r10_bio->sector, r10_bio->sectors, 433 - !test_bit(R10BIO_Degraded, &r10_bio->state), 434 - false); 435 431 md_write_end(mddev); 436 432 } 437 433 ··· 497 501 set_bit(R10BIO_WriteError, &r10_bio->state); 498 502 else { 499 503 /* Fail the request */ 500 - set_bit(R10BIO_Degraded, &r10_bio->state); 501 504 r10_bio->devs[slot].bio = NULL; 502 505 to_put = bio; 503 506 dec_rdev = 1; ··· 1433 1438 r10_bio->devs[i].bio = NULL; 1434 1439 r10_bio->devs[i].repl_bio = NULL; 1435 1440 1436 - if (!rdev && !rrdev) { 1437 - set_bit(R10BIO_Degraded, &r10_bio->state); 1441 + if (!rdev && !rrdev) 1438 1442 continue; 1439 - } 1440 1443 if (rdev && test_bit(WriteErrorSeen, &rdev->flags)) { 1441 1444 sector_t first_bad; 1442 1445 sector_t dev_sector = r10_bio->devs[i].addr; ··· 1451 1458 * to other devices yet 1452 1459 */ 1453 1460 max_sectors = bad_sectors; 1454 - /* We don't set R10BIO_Degraded as that 1455 - * only applies if the disk is missing, 1456 - * so it might be re-added, and we want to 1457 - * know to recover this chunk. 1458 - * In this case the device is here, and the 1459 - * fact that this chunk is not in-sync is 1460 - * recorded in the bad block log. 1461 - */ 1462 1461 continue; 1463 1462 } 1464 1463 if (is_bad) { ··· 1504 1519 md_account_bio(mddev, &bio); 1505 1520 r10_bio->master_bio = bio; 1506 1521 atomic_set(&r10_bio->remaining, 1); 1507 - mddev->bitmap_ops->startwrite(mddev, r10_bio->sector, r10_bio->sectors, 1508 - false); 1509 1522 1510 1523 for (i = 0; i < conf->copies; i++) { 1511 1524 if (r10_bio->devs[i].bio) ··· 2949 2966 rdev_dec_pending(rdev, conf->mddev); 2950 2967 } else if (bio != NULL && bio->bi_status) { 2951 2968 fail = true; 2952 - if (!narrow_write_error(r10_bio, m)) { 2969 + if (!narrow_write_error(r10_bio, m)) 2953 2970 md_error(conf->mddev, rdev); 2954 - set_bit(R10BIO_Degraded, 2955 - &r10_bio->state); 2956 - } 2957 2971 rdev_dec_pending(rdev, conf->mddev); 2958 2972 } 2959 2973 bio = r10_bio->devs[m].repl_bio; ··· 3009 3029 r10_bio = list_first_entry(&tmp, struct r10bio, 3010 3030 retry_list); 3011 3031 list_del(&r10_bio->retry_list); 3012 - if (mddev->degraded) 3013 - set_bit(R10BIO_Degraded, &r10_bio->state); 3014 3032 3015 3033 if (test_bit(R10BIO_WriteError, 3016 3034 &r10_bio->state)) ··· 4018 4040 lim.max_write_zeroes_sectors = 0; 4019 4041 lim.io_min = mddev->chunk_sectors << 9; 4020 4042 lim.io_opt = lim.io_min * raid10_nr_stripes(conf); 4021 - lim.features |= BLK_FEAT_ATOMIC_WRITES_STACKED; 4043 + lim.features |= BLK_FEAT_ATOMIC_WRITES; 4022 4044 err = mddev_stack_rdev_limits(mddev, &lim, MDDEV_STACK_INTEGRITY); 4023 4045 if (err) { 4024 4046 queue_limits_cancel_update(mddev->gendisk->queue);

-1

drivers/md/raid10.h

··· 161 161 R10BIO_IsSync, 162 162 R10BIO_IsRecover, 163 163 R10BIO_IsReshape, 164 - R10BIO_Degraded, 165 164 /* Set ReadError on bios that experience a read error 166 165 * so that raid10d knows what to do with them. 167 166 */

+8 -12

drivers/md/raid5-cache.c

··· 313 313 if (sh->dev[i].written) { 314 314 set_bit(R5_UPTODATE, &sh->dev[i].flags); 315 315 r5c_return_dev_pending_writes(conf, &sh->dev[i]); 316 - conf->mddev->bitmap_ops->endwrite(conf->mddev, 317 - sh->sector, RAID5_STRIPE_SECTORS(conf), 318 - !test_bit(STRIPE_DEGRADED, &sh->state), 319 - false); 320 316 } 321 317 } 322 318 } ··· 1019 1023 /* checksum is already calculated in last run */ 1020 1024 if (test_bit(STRIPE_LOG_TRAPPED, &sh->state)) 1021 1025 continue; 1022 - addr = kmap_atomic(sh->dev[i].page); 1026 + addr = kmap_local_page(sh->dev[i].page); 1023 1027 sh->dev[i].log_checksum = crc32c_le(log->uuid_checksum, 1024 1028 addr, PAGE_SIZE); 1025 - kunmap_atomic(addr); 1029 + kunmap_local(addr); 1026 1030 } 1027 1031 parity_pages = 1 + !!(sh->qd_idx >= 0); 1028 1032 data_pages = write_disks - parity_pages; ··· 1975 1979 u32 checksum; 1976 1980 1977 1981 r5l_recovery_read_page(log, ctx, page, log_offset); 1978 - addr = kmap_atomic(page); 1982 + addr = kmap_local_page(page); 1979 1983 checksum = crc32c_le(log->uuid_checksum, addr, PAGE_SIZE); 1980 - kunmap_atomic(addr); 1984 + kunmap_local(addr); 1981 1985 return (le32_to_cpu(log_checksum) == checksum) ? 0 : -EINVAL; 1982 1986 } 1983 1987 ··· 2377 2381 payload->size = cpu_to_le32(BLOCK_SECTORS); 2378 2382 payload->location = cpu_to_le64( 2379 2383 raid5_compute_blocknr(sh, i, 0)); 2380 - addr = kmap_atomic(dev->page); 2384 + addr = kmap_local_page(dev->page); 2381 2385 payload->checksum[0] = cpu_to_le32( 2382 2386 crc32c_le(log->uuid_checksum, addr, 2383 2387 PAGE_SIZE)); 2384 - kunmap_atomic(addr); 2388 + kunmap_local(addr); 2385 2389 sync_page_io(log->rdev, write_pos, PAGE_SIZE, 2386 2390 dev->page, REQ_OP_WRITE, false); 2387 2391 write_pos = r5l_ring_add(log, write_pos, ··· 2884 2888 2885 2889 if (!test_bit(R5_Wantwrite, &sh->dev[i].flags)) 2886 2890 continue; 2887 - addr = kmap_atomic(sh->dev[i].page); 2891 + addr = kmap_local_page(sh->dev[i].page); 2888 2892 sh->dev[i].log_checksum = crc32c_le(log->uuid_checksum, 2889 2893 addr, PAGE_SIZE); 2890 - kunmap_atomic(addr); 2894 + kunmap_local(addr); 2891 2895 pages++; 2892 2896 } 2893 2897 WARN_ON(pages == 0);

+57 -54

drivers/md/raid5.c

··· 906 906 if (raid5_has_log(conf) || raid5_has_ppl(conf)) 907 907 return false; 908 908 return test_bit(STRIPE_BATCH_READY, &sh->state) && 909 - !test_bit(STRIPE_BITMAP_PENDING, &sh->state) && 910 - is_full_stripe_write(sh); 909 + is_full_stripe_write(sh); 911 910 } 912 911 913 912 /* we only do back search */ ··· 1344 1345 submit_bio_noacct(rbi); 1345 1346 } 1346 1347 if (!rdev && !rrdev) { 1347 - if (op_is_write(op)) 1348 - set_bit(STRIPE_DEGRADED, &sh->state); 1349 1348 pr_debug("skip op %d on disc %d for sector %llu\n", 1350 1349 bi->bi_opf, i, (unsigned long long)sh->sector); 1351 1350 clear_bit(R5_LOCKED, &sh->dev[i].flags); ··· 2881 2884 set_bit(R5_MadeGoodRepl, &sh->dev[i].flags); 2882 2885 } else { 2883 2886 if (bi->bi_status) { 2884 - set_bit(STRIPE_DEGRADED, &sh->state); 2885 2887 set_bit(WriteErrorSeen, &rdev->flags); 2886 2888 set_bit(R5_WriteError, &sh->dev[i].flags); 2887 2889 if (!test_and_set_bit(WantReplacement, &rdev->flags)) ··· 3544 3548 (*bip)->bi_iter.bi_sector, sh->sector, dd_idx, 3545 3549 sh->dev[dd_idx].sector); 3546 3550 3547 - if (conf->mddev->bitmap && firstwrite) { 3548 - /* Cannot hold spinlock over bitmap_startwrite, 3549 - * but must ensure this isn't added to a batch until 3550 - * we have added to the bitmap and set bm_seq. 3551 - * So set STRIPE_BITMAP_PENDING to prevent 3552 - * batching. 3553 - * If multiple __add_stripe_bio() calls race here they 3554 - * much all set STRIPE_BITMAP_PENDING. So only the first one 3555 - * to complete "bitmap_startwrite" gets to set 3556 - * STRIPE_BIT_DELAY. This is important as once a stripe 3557 - * is added to a batch, STRIPE_BIT_DELAY cannot be changed 3558 - * any more. 3559 - */ 3560 - set_bit(STRIPE_BITMAP_PENDING, &sh->state); 3561 - spin_unlock_irq(&sh->stripe_lock); 3562 - conf->mddev->bitmap_ops->startwrite(conf->mddev, sh->sector, 3563 - RAID5_STRIPE_SECTORS(conf), false); 3564 - spin_lock_irq(&sh->stripe_lock); 3565 - clear_bit(STRIPE_BITMAP_PENDING, &sh->state); 3566 - if (!sh->batch_head) { 3567 - sh->bm_seq = conf->seq_flush+1; 3568 - set_bit(STRIPE_BIT_DELAY, &sh->state); 3569 - } 3551 + if (conf->mddev->bitmap && firstwrite && !sh->batch_head) { 3552 + sh->bm_seq = conf->seq_flush+1; 3553 + set_bit(STRIPE_BIT_DELAY, &sh->state); 3570 3554 } 3571 3555 } 3572 3556 ··· 3597 3621 BUG_ON(sh->batch_head); 3598 3622 for (i = disks; i--; ) { 3599 3623 struct bio *bi; 3600 - int bitmap_end = 0; 3601 3624 3602 3625 if (test_bit(R5_ReadError, &sh->dev[i].flags)) { 3603 3626 struct md_rdev *rdev = conf->disks[i].rdev; ··· 3621 3646 sh->dev[i].towrite = NULL; 3622 3647 sh->overwrite_disks = 0; 3623 3648 spin_unlock_irq(&sh->stripe_lock); 3624 - if (bi) 3625 - bitmap_end = 1; 3626 3649 3627 3650 log_stripe_write_finished(sh); 3628 3651 ··· 3635 3662 bio_io_error(bi); 3636 3663 bi = nextbi; 3637 3664 } 3638 - if (bitmap_end) 3639 - conf->mddev->bitmap_ops->endwrite(conf->mddev, 3640 - sh->sector, RAID5_STRIPE_SECTORS(conf), 3641 - false, false); 3642 - bitmap_end = 0; 3643 3665 /* and fail all 'written' */ 3644 3666 bi = sh->dev[i].written; 3645 3667 sh->dev[i].written = NULL; ··· 3643 3675 sh->dev[i].page = sh->dev[i].orig_page; 3644 3676 } 3645 3677 3646 - if (bi) bitmap_end = 1; 3647 3678 while (bi && bi->bi_iter.bi_sector < 3648 3679 sh->dev[i].sector + RAID5_STRIPE_SECTORS(conf)) { 3649 3680 struct bio *bi2 = r5_next_bio(conf, bi, sh->dev[i].sector); ··· 3676 3709 bi = nextbi; 3677 3710 } 3678 3711 } 3679 - if (bitmap_end) 3680 - conf->mddev->bitmap_ops->endwrite(conf->mddev, 3681 - sh->sector, RAID5_STRIPE_SECTORS(conf), 3682 - false, false); 3683 3712 /* If we were in the middle of a write the parity block might 3684 3713 * still be locked - so just clear all R5_LOCKED flags 3685 3714 */ ··· 4024 4061 bio_endio(wbi); 4025 4062 wbi = wbi2; 4026 4063 } 4027 - conf->mddev->bitmap_ops->endwrite(conf->mddev, 4028 - sh->sector, RAID5_STRIPE_SECTORS(conf), 4029 - !test_bit(STRIPE_DEGRADED, &sh->state), 4030 - false); 4064 + 4031 4065 if (head_sh->batch_head) { 4032 4066 sh = list_first_entry(&sh->batch_list, 4033 4067 struct stripe_head, ··· 4301 4341 s->locked++; 4302 4342 set_bit(R5_Wantwrite, &dev->flags); 4303 4343 4304 - clear_bit(STRIPE_DEGRADED, &sh->state); 4305 4344 set_bit(STRIPE_INSYNC, &sh->state); 4306 4345 break; 4307 4346 case check_state_run: ··· 4457 4498 clear_bit(R5_Wantwrite, &dev->flags); 4458 4499 s->locked--; 4459 4500 } 4460 - clear_bit(STRIPE_DEGRADED, &sh->state); 4461 4501 4462 4502 set_bit(STRIPE_INSYNC, &sh->state); 4463 4503 break; ··· 4849 4891 (1 << STRIPE_COMPUTE_RUN) | 4850 4892 (1 << STRIPE_DISCARD) | 4851 4893 (1 << STRIPE_BATCH_READY) | 4852 - (1 << STRIPE_BATCH_ERR) | 4853 - (1 << STRIPE_BITMAP_PENDING)), 4894 + (1 << STRIPE_BATCH_ERR)), 4854 4895 "stripe state: %lx\n", sh->state); 4855 4896 WARN_ONCE(head_sh->state & ((1 << STRIPE_DISCARD) | 4856 4897 (1 << STRIPE_REPLACED)), ··· 4857 4900 4858 4901 set_mask_bits(&sh->state, ~(STRIPE_EXPAND_SYNC_FLAGS | 4859 4902 (1 << STRIPE_PREREAD_ACTIVE) | 4860 - (1 << STRIPE_DEGRADED) | 4861 4903 (1 << STRIPE_ON_UNPLUG_LIST)), 4862 4904 head_sh->state & (1 << STRIPE_INSYNC)); 4863 4905 ··· 5740 5784 } 5741 5785 spin_unlock_irq(&sh->stripe_lock); 5742 5786 if (conf->mddev->bitmap) { 5743 - for (d = 0; d < conf->raid_disks - conf->max_degraded; 5744 - d++) 5745 - mddev->bitmap_ops->startwrite(mddev, sh->sector, 5746 - RAID5_STRIPE_SECTORS(conf), false); 5747 5787 sh->bm_seq = conf->seq_flush + 1; 5748 5788 set_bit(STRIPE_BIT_DELAY, &sh->state); 5749 5789 } ··· 5878 5926 if (ahead_of_reshape(mddev, logical_sector, reshape_safe)) 5879 5927 return LOC_INSIDE_RESHAPE; 5880 5928 return LOC_BEHIND_RESHAPE; 5929 + } 5930 + 5931 + static void raid5_bitmap_sector(struct mddev *mddev, sector_t *offset, 5932 + unsigned long *sectors) 5933 + { 5934 + struct r5conf *conf = mddev->private; 5935 + sector_t start = *offset; 5936 + sector_t end = start + *sectors; 5937 + sector_t prev_start = start; 5938 + sector_t prev_end = end; 5939 + int sectors_per_chunk; 5940 + enum reshape_loc loc; 5941 + int dd_idx; 5942 + 5943 + sectors_per_chunk = conf->chunk_sectors * 5944 + (conf->raid_disks - conf->max_degraded); 5945 + start = round_down(start, sectors_per_chunk); 5946 + end = round_up(end, sectors_per_chunk); 5947 + 5948 + start = raid5_compute_sector(conf, start, 0, &dd_idx, NULL); 5949 + end = raid5_compute_sector(conf, end, 0, &dd_idx, NULL); 5950 + 5951 + /* 5952 + * For LOC_INSIDE_RESHAPE, this IO will wait for reshape to make 5953 + * progress, hence it's the same as LOC_BEHIND_RESHAPE. 5954 + */ 5955 + loc = get_reshape_loc(mddev, conf, prev_start); 5956 + if (likely(loc != LOC_AHEAD_OF_RESHAPE)) { 5957 + *offset = start; 5958 + *sectors = end - start; 5959 + return; 5960 + } 5961 + 5962 + sectors_per_chunk = conf->prev_chunk_sectors * 5963 + (conf->previous_raid_disks - conf->max_degraded); 5964 + prev_start = round_down(prev_start, sectors_per_chunk); 5965 + prev_end = round_down(prev_end, sectors_per_chunk); 5966 + 5967 + prev_start = raid5_compute_sector(conf, prev_start, 1, &dd_idx, NULL); 5968 + prev_end = raid5_compute_sector(conf, prev_end, 1, &dd_idx, NULL); 5969 + 5970 + /* 5971 + * for LOC_AHEAD_OF_RESHAPE, reshape can make progress before this IO 5972 + * is handled in make_stripe_request(), we can't know this here hence 5973 + * we set bits for both. 5974 + */ 5975 + *offset = min(start, prev_start); 5976 + *sectors = max(end, prev_end) - *offset; 5881 5977 } 5882 5978 5883 5979 static enum stripe_result make_stripe_request(struct mddev *mddev, ··· 8976 8976 .takeover = raid6_takeover, 8977 8977 .change_consistency_policy = raid5_change_consistency_policy, 8978 8978 .prepare_suspend = raid5_prepare_suspend, 8979 + .bitmap_sector = raid5_bitmap_sector, 8979 8980 }; 8980 8981 static struct md_personality raid5_personality = 8981 8982 { ··· 9002 9001 .takeover = raid5_takeover, 9003 9002 .change_consistency_policy = raid5_change_consistency_policy, 9004 9003 .prepare_suspend = raid5_prepare_suspend, 9004 + .bitmap_sector = raid5_bitmap_sector, 9005 9005 }; 9006 9006 9007 9007 static struct md_personality raid4_personality = ··· 9029 9027 .takeover = raid4_takeover, 9030 9028 .change_consistency_policy = raid5_change_consistency_policy, 9031 9029 .prepare_suspend = raid5_prepare_suspend, 9030 + .bitmap_sector = raid5_bitmap_sector, 9032 9031 }; 9033 9032 9034 9033 static int __init raid5_init(void)

-4

drivers/md/raid5.h

··· 358 358 STRIPE_REPLACED, 359 359 STRIPE_PREREAD_ACTIVE, 360 360 STRIPE_DELAYED, 361 - STRIPE_DEGRADED, 362 361 STRIPE_BIT_DELAY, 363 362 STRIPE_EXPANDING, 364 363 STRIPE_EXPAND_SOURCE, ··· 371 372 STRIPE_ON_RELEASE_LIST, 372 373 STRIPE_BATCH_READY, 373 374 STRIPE_BATCH_ERR, 374 - STRIPE_BITMAP_PENDING, /* Being added to bitmap, don't add 375 - * to batch yet. 376 - */ 377 375 STRIPE_LOG_TRAPPED, /* trapped into log (see raid5-cache.c) 378 376 * this bit is used in two scenarios: 379 377 *

+1 -2

drivers/memstick/core/ms_block.c

··· 2094 2094 if (msb->disk_id < 0) 2095 2095 return msb->disk_id; 2096 2096 2097 - rc = blk_mq_alloc_sq_tag_set(&msb->tag_set, &msb_mq_ops, 2, 2098 - BLK_MQ_F_SHOULD_MERGE); 2097 + rc = blk_mq_alloc_sq_tag_set(&msb->tag_set, &msb_mq_ops, 2, 0); 2099 2098 if (rc) 2100 2099 goto out_release_id; 2101 2100

+1 -2

drivers/memstick/core/mspro_block.c

··· 1139 1139 if (disk_id < 0) 1140 1140 return disk_id; 1141 1141 1142 - rc = blk_mq_alloc_sq_tag_set(&msb->tag_set, &mspro_mq_ops, 2, 1143 - BLK_MQ_F_SHOULD_MERGE); 1142 + rc = blk_mq_alloc_sq_tag_set(&msb->tag_set, &mspro_mq_ops, 2, 0); 1144 1143 if (rc) 1145 1144 goto out_release_id; 1146 1145

+1 -1

drivers/mmc/core/queue.c

··· 441 441 else 442 442 mq->tag_set.queue_depth = MMC_QUEUE_DEPTH; 443 443 mq->tag_set.numa_node = NUMA_NO_NODE; 444 - mq->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_BLOCKING; 444 + mq->tag_set.flags = BLK_MQ_F_BLOCKING; 445 445 mq->tag_set.nr_hw_queues = 1; 446 446 mq->tag_set.cmd_size = sizeof(struct mmc_queue_req); 447 447 mq->tag_set.driver_data = mq;

+1 -1

drivers/mtd/mtd_blkdevs.c

··· 329 329 goto out_list_del; 330 330 331 331 ret = blk_mq_alloc_sq_tag_set(new->tag_set, &mtd_mq_ops, 2, 332 - BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_BLOCKING); 332 + BLK_MQ_F_BLOCKING); 333 333 if (ret) 334 334 goto out_kfree_tag_set; 335 335

+1 -1

drivers/mtd/ubi/block.c

··· 383 383 dev->tag_set.ops = &ubiblock_mq_ops; 384 384 dev->tag_set.queue_depth = 64; 385 385 dev->tag_set.numa_node = NUMA_NO_NODE; 386 - dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_BLOCKING; 386 + dev->tag_set.flags = BLK_MQ_F_BLOCKING; 387 387 dev->tag_set.cmd_size = sizeof(struct ubiblock_pdu); 388 388 dev->tag_set.driver_data = dev; 389 389 dev->tag_set.nr_hw_queues = 1;

-2

drivers/nvme/host/apple.c

··· 1251 1251 anv->admin_tagset.timeout = NVME_ADMIN_TIMEOUT; 1252 1252 anv->admin_tagset.numa_node = NUMA_NO_NODE; 1253 1253 anv->admin_tagset.cmd_size = sizeof(struct apple_nvme_iod); 1254 - anv->admin_tagset.flags = BLK_MQ_F_NO_SCHED; 1255 1254 anv->admin_tagset.driver_data = &anv->adminq; 1256 1255 1257 1256 ret = blk_mq_alloc_tag_set(&anv->admin_tagset); ··· 1274 1275 anv->tagset.timeout = NVME_IO_TIMEOUT; 1275 1276 anv->tagset.numa_node = NUMA_NO_NODE; 1276 1277 anv->tagset.cmd_size = sizeof(struct apple_nvme_iod); 1277 - anv->tagset.flags = BLK_MQ_F_SHOULD_MERGE; 1278 1278 anv->tagset.driver_data = &anv->ioq; 1279 1279 1280 1280 ret = blk_mq_alloc_tag_set(&anv->tagset);

+34 -12

drivers/nvme/host/core.c

··· 2002 2002 lim->atomic_write_hw_boundary = boundary; 2003 2003 lim->atomic_write_hw_unit_min = bs; 2004 2004 lim->atomic_write_hw_unit_max = rounddown_pow_of_two(atomic_bs); 2005 + lim->features |= BLK_FEAT_ATOMIC_WRITES; 2005 2006 } 2006 2007 2007 2008 static u32 nvme_max_drv_segments(struct nvme_ctrl *ctrl) ··· 2129 2128 struct queue_limits lim; 2130 2129 int ret; 2131 2130 2132 - blk_mq_freeze_queue(ns->disk->queue); 2133 2131 lim = queue_limits_start_update(ns->disk->queue); 2134 2132 nvme_set_ctrl_limits(ns->ctrl, &lim); 2133 + 2134 + blk_mq_freeze_queue(ns->disk->queue); 2135 2135 ret = queue_limits_commit_update(ns->disk->queue, &lim); 2136 2136 set_disk_ro(ns->disk, nvme_ns_is_readonly(ns, info)); 2137 2137 blk_mq_unfreeze_queue(ns->disk->queue); ··· 2179 2177 goto out; 2180 2178 } 2181 2179 2180 + lim = queue_limits_start_update(ns->disk->queue); 2181 + 2182 2182 blk_mq_freeze_queue(ns->disk->queue); 2183 2183 ns->head->lba_shift = id->lbaf[lbaf].ds; 2184 2184 ns->head->nuse = le64_to_cpu(id->nuse); 2185 2185 capacity = nvme_lba_to_sect(ns->head, le64_to_cpu(id->nsze)); 2186 - 2187 - lim = queue_limits_start_update(ns->disk->queue); 2188 2186 nvme_set_ctrl_limits(ns->ctrl, &lim); 2189 2187 nvme_configure_metadata(ns->ctrl, ns->head, id, nvm, info); 2190 2188 nvme_set_chunk_sectors(ns, id, &lim); ··· 2287 2285 struct queue_limits *ns_lim = &ns->disk->queue->limits; 2288 2286 struct queue_limits lim; 2289 2287 2288 + lim = queue_limits_start_update(ns->head->disk->queue); 2290 2289 blk_mq_freeze_queue(ns->head->disk->queue); 2291 2290 /* 2292 2291 * queue_limits mixes values that are the hardware limitations ··· 2304 2301 * the splitting limits in to make sure we still obey possibly 2305 2302 * lower limitations of other controllers. 2306 2303 */ 2307 - lim = queue_limits_start_update(ns->head->disk->queue); 2308 2304 lim.logical_block_size = ns_lim->logical_block_size; 2309 2305 lim.physical_block_size = ns_lim->physical_block_size; 2310 2306 lim.io_min = ns_lim->io_min; ··· 3094 3092 static int nvme_get_effects_log(struct nvme_ctrl *ctrl, u8 csi, 3095 3093 struct nvme_effects_log **log) 3096 3094 { 3097 - struct nvme_effects_log *cel = xa_load(&ctrl->cels, csi); 3095 + struct nvme_effects_log *old, *cel = xa_load(&ctrl->cels, csi); 3098 3096 int ret; 3099 3097 3100 3098 if (cel) ··· 3111 3109 return ret; 3112 3110 } 3113 3111 3114 - xa_store(&ctrl->cels, csi, cel, GFP_KERNEL); 3112 + old = xa_store(&ctrl->cels, csi, cel, GFP_KERNEL); 3113 + if (xa_is_err(old)) { 3114 + kfree(cel); 3115 + return xa_err(old); 3116 + } 3115 3117 out: 3116 3118 *log = cel; 3117 3119 return 0; ··· 3177 3171 return ret; 3178 3172 } 3179 3173 3174 + static int nvme_init_effects_log(struct nvme_ctrl *ctrl, 3175 + u8 csi, struct nvme_effects_log **log) 3176 + { 3177 + struct nvme_effects_log *effects, *old; 3178 + 3179 + effects = kzalloc(sizeof(*effects), GFP_KERNEL); 3180 + if (!effects) 3181 + return -ENOMEM; 3182 + 3183 + old = xa_store(&ctrl->cels, csi, effects, GFP_KERNEL); 3184 + if (xa_is_err(old)) { 3185 + kfree(effects); 3186 + return xa_err(old); 3187 + } 3188 + 3189 + *log = effects; 3190 + return 0; 3191 + } 3192 + 3180 3193 static void nvme_init_known_nvm_effects(struct nvme_ctrl *ctrl) 3181 3194 { 3182 3195 struct nvme_effects_log *log = ctrl->effects; ··· 3242 3217 } 3243 3218 3244 3219 if (!ctrl->effects) { 3245 - ctrl->effects = kzalloc(sizeof(*ctrl->effects), GFP_KERNEL); 3246 - if (!ctrl->effects) 3247 - return -ENOMEM; 3248 - xa_store(&ctrl->cels, NVME_CSI_NVM, ctrl->effects, GFP_KERNEL); 3220 + ret = nvme_init_effects_log(ctrl, NVME_CSI_NVM, &ctrl->effects); 3221 + if (ret < 0) 3222 + return ret; 3249 3223 } 3250 3224 3251 3225 nvme_init_known_nvm_effects(ctrl); ··· 4588 4564 /* Reserved for fabric connect and keep alive */ 4589 4565 set->reserved_tags = 2; 4590 4566 set->numa_node = ctrl->numa_node; 4591 - set->flags = BLK_MQ_F_NO_SCHED; 4592 4567 if (ctrl->ops->flags & NVME_F_BLOCKING) 4593 4568 set->flags |= BLK_MQ_F_BLOCKING; 4594 4569 set->cmd_size = cmd_size; ··· 4662 4639 /* Reserved for fabric connect */ 4663 4640 set->reserved_tags = 1; 4664 4641 set->numa_node = ctrl->numa_node; 4665 - set->flags = BLK_MQ_F_SHOULD_MERGE; 4666 4642 if (ctrl->ops->flags & NVME_F_BLOCKING) 4667 4643 set->flags |= BLK_MQ_F_BLOCKING; 4668 4644 set->cmd_size = cmd_size;

-1

drivers/nvme/host/fc.c

··· 16 16 #include <linux/nvme-fc.h> 17 17 #include "fc.h" 18 18 #include <scsi/scsi_transport_fc.h> 19 - #include <linux/blk-mq-pci.h> 20 19 21 20 /* *************************** Data Structures/Defines ****************** */ 22 21

-39

drivers/nvme/host/nvme.h

··· 1187 1187 return (ctrl->ctrl_config & NVME_CC_CSS_MASK) == NVME_CC_CSS_CSI; 1188 1188 } 1189 1189 1190 - #ifdef CONFIG_NVME_VERBOSE_ERRORS 1191 - const char *nvme_get_error_status_str(u16 status); 1192 - const char *nvme_get_opcode_str(u8 opcode); 1193 - const char *nvme_get_admin_opcode_str(u8 opcode); 1194 - const char *nvme_get_fabrics_opcode_str(u8 opcode); 1195 - #else /* CONFIG_NVME_VERBOSE_ERRORS */ 1196 - static inline const char *nvme_get_error_status_str(u16 status) 1197 - { 1198 - return "I/O Error"; 1199 - } 1200 - static inline const char *nvme_get_opcode_str(u8 opcode) 1201 - { 1202 - return "I/O Cmd"; 1203 - } 1204 - static inline const char *nvme_get_admin_opcode_str(u8 opcode) 1205 - { 1206 - return "Admin Cmd"; 1207 - } 1208 - 1209 - static inline const char *nvme_get_fabrics_opcode_str(u8 opcode) 1210 - { 1211 - return "Fabrics Cmd"; 1212 - } 1213 - #endif /* CONFIG_NVME_VERBOSE_ERRORS */ 1214 - 1215 - static inline const char *nvme_opcode_str(int qid, u8 opcode) 1216 - { 1217 - return qid ? nvme_get_opcode_str(opcode) : 1218 - nvme_get_admin_opcode_str(opcode); 1219 - } 1220 - 1221 - static inline const char *nvme_fabrics_opcode_str( 1222 - int qid, const struct nvme_command *cmd) 1223 - { 1224 - if (nvme_is_fabrics(cmd)) 1225 - return nvme_get_fabrics_opcode_str(cmd->fabrics.fctype); 1226 - 1227 - return nvme_opcode_str(qid, cmd->common.opcode); 1228 - } 1229 1190 #endif /* _NVME_H */

+8 -9

drivers/nvme/host/pci.c

··· 8 8 #include <linux/async.h> 9 9 #include <linux/blkdev.h> 10 10 #include <linux/blk-mq.h> 11 - #include <linux/blk-mq-pci.h> 12 11 #include <linux/blk-integrity.h> 13 12 #include <linux/dmi.h> 14 13 #include <linux/init.h> ··· 372 373 /* 373 374 * Ensure that the doorbell is updated before reading the event 374 375 * index from memory. The controller needs to provide similar 375 - * ordering to ensure the envent index is updated before reading 376 + * ordering to ensure the event index is updated before reading 376 377 * the doorbell. 377 378 */ 378 379 mb(); ··· 462 463 */ 463 464 map->queue_offset = qoff; 464 465 if (i != HCTX_TYPE_POLL && offset) 465 - blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset); 466 + blk_mq_map_hw_queues(map, dev->dev, offset); 466 467 else 467 468 blk_mq_map_queues(map); 468 469 qoff += map->nr_queues; ··· 1147 1148 } 1148 1149 } 1149 1150 1150 - static inline int nvme_poll_cq(struct nvme_queue *nvmeq, 1151 - struct io_comp_batch *iob) 1151 + static inline bool nvme_poll_cq(struct nvme_queue *nvmeq, 1152 + struct io_comp_batch *iob) 1152 1153 { 1153 - int found = 0; 1154 + bool found = false; 1154 1155 1155 1156 while (nvme_cqe_pending(nvmeq)) { 1156 - found++; 1157 + found = true; 1157 1158 /* 1158 1159 * load-load control dependency between phase and the rest of 1159 1160 * the cqe requires a full read memory barrier ··· 2085 2086 sizeof(*dev->host_mem_descs), &dev->host_mem_descs_dma, 2086 2087 GFP_KERNEL); 2087 2088 if (!dev->host_mem_descs) { 2088 - dma_free_noncontiguous(dev->dev, dev->host_mem_size, 2089 - dev->hmb_sgt, DMA_BIDIRECTIONAL); 2089 + dma_free_noncontiguous(dev->dev, size, dev->hmb_sgt, 2090 + DMA_BIDIRECTIONAL); 2090 2091 dev->hmb_sgt = NULL; 2091 2092 return -ENOMEM; 2092 2093 }

+57 -13

drivers/nvme/host/tcp.c

··· 54 54 "nvme TLS handshake timeout in seconds (default 10)"); 55 55 #endif 56 56 57 + static atomic_t nvme_tcp_cpu_queues[NR_CPUS]; 58 + 57 59 #ifdef CONFIG_DEBUG_LOCK_ALLOC 58 60 /* lockdep can detect a circular dependency of the form 59 61 * sk_lock -> mmap_lock (page fault) -> fs locks -> sk_lock ··· 129 127 NVME_TCP_Q_ALLOCATED = 0, 130 128 NVME_TCP_Q_LIVE = 1, 131 129 NVME_TCP_Q_POLLING = 2, 130 + NVME_TCP_Q_IO_CPU_SET = 3, 132 131 }; 133 132 134 133 enum nvme_tcp_recv_state { ··· 1565 1562 ctrl->io_queues[HCTX_TYPE_POLL]; 1566 1563 } 1567 1564 1565 + /** 1566 + * Track the number of queues assigned to each cpu using a global per-cpu 1567 + * counter and select the least used cpu from the mq_map. Our goal is to spread 1568 + * different controllers I/O threads across different cpu cores. 1569 + * 1570 + * Note that the accounting is not 100% perfect, but we don't need to be, we're 1571 + * simply putting our best effort to select the best candidate cpu core that we 1572 + * find at any given point. 1573 + */ 1568 1574 static void nvme_tcp_set_queue_io_cpu(struct nvme_tcp_queue *queue) 1569 1575 { 1570 1576 struct nvme_tcp_ctrl *ctrl = queue->ctrl; 1571 - int qid = nvme_tcp_queue_id(queue); 1572 - int n = 0; 1577 + struct blk_mq_tag_set *set = &ctrl->tag_set; 1578 + int qid = nvme_tcp_queue_id(queue) - 1; 1579 + unsigned int *mq_map = NULL; 1580 + int cpu, min_queues = INT_MAX, io_cpu; 1581 + 1582 + if (wq_unbound) 1583 + goto out; 1573 1584 1574 1585 if (nvme_tcp_default_queue(queue)) 1575 - n = qid - 1; 1586 + mq_map = set->map[HCTX_TYPE_DEFAULT].mq_map; 1576 1587 else if (nvme_tcp_read_queue(queue)) 1577 - n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT] - 1; 1588 + mq_map = set->map[HCTX_TYPE_READ].mq_map; 1578 1589 else if (nvme_tcp_poll_queue(queue)) 1579 - n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT] - 1580 - ctrl->io_queues[HCTX_TYPE_READ] - 1; 1581 - if (wq_unbound) 1582 - queue->io_cpu = WORK_CPU_UNBOUND; 1583 - else 1584 - queue->io_cpu = cpumask_next_wrap(n - 1, cpu_online_mask, -1, false); 1590 + mq_map = set->map[HCTX_TYPE_POLL].mq_map; 1591 + 1592 + if (WARN_ON(!mq_map)) 1593 + goto out; 1594 + 1595 + /* Search for the least used cpu from the mq_map */ 1596 + io_cpu = WORK_CPU_UNBOUND; 1597 + for_each_online_cpu(cpu) { 1598 + int num_queues = atomic_read(&nvme_tcp_cpu_queues[cpu]); 1599 + 1600 + if (mq_map[cpu] != qid) 1601 + continue; 1602 + if (num_queues < min_queues) { 1603 + io_cpu = cpu; 1604 + min_queues = num_queues; 1605 + } 1606 + } 1607 + if (io_cpu != WORK_CPU_UNBOUND) { 1608 + queue->io_cpu = io_cpu; 1609 + atomic_inc(&nvme_tcp_cpu_queues[io_cpu]); 1610 + set_bit(NVME_TCP_Q_IO_CPU_SET, &queue->flags); 1611 + } 1612 + out: 1613 + dev_dbg(ctrl->ctrl.device, "queue %d: using cpu %d\n", 1614 + qid, queue->io_cpu); 1585 1615 } 1586 1616 1587 1617 static void nvme_tcp_tls_done(void *data, int status, key_serial_t pskid) ··· 1758 1722 1759 1723 queue->sock->sk->sk_allocation = GFP_ATOMIC; 1760 1724 queue->sock->sk->sk_use_task_frag = false; 1761 - nvme_tcp_set_queue_io_cpu(queue); 1725 + queue->io_cpu = WORK_CPU_UNBOUND; 1762 1726 queue->request = NULL; 1763 1727 queue->data_remaining = 0; 1764 1728 queue->ddgst_remaining = 0; ··· 1880 1844 if (!test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags)) 1881 1845 return; 1882 1846 1847 + if (test_and_clear_bit(NVME_TCP_Q_IO_CPU_SET, &queue->flags)) 1848 + atomic_dec(&nvme_tcp_cpu_queues[queue->io_cpu]); 1849 + 1883 1850 mutex_lock(&queue->queue_lock); 1884 1851 if (test_and_clear_bit(NVME_TCP_Q_LIVE, &queue->flags)) 1885 1852 __nvme_tcp_stop_queue(queue); ··· 1917 1878 nvme_tcp_init_recv_ctx(queue); 1918 1879 nvme_tcp_setup_sock_ops(queue); 1919 1880 1920 - if (idx) 1881 + if (idx) { 1882 + nvme_tcp_set_queue_io_cpu(queue); 1921 1883 ret = nvmf_connect_io_queue(nctrl, idx); 1922 - else 1884 + } else 1923 1885 ret = nvmf_connect_admin_queue(nctrl); 1924 1886 1925 1887 if (!ret) { ··· 2885 2845 static int __init nvme_tcp_init_module(void) 2886 2846 { 2887 2847 unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_SYSFS; 2848 + int cpu; 2888 2849 2889 2850 BUILD_BUG_ON(sizeof(struct nvme_tcp_hdr) != 8); 2890 2851 BUILD_BUG_ON(sizeof(struct nvme_tcp_cmd_pdu) != 72); ··· 2902 2861 nvme_tcp_wq = alloc_workqueue("nvme_tcp_wq", wq_flags, 0); 2903 2862 if (!nvme_tcp_wq) 2904 2863 return -ENOMEM; 2864 + 2865 + for_each_possible_cpu(cpu) 2866 + atomic_set(&nvme_tcp_cpu_queues[cpu], 0); 2905 2867 2906 2868 nvmf_register_transport(&nvme_tcp_transport); 2907 2869 return 0;

+11

drivers/nvme/target/Kconfig

··· 115 115 target side. 116 116 117 117 If unsure, say N. 118 + 119 + config NVME_TARGET_PCI_EPF 120 + tristate "NVMe PCI Endpoint Function target support" 121 + depends on NVME_TARGET && PCI_ENDPOINT 122 + depends on NVME_CORE=y || NVME_CORE=NVME_TARGET 123 + help 124 + This enables the NVMe PCI Endpoint Function target driver support, 125 + which allows creating a NVMe PCI controller using an endpoint mode 126 + capable PCI controller. 127 + 128 + If unsure, say N.

+2

drivers/nvme/target/Makefile

··· 8 8 obj-$(CONFIG_NVME_TARGET_FC) += nvmet-fc.o 9 9 obj-$(CONFIG_NVME_TARGET_FCLOOP) += nvme-fcloop.o 10 10 obj-$(CONFIG_NVME_TARGET_TCP) += nvmet-tcp.o 11 + obj-$(CONFIG_NVME_TARGET_PCI_EPF) += nvmet-pci-epf.o 11 12 12 13 nvmet-y += core.o configfs.o admin-cmd.o fabrics-cmd.o \ 13 14 discovery.o io-cmd-file.o io-cmd-bdev.o pr.o ··· 21 20 nvmet-fc-y += fc.o 22 21 nvme-fcloop-y += fcloop.o 23 22 nvmet-tcp-y += tcp.o 23 + nvmet-pci-epf-y += pci-epf.o 24 24 nvmet-$(CONFIG_TRACING) += trace.o

+376 -14

drivers/nvme/target/admin-cmd.c

··· 12 12 #include <linux/unaligned.h> 13 13 #include "nvmet.h" 14 14 15 + static void nvmet_execute_delete_sq(struct nvmet_req *req) 16 + { 17 + struct nvmet_ctrl *ctrl = req->sq->ctrl; 18 + u16 sqid = le16_to_cpu(req->cmd->delete_queue.qid); 19 + u16 status; 20 + 21 + if (!nvmet_is_pci_ctrl(ctrl)) { 22 + status = nvmet_report_invalid_opcode(req); 23 + goto complete; 24 + } 25 + 26 + if (!sqid) { 27 + status = NVME_SC_QID_INVALID | NVME_STATUS_DNR; 28 + goto complete; 29 + } 30 + 31 + status = nvmet_check_sqid(ctrl, sqid, false); 32 + if (status != NVME_SC_SUCCESS) 33 + goto complete; 34 + 35 + status = ctrl->ops->delete_sq(ctrl, sqid); 36 + 37 + complete: 38 + nvmet_req_complete(req, status); 39 + } 40 + 41 + static void nvmet_execute_create_sq(struct nvmet_req *req) 42 + { 43 + struct nvmet_ctrl *ctrl = req->sq->ctrl; 44 + struct nvme_command *cmd = req->cmd; 45 + u16 sqid = le16_to_cpu(cmd->create_sq.sqid); 46 + u16 cqid = le16_to_cpu(cmd->create_sq.cqid); 47 + u16 sq_flags = le16_to_cpu(cmd->create_sq.sq_flags); 48 + u16 qsize = le16_to_cpu(cmd->create_sq.qsize); 49 + u64 prp1 = le64_to_cpu(cmd->create_sq.prp1); 50 + u16 status; 51 + 52 + if (!nvmet_is_pci_ctrl(ctrl)) { 53 + status = nvmet_report_invalid_opcode(req); 54 + goto complete; 55 + } 56 + 57 + if (!sqid) { 58 + status = NVME_SC_QID_INVALID | NVME_STATUS_DNR; 59 + goto complete; 60 + } 61 + 62 + status = nvmet_check_sqid(ctrl, sqid, true); 63 + if (status != NVME_SC_SUCCESS) 64 + goto complete; 65 + 66 + /* 67 + * Note: The NVMe specification allows multiple SQs to use the same CQ. 68 + * However, the target code does not really support that. So for now, 69 + * prevent this and fail the command if sqid and cqid are different. 70 + */ 71 + if (!cqid || cqid != sqid) { 72 + pr_err("SQ %u: Unsupported CQID %u\n", sqid, cqid); 73 + status = NVME_SC_CQ_INVALID | NVME_STATUS_DNR; 74 + goto complete; 75 + } 76 + 77 + if (!qsize || qsize > NVME_CAP_MQES(ctrl->cap)) { 78 + status = NVME_SC_QUEUE_SIZE | NVME_STATUS_DNR; 79 + goto complete; 80 + } 81 + 82 + status = ctrl->ops->create_sq(ctrl, sqid, sq_flags, qsize, prp1); 83 + 84 + complete: 85 + nvmet_req_complete(req, status); 86 + } 87 + 88 + static void nvmet_execute_delete_cq(struct nvmet_req *req) 89 + { 90 + struct nvmet_ctrl *ctrl = req->sq->ctrl; 91 + u16 cqid = le16_to_cpu(req->cmd->delete_queue.qid); 92 + u16 status; 93 + 94 + if (!nvmet_is_pci_ctrl(ctrl)) { 95 + status = nvmet_report_invalid_opcode(req); 96 + goto complete; 97 + } 98 + 99 + if (!cqid) { 100 + status = NVME_SC_QID_INVALID | NVME_STATUS_DNR; 101 + goto complete; 102 + } 103 + 104 + status = nvmet_check_cqid(ctrl, cqid); 105 + if (status != NVME_SC_SUCCESS) 106 + goto complete; 107 + 108 + status = ctrl->ops->delete_cq(ctrl, cqid); 109 + 110 + complete: 111 + nvmet_req_complete(req, status); 112 + } 113 + 114 + static void nvmet_execute_create_cq(struct nvmet_req *req) 115 + { 116 + struct nvmet_ctrl *ctrl = req->sq->ctrl; 117 + struct nvme_command *cmd = req->cmd; 118 + u16 cqid = le16_to_cpu(cmd->create_cq.cqid); 119 + u16 cq_flags = le16_to_cpu(cmd->create_cq.cq_flags); 120 + u16 qsize = le16_to_cpu(cmd->create_cq.qsize); 121 + u16 irq_vector = le16_to_cpu(cmd->create_cq.irq_vector); 122 + u64 prp1 = le64_to_cpu(cmd->create_cq.prp1); 123 + u16 status; 124 + 125 + if (!nvmet_is_pci_ctrl(ctrl)) { 126 + status = nvmet_report_invalid_opcode(req); 127 + goto complete; 128 + } 129 + 130 + if (!cqid) { 131 + status = NVME_SC_QID_INVALID | NVME_STATUS_DNR; 132 + goto complete; 133 + } 134 + 135 + status = nvmet_check_cqid(ctrl, cqid); 136 + if (status != NVME_SC_SUCCESS) 137 + goto complete; 138 + 139 + if (!qsize || qsize > NVME_CAP_MQES(ctrl->cap)) { 140 + status = NVME_SC_QUEUE_SIZE | NVME_STATUS_DNR; 141 + goto complete; 142 + } 143 + 144 + status = ctrl->ops->create_cq(ctrl, cqid, cq_flags, qsize, 145 + prp1, irq_vector); 146 + 147 + complete: 148 + nvmet_req_complete(req, status); 149 + } 150 + 15 151 u32 nvmet_get_log_page_len(struct nvme_command *cmd) 16 152 { 17 153 u32 len = le16_to_cpu(cmd->get_log_page.numdu); ··· 366 230 nvmet_req_complete(req, status); 367 231 } 368 232 369 - static void nvmet_get_cmd_effects_nvm(struct nvme_effects_log *log) 233 + static void nvmet_get_cmd_effects_admin(struct nvmet_ctrl *ctrl, 234 + struct nvme_effects_log *log) 370 235 { 236 + /* For a PCI target controller, advertize support for the . */ 237 + if (nvmet_is_pci_ctrl(ctrl)) { 238 + log->acs[nvme_admin_delete_sq] = 239 + log->acs[nvme_admin_create_sq] = 240 + log->acs[nvme_admin_delete_cq] = 241 + log->acs[nvme_admin_create_cq] = 242 + cpu_to_le32(NVME_CMD_EFFECTS_CSUPP); 243 + } 244 + 371 245 log->acs[nvme_admin_get_log_page] = 372 246 log->acs[nvme_admin_identify] = 373 247 log->acs[nvme_admin_abort_cmd] = ··· 386 240 log->acs[nvme_admin_async_event] = 387 241 log->acs[nvme_admin_keep_alive] = 388 242 cpu_to_le32(NVME_CMD_EFFECTS_CSUPP); 243 + } 389 244 245 + static void nvmet_get_cmd_effects_nvm(struct nvme_effects_log *log) 246 + { 390 247 log->iocs[nvme_cmd_read] = 391 248 log->iocs[nvme_cmd_flush] = 392 249 log->iocs[nvme_cmd_dsm] = ··· 414 265 415 266 static void nvmet_execute_get_log_cmd_effects_ns(struct nvmet_req *req) 416 267 { 268 + struct nvmet_ctrl *ctrl = req->sq->ctrl; 417 269 struct nvme_effects_log *log; 418 270 u16 status = NVME_SC_SUCCESS; 419 271 ··· 426 276 427 277 switch (req->cmd->get_log_page.csi) { 428 278 case NVME_CSI_NVM: 279 + nvmet_get_cmd_effects_admin(ctrl, log); 429 280 nvmet_get_cmd_effects_nvm(log); 430 281 break; 431 282 case NVME_CSI_ZNS: ··· 434 283 status = NVME_SC_INVALID_IO_CMD_SET; 435 284 goto free; 436 285 } 286 + nvmet_get_cmd_effects_admin(ctrl, log); 437 287 nvmet_get_cmd_effects_nvm(log); 438 288 nvmet_get_cmd_effects_zns(log); 439 289 break; ··· 660 508 struct nvmet_ctrl *ctrl = req->sq->ctrl; 661 509 struct nvmet_subsys *subsys = ctrl->subsys; 662 510 struct nvme_id_ctrl *id; 663 - u32 cmd_capsule_size; 511 + u32 cmd_capsule_size, ctratt; 664 512 u16 status = 0; 665 513 666 514 if (!subsys->subsys_discovered) { ··· 675 523 goto out; 676 524 } 677 525 678 - /* XXX: figure out how to assign real vendors IDs. */ 679 - id->vid = 0; 680 - id->ssvid = 0; 526 + id->vid = cpu_to_le16(subsys->vendor_id); 527 + id->ssvid = cpu_to_le16(subsys->subsys_vendor_id); 681 528 682 529 memcpy(id->sn, ctrl->subsys->serial, NVMET_SN_MAX_SIZE); 683 530 memcpy_and_pad(id->mn, sizeof(id->mn), subsys->model_number, ··· 708 557 709 558 /* XXX: figure out what to do about RTD3R/RTD3 */ 710 559 id->oaes = cpu_to_le32(NVMET_AEN_CFG_OPTIONAL); 711 - id->ctratt = cpu_to_le32(NVME_CTRL_ATTR_HID_128_BIT | 712 - NVME_CTRL_ATTR_TBKAS); 560 + ctratt = NVME_CTRL_ATTR_HID_128_BIT | NVME_CTRL_ATTR_TBKAS; 561 + if (nvmet_is_pci_ctrl(ctrl)) 562 + ctratt |= NVME_CTRL_ATTR_RHII; 563 + id->ctratt = cpu_to_le32(ctratt); 713 564 714 565 id->oacs = 0; 715 566 ··· 1258 1105 return 0; 1259 1106 } 1260 1107 1108 + static u16 nvmet_set_feat_host_id(struct nvmet_req *req) 1109 + { 1110 + struct nvmet_ctrl *ctrl = req->sq->ctrl; 1111 + 1112 + if (!nvmet_is_pci_ctrl(ctrl)) 1113 + return NVME_SC_CMD_SEQ_ERROR | NVME_STATUS_DNR; 1114 + 1115 + /* 1116 + * The NVMe base specifications v2.1 recommends supporting 128-bits host 1117 + * IDs (section 5.1.25.1.28.1). However, that same section also says 1118 + * that "The controller may support a 64-bit Host Identifier and/or an 1119 + * extended 128-bit Host Identifier". So simplify this support and do 1120 + * not support 64-bits host IDs to avoid needing to check that all 1121 + * controllers associated with the same subsystem all use the same host 1122 + * ID size. 1123 + */ 1124 + if (!(req->cmd->common.cdw11 & cpu_to_le32(1 << 0))) { 1125 + req->error_loc = offsetof(struct nvme_common_command, cdw11); 1126 + return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 1127 + } 1128 + 1129 + return nvmet_copy_from_sgl(req, 0, &req->sq->ctrl->hostid, 1130 + sizeof(req->sq->ctrl->hostid)); 1131 + } 1132 + 1133 + static u16 nvmet_set_feat_irq_coalesce(struct nvmet_req *req) 1134 + { 1135 + struct nvmet_ctrl *ctrl = req->sq->ctrl; 1136 + u32 cdw11 = le32_to_cpu(req->cmd->common.cdw11); 1137 + struct nvmet_feat_irq_coalesce irqc = { 1138 + .time = (cdw11 >> 8) & 0xff, 1139 + .thr = cdw11 & 0xff, 1140 + }; 1141 + 1142 + /* 1143 + * This feature is not supported for fabrics controllers and mandatory 1144 + * for PCI controllers. 1145 + */ 1146 + if (!nvmet_is_pci_ctrl(ctrl)) { 1147 + req->error_loc = offsetof(struct nvme_common_command, cdw10); 1148 + return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 1149 + } 1150 + 1151 + return ctrl->ops->set_feature(ctrl, NVME_FEAT_IRQ_COALESCE, &irqc); 1152 + } 1153 + 1154 + static u16 nvmet_set_feat_irq_config(struct nvmet_req *req) 1155 + { 1156 + struct nvmet_ctrl *ctrl = req->sq->ctrl; 1157 + u32 cdw11 = le32_to_cpu(req->cmd->common.cdw11); 1158 + struct nvmet_feat_irq_config irqcfg = { 1159 + .iv = cdw11 & 0xffff, 1160 + .cd = (cdw11 >> 16) & 0x1, 1161 + }; 1162 + 1163 + /* 1164 + * This feature is not supported for fabrics controllers and mandatory 1165 + * for PCI controllers. 1166 + */ 1167 + if (!nvmet_is_pci_ctrl(ctrl)) { 1168 + req->error_loc = offsetof(struct nvme_common_command, cdw10); 1169 + return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 1170 + } 1171 + 1172 + return ctrl->ops->set_feature(ctrl, NVME_FEAT_IRQ_CONFIG, &irqcfg); 1173 + } 1174 + 1175 + static u16 nvmet_set_feat_arbitration(struct nvmet_req *req) 1176 + { 1177 + struct nvmet_ctrl *ctrl = req->sq->ctrl; 1178 + u32 cdw11 = le32_to_cpu(req->cmd->common.cdw11); 1179 + struct nvmet_feat_arbitration arb = { 1180 + .hpw = (cdw11 >> 24) & 0xff, 1181 + .mpw = (cdw11 >> 16) & 0xff, 1182 + .lpw = (cdw11 >> 8) & 0xff, 1183 + .ab = cdw11 & 0x3, 1184 + }; 1185 + 1186 + if (!ctrl->ops->set_feature) { 1187 + req->error_loc = offsetof(struct nvme_common_command, cdw10); 1188 + return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 1189 + } 1190 + 1191 + return ctrl->ops->set_feature(ctrl, NVME_FEAT_ARBITRATION, &arb); 1192 + } 1193 + 1261 1194 void nvmet_execute_set_features(struct nvmet_req *req) 1262 1195 { 1263 1196 struct nvmet_subsys *subsys = nvmet_req_subsys(req); ··· 1357 1118 return; 1358 1119 1359 1120 switch (cdw10 & 0xff) { 1121 + case NVME_FEAT_ARBITRATION: 1122 + status = nvmet_set_feat_arbitration(req); 1123 + break; 1360 1124 case NVME_FEAT_NUM_QUEUES: 1361 1125 ncqr = (cdw11 >> 16) & 0xffff; 1362 1126 nsqr = cdw11 & 0xffff; ··· 1370 1128 nvmet_set_result(req, 1371 1129 (subsys->max_qid - 1) | ((subsys->max_qid - 1) << 16)); 1372 1130 break; 1131 + case NVME_FEAT_IRQ_COALESCE: 1132 + status = nvmet_set_feat_irq_coalesce(req); 1133 + break; 1134 + case NVME_FEAT_IRQ_CONFIG: 1135 + status = nvmet_set_feat_irq_config(req); 1136 + break; 1373 1137 case NVME_FEAT_KATO: 1374 1138 status = nvmet_set_feat_kato(req); 1375 1139 break; ··· 1383 1135 status = nvmet_set_feat_async_event(req, NVMET_AEN_CFG_ALL); 1384 1136 break; 1385 1137 case NVME_FEAT_HOST_ID: 1386 - status = NVME_SC_CMD_SEQ_ERROR | NVME_STATUS_DNR; 1138 + status = nvmet_set_feat_host_id(req); 1387 1139 break; 1388 1140 case NVME_FEAT_WRITE_PROTECT: 1389 1141 status = nvmet_set_feat_write_protect(req); ··· 1420 1172 return 0; 1421 1173 } 1422 1174 1175 + static u16 nvmet_get_feat_irq_coalesce(struct nvmet_req *req) 1176 + { 1177 + struct nvmet_ctrl *ctrl = req->sq->ctrl; 1178 + struct nvmet_feat_irq_coalesce irqc = { }; 1179 + u16 status; 1180 + 1181 + /* 1182 + * This feature is not supported for fabrics controllers and mandatory 1183 + * for PCI controllers. 1184 + */ 1185 + if (!nvmet_is_pci_ctrl(ctrl)) { 1186 + req->error_loc = offsetof(struct nvme_common_command, cdw10); 1187 + return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 1188 + } 1189 + 1190 + status = ctrl->ops->get_feature(ctrl, NVME_FEAT_IRQ_COALESCE, &irqc); 1191 + if (status != NVME_SC_SUCCESS) 1192 + return status; 1193 + 1194 + nvmet_set_result(req, ((u32)irqc.time << 8) | (u32)irqc.thr); 1195 + 1196 + return NVME_SC_SUCCESS; 1197 + } 1198 + 1199 + static u16 nvmet_get_feat_irq_config(struct nvmet_req *req) 1200 + { 1201 + struct nvmet_ctrl *ctrl = req->sq->ctrl; 1202 + u32 iv = le32_to_cpu(req->cmd->common.cdw11) & 0xffff; 1203 + struct nvmet_feat_irq_config irqcfg = { .iv = iv }; 1204 + u16 status; 1205 + 1206 + /* 1207 + * This feature is not supported for fabrics controllers and mandatory 1208 + * for PCI controllers. 1209 + */ 1210 + if (!nvmet_is_pci_ctrl(ctrl)) { 1211 + req->error_loc = offsetof(struct nvme_common_command, cdw10); 1212 + return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 1213 + } 1214 + 1215 + status = ctrl->ops->get_feature(ctrl, NVME_FEAT_IRQ_CONFIG, &irqcfg); 1216 + if (status != NVME_SC_SUCCESS) 1217 + return status; 1218 + 1219 + nvmet_set_result(req, ((u32)irqcfg.cd << 16) | iv); 1220 + 1221 + return NVME_SC_SUCCESS; 1222 + } 1223 + 1224 + static u16 nvmet_get_feat_arbitration(struct nvmet_req *req) 1225 + { 1226 + struct nvmet_ctrl *ctrl = req->sq->ctrl; 1227 + struct nvmet_feat_arbitration arb = { }; 1228 + u16 status; 1229 + 1230 + if (!ctrl->ops->get_feature) { 1231 + req->error_loc = offsetof(struct nvme_common_command, cdw10); 1232 + return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 1233 + } 1234 + 1235 + status = ctrl->ops->get_feature(ctrl, NVME_FEAT_ARBITRATION, &arb); 1236 + if (status != NVME_SC_SUCCESS) 1237 + return status; 1238 + 1239 + nvmet_set_result(req, 1240 + ((u32)arb.hpw << 24) | 1241 + ((u32)arb.mpw << 16) | 1242 + ((u32)arb.lpw << 8) | 1243 + (arb.ab & 0x3)); 1244 + 1245 + return NVME_SC_SUCCESS; 1246 + } 1247 + 1423 1248 void nvmet_get_feat_kato(struct nvmet_req *req) 1424 1249 { 1425 1250 nvmet_set_result(req, req->sq->ctrl->kato * 1000); ··· 1519 1198 * need to come up with some fake values for these. 1520 1199 */ 1521 1200 #if 0 1522 - case NVME_FEAT_ARBITRATION: 1523 - break; 1524 1201 case NVME_FEAT_POWER_MGMT: 1525 1202 break; 1526 1203 case NVME_FEAT_TEMP_THRESH: 1527 1204 break; 1528 1205 case NVME_FEAT_ERR_RECOVERY: 1529 1206 break; 1530 - case NVME_FEAT_IRQ_COALESCE: 1531 - break; 1532 - case NVME_FEAT_IRQ_CONFIG: 1533 - break; 1534 1207 case NVME_FEAT_WRITE_ATOMIC: 1535 1208 break; 1536 1209 #endif 1210 + case NVME_FEAT_ARBITRATION: 1211 + status = nvmet_get_feat_arbitration(req); 1212 + break; 1213 + case NVME_FEAT_IRQ_COALESCE: 1214 + status = nvmet_get_feat_irq_coalesce(req); 1215 + break; 1216 + case NVME_FEAT_IRQ_CONFIG: 1217 + status = nvmet_get_feat_irq_config(req); 1218 + break; 1537 1219 case NVME_FEAT_ASYNC_EVENT: 1538 1220 nvmet_get_feat_async_event(req); 1539 1221 break; ··· 1617 1293 nvmet_req_complete(req, status); 1618 1294 } 1619 1295 1296 + u32 nvmet_admin_cmd_data_len(struct nvmet_req *req) 1297 + { 1298 + struct nvme_command *cmd = req->cmd; 1299 + 1300 + if (nvme_is_fabrics(cmd)) 1301 + return nvmet_fabrics_admin_cmd_data_len(req); 1302 + if (nvmet_is_disc_subsys(nvmet_req_subsys(req))) 1303 + return nvmet_discovery_cmd_data_len(req); 1304 + 1305 + switch (cmd->common.opcode) { 1306 + case nvme_admin_get_log_page: 1307 + return nvmet_get_log_page_len(cmd); 1308 + case nvme_admin_identify: 1309 + return NVME_IDENTIFY_DATA_SIZE; 1310 + case nvme_admin_get_features: 1311 + return nvmet_feat_data_len(req, le32_to_cpu(cmd->common.cdw10)); 1312 + default: 1313 + return 0; 1314 + } 1315 + } 1316 + 1620 1317 u16 nvmet_parse_admin_cmd(struct nvmet_req *req) 1621 1318 { 1622 1319 struct nvme_command *cmd = req->cmd; ··· 1652 1307 if (unlikely(ret)) 1653 1308 return ret; 1654 1309 1310 + /* For PCI controllers, admin commands shall not use SGL. */ 1311 + if (nvmet_is_pci_ctrl(req->sq->ctrl) && !req->sq->qid && 1312 + cmd->common.flags & NVME_CMD_SGL_ALL) 1313 + return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 1314 + 1655 1315 if (nvmet_is_passthru_req(req)) 1656 1316 return nvmet_parse_passthru_admin_cmd(req); 1657 1317 1658 1318 switch (cmd->common.opcode) { 1319 + case nvme_admin_delete_sq: 1320 + req->execute = nvmet_execute_delete_sq; 1321 + return 0; 1322 + case nvme_admin_create_sq: 1323 + req->execute = nvmet_execute_create_sq; 1324 + return 0; 1659 1325 case nvme_admin_get_log_page: 1660 1326 req->execute = nvmet_execute_get_log_page; 1327 + return 0; 1328 + case nvme_admin_delete_cq: 1329 + req->execute = nvmet_execute_delete_cq; 1330 + return 0; 1331 + case nvme_admin_create_cq: 1332 + req->execute = nvmet_execute_create_cq; 1661 1333 return 0; 1662 1334 case nvme_admin_identify: 1663 1335 req->execute = nvmet_execute_identify;

+49

drivers/nvme/target/configfs.c

··· 37 37 { NVMF_TRTYPE_RDMA, "rdma" }, 38 38 { NVMF_TRTYPE_FC, "fc" }, 39 39 { NVMF_TRTYPE_TCP, "tcp" }, 40 + { NVMF_TRTYPE_PCI, "pci" }, 40 41 { NVMF_TRTYPE_LOOP, "loop" }, 41 42 }; 42 43 ··· 47 46 { NVMF_ADDR_FAMILY_IP6, "ipv6" }, 48 47 { NVMF_ADDR_FAMILY_IB, "ib" }, 49 48 { NVMF_ADDR_FAMILY_FC, "fc" }, 49 + { NVMF_ADDR_FAMILY_PCI, "pci" }, 50 50 { NVMF_ADDR_FAMILY_LOOP, "loop" }, 51 51 }; 52 52 ··· 1402 1400 } 1403 1401 CONFIGFS_ATTR(nvmet_subsys_, attr_cntlid_max); 1404 1402 1403 + static ssize_t nvmet_subsys_attr_vendor_id_show(struct config_item *item, 1404 + char *page) 1405 + { 1406 + return snprintf(page, PAGE_SIZE, "0x%x\n", to_subsys(item)->vendor_id); 1407 + } 1408 + 1409 + static ssize_t nvmet_subsys_attr_vendor_id_store(struct config_item *item, 1410 + const char *page, size_t count) 1411 + { 1412 + u16 vid; 1413 + 1414 + if (kstrtou16(page, 0, &vid)) 1415 + return -EINVAL; 1416 + 1417 + down_write(&nvmet_config_sem); 1418 + to_subsys(item)->vendor_id = vid; 1419 + up_write(&nvmet_config_sem); 1420 + return count; 1421 + } 1422 + CONFIGFS_ATTR(nvmet_subsys_, attr_vendor_id); 1423 + 1424 + static ssize_t nvmet_subsys_attr_subsys_vendor_id_show(struct config_item *item, 1425 + char *page) 1426 + { 1427 + return snprintf(page, PAGE_SIZE, "0x%x\n", 1428 + to_subsys(item)->subsys_vendor_id); 1429 + } 1430 + 1431 + static ssize_t nvmet_subsys_attr_subsys_vendor_id_store(struct config_item *item, 1432 + const char *page, size_t count) 1433 + { 1434 + u16 ssvid; 1435 + 1436 + if (kstrtou16(page, 0, &ssvid)) 1437 + return -EINVAL; 1438 + 1439 + down_write(&nvmet_config_sem); 1440 + to_subsys(item)->subsys_vendor_id = ssvid; 1441 + up_write(&nvmet_config_sem); 1442 + return count; 1443 + } 1444 + CONFIGFS_ATTR(nvmet_subsys_, attr_subsys_vendor_id); 1445 + 1405 1446 static ssize_t nvmet_subsys_attr_model_show(struct config_item *item, 1406 1447 char *page) 1407 1448 { ··· 1673 1628 &nvmet_subsys_attr_attr_serial, 1674 1629 &nvmet_subsys_attr_attr_cntlid_min, 1675 1630 &nvmet_subsys_attr_attr_cntlid_max, 1631 + &nvmet_subsys_attr_attr_vendor_id, 1632 + &nvmet_subsys_attr_attr_subsys_vendor_id, 1676 1633 &nvmet_subsys_attr_attr_model, 1677 1634 &nvmet_subsys_attr_attr_qid_max, 1678 1635 &nvmet_subsys_attr_attr_ieee_oui, ··· 1829 1782 return ERR_PTR(-ENOMEM); 1830 1783 1831 1784 INIT_LIST_HEAD(&port->entry); 1785 + port->disc_addr.trtype = NVMF_TRTYPE_MAX; 1832 1786 config_group_init_type_name(&port->group, name, &nvmet_referral_type); 1833 1787 1834 1788 return &port->group; ··· 2055 2007 port->inline_data_size = -1; /* < 0 == let the transport choose */ 2056 2008 port->max_queue_size = -1; /* < 0 == let the transport choose */ 2057 2009 2010 + port->disc_addr.trtype = NVMF_TRTYPE_MAX; 2058 2011 port->disc_addr.portid = cpu_to_le16(portid); 2059 2012 port->disc_addr.adrfam = NVMF_ADDR_FAMILY_MAX; 2060 2013 port->disc_addr.treq = NVMF_TREQ_DISABLE_SQFLOW;

+195 -71

drivers/nvme/target/core.c

··· 836 836 complete(&sq->confirm_done); 837 837 } 838 838 839 + u16 nvmet_check_cqid(struct nvmet_ctrl *ctrl, u16 cqid) 840 + { 841 + if (!ctrl->sqs) 842 + return NVME_SC_INTERNAL | NVME_STATUS_DNR; 843 + 844 + if (cqid > ctrl->subsys->max_qid) 845 + return NVME_SC_QID_INVALID | NVME_STATUS_DNR; 846 + 847 + /* 848 + * Note: For PCI controllers, the NVMe specifications allows multiple 849 + * SQs to share a single CQ. However, we do not support this yet, so 850 + * check that there is no SQ defined for a CQ. If one exist, then the 851 + * CQ ID is invalid for creation as well as when the CQ is being 852 + * deleted (as that would mean that the SQ was not deleted before the 853 + * CQ). 854 + */ 855 + if (ctrl->sqs[cqid]) 856 + return NVME_SC_QID_INVALID | NVME_STATUS_DNR; 857 + 858 + return NVME_SC_SUCCESS; 859 + } 860 + 861 + u16 nvmet_cq_create(struct nvmet_ctrl *ctrl, struct nvmet_cq *cq, 862 + u16 qid, u16 size) 863 + { 864 + u16 status; 865 + 866 + status = nvmet_check_cqid(ctrl, qid); 867 + if (status != NVME_SC_SUCCESS) 868 + return status; 869 + 870 + nvmet_cq_setup(ctrl, cq, qid, size); 871 + 872 + return NVME_SC_SUCCESS; 873 + } 874 + EXPORT_SYMBOL_GPL(nvmet_cq_create); 875 + 876 + u16 nvmet_check_sqid(struct nvmet_ctrl *ctrl, u16 sqid, 877 + bool create) 878 + { 879 + if (!ctrl->sqs) 880 + return NVME_SC_INTERNAL | NVME_STATUS_DNR; 881 + 882 + if (sqid > ctrl->subsys->max_qid) 883 + return NVME_SC_QID_INVALID | NVME_STATUS_DNR; 884 + 885 + if ((create && ctrl->sqs[sqid]) || 886 + (!create && !ctrl->sqs[sqid])) 887 + return NVME_SC_QID_INVALID | NVME_STATUS_DNR; 888 + 889 + return NVME_SC_SUCCESS; 890 + } 891 + 892 + u16 nvmet_sq_create(struct nvmet_ctrl *ctrl, struct nvmet_sq *sq, 893 + u16 sqid, u16 size) 894 + { 895 + u16 status; 896 + int ret; 897 + 898 + if (!kref_get_unless_zero(&ctrl->ref)) 899 + return NVME_SC_INTERNAL | NVME_STATUS_DNR; 900 + 901 + status = nvmet_check_sqid(ctrl, sqid, true); 902 + if (status != NVME_SC_SUCCESS) 903 + return status; 904 + 905 + ret = nvmet_sq_init(sq); 906 + if (ret) { 907 + status = NVME_SC_INTERNAL | NVME_STATUS_DNR; 908 + goto ctrl_put; 909 + } 910 + 911 + nvmet_sq_setup(ctrl, sq, sqid, size); 912 + sq->ctrl = ctrl; 913 + 914 + return NVME_SC_SUCCESS; 915 + 916 + ctrl_put: 917 + nvmet_ctrl_put(ctrl); 918 + return status; 919 + } 920 + EXPORT_SYMBOL_GPL(nvmet_sq_create); 921 + 839 922 void nvmet_sq_destroy(struct nvmet_sq *sq) 840 923 { 841 924 struct nvmet_ctrl *ctrl = sq->ctrl; ··· 1010 927 } 1011 928 1012 929 return 0; 930 + } 931 + 932 + static u32 nvmet_io_cmd_transfer_len(struct nvmet_req *req) 933 + { 934 + struct nvme_command *cmd = req->cmd; 935 + u32 metadata_len = 0; 936 + 937 + if (nvme_is_fabrics(cmd)) 938 + return nvmet_fabrics_io_cmd_data_len(req); 939 + 940 + if (!req->ns) 941 + return 0; 942 + 943 + switch (req->cmd->common.opcode) { 944 + case nvme_cmd_read: 945 + case nvme_cmd_write: 946 + case nvme_cmd_zone_append: 947 + if (req->sq->ctrl->pi_support && nvmet_ns_has_pi(req->ns)) 948 + metadata_len = nvmet_rw_metadata_len(req); 949 + return nvmet_rw_data_len(req) + metadata_len; 950 + case nvme_cmd_dsm: 951 + return nvmet_dsm_len(req); 952 + case nvme_cmd_zone_mgmt_recv: 953 + return (le32_to_cpu(req->cmd->zmr.numd) + 1) << 2; 954 + default: 955 + return 0; 956 + } 1013 957 } 1014 958 1015 959 static u16 nvmet_parse_io_cmd(struct nvmet_req *req) ··· 1140 1030 /* 1141 1031 * For fabrics, PSDT field shall describe metadata pointer (MPTR) that 1142 1032 * contains an address of a single contiguous physical buffer that is 1143 - * byte aligned. 1033 + * byte aligned. For PCI controllers, this is optional so not enforced. 1144 1034 */ 1145 1035 if (unlikely((flags & NVME_CMD_SGL_ALL) != NVME_CMD_SGL_METABUF)) { 1146 - req->error_loc = offsetof(struct nvme_common_command, flags); 1147 - status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 1148 - goto fail; 1036 + if (!req->sq->ctrl || !nvmet_is_pci_ctrl(req->sq->ctrl)) { 1037 + req->error_loc = 1038 + offsetof(struct nvme_common_command, flags); 1039 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 1040 + goto fail; 1041 + } 1149 1042 } 1150 1043 1151 1044 if (unlikely(!req->sq->ctrl)) ··· 1190 1077 } 1191 1078 EXPORT_SYMBOL_GPL(nvmet_req_uninit); 1192 1079 1080 + size_t nvmet_req_transfer_len(struct nvmet_req *req) 1081 + { 1082 + if (likely(req->sq->qid != 0)) 1083 + return nvmet_io_cmd_transfer_len(req); 1084 + if (unlikely(!req->sq->ctrl)) 1085 + return nvmet_connect_cmd_data_len(req); 1086 + return nvmet_admin_cmd_data_len(req); 1087 + } 1088 + EXPORT_SYMBOL_GPL(nvmet_req_transfer_len); 1089 + 1193 1090 bool nvmet_check_transfer_len(struct nvmet_req *req, size_t len) 1194 1091 { 1195 1092 if (unlikely(len != req->transfer_len)) { 1093 + u16 status; 1094 + 1196 1095 req->error_loc = offsetof(struct nvme_common_command, dptr); 1197 - nvmet_req_complete(req, NVME_SC_SGL_INVALID_DATA | NVME_STATUS_DNR); 1096 + if (req->cmd->common.flags & NVME_CMD_SGL_ALL) 1097 + status = NVME_SC_SGL_INVALID_DATA; 1098 + else 1099 + status = NVME_SC_INVALID_FIELD; 1100 + nvmet_req_complete(req, status | NVME_STATUS_DNR); 1198 1101 return false; 1199 1102 } 1200 1103 ··· 1221 1092 bool nvmet_check_data_len_lte(struct nvmet_req *req, size_t data_len) 1222 1093 { 1223 1094 if (unlikely(data_len > req->transfer_len)) { 1095 + u16 status; 1096 + 1224 1097 req->error_loc = offsetof(struct nvme_common_command, dptr); 1225 - nvmet_req_complete(req, NVME_SC_SGL_INVALID_DATA | NVME_STATUS_DNR); 1098 + if (req->cmd->common.flags & NVME_CMD_SGL_ALL) 1099 + status = NVME_SC_SGL_INVALID_DATA; 1100 + else 1101 + status = NVME_SC_INVALID_FIELD; 1102 + nvmet_req_complete(req, status | NVME_STATUS_DNR); 1226 1103 return false; 1227 1104 } 1228 1105 ··· 1319 1184 } 1320 1185 EXPORT_SYMBOL_GPL(nvmet_req_free_sgls); 1321 1186 1322 - static inline bool nvmet_cc_en(u32 cc) 1323 - { 1324 - return (cc >> NVME_CC_EN_SHIFT) & 0x1; 1325 - } 1326 - 1327 - static inline u8 nvmet_cc_css(u32 cc) 1328 - { 1329 - return (cc >> NVME_CC_CSS_SHIFT) & 0x7; 1330 - } 1331 - 1332 - static inline u8 nvmet_cc_mps(u32 cc) 1333 - { 1334 - return (cc >> NVME_CC_MPS_SHIFT) & 0xf; 1335 - } 1336 - 1337 - static inline u8 nvmet_cc_ams(u32 cc) 1338 - { 1339 - return (cc >> NVME_CC_AMS_SHIFT) & 0x7; 1340 - } 1341 - 1342 - static inline u8 nvmet_cc_shn(u32 cc) 1343 - { 1344 - return (cc >> NVME_CC_SHN_SHIFT) & 0x3; 1345 - } 1346 - 1347 - static inline u8 nvmet_cc_iosqes(u32 cc) 1348 - { 1349 - return (cc >> NVME_CC_IOSQES_SHIFT) & 0xf; 1350 - } 1351 - 1352 - static inline u8 nvmet_cc_iocqes(u32 cc) 1353 - { 1354 - return (cc >> NVME_CC_IOCQES_SHIFT) & 0xf; 1355 - } 1356 - 1357 1187 static inline bool nvmet_css_supported(u8 cc_css) 1358 1188 { 1359 1189 switch (cc_css << NVME_CC_CSS_SHIFT) { ··· 1395 1295 ctrl->csts &= ~NVME_CSTS_SHST_CMPLT; 1396 1296 mutex_unlock(&ctrl->lock); 1397 1297 } 1298 + EXPORT_SYMBOL_GPL(nvmet_update_cc); 1398 1299 1399 1300 static void nvmet_init_cap(struct nvmet_ctrl *ctrl) 1400 1301 { ··· 1503 1402 * Note: ctrl->subsys->lock should be held when calling this function 1504 1403 */ 1505 1404 static void nvmet_setup_p2p_ns_map(struct nvmet_ctrl *ctrl, 1506 - struct nvmet_req *req) 1405 + struct device *p2p_client) 1507 1406 { 1508 1407 struct nvmet_ns *ns; 1509 1408 unsigned long idx; 1510 1409 1511 - if (!req->p2p_client) 1410 + if (!p2p_client) 1512 1411 return; 1513 1412 1514 - ctrl->p2p_client = get_device(req->p2p_client); 1413 + ctrl->p2p_client = get_device(p2p_client); 1515 1414 1516 1415 nvmet_for_each_enabled_ns(&ctrl->subsys->namespaces, idx, ns) 1517 1416 nvmet_p2pmem_ns_add_p2p(ctrl, ns); ··· 1540 1439 ctrl->ops->delete_ctrl(ctrl); 1541 1440 } 1542 1441 1543 - u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn, 1544 - struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp, 1545 - uuid_t *hostid) 1442 + struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args) 1546 1443 { 1547 1444 struct nvmet_subsys *subsys; 1548 1445 struct nvmet_ctrl *ctrl; 1446 + u32 kato = args->kato; 1447 + u8 dhchap_status; 1549 1448 int ret; 1550 - u16 status; 1551 1449 1552 - status = NVME_SC_CONNECT_INVALID_PARAM | NVME_STATUS_DNR; 1553 - subsys = nvmet_find_get_subsys(req->port, subsysnqn); 1450 + args->status = NVME_SC_CONNECT_INVALID_PARAM | NVME_STATUS_DNR; 1451 + subsys = nvmet_find_get_subsys(args->port, args->subsysnqn); 1554 1452 if (!subsys) { 1555 1453 pr_warn("connect request for invalid subsystem %s!\n", 1556 - subsysnqn); 1557 - req->cqe->result.u32 = IPO_IATTR_CONNECT_DATA(subsysnqn); 1558 - req->error_loc = offsetof(struct nvme_common_command, dptr); 1559 - goto out; 1454 + args->subsysnqn); 1455 + args->result = IPO_IATTR_CONNECT_DATA(subsysnqn); 1456 + args->error_loc = offsetof(struct nvme_common_command, dptr); 1457 + return NULL; 1560 1458 } 1561 1459 1562 1460 down_read(&nvmet_config_sem); 1563 - if (!nvmet_host_allowed(subsys, hostnqn)) { 1461 + if (!nvmet_host_allowed(subsys, args->hostnqn)) { 1564 1462 pr_info("connect by host %s for subsystem %s not allowed\n", 1565 - hostnqn, subsysnqn); 1566 - req->cqe->result.u32 = IPO_IATTR_CONNECT_DATA(hostnqn); 1463 + args->hostnqn, args->subsysnqn); 1464 + args->result = IPO_IATTR_CONNECT_DATA(hostnqn); 1567 1465 up_read(&nvmet_config_sem); 1568 - status = NVME_SC_CONNECT_INVALID_HOST | NVME_STATUS_DNR; 1569 - req->error_loc = offsetof(struct nvme_common_command, dptr); 1466 + args->status = NVME_SC_CONNECT_INVALID_HOST | NVME_STATUS_DNR; 1467 + args->error_loc = offsetof(struct nvme_common_command, dptr); 1570 1468 goto out_put_subsystem; 1571 1469 } 1572 1470 up_read(&nvmet_config_sem); 1573 1471 1574 - status = NVME_SC_INTERNAL; 1472 + args->status = NVME_SC_INTERNAL; 1575 1473 ctrl = kzalloc(sizeof(*ctrl), GFP_KERNEL); 1576 1474 if (!ctrl) 1577 1475 goto out_put_subsystem; 1578 1476 mutex_init(&ctrl->lock); 1579 1477 1580 - ctrl->port = req->port; 1581 - ctrl->ops = req->ops; 1478 + ctrl->port = args->port; 1479 + ctrl->ops = args->ops; 1582 1480 1583 1481 #ifdef CONFIG_NVME_TARGET_PASSTHRU 1584 1482 /* By default, set loop targets to clear IDS by default */ ··· 1591 1491 INIT_WORK(&ctrl->fatal_err_work, nvmet_fatal_error_handler); 1592 1492 INIT_DELAYED_WORK(&ctrl->ka_work, nvmet_keep_alive_timer); 1593 1493 1594 - memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE); 1595 - memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE); 1494 + memcpy(ctrl->subsysnqn, args->subsysnqn, NVMF_NQN_SIZE); 1495 + memcpy(ctrl->hostnqn, args->hostnqn, NVMF_NQN_SIZE); 1596 1496 1597 1497 kref_init(&ctrl->ref); 1598 1498 ctrl->subsys = subsys; ··· 1615 1515 subsys->cntlid_min, subsys->cntlid_max, 1616 1516 GFP_KERNEL); 1617 1517 if (ret < 0) { 1618 - status = NVME_SC_CONNECT_CTRL_BUSY | NVME_STATUS_DNR; 1518 + args->status = NVME_SC_CONNECT_CTRL_BUSY | NVME_STATUS_DNR; 1619 1519 goto out_free_sqs; 1620 1520 } 1621 1521 ctrl->cntlid = ret; 1622 1522 1623 - uuid_copy(&ctrl->hostid, hostid); 1523 + uuid_copy(&ctrl->hostid, args->hostid); 1624 1524 1625 1525 /* 1626 1526 * Discovery controllers may use some arbitrary high value ··· 1642 1542 if (ret) 1643 1543 goto init_pr_fail; 1644 1544 list_add_tail(&ctrl->subsys_entry, &subsys->ctrls); 1645 - nvmet_setup_p2p_ns_map(ctrl, req); 1545 + nvmet_setup_p2p_ns_map(ctrl, args->p2p_client); 1646 1546 nvmet_debugfs_ctrl_setup(ctrl); 1647 1547 mutex_unlock(&subsys->lock); 1648 1548 1649 - *ctrlp = ctrl; 1650 - return 0; 1549 + if (args->hostid) 1550 + uuid_copy(&ctrl->hostid, args->hostid); 1551 + 1552 + dhchap_status = nvmet_setup_auth(ctrl); 1553 + if (dhchap_status) { 1554 + pr_err("Failed to setup authentication, dhchap status %u\n", 1555 + dhchap_status); 1556 + nvmet_ctrl_put(ctrl); 1557 + if (dhchap_status == NVME_AUTH_DHCHAP_FAILURE_FAILED) 1558 + args->status = 1559 + NVME_SC_CONNECT_INVALID_HOST | NVME_STATUS_DNR; 1560 + else 1561 + args->status = NVME_SC_INTERNAL; 1562 + return NULL; 1563 + } 1564 + 1565 + args->status = NVME_SC_SUCCESS; 1566 + 1567 + pr_info("Created %s controller %d for subsystem %s for NQN %s%s%s.\n", 1568 + nvmet_is_disc_subsys(ctrl->subsys) ? "discovery" : "nvm", 1569 + ctrl->cntlid, ctrl->subsys->subsysnqn, ctrl->hostnqn, 1570 + ctrl->pi_support ? " T10-PI is enabled" : "", 1571 + nvmet_has_auth(ctrl) ? " with DH-HMAC-CHAP" : ""); 1572 + 1573 + return ctrl; 1651 1574 1652 1575 init_pr_fail: 1653 1576 mutex_unlock(&subsys->lock); ··· 1684 1561 kfree(ctrl); 1685 1562 out_put_subsystem: 1686 1563 nvmet_subsys_put(subsys); 1687 - out: 1688 - return status; 1564 + return NULL; 1689 1565 } 1566 + EXPORT_SYMBOL_GPL(nvmet_alloc_ctrl); 1690 1567 1691 1568 static void nvmet_ctrl_free(struct kref *ref) 1692 1569 { ··· 1722 1599 { 1723 1600 kref_put(&ctrl->ref, nvmet_ctrl_free); 1724 1601 } 1602 + EXPORT_SYMBOL_GPL(nvmet_ctrl_put); 1725 1603 1726 1604 void nvmet_ctrl_fatal_error(struct nvmet_ctrl *ctrl) 1727 1605 {

+17

drivers/nvme/target/discovery.c

··· 224 224 } 225 225 226 226 list_for_each_entry(r, &req->port->referrals, entry) { 227 + if (r->disc_addr.trtype == NVMF_TRTYPE_PCI) 228 + continue; 229 + 227 230 nvmet_format_discovery_entry(hdr, r, 228 231 NVME_DISC_SUBSYS_NAME, 229 232 r->disc_addr.traddr, ··· 353 350 } 354 351 355 352 nvmet_req_complete(req, stat); 353 + } 354 + 355 + u32 nvmet_discovery_cmd_data_len(struct nvmet_req *req) 356 + { 357 + struct nvme_command *cmd = req->cmd; 358 + 359 + switch (cmd->common.opcode) { 360 + case nvme_admin_get_log_page: 361 + return nvmet_get_log_page_len(req->cmd); 362 + case nvme_admin_identify: 363 + return NVME_IDENTIFY_DATA_SIZE; 364 + default: 365 + return 0; 366 + } 356 367 } 357 368 358 369 u16 nvmet_parse_discovery_cmd(struct nvmet_req *req)

+12 -2

drivers/nvme/target/fabrics-cmd-auth.c

··· 179 179 return data->rescode_exp; 180 180 } 181 181 182 + u32 nvmet_auth_send_data_len(struct nvmet_req *req) 183 + { 184 + return le32_to_cpu(req->cmd->auth_send.tl); 185 + } 186 + 182 187 void nvmet_execute_auth_send(struct nvmet_req *req) 183 188 { 184 189 struct nvmet_ctrl *ctrl = req->sq->ctrl; ··· 211 206 offsetof(struct nvmf_auth_send_command, spsp1); 212 207 goto done; 213 208 } 214 - tl = le32_to_cpu(req->cmd->auth_send.tl); 209 + tl = nvmet_auth_send_data_len(req); 215 210 if (!tl) { 216 211 status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 217 212 req->error_loc = ··· 434 429 data->rescode_exp = req->sq->dhchap_status; 435 430 } 436 431 432 + u32 nvmet_auth_receive_data_len(struct nvmet_req *req) 433 + { 434 + return le32_to_cpu(req->cmd->auth_receive.al); 435 + } 436 + 437 437 void nvmet_execute_auth_receive(struct nvmet_req *req) 438 438 { 439 439 struct nvmet_ctrl *ctrl = req->sq->ctrl; ··· 464 454 offsetof(struct nvmf_auth_receive_command, spsp1); 465 455 goto done; 466 456 } 467 - al = le32_to_cpu(req->cmd->auth_receive.al); 457 + al = nvmet_auth_receive_data_len(req); 468 458 if (!al) { 469 459 status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 470 460 req->error_loc =

+70 -33

drivers/nvme/target/fabrics-cmd.c

··· 85 85 nvmet_req_complete(req, status); 86 86 } 87 87 88 + u32 nvmet_fabrics_admin_cmd_data_len(struct nvmet_req *req) 89 + { 90 + struct nvme_command *cmd = req->cmd; 91 + 92 + switch (cmd->fabrics.fctype) { 93 + #ifdef CONFIG_NVME_TARGET_AUTH 94 + case nvme_fabrics_type_auth_send: 95 + return nvmet_auth_send_data_len(req); 96 + case nvme_fabrics_type_auth_receive: 97 + return nvmet_auth_receive_data_len(req); 98 + #endif 99 + default: 100 + return 0; 101 + } 102 + } 103 + 88 104 u16 nvmet_parse_fabrics_admin_cmd(struct nvmet_req *req) 89 105 { 90 106 struct nvme_command *cmd = req->cmd; ··· 128 112 } 129 113 130 114 return 0; 115 + } 116 + 117 + u32 nvmet_fabrics_io_cmd_data_len(struct nvmet_req *req) 118 + { 119 + struct nvme_command *cmd = req->cmd; 120 + 121 + switch (cmd->fabrics.fctype) { 122 + #ifdef CONFIG_NVME_TARGET_AUTH 123 + case nvme_fabrics_type_auth_send: 124 + return nvmet_auth_send_data_len(req); 125 + case nvme_fabrics_type_auth_receive: 126 + return nvmet_auth_receive_data_len(req); 127 + #endif 128 + default: 129 + return 0; 130 + } 131 131 } 132 132 133 133 u16 nvmet_parse_fabrics_io_cmd(struct nvmet_req *req) ··· 245 213 struct nvmf_connect_command *c = &req->cmd->connect; 246 214 struct nvmf_connect_data *d; 247 215 struct nvmet_ctrl *ctrl = NULL; 248 - u16 status; 249 - u8 dhchap_status; 216 + struct nvmet_alloc_ctrl_args args = { 217 + .port = req->port, 218 + .ops = req->ops, 219 + .p2p_client = req->p2p_client, 220 + .kato = le32_to_cpu(c->kato), 221 + }; 250 222 251 223 if (!nvmet_check_transfer_len(req, sizeof(struct nvmf_connect_data))) 252 224 return; 253 225 254 226 d = kmalloc(sizeof(*d), GFP_KERNEL); 255 227 if (!d) { 256 - status = NVME_SC_INTERNAL; 228 + args.status = NVME_SC_INTERNAL; 257 229 goto complete; 258 230 } 259 231 260 - status = nvmet_copy_from_sgl(req, 0, d, sizeof(*d)); 261 - if (status) 232 + args.status = nvmet_copy_from_sgl(req, 0, d, sizeof(*d)); 233 + if (args.status) 262 234 goto out; 263 235 264 236 if (c->recfmt != 0) { 265 237 pr_warn("invalid connect version (%d).\n", 266 238 le16_to_cpu(c->recfmt)); 267 - req->error_loc = offsetof(struct nvmf_connect_command, recfmt); 268 - status = NVME_SC_CONNECT_FORMAT | NVME_STATUS_DNR; 239 + args.error_loc = offsetof(struct nvmf_connect_command, recfmt); 240 + args.status = NVME_SC_CONNECT_FORMAT | NVME_STATUS_DNR; 269 241 goto out; 270 242 } 271 243 272 244 if (unlikely(d->cntlid != cpu_to_le16(0xffff))) { 273 245 pr_warn("connect attempt for invalid controller ID %#x\n", 274 246 d->cntlid); 275 - status = NVME_SC_CONNECT_INVALID_PARAM | NVME_STATUS_DNR; 276 - req->cqe->result.u32 = IPO_IATTR_CONNECT_DATA(cntlid); 247 + args.status = NVME_SC_CONNECT_INVALID_PARAM | NVME_STATUS_DNR; 248 + args.result = IPO_IATTR_CONNECT_DATA(cntlid); 277 249 goto out; 278 250 } 279 251 280 252 d->subsysnqn[NVMF_NQN_FIELD_LEN - 1] = '\0'; 281 253 d->hostnqn[NVMF_NQN_FIELD_LEN - 1] = '\0'; 282 - status = nvmet_alloc_ctrl(d->subsysnqn, d->hostnqn, req, 283 - le32_to_cpu(c->kato), &ctrl, &d->hostid); 284 - if (status) 254 + 255 + args.subsysnqn = d->subsysnqn; 256 + args.hostnqn = d->hostnqn; 257 + args.hostid = &d->hostid; 258 + args.kato = c->kato; 259 + 260 + ctrl = nvmet_alloc_ctrl(&args); 261 + if (!ctrl) 285 262 goto out; 286 263 287 - dhchap_status = nvmet_setup_auth(ctrl); 288 - if (dhchap_status) { 289 - pr_err("Failed to setup authentication, dhchap status %u\n", 290 - dhchap_status); 291 - nvmet_ctrl_put(ctrl); 292 - if (dhchap_status == NVME_AUTH_DHCHAP_FAILURE_FAILED) 293 - status = (NVME_SC_CONNECT_INVALID_HOST | NVME_STATUS_DNR); 294 - else 295 - status = NVME_SC_INTERNAL; 296 - goto out; 297 - } 298 - 299 - status = nvmet_install_queue(ctrl, req); 300 - if (status) { 264 + args.status = nvmet_install_queue(ctrl, req); 265 + if (args.status) { 301 266 nvmet_ctrl_put(ctrl); 302 267 goto out; 303 268 } 304 269 305 - pr_info("creating %s controller %d for subsystem %s for NQN %s%s%s.\n", 306 - nvmet_is_disc_subsys(ctrl->subsys) ? "discovery" : "nvm", 307 - ctrl->cntlid, ctrl->subsys->subsysnqn, ctrl->hostnqn, 308 - ctrl->pi_support ? " T10-PI is enabled" : "", 309 - nvmet_has_auth(ctrl) ? " with DH-HMAC-CHAP" : ""); 310 - req->cqe->result.u32 = cpu_to_le32(nvmet_connect_result(ctrl)); 270 + args.result = cpu_to_le32(nvmet_connect_result(ctrl)); 311 271 out: 312 272 kfree(d); 313 273 complete: 314 - nvmet_req_complete(req, status); 274 + req->error_loc = args.error_loc; 275 + req->cqe->result.u32 = args.result; 276 + nvmet_req_complete(req, args.status); 315 277 } 316 278 317 279 static void nvmet_execute_io_connect(struct nvmet_req *req) ··· 367 341 out_ctrl_put: 368 342 nvmet_ctrl_put(ctrl); 369 343 goto out; 344 + } 345 + 346 + u32 nvmet_connect_cmd_data_len(struct nvmet_req *req) 347 + { 348 + struct nvme_command *cmd = req->cmd; 349 + 350 + if (!nvme_is_fabrics(cmd) || 351 + cmd->fabrics.fctype != nvme_fabrics_type_connect) 352 + return 0; 353 + 354 + return sizeof(struct nvmf_connect_data); 370 355 } 371 356 372 357 u16 nvmet_parse_connect_cmd(struct nvmet_req *req)

+3

drivers/nvme/target/io-cmd-bdev.c

··· 272 272 iter_flags = SG_MITER_FROM_SG; 273 273 } 274 274 275 + if (req->cmd->rw.control & NVME_RW_LR) 276 + opf |= REQ_FAILFAST_DEV; 277 + 275 278 if (is_pci_p2pdma_page(sg_page(req->sg))) 276 279 opf |= REQ_NOMERGE; 277 280

+107 -3

drivers/nvme/target/nvmet.h

··· 245 245 struct nvmet_subsys *subsys; 246 246 struct nvmet_sq **sqs; 247 247 248 + void *drvdata; 249 + 248 250 bool reset_tbkas; 249 251 250 252 struct mutex lock; ··· 333 331 struct config_group namespaces_group; 334 332 struct config_group allowed_hosts_group; 335 333 334 + u16 vendor_id; 335 + u16 subsys_vendor_id; 336 336 char *model_number; 337 337 u32 ieee_oui; 338 338 char *firmware_rev; ··· 415 411 void (*discovery_chg)(struct nvmet_port *port); 416 412 u8 (*get_mdts)(const struct nvmet_ctrl *ctrl); 417 413 u16 (*get_max_queue_size)(const struct nvmet_ctrl *ctrl); 414 + 415 + /* Operations mandatory for PCI target controllers */ 416 + u16 (*create_sq)(struct nvmet_ctrl *ctrl, u16 sqid, u16 flags, 417 + u16 qsize, u64 prp1); 418 + u16 (*delete_sq)(struct nvmet_ctrl *ctrl, u16 sqid); 419 + u16 (*create_cq)(struct nvmet_ctrl *ctrl, u16 cqid, u16 flags, 420 + u16 qsize, u64 prp1, u16 irq_vector); 421 + u16 (*delete_cq)(struct nvmet_ctrl *ctrl, u16 cqid); 422 + u16 (*set_feature)(const struct nvmet_ctrl *ctrl, u8 feat, 423 + void *feat_data); 424 + u16 (*get_feature)(const struct nvmet_ctrl *ctrl, u8 feat, 425 + void *feat_data); 418 426 }; 419 427 420 428 #define NVMET_MAX_INLINE_BIOVEC 8 ··· 536 520 void nvmet_stop_keep_alive_timer(struct nvmet_ctrl *ctrl); 537 521 538 522 u16 nvmet_parse_connect_cmd(struct nvmet_req *req); 523 + u32 nvmet_connect_cmd_data_len(struct nvmet_req *req); 539 524 void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id); 540 525 u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req); 541 526 u16 nvmet_file_parse_io_cmd(struct nvmet_req *req); 542 527 u16 nvmet_bdev_zns_parse_io_cmd(struct nvmet_req *req); 528 + u32 nvmet_admin_cmd_data_len(struct nvmet_req *req); 543 529 u16 nvmet_parse_admin_cmd(struct nvmet_req *req); 530 + u32 nvmet_discovery_cmd_data_len(struct nvmet_req *req); 544 531 u16 nvmet_parse_discovery_cmd(struct nvmet_req *req); 545 532 u16 nvmet_parse_fabrics_admin_cmd(struct nvmet_req *req); 533 + u32 nvmet_fabrics_admin_cmd_data_len(struct nvmet_req *req); 546 534 u16 nvmet_parse_fabrics_io_cmd(struct nvmet_req *req); 535 + u32 nvmet_fabrics_io_cmd_data_len(struct nvmet_req *req); 547 536 548 537 bool nvmet_req_init(struct nvmet_req *req, struct nvmet_cq *cq, 549 538 struct nvmet_sq *sq, const struct nvmet_fabrics_ops *ops); 550 539 void nvmet_req_uninit(struct nvmet_req *req); 540 + size_t nvmet_req_transfer_len(struct nvmet_req *req); 551 541 bool nvmet_check_transfer_len(struct nvmet_req *req, size_t len); 552 542 bool nvmet_check_data_len_lte(struct nvmet_req *req, size_t data_len); 553 543 void nvmet_req_complete(struct nvmet_req *req, u16 status); ··· 564 542 void nvmet_execute_get_features(struct nvmet_req *req); 565 543 void nvmet_execute_keep_alive(struct nvmet_req *req); 566 544 545 + u16 nvmet_check_cqid(struct nvmet_ctrl *ctrl, u16 cqid); 567 546 void nvmet_cq_setup(struct nvmet_ctrl *ctrl, struct nvmet_cq *cq, u16 qid, 568 547 u16 size); 548 + u16 nvmet_cq_create(struct nvmet_ctrl *ctrl, struct nvmet_cq *cq, u16 qid, 549 + u16 size); 550 + u16 nvmet_check_sqid(struct nvmet_ctrl *ctrl, u16 sqid, bool create); 569 551 void nvmet_sq_setup(struct nvmet_ctrl *ctrl, struct nvmet_sq *sq, u16 qid, 552 + u16 size); 553 + u16 nvmet_sq_create(struct nvmet_ctrl *ctrl, struct nvmet_sq *sq, u16 qid, 570 554 u16 size); 571 555 void nvmet_sq_destroy(struct nvmet_sq *sq); 572 556 int nvmet_sq_init(struct nvmet_sq *sq); ··· 580 552 void nvmet_ctrl_fatal_error(struct nvmet_ctrl *ctrl); 581 553 582 554 void nvmet_update_cc(struct nvmet_ctrl *ctrl, u32 new); 583 - u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn, 584 - struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp, 585 - uuid_t *hostid); 555 + 556 + struct nvmet_alloc_ctrl_args { 557 + struct nvmet_port *port; 558 + char *subsysnqn; 559 + char *hostnqn; 560 + uuid_t *hostid; 561 + const struct nvmet_fabrics_ops *ops; 562 + struct device *p2p_client; 563 + u32 kato; 564 + u32 result; 565 + u16 error_loc; 566 + u16 status; 567 + }; 568 + 569 + struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args); 586 570 struct nvmet_ctrl *nvmet_ctrl_find_get(const char *subsysnqn, 587 571 const char *hostnqn, u16 cntlid, 588 572 struct nvmet_req *req); ··· 736 696 return subsys->type != NVME_NQN_NVME; 737 697 } 738 698 699 + static inline bool nvmet_is_pci_ctrl(struct nvmet_ctrl *ctrl) 700 + { 701 + return ctrl->port->disc_addr.trtype == NVMF_TRTYPE_PCI; 702 + } 703 + 739 704 #ifdef CONFIG_NVME_TARGET_PASSTHRU 740 705 void nvmet_passthru_subsys_free(struct nvmet_subsys *subsys); 741 706 int nvmet_passthru_ctrl_enable(struct nvmet_subsys *subsys); ··· 782 737 u16 errno_to_nvme_status(struct nvmet_req *req, int errno); 783 738 u16 nvmet_report_invalid_opcode(struct nvmet_req *req); 784 739 740 + static inline bool nvmet_cc_en(u32 cc) 741 + { 742 + return (cc >> NVME_CC_EN_SHIFT) & 0x1; 743 + } 744 + 745 + static inline u8 nvmet_cc_css(u32 cc) 746 + { 747 + return (cc >> NVME_CC_CSS_SHIFT) & 0x7; 748 + } 749 + 750 + static inline u8 nvmet_cc_mps(u32 cc) 751 + { 752 + return (cc >> NVME_CC_MPS_SHIFT) & 0xf; 753 + } 754 + 755 + static inline u8 nvmet_cc_ams(u32 cc) 756 + { 757 + return (cc >> NVME_CC_AMS_SHIFT) & 0x7; 758 + } 759 + 760 + static inline u8 nvmet_cc_shn(u32 cc) 761 + { 762 + return (cc >> NVME_CC_SHN_SHIFT) & 0x3; 763 + } 764 + 765 + static inline u8 nvmet_cc_iosqes(u32 cc) 766 + { 767 + return (cc >> NVME_CC_IOSQES_SHIFT) & 0xf; 768 + } 769 + 770 + static inline u8 nvmet_cc_iocqes(u32 cc) 771 + { 772 + return (cc >> NVME_CC_IOCQES_SHIFT) & 0xf; 773 + } 774 + 785 775 /* Convert a 32-bit number to a 16-bit 0's based number */ 786 776 static inline __le16 to0based(u32 a) 787 777 { ··· 853 773 } 854 774 855 775 #ifdef CONFIG_NVME_TARGET_AUTH 776 + u32 nvmet_auth_send_data_len(struct nvmet_req *req); 856 777 void nvmet_execute_auth_send(struct nvmet_req *req); 778 + u32 nvmet_auth_receive_data_len(struct nvmet_req *req); 857 779 void nvmet_execute_auth_receive(struct nvmet_req *req); 858 780 int nvmet_auth_set_key(struct nvmet_host *host, const char *secret, 859 781 bool set_ctrl); ··· 913 831 { 914 832 percpu_ref_put(&pc_ref->ref); 915 833 } 834 + 835 + /* 836 + * Data for the get_feature() and set_feature() operations of PCI target 837 + * controllers. 838 + */ 839 + struct nvmet_feat_irq_coalesce { 840 + u8 thr; 841 + u8 time; 842 + }; 843 + 844 + struct nvmet_feat_irq_config { 845 + u16 iv; 846 + bool cd; 847 + }; 848 + 849 + struct nvmet_feat_arbitration { 850 + u8 hpw; 851 + u8 mpw; 852 + u8 lpw; 853 + u8 ab; 854 + }; 855 + 916 856 #endif /* _NVMET_H */

+11 -7

drivers/nvme/target/passthru.c

··· 261 261 { 262 262 struct scatterlist *sg; 263 263 struct bio *bio; 264 + int ret = -EINVAL; 264 265 int i; 265 266 266 267 if (req->sg_cnt > BIO_MAX_VECS) ··· 278 277 } 279 278 280 279 for_each_sg(req->sg, sg, req->sg_cnt, i) { 281 - if (bio_add_pc_page(rq->q, bio, sg_page(sg), sg->length, 282 - sg->offset) < sg->length) { 283 - nvmet_req_bio_put(req, bio); 284 - return -EINVAL; 285 - } 280 + if (bio_add_page(bio, sg_page(sg), sg->length, sg->offset) < 281 + sg->length) 282 + goto out_bio_put; 286 283 } 287 284 288 - blk_rq_bio_prep(rq, bio, req->sg_cnt); 289 - 285 + ret = blk_rq_append_bio(rq, bio); 286 + if (ret) 287 + goto out_bio_put; 290 288 return 0; 289 + 290 + out_bio_put: 291 + nvmet_req_bio_put(req, bio); 292 + return ret; 291 293 } 292 294 293 295 static void nvmet_passthru_execute_cmd(struct nvmet_req *req)

+2591

drivers/nvme/target/pci-epf.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * NVMe PCI Endpoint Function target driver. 4 + * 5 + * Copyright (c) 2024, Western Digital Corporation or its affiliates. 6 + * Copyright (c) 2024, Rick Wertenbroek <rick.wertenbroek@gmail.com> 7 + * REDS Institute, HEIG-VD, HES-SO, Switzerland 8 + */ 9 + #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt 10 + 11 + #include <linux/delay.h> 12 + #include <linux/dmaengine.h> 13 + #include <linux/io.h> 14 + #include <linux/mempool.h> 15 + #include <linux/module.h> 16 + #include <linux/mutex.h> 17 + #include <linux/nvme.h> 18 + #include <linux/pci_ids.h> 19 + #include <linux/pci-epc.h> 20 + #include <linux/pci-epf.h> 21 + #include <linux/pci_regs.h> 22 + #include <linux/slab.h> 23 + 24 + #include "nvmet.h" 25 + 26 + static LIST_HEAD(nvmet_pci_epf_ports); 27 + static DEFINE_MUTEX(nvmet_pci_epf_ports_mutex); 28 + 29 + /* 30 + * Default and maximum allowed data transfer size. For the default, 31 + * allow up to 128 page-sized segments. For the maximum allowed, 32 + * use 4 times the default (which is completely arbitrary). 33 + */ 34 + #define NVMET_PCI_EPF_MAX_SEGS 128 35 + #define NVMET_PCI_EPF_MDTS_KB \ 36 + (NVMET_PCI_EPF_MAX_SEGS << (PAGE_SHIFT - 10)) 37 + #define NVMET_PCI_EPF_MAX_MDTS_KB (NVMET_PCI_EPF_MDTS_KB * 4) 38 + 39 + /* 40 + * IRQ vector coalescing threshold: by default, post 8 CQEs before raising an 41 + * interrupt vector to the host. This default 8 is completely arbitrary and can 42 + * be changed by the host with a nvme_set_features command. 43 + */ 44 + #define NVMET_PCI_EPF_IV_THRESHOLD 8 45 + 46 + /* 47 + * BAR CC register and SQ polling intervals. 48 + */ 49 + #define NVMET_PCI_EPF_CC_POLL_INTERVAL msecs_to_jiffies(5) 50 + #define NVMET_PCI_EPF_SQ_POLL_INTERVAL msecs_to_jiffies(5) 51 + #define NVMET_PCI_EPF_SQ_POLL_IDLE msecs_to_jiffies(5000) 52 + 53 + /* 54 + * SQ arbitration burst default: fetch at most 8 commands at a time from an SQ. 55 + */ 56 + #define NVMET_PCI_EPF_SQ_AB 8 57 + 58 + /* 59 + * Handling of CQs is normally immediate, unless we fail to map a CQ or the CQ 60 + * is full, in which case we retry the CQ processing after this interval. 61 + */ 62 + #define NVMET_PCI_EPF_CQ_RETRY_INTERVAL msecs_to_jiffies(1) 63 + 64 + enum nvmet_pci_epf_queue_flags { 65 + NVMET_PCI_EPF_Q_IS_SQ = 0, /* The queue is a submission queue */ 66 + NVMET_PCI_EPF_Q_LIVE, /* The queue is live */ 67 + NVMET_PCI_EPF_Q_IRQ_ENABLED, /* IRQ is enabled for this queue */ 68 + }; 69 + 70 + /* 71 + * IRQ vector descriptor. 72 + */ 73 + struct nvmet_pci_epf_irq_vector { 74 + unsigned int vector; 75 + unsigned int ref; 76 + bool cd; 77 + int nr_irqs; 78 + }; 79 + 80 + struct nvmet_pci_epf_queue { 81 + union { 82 + struct nvmet_sq nvme_sq; 83 + struct nvmet_cq nvme_cq; 84 + }; 85 + struct nvmet_pci_epf_ctrl *ctrl; 86 + unsigned long flags; 87 + 88 + u64 pci_addr; 89 + size_t pci_size; 90 + struct pci_epc_map pci_map; 91 + 92 + u16 qid; 93 + u16 depth; 94 + u16 vector; 95 + u16 head; 96 + u16 tail; 97 + u16 phase; 98 + u32 db; 99 + 100 + size_t qes; 101 + 102 + struct nvmet_pci_epf_irq_vector *iv; 103 + struct workqueue_struct *iod_wq; 104 + struct delayed_work work; 105 + spinlock_t lock; 106 + struct list_head list; 107 + }; 108 + 109 + /* 110 + * PCI Root Complex (RC) address data segment for mapping an admin or 111 + * I/O command buffer @buf of @length bytes to the PCI address @pci_addr. 112 + */ 113 + struct nvmet_pci_epf_segment { 114 + void *buf; 115 + u64 pci_addr; 116 + u32 length; 117 + }; 118 + 119 + /* 120 + * Command descriptors. 121 + */ 122 + struct nvmet_pci_epf_iod { 123 + struct list_head link; 124 + 125 + struct nvmet_req req; 126 + struct nvme_command cmd; 127 + struct nvme_completion cqe; 128 + unsigned int status; 129 + 130 + struct nvmet_pci_epf_ctrl *ctrl; 131 + 132 + struct nvmet_pci_epf_queue *sq; 133 + struct nvmet_pci_epf_queue *cq; 134 + 135 + /* Data transfer size and direction for the command. */ 136 + size_t data_len; 137 + enum dma_data_direction dma_dir; 138 + 139 + /* 140 + * PCI Root Complex (RC) address data segments: if nr_data_segs is 1, we 141 + * use only @data_seg. Otherwise, the array of segments @data_segs is 142 + * allocated to manage multiple PCI address data segments. @data_sgl and 143 + * @data_sgt are used to setup the command request for execution by the 144 + * target core. 145 + */ 146 + unsigned int nr_data_segs; 147 + struct nvmet_pci_epf_segment data_seg; 148 + struct nvmet_pci_epf_segment *data_segs; 149 + struct scatterlist data_sgl; 150 + struct sg_table data_sgt; 151 + 152 + struct work_struct work; 153 + struct completion done; 154 + }; 155 + 156 + /* 157 + * PCI target controller private data. 158 + */ 159 + struct nvmet_pci_epf_ctrl { 160 + struct nvmet_pci_epf *nvme_epf; 161 + struct nvmet_port *port; 162 + struct nvmet_ctrl *tctrl; 163 + struct device *dev; 164 + 165 + unsigned int nr_queues; 166 + struct nvmet_pci_epf_queue *sq; 167 + struct nvmet_pci_epf_queue *cq; 168 + unsigned int sq_ab; 169 + 170 + mempool_t iod_pool; 171 + void *bar; 172 + u64 cap; 173 + u32 cc; 174 + u32 csts; 175 + 176 + size_t io_sqes; 177 + size_t io_cqes; 178 + 179 + size_t mps_shift; 180 + size_t mps; 181 + size_t mps_mask; 182 + 183 + unsigned int mdts; 184 + 185 + struct delayed_work poll_cc; 186 + struct delayed_work poll_sqs; 187 + 188 + struct mutex irq_lock; 189 + struct nvmet_pci_epf_irq_vector *irq_vectors; 190 + unsigned int irq_vector_threshold; 191 + 192 + bool link_up; 193 + bool enabled; 194 + }; 195 + 196 + /* 197 + * PCI EPF driver private data. 198 + */ 199 + struct nvmet_pci_epf { 200 + struct pci_epf *epf; 201 + 202 + const struct pci_epc_features *epc_features; 203 + 204 + void *reg_bar; 205 + size_t msix_table_offset; 206 + 207 + unsigned int irq_type; 208 + unsigned int nr_vectors; 209 + 210 + struct nvmet_pci_epf_ctrl ctrl; 211 + 212 + bool dma_enabled; 213 + struct dma_chan *dma_tx_chan; 214 + struct mutex dma_tx_lock; 215 + struct dma_chan *dma_rx_chan; 216 + struct mutex dma_rx_lock; 217 + 218 + struct mutex mmio_lock; 219 + 220 + /* PCI endpoint function configfs attributes. */ 221 + struct config_group group; 222 + __le16 portid; 223 + char subsysnqn[NVMF_NQN_SIZE]; 224 + unsigned int mdts_kb; 225 + }; 226 + 227 + static inline u32 nvmet_pci_epf_bar_read32(struct nvmet_pci_epf_ctrl *ctrl, 228 + u32 off) 229 + { 230 + __le32 *bar_reg = ctrl->bar + off; 231 + 232 + return le32_to_cpu(READ_ONCE(*bar_reg)); 233 + } 234 + 235 + static inline void nvmet_pci_epf_bar_write32(struct nvmet_pci_epf_ctrl *ctrl, 236 + u32 off, u32 val) 237 + { 238 + __le32 *bar_reg = ctrl->bar + off; 239 + 240 + WRITE_ONCE(*bar_reg, cpu_to_le32(val)); 241 + } 242 + 243 + static inline u64 nvmet_pci_epf_bar_read64(struct nvmet_pci_epf_ctrl *ctrl, 244 + u32 off) 245 + { 246 + return (u64)nvmet_pci_epf_bar_read32(ctrl, off) | 247 + ((u64)nvmet_pci_epf_bar_read32(ctrl, off + 4) << 32); 248 + } 249 + 250 + static inline void nvmet_pci_epf_bar_write64(struct nvmet_pci_epf_ctrl *ctrl, 251 + u32 off, u64 val) 252 + { 253 + nvmet_pci_epf_bar_write32(ctrl, off, val & 0xFFFFFFFF); 254 + nvmet_pci_epf_bar_write32(ctrl, off + 4, (val >> 32) & 0xFFFFFFFF); 255 + } 256 + 257 + static inline int nvmet_pci_epf_mem_map(struct nvmet_pci_epf *nvme_epf, 258 + u64 pci_addr, size_t size, struct pci_epc_map *map) 259 + { 260 + struct pci_epf *epf = nvme_epf->epf; 261 + 262 + return pci_epc_mem_map(epf->epc, epf->func_no, epf->vfunc_no, 263 + pci_addr, size, map); 264 + } 265 + 266 + static inline void nvmet_pci_epf_mem_unmap(struct nvmet_pci_epf *nvme_epf, 267 + struct pci_epc_map *map) 268 + { 269 + struct pci_epf *epf = nvme_epf->epf; 270 + 271 + pci_epc_mem_unmap(epf->epc, epf->func_no, epf->vfunc_no, map); 272 + } 273 + 274 + struct nvmet_pci_epf_dma_filter { 275 + struct device *dev; 276 + u32 dma_mask; 277 + }; 278 + 279 + static bool nvmet_pci_epf_dma_filter(struct dma_chan *chan, void *arg) 280 + { 281 + struct nvmet_pci_epf_dma_filter *filter = arg; 282 + struct dma_slave_caps caps; 283 + 284 + memset(&caps, 0, sizeof(caps)); 285 + dma_get_slave_caps(chan, &caps); 286 + 287 + return chan->device->dev == filter->dev && 288 + (filter->dma_mask & caps.directions); 289 + } 290 + 291 + static void nvmet_pci_epf_init_dma(struct nvmet_pci_epf *nvme_epf) 292 + { 293 + struct pci_epf *epf = nvme_epf->epf; 294 + struct device *dev = &epf->dev; 295 + struct nvmet_pci_epf_dma_filter filter; 296 + struct dma_chan *chan; 297 + dma_cap_mask_t mask; 298 + 299 + mutex_init(&nvme_epf->dma_rx_lock); 300 + mutex_init(&nvme_epf->dma_tx_lock); 301 + 302 + dma_cap_zero(mask); 303 + dma_cap_set(DMA_SLAVE, mask); 304 + 305 + filter.dev = epf->epc->dev.parent; 306 + filter.dma_mask = BIT(DMA_DEV_TO_MEM); 307 + 308 + chan = dma_request_channel(mask, nvmet_pci_epf_dma_filter, &filter); 309 + if (!chan) 310 + goto out_dma_no_rx; 311 + 312 + nvme_epf->dma_rx_chan = chan; 313 + 314 + filter.dma_mask = BIT(DMA_MEM_TO_DEV); 315 + chan = dma_request_channel(mask, nvmet_pci_epf_dma_filter, &filter); 316 + if (!chan) 317 + goto out_dma_no_tx; 318 + 319 + nvme_epf->dma_tx_chan = chan; 320 + 321 + nvme_epf->dma_enabled = true; 322 + 323 + dev_dbg(dev, "Using DMA RX channel %s, maximum segment size %u B\n", 324 + dma_chan_name(chan), 325 + dma_get_max_seg_size(dmaengine_get_dma_device(chan))); 326 + 327 + dev_dbg(dev, "Using DMA TX channel %s, maximum segment size %u B\n", 328 + dma_chan_name(chan), 329 + dma_get_max_seg_size(dmaengine_get_dma_device(chan))); 330 + 331 + return; 332 + 333 + out_dma_no_tx: 334 + dma_release_channel(nvme_epf->dma_rx_chan); 335 + nvme_epf->dma_rx_chan = NULL; 336 + 337 + out_dma_no_rx: 338 + mutex_destroy(&nvme_epf->dma_rx_lock); 339 + mutex_destroy(&nvme_epf->dma_tx_lock); 340 + nvme_epf->dma_enabled = false; 341 + 342 + dev_info(&epf->dev, "DMA not supported, falling back to MMIO\n"); 343 + } 344 + 345 + static void nvmet_pci_epf_deinit_dma(struct nvmet_pci_epf *nvme_epf) 346 + { 347 + if (!nvme_epf->dma_enabled) 348 + return; 349 + 350 + dma_release_channel(nvme_epf->dma_tx_chan); 351 + nvme_epf->dma_tx_chan = NULL; 352 + dma_release_channel(nvme_epf->dma_rx_chan); 353 + nvme_epf->dma_rx_chan = NULL; 354 + mutex_destroy(&nvme_epf->dma_rx_lock); 355 + mutex_destroy(&nvme_epf->dma_tx_lock); 356 + nvme_epf->dma_enabled = false; 357 + } 358 + 359 + static int nvmet_pci_epf_dma_transfer(struct nvmet_pci_epf *nvme_epf, 360 + struct nvmet_pci_epf_segment *seg, enum dma_data_direction dir) 361 + { 362 + struct pci_epf *epf = nvme_epf->epf; 363 + struct dma_async_tx_descriptor *desc; 364 + struct dma_slave_config sconf = {}; 365 + struct device *dev = &epf->dev; 366 + struct device *dma_dev; 367 + struct dma_chan *chan; 368 + dma_cookie_t cookie; 369 + dma_addr_t dma_addr; 370 + struct mutex *lock; 371 + int ret; 372 + 373 + switch (dir) { 374 + case DMA_FROM_DEVICE: 375 + lock = &nvme_epf->dma_rx_lock; 376 + chan = nvme_epf->dma_rx_chan; 377 + sconf.direction = DMA_DEV_TO_MEM; 378 + sconf.src_addr = seg->pci_addr; 379 + break; 380 + case DMA_TO_DEVICE: 381 + lock = &nvme_epf->dma_tx_lock; 382 + chan = nvme_epf->dma_tx_chan; 383 + sconf.direction = DMA_MEM_TO_DEV; 384 + sconf.dst_addr = seg->pci_addr; 385 + break; 386 + default: 387 + return -EINVAL; 388 + } 389 + 390 + mutex_lock(lock); 391 + 392 + dma_dev = dmaengine_get_dma_device(chan); 393 + dma_addr = dma_map_single(dma_dev, seg->buf, seg->length, dir); 394 + ret = dma_mapping_error(dma_dev, dma_addr); 395 + if (ret) 396 + goto unlock; 397 + 398 + ret = dmaengine_slave_config(chan, &sconf); 399 + if (ret) { 400 + dev_err(dev, "Failed to configure DMA channel\n"); 401 + goto unmap; 402 + } 403 + 404 + desc = dmaengine_prep_slave_single(chan, dma_addr, seg->length, 405 + sconf.direction, DMA_CTRL_ACK); 406 + if (!desc) { 407 + dev_err(dev, "Failed to prepare DMA\n"); 408 + ret = -EIO; 409 + goto unmap; 410 + } 411 + 412 + cookie = dmaengine_submit(desc); 413 + ret = dma_submit_error(cookie); 414 + if (ret) { 415 + dev_err(dev, "Failed to do DMA submit (err=%d)\n", ret); 416 + goto unmap; 417 + } 418 + 419 + if (dma_sync_wait(chan, cookie) != DMA_COMPLETE) { 420 + dev_err(dev, "DMA transfer failed\n"); 421 + ret = -EIO; 422 + } 423 + 424 + dmaengine_terminate_sync(chan); 425 + 426 + unmap: 427 + dma_unmap_single(dma_dev, dma_addr, seg->length, dir); 428 + 429 + unlock: 430 + mutex_unlock(lock); 431 + 432 + return ret; 433 + } 434 + 435 + static int nvmet_pci_epf_mmio_transfer(struct nvmet_pci_epf *nvme_epf, 436 + struct nvmet_pci_epf_segment *seg, enum dma_data_direction dir) 437 + { 438 + u64 pci_addr = seg->pci_addr; 439 + u32 length = seg->length; 440 + void *buf = seg->buf; 441 + struct pci_epc_map map; 442 + int ret = -EINVAL; 443 + 444 + /* 445 + * Note: MMIO transfers do not need serialization but this is a 446 + * simple way to avoid using too many mapping windows. 447 + */ 448 + mutex_lock(&nvme_epf->mmio_lock); 449 + 450 + while (length) { 451 + ret = nvmet_pci_epf_mem_map(nvme_epf, pci_addr, length, &map); 452 + if (ret) 453 + break; 454 + 455 + switch (dir) { 456 + case DMA_FROM_DEVICE: 457 + memcpy_fromio(buf, map.virt_addr, map.pci_size); 458 + break; 459 + case DMA_TO_DEVICE: 460 + memcpy_toio(map.virt_addr, buf, map.pci_size); 461 + break; 462 + default: 463 + ret = -EINVAL; 464 + goto unlock; 465 + } 466 + 467 + pci_addr += map.pci_size; 468 + buf += map.pci_size; 469 + length -= map.pci_size; 470 + 471 + nvmet_pci_epf_mem_unmap(nvme_epf, &map); 472 + } 473 + 474 + unlock: 475 + mutex_unlock(&nvme_epf->mmio_lock); 476 + 477 + return ret; 478 + } 479 + 480 + static inline int nvmet_pci_epf_transfer_seg(struct nvmet_pci_epf *nvme_epf, 481 + struct nvmet_pci_epf_segment *seg, enum dma_data_direction dir) 482 + { 483 + if (nvme_epf->dma_enabled) 484 + return nvmet_pci_epf_dma_transfer(nvme_epf, seg, dir); 485 + 486 + return nvmet_pci_epf_mmio_transfer(nvme_epf, seg, dir); 487 + } 488 + 489 + static inline int nvmet_pci_epf_transfer(struct nvmet_pci_epf_ctrl *ctrl, 490 + void *buf, u64 pci_addr, u32 length, 491 + enum dma_data_direction dir) 492 + { 493 + struct nvmet_pci_epf_segment seg = { 494 + .buf = buf, 495 + .pci_addr = pci_addr, 496 + .length = length, 497 + }; 498 + 499 + return nvmet_pci_epf_transfer_seg(ctrl->nvme_epf, &seg, dir); 500 + } 501 + 502 + static int nvmet_pci_epf_alloc_irq_vectors(struct nvmet_pci_epf_ctrl *ctrl) 503 + { 504 + ctrl->irq_vectors = kcalloc(ctrl->nr_queues, 505 + sizeof(struct nvmet_pci_epf_irq_vector), 506 + GFP_KERNEL); 507 + if (!ctrl->irq_vectors) 508 + return -ENOMEM; 509 + 510 + mutex_init(&ctrl->irq_lock); 511 + 512 + return 0; 513 + } 514 + 515 + static void nvmet_pci_epf_free_irq_vectors(struct nvmet_pci_epf_ctrl *ctrl) 516 + { 517 + if (ctrl->irq_vectors) { 518 + mutex_destroy(&ctrl->irq_lock); 519 + kfree(ctrl->irq_vectors); 520 + ctrl->irq_vectors = NULL; 521 + } 522 + } 523 + 524 + static struct nvmet_pci_epf_irq_vector * 525 + nvmet_pci_epf_find_irq_vector(struct nvmet_pci_epf_ctrl *ctrl, u16 vector) 526 + { 527 + struct nvmet_pci_epf_irq_vector *iv; 528 + int i; 529 + 530 + lockdep_assert_held(&ctrl->irq_lock); 531 + 532 + for (i = 0; i < ctrl->nr_queues; i++) { 533 + iv = &ctrl->irq_vectors[i]; 534 + if (iv->ref && iv->vector == vector) 535 + return iv; 536 + } 537 + 538 + return NULL; 539 + } 540 + 541 + static struct nvmet_pci_epf_irq_vector * 542 + nvmet_pci_epf_add_irq_vector(struct nvmet_pci_epf_ctrl *ctrl, u16 vector) 543 + { 544 + struct nvmet_pci_epf_irq_vector *iv; 545 + int i; 546 + 547 + mutex_lock(&ctrl->irq_lock); 548 + 549 + iv = nvmet_pci_epf_find_irq_vector(ctrl, vector); 550 + if (iv) { 551 + iv->ref++; 552 + goto unlock; 553 + } 554 + 555 + for (i = 0; i < ctrl->nr_queues; i++) { 556 + iv = &ctrl->irq_vectors[i]; 557 + if (!iv->ref) 558 + break; 559 + } 560 + 561 + if (WARN_ON_ONCE(!iv)) 562 + goto unlock; 563 + 564 + iv->ref = 1; 565 + iv->vector = vector; 566 + iv->nr_irqs = 0; 567 + 568 + unlock: 569 + mutex_unlock(&ctrl->irq_lock); 570 + 571 + return iv; 572 + } 573 + 574 + static void nvmet_pci_epf_remove_irq_vector(struct nvmet_pci_epf_ctrl *ctrl, 575 + u16 vector) 576 + { 577 + struct nvmet_pci_epf_irq_vector *iv; 578 + 579 + mutex_lock(&ctrl->irq_lock); 580 + 581 + iv = nvmet_pci_epf_find_irq_vector(ctrl, vector); 582 + if (iv) { 583 + iv->ref--; 584 + if (!iv->ref) { 585 + iv->vector = 0; 586 + iv->nr_irqs = 0; 587 + } 588 + } 589 + 590 + mutex_unlock(&ctrl->irq_lock); 591 + } 592 + 593 + static bool nvmet_pci_epf_should_raise_irq(struct nvmet_pci_epf_ctrl *ctrl, 594 + struct nvmet_pci_epf_queue *cq, bool force) 595 + { 596 + struct nvmet_pci_epf_irq_vector *iv = cq->iv; 597 + bool ret; 598 + 599 + if (!test_bit(NVMET_PCI_EPF_Q_IRQ_ENABLED, &cq->flags)) 600 + return false; 601 + 602 + /* IRQ coalescing for the admin queue is not allowed. */ 603 + if (!cq->qid) 604 + return true; 605 + 606 + if (iv->cd) 607 + return true; 608 + 609 + if (force) { 610 + ret = iv->nr_irqs > 0; 611 + } else { 612 + iv->nr_irqs++; 613 + ret = iv->nr_irqs >= ctrl->irq_vector_threshold; 614 + } 615 + if (ret) 616 + iv->nr_irqs = 0; 617 + 618 + return ret; 619 + } 620 + 621 + static void nvmet_pci_epf_raise_irq(struct nvmet_pci_epf_ctrl *ctrl, 622 + struct nvmet_pci_epf_queue *cq, bool force) 623 + { 624 + struct nvmet_pci_epf *nvme_epf = ctrl->nvme_epf; 625 + struct pci_epf *epf = nvme_epf->epf; 626 + int ret = 0; 627 + 628 + if (!test_bit(NVMET_PCI_EPF_Q_LIVE, &cq->flags)) 629 + return; 630 + 631 + mutex_lock(&ctrl->irq_lock); 632 + 633 + if (!nvmet_pci_epf_should_raise_irq(ctrl, cq, force)) 634 + goto unlock; 635 + 636 + switch (nvme_epf->irq_type) { 637 + case PCI_IRQ_MSIX: 638 + case PCI_IRQ_MSI: 639 + ret = pci_epc_raise_irq(epf->epc, epf->func_no, epf->vfunc_no, 640 + nvme_epf->irq_type, cq->vector + 1); 641 + if (!ret) 642 + break; 643 + /* 644 + * If we got an error, it is likely because the host is using 645 + * legacy IRQs (e.g. BIOS, grub). 646 + */ 647 + fallthrough; 648 + case PCI_IRQ_INTX: 649 + ret = pci_epc_raise_irq(epf->epc, epf->func_no, epf->vfunc_no, 650 + PCI_IRQ_INTX, 0); 651 + break; 652 + default: 653 + WARN_ON_ONCE(1); 654 + ret = -EINVAL; 655 + break; 656 + } 657 + 658 + if (ret) 659 + dev_err(ctrl->dev, "Failed to raise IRQ (err=%d)\n", ret); 660 + 661 + unlock: 662 + mutex_unlock(&ctrl->irq_lock); 663 + } 664 + 665 + static inline const char *nvmet_pci_epf_iod_name(struct nvmet_pci_epf_iod *iod) 666 + { 667 + return nvme_opcode_str(iod->sq->qid, iod->cmd.common.opcode); 668 + } 669 + 670 + static void nvmet_pci_epf_exec_iod_work(struct work_struct *work); 671 + 672 + static struct nvmet_pci_epf_iod * 673 + nvmet_pci_epf_alloc_iod(struct nvmet_pci_epf_queue *sq) 674 + { 675 + struct nvmet_pci_epf_ctrl *ctrl = sq->ctrl; 676 + struct nvmet_pci_epf_iod *iod; 677 + 678 + iod = mempool_alloc(&ctrl->iod_pool, GFP_KERNEL); 679 + if (unlikely(!iod)) 680 + return NULL; 681 + 682 + memset(iod, 0, sizeof(*iod)); 683 + iod->req.cmd = &iod->cmd; 684 + iod->req.cqe = &iod->cqe; 685 + iod->req.port = ctrl->port; 686 + iod->ctrl = ctrl; 687 + iod->sq = sq; 688 + iod->cq = &ctrl->cq[sq->qid]; 689 + INIT_LIST_HEAD(&iod->link); 690 + iod->dma_dir = DMA_NONE; 691 + INIT_WORK(&iod->work, nvmet_pci_epf_exec_iod_work); 692 + init_completion(&iod->done); 693 + 694 + return iod; 695 + } 696 + 697 + /* 698 + * Allocate or grow a command table of PCI segments. 699 + */ 700 + static int nvmet_pci_epf_alloc_iod_data_segs(struct nvmet_pci_epf_iod *iod, 701 + int nsegs) 702 + { 703 + struct nvmet_pci_epf_segment *segs; 704 + int nr_segs = iod->nr_data_segs + nsegs; 705 + 706 + segs = krealloc(iod->data_segs, 707 + nr_segs * sizeof(struct nvmet_pci_epf_segment), 708 + GFP_KERNEL | __GFP_ZERO); 709 + if (!segs) 710 + return -ENOMEM; 711 + 712 + iod->nr_data_segs = nr_segs; 713 + iod->data_segs = segs; 714 + 715 + return 0; 716 + } 717 + 718 + static void nvmet_pci_epf_free_iod(struct nvmet_pci_epf_iod *iod) 719 + { 720 + int i; 721 + 722 + if (iod->data_segs) { 723 + for (i = 0; i < iod->nr_data_segs; i++) 724 + kfree(iod->data_segs[i].buf); 725 + if (iod->data_segs != &iod->data_seg) 726 + kfree(iod->data_segs); 727 + } 728 + if (iod->data_sgt.nents > 1) 729 + sg_free_table(&iod->data_sgt); 730 + mempool_free(iod, &iod->ctrl->iod_pool); 731 + } 732 + 733 + static int nvmet_pci_epf_transfer_iod_data(struct nvmet_pci_epf_iod *iod) 734 + { 735 + struct nvmet_pci_epf *nvme_epf = iod->ctrl->nvme_epf; 736 + struct nvmet_pci_epf_segment *seg = &iod->data_segs[0]; 737 + int i, ret; 738 + 739 + /* Split the data transfer according to the PCI segments. */ 740 + for (i = 0; i < iod->nr_data_segs; i++, seg++) { 741 + ret = nvmet_pci_epf_transfer_seg(nvme_epf, seg, iod->dma_dir); 742 + if (ret) { 743 + iod->status = NVME_SC_DATA_XFER_ERROR | NVME_STATUS_DNR; 744 + return ret; 745 + } 746 + } 747 + 748 + return 0; 749 + } 750 + 751 + static inline u32 nvmet_pci_epf_prp_ofst(struct nvmet_pci_epf_ctrl *ctrl, 752 + u64 prp) 753 + { 754 + return prp & ctrl->mps_mask; 755 + } 756 + 757 + static inline size_t nvmet_pci_epf_prp_size(struct nvmet_pci_epf_ctrl *ctrl, 758 + u64 prp) 759 + { 760 + return ctrl->mps - nvmet_pci_epf_prp_ofst(ctrl, prp); 761 + } 762 + 763 + /* 764 + * Transfer a PRP list from the host and return the number of prps. 765 + */ 766 + static int nvmet_pci_epf_get_prp_list(struct nvmet_pci_epf_ctrl *ctrl, u64 prp, 767 + size_t xfer_len, __le64 *prps) 768 + { 769 + size_t nr_prps = (xfer_len + ctrl->mps_mask) >> ctrl->mps_shift; 770 + u32 length; 771 + int ret; 772 + 773 + /* 774 + * Compute the number of PRPs required for the number of bytes to 775 + * transfer (xfer_len). If this number overflows the memory page size 776 + * with the PRP list pointer specified, only return the space available 777 + * in the memory page, the last PRP in there will be a PRP list pointer 778 + * to the remaining PRPs. 779 + */ 780 + length = min(nvmet_pci_epf_prp_size(ctrl, prp), nr_prps << 3); 781 + ret = nvmet_pci_epf_transfer(ctrl, prps, prp, length, DMA_FROM_DEVICE); 782 + if (ret) 783 + return ret; 784 + 785 + return length >> 3; 786 + } 787 + 788 + static int nvmet_pci_epf_iod_parse_prp_list(struct nvmet_pci_epf_ctrl *ctrl, 789 + struct nvmet_pci_epf_iod *iod) 790 + { 791 + struct nvme_command *cmd = &iod->cmd; 792 + struct nvmet_pci_epf_segment *seg; 793 + size_t size = 0, ofst, prp_size, xfer_len; 794 + size_t transfer_len = iod->data_len; 795 + int nr_segs, nr_prps = 0; 796 + u64 pci_addr, prp; 797 + int i = 0, ret; 798 + __le64 *prps; 799 + 800 + prps = kzalloc(ctrl->mps, GFP_KERNEL); 801 + if (!prps) 802 + goto err_internal; 803 + 804 + /* 805 + * Allocate PCI segments for the command: this considers the worst case 806 + * scenario where all prps are discontiguous, so get as many segments 807 + * as we can have prps. In practice, most of the time, we will have 808 + * far less PCI segments than prps. 809 + */ 810 + prp = le64_to_cpu(cmd->common.dptr.prp1); 811 + if (!prp) 812 + goto err_invalid_field; 813 + 814 + ofst = nvmet_pci_epf_prp_ofst(ctrl, prp); 815 + nr_segs = (transfer_len + ofst + ctrl->mps - 1) >> ctrl->mps_shift; 816 + 817 + ret = nvmet_pci_epf_alloc_iod_data_segs(iod, nr_segs); 818 + if (ret) 819 + goto err_internal; 820 + 821 + /* Set the first segment using prp1. */ 822 + seg = &iod->data_segs[0]; 823 + seg->pci_addr = prp; 824 + seg->length = nvmet_pci_epf_prp_size(ctrl, prp); 825 + 826 + size = seg->length; 827 + pci_addr = prp + size; 828 + nr_segs = 1; 829 + 830 + /* 831 + * Now build the PCI address segments using the PRP lists, starting 832 + * from prp2. 833 + */ 834 + prp = le64_to_cpu(cmd->common.dptr.prp2); 835 + if (!prp) 836 + goto err_invalid_field; 837 + 838 + while (size < transfer_len) { 839 + xfer_len = transfer_len - size; 840 + 841 + if (!nr_prps) { 842 + nr_prps = nvmet_pci_epf_get_prp_list(ctrl, prp, 843 + xfer_len, prps); 844 + if (nr_prps < 0) 845 + goto err_internal; 846 + 847 + i = 0; 848 + ofst = 0; 849 + } 850 + 851 + /* Current entry */ 852 + prp = le64_to_cpu(prps[i]); 853 + if (!prp) 854 + goto err_invalid_field; 855 + 856 + /* Did we reach the last PRP entry of the list? */ 857 + if (xfer_len > ctrl->mps && i == nr_prps - 1) { 858 + /* We need more PRPs: PRP is a list pointer. */ 859 + nr_prps = 0; 860 + continue; 861 + } 862 + 863 + /* Only the first PRP is allowed to have an offset. */ 864 + if (nvmet_pci_epf_prp_ofst(ctrl, prp)) 865 + goto err_invalid_offset; 866 + 867 + if (prp != pci_addr) { 868 + /* Discontiguous prp: new segment. */ 869 + nr_segs++; 870 + if (WARN_ON_ONCE(nr_segs > iod->nr_data_segs)) 871 + goto err_internal; 872 + 873 + seg++; 874 + seg->pci_addr = prp; 875 + seg->length = 0; 876 + pci_addr = prp; 877 + } 878 + 879 + prp_size = min_t(size_t, ctrl->mps, xfer_len); 880 + seg->length += prp_size; 881 + pci_addr += prp_size; 882 + size += prp_size; 883 + 884 + i++; 885 + } 886 + 887 + iod->nr_data_segs = nr_segs; 888 + ret = 0; 889 + 890 + if (size != transfer_len) { 891 + dev_err(ctrl->dev, 892 + "PRPs transfer length mismatch: got %zu B, need %zu B\n", 893 + size, transfer_len); 894 + goto err_internal; 895 + } 896 + 897 + kfree(prps); 898 + 899 + return 0; 900 + 901 + err_invalid_offset: 902 + dev_err(ctrl->dev, "PRPs list invalid offset\n"); 903 + iod->status = NVME_SC_PRP_INVALID_OFFSET | NVME_STATUS_DNR; 904 + goto err; 905 + 906 + err_invalid_field: 907 + dev_err(ctrl->dev, "PRPs list invalid field\n"); 908 + iod->status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 909 + goto err; 910 + 911 + err_internal: 912 + dev_err(ctrl->dev, "PRPs list internal error\n"); 913 + iod->status = NVME_SC_INTERNAL | NVME_STATUS_DNR; 914 + 915 + err: 916 + kfree(prps); 917 + return -EINVAL; 918 + } 919 + 920 + static int nvmet_pci_epf_iod_parse_prp_simple(struct nvmet_pci_epf_ctrl *ctrl, 921 + struct nvmet_pci_epf_iod *iod) 922 + { 923 + struct nvme_command *cmd = &iod->cmd; 924 + size_t transfer_len = iod->data_len; 925 + int ret, nr_segs = 1; 926 + u64 prp1, prp2 = 0; 927 + size_t prp1_size; 928 + 929 + prp1 = le64_to_cpu(cmd->common.dptr.prp1); 930 + prp1_size = nvmet_pci_epf_prp_size(ctrl, prp1); 931 + 932 + /* For commands crossing a page boundary, we should have prp2. */ 933 + if (transfer_len > prp1_size) { 934 + prp2 = le64_to_cpu(cmd->common.dptr.prp2); 935 + if (!prp2) { 936 + iod->status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 937 + return -EINVAL; 938 + } 939 + if (nvmet_pci_epf_prp_ofst(ctrl, prp2)) { 940 + iod->status = 941 + NVME_SC_PRP_INVALID_OFFSET | NVME_STATUS_DNR; 942 + return -EINVAL; 943 + } 944 + if (prp2 != prp1 + prp1_size) 945 + nr_segs = 2; 946 + } 947 + 948 + if (nr_segs == 1) { 949 + iod->nr_data_segs = 1; 950 + iod->data_segs = &iod->data_seg; 951 + iod->data_segs[0].pci_addr = prp1; 952 + iod->data_segs[0].length = transfer_len; 953 + return 0; 954 + } 955 + 956 + ret = nvmet_pci_epf_alloc_iod_data_segs(iod, nr_segs); 957 + if (ret) { 958 + iod->status = NVME_SC_INTERNAL | NVME_STATUS_DNR; 959 + return ret; 960 + } 961 + 962 + iod->data_segs[0].pci_addr = prp1; 963 + iod->data_segs[0].length = prp1_size; 964 + iod->data_segs[1].pci_addr = prp2; 965 + iod->data_segs[1].length = transfer_len - prp1_size; 966 + 967 + return 0; 968 + } 969 + 970 + static int nvmet_pci_epf_iod_parse_prps(struct nvmet_pci_epf_iod *iod) 971 + { 972 + struct nvmet_pci_epf_ctrl *ctrl = iod->ctrl; 973 + u64 prp1 = le64_to_cpu(iod->cmd.common.dptr.prp1); 974 + size_t ofst; 975 + 976 + /* Get the PCI address segments for the command using its PRPs. */ 977 + ofst = nvmet_pci_epf_prp_ofst(ctrl, prp1); 978 + if (ofst & 0x3) { 979 + iod->status = NVME_SC_PRP_INVALID_OFFSET | NVME_STATUS_DNR; 980 + return -EINVAL; 981 + } 982 + 983 + if (iod->data_len + ofst <= ctrl->mps * 2) 984 + return nvmet_pci_epf_iod_parse_prp_simple(ctrl, iod); 985 + 986 + return nvmet_pci_epf_iod_parse_prp_list(ctrl, iod); 987 + } 988 + 989 + /* 990 + * Transfer an SGL segment from the host and return the number of data 991 + * descriptors and the next segment descriptor, if any. 992 + */ 993 + static struct nvme_sgl_desc * 994 + nvmet_pci_epf_get_sgl_segment(struct nvmet_pci_epf_ctrl *ctrl, 995 + struct nvme_sgl_desc *desc, unsigned int *nr_sgls) 996 + { 997 + struct nvme_sgl_desc *sgls; 998 + u32 length = le32_to_cpu(desc->length); 999 + int nr_descs, ret; 1000 + void *buf; 1001 + 1002 + buf = kmalloc(length, GFP_KERNEL); 1003 + if (!buf) 1004 + return NULL; 1005 + 1006 + ret = nvmet_pci_epf_transfer(ctrl, buf, le64_to_cpu(desc->addr), length, 1007 + DMA_FROM_DEVICE); 1008 + if (ret) { 1009 + kfree(buf); 1010 + return NULL; 1011 + } 1012 + 1013 + sgls = buf; 1014 + nr_descs = length / sizeof(struct nvme_sgl_desc); 1015 + if (sgls[nr_descs - 1].type == (NVME_SGL_FMT_SEG_DESC << 4) || 1016 + sgls[nr_descs - 1].type == (NVME_SGL_FMT_LAST_SEG_DESC << 4)) { 1017 + /* 1018 + * We have another SGL segment following this one: do not count 1019 + * it as a regular data SGL descriptor and return it to the 1020 + * caller. 1021 + */ 1022 + *desc = sgls[nr_descs - 1]; 1023 + nr_descs--; 1024 + } else { 1025 + /* We do not have another SGL segment after this one. */ 1026 + desc->length = 0; 1027 + } 1028 + 1029 + *nr_sgls = nr_descs; 1030 + 1031 + return sgls; 1032 + } 1033 + 1034 + static int nvmet_pci_epf_iod_parse_sgl_segments(struct nvmet_pci_epf_ctrl *ctrl, 1035 + struct nvmet_pci_epf_iod *iod) 1036 + { 1037 + struct nvme_command *cmd = &iod->cmd; 1038 + struct nvme_sgl_desc seg = cmd->common.dptr.sgl; 1039 + struct nvme_sgl_desc *sgls = NULL; 1040 + int n = 0, i, nr_sgls; 1041 + int ret; 1042 + 1043 + /* 1044 + * We do not support inline data nor keyed SGLs, so we should be seeing 1045 + * only segment descriptors. 1046 + */ 1047 + if (seg.type != (NVME_SGL_FMT_SEG_DESC << 4) && 1048 + seg.type != (NVME_SGL_FMT_LAST_SEG_DESC << 4)) { 1049 + iod->status = NVME_SC_SGL_INVALID_TYPE | NVME_STATUS_DNR; 1050 + return -EIO; 1051 + } 1052 + 1053 + while (seg.length) { 1054 + sgls = nvmet_pci_epf_get_sgl_segment(ctrl, &seg, &nr_sgls); 1055 + if (!sgls) { 1056 + iod->status = NVME_SC_INTERNAL | NVME_STATUS_DNR; 1057 + return -EIO; 1058 + } 1059 + 1060 + /* Grow the PCI segment table as needed. */ 1061 + ret = nvmet_pci_epf_alloc_iod_data_segs(iod, nr_sgls); 1062 + if (ret) { 1063 + iod->status = NVME_SC_INTERNAL | NVME_STATUS_DNR; 1064 + goto out; 1065 + } 1066 + 1067 + /* 1068 + * Parse the SGL descriptors to build the PCI segment table, 1069 + * checking the descriptor type as we go. 1070 + */ 1071 + for (i = 0; i < nr_sgls; i++) { 1072 + if (sgls[i].type != (NVME_SGL_FMT_DATA_DESC << 4)) { 1073 + iod->status = NVME_SC_SGL_INVALID_TYPE | 1074 + NVME_STATUS_DNR; 1075 + goto out; 1076 + } 1077 + iod->data_segs[n].pci_addr = le64_to_cpu(sgls[i].addr); 1078 + iod->data_segs[n].length = le32_to_cpu(sgls[i].length); 1079 + n++; 1080 + } 1081 + 1082 + kfree(sgls); 1083 + } 1084 + 1085 + out: 1086 + if (iod->status != NVME_SC_SUCCESS) { 1087 + kfree(sgls); 1088 + return -EIO; 1089 + } 1090 + 1091 + return 0; 1092 + } 1093 + 1094 + static int nvmet_pci_epf_iod_parse_sgls(struct nvmet_pci_epf_iod *iod) 1095 + { 1096 + struct nvmet_pci_epf_ctrl *ctrl = iod->ctrl; 1097 + struct nvme_sgl_desc *sgl = &iod->cmd.common.dptr.sgl; 1098 + 1099 + if (sgl->type == (NVME_SGL_FMT_DATA_DESC << 4)) { 1100 + /* Single data descriptor case. */ 1101 + iod->nr_data_segs = 1; 1102 + iod->data_segs = &iod->data_seg; 1103 + iod->data_seg.pci_addr = le64_to_cpu(sgl->addr); 1104 + iod->data_seg.length = le32_to_cpu(sgl->length); 1105 + return 0; 1106 + } 1107 + 1108 + return nvmet_pci_epf_iod_parse_sgl_segments(ctrl, iod); 1109 + } 1110 + 1111 + static int nvmet_pci_epf_alloc_iod_data_buf(struct nvmet_pci_epf_iod *iod) 1112 + { 1113 + struct nvmet_pci_epf_ctrl *ctrl = iod->ctrl; 1114 + struct nvmet_req *req = &iod->req; 1115 + struct nvmet_pci_epf_segment *seg; 1116 + struct scatterlist *sg; 1117 + int ret, i; 1118 + 1119 + if (iod->data_len > ctrl->mdts) { 1120 + iod->status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 1121 + return -EINVAL; 1122 + } 1123 + 1124 + /* 1125 + * Get the PCI address segments for the command data buffer using either 1126 + * its SGLs or PRPs. 1127 + */ 1128 + if (iod->cmd.common.flags & NVME_CMD_SGL_ALL) 1129 + ret = nvmet_pci_epf_iod_parse_sgls(iod); 1130 + else 1131 + ret = nvmet_pci_epf_iod_parse_prps(iod); 1132 + if (ret) 1133 + return ret; 1134 + 1135 + /* Get a command buffer using SGLs matching the PCI segments. */ 1136 + if (iod->nr_data_segs == 1) { 1137 + sg_init_table(&iod->data_sgl, 1); 1138 + iod->data_sgt.sgl = &iod->data_sgl; 1139 + iod->data_sgt.nents = 1; 1140 + iod->data_sgt.orig_nents = 1; 1141 + } else { 1142 + ret = sg_alloc_table(&iod->data_sgt, iod->nr_data_segs, 1143 + GFP_KERNEL); 1144 + if (ret) 1145 + goto err_nomem; 1146 + } 1147 + 1148 + for_each_sgtable_sg(&iod->data_sgt, sg, i) { 1149 + seg = &iod->data_segs[i]; 1150 + seg->buf = kmalloc(seg->length, GFP_KERNEL); 1151 + if (!seg->buf) 1152 + goto err_nomem; 1153 + sg_set_buf(sg, seg->buf, seg->length); 1154 + } 1155 + 1156 + req->transfer_len = iod->data_len; 1157 + req->sg = iod->data_sgt.sgl; 1158 + req->sg_cnt = iod->data_sgt.nents; 1159 + 1160 + return 0; 1161 + 1162 + err_nomem: 1163 + iod->status = NVME_SC_INTERNAL | NVME_STATUS_DNR; 1164 + return -ENOMEM; 1165 + } 1166 + 1167 + static void nvmet_pci_epf_complete_iod(struct nvmet_pci_epf_iod *iod) 1168 + { 1169 + struct nvmet_pci_epf_queue *cq = iod->cq; 1170 + unsigned long flags; 1171 + 1172 + /* Print an error message for failed commands, except AENs. */ 1173 + iod->status = le16_to_cpu(iod->cqe.status) >> 1; 1174 + if (iod->status && iod->cmd.common.opcode != nvme_admin_async_event) 1175 + dev_err(iod->ctrl->dev, 1176 + "CQ[%d]: Command %s (0x%x) status 0x%0x\n", 1177 + iod->sq->qid, nvmet_pci_epf_iod_name(iod), 1178 + iod->cmd.common.opcode, iod->status); 1179 + 1180 + /* 1181 + * Add the command to the list of completed commands and schedule the 1182 + * CQ work. 1183 + */ 1184 + spin_lock_irqsave(&cq->lock, flags); 1185 + list_add_tail(&iod->link, &cq->list); 1186 + queue_delayed_work(system_highpri_wq, &cq->work, 0); 1187 + spin_unlock_irqrestore(&cq->lock, flags); 1188 + } 1189 + 1190 + static void nvmet_pci_epf_drain_queue(struct nvmet_pci_epf_queue *queue) 1191 + { 1192 + struct nvmet_pci_epf_iod *iod; 1193 + unsigned long flags; 1194 + 1195 + spin_lock_irqsave(&queue->lock, flags); 1196 + while (!list_empty(&queue->list)) { 1197 + iod = list_first_entry(&queue->list, struct nvmet_pci_epf_iod, 1198 + link); 1199 + list_del_init(&iod->link); 1200 + nvmet_pci_epf_free_iod(iod); 1201 + } 1202 + spin_unlock_irqrestore(&queue->lock, flags); 1203 + } 1204 + 1205 + static int nvmet_pci_epf_add_port(struct nvmet_port *port) 1206 + { 1207 + mutex_lock(&nvmet_pci_epf_ports_mutex); 1208 + list_add_tail(&port->entry, &nvmet_pci_epf_ports); 1209 + mutex_unlock(&nvmet_pci_epf_ports_mutex); 1210 + return 0; 1211 + } 1212 + 1213 + static void nvmet_pci_epf_remove_port(struct nvmet_port *port) 1214 + { 1215 + mutex_lock(&nvmet_pci_epf_ports_mutex); 1216 + list_del_init(&port->entry); 1217 + mutex_unlock(&nvmet_pci_epf_ports_mutex); 1218 + } 1219 + 1220 + static struct nvmet_port * 1221 + nvmet_pci_epf_find_port(struct nvmet_pci_epf_ctrl *ctrl, __le16 portid) 1222 + { 1223 + struct nvmet_port *p, *port = NULL; 1224 + 1225 + mutex_lock(&nvmet_pci_epf_ports_mutex); 1226 + list_for_each_entry(p, &nvmet_pci_epf_ports, entry) { 1227 + if (p->disc_addr.portid == portid) { 1228 + port = p; 1229 + break; 1230 + } 1231 + } 1232 + mutex_unlock(&nvmet_pci_epf_ports_mutex); 1233 + 1234 + return port; 1235 + } 1236 + 1237 + static void nvmet_pci_epf_queue_response(struct nvmet_req *req) 1238 + { 1239 + struct nvmet_pci_epf_iod *iod = 1240 + container_of(req, struct nvmet_pci_epf_iod, req); 1241 + 1242 + iod->status = le16_to_cpu(req->cqe->status) >> 1; 1243 + 1244 + /* If we have no data to transfer, directly complete the command. */ 1245 + if (!iod->data_len || iod->dma_dir != DMA_TO_DEVICE) { 1246 + nvmet_pci_epf_complete_iod(iod); 1247 + return; 1248 + } 1249 + 1250 + complete(&iod->done); 1251 + } 1252 + 1253 + static u8 nvmet_pci_epf_get_mdts(const struct nvmet_ctrl *tctrl) 1254 + { 1255 + struct nvmet_pci_epf_ctrl *ctrl = tctrl->drvdata; 1256 + int page_shift = NVME_CAP_MPSMIN(tctrl->cap) + 12; 1257 + 1258 + return ilog2(ctrl->mdts) - page_shift; 1259 + } 1260 + 1261 + static u16 nvmet_pci_epf_create_cq(struct nvmet_ctrl *tctrl, 1262 + u16 cqid, u16 flags, u16 qsize, u64 pci_addr, u16 vector) 1263 + { 1264 + struct nvmet_pci_epf_ctrl *ctrl = tctrl->drvdata; 1265 + struct nvmet_pci_epf_queue *cq = &ctrl->cq[cqid]; 1266 + u16 status; 1267 + 1268 + if (test_and_set_bit(NVMET_PCI_EPF_Q_LIVE, &cq->flags)) 1269 + return NVME_SC_QID_INVALID | NVME_STATUS_DNR; 1270 + 1271 + if (!(flags & NVME_QUEUE_PHYS_CONTIG)) 1272 + return NVME_SC_INVALID_QUEUE | NVME_STATUS_DNR; 1273 + 1274 + if (flags & NVME_CQ_IRQ_ENABLED) 1275 + set_bit(NVMET_PCI_EPF_Q_IRQ_ENABLED, &cq->flags); 1276 + 1277 + cq->pci_addr = pci_addr; 1278 + cq->qid = cqid; 1279 + cq->depth = qsize + 1; 1280 + cq->vector = vector; 1281 + cq->head = 0; 1282 + cq->tail = 0; 1283 + cq->phase = 1; 1284 + cq->db = NVME_REG_DBS + (((cqid * 2) + 1) * sizeof(u32)); 1285 + nvmet_pci_epf_bar_write32(ctrl, cq->db, 0); 1286 + 1287 + if (!cqid) 1288 + cq->qes = sizeof(struct nvme_completion); 1289 + else 1290 + cq->qes = ctrl->io_cqes; 1291 + cq->pci_size = cq->qes * cq->depth; 1292 + 1293 + cq->iv = nvmet_pci_epf_add_irq_vector(ctrl, vector); 1294 + if (!cq->iv) { 1295 + status = NVME_SC_INTERNAL | NVME_STATUS_DNR; 1296 + goto err; 1297 + } 1298 + 1299 + status = nvmet_cq_create(tctrl, &cq->nvme_cq, cqid, cq->depth); 1300 + if (status != NVME_SC_SUCCESS) 1301 + goto err; 1302 + 1303 + dev_dbg(ctrl->dev, "CQ[%u]: %u entries of %zu B, IRQ vector %u\n", 1304 + cqid, qsize, cq->qes, cq->vector); 1305 + 1306 + return NVME_SC_SUCCESS; 1307 + 1308 + err: 1309 + clear_bit(NVMET_PCI_EPF_Q_IRQ_ENABLED, &cq->flags); 1310 + clear_bit(NVMET_PCI_EPF_Q_LIVE, &cq->flags); 1311 + return status; 1312 + } 1313 + 1314 + static u16 nvmet_pci_epf_delete_cq(struct nvmet_ctrl *tctrl, u16 cqid) 1315 + { 1316 + struct nvmet_pci_epf_ctrl *ctrl = tctrl->drvdata; 1317 + struct nvmet_pci_epf_queue *cq = &ctrl->cq[cqid]; 1318 + 1319 + if (!test_and_clear_bit(NVMET_PCI_EPF_Q_LIVE, &cq->flags)) 1320 + return NVME_SC_QID_INVALID | NVME_STATUS_DNR; 1321 + 1322 + cancel_delayed_work_sync(&cq->work); 1323 + nvmet_pci_epf_drain_queue(cq); 1324 + nvmet_pci_epf_remove_irq_vector(ctrl, cq->vector); 1325 + 1326 + return NVME_SC_SUCCESS; 1327 + } 1328 + 1329 + static u16 nvmet_pci_epf_create_sq(struct nvmet_ctrl *tctrl, 1330 + u16 sqid, u16 flags, u16 qsize, u64 pci_addr) 1331 + { 1332 + struct nvmet_pci_epf_ctrl *ctrl = tctrl->drvdata; 1333 + struct nvmet_pci_epf_queue *sq = &ctrl->sq[sqid]; 1334 + u16 status; 1335 + 1336 + if (test_and_set_bit(NVMET_PCI_EPF_Q_LIVE, &sq->flags)) 1337 + return NVME_SC_QID_INVALID | NVME_STATUS_DNR; 1338 + 1339 + if (!(flags & NVME_QUEUE_PHYS_CONTIG)) 1340 + return NVME_SC_INVALID_QUEUE | NVME_STATUS_DNR; 1341 + 1342 + sq->pci_addr = pci_addr; 1343 + sq->qid = sqid; 1344 + sq->depth = qsize + 1; 1345 + sq->head = 0; 1346 + sq->tail = 0; 1347 + sq->phase = 0; 1348 + sq->db = NVME_REG_DBS + (sqid * 2 * sizeof(u32)); 1349 + nvmet_pci_epf_bar_write32(ctrl, sq->db, 0); 1350 + if (!sqid) 1351 + sq->qes = 1UL << NVME_ADM_SQES; 1352 + else 1353 + sq->qes = ctrl->io_sqes; 1354 + sq->pci_size = sq->qes * sq->depth; 1355 + 1356 + status = nvmet_sq_create(tctrl, &sq->nvme_sq, sqid, sq->depth); 1357 + if (status != NVME_SC_SUCCESS) 1358 + goto out_clear_bit; 1359 + 1360 + sq->iod_wq = alloc_workqueue("sq%d_wq", WQ_UNBOUND, 1361 + min_t(int, sq->depth, WQ_MAX_ACTIVE), sqid); 1362 + if (!sq->iod_wq) { 1363 + dev_err(ctrl->dev, "Failed to create SQ %d work queue\n", sqid); 1364 + status = NVME_SC_INTERNAL | NVME_STATUS_DNR; 1365 + goto out_destroy_sq; 1366 + } 1367 + 1368 + dev_dbg(ctrl->dev, "SQ[%u]: %u entries of %zu B\n", 1369 + sqid, qsize, sq->qes); 1370 + 1371 + return NVME_SC_SUCCESS; 1372 + 1373 + out_destroy_sq: 1374 + nvmet_sq_destroy(&sq->nvme_sq); 1375 + out_clear_bit: 1376 + clear_bit(NVMET_PCI_EPF_Q_LIVE, &sq->flags); 1377 + return status; 1378 + } 1379 + 1380 + static u16 nvmet_pci_epf_delete_sq(struct nvmet_ctrl *tctrl, u16 sqid) 1381 + { 1382 + struct nvmet_pci_epf_ctrl *ctrl = tctrl->drvdata; 1383 + struct nvmet_pci_epf_queue *sq = &ctrl->sq[sqid]; 1384 + 1385 + if (!test_and_clear_bit(NVMET_PCI_EPF_Q_LIVE, &sq->flags)) 1386 + return NVME_SC_QID_INVALID | NVME_STATUS_DNR; 1387 + 1388 + flush_workqueue(sq->iod_wq); 1389 + destroy_workqueue(sq->iod_wq); 1390 + sq->iod_wq = NULL; 1391 + 1392 + nvmet_pci_epf_drain_queue(sq); 1393 + 1394 + if (sq->nvme_sq.ctrl) 1395 + nvmet_sq_destroy(&sq->nvme_sq); 1396 + 1397 + return NVME_SC_SUCCESS; 1398 + } 1399 + 1400 + static u16 nvmet_pci_epf_get_feat(const struct nvmet_ctrl *tctrl, 1401 + u8 feat, void *data) 1402 + { 1403 + struct nvmet_pci_epf_ctrl *ctrl = tctrl->drvdata; 1404 + struct nvmet_feat_arbitration *arb; 1405 + struct nvmet_feat_irq_coalesce *irqc; 1406 + struct nvmet_feat_irq_config *irqcfg; 1407 + struct nvmet_pci_epf_irq_vector *iv; 1408 + u16 status; 1409 + 1410 + switch (feat) { 1411 + case NVME_FEAT_ARBITRATION: 1412 + arb = data; 1413 + if (!ctrl->sq_ab) 1414 + arb->ab = 0x7; 1415 + else 1416 + arb->ab = ilog2(ctrl->sq_ab); 1417 + return NVME_SC_SUCCESS; 1418 + 1419 + case NVME_FEAT_IRQ_COALESCE: 1420 + irqc = data; 1421 + irqc->thr = ctrl->irq_vector_threshold; 1422 + irqc->time = 0; 1423 + return NVME_SC_SUCCESS; 1424 + 1425 + case NVME_FEAT_IRQ_CONFIG: 1426 + irqcfg = data; 1427 + mutex_lock(&ctrl->irq_lock); 1428 + iv = nvmet_pci_epf_find_irq_vector(ctrl, irqcfg->iv); 1429 + if (iv) { 1430 + irqcfg->cd = iv->cd; 1431 + status = NVME_SC_SUCCESS; 1432 + } else { 1433 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 1434 + } 1435 + mutex_unlock(&ctrl->irq_lock); 1436 + return status; 1437 + 1438 + default: 1439 + return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 1440 + } 1441 + } 1442 + 1443 + static u16 nvmet_pci_epf_set_feat(const struct nvmet_ctrl *tctrl, 1444 + u8 feat, void *data) 1445 + { 1446 + struct nvmet_pci_epf_ctrl *ctrl = tctrl->drvdata; 1447 + struct nvmet_feat_arbitration *arb; 1448 + struct nvmet_feat_irq_coalesce *irqc; 1449 + struct nvmet_feat_irq_config *irqcfg; 1450 + struct nvmet_pci_epf_irq_vector *iv; 1451 + u16 status; 1452 + 1453 + switch (feat) { 1454 + case NVME_FEAT_ARBITRATION: 1455 + arb = data; 1456 + if (arb->ab == 0x7) 1457 + ctrl->sq_ab = 0; 1458 + else 1459 + ctrl->sq_ab = 1 << arb->ab; 1460 + return NVME_SC_SUCCESS; 1461 + 1462 + case NVME_FEAT_IRQ_COALESCE: 1463 + /* 1464 + * Since we do not implement precise IRQ coalescing timing, 1465 + * ignore the time field. 1466 + */ 1467 + irqc = data; 1468 + ctrl->irq_vector_threshold = irqc->thr + 1; 1469 + return NVME_SC_SUCCESS; 1470 + 1471 + case NVME_FEAT_IRQ_CONFIG: 1472 + irqcfg = data; 1473 + mutex_lock(&ctrl->irq_lock); 1474 + iv = nvmet_pci_epf_find_irq_vector(ctrl, irqcfg->iv); 1475 + if (iv) { 1476 + iv->cd = irqcfg->cd; 1477 + status = NVME_SC_SUCCESS; 1478 + } else { 1479 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 1480 + } 1481 + mutex_unlock(&ctrl->irq_lock); 1482 + return status; 1483 + 1484 + default: 1485 + return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 1486 + } 1487 + } 1488 + 1489 + static const struct nvmet_fabrics_ops nvmet_pci_epf_fabrics_ops = { 1490 + .owner = THIS_MODULE, 1491 + .type = NVMF_TRTYPE_PCI, 1492 + .add_port = nvmet_pci_epf_add_port, 1493 + .remove_port = nvmet_pci_epf_remove_port, 1494 + .queue_response = nvmet_pci_epf_queue_response, 1495 + .get_mdts = nvmet_pci_epf_get_mdts, 1496 + .create_cq = nvmet_pci_epf_create_cq, 1497 + .delete_cq = nvmet_pci_epf_delete_cq, 1498 + .create_sq = nvmet_pci_epf_create_sq, 1499 + .delete_sq = nvmet_pci_epf_delete_sq, 1500 + .get_feature = nvmet_pci_epf_get_feat, 1501 + .set_feature = nvmet_pci_epf_set_feat, 1502 + }; 1503 + 1504 + static void nvmet_pci_epf_cq_work(struct work_struct *work); 1505 + 1506 + static void nvmet_pci_epf_init_queue(struct nvmet_pci_epf_ctrl *ctrl, 1507 + unsigned int qid, bool sq) 1508 + { 1509 + struct nvmet_pci_epf_queue *queue; 1510 + 1511 + if (sq) { 1512 + queue = &ctrl->sq[qid]; 1513 + set_bit(NVMET_PCI_EPF_Q_IS_SQ, &queue->flags); 1514 + } else { 1515 + queue = &ctrl->cq[qid]; 1516 + INIT_DELAYED_WORK(&queue->work, nvmet_pci_epf_cq_work); 1517 + } 1518 + queue->ctrl = ctrl; 1519 + queue->qid = qid; 1520 + spin_lock_init(&queue->lock); 1521 + INIT_LIST_HEAD(&queue->list); 1522 + } 1523 + 1524 + static int nvmet_pci_epf_alloc_queues(struct nvmet_pci_epf_ctrl *ctrl) 1525 + { 1526 + unsigned int qid; 1527 + 1528 + ctrl->sq = kcalloc(ctrl->nr_queues, 1529 + sizeof(struct nvmet_pci_epf_queue), GFP_KERNEL); 1530 + if (!ctrl->sq) 1531 + return -ENOMEM; 1532 + 1533 + ctrl->cq = kcalloc(ctrl->nr_queues, 1534 + sizeof(struct nvmet_pci_epf_queue), GFP_KERNEL); 1535 + if (!ctrl->cq) { 1536 + kfree(ctrl->sq); 1537 + ctrl->sq = NULL; 1538 + return -ENOMEM; 1539 + } 1540 + 1541 + for (qid = 0; qid < ctrl->nr_queues; qid++) { 1542 + nvmet_pci_epf_init_queue(ctrl, qid, true); 1543 + nvmet_pci_epf_init_queue(ctrl, qid, false); 1544 + } 1545 + 1546 + return 0; 1547 + } 1548 + 1549 + static void nvmet_pci_epf_free_queues(struct nvmet_pci_epf_ctrl *ctrl) 1550 + { 1551 + kfree(ctrl->sq); 1552 + ctrl->sq = NULL; 1553 + kfree(ctrl->cq); 1554 + ctrl->cq = NULL; 1555 + } 1556 + 1557 + static int nvmet_pci_epf_map_queue(struct nvmet_pci_epf_ctrl *ctrl, 1558 + struct nvmet_pci_epf_queue *queue) 1559 + { 1560 + struct nvmet_pci_epf *nvme_epf = ctrl->nvme_epf; 1561 + int ret; 1562 + 1563 + ret = nvmet_pci_epf_mem_map(nvme_epf, queue->pci_addr, 1564 + queue->pci_size, &queue->pci_map); 1565 + if (ret) { 1566 + dev_err(ctrl->dev, "Failed to map queue %u (err=%d)\n", 1567 + queue->qid, ret); 1568 + return ret; 1569 + } 1570 + 1571 + if (queue->pci_map.pci_size < queue->pci_size) { 1572 + dev_err(ctrl->dev, "Invalid partial mapping of queue %u\n", 1573 + queue->qid); 1574 + nvmet_pci_epf_mem_unmap(nvme_epf, &queue->pci_map); 1575 + return -ENOMEM; 1576 + } 1577 + 1578 + return 0; 1579 + } 1580 + 1581 + static inline void nvmet_pci_epf_unmap_queue(struct nvmet_pci_epf_ctrl *ctrl, 1582 + struct nvmet_pci_epf_queue *queue) 1583 + { 1584 + nvmet_pci_epf_mem_unmap(ctrl->nvme_epf, &queue->pci_map); 1585 + } 1586 + 1587 + static void nvmet_pci_epf_exec_iod_work(struct work_struct *work) 1588 + { 1589 + struct nvmet_pci_epf_iod *iod = 1590 + container_of(work, struct nvmet_pci_epf_iod, work); 1591 + struct nvmet_req *req = &iod->req; 1592 + int ret; 1593 + 1594 + if (!iod->ctrl->link_up) { 1595 + nvmet_pci_epf_free_iod(iod); 1596 + return; 1597 + } 1598 + 1599 + if (!test_bit(NVMET_PCI_EPF_Q_LIVE, &iod->sq->flags)) { 1600 + iod->status = NVME_SC_QID_INVALID | NVME_STATUS_DNR; 1601 + goto complete; 1602 + } 1603 + 1604 + if (!nvmet_req_init(req, &iod->cq->nvme_cq, &iod->sq->nvme_sq, 1605 + &nvmet_pci_epf_fabrics_ops)) 1606 + goto complete; 1607 + 1608 + iod->data_len = nvmet_req_transfer_len(req); 1609 + if (iod->data_len) { 1610 + /* 1611 + * Get the data DMA transfer direction. Here "device" means the 1612 + * PCI root-complex host. 1613 + */ 1614 + if (nvme_is_write(&iod->cmd)) 1615 + iod->dma_dir = DMA_FROM_DEVICE; 1616 + else 1617 + iod->dma_dir = DMA_TO_DEVICE; 1618 + 1619 + /* 1620 + * Setup the command data buffer and get the command data from 1621 + * the host if needed. 1622 + */ 1623 + ret = nvmet_pci_epf_alloc_iod_data_buf(iod); 1624 + if (!ret && iod->dma_dir == DMA_FROM_DEVICE) 1625 + ret = nvmet_pci_epf_transfer_iod_data(iod); 1626 + if (ret) { 1627 + nvmet_req_uninit(req); 1628 + goto complete; 1629 + } 1630 + } 1631 + 1632 + req->execute(req); 1633 + 1634 + /* 1635 + * If we do not have data to transfer after the command execution 1636 + * finishes, nvmet_pci_epf_queue_response() will complete the command 1637 + * directly. No need to wait for the completion in this case. 1638 + */ 1639 + if (!iod->data_len || iod->dma_dir != DMA_TO_DEVICE) 1640 + return; 1641 + 1642 + wait_for_completion(&iod->done); 1643 + 1644 + if (iod->status == NVME_SC_SUCCESS) { 1645 + WARN_ON_ONCE(!iod->data_len || iod->dma_dir != DMA_TO_DEVICE); 1646 + nvmet_pci_epf_transfer_iod_data(iod); 1647 + } 1648 + 1649 + complete: 1650 + nvmet_pci_epf_complete_iod(iod); 1651 + } 1652 + 1653 + static int nvmet_pci_epf_process_sq(struct nvmet_pci_epf_ctrl *ctrl, 1654 + struct nvmet_pci_epf_queue *sq) 1655 + { 1656 + struct nvmet_pci_epf_iod *iod; 1657 + int ret, n = 0; 1658 + 1659 + sq->tail = nvmet_pci_epf_bar_read32(ctrl, sq->db); 1660 + while (sq->head != sq->tail && (!ctrl->sq_ab || n < ctrl->sq_ab)) { 1661 + iod = nvmet_pci_epf_alloc_iod(sq); 1662 + if (!iod) 1663 + break; 1664 + 1665 + /* Get the NVMe command submitted by the host. */ 1666 + ret = nvmet_pci_epf_transfer(ctrl, &iod->cmd, 1667 + sq->pci_addr + sq->head * sq->qes, 1668 + sq->qes, DMA_FROM_DEVICE); 1669 + if (ret) { 1670 + /* Not much we can do... */ 1671 + nvmet_pci_epf_free_iod(iod); 1672 + break; 1673 + } 1674 + 1675 + dev_dbg(ctrl->dev, "SQ[%u]: head %u, tail %u, command %s\n", 1676 + sq->qid, sq->head, sq->tail, 1677 + nvmet_pci_epf_iod_name(iod)); 1678 + 1679 + sq->head++; 1680 + if (sq->head == sq->depth) 1681 + sq->head = 0; 1682 + n++; 1683 + 1684 + queue_work_on(WORK_CPU_UNBOUND, sq->iod_wq, &iod->work); 1685 + 1686 + sq->tail = nvmet_pci_epf_bar_read32(ctrl, sq->db); 1687 + } 1688 + 1689 + return n; 1690 + } 1691 + 1692 + static void nvmet_pci_epf_poll_sqs_work(struct work_struct *work) 1693 + { 1694 + struct nvmet_pci_epf_ctrl *ctrl = 1695 + container_of(work, struct nvmet_pci_epf_ctrl, poll_sqs.work); 1696 + struct nvmet_pci_epf_queue *sq; 1697 + unsigned long last = 0; 1698 + int i, nr_sqs; 1699 + 1700 + while (ctrl->link_up && ctrl->enabled) { 1701 + nr_sqs = 0; 1702 + /* Do round-robin arbitration. */ 1703 + for (i = 0; i < ctrl->nr_queues; i++) { 1704 + sq = &ctrl->sq[i]; 1705 + if (!test_bit(NVMET_PCI_EPF_Q_LIVE, &sq->flags)) 1706 + continue; 1707 + if (nvmet_pci_epf_process_sq(ctrl, sq)) 1708 + nr_sqs++; 1709 + } 1710 + 1711 + if (nr_sqs) { 1712 + last = jiffies; 1713 + continue; 1714 + } 1715 + 1716 + /* 1717 + * If we have not received any command on any queue for more 1718 + * than NVMET_PCI_EPF_SQ_POLL_IDLE, assume we are idle and 1719 + * reschedule. This avoids "burning" a CPU when the controller 1720 + * is idle for a long time. 1721 + */ 1722 + if (time_is_before_jiffies(last + NVMET_PCI_EPF_SQ_POLL_IDLE)) 1723 + break; 1724 + 1725 + cpu_relax(); 1726 + } 1727 + 1728 + schedule_delayed_work(&ctrl->poll_sqs, NVMET_PCI_EPF_SQ_POLL_INTERVAL); 1729 + } 1730 + 1731 + static void nvmet_pci_epf_cq_work(struct work_struct *work) 1732 + { 1733 + struct nvmet_pci_epf_queue *cq = 1734 + container_of(work, struct nvmet_pci_epf_queue, work.work); 1735 + struct nvmet_pci_epf_ctrl *ctrl = cq->ctrl; 1736 + struct nvme_completion *cqe; 1737 + struct nvmet_pci_epf_iod *iod; 1738 + unsigned long flags; 1739 + int ret, n = 0; 1740 + 1741 + ret = nvmet_pci_epf_map_queue(ctrl, cq); 1742 + if (ret) 1743 + goto again; 1744 + 1745 + while (test_bit(NVMET_PCI_EPF_Q_LIVE, &cq->flags) && ctrl->link_up) { 1746 + 1747 + /* Check that the CQ is not full. */ 1748 + cq->head = nvmet_pci_epf_bar_read32(ctrl, cq->db); 1749 + if (cq->head == cq->tail + 1) { 1750 + ret = -EAGAIN; 1751 + break; 1752 + } 1753 + 1754 + spin_lock_irqsave(&cq->lock, flags); 1755 + iod = list_first_entry_or_null(&cq->list, 1756 + struct nvmet_pci_epf_iod, link); 1757 + if (iod) 1758 + list_del_init(&iod->link); 1759 + spin_unlock_irqrestore(&cq->lock, flags); 1760 + 1761 + if (!iod) 1762 + break; 1763 + 1764 + /* Post the IOD completion entry. */ 1765 + cqe = &iod->cqe; 1766 + cqe->status = cpu_to_le16((iod->status << 1) | cq->phase); 1767 + 1768 + dev_dbg(ctrl->dev, 1769 + "CQ[%u]: %s status 0x%x, result 0x%llx, head %u, tail %u, phase %u\n", 1770 + cq->qid, nvmet_pci_epf_iod_name(iod), iod->status, 1771 + le64_to_cpu(cqe->result.u64), cq->head, cq->tail, 1772 + cq->phase); 1773 + 1774 + memcpy_toio(cq->pci_map.virt_addr + cq->tail * cq->qes, 1775 + cqe, cq->qes); 1776 + 1777 + cq->tail++; 1778 + if (cq->tail >= cq->depth) { 1779 + cq->tail = 0; 1780 + cq->phase ^= 1; 1781 + } 1782 + 1783 + nvmet_pci_epf_free_iod(iod); 1784 + 1785 + /* Signal the host. */ 1786 + nvmet_pci_epf_raise_irq(ctrl, cq, false); 1787 + n++; 1788 + } 1789 + 1790 + nvmet_pci_epf_unmap_queue(ctrl, cq); 1791 + 1792 + /* 1793 + * We do not support precise IRQ coalescing time (100ns units as per 1794 + * NVMe specifications). So if we have posted completion entries without 1795 + * reaching the interrupt coalescing threshold, raise an interrupt. 1796 + */ 1797 + if (n) 1798 + nvmet_pci_epf_raise_irq(ctrl, cq, true); 1799 + 1800 + again: 1801 + if (ret < 0) 1802 + queue_delayed_work(system_highpri_wq, &cq->work, 1803 + NVMET_PCI_EPF_CQ_RETRY_INTERVAL); 1804 + } 1805 + 1806 + static int nvmet_pci_epf_enable_ctrl(struct nvmet_pci_epf_ctrl *ctrl) 1807 + { 1808 + u64 pci_addr, asq, acq; 1809 + u32 aqa; 1810 + u16 status, qsize; 1811 + 1812 + if (ctrl->enabled) 1813 + return 0; 1814 + 1815 + dev_info(ctrl->dev, "Enabling controller\n"); 1816 + 1817 + ctrl->mps_shift = nvmet_cc_mps(ctrl->cc) + 12; 1818 + ctrl->mps = 1UL << ctrl->mps_shift; 1819 + ctrl->mps_mask = ctrl->mps - 1; 1820 + 1821 + ctrl->io_sqes = 1UL << nvmet_cc_iosqes(ctrl->cc); 1822 + if (ctrl->io_sqes < sizeof(struct nvme_command)) { 1823 + dev_err(ctrl->dev, "Unsupported I/O SQES %zu (need %zu)\n", 1824 + ctrl->io_sqes, sizeof(struct nvme_command)); 1825 + return -EINVAL; 1826 + } 1827 + 1828 + ctrl->io_cqes = 1UL << nvmet_cc_iocqes(ctrl->cc); 1829 + if (ctrl->io_cqes < sizeof(struct nvme_completion)) { 1830 + dev_err(ctrl->dev, "Unsupported I/O CQES %zu (need %zu)\n", 1831 + ctrl->io_sqes, sizeof(struct nvme_completion)); 1832 + return -EINVAL; 1833 + } 1834 + 1835 + /* Create the admin queue. */ 1836 + aqa = nvmet_pci_epf_bar_read32(ctrl, NVME_REG_AQA); 1837 + asq = nvmet_pci_epf_bar_read64(ctrl, NVME_REG_ASQ); 1838 + acq = nvmet_pci_epf_bar_read64(ctrl, NVME_REG_ACQ); 1839 + 1840 + qsize = (aqa & 0x0fff0000) >> 16; 1841 + pci_addr = acq & GENMASK_ULL(63, 12); 1842 + status = nvmet_pci_epf_create_cq(ctrl->tctrl, 0, 1843 + NVME_CQ_IRQ_ENABLED | NVME_QUEUE_PHYS_CONTIG, 1844 + qsize, pci_addr, 0); 1845 + if (status != NVME_SC_SUCCESS) { 1846 + dev_err(ctrl->dev, "Failed to create admin completion queue\n"); 1847 + return -EINVAL; 1848 + } 1849 + 1850 + qsize = aqa & 0x00000fff; 1851 + pci_addr = asq & GENMASK_ULL(63, 12); 1852 + status = nvmet_pci_epf_create_sq(ctrl->tctrl, 0, NVME_QUEUE_PHYS_CONTIG, 1853 + qsize, pci_addr); 1854 + if (status != NVME_SC_SUCCESS) { 1855 + dev_err(ctrl->dev, "Failed to create admin submission queue\n"); 1856 + nvmet_pci_epf_delete_cq(ctrl->tctrl, 0); 1857 + return -EINVAL; 1858 + } 1859 + 1860 + ctrl->sq_ab = NVMET_PCI_EPF_SQ_AB; 1861 + ctrl->irq_vector_threshold = NVMET_PCI_EPF_IV_THRESHOLD; 1862 + ctrl->enabled = true; 1863 + 1864 + /* Start polling the controller SQs. */ 1865 + schedule_delayed_work(&ctrl->poll_sqs, 0); 1866 + 1867 + return 0; 1868 + } 1869 + 1870 + static void nvmet_pci_epf_disable_ctrl(struct nvmet_pci_epf_ctrl *ctrl) 1871 + { 1872 + int qid; 1873 + 1874 + if (!ctrl->enabled) 1875 + return; 1876 + 1877 + dev_info(ctrl->dev, "Disabling controller\n"); 1878 + 1879 + ctrl->enabled = false; 1880 + cancel_delayed_work_sync(&ctrl->poll_sqs); 1881 + 1882 + /* Delete all I/O queues first. */ 1883 + for (qid = 1; qid < ctrl->nr_queues; qid++) 1884 + nvmet_pci_epf_delete_sq(ctrl->tctrl, qid); 1885 + 1886 + for (qid = 1; qid < ctrl->nr_queues; qid++) 1887 + nvmet_pci_epf_delete_cq(ctrl->tctrl, qid); 1888 + 1889 + /* Delete the admin queue last. */ 1890 + nvmet_pci_epf_delete_sq(ctrl->tctrl, 0); 1891 + nvmet_pci_epf_delete_cq(ctrl->tctrl, 0); 1892 + } 1893 + 1894 + static void nvmet_pci_epf_poll_cc_work(struct work_struct *work) 1895 + { 1896 + struct nvmet_pci_epf_ctrl *ctrl = 1897 + container_of(work, struct nvmet_pci_epf_ctrl, poll_cc.work); 1898 + u32 old_cc, new_cc; 1899 + int ret; 1900 + 1901 + if (!ctrl->tctrl) 1902 + return; 1903 + 1904 + old_cc = ctrl->cc; 1905 + new_cc = nvmet_pci_epf_bar_read32(ctrl, NVME_REG_CC); 1906 + ctrl->cc = new_cc; 1907 + 1908 + if (nvmet_cc_en(new_cc) && !nvmet_cc_en(old_cc)) { 1909 + ret = nvmet_pci_epf_enable_ctrl(ctrl); 1910 + if (ret) 1911 + return; 1912 + ctrl->csts |= NVME_CSTS_RDY; 1913 + } 1914 + 1915 + if (!nvmet_cc_en(new_cc) && nvmet_cc_en(old_cc)) { 1916 + nvmet_pci_epf_disable_ctrl(ctrl); 1917 + ctrl->csts &= ~NVME_CSTS_RDY; 1918 + } 1919 + 1920 + if (nvmet_cc_shn(new_cc) && !nvmet_cc_shn(old_cc)) { 1921 + nvmet_pci_epf_disable_ctrl(ctrl); 1922 + ctrl->csts |= NVME_CSTS_SHST_CMPLT; 1923 + } 1924 + 1925 + if (!nvmet_cc_shn(new_cc) && nvmet_cc_shn(old_cc)) 1926 + ctrl->csts &= ~NVME_CSTS_SHST_CMPLT; 1927 + 1928 + nvmet_update_cc(ctrl->tctrl, ctrl->cc); 1929 + nvmet_pci_epf_bar_write32(ctrl, NVME_REG_CSTS, ctrl->csts); 1930 + 1931 + schedule_delayed_work(&ctrl->poll_cc, NVMET_PCI_EPF_CC_POLL_INTERVAL); 1932 + } 1933 + 1934 + static void nvmet_pci_epf_init_bar(struct nvmet_pci_epf_ctrl *ctrl) 1935 + { 1936 + struct nvmet_ctrl *tctrl = ctrl->tctrl; 1937 + 1938 + ctrl->bar = ctrl->nvme_epf->reg_bar; 1939 + 1940 + /* Copy the target controller capabilities as a base. */ 1941 + ctrl->cap = tctrl->cap; 1942 + 1943 + /* Contiguous Queues Required (CQR). */ 1944 + ctrl->cap |= 0x1ULL << 16; 1945 + 1946 + /* Set Doorbell stride to 4B (DSTRB). */ 1947 + ctrl->cap &= ~GENMASK_ULL(35, 32); 1948 + 1949 + /* Clear NVM Subsystem Reset Supported (NSSRS). */ 1950 + ctrl->cap &= ~(0x1ULL << 36); 1951 + 1952 + /* Clear Boot Partition Support (BPS). */ 1953 + ctrl->cap &= ~(0x1ULL << 45); 1954 + 1955 + /* Clear Persistent Memory Region Supported (PMRS). */ 1956 + ctrl->cap &= ~(0x1ULL << 56); 1957 + 1958 + /* Clear Controller Memory Buffer Supported (CMBS). */ 1959 + ctrl->cap &= ~(0x1ULL << 57); 1960 + 1961 + /* Controller configuration. */ 1962 + ctrl->cc = tctrl->cc & (~NVME_CC_ENABLE); 1963 + 1964 + /* Controller status. */ 1965 + ctrl->csts = ctrl->tctrl->csts; 1966 + 1967 + nvmet_pci_epf_bar_write64(ctrl, NVME_REG_CAP, ctrl->cap); 1968 + nvmet_pci_epf_bar_write32(ctrl, NVME_REG_VS, tctrl->subsys->ver); 1969 + nvmet_pci_epf_bar_write32(ctrl, NVME_REG_CSTS, ctrl->csts); 1970 + nvmet_pci_epf_bar_write32(ctrl, NVME_REG_CC, ctrl->cc); 1971 + } 1972 + 1973 + static int nvmet_pci_epf_create_ctrl(struct nvmet_pci_epf *nvme_epf, 1974 + unsigned int max_nr_queues) 1975 + { 1976 + struct nvmet_pci_epf_ctrl *ctrl = &nvme_epf->ctrl; 1977 + struct nvmet_alloc_ctrl_args args = {}; 1978 + char hostnqn[NVMF_NQN_SIZE]; 1979 + uuid_t id; 1980 + int ret; 1981 + 1982 + memset(ctrl, 0, sizeof(*ctrl)); 1983 + ctrl->dev = &nvme_epf->epf->dev; 1984 + mutex_init(&ctrl->irq_lock); 1985 + ctrl->nvme_epf = nvme_epf; 1986 + ctrl->mdts = nvme_epf->mdts_kb * SZ_1K; 1987 + INIT_DELAYED_WORK(&ctrl->poll_cc, nvmet_pci_epf_poll_cc_work); 1988 + INIT_DELAYED_WORK(&ctrl->poll_sqs, nvmet_pci_epf_poll_sqs_work); 1989 + 1990 + ret = mempool_init_kmalloc_pool(&ctrl->iod_pool, 1991 + max_nr_queues * NVMET_MAX_QUEUE_SIZE, 1992 + sizeof(struct nvmet_pci_epf_iod)); 1993 + if (ret) { 1994 + dev_err(ctrl->dev, "Failed to initialize IOD mempool\n"); 1995 + return ret; 1996 + } 1997 + 1998 + ctrl->port = nvmet_pci_epf_find_port(ctrl, nvme_epf->portid); 1999 + if (!ctrl->port) { 2000 + dev_err(ctrl->dev, "Port not found\n"); 2001 + ret = -EINVAL; 2002 + goto out_mempool_exit; 2003 + } 2004 + 2005 + /* Create the target controller. */ 2006 + uuid_gen(&id); 2007 + snprintf(hostnqn, NVMF_NQN_SIZE, 2008 + "nqn.2014-08.org.nvmexpress:uuid:%pUb", &id); 2009 + args.port = ctrl->port; 2010 + args.subsysnqn = nvme_epf->subsysnqn; 2011 + memset(&id, 0, sizeof(uuid_t)); 2012 + args.hostid = &id; 2013 + args.hostnqn = hostnqn; 2014 + args.ops = &nvmet_pci_epf_fabrics_ops; 2015 + 2016 + ctrl->tctrl = nvmet_alloc_ctrl(&args); 2017 + if (!ctrl->tctrl) { 2018 + dev_err(ctrl->dev, "Failed to create target controller\n"); 2019 + ret = -ENOMEM; 2020 + goto out_mempool_exit; 2021 + } 2022 + ctrl->tctrl->drvdata = ctrl; 2023 + 2024 + /* We do not support protection information for now. */ 2025 + if (ctrl->tctrl->pi_support) { 2026 + dev_err(ctrl->dev, 2027 + "Protection information (PI) is not supported\n"); 2028 + ret = -ENOTSUPP; 2029 + goto out_put_ctrl; 2030 + } 2031 + 2032 + /* Allocate our queues, up to the maximum number. */ 2033 + ctrl->nr_queues = min(ctrl->tctrl->subsys->max_qid + 1, max_nr_queues); 2034 + ret = nvmet_pci_epf_alloc_queues(ctrl); 2035 + if (ret) 2036 + goto out_put_ctrl; 2037 + 2038 + /* 2039 + * Allocate the IRQ vectors descriptors. We cannot have more than the 2040 + * maximum number of queues. 2041 + */ 2042 + ret = nvmet_pci_epf_alloc_irq_vectors(ctrl); 2043 + if (ret) 2044 + goto out_free_queues; 2045 + 2046 + dev_info(ctrl->dev, 2047 + "New PCI ctrl \"%s\", %u I/O queues, mdts %u B\n", 2048 + ctrl->tctrl->subsys->subsysnqn, ctrl->nr_queues - 1, 2049 + ctrl->mdts); 2050 + 2051 + /* Initialize BAR 0 using the target controller CAP. */ 2052 + nvmet_pci_epf_init_bar(ctrl); 2053 + 2054 + return 0; 2055 + 2056 + out_free_queues: 2057 + nvmet_pci_epf_free_queues(ctrl); 2058 + out_put_ctrl: 2059 + nvmet_ctrl_put(ctrl->tctrl); 2060 + ctrl->tctrl = NULL; 2061 + out_mempool_exit: 2062 + mempool_exit(&ctrl->iod_pool); 2063 + return ret; 2064 + } 2065 + 2066 + static void nvmet_pci_epf_start_ctrl(struct nvmet_pci_epf_ctrl *ctrl) 2067 + { 2068 + schedule_delayed_work(&ctrl->poll_cc, NVMET_PCI_EPF_CC_POLL_INTERVAL); 2069 + } 2070 + 2071 + static void nvmet_pci_epf_stop_ctrl(struct nvmet_pci_epf_ctrl *ctrl) 2072 + { 2073 + cancel_delayed_work_sync(&ctrl->poll_cc); 2074 + 2075 + nvmet_pci_epf_disable_ctrl(ctrl); 2076 + } 2077 + 2078 + static void nvmet_pci_epf_destroy_ctrl(struct nvmet_pci_epf_ctrl *ctrl) 2079 + { 2080 + if (!ctrl->tctrl) 2081 + return; 2082 + 2083 + dev_info(ctrl->dev, "Destroying PCI ctrl \"%s\"\n", 2084 + ctrl->tctrl->subsys->subsysnqn); 2085 + 2086 + nvmet_pci_epf_stop_ctrl(ctrl); 2087 + 2088 + nvmet_pci_epf_free_queues(ctrl); 2089 + nvmet_pci_epf_free_irq_vectors(ctrl); 2090 + 2091 + nvmet_ctrl_put(ctrl->tctrl); 2092 + ctrl->tctrl = NULL; 2093 + 2094 + mempool_exit(&ctrl->iod_pool); 2095 + } 2096 + 2097 + static int nvmet_pci_epf_configure_bar(struct nvmet_pci_epf *nvme_epf) 2098 + { 2099 + struct pci_epf *epf = nvme_epf->epf; 2100 + const struct pci_epc_features *epc_features = nvme_epf->epc_features; 2101 + size_t reg_size, reg_bar_size; 2102 + size_t msix_table_size = 0; 2103 + 2104 + /* 2105 + * The first free BAR will be our register BAR and per NVMe 2106 + * specifications, it must be BAR 0. 2107 + */ 2108 + if (pci_epc_get_first_free_bar(epc_features) != BAR_0) { 2109 + dev_err(&epf->dev, "BAR 0 is not free\n"); 2110 + return -ENODEV; 2111 + } 2112 + 2113 + if (epc_features->bar[BAR_0].only_64bit) 2114 + epf->bar[BAR_0].flags |= PCI_BASE_ADDRESS_MEM_TYPE_64; 2115 + 2116 + /* 2117 + * Calculate the size of the register bar: NVMe registers first with 2118 + * enough space for the doorbells, followed by the MSI-X table 2119 + * if supported. 2120 + */ 2121 + reg_size = NVME_REG_DBS + (NVMET_NR_QUEUES * 2 * sizeof(u32)); 2122 + reg_size = ALIGN(reg_size, 8); 2123 + 2124 + if (epc_features->msix_capable) { 2125 + size_t pba_size; 2126 + 2127 + msix_table_size = PCI_MSIX_ENTRY_SIZE * epf->msix_interrupts; 2128 + nvme_epf->msix_table_offset = reg_size; 2129 + pba_size = ALIGN(DIV_ROUND_UP(epf->msix_interrupts, 8), 8); 2130 + 2131 + reg_size += msix_table_size + pba_size; 2132 + } 2133 + 2134 + if (epc_features->bar[BAR_0].type == BAR_FIXED) { 2135 + if (reg_size > epc_features->bar[BAR_0].fixed_size) { 2136 + dev_err(&epf->dev, 2137 + "BAR 0 size %llu B too small, need %zu B\n", 2138 + epc_features->bar[BAR_0].fixed_size, 2139 + reg_size); 2140 + return -ENOMEM; 2141 + } 2142 + reg_bar_size = epc_features->bar[BAR_0].fixed_size; 2143 + } else { 2144 + reg_bar_size = ALIGN(reg_size, max(epc_features->align, 4096)); 2145 + } 2146 + 2147 + nvme_epf->reg_bar = pci_epf_alloc_space(epf, reg_bar_size, BAR_0, 2148 + epc_features, PRIMARY_INTERFACE); 2149 + if (!nvme_epf->reg_bar) { 2150 + dev_err(&epf->dev, "Failed to allocate BAR 0\n"); 2151 + return -ENOMEM; 2152 + } 2153 + memset(nvme_epf->reg_bar, 0, reg_bar_size); 2154 + 2155 + return 0; 2156 + } 2157 + 2158 + static void nvmet_pci_epf_free_bar(struct nvmet_pci_epf *nvme_epf) 2159 + { 2160 + struct pci_epf *epf = nvme_epf->epf; 2161 + 2162 + if (!nvme_epf->reg_bar) 2163 + return; 2164 + 2165 + pci_epf_free_space(epf, nvme_epf->reg_bar, BAR_0, PRIMARY_INTERFACE); 2166 + nvme_epf->reg_bar = NULL; 2167 + } 2168 + 2169 + static void nvmet_pci_epf_clear_bar(struct nvmet_pci_epf *nvme_epf) 2170 + { 2171 + struct pci_epf *epf = nvme_epf->epf; 2172 + 2173 + pci_epc_clear_bar(epf->epc, epf->func_no, epf->vfunc_no, 2174 + &epf->bar[BAR_0]); 2175 + } 2176 + 2177 + static int nvmet_pci_epf_init_irq(struct nvmet_pci_epf *nvme_epf) 2178 + { 2179 + const struct pci_epc_features *epc_features = nvme_epf->epc_features; 2180 + struct pci_epf *epf = nvme_epf->epf; 2181 + int ret; 2182 + 2183 + /* Enable MSI-X if supported, otherwise, use MSI. */ 2184 + if (epc_features->msix_capable && epf->msix_interrupts) { 2185 + ret = pci_epc_set_msix(epf->epc, epf->func_no, epf->vfunc_no, 2186 + epf->msix_interrupts, BAR_0, 2187 + nvme_epf->msix_table_offset); 2188 + if (ret) { 2189 + dev_err(&epf->dev, "Failed to configure MSI-X\n"); 2190 + return ret; 2191 + } 2192 + 2193 + nvme_epf->nr_vectors = epf->msix_interrupts; 2194 + nvme_epf->irq_type = PCI_IRQ_MSIX; 2195 + 2196 + return 0; 2197 + } 2198 + 2199 + if (epc_features->msi_capable && epf->msi_interrupts) { 2200 + ret = pci_epc_set_msi(epf->epc, epf->func_no, epf->vfunc_no, 2201 + epf->msi_interrupts); 2202 + if (ret) { 2203 + dev_err(&epf->dev, "Failed to configure MSI\n"); 2204 + return ret; 2205 + } 2206 + 2207 + nvme_epf->nr_vectors = epf->msi_interrupts; 2208 + nvme_epf->irq_type = PCI_IRQ_MSI; 2209 + 2210 + return 0; 2211 + } 2212 + 2213 + /* MSI and MSI-X are not supported: fall back to INTx. */ 2214 + nvme_epf->nr_vectors = 1; 2215 + nvme_epf->irq_type = PCI_IRQ_INTX; 2216 + 2217 + return 0; 2218 + } 2219 + 2220 + static int nvmet_pci_epf_epc_init(struct pci_epf *epf) 2221 + { 2222 + struct nvmet_pci_epf *nvme_epf = epf_get_drvdata(epf); 2223 + const struct pci_epc_features *epc_features = nvme_epf->epc_features; 2224 + struct nvmet_pci_epf_ctrl *ctrl = &nvme_epf->ctrl; 2225 + unsigned int max_nr_queues = NVMET_NR_QUEUES; 2226 + int ret; 2227 + 2228 + /* For now, do not support virtual functions. */ 2229 + if (epf->vfunc_no > 0) { 2230 + dev_err(&epf->dev, "Virtual functions are not supported\n"); 2231 + return -EINVAL; 2232 + } 2233 + 2234 + /* 2235 + * Cap the maximum number of queues we can support on the controller 2236 + * with the number of IRQs we can use. 2237 + */ 2238 + if (epc_features->msix_capable && epf->msix_interrupts) { 2239 + dev_info(&epf->dev, 2240 + "PCI endpoint controller supports MSI-X, %u vectors\n", 2241 + epf->msix_interrupts); 2242 + max_nr_queues = min(max_nr_queues, epf->msix_interrupts); 2243 + } else if (epc_features->msi_capable && epf->msi_interrupts) { 2244 + dev_info(&epf->dev, 2245 + "PCI endpoint controller supports MSI, %u vectors\n", 2246 + epf->msi_interrupts); 2247 + max_nr_queues = min(max_nr_queues, epf->msi_interrupts); 2248 + } 2249 + 2250 + if (max_nr_queues < 2) { 2251 + dev_err(&epf->dev, "Invalid maximum number of queues %u\n", 2252 + max_nr_queues); 2253 + return -EINVAL; 2254 + } 2255 + 2256 + /* Create the target controller. */ 2257 + ret = nvmet_pci_epf_create_ctrl(nvme_epf, max_nr_queues); 2258 + if (ret) { 2259 + dev_err(&epf->dev, 2260 + "Failed to create NVMe PCI target controller (err=%d)\n", 2261 + ret); 2262 + return ret; 2263 + } 2264 + 2265 + /* Set device ID, class, etc. */ 2266 + epf->header->vendorid = ctrl->tctrl->subsys->vendor_id; 2267 + epf->header->subsys_vendor_id = ctrl->tctrl->subsys->subsys_vendor_id; 2268 + ret = pci_epc_write_header(epf->epc, epf->func_no, epf->vfunc_no, 2269 + epf->header); 2270 + if (ret) { 2271 + dev_err(&epf->dev, 2272 + "Failed to write configuration header (err=%d)\n", ret); 2273 + goto out_destroy_ctrl; 2274 + } 2275 + 2276 + ret = pci_epc_set_bar(epf->epc, epf->func_no, epf->vfunc_no, 2277 + &epf->bar[BAR_0]); 2278 + if (ret) { 2279 + dev_err(&epf->dev, "Failed to set BAR 0 (err=%d)\n", ret); 2280 + goto out_destroy_ctrl; 2281 + } 2282 + 2283 + /* 2284 + * Enable interrupts and start polling the controller BAR if we do not 2285 + * have a link up notifier. 2286 + */ 2287 + ret = nvmet_pci_epf_init_irq(nvme_epf); 2288 + if (ret) 2289 + goto out_clear_bar; 2290 + 2291 + if (!epc_features->linkup_notifier) { 2292 + ctrl->link_up = true; 2293 + nvmet_pci_epf_start_ctrl(&nvme_epf->ctrl); 2294 + } 2295 + 2296 + return 0; 2297 + 2298 + out_clear_bar: 2299 + nvmet_pci_epf_clear_bar(nvme_epf); 2300 + out_destroy_ctrl: 2301 + nvmet_pci_epf_destroy_ctrl(&nvme_epf->ctrl); 2302 + return ret; 2303 + } 2304 + 2305 + static void nvmet_pci_epf_epc_deinit(struct pci_epf *epf) 2306 + { 2307 + struct nvmet_pci_epf *nvme_epf = epf_get_drvdata(epf); 2308 + struct nvmet_pci_epf_ctrl *ctrl = &nvme_epf->ctrl; 2309 + 2310 + ctrl->link_up = false; 2311 + nvmet_pci_epf_destroy_ctrl(ctrl); 2312 + 2313 + nvmet_pci_epf_deinit_dma(nvme_epf); 2314 + nvmet_pci_epf_clear_bar(nvme_epf); 2315 + } 2316 + 2317 + static int nvmet_pci_epf_link_up(struct pci_epf *epf) 2318 + { 2319 + struct nvmet_pci_epf *nvme_epf = epf_get_drvdata(epf); 2320 + struct nvmet_pci_epf_ctrl *ctrl = &nvme_epf->ctrl; 2321 + 2322 + ctrl->link_up = true; 2323 + nvmet_pci_epf_start_ctrl(ctrl); 2324 + 2325 + return 0; 2326 + } 2327 + 2328 + static int nvmet_pci_epf_link_down(struct pci_epf *epf) 2329 + { 2330 + struct nvmet_pci_epf *nvme_epf = epf_get_drvdata(epf); 2331 + struct nvmet_pci_epf_ctrl *ctrl = &nvme_epf->ctrl; 2332 + 2333 + ctrl->link_up = false; 2334 + nvmet_pci_epf_stop_ctrl(ctrl); 2335 + 2336 + return 0; 2337 + } 2338 + 2339 + static const struct pci_epc_event_ops nvmet_pci_epf_event_ops = { 2340 + .epc_init = nvmet_pci_epf_epc_init, 2341 + .epc_deinit = nvmet_pci_epf_epc_deinit, 2342 + .link_up = nvmet_pci_epf_link_up, 2343 + .link_down = nvmet_pci_epf_link_down, 2344 + }; 2345 + 2346 + static int nvmet_pci_epf_bind(struct pci_epf *epf) 2347 + { 2348 + struct nvmet_pci_epf *nvme_epf = epf_get_drvdata(epf); 2349 + const struct pci_epc_features *epc_features; 2350 + struct pci_epc *epc = epf->epc; 2351 + int ret; 2352 + 2353 + if (WARN_ON_ONCE(!epc)) 2354 + return -EINVAL; 2355 + 2356 + epc_features = pci_epc_get_features(epc, epf->func_no, epf->vfunc_no); 2357 + if (!epc_features) { 2358 + dev_err(&epf->dev, "epc_features not implemented\n"); 2359 + return -EOPNOTSUPP; 2360 + } 2361 + nvme_epf->epc_features = epc_features; 2362 + 2363 + ret = nvmet_pci_epf_configure_bar(nvme_epf); 2364 + if (ret) 2365 + return ret; 2366 + 2367 + nvmet_pci_epf_init_dma(nvme_epf); 2368 + 2369 + return 0; 2370 + } 2371 + 2372 + static void nvmet_pci_epf_unbind(struct pci_epf *epf) 2373 + { 2374 + struct nvmet_pci_epf *nvme_epf = epf_get_drvdata(epf); 2375 + struct pci_epc *epc = epf->epc; 2376 + 2377 + nvmet_pci_epf_destroy_ctrl(&nvme_epf->ctrl); 2378 + 2379 + if (epc->init_complete) { 2380 + nvmet_pci_epf_deinit_dma(nvme_epf); 2381 + nvmet_pci_epf_clear_bar(nvme_epf); 2382 + } 2383 + 2384 + nvmet_pci_epf_free_bar(nvme_epf); 2385 + } 2386 + 2387 + static struct pci_epf_header nvme_epf_pci_header = { 2388 + .vendorid = PCI_ANY_ID, 2389 + .deviceid = PCI_ANY_ID, 2390 + .progif_code = 0x02, /* NVM Express */ 2391 + .baseclass_code = PCI_BASE_CLASS_STORAGE, 2392 + .subclass_code = 0x08, /* Non-Volatile Memory controller */ 2393 + .interrupt_pin = PCI_INTERRUPT_INTA, 2394 + }; 2395 + 2396 + static int nvmet_pci_epf_probe(struct pci_epf *epf, 2397 + const struct pci_epf_device_id *id) 2398 + { 2399 + struct nvmet_pci_epf *nvme_epf; 2400 + int ret; 2401 + 2402 + nvme_epf = devm_kzalloc(&epf->dev, sizeof(*nvme_epf), GFP_KERNEL); 2403 + if (!nvme_epf) 2404 + return -ENOMEM; 2405 + 2406 + ret = devm_mutex_init(&epf->dev, &nvme_epf->mmio_lock); 2407 + if (ret) 2408 + return ret; 2409 + 2410 + nvme_epf->epf = epf; 2411 + nvme_epf->mdts_kb = NVMET_PCI_EPF_MDTS_KB; 2412 + 2413 + epf->event_ops = &nvmet_pci_epf_event_ops; 2414 + epf->header = &nvme_epf_pci_header; 2415 + epf_set_drvdata(epf, nvme_epf); 2416 + 2417 + return 0; 2418 + } 2419 + 2420 + #define to_nvme_epf(epf_group) \ 2421 + container_of(epf_group, struct nvmet_pci_epf, group) 2422 + 2423 + static ssize_t nvmet_pci_epf_portid_show(struct config_item *item, char *page) 2424 + { 2425 + struct config_group *group = to_config_group(item); 2426 + struct nvmet_pci_epf *nvme_epf = to_nvme_epf(group); 2427 + 2428 + return sysfs_emit(page, "%u\n", le16_to_cpu(nvme_epf->portid)); 2429 + } 2430 + 2431 + static ssize_t nvmet_pci_epf_portid_store(struct config_item *item, 2432 + const char *page, size_t len) 2433 + { 2434 + struct config_group *group = to_config_group(item); 2435 + struct nvmet_pci_epf *nvme_epf = to_nvme_epf(group); 2436 + u16 portid; 2437 + 2438 + /* Do not allow setting this when the function is already started. */ 2439 + if (nvme_epf->ctrl.tctrl) 2440 + return -EBUSY; 2441 + 2442 + if (!len) 2443 + return -EINVAL; 2444 + 2445 + if (kstrtou16(page, 0, &portid)) 2446 + return -EINVAL; 2447 + 2448 + nvme_epf->portid = cpu_to_le16(portid); 2449 + 2450 + return len; 2451 + } 2452 + 2453 + CONFIGFS_ATTR(nvmet_pci_epf_, portid); 2454 + 2455 + static ssize_t nvmet_pci_epf_subsysnqn_show(struct config_item *item, 2456 + char *page) 2457 + { 2458 + struct config_group *group = to_config_group(item); 2459 + struct nvmet_pci_epf *nvme_epf = to_nvme_epf(group); 2460 + 2461 + return sysfs_emit(page, "%s\n", nvme_epf->subsysnqn); 2462 + } 2463 + 2464 + static ssize_t nvmet_pci_epf_subsysnqn_store(struct config_item *item, 2465 + const char *page, size_t len) 2466 + { 2467 + struct config_group *group = to_config_group(item); 2468 + struct nvmet_pci_epf *nvme_epf = to_nvme_epf(group); 2469 + 2470 + /* Do not allow setting this when the function is already started. */ 2471 + if (nvme_epf->ctrl.tctrl) 2472 + return -EBUSY; 2473 + 2474 + if (!len) 2475 + return -EINVAL; 2476 + 2477 + strscpy(nvme_epf->subsysnqn, page, len); 2478 + 2479 + return len; 2480 + } 2481 + 2482 + CONFIGFS_ATTR(nvmet_pci_epf_, subsysnqn); 2483 + 2484 + static ssize_t nvmet_pci_epf_mdts_kb_show(struct config_item *item, char *page) 2485 + { 2486 + struct config_group *group = to_config_group(item); 2487 + struct nvmet_pci_epf *nvme_epf = to_nvme_epf(group); 2488 + 2489 + return sysfs_emit(page, "%u\n", nvme_epf->mdts_kb); 2490 + } 2491 + 2492 + static ssize_t nvmet_pci_epf_mdts_kb_store(struct config_item *item, 2493 + const char *page, size_t len) 2494 + { 2495 + struct config_group *group = to_config_group(item); 2496 + struct nvmet_pci_epf *nvme_epf = to_nvme_epf(group); 2497 + unsigned long mdts_kb; 2498 + int ret; 2499 + 2500 + if (nvme_epf->ctrl.tctrl) 2501 + return -EBUSY; 2502 + 2503 + ret = kstrtoul(page, 0, &mdts_kb); 2504 + if (ret) 2505 + return ret; 2506 + if (!mdts_kb) 2507 + mdts_kb = NVMET_PCI_EPF_MDTS_KB; 2508 + else if (mdts_kb > NVMET_PCI_EPF_MAX_MDTS_KB) 2509 + mdts_kb = NVMET_PCI_EPF_MAX_MDTS_KB; 2510 + 2511 + if (!is_power_of_2(mdts_kb)) 2512 + return -EINVAL; 2513 + 2514 + nvme_epf->mdts_kb = mdts_kb; 2515 + 2516 + return len; 2517 + } 2518 + 2519 + CONFIGFS_ATTR(nvmet_pci_epf_, mdts_kb); 2520 + 2521 + static struct configfs_attribute *nvmet_pci_epf_attrs[] = { 2522 + &nvmet_pci_epf_attr_portid, 2523 + &nvmet_pci_epf_attr_subsysnqn, 2524 + &nvmet_pci_epf_attr_mdts_kb, 2525 + NULL, 2526 + }; 2527 + 2528 + static const struct config_item_type nvmet_pci_epf_group_type = { 2529 + .ct_attrs = nvmet_pci_epf_attrs, 2530 + .ct_owner = THIS_MODULE, 2531 + }; 2532 + 2533 + static struct config_group *nvmet_pci_epf_add_cfs(struct pci_epf *epf, 2534 + struct config_group *group) 2535 + { 2536 + struct nvmet_pci_epf *nvme_epf = epf_get_drvdata(epf); 2537 + 2538 + config_group_init_type_name(&nvme_epf->group, "nvme", 2539 + &nvmet_pci_epf_group_type); 2540 + 2541 + return &nvme_epf->group; 2542 + } 2543 + 2544 + static const struct pci_epf_device_id nvmet_pci_epf_ids[] = { 2545 + { .name = "nvmet_pci_epf" }, 2546 + {}, 2547 + }; 2548 + 2549 + static struct pci_epf_ops nvmet_pci_epf_ops = { 2550 + .bind = nvmet_pci_epf_bind, 2551 + .unbind = nvmet_pci_epf_unbind, 2552 + .add_cfs = nvmet_pci_epf_add_cfs, 2553 + }; 2554 + 2555 + static struct pci_epf_driver nvmet_pci_epf_driver = { 2556 + .driver.name = "nvmet_pci_epf", 2557 + .probe = nvmet_pci_epf_probe, 2558 + .id_table = nvmet_pci_epf_ids, 2559 + .ops = &nvmet_pci_epf_ops, 2560 + .owner = THIS_MODULE, 2561 + }; 2562 + 2563 + static int __init nvmet_pci_epf_init_module(void) 2564 + { 2565 + int ret; 2566 + 2567 + ret = pci_epf_register_driver(&nvmet_pci_epf_driver); 2568 + if (ret) 2569 + return ret; 2570 + 2571 + ret = nvmet_register_transport(&nvmet_pci_epf_fabrics_ops); 2572 + if (ret) { 2573 + pci_epf_unregister_driver(&nvmet_pci_epf_driver); 2574 + return ret; 2575 + } 2576 + 2577 + return 0; 2578 + } 2579 + 2580 + static void __exit nvmet_pci_epf_cleanup_module(void) 2581 + { 2582 + nvmet_unregister_transport(&nvmet_pci_epf_fabrics_ops); 2583 + pci_epf_unregister_driver(&nvmet_pci_epf_driver); 2584 + } 2585 + 2586 + module_init(nvmet_pci_epf_init_module); 2587 + module_exit(nvmet_pci_epf_cleanup_module); 2588 + 2589 + MODULE_DESCRIPTION("NVMe PCI Endpoint Function target driver"); 2590 + MODULE_AUTHOR("Damien Le Moal <dlemoal@kernel.org>"); 2591 + MODULE_LICENSE("GPL");

+1 -2

drivers/nvme/target/zns.c

··· 586 586 for_each_sg(req->sg, sg, req->sg_cnt, sg_cnt) { 587 587 unsigned int len = sg->length; 588 588 589 - if (bio_add_pc_page(bdev_get_queue(bio->bi_bdev), bio, 590 - sg_page(sg), len, sg->offset) != len) { 589 + if (bio_add_page(bio, sg_page(sg), len, sg->offset) != len) { 591 590 status = NVME_SC_INTERNAL; 592 591 goto out_put_bio; 593 592 }

+14

drivers/pci/pci-driver.c

··· 1670 1670 iommu_device_unuse_default_domain(dev); 1671 1671 } 1672 1672 1673 + /* 1674 + * pci_device_irq_get_affinity - get IRQ affinity mask for device 1675 + * @dev: ptr to dev structure 1676 + * @irq_vec: interrupt vector number 1677 + * 1678 + * Return the CPU affinity mask for @dev and @irq_vec. 1679 + */ 1680 + static const struct cpumask *pci_device_irq_get_affinity(struct device *dev, 1681 + unsigned int irq_vec) 1682 + { 1683 + return pci_irq_get_affinity(to_pci_dev(dev), irq_vec); 1684 + } 1685 + 1673 1686 const struct bus_type pci_bus_type = { 1674 1687 .name = "pci", 1675 1688 .match = pci_bus_match, ··· 1690 1677 .probe = pci_device_probe, 1691 1678 .remove = pci_device_remove, 1692 1679 .shutdown = pci_device_shutdown, 1680 + .irq_get_affinity = pci_device_irq_get_affinity, 1693 1681 .dev_groups = pci_dev_groups, 1694 1682 .bus_groups = pci_bus_groups, 1695 1683 .drv_groups = pci_drv_groups,

-1

drivers/s390/block/dasd_genhd.c

··· 56 56 block->tag_set.cmd_size = sizeof(struct dasd_ccw_req); 57 57 block->tag_set.nr_hw_queues = nr_hw_queues; 58 58 block->tag_set.queue_depth = queue_depth; 59 - block->tag_set.flags = BLK_MQ_F_SHOULD_MERGE; 60 59 block->tag_set.numa_node = NUMA_NO_NODE; 61 60 rc = blk_mq_alloc_tag_set(&block->tag_set); 62 61 if (rc)

-1

drivers/s390/block/scm_blk.c

··· 461 461 bdev->tag_set.cmd_size = sizeof(blk_status_t); 462 462 bdev->tag_set.nr_hw_queues = nr_requests; 463 463 bdev->tag_set.queue_depth = nr_requests_per_io * nr_requests; 464 - bdev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE; 465 464 bdev->tag_set.numa_node = NUMA_NO_NODE; 466 465 467 466 ret = blk_mq_alloc_tag_set(&bdev->tag_set);

+1 -2

drivers/scsi/fnic/fnic_main.c

··· 16 16 #include <linux/spinlock.h> 17 17 #include <linux/workqueue.h> 18 18 #include <linux/if_ether.h> 19 - #include <linux/blk-mq-pci.h> 20 19 #include <scsi/fc/fc_fip.h> 21 20 #include <scsi/scsi_host.h> 22 21 #include <scsi/scsi_transport.h> ··· 600 601 return; 601 602 } 602 603 603 - blk_mq_pci_map_queues(qmap, l_pdev, FNIC_PCI_OFFSET); 604 + blk_mq_map_hw_queues(qmap, &l_pdev->dev, FNIC_PCI_OFFSET); 604 605 } 605 606 606 607 static int fnic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)

-1

drivers/scsi/hisi_sas/hisi_sas.h

··· 9 9 10 10 #include <linux/acpi.h> 11 11 #include <linux/blk-mq.h> 12 - #include <linux/blk-mq-pci.h> 13 12 #include <linux/clk.h> 14 13 #include <linux/debugfs.h> 15 14 #include <linux/dmapool.h>

+3 -3

drivers/scsi/hisi_sas/hisi_sas_v3_hw.c

··· 3328 3328 if (i == HCTX_TYPE_POLL) 3329 3329 blk_mq_map_queues(qmap); 3330 3330 else 3331 - blk_mq_pci_map_queues(qmap, hisi_hba->pci_dev, 3332 - BASE_VECTORS_V3_HW); 3331 + blk_mq_map_hw_queues(qmap, hisi_hba->dev, 3332 + BASE_VECTORS_V3_HW); 3333 3333 qoff += qmap->nr_queues; 3334 3334 } 3335 3335 } ··· 3345 3345 .slave_alloc = hisi_sas_slave_alloc, 3346 3346 .shost_groups = host_v3_hw_groups, 3347 3347 .sdev_groups = sdev_groups_v3_hw, 3348 - .tag_alloc_policy = BLK_TAG_ALLOC_RR, 3348 + .tag_alloc_policy_rr = true, 3349 3349 .host_reset = hisi_sas_host_reset, 3350 3350 .host_tagset = 1, 3351 3351 .mq_poll = queue_complete_v3_hw,

+1 -2

drivers/scsi/megaraid/megaraid_sas_base.c

··· 37 37 #include <linux/poll.h> 38 38 #include <linux/vmalloc.h> 39 39 #include <linux/irq_poll.h> 40 - #include <linux/blk-mq-pci.h> 41 40 42 41 #include <scsi/scsi.h> 43 42 #include <scsi/scsi_cmnd.h> ··· 3192 3193 map = &shost->tag_set.map[HCTX_TYPE_DEFAULT]; 3193 3194 map->nr_queues = instance->msix_vectors - offset; 3194 3195 map->queue_offset = 0; 3195 - blk_mq_pci_map_queues(map, instance->pdev, offset); 3196 + blk_mq_map_hw_queues(map, &instance->pdev->dev, offset); 3196 3197 qoff += map->nr_queues; 3197 3198 offset += map->nr_queues; 3198 3199

-1

drivers/scsi/mpi3mr/mpi3mr.h

··· 12 12 13 13 #include <linux/blkdev.h> 14 14 #include <linux/blk-mq.h> 15 - #include <linux/blk-mq-pci.h> 16 15 #include <linux/delay.h> 17 16 #include <linux/dmapool.h> 18 17 #include <linux/errno.h>

+1 -1

drivers/scsi/mpi3mr/mpi3mr_os.c

··· 4042 4042 */ 4043 4043 map->queue_offset = qoff; 4044 4044 if (i != HCTX_TYPE_POLL) 4045 - blk_mq_pci_map_queues(map, mrioc->pdev, offset); 4045 + blk_mq_map_hw_queues(map, &mrioc->pdev->dev, offset); 4046 4046 else 4047 4047 blk_mq_map_queues(map); 4048 4048

+1 -2

drivers/scsi/mpt3sas/mpt3sas_scsih.c

··· 53 53 #include <linux/pci.h> 54 54 #include <linux/interrupt.h> 55 55 #include <linux/raid_class.h> 56 - #include <linux/blk-mq-pci.h> 57 56 #include <linux/unaligned.h> 58 57 59 58 #include "mpt3sas_base.h" ··· 11889 11890 */ 11890 11891 map->queue_offset = qoff; 11891 11892 if (i != HCTX_TYPE_POLL) 11892 - blk_mq_pci_map_queues(map, ioc->pdev, offset); 11893 + blk_mq_map_hw_queues(map, &ioc->pdev->dev, offset); 11893 11894 else 11894 11895 blk_mq_map_queues(map); 11895 11896

+1 -1

drivers/scsi/pm8001/pm8001_init.c

··· 105 105 struct blk_mq_queue_map *qmap = &shost->tag_set.map[HCTX_TYPE_DEFAULT]; 106 106 107 107 if (pm8001_ha->number_of_intr > 1) { 108 - blk_mq_pci_map_queues(qmap, pm8001_ha->pdev, 1); 108 + blk_mq_map_hw_queues(qmap, &pm8001_ha->pdev->dev, 1); 109 109 return; 110 110 } 111 111

-1

drivers/scsi/pm8001/pm8001_sas.h

··· 56 56 #include <scsi/sas_ata.h> 57 57 #include <linux/atomic.h> 58 58 #include <linux/blk-mq.h> 59 - #include <linux/blk-mq-pci.h> 60 59 #include "pm8001_defs.h" 61 60 62 61 #define DRV_NAME "pm80xx"

+1 -2

drivers/scsi/qla2xxx/qla_nvme.c

··· 8 8 #include <linux/delay.h> 9 9 #include <linux/nvme.h> 10 10 #include <linux/nvme-fc.h> 11 - #include <linux/blk-mq-pci.h> 12 11 #include <linux/blk-mq.h> 13 12 14 13 static struct nvme_fc_port_template qla_nvme_fc_transport; ··· 840 841 { 841 842 struct scsi_qla_host *vha = lport->private; 842 843 843 - blk_mq_pci_map_queues(map, vha->hw->pdev, vha->irq_offset); 844 + blk_mq_map_hw_queues(map, &vha->hw->pdev->dev, vha->irq_offset); 844 845 } 845 846 846 847 static void qla_nvme_localport_delete(struct nvme_fc_local_port *lport)

+2 -2

drivers/scsi/qla2xxx/qla_os.c

··· 13 13 #include <linux/mutex.h> 14 14 #include <linux/kobject.h> 15 15 #include <linux/slab.h> 16 - #include <linux/blk-mq-pci.h> 17 16 #include <linux/refcount.h> 18 17 #include <linux/crash_dump.h> 19 18 #include <linux/trace_events.h> ··· 8070 8071 if (USER_CTRL_IRQ(vha->hw) || !vha->hw->mqiobase) 8071 8072 blk_mq_map_queues(qmap); 8072 8073 else 8073 - blk_mq_pci_map_queues(qmap, vha->hw->pdev, vha->irq_offset); 8074 + blk_mq_map_hw_queues(qmap, &vha->hw->pdev->dev, 8075 + vha->irq_offset); 8074 8076 } 8075 8077 8076 8078 struct scsi_host_template qla2xxx_driver_template = {

+2 -3

drivers/scsi/scsi_lib.c

··· 2068 2068 tag_set->queue_depth = shost->can_queue; 2069 2069 tag_set->cmd_size = cmd_size; 2070 2070 tag_set->numa_node = dev_to_node(shost->dma_dev); 2071 - tag_set->flags = BLK_MQ_F_SHOULD_MERGE; 2072 - tag_set->flags |= 2073 - BLK_ALLOC_POLICY_TO_MQ_FLAG(shost->hostt->tag_alloc_policy); 2071 + if (shost->hostt->tag_alloc_policy_rr) 2072 + tag_set->flags |= BLK_MQ_F_TAG_RR; 2074 2073 if (shost->queuecommand_may_block) 2075 2074 tag_set->flags |= BLK_MQ_F_BLOCKING; 2076 2075 tag_set->driver_data = shost;

+6 -12

drivers/scsi/sd.c

··· 177 177 178 178 lim = queue_limits_start_update(sdkp->disk->queue); 179 179 sd_set_flush_flag(sdkp, &lim); 180 - blk_mq_freeze_queue(sdkp->disk->queue); 181 - ret = queue_limits_commit_update(sdkp->disk->queue, &lim); 182 - blk_mq_unfreeze_queue(sdkp->disk->queue); 180 + ret = queue_limits_commit_update_frozen(sdkp->disk->queue, 181 + &lim); 183 182 if (ret) 184 183 return ret; 185 184 return count; ··· 482 483 483 484 lim = queue_limits_start_update(sdkp->disk->queue); 484 485 sd_config_discard(sdkp, &lim, mode); 485 - blk_mq_freeze_queue(sdkp->disk->queue); 486 - err = queue_limits_commit_update(sdkp->disk->queue, &lim); 487 - blk_mq_unfreeze_queue(sdkp->disk->queue); 486 + err = queue_limits_commit_update_frozen(sdkp->disk->queue, &lim); 488 487 if (err) 489 488 return err; 490 489 return count; ··· 591 594 592 595 lim = queue_limits_start_update(sdkp->disk->queue); 593 596 sd_config_write_same(sdkp, &lim); 594 - blk_mq_freeze_queue(sdkp->disk->queue); 595 - err = queue_limits_commit_update(sdkp->disk->queue, &lim); 596 - blk_mq_unfreeze_queue(sdkp->disk->queue); 597 + err = queue_limits_commit_update_frozen(sdkp->disk->queue, &lim); 597 598 if (err) 598 599 return err; 599 600 return count; ··· 991 996 lim->atomic_write_hw_boundary = 0; 992 997 lim->atomic_write_hw_unit_min = unit_min * logical_block_size; 993 998 lim->atomic_write_hw_unit_max = unit_max * logical_block_size; 999 + lim->features |= BLK_FEAT_ATOMIC_WRITES; 994 1000 } 995 1001 996 1002 static blk_status_t sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd, ··· 3799 3803 sd_config_write_same(sdkp, &lim); 3800 3804 kfree(buffer); 3801 3805 3802 - blk_mq_freeze_queue(sdkp->disk->queue); 3803 - err = queue_limits_commit_update(sdkp->disk->queue, &lim); 3804 - blk_mq_unfreeze_queue(sdkp->disk->queue); 3806 + err = queue_limits_commit_update_frozen(sdkp->disk->queue, &lim); 3805 3807 if (err) 3806 3808 return err; 3807 3809

+3 -4

drivers/scsi/smartpqi/smartpqi_init.c

··· 19 19 #include <linux/bcd.h> 20 20 #include <linux/reboot.h> 21 21 #include <linux/cciss_ioctl.h> 22 - #include <linux/blk-mq-pci.h> 23 22 #include <scsi/scsi_host.h> 24 23 #include <scsi/scsi_cmnd.h> 25 24 #include <scsi/scsi_device.h> ··· 6546 6547 struct pqi_ctrl_info *ctrl_info = shost_to_hba(shost); 6547 6548 6548 6549 if (!ctrl_info->disable_managed_interrupts) 6549 - return blk_mq_pci_map_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT], 6550 - ctrl_info->pci_dev, 0); 6550 + blk_mq_map_hw_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT], 6551 + &ctrl_info->pci_dev->dev, 0); 6551 6552 else 6552 - return blk_mq_map_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT]); 6553 + blk_mq_map_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT]); 6553 6554 } 6554 6555 6555 6556 static inline bool pqi_is_tape_changer_device(struct pqi_scsi_dev *device)

+1 -4

drivers/scsi/sr.c

··· 797 797 798 798 lim = queue_limits_start_update(q); 799 799 lim.logical_block_size = sector_size; 800 - blk_mq_freeze_queue(q); 801 - err = queue_limits_commit_update(q, &lim); 802 - blk_mq_unfreeze_queue(q); 803 - return err; 800 + return queue_limits_commit_update_frozen(q, &lim); 804 801 } 805 802 806 803 static int get_capabilities(struct scsi_cd *cd)

+1 -2

drivers/scsi/virtio_scsi.c

··· 29 29 #include <scsi/scsi_tcq.h> 30 30 #include <scsi/scsi_devinfo.h> 31 31 #include <linux/seqlock.h> 32 - #include <linux/blk-mq-virtio.h> 33 32 34 33 #include "sd.h" 35 34 ··· 745 746 if (i == HCTX_TYPE_POLL) 746 747 blk_mq_map_queues(map); 747 748 else 748 - blk_mq_virtio_map_queues(map, vscsi->vdev, 2); 749 + blk_mq_map_hw_queues(map, &vscsi->vdev->dev, 2); 749 750 } 750 751 } 751 752

+2 -4

drivers/target/target_core_pscsi.c

··· 823 823 pscsi_map_sg(struct se_cmd *cmd, struct scatterlist *sgl, u32 sgl_nents, 824 824 struct request *req) 825 825 { 826 - struct pscsi_dev_virt *pdv = PSCSI_DEV(cmd->se_dev); 827 826 struct bio *bio = NULL; 828 827 struct page *page; 829 828 struct scatterlist *sg; ··· 870 871 (rw) ? "rw" : "r", nr_vecs); 871 872 } 872 873 873 - pr_debug("PSCSI: Calling bio_add_pc_page() i: %d" 874 + pr_debug("PSCSI: Calling bio_add_page() i: %d" 874 875 " bio: %p page: %p len: %d off: %d\n", i, bio, 875 876 page, len, off); 876 877 877 - rc = bio_add_pc_page(pdv->pdv_sd->request_queue, 878 - bio, page, bytes, off); 878 + rc = bio_add_page(bio, page, bytes, off); 879 879 pr_debug("PSCSI: bio->bi_vcnt: %d nr_vecs: %d\n", 880 880 bio_segments(bio), nr_vecs); 881 881 if (rc != bytes) {

-1

drivers/ufs/core/ufshcd.c

··· 10411 10411 .nr_hw_queues = 1, 10412 10412 .queue_depth = hba->nutmrs, 10413 10413 .ops = &ufshcd_tmf_ops, 10414 - .flags = BLK_MQ_F_NO_SCHED, 10415 10414 }; 10416 10415 err = blk_mq_alloc_tag_set(&hba->tmf_tag_set); 10417 10416 if (err < 0)

+1 -4

drivers/usb/storage/scsiglue.c

··· 592 592 if (sscanf(buf, "%hu", &ms) <= 0) 593 593 return -EINVAL; 594 594 595 - blk_mq_freeze_queue(sdev->request_queue); 596 595 lim = queue_limits_start_update(sdev->request_queue); 597 596 lim.max_hw_sectors = ms; 598 - ret = queue_limits_commit_update(sdev->request_queue, &lim); 599 - blk_mq_unfreeze_queue(sdev->request_queue); 600 - 597 + ret = queue_limits_commit_update_frozen(sdev->request_queue, &lim); 601 598 if (ret) 602 599 return ret; 603 600 return count;

+19

drivers/virtio/virtio.c

··· 377 377 of_node_put(dev->dev.of_node); 378 378 } 379 379 380 + /* 381 + * virtio_irq_get_affinity - get IRQ affinity mask for device 382 + * @_d: ptr to dev structure 383 + * @irq_vec: interrupt vector number 384 + * 385 + * Return the CPU affinity mask for @_d and @irq_vec. 386 + */ 387 + static const struct cpumask *virtio_irq_get_affinity(struct device *_d, 388 + unsigned int irq_vec) 389 + { 390 + struct virtio_device *dev = dev_to_virtio(_d); 391 + 392 + if (!dev->config->get_vq_affinity) 393 + return NULL; 394 + 395 + return dev->config->get_vq_affinity(dev, irq_vec); 396 + } 397 + 380 398 static const struct bus_type virtio_bus = { 381 399 .name = "virtio", 382 400 .match = virtio_dev_match, ··· 402 384 .uevent = virtio_uevent, 403 385 .probe = virtio_dev_probe, 404 386 .remove = virtio_dev_remove, 387 + .irq_get_affinity = virtio_irq_get_affinity, 405 388 }; 406 389 407 390 int __register_virtio_driver(struct virtio_driver *driver, struct module *owner)

+3 -3

fs/bcachefs/move.c

··· 301 301 io->write_sectors = k.k->size; 302 302 303 303 bio_init(&io->write.op.wbio.bio, NULL, io->bi_inline_vecs, pages, 0); 304 - bio_set_prio(&io->write.op.wbio.bio, 305 - IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0)); 304 + io->write.op.wbio.bio.bi_ioprio = 305 + IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0); 306 306 307 307 if (bch2_bio_alloc_pages(&io->write.op.wbio.bio, sectors << 9, 308 308 GFP_KERNEL)) ··· 312 312 io->rbio.opts = io_opts; 313 313 bio_init(&io->rbio.bio, NULL, io->bi_inline_vecs, pages, 0); 314 314 io->rbio.bio.bi_vcnt = pages; 315 - bio_set_prio(&io->rbio.bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0)); 315 + io->rbio.bio.bi_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0); 316 316 io->rbio.bio.bi_iter.bi_size = sectors << 9; 317 317 318 318 io->rbio.bio.bi_opf = REQ_OP_READ;

-5

include/linux/bio.h

··· 19 19 return min(nr_segs, BIO_MAX_VECS); 20 20 } 21 21 22 - #define bio_prio(bio) (bio)->bi_ioprio 23 - #define bio_set_prio(bio, prio) ((bio)->bi_ioprio = prio) 24 - 25 22 #define bio_iter_iovec(bio, iter) \ 26 23 bvec_iter_bvec((bio)->bi_io_vec, (iter)) 27 24 ··· 413 416 unsigned off); 414 417 bool __must_check bio_add_folio(struct bio *bio, struct folio *folio, 415 418 size_t len, size_t off); 416 - extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *, 417 - unsigned int, unsigned int); 418 419 void __bio_add_page(struct bio *bio, struct page *page, 419 420 unsigned int len, unsigned int off); 420 421 void bio_add_folio_nofail(struct bio *bio, struct folio *folio, size_t len,

-11

include/linux/blk-mq-pci.h

··· 1 - /* SPDX-License-Identifier: GPL-2.0 */ 2 - #ifndef _LINUX_BLK_MQ_PCI_H 3 - #define _LINUX_BLK_MQ_PCI_H 4 - 5 - struct blk_mq_queue_map; 6 - struct pci_dev; 7 - 8 - void blk_mq_pci_map_queues(struct blk_mq_queue_map *qmap, struct pci_dev *pdev, 9 - int offset); 10 - 11 - #endif /* _LINUX_BLK_MQ_PCI_H */

-11

include/linux/blk-mq-virtio.h

··· 1 - /* SPDX-License-Identifier: GPL-2.0 */ 2 - #ifndef _LINUX_BLK_MQ_VIRTIO_H 3 - #define _LINUX_BLK_MQ_VIRTIO_H 4 - 5 - struct blk_mq_queue_map; 6 - struct virtio_device; 7 - 8 - void blk_mq_virtio_map_queues(struct blk_mq_queue_map *qmap, 9 - struct virtio_device *vdev, int first_vec); 10 - 11 - #endif /* _LINUX_BLK_MQ_VIRTIO_H */

+9 -26

include/linux/blk-mq.h

··· 296 296 BLK_EH_RESET_TIMER, 297 297 }; 298 298 299 - /* Keep alloc_policy_name[] in sync with the definitions below */ 300 - enum { 301 - BLK_TAG_ALLOC_FIFO, /* allocate starting from 0 */ 302 - BLK_TAG_ALLOC_RR, /* allocate starting from last allocated tag */ 303 - BLK_TAG_ALLOC_MAX 304 - }; 305 - 306 299 /** 307 300 * struct blk_mq_hw_ctx - State for a hardware queue facing the hardware 308 301 * block device ··· 661 668 662 669 /* Keep hctx_flag_name[] in sync with the definitions below */ 663 670 enum { 664 - BLK_MQ_F_SHOULD_MERGE = 1 << 0, 665 671 BLK_MQ_F_TAG_QUEUE_SHARED = 1 << 1, 666 672 /* 667 673 * Set when this device requires underlying blk-mq device for ··· 669 677 BLK_MQ_F_STACKING = 1 << 2, 670 678 BLK_MQ_F_TAG_HCTX_SHARED = 1 << 3, 671 679 BLK_MQ_F_BLOCKING = 1 << 4, 672 - /* Do not allow an I/O scheduler to be configured. */ 673 - BLK_MQ_F_NO_SCHED = 1 << 5, 680 + 681 + /* 682 + * Alloc tags on a round-robin base instead of the first available one. 683 + */ 684 + BLK_MQ_F_TAG_RR = 1 << 5, 674 685 675 686 /* 676 687 * Select 'none' during queue registration in case of a single hwq 677 688 * or shared hwqs instead of 'mq-deadline'. 678 689 */ 679 690 BLK_MQ_F_NO_SCHED_BY_DEFAULT = 1 << 6, 680 - BLK_MQ_F_ALLOC_POLICY_START_BIT = 7, 681 - BLK_MQ_F_ALLOC_POLICY_BITS = 1, 691 + 692 + BLK_MQ_F_MAX = 1 << 7, 682 693 }; 683 - #define BLK_MQ_FLAG_TO_ALLOC_POLICY(flags) \ 684 - ((flags >> BLK_MQ_F_ALLOC_POLICY_START_BIT) & \ 685 - ((1 << BLK_MQ_F_ALLOC_POLICY_BITS) - 1)) 686 - #define BLK_ALLOC_POLICY_TO_MQ_FLAG(policy) \ 687 - ((policy & ((1 << BLK_MQ_F_ALLOC_POLICY_BITS) - 1)) \ 688 - << BLK_MQ_F_ALLOC_POLICY_START_BIT) 689 694 690 695 #define BLK_MQ_MAX_DEPTH (10240) 691 696 #define BLK_MQ_NO_HCTX_IDX (-1U) ··· 910 921 void blk_freeze_queue_start_non_owner(struct request_queue *q); 911 922 912 923 void blk_mq_map_queues(struct blk_mq_queue_map *qmap); 924 + void blk_mq_map_hw_queues(struct blk_mq_queue_map *qmap, 925 + struct device *dev, unsigned int offset); 913 926 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues); 914 927 915 928 void blk_mq_quiesce_queue_nowait(struct request_queue *q); ··· 966 975 { 967 976 if (rq->q->mq_ops->cleanup_rq) 968 977 rq->q->mq_ops->cleanup_rq(rq); 969 - } 970 - 971 - static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio, 972 - unsigned int nr_segs) 973 - { 974 - rq->nr_phys_segments = nr_segs; 975 - rq->__data_len = bio->bi_iter.bi_size; 976 - rq->bio = rq->biotail = bio; 977 978 } 978 979 979 980 void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx,

+23 -13

include/linux/blkdev.h

··· 331 331 #define BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE \ 332 332 ((__force blk_features_t)(1u << 15)) 333 333 334 - /* stacked device can/does support atomic writes */ 335 - #define BLK_FEAT_ATOMIC_WRITES_STACKED \ 334 + /* atomic writes enabled */ 335 + #define BLK_FEAT_ATOMIC_WRITES \ 336 336 ((__force blk_features_t)(1u << 16)) 337 337 338 338 /* ··· 581 581 #ifdef CONFIG_LOCKDEP 582 582 struct task_struct *mq_freeze_owner; 583 583 int mq_freeze_owner_depth; 584 + /* 585 + * Records disk & queue state in current context, used in unfreeze 586 + * queue 587 + */ 588 + bool mq_freeze_disk_dead; 589 + bool mq_freeze_queue_dying; 584 590 #endif 585 591 wait_queue_head_t mq_freeze_wq; 586 592 /* ··· 944 938 * the caller can modify. The caller must call queue_limits_commit_update() 945 939 * to finish the update. 946 940 * 947 - * Context: process context. The caller must have frozen the queue or ensured 948 - * that there is outstanding I/O by other means. 941 + * Context: process context. 949 942 */ 950 943 static inline struct queue_limits 951 944 queue_limits_start_update(struct request_queue *q) ··· 952 947 mutex_lock(&q->limits_lock); 953 948 return q->limits; 954 949 } 950 + int queue_limits_commit_update_frozen(struct request_queue *q, 951 + struct queue_limits *lim); 955 952 int queue_limits_commit_update(struct request_queue *q, 956 953 struct queue_limits *lim); 957 954 int queue_limits_set(struct request_queue *q, struct queue_limits *lim); ··· 1706 1699 void (*complete)(struct io_comp_batch *); 1707 1700 }; 1708 1701 1702 + static inline bool blk_atomic_write_start_sect_aligned(sector_t sector, 1703 + struct queue_limits *limits) 1704 + { 1705 + unsigned int alignment = max(limits->atomic_write_hw_unit_min, 1706 + limits->atomic_write_hw_boundary); 1707 + 1708 + return IS_ALIGNED(sector, alignment >> SECTOR_SHIFT); 1709 + } 1710 + 1709 1711 static inline bool bdev_can_atomic_write(struct block_device *bdev) 1710 1712 { 1711 1713 struct request_queue *bd_queue = bdev->bd_queue; ··· 1723 1707 if (!limits->atomic_write_unit_min) 1724 1708 return false; 1725 1709 1726 - if (bdev_is_partition(bdev)) { 1727 - sector_t bd_start_sect = bdev->bd_start_sect; 1728 - unsigned int alignment = 1729 - max(limits->atomic_write_unit_min, 1730 - limits->atomic_write_hw_boundary); 1731 - 1732 - if (!IS_ALIGNED(bd_start_sect, alignment >> SECTOR_SHIFT)) 1733 - return false; 1734 - } 1710 + if (bdev_is_partition(bdev)) 1711 + return blk_atomic_write_start_sect_aligned(bdev->bd_start_sect, 1712 + limits); 1735 1713 1736 1714 return true; 1737 1715 }

+1 -6

include/linux/bvec.h

··· 286 286 */ 287 287 static inline phys_addr_t bvec_phys(const struct bio_vec *bvec) 288 288 { 289 - /* 290 - * Note this open codes page_to_phys because page_to_phys is defined in 291 - * <asm/io.h>, which we don't want to pull in here. If it ever moves to 292 - * a sensible place we should start using it. 293 - */ 294 - return PFN_PHYS(page_to_pfn(bvec->bv_page)) + bvec->bv_offset; 289 + return page_to_phys(bvec->bv_page) + bvec->bv_offset; 295 290 } 296 291 297 292 #endif /* __LINUX_BVEC_H */

+3

include/linux/device/bus.h

··· 48 48 * will never get called until they do. 49 49 * @remove: Called when a device removed from this bus. 50 50 * @shutdown: Called at shut-down time to quiesce the device. 51 + * @irq_get_affinity: Get IRQ affinity mask for the device on this bus. 51 52 * 52 53 * @online: Called to put the device back online (after offlining it). 53 54 * @offline: Called to put the device offline for hot-removal. May fail. ··· 88 87 void (*sync_state)(struct device *dev); 89 88 void (*remove)(struct device *dev); 90 89 void (*shutdown)(struct device *dev); 90 + const struct cpumask *(*irq_get_affinity)(struct device *dev, 91 + unsigned int irq_vec); 91 92 92 93 int (*online)(struct device *dev); 93 94 int (*offline)(struct device *dev);

+2 -2

include/linux/libata.h

··· 1467 1467 #define ATA_SUBBASE_SHT(drv_name) \ 1468 1468 __ATA_BASE_SHT(drv_name), \ 1469 1469 .can_queue = ATA_DEF_QUEUE, \ 1470 - .tag_alloc_policy = BLK_TAG_ALLOC_RR, \ 1470 + .tag_alloc_policy_rr = true, \ 1471 1471 .device_configure = ata_scsi_device_configure 1472 1472 1473 1473 #define ATA_SUBBASE_SHT_QD(drv_name, drv_qd) \ 1474 1474 __ATA_BASE_SHT(drv_name), \ 1475 1475 .can_queue = drv_qd, \ 1476 - .tag_alloc_policy = BLK_TAG_ALLOC_RR, \ 1476 + .tag_alloc_policy_rr = true, \ 1477 1477 .device_configure = ata_scsi_device_configure 1478 1478 1479 1479 #define ATA_BASE_SHT(drv_name) \

+42

include/linux/nvme.h

··· 64 64 65 65 /* Transport Type codes for Discovery Log Page entry TRTYPE field */ 66 66 enum { 67 + NVMF_TRTYPE_PCI = 0, /* PCI */ 67 68 NVMF_TRTYPE_RDMA = 1, /* RDMA */ 68 69 NVMF_TRTYPE_FC = 2, /* Fibre Channel */ 69 70 NVMF_TRTYPE_TCP = 3, /* TCP/IP */ ··· 276 275 NVME_CTRL_ATTR_HID_128_BIT = (1 << 0), 277 276 NVME_CTRL_ATTR_TBKAS = (1 << 6), 278 277 NVME_CTRL_ATTR_ELBAS = (1 << 15), 278 + NVME_CTRL_ATTR_RHII = (1 << 18), 279 279 }; 280 280 281 281 struct nvme_id_ctrl { ··· 1896 1894 static inline bool nvme_is_fabrics(const struct nvme_command *cmd) 1897 1895 { 1898 1896 return cmd->common.opcode == nvme_fabrics_command; 1897 + } 1898 + 1899 + #ifdef CONFIG_NVME_VERBOSE_ERRORS 1900 + const char *nvme_get_error_status_str(u16 status); 1901 + const char *nvme_get_opcode_str(u8 opcode); 1902 + const char *nvme_get_admin_opcode_str(u8 opcode); 1903 + const char *nvme_get_fabrics_opcode_str(u8 opcode); 1904 + #else /* CONFIG_NVME_VERBOSE_ERRORS */ 1905 + static inline const char *nvme_get_error_status_str(u16 status) 1906 + { 1907 + return "I/O Error"; 1908 + } 1909 + static inline const char *nvme_get_opcode_str(u8 opcode) 1910 + { 1911 + return "I/O Cmd"; 1912 + } 1913 + static inline const char *nvme_get_admin_opcode_str(u8 opcode) 1914 + { 1915 + return "Admin Cmd"; 1916 + } 1917 + 1918 + static inline const char *nvme_get_fabrics_opcode_str(u8 opcode) 1919 + { 1920 + return "Fabrics Cmd"; 1921 + } 1922 + #endif /* CONFIG_NVME_VERBOSE_ERRORS */ 1923 + 1924 + static inline const char *nvme_opcode_str(int qid, u8 opcode) 1925 + { 1926 + return qid ? nvme_get_opcode_str(opcode) : 1927 + nvme_get_admin_opcode_str(opcode); 1928 + } 1929 + 1930 + static inline const char *nvme_fabrics_opcode_str( 1931 + int qid, const struct nvme_command *cmd) 1932 + { 1933 + if (nvme_is_fabrics(cmd)) 1934 + return nvme_get_fabrics_opcode_str(cmd->fabrics.fctype); 1935 + 1936 + return nvme_opcode_str(qid, cmd->common.opcode); 1899 1937 } 1900 1938 1901 1939 struct nvme_error_slot {

+4 -2

include/scsi/scsi_host.h

··· 438 438 */ 439 439 short cmd_per_lun; 440 440 441 - /* If use block layer to manage tags, this is tag allocation policy */ 442 - int tag_alloc_policy; 441 + /* 442 + * Allocate tags starting from last allocated tag. 443 + */ 444 + bool tag_alloc_policy_rr : 1; 443 445 444 446 /* 445 447 * Track QUEUE_FULL events and reduce queue depth on demand.

+1 -1

include/uapi/linux/raid/md_p.h

··· 233 233 char set_name[32]; /* set and interpreted by user-space */ 234 234 235 235 __le64 ctime; /* lo 40 bits are seconds, top 24 are microseconds or 0*/ 236 - __le32 level; /* 0,1,4,5 */ 236 + __le32 level; /* 0,1,4,5, -1 (linear) */ 237 237 __le32 layout; /* only for raid5 and raid10 currently */ 238 238 __le64 size; /* used size of component devices, in 512byte sectors */ 239 239

+2

include/uapi/linux/raid/md_u.h

··· 103 103 104 104 } mdu_array_info_t; 105 105 106 + #define LEVEL_LINEAR (-1) 107 + 106 108 /* we need a value for 'no level specified' and 0 107 109 * means 'raid0', so we need something else. This is 108 110 * for internal use only

+12 -24

kernel/trace/blktrace.c

··· 617 617 return ret; 618 618 } 619 619 620 - static int __blk_trace_setup(struct request_queue *q, char *name, dev_t dev, 621 - struct block_device *bdev, char __user *arg) 620 + int blk_trace_setup(struct request_queue *q, char *name, dev_t dev, 621 + struct block_device *bdev, 622 + char __user *arg) 622 623 { 623 624 struct blk_user_trace_setup buts; 624 625 int ret; ··· 628 627 if (ret) 629 628 return -EFAULT; 630 629 630 + mutex_lock(&q->debugfs_mutex); 631 631 ret = do_blk_trace_setup(q, name, dev, bdev, &buts); 632 + mutex_unlock(&q->debugfs_mutex); 632 633 if (ret) 633 634 return ret; 634 635 635 636 if (copy_to_user(arg, &buts, sizeof(buts))) { 636 - __blk_trace_remove(q); 637 + blk_trace_remove(q); 637 638 return -EFAULT; 638 639 } 639 640 return 0; 640 - } 641 - 642 - int blk_trace_setup(struct request_queue *q, char *name, dev_t dev, 643 - struct block_device *bdev, 644 - char __user *arg) 645 - { 646 - int ret; 647 - 648 - mutex_lock(&q->debugfs_mutex); 649 - ret = __blk_trace_setup(q, name, dev, bdev, arg); 650 - mutex_unlock(&q->debugfs_mutex); 651 - 652 - return ret; 653 641 } 654 642 EXPORT_SYMBOL_GPL(blk_trace_setup); 655 643 ··· 663 673 .pid = cbuts.pid, 664 674 }; 665 675 676 + mutex_lock(&q->debugfs_mutex); 666 677 ret = do_blk_trace_setup(q, name, dev, bdev, &buts); 678 + mutex_unlock(&q->debugfs_mutex); 667 679 if (ret) 668 680 return ret; 669 681 670 682 if (copy_to_user(arg, &buts.name, ARRAY_SIZE(buts.name))) { 671 - __blk_trace_remove(q); 683 + blk_trace_remove(q); 672 684 return -EFAULT; 673 685 } 674 686 ··· 724 732 int ret, start = 0; 725 733 char b[BDEVNAME_SIZE]; 726 734 727 - mutex_lock(&q->debugfs_mutex); 728 - 729 735 switch (cmd) { 730 736 case BLKTRACESETUP: 731 737 snprintf(b, sizeof(b), "%pg", bdev); 732 - ret = __blk_trace_setup(q, b, bdev->bd_dev, bdev, arg); 738 + ret = blk_trace_setup(q, b, bdev->bd_dev, bdev, arg); 733 739 break; 734 740 #if defined(CONFIG_COMPAT) && defined(CONFIG_X86_64) 735 741 case BLKTRACESETUP32: ··· 739 749 start = 1; 740 750 fallthrough; 741 751 case BLKTRACESTOP: 742 - ret = __blk_trace_startstop(q, start); 752 + ret = blk_trace_startstop(q, start); 743 753 break; 744 754 case BLKTRACETEARDOWN: 745 - ret = __blk_trace_remove(q); 755 + ret = blk_trace_remove(q); 746 756 break; 747 757 default: 748 758 ret = -ENOTTY; 749 759 break; 750 760 } 751 - 752 - mutex_unlock(&q->debugfs_mutex); 753 761 return ret; 754 762 } 755 763

+1 -1

rust/kernel/block/mq/tag_set.rs

··· 52 52 numa_node: bindings::NUMA_NO_NODE, 53 53 queue_depth: num_tags, 54 54 cmd_size, 55 - flags: bindings::BLK_MQ_F_SHOULD_MERGE, 55 + flags: 0, 56 56 driver_data: core::ptr::null_mut::<crate::ffi::c_void>(), 57 57 nr_maps: num_maps, 58 58 ..tag_set