Merge tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux

-18

Documentation/ABI/testing/debugfs-pktcdvd

··· 1 - What: /sys/kernel/debug/pktcdvd/pktcdvd[0-7] 2 - Date: Oct. 2006 3 - KernelVersion: 2.6.20 4 - Contact: Thomas Maier <balagi@justmail.de> 5 - Description: 6 - 7 - The pktcdvd module (packet writing driver) creates 8 - these files in debugfs: 9 - 10 - /sys/kernel/debug/pktcdvd/pktcdvd[0-7]/ 11 - 12 - ==== ====== ==================================== 13 - info 0444 Lots of driver statistics and infos. 14 - ==== ====== ==================================== 15 - 16 - Example:: 17 - 18 - cat /sys/kernel/debug/pktcdvd/pktcdvd0/info

+10

Documentation/ABI/testing/sysfs-bus-pci

··· 407 407 file contains a '1' if the memory has been published for 408 408 use outside the driver that owns the device. 409 409 410 + What: /sys/bus/pci/devices/.../p2pmem/allocate 411 + Date: August 2022 412 + Contact: Logan Gunthorpe <logang@deltatee.com> 413 + Description: 414 + This file allows mapping p2pmem into userspace. For each 415 + mmap() call on this file, the kernel will allocate a chunk 416 + of Peer-to-Peer memory for use in Peer-to-Peer transactions. 417 + This memory can be used in O_DIRECT calls to NVMe backed 418 + files for Peer-to-Peer copies. 419 + 410 420 What: /sys/bus/pci/devices/.../link/clkpm 411 421 /sys/bus/pci/devices/.../link/l0s_aspm 412 422 /sys/bus/pci/devices/.../link/l1_aspm

-97

Documentation/ABI/testing/sysfs-class-pktcdvd

··· 1 - sysfs interface 2 - --------------- 3 - The pktcdvd module (packet writing driver) creates the following files in the 4 - sysfs: (<devid> is in the format major:minor) 5 - 6 - What: /sys/class/pktcdvd/add 7 - What: /sys/class/pktcdvd/remove 8 - What: /sys/class/pktcdvd/device_map 9 - Date: Oct. 2006 10 - KernelVersion: 2.6.20 11 - Contact: Thomas Maier <balagi@justmail.de> 12 - Description: 13 - 14 - ========== ============================================== 15 - add (WO) Write a block device id (major:minor) to 16 - create a new pktcdvd device and map it to the 17 - block device. 18 - 19 - remove (WO) Write the pktcdvd device id (major:minor) 20 - to remove the pktcdvd device. 21 - 22 - device_map (RO) Shows the device mapping in format: 23 - pktcdvd[0-7] <pktdevid> <blkdevid> 24 - ========== ============================================== 25 - 26 - 27 - What: /sys/class/pktcdvd/pktcdvd[0-7]/dev 28 - What: /sys/class/pktcdvd/pktcdvd[0-7]/uevent 29 - Date: Oct. 2006 30 - KernelVersion: 2.6.20 31 - Contact: Thomas Maier <balagi@justmail.de> 32 - Description: 33 - dev: (RO) Device id 34 - 35 - uevent: (WO) To send a uevent 36 - 37 - 38 - What: /sys/class/pktcdvd/pktcdvd[0-7]/stat/packets_started 39 - What: /sys/class/pktcdvd/pktcdvd[0-7]/stat/packets_finished 40 - What: /sys/class/pktcdvd/pktcdvd[0-7]/stat/kb_written 41 - What: /sys/class/pktcdvd/pktcdvd[0-7]/stat/kb_read 42 - What: /sys/class/pktcdvd/pktcdvd[0-7]/stat/kb_read_gather 43 - What: /sys/class/pktcdvd/pktcdvd[0-7]/stat/reset 44 - Date: Oct. 2006 45 - KernelVersion: 2.6.20 46 - Contact: Thomas Maier <balagi@justmail.de> 47 - Description: 48 - packets_started: (RO) Number of started packets. 49 - 50 - packets_finished: (RO) Number of finished packets. 51 - 52 - kb_written: (RO) kBytes written. 53 - 54 - kb_read: (RO) kBytes read. 55 - 56 - kb_read_gather: (RO) kBytes read to fill write packets. 57 - 58 - reset: (WO) Write any value to it to reset 59 - pktcdvd device statistic values, like 60 - bytes read/written. 61 - 62 - 63 - What: /sys/class/pktcdvd/pktcdvd[0-7]/write_queue/size 64 - What: /sys/class/pktcdvd/pktcdvd[0-7]/write_queue/congestion_off 65 - What: /sys/class/pktcdvd/pktcdvd[0-7]/write_queue/congestion_on 66 - Date: Oct. 2006 67 - KernelVersion: 2.6.20 68 - Contact: Thomas Maier <balagi@justmail.de> 69 - Description: 70 - ============== ================================================ 71 - size (RO) Contains the size of the bio write queue. 72 - 73 - congestion_off (RW) If bio write queue size is below this mark, 74 - accept new bio requests from the block layer. 75 - 76 - congestion_on (RW) If bio write queue size is higher as this 77 - mark, do no longer accept bio write requests 78 - from the block layer and wait till the pktcdvd 79 - device has processed enough bio's so that bio 80 - write queue size is below congestion off mark. 81 - A value of <= 0 disables congestion control. 82 - ============== ================================================ 83 - 84 - 85 - Example: 86 - -------- 87 - To use the pktcdvd sysfs interface directly, you can do:: 88 - 89 - # create a new pktcdvd device mapped to /dev/hdc 90 - echo "22:0" >/sys/class/pktcdvd/add 91 - cat /sys/class/pktcdvd/device_map 92 - # assuming device pktcdvd0 was created, look at stat's 93 - cat /sys/class/pktcdvd/pktcdvd0/stat/kb_written 94 - # print the device id of the mapped block device 95 - fgrep pktcdvd0 /sys/class/pktcdvd/device_map 96 - # remove device, using pktcdvd0 device id 253:0 97 - echo "253:0" >/sys/class/pktcdvd/remove

+6 -6

Documentation/block/inline-encryption.rst

··· 142 142 of inline encryption using the kernel crypto API. blk-crypto-fallback is built 143 143 into the block layer, so it works on any block device without any special setup. 144 144 Essentially, when a bio with an encryption context is submitted to a 145 - request_queue that doesn't support that encryption context, the block layer will 145 + block_device that doesn't support that encryption context, the block layer will 146 146 handle en/decryption of the bio using blk-crypto-fallback. 147 147 148 148 For encryption, the data cannot be encrypted in-place, as callers usually rely ··· 187 187 188 188 ``blk_crypto_config_supported()`` allows users to check ahead of time whether 189 189 inline encryption with particular crypto settings will work on a particular 190 - request_queue -- either via hardware or via blk-crypto-fallback. This function 190 + block_device -- either via hardware or via blk-crypto-fallback. This function 191 191 takes in a ``struct blk_crypto_config`` which is like blk_crypto_key, but omits 192 192 the actual bytes of the key and instead just contains the algorithm, data unit 193 193 size, etc. This function can be useful if blk-crypto-fallback is disabled. ··· 195 195 ``blk_crypto_init_key()`` allows users to initialize a blk_crypto_key. 196 196 197 197 Users must call ``blk_crypto_start_using_key()`` before actually starting to use 198 - a blk_crypto_key on a request_queue (even if ``blk_crypto_config_supported()`` 198 + a blk_crypto_key on a block_device (even if ``blk_crypto_config_supported()`` 199 199 was called earlier). This is needed to initialize blk-crypto-fallback if it 200 200 will be needed. This must not be called from the data path, as this may have to 201 201 allocate resources, which may deadlock in that case. ··· 207 207 later, as that happens automatically when the bio is freed or reset. 208 208 209 209 Finally, when done using inline encryption with a blk_crypto_key on a 210 - request_queue, users must call ``blk_crypto_evict_key()``. This ensures that 210 + block_device, users must call ``blk_crypto_evict_key()``. This ensures that 211 211 the key is evicted from all keyslots it may be programmed into and unlinked from 212 212 any kernel data structures it may be linked into. 213 213 ··· 221 221 5. ``blk_crypto_evict_key()`` (after all I/O has completed) 222 222 6. Zeroize the blk_crypto_key (this has no dedicated function) 223 223 224 - If a blk_crypto_key is being used on multiple request_queues, then 224 + If a blk_crypto_key is being used on multiple block_devices, then 225 225 ``blk_crypto_config_supported()`` (if used), ``blk_crypto_start_using_key()``, 226 - and ``blk_crypto_evict_key()`` must be called on each request_queue. 226 + and ``blk_crypto_evict_key()`` must be called on each block_device. 227 227 228 228 API presented to device drivers 229 229 ===============================

-7

MAINTAINERS

··· 16430 16430 F: Documentation/devicetree/bindings/input/pine64,pinephone-keyboard.yaml 16431 16431 F: drivers/input/keyboard/pinephone-keyboard.c 16432 16432 16433 - PKTCDVD DRIVER 16434 - M: linux-block@vger.kernel.org 16435 - S: Orphan 16436 - F: drivers/block/pktcdvd.c 16437 - F: include/linux/pktcdvd.h 16438 - F: include/uapi/linux/pktcdvd.h 16439 - 16440 16433 PLANTOWER PMS7003 AIR POLLUTION SENSOR DRIVER 16441 16434 M: Tomasz Duszynski <tduszyns@gmail.com> 16442 16435 S: Maintained

+2 -2

block/bdev.c

··· 224 224 EXPORT_SYMBOL(fsync_bdev); 225 225 226 226 /** 227 - * freeze_bdev -- lock a filesystem and force it into a consistent state 227 + * freeze_bdev - lock a filesystem and force it into a consistent state 228 228 * @bdev: blockdevice to lock 229 229 * 230 230 * If a superblock is found on this device, we take the s_umount semaphore ··· 268 268 EXPORT_SYMBOL(freeze_bdev); 269 269 270 270 /** 271 - * thaw_bdev -- unlock filesystem 271 + * thaw_bdev - unlock filesystem 272 272 * @bdev: blockdevice to unlock 273 273 * 274 274 * Unlocks the filesystem and marks it writeable again after freeze_bdev().

+11 -1

block/bfq-cgroup.c

··· 224 224 { 225 225 blkg_rwstat_add(&bfqg->stats.queued, opf, 1); 226 226 bfqg_stats_end_empty_time(&bfqg->stats); 227 - if (!(bfqq == ((struct bfq_data *)bfqg->bfqd)->in_service_queue)) 227 + if (!(bfqq == bfqg->bfqd->in_service_queue)) 228 228 bfqg_stats_set_start_group_wait_time(bfqg, bfqq_group(bfqq)); 229 229 } 230 230 ··· 552 552 */ 553 553 bfqg->bfqd = bfqd; 554 554 bfqg->active_entities = 0; 555 + bfqg->num_queues_with_pending_reqs = 0; 555 556 bfqg->online = true; 556 557 bfqg->rq_pos_tree = RB_ROOT; 557 558 } ··· 646 645 { 647 646 struct bfq_entity *entity = &bfqq->entity; 648 647 struct bfq_group *old_parent = bfqq_group(bfqq); 648 + bool has_pending_reqs = false; 649 649 650 650 /* 651 651 * No point to move bfqq to the same group, which can happen when ··· 666 664 * next possible expire or deactivate. 667 665 */ 668 666 bfqq->ref++; 667 + 668 + if (entity->in_groups_with_pending_reqs) { 669 + has_pending_reqs = true; 670 + bfq_del_bfqq_in_groups_with_pending_reqs(bfqq); 671 + } 669 672 670 673 /* If bfqq is empty, then bfq_bfqq_expire also invokes 671 674 * bfq_del_bfqq_busy, thereby removing bfqq and its entity ··· 698 691 entity->sched_data = &bfqg->sched_data; 699 692 /* pin down bfqg and its associated blkg */ 700 693 bfqg_and_blkg_get(bfqg); 694 + 695 + if (has_pending_reqs) 696 + bfq_add_bfqq_in_groups_with_pending_reqs(bfqq); 701 697 702 698 if (bfq_bfqq_busy(bfqq)) { 703 699 if (unlikely(!bfqd->nonrot_with_queueing))

+22 -80

block/bfq-iosched.c

··· 820 820 * much easier to maintain the needed state: 821 821 * 1) all active queues have the same weight, 822 822 * 2) all active queues belong to the same I/O-priority class, 823 - * 3) there are no active groups. 823 + * 3) there is at most one active group. 824 824 * In particular, the last condition is always true if hierarchical 825 825 * support or the cgroups interface are not enabled, thus no state 826 826 * needs to be maintained in this case. ··· 852 852 853 853 return varied_queue_weights || multiple_classes_busy 854 854 #ifdef CONFIG_BFQ_GROUP_IOSCHED 855 - || bfqd->num_groups_with_pending_reqs > 0 855 + || bfqd->num_groups_with_pending_reqs > 1 856 856 #endif 857 857 ; 858 858 } ··· 870 870 * In most scenarios, the rate at which nodes are created/destroyed 871 871 * should be low too. 872 872 */ 873 - void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq, 874 - struct rb_root_cached *root) 873 + void bfq_weights_tree_add(struct bfq_queue *bfqq) 875 874 { 875 + struct rb_root_cached *root = &bfqq->bfqd->queue_weights_tree; 876 876 struct bfq_entity *entity = &bfqq->entity; 877 877 struct rb_node **new = &(root->rb_root.rb_node), *parent = NULL; 878 878 bool leftmost = true; ··· 944 944 * See the comments to the function bfq_weights_tree_add() for considerations 945 945 * about overhead. 946 946 */ 947 - void __bfq_weights_tree_remove(struct bfq_data *bfqd, 948 - struct bfq_queue *bfqq, 949 - struct rb_root_cached *root) 947 + void bfq_weights_tree_remove(struct bfq_queue *bfqq) 950 948 { 949 + struct rb_root_cached *root; 950 + 951 951 if (!bfqq->weight_counter) 952 952 return; 953 953 954 + root = &bfqq->bfqd->queue_weights_tree; 954 955 bfqq->weight_counter->num_active--; 955 956 if (bfqq->weight_counter->num_active > 0) 956 957 goto reset_entity_pointer; ··· 962 961 reset_entity_pointer: 963 962 bfqq->weight_counter = NULL; 964 963 bfq_put_queue(bfqq); 965 - } 966 - 967 - /* 968 - * Invoke __bfq_weights_tree_remove on bfqq and decrement the number 969 - * of active groups for each queue's inactive parent entity. 970 - */ 971 - void bfq_weights_tree_remove(struct bfq_data *bfqd, 972 - struct bfq_queue *bfqq) 973 - { 974 - struct bfq_entity *entity = bfqq->entity.parent; 975 - 976 - for_each_entity(entity) { 977 - struct bfq_sched_data *sd = entity->my_sched_data; 978 - 979 - if (sd->next_in_service || sd->in_service_entity) { 980 - /* 981 - * entity is still active, because either 982 - * next_in_service or in_service_entity is not 983 - * NULL (see the comments on the definition of 984 - * next_in_service for details on why 985 - * in_service_entity must be checked too). 986 - * 987 - * As a consequence, its parent entities are 988 - * active as well, and thus this loop must 989 - * stop here. 990 - */ 991 - break; 992 - } 993 - 994 - /* 995 - * The decrement of num_groups_with_pending_reqs is 996 - * not performed immediately upon the deactivation of 997 - * entity, but it is delayed to when it also happens 998 - * that the first leaf descendant bfqq of entity gets 999 - * all its pending requests completed. The following 1000 - * instructions perform this delayed decrement, if 1001 - * needed. See the comments on 1002 - * num_groups_with_pending_reqs for details. 1003 - */ 1004 - if (entity->in_groups_with_pending_reqs) { 1005 - entity->in_groups_with_pending_reqs = false; 1006 - bfqd->num_groups_with_pending_reqs--; 1007 - } 1008 - } 1009 - 1010 - /* 1011 - * Next function is invoked last, because it causes bfqq to be 1012 - * freed if the following holds: bfqq is not in service and 1013 - * has no dispatched request. DO NOT use bfqq after the next 1014 - * function invocation. 1015 - */ 1016 - __bfq_weights_tree_remove(bfqd, bfqq, 1017 - &bfqd->queue_weights_tree); 1018 964 } 1019 965 1020 966 /* ··· 2083 2135 if (!bfqd->last_completed_rq_bfqq || 2084 2136 bfqd->last_completed_rq_bfqq == bfqq || 2085 2137 bfq_bfqq_has_short_ttime(bfqq) || 2086 - now_ns - bfqd->last_completion >= 4 * NSEC_PER_MSEC) 2138 + now_ns - bfqd->last_completion >= 4 * NSEC_PER_MSEC || 2139 + bfqd->last_completed_rq_bfqq == &bfqd->oom_bfqq || 2140 + bfqq == &bfqd->oom_bfqq) 2087 2141 return; 2088 2142 2089 2143 /* ··· 2322 2372 2323 2373 return 0; 2324 2374 } 2325 - 2326 - #if 0 /* Still not clear if we can do without next two functions */ 2327 - static void bfq_activate_request(struct request_queue *q, struct request *rq) 2328 - { 2329 - struct bfq_data *bfqd = q->elevator->elevator_data; 2330 - 2331 - bfqd->rq_in_driver++; 2332 - } 2333 - 2334 - static void bfq_deactivate_request(struct request_queue *q, struct request *rq) 2335 - { 2336 - struct bfq_data *bfqd = q->elevator->elevator_data; 2337 - 2338 - bfqd->rq_in_driver--; 2339 - } 2340 - #endif 2341 2375 2342 2376 static void bfq_remove_request(struct request_queue *q, 2343 2377 struct request *rq) ··· 6195 6261 */ 6196 6262 bfqq->budget_timeout = jiffies; 6197 6263 6198 - bfq_weights_tree_remove(bfqd, bfqq); 6264 + bfq_del_bfqq_in_groups_with_pending_reqs(bfqq); 6265 + bfq_weights_tree_remove(bfqq); 6199 6266 } 6200 6267 6201 6268 now_ns = ktime_get_ns(); ··· 6719 6784 bfqq = bfq_get_bfqq_handle_split(bfqd, bic, bio, 6720 6785 true, is_sync, 6721 6786 NULL); 6787 + if (unlikely(bfqq == &bfqd->oom_bfqq)) 6788 + bfqq_already_existing = true; 6789 + } else 6790 + bfqq_already_existing = true; 6791 + 6792 + if (!bfqq_already_existing) { 6722 6793 bfqq->waker_bfqq = old_bfqq->waker_bfqq; 6723 6794 bfqq->tentative_waker_bfqq = NULL; 6724 6795 ··· 6738 6797 if (bfqq->waker_bfqq) 6739 6798 hlist_add_head(&bfqq->woken_list_node, 6740 6799 &bfqq->waker_bfqq->woken_list); 6741 - } else 6742 - bfqq_already_existing = true; 6800 + } 6743 6801 } 6744 6802 } 6745 6803 ··· 6985 7045 #endif 6986 7046 6987 7047 blk_stat_disable_accounting(bfqd->queue); 7048 + clear_bit(ELEVATOR_FLAG_DISABLE_WBT, &e->flags); 6988 7049 wbt_enable_default(bfqd->queue); 6989 7050 6990 7051 kfree(bfqd); ··· 7131 7190 /* We dispatch from request queue wide instead of hw queue */ 7132 7191 blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q); 7133 7192 7193 + set_bit(ELEVATOR_FLAG_DISABLE_WBT, &eq->flags); 7134 7194 wbt_disable_default(q); 7135 7195 blk_stat_enable_accounting(q); 7136 7196

+15 -17

block/bfq-iosched.h

··· 492 492 struct rb_root_cached queue_weights_tree; 493 493 494 494 /* 495 - * Number of groups with at least one descendant process that 495 + * Number of groups with at least one process that 496 496 * has at least one request waiting for completion. Note that 497 497 * this accounts for also requests already dispatched, but not 498 498 * yet completed. Therefore this number of groups may differ 499 499 * (be larger) than the number of active groups, as a group is 500 500 * considered active only if its corresponding entity has 501 - * descendant queues with at least one request queued. This 501 + * queues with at least one request queued. This 502 502 * number is used to decide whether a scenario is symmetric. 503 503 * For a detailed explanation see comments on the computation 504 504 * of the variable asymmetric_scenario in the function 505 505 * bfq_better_to_idle(). 506 506 * 507 507 * However, it is hard to compute this number exactly, for 508 - * groups with multiple descendant processes. Consider a group 509 - * that is inactive, i.e., that has no descendant process with 508 + * groups with multiple processes. Consider a group 509 + * that is inactive, i.e., that has no process with 510 510 * pending I/O inside BFQ queues. Then suppose that 511 511 * num_groups_with_pending_reqs is still accounting for this 512 - * group, because the group has descendant processes with some 512 + * group, because the group has processes with some 513 513 * I/O request still in flight. num_groups_with_pending_reqs 514 514 * should be decremented when the in-flight request of the 515 - * last descendant process is finally completed (assuming that 515 + * last process is finally completed (assuming that 516 516 * nothing else has changed for the group in the meantime, in 517 517 * terms of composition of the group and active/inactive state of child 518 518 * groups and processes). To accomplish this, an additional ··· 521 521 * we resort to the following tradeoff between simplicity and 522 522 * accuracy: for an inactive group that is still counted in 523 523 * num_groups_with_pending_reqs, we decrement 524 - * num_groups_with_pending_reqs when the first descendant 524 + * num_groups_with_pending_reqs when the first 525 525 * process of the group remains with no request waiting for 526 526 * completion. 527 527 * ··· 529 529 * carefulness: to avoid multiple decrements, we flag a group, 530 530 * more precisely an entity representing a group, as still 531 531 * counted in num_groups_with_pending_reqs when it becomes 532 - * inactive. Then, when the first descendant queue of the 532 + * inactive. Then, when the first queue of the 533 533 * entity remains with no request waiting for completion, 534 534 * num_groups_with_pending_reqs is decremented, and this flag 535 535 * is reset. After this flag is reset for the entity, 536 536 * num_groups_with_pending_reqs won't be decremented any 537 - * longer in case a new descendant queue of the entity remains 537 + * longer in case a new queue of the entity remains 538 538 * with no request waiting for completion. 539 539 */ 540 540 unsigned int num_groups_with_pending_reqs; ··· 931 931 struct bfq_entity entity; 932 932 struct bfq_sched_data sched_data; 933 933 934 - void *bfqd; 934 + struct bfq_data *bfqd; 935 935 936 936 struct bfq_queue *async_bfqq[2][IOPRIO_NR_LEVELS]; 937 937 struct bfq_queue *async_idle_bfqq; ··· 939 939 struct bfq_entity *my_entity; 940 940 941 941 int active_entities; 942 + int num_queues_with_pending_reqs; 942 943 943 944 struct rb_root rq_pos_tree; 944 945 ··· 969 968 void bic_set_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq, bool is_sync); 970 969 struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic); 971 970 void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq); 972 - void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq, 973 - struct rb_root_cached *root); 974 - void __bfq_weights_tree_remove(struct bfq_data *bfqd, 975 - struct bfq_queue *bfqq, 976 - struct rb_root_cached *root); 977 - void bfq_weights_tree_remove(struct bfq_data *bfqd, 978 - struct bfq_queue *bfqq); 971 + void bfq_weights_tree_add(struct bfq_queue *bfqq); 972 + void bfq_weights_tree_remove(struct bfq_queue *bfqq); 979 973 void bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq, 980 974 bool compensate, enum bfqq_expiration reason); 981 975 void bfq_put_queue(struct bfq_queue *bfqq); ··· 1074 1078 bool expiration); 1075 1079 void bfq_del_bfqq_busy(struct bfq_queue *bfqq, bool expiration); 1076 1080 void bfq_add_bfqq_busy(struct bfq_queue *bfqq); 1081 + void bfq_add_bfqq_in_groups_with_pending_reqs(struct bfq_queue *bfqq); 1082 + void bfq_del_bfqq_in_groups_with_pending_reqs(struct bfq_queue *bfqq); 1077 1083 1078 1084 /* --------------- end of interface of B-WF2Q+ ---------------- */ 1079 1085

+77 -80

block/bfq-wf2q.c

··· 218 218 return false; 219 219 } 220 220 221 + static void bfq_inc_active_entities(struct bfq_entity *entity) 222 + { 223 + struct bfq_sched_data *sd = entity->sched_data; 224 + struct bfq_group *bfqg = container_of(sd, struct bfq_group, sched_data); 225 + 226 + if (bfqg != bfqg->bfqd->root_group) 227 + bfqg->active_entities++; 228 + } 229 + 230 + static void bfq_dec_active_entities(struct bfq_entity *entity) 231 + { 232 + struct bfq_sched_data *sd = entity->sched_data; 233 + struct bfq_group *bfqg = container_of(sd, struct bfq_group, sched_data); 234 + 235 + if (bfqg != bfqg->bfqd->root_group) 236 + bfqg->active_entities--; 237 + } 238 + 221 239 #else /* CONFIG_BFQ_GROUP_IOSCHED */ 222 240 223 241 static bool bfq_update_parent_budget(struct bfq_entity *next_in_service) ··· 246 228 static bool bfq_no_longer_next_in_service(struct bfq_entity *entity) 247 229 { 248 230 return true; 231 + } 232 + 233 + static void bfq_inc_active_entities(struct bfq_entity *entity) 234 + { 235 + } 236 + 237 + static void bfq_dec_active_entities(struct bfq_entity *entity) 238 + { 249 239 } 250 240 251 241 #endif /* CONFIG_BFQ_GROUP_IOSCHED */ ··· 482 456 { 483 457 struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); 484 458 struct rb_node *node = &entity->rb_node; 485 - #ifdef CONFIG_BFQ_GROUP_IOSCHED 486 - struct bfq_sched_data *sd = NULL; 487 - struct bfq_group *bfqg = NULL; 488 - struct bfq_data *bfqd = NULL; 489 - #endif 490 459 491 460 bfq_insert(&st->active, entity); 492 461 ··· 492 471 493 472 bfq_update_active_tree(node); 494 473 495 - #ifdef CONFIG_BFQ_GROUP_IOSCHED 496 - sd = entity->sched_data; 497 - bfqg = container_of(sd, struct bfq_group, sched_data); 498 - bfqd = (struct bfq_data *)bfqg->bfqd; 499 - #endif 500 474 if (bfqq) 501 475 list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list); 502 - #ifdef CONFIG_BFQ_GROUP_IOSCHED 503 - if (bfqg != bfqd->root_group) 504 - bfqg->active_entities++; 505 - #endif 476 + 477 + bfq_inc_active_entities(entity); 506 478 } 507 479 508 480 /** ··· 572 558 { 573 559 struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); 574 560 struct rb_node *node; 575 - #ifdef CONFIG_BFQ_GROUP_IOSCHED 576 - struct bfq_sched_data *sd = NULL; 577 - struct bfq_group *bfqg = NULL; 578 - struct bfq_data *bfqd = NULL; 579 - #endif 580 561 581 562 node = bfq_find_deepest(&entity->rb_node); 582 563 bfq_extract(&st->active, entity); 583 564 584 565 if (node) 585 566 bfq_update_active_tree(node); 586 - 587 - #ifdef CONFIG_BFQ_GROUP_IOSCHED 588 - sd = entity->sched_data; 589 - bfqg = container_of(sd, struct bfq_group, sched_data); 590 - bfqd = (struct bfq_data *)bfqg->bfqd; 591 - #endif 592 567 if (bfqq) 593 568 list_del(&bfqq->bfqq_list); 594 - #ifdef CONFIG_BFQ_GROUP_IOSCHED 595 - if (bfqg != bfqd->root_group) 596 - bfqg->active_entities--; 597 - #endif 569 + 570 + bfq_dec_active_entities(entity); 598 571 } 599 572 600 573 /** ··· 707 706 if (entity->prio_changed) { 708 707 struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); 709 708 unsigned int prev_weight, new_weight; 710 - struct bfq_data *bfqd = NULL; 711 - struct rb_root_cached *root; 712 - #ifdef CONFIG_BFQ_GROUP_IOSCHED 713 - struct bfq_sched_data *sd; 714 - struct bfq_group *bfqg; 715 - #endif 716 - 717 - if (bfqq) 718 - bfqd = bfqq->bfqd; 719 - #ifdef CONFIG_BFQ_GROUP_IOSCHED 720 - else { 721 - sd = entity->my_sched_data; 722 - bfqg = container_of(sd, struct bfq_group, sched_data); 723 - bfqd = (struct bfq_data *)bfqg->bfqd; 724 - } 725 - #endif 726 709 727 710 /* Matches the smp_wmb() in bfq_group_set_weight. */ 728 711 smp_rmb(); ··· 755 770 * queue, remove the entity from its old weight counter (if 756 771 * there is a counter associated with the entity). 757 772 */ 758 - if (prev_weight != new_weight && bfqq) { 759 - root = &bfqd->queue_weights_tree; 760 - __bfq_weights_tree_remove(bfqd, bfqq, root); 761 - } 773 + if (prev_weight != new_weight && bfqq) 774 + bfq_weights_tree_remove(bfqq); 762 775 entity->weight = new_weight; 763 776 /* 764 777 * Add the entity, if it is not a weight-raised queue, 765 778 * to the counter associated with its new weight. 766 779 */ 767 - if (prev_weight != new_weight && bfqq && bfqq->wr_coeff == 1) { 768 - /* If we get here, root has been initialized. */ 769 - bfq_weights_tree_add(bfqd, bfqq, root); 770 - } 780 + if (prev_weight != new_weight && bfqq && bfqq->wr_coeff == 1) 781 + bfq_weights_tree_add(bfqq); 771 782 772 783 new_st->wsum += entity->weight; 773 784 ··· 965 984 entity->on_st_or_in_serv = true; 966 985 } 967 986 968 - #ifdef CONFIG_BFQ_GROUP_IOSCHED 969 - if (!bfq_entity_to_bfqq(entity)) { /* bfq_group */ 970 - struct bfq_group *bfqg = 971 - container_of(entity, struct bfq_group, entity); 972 - struct bfq_data *bfqd = bfqg->bfqd; 973 - 974 - if (!entity->in_groups_with_pending_reqs) { 975 - entity->in_groups_with_pending_reqs = true; 976 - bfqd->num_groups_with_pending_reqs++; 977 - } 978 - } 979 - #endif 980 - 981 987 bfq_update_fin_time_enqueue(entity, st, backshifted); 982 988 } 983 989 ··· 1050 1082 } 1051 1083 1052 1084 static void __bfq_activate_requeue_entity(struct bfq_entity *entity, 1053 - struct bfq_sched_data *sd, 1054 1085 bool non_blocking_wait_rq) 1055 1086 { 1056 1087 struct bfq_service_tree *st = bfq_entity_service_tree(entity); 1057 1088 1058 - if (sd->in_service_entity == entity || entity->tree == &st->active) 1089 + if (entity->sched_data->in_service_entity == entity || 1090 + entity->tree == &st->active) 1059 1091 /* 1060 1092 * in service or already queued on the active tree, 1061 1093 * requeue or reposition ··· 1087 1119 bool non_blocking_wait_rq, 1088 1120 bool requeue, bool expiration) 1089 1121 { 1090 - struct bfq_sched_data *sd; 1091 - 1092 1122 for_each_entity(entity) { 1093 - sd = entity->sched_data; 1094 - __bfq_activate_requeue_entity(entity, sd, non_blocking_wait_rq); 1095 - 1096 - if (!bfq_update_next_in_service(sd, entity, expiration) && 1097 - !requeue) 1123 + __bfq_activate_requeue_entity(entity, non_blocking_wait_rq); 1124 + if (!bfq_update_next_in_service(entity->sched_data, entity, 1125 + expiration) && !requeue) 1098 1126 break; 1099 1127 } 1100 1128 } ··· 1610 1646 bfqq == bfqd->in_service_queue, expiration); 1611 1647 } 1612 1648 1649 + void bfq_add_bfqq_in_groups_with_pending_reqs(struct bfq_queue *bfqq) 1650 + { 1651 + struct bfq_entity *entity = &bfqq->entity; 1652 + 1653 + if (!entity->in_groups_with_pending_reqs) { 1654 + entity->in_groups_with_pending_reqs = true; 1655 + #ifdef CONFIG_BFQ_GROUP_IOSCHED 1656 + if (!(bfqq_group(bfqq)->num_queues_with_pending_reqs++)) 1657 + bfqq->bfqd->num_groups_with_pending_reqs++; 1658 + #endif 1659 + } 1660 + } 1661 + 1662 + void bfq_del_bfqq_in_groups_with_pending_reqs(struct bfq_queue *bfqq) 1663 + { 1664 + struct bfq_entity *entity = &bfqq->entity; 1665 + 1666 + if (entity->in_groups_with_pending_reqs) { 1667 + entity->in_groups_with_pending_reqs = false; 1668 + #ifdef CONFIG_BFQ_GROUP_IOSCHED 1669 + if (!(--bfqq_group(bfqq)->num_queues_with_pending_reqs)) 1670 + bfqq->bfqd->num_groups_with_pending_reqs--; 1671 + #endif 1672 + } 1673 + } 1674 + 1613 1675 /* 1614 1676 * Called when the bfqq no longer has requests pending, remove it from 1615 1677 * the service tree. As a special case, it can be invoked during an ··· 1658 1668 1659 1669 bfq_deactivate_bfqq(bfqd, bfqq, true, expiration); 1660 1670 1661 - if (!bfqq->dispatched) 1662 - bfq_weights_tree_remove(bfqd, bfqq); 1671 + if (!bfqq->dispatched) { 1672 + bfq_del_bfqq_in_groups_with_pending_reqs(bfqq); 1673 + /* 1674 + * Next function is invoked last, because it causes bfqq to be 1675 + * freed. DO NOT use bfqq after the next function invocation. 1676 + */ 1677 + bfq_weights_tree_remove(bfqq); 1678 + } 1663 1679 } 1664 1680 1665 1681 /* ··· 1682 1686 bfq_mark_bfqq_busy(bfqq); 1683 1687 bfqd->busy_queues[bfqq->ioprio_class - 1]++; 1684 1688 1685 - if (!bfqq->dispatched) 1689 + if (!bfqq->dispatched) { 1690 + bfq_add_bfqq_in_groups_with_pending_reqs(bfqq); 1686 1691 if (bfqq->wr_coeff == 1) 1687 - bfq_weights_tree_add(bfqd, bfqq, 1688 - &bfqd->queue_weights_tree); 1692 + bfq_weights_tree_add(bfqq); 1693 + } 1689 1694 1690 1695 if (bfqq->wr_coeff > 1) 1691 1696 bfqd->wr_busy_queues++;

+99 -47

block/bio.c

··· 25 25 #include "blk-rq-qos.h" 26 26 #include "blk-cgroup.h" 27 27 28 + #define ALLOC_CACHE_THRESHOLD 16 29 + #define ALLOC_CACHE_SLACK 64 30 + #define ALLOC_CACHE_MAX 256 31 + 28 32 struct bio_alloc_cache { 29 33 struct bio *free_list; 34 + struct bio *free_list_irq; 30 35 unsigned int nr; 36 + unsigned int nr_irq; 31 37 }; 32 38 33 39 static struct biovec_slab { ··· 414 408 queue_work(bs->rescue_workqueue, &bs->rescue_work); 415 409 } 416 410 411 + static void bio_alloc_irq_cache_splice(struct bio_alloc_cache *cache) 412 + { 413 + unsigned long flags; 414 + 415 + /* cache->free_list must be empty */ 416 + if (WARN_ON_ONCE(cache->free_list)) 417 + return; 418 + 419 + local_irq_save(flags); 420 + cache->free_list = cache->free_list_irq; 421 + cache->free_list_irq = NULL; 422 + cache->nr += cache->nr_irq; 423 + cache->nr_irq = 0; 424 + local_irq_restore(flags); 425 + } 426 + 417 427 static struct bio *bio_alloc_percpu_cache(struct block_device *bdev, 418 428 unsigned short nr_vecs, blk_opf_t opf, gfp_t gfp, 419 429 struct bio_set *bs) ··· 439 417 440 418 cache = per_cpu_ptr(bs->cache, get_cpu()); 441 419 if (!cache->free_list) { 442 - put_cpu(); 443 - return NULL; 420 + if (READ_ONCE(cache->nr_irq) >= ALLOC_CACHE_THRESHOLD) 421 + bio_alloc_irq_cache_splice(cache); 422 + if (!cache->free_list) { 423 + put_cpu(); 424 + return NULL; 425 + } 444 426 } 445 427 bio = cache->free_list; 446 428 cache->free_list = bio->bi_next; ··· 487 461 * mempools. Doing multiple allocations from the same mempool under 488 462 * submit_bio_noacct() should be avoided - instead, use bio_set's front_pad 489 463 * for per bio allocations. 490 - * 491 - * If REQ_ALLOC_CACHE is set, the final put of the bio MUST be done from process 492 - * context, not hard/soft IRQ. 493 464 * 494 465 * Returns: Pointer to new bio on success, NULL on failure. 495 466 */ ··· 549 526 } 550 527 if (unlikely(!p)) 551 528 return NULL; 529 + if (!mempool_is_saturated(&bs->bio_pool)) 530 + opf &= ~REQ_ALLOC_CACHE; 552 531 553 532 bio = p + bs->front_pad; 554 533 if (nr_vecs > BIO_INLINE_VECS) { ··· 701 676 bio_truncate(bio, maxsector << 9); 702 677 } 703 678 704 - #define ALLOC_CACHE_MAX 512 705 - #define ALLOC_CACHE_SLACK 64 706 - 707 - static void bio_alloc_cache_prune(struct bio_alloc_cache *cache, 708 - unsigned int nr) 679 + static int __bio_alloc_cache_prune(struct bio_alloc_cache *cache, 680 + unsigned int nr) 709 681 { 710 682 unsigned int i = 0; 711 683 struct bio *bio; ··· 713 691 bio_free(bio); 714 692 if (++i == nr) 715 693 break; 694 + } 695 + return i; 696 + } 697 + 698 + static void bio_alloc_cache_prune(struct bio_alloc_cache *cache, 699 + unsigned int nr) 700 + { 701 + nr -= __bio_alloc_cache_prune(cache, nr); 702 + if (!READ_ONCE(cache->free_list)) { 703 + bio_alloc_irq_cache_splice(cache); 704 + __bio_alloc_cache_prune(cache, nr); 716 705 } 717 706 } 718 707 ··· 758 725 bs->cache = NULL; 759 726 } 760 727 728 + static inline void bio_put_percpu_cache(struct bio *bio) 729 + { 730 + struct bio_alloc_cache *cache; 731 + 732 + cache = per_cpu_ptr(bio->bi_pool->cache, get_cpu()); 733 + if (READ_ONCE(cache->nr_irq) + cache->nr > ALLOC_CACHE_MAX) { 734 + put_cpu(); 735 + bio_free(bio); 736 + return; 737 + } 738 + 739 + bio_uninit(bio); 740 + 741 + if ((bio->bi_opf & REQ_POLLED) && !WARN_ON_ONCE(in_interrupt())) { 742 + bio->bi_next = cache->free_list; 743 + cache->free_list = bio; 744 + cache->nr++; 745 + } else { 746 + unsigned long flags; 747 + 748 + local_irq_save(flags); 749 + bio->bi_next = cache->free_list_irq; 750 + cache->free_list_irq = bio; 751 + cache->nr_irq++; 752 + local_irq_restore(flags); 753 + } 754 + put_cpu(); 755 + } 756 + 761 757 /** 762 758 * bio_put - release a reference to a bio 763 759 * @bio: bio to release reference to ··· 802 740 if (!atomic_dec_and_test(&bio->__bi_cnt)) 803 741 return; 804 742 } 805 - 806 - if ((bio->bi_opf & REQ_ALLOC_CACHE) && !WARN_ON_ONCE(in_interrupt())) { 807 - struct bio_alloc_cache *cache; 808 - 809 - bio_uninit(bio); 810 - cache = per_cpu_ptr(bio->bi_pool->cache, get_cpu()); 811 - bio->bi_next = cache->free_list; 812 - cache->free_list = bio; 813 - if (++cache->nr > ALLOC_CACHE_MAX + ALLOC_CACHE_SLACK) 814 - bio_alloc_cache_prune(cache, ALLOC_CACHE_SLACK); 815 - put_cpu(); 816 - } else { 743 + if (bio->bi_opf & REQ_ALLOC_CACHE) 744 + bio_put_percpu_cache(bio); 745 + else 817 746 bio_free(bio); 818 - } 819 747 } 820 748 EXPORT_SYMBOL(bio_put); 821 749 ··· 914 862 if (vec_end_addr + 1 != page_addr + off) 915 863 return false; 916 864 if (xen_domain() && !xen_biovec_phys_mergeable(bv, page)) 865 + return false; 866 + if (!zone_device_pages_have_same_pgmap(bv->bv_page, page)) 917 867 return false; 918 868 919 869 *same_page = ((vec_end_addr & PAGE_MASK) == page_addr); ··· 1249 1195 unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt; 1250 1196 struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt; 1251 1197 struct page **pages = (struct page **)bv; 1198 + unsigned int gup_flags = 0; 1252 1199 ssize_t size, left; 1253 1200 unsigned len, i = 0; 1254 1201 size_t offset, trim; ··· 1263 1208 BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2); 1264 1209 pages += entries_left * (PAGE_PTRS_PER_BVEC - 1); 1265 1210 1211 + if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue)) 1212 + gup_flags |= FOLL_PCI_P2PDMA; 1213 + 1266 1214 /* 1267 1215 * Each segment in the iov is required to be a block size multiple. 1268 1216 * However, we may not be able to get the entire segment if it spans ··· 1273 1215 * result to ensure the bio's total size is correct. The remainder of 1274 1216 * the iov data will be picked up in the next bio iteration. 1275 1217 */ 1276 - size = iov_iter_get_pages2(iter, pages, UINT_MAX - bio->bi_iter.bi_size, 1277 - nr_pages, &offset); 1218 + size = iov_iter_get_pages(iter, pages, 1219 + UINT_MAX - bio->bi_iter.bi_size, 1220 + nr_pages, &offset, gup_flags); 1278 1221 if (unlikely(size <= 0)) 1279 1222 return size ? size : -EFAULT; 1280 1223 ··· 1401 1342 } 1402 1343 EXPORT_SYMBOL(__bio_advance); 1403 1344 1404 - void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter, 1405 - struct bio *src, struct bvec_iter *src_iter) 1406 - { 1407 - while (src_iter->bi_size && dst_iter->bi_size) { 1408 - struct bio_vec src_bv = bio_iter_iovec(src, *src_iter); 1409 - struct bio_vec dst_bv = bio_iter_iovec(dst, *dst_iter); 1410 - unsigned int bytes = min(src_bv.bv_len, dst_bv.bv_len); 1411 - void *src_buf = bvec_kmap_local(&src_bv); 1412 - void *dst_buf = bvec_kmap_local(&dst_bv); 1413 - 1414 - memcpy(dst_buf, src_buf, bytes); 1415 - 1416 - kunmap_local(dst_buf); 1417 - kunmap_local(src_buf); 1418 - 1419 - bio_advance_iter_single(src, src_iter, bytes); 1420 - bio_advance_iter_single(dst, dst_iter, bytes); 1421 - } 1422 - } 1423 - EXPORT_SYMBOL(bio_copy_data_iter); 1424 - 1425 1345 /** 1426 1346 * bio_copy_data - copy contents of data buffers from one bio to another 1427 1347 * @src: source bio ··· 1414 1376 struct bvec_iter src_iter = src->bi_iter; 1415 1377 struct bvec_iter dst_iter = dst->bi_iter; 1416 1378 1417 - bio_copy_data_iter(dst, &dst_iter, src, &src_iter); 1379 + while (src_iter.bi_size && dst_iter.bi_size) { 1380 + struct bio_vec src_bv = bio_iter_iovec(src, src_iter); 1381 + struct bio_vec dst_bv = bio_iter_iovec(dst, dst_iter); 1382 + unsigned int bytes = min(src_bv.bv_len, dst_bv.bv_len); 1383 + void *src_buf = bvec_kmap_local(&src_bv); 1384 + void *dst_buf = bvec_kmap_local(&dst_bv); 1385 + 1386 + memcpy(dst_buf, src_buf, bytes); 1387 + 1388 + kunmap_local(dst_buf); 1389 + kunmap_local(src_buf); 1390 + 1391 + bio_advance_iter_single(src, &src_iter, bytes); 1392 + bio_advance_iter_single(dst, &dst_iter, bytes); 1393 + } 1418 1394 } 1419 1395 EXPORT_SYMBOL(bio_copy_data); 1420 1396

+77 -17

block/blk-cgroup.c

··· 59 59 60 60 #define BLKG_DESTROY_BATCH_SIZE 64 61 61 62 + /* 63 + * Lockless lists for tracking IO stats update 64 + * 65 + * New IO stats are stored in the percpu iostat_cpu within blkcg_gq (blkg). 66 + * There are multiple blkg's (one for each block device) attached to each 67 + * blkcg. The rstat code keeps track of which cpu has IO stats updated, 68 + * but it doesn't know which blkg has the updated stats. If there are many 69 + * block devices in a system, the cost of iterating all the blkg's to flush 70 + * out the IO stats can be high. To reduce such overhead, a set of percpu 71 + * lockless lists (lhead) per blkcg are used to track the set of recently 72 + * updated iostat_cpu's since the last flush. An iostat_cpu will be put 73 + * onto the lockless list on the update side [blk_cgroup_bio_start()] if 74 + * not there yet and then removed when being flushed [blkcg_rstat_flush()]. 75 + * References to blkg are gotten and then put back in the process to 76 + * protect against blkg removal. 77 + * 78 + * Return: 0 if successful or -ENOMEM if allocation fails. 79 + */ 80 + static int init_blkcg_llists(struct blkcg *blkcg) 81 + { 82 + int cpu; 83 + 84 + blkcg->lhead = alloc_percpu_gfp(struct llist_head, GFP_KERNEL); 85 + if (!blkcg->lhead) 86 + return -ENOMEM; 87 + 88 + for_each_possible_cpu(cpu) 89 + init_llist_head(per_cpu_ptr(blkcg->lhead, cpu)); 90 + return 0; 91 + } 92 + 62 93 /** 63 94 * blkcg_css - find the current css 64 95 * ··· 267 236 blkg->blkcg = blkcg; 268 237 269 238 u64_stats_init(&blkg->iostat.sync); 270 - for_each_possible_cpu(cpu) 239 + for_each_possible_cpu(cpu) { 271 240 u64_stats_init(&per_cpu_ptr(blkg->iostat_cpu, cpu)->sync); 241 + per_cpu_ptr(blkg->iostat_cpu, cpu)->blkg = blkg; 242 + } 272 243 273 244 for (i = 0; i < BLKCG_MAX_POLS; i++) { 274 245 struct blkcg_policy *pol = blkcg_policy[i]; ··· 610 577 * @pd: policy private data of interest 611 578 * @v: value to print 612 579 * 613 - * Print @v to @sf for the device assocaited with @pd. 580 + * Print @v to @sf for the device associated with @pd. 614 581 */ 615 582 u64 __blkg_prfill_u64(struct seq_file *sf, struct blkg_policy_data *pd, u64 v) 616 583 { ··· 798 765 799 766 /** 800 767 * blkg_conf_finish - finish up per-blkg config update 801 - * @ctx: blkg_conf_ctx intiailized by blkg_conf_prep() 768 + * @ctx: blkg_conf_ctx initialized by blkg_conf_prep() 802 769 * 803 770 * Finish up after per-blkg config update. This function must be paired 804 771 * with blkg_conf_prep(). ··· 860 827 static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) 861 828 { 862 829 struct blkcg *blkcg = css_to_blkcg(css); 863 - struct blkcg_gq *blkg; 830 + struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu); 831 + struct llist_node *lnode; 832 + struct blkg_iostat_set *bisc, *next_bisc; 864 833 865 834 /* Root-level stats are sourced from system-wide IO stats */ 866 835 if (!cgroup_parent(css->cgroup)) ··· 870 835 871 836 rcu_read_lock(); 872 837 873 - hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) { 838 + lnode = llist_del_all(lhead); 839 + if (!lnode) 840 + goto out; 841 + 842 + /* 843 + * Iterate only the iostat_cpu's queued in the lockless list. 844 + */ 845 + llist_for_each_entry_safe(bisc, next_bisc, lnode, lnode) { 846 + struct blkcg_gq *blkg = bisc->blkg; 874 847 struct blkcg_gq *parent = blkg->parent; 875 - struct blkg_iostat_set *bisc = per_cpu_ptr(blkg->iostat_cpu, cpu); 876 848 struct blkg_iostat cur; 877 849 unsigned int seq; 850 + 851 + WRITE_ONCE(bisc->lqueued, false); 878 852 879 853 /* fetch the current per-cpu values */ 880 854 do { ··· 897 853 if (parent && parent->parent) 898 854 blkcg_iostat_update(parent, &blkg->iostat.cur, 899 855 &blkg->iostat.last); 856 + percpu_ref_put(&blkg->refcnt); 900 857 } 901 858 859 + out: 902 860 rcu_read_unlock(); 903 861 } 904 862 ··· 1178 1132 1179 1133 mutex_unlock(&blkcg_pol_mutex); 1180 1134 1135 + free_percpu(blkcg->lhead); 1181 1136 kfree(blkcg); 1182 1137 } 1183 1138 ··· 1186 1139 blkcg_css_alloc(struct cgroup_subsys_state *parent_css) 1187 1140 { 1188 1141 struct blkcg *blkcg; 1189 - struct cgroup_subsys_state *ret; 1190 1142 int i; 1191 1143 1192 1144 mutex_lock(&blkcg_pol_mutex); ··· 1194 1148 blkcg = &blkcg_root; 1195 1149 } else { 1196 1150 blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL); 1197 - if (!blkcg) { 1198 - ret = ERR_PTR(-ENOMEM); 1151 + if (!blkcg) 1199 1152 goto unlock; 1200 - } 1201 1153 } 1154 + 1155 + if (init_blkcg_llists(blkcg)) 1156 + goto free_blkcg; 1202 1157 1203 1158 for (i = 0; i < BLKCG_MAX_POLS ; i++) { 1204 1159 struct blkcg_policy *pol = blkcg_policy[i]; ··· 1215 1168 continue; 1216 1169 1217 1170 cpd = pol->cpd_alloc_fn(GFP_KERNEL); 1218 - if (!cpd) { 1219 - ret = ERR_PTR(-ENOMEM); 1171 + if (!cpd) 1220 1172 goto free_pd_blkcg; 1221 - } 1173 + 1222 1174 blkcg->cpd[i] = cpd; 1223 1175 cpd->blkcg = blkcg; 1224 1176 cpd->plid = i; ··· 1241 1195 for (i--; i >= 0; i--) 1242 1196 if (blkcg->cpd[i]) 1243 1197 blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]); 1244 - 1198 + free_percpu(blkcg->lhead); 1199 + free_blkcg: 1245 1200 if (blkcg != &blkcg_root) 1246 1201 kfree(blkcg); 1247 1202 unlock: 1248 1203 mutex_unlock(&blkcg_pol_mutex); 1249 - return ret; 1204 + return ERR_PTR(-ENOMEM); 1250 1205 } 1251 1206 1252 1207 static int blkcg_css_online(struct cgroup_subsys_state *css) ··· 1831 1784 1832 1785 /** 1833 1786 * blkcg_schedule_throttle - this task needs to check for throttling 1834 - * @gendisk: disk to throttle 1787 + * @disk: disk to throttle 1835 1788 * @use_memdelay: do we charge this to memory delay for PSI 1836 1789 * 1837 1790 * This is called by the IO controller when we know there's delay accumulated ··· 1990 1943 1991 1944 void blk_cgroup_bio_start(struct bio *bio) 1992 1945 { 1946 + struct blkcg *blkcg = bio->bi_blkg->blkcg; 1993 1947 int rwd = blk_cgroup_io_type(bio), cpu; 1994 1948 struct blkg_iostat_set *bis; 1995 1949 unsigned long flags; ··· 2009 1961 } 2010 1962 bis->cur.ios[rwd]++; 2011 1963 1964 + /* 1965 + * If the iostat_cpu isn't in a lockless list, put it into the 1966 + * list to indicate that a stat update is pending. 1967 + */ 1968 + if (!READ_ONCE(bis->lqueued)) { 1969 + struct llist_head *lhead = this_cpu_ptr(blkcg->lhead); 1970 + 1971 + llist_add(&bis->lnode, lhead); 1972 + WRITE_ONCE(bis->lqueued, true); 1973 + percpu_ref_get(&bis->blkg->refcnt); 1974 + } 1975 + 2012 1976 u64_stats_update_end_irqrestore(&bis->sync, flags); 2013 1977 if (cgroup_subsys_on_dfl(io_cgrp_subsys)) 2014 - cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu); 1978 + cgroup_rstat_updated(blkcg->css.cgroup, cpu); 2015 1979 put_cpu(); 2016 1980 } 2017 1981

+10

block/blk-cgroup.h

··· 18 18 #include <linux/cgroup.h> 19 19 #include <linux/kthread.h> 20 20 #include <linux/blk-mq.h> 21 + #include <linux/llist.h> 21 22 22 23 struct blkcg_gq; 23 24 struct blkg_policy_data; ··· 44 43 45 44 struct blkg_iostat_set { 46 45 struct u64_stats_sync sync; 46 + struct blkcg_gq *blkg; 47 + struct llist_node lnode; 48 + int lqueued; /* queued in llist */ 47 49 struct blkg_iostat cur; 48 50 struct blkg_iostat last; 49 51 }; ··· 101 97 struct blkcg_policy_data *cpd[BLKCG_MAX_POLS]; 102 98 103 99 struct list_head all_blkcgs_node; 100 + 101 + /* 102 + * List of updated percpu blkg_iostat_set's since the last flush. 103 + */ 104 + struct llist_head __percpu *lhead; 105 + 104 106 #ifdef CONFIG_BLK_CGROUP_FC_APPID 105 107 char fc_app_id[FC_APPID_LEN]; 106 108 #endif

+39 -44

block/blk-core.c

··· 59 59 EXPORT_TRACEPOINT_SYMBOL_GPL(block_unplug); 60 60 EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_insert); 61 61 62 - DEFINE_IDA(blk_queue_ida); 62 + static DEFINE_IDA(blk_queue_ida); 63 63 64 64 /* 65 65 * For queue allocation 66 66 */ 67 - struct kmem_cache *blk_requestq_cachep; 68 - struct kmem_cache *blk_requestq_srcu_cachep; 67 + static struct kmem_cache *blk_requestq_cachep; 69 68 70 69 /* 71 70 * Controlling structure to kblockd ··· 252 253 } 253 254 EXPORT_SYMBOL_GPL(blk_clear_pm_only); 254 255 256 + static void blk_free_queue_rcu(struct rcu_head *rcu_head) 257 + { 258 + kmem_cache_free(blk_requestq_cachep, 259 + container_of(rcu_head, struct request_queue, rcu_head)); 260 + } 261 + 262 + static void blk_free_queue(struct request_queue *q) 263 + { 264 + percpu_ref_exit(&q->q_usage_counter); 265 + 266 + if (q->poll_stat) 267 + blk_stat_remove_callback(q, q->poll_cb); 268 + blk_stat_free_callback(q->poll_cb); 269 + 270 + blk_free_queue_stats(q->stats); 271 + kfree(q->poll_stat); 272 + 273 + if (queue_is_mq(q)) 274 + blk_mq_release(q); 275 + 276 + ida_free(&blk_queue_ida, q->id); 277 + call_rcu(&q->rcu_head, blk_free_queue_rcu); 278 + } 279 + 255 280 /** 256 281 * blk_put_queue - decrement the request_queue refcount 257 282 * @q: the request_queue structure to decrement the refcount for 258 283 * 259 - * Decrements the refcount of the request_queue kobject. When this reaches 0 260 - * we'll have blk_release_queue() called. 284 + * Decrements the refcount of the request_queue and free it when the refcount 285 + * reaches 0. 261 286 * 262 - * Context: Any context, but the last reference must not be dropped from 263 - * atomic context. 287 + * Context: Can sleep. 264 288 */ 265 289 void blk_put_queue(struct request_queue *q) 266 290 { 267 - kobject_put(&q->kobj); 291 + might_sleep(); 292 + if (refcount_dec_and_test(&q->refs)) 293 + blk_free_queue(q); 268 294 } 269 295 EXPORT_SYMBOL(blk_put_queue); 270 296 ··· 397 373 { 398 374 } 399 375 400 - struct request_queue *blk_alloc_queue(int node_id, bool alloc_srcu) 376 + struct request_queue *blk_alloc_queue(int node_id) 401 377 { 402 378 struct request_queue *q; 403 379 404 - q = kmem_cache_alloc_node(blk_get_queue_kmem_cache(alloc_srcu), 405 - GFP_KERNEL | __GFP_ZERO, node_id); 380 + q = kmem_cache_alloc_node(blk_requestq_cachep, GFP_KERNEL | __GFP_ZERO, 381 + node_id); 406 382 if (!q) 407 383 return NULL; 408 - 409 - if (alloc_srcu) { 410 - blk_queue_flag_set(QUEUE_FLAG_HAS_SRCU, q); 411 - if (init_srcu_struct(q->srcu) != 0) 412 - goto fail_q; 413 - } 414 384 415 385 q->last_merge = NULL; 416 386 417 387 q->id = ida_alloc(&blk_queue_ida, GFP_KERNEL); 418 388 if (q->id < 0) 419 - goto fail_srcu; 389 + goto fail_q; 420 390 421 391 q->stats = blk_alloc_queue_stats(); 422 392 if (!q->stats) ··· 424 406 INIT_WORK(&q->timeout_work, blk_timeout_work); 425 407 INIT_LIST_HEAD(&q->icq_list); 426 408 427 - kobject_init(&q->kobj, &blk_queue_ktype); 428 - 409 + refcount_set(&q->refs, 1); 429 410 mutex_init(&q->debugfs_mutex); 430 411 mutex_init(&q->sysfs_lock); 431 412 mutex_init(&q->sysfs_dir_lock); ··· 451 434 blk_free_queue_stats(q->stats); 452 435 fail_id: 453 436 ida_free(&blk_queue_ida, q->id); 454 - fail_srcu: 455 - if (alloc_srcu) 456 - cleanup_srcu_struct(q->srcu); 457 437 fail_q: 458 - kmem_cache_free(blk_get_queue_kmem_cache(alloc_srcu), q); 438 + kmem_cache_free(blk_requestq_cachep, q); 459 439 return NULL; 460 440 } 461 441 ··· 468 454 { 469 455 if (unlikely(blk_queue_dying(q))) 470 456 return false; 471 - kobject_get(&q->kobj); 457 + refcount_inc(&q->refs); 472 458 return true; 473 459 } 474 460 EXPORT_SYMBOL(blk_get_queue); ··· 959 945 EXPORT_SYMBOL(bdev_start_io_acct); 960 946 961 947 /** 962 - * bio_start_io_acct_time - start I/O accounting for bio based drivers 963 - * @bio: bio to start account for 964 - * @start_time: start time that should be passed back to bio_end_io_acct(). 965 - */ 966 - void bio_start_io_acct_time(struct bio *bio, unsigned long start_time) 967 - { 968 - bdev_start_io_acct(bio->bi_bdev, bio_sectors(bio), 969 - bio_op(bio), start_time); 970 - } 971 - EXPORT_SYMBOL_GPL(bio_start_io_acct_time); 972 - 973 - /** 974 948 * bio_start_io_acct - start I/O accounting for bio based drivers 975 949 * @bio: bio to start account for 976 950 * ··· 1185 1183 sizeof_field(struct request, cmd_flags)); 1186 1184 BUILD_BUG_ON(REQ_OP_BITS + REQ_FLAG_BITS > 8 * 1187 1185 sizeof_field(struct bio, bi_opf)); 1188 - BUILD_BUG_ON(ALIGN(offsetof(struct request_queue, srcu), 1189 - __alignof__(struct request_queue)) != 1190 - sizeof(struct request_queue)); 1191 1186 1192 1187 /* used for unplugging and affects IO latency/throughput - HIGHPRI */ 1193 1188 kblockd_workqueue = alloc_workqueue("kblockd", ··· 1194 1195 1195 1196 blk_requestq_cachep = kmem_cache_create("request_queue", 1196 1197 sizeof(struct request_queue), 0, SLAB_PANIC, NULL); 1197 - 1198 - blk_requestq_srcu_cachep = kmem_cache_create("request_queue_srcu", 1199 - sizeof(struct request_queue) + 1200 - sizeof(struct srcu_struct), 0, SLAB_PANIC, NULL); 1201 1198 1202 1199 blk_debugfs_root = debugfs_create_dir("block", NULL); 1203 1200

+18 -4

block/blk-crypto-internal.h

··· 21 21 22 22 #ifdef CONFIG_BLK_INLINE_ENCRYPTION 23 23 24 - int blk_crypto_sysfs_register(struct request_queue *q); 24 + int blk_crypto_sysfs_register(struct gendisk *disk); 25 25 26 - void blk_crypto_sysfs_unregister(struct request_queue *q); 26 + void blk_crypto_sysfs_unregister(struct gendisk *disk); 27 27 28 28 void bio_crypt_dun_increment(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE], 29 29 unsigned int inc); ··· 65 65 return rq->crypt_ctx; 66 66 } 67 67 68 + blk_status_t blk_crypto_get_keyslot(struct blk_crypto_profile *profile, 69 + const struct blk_crypto_key *key, 70 + struct blk_crypto_keyslot **slot_ptr); 71 + 72 + void blk_crypto_put_keyslot(struct blk_crypto_keyslot *slot); 73 + 74 + int __blk_crypto_evict_key(struct blk_crypto_profile *profile, 75 + const struct blk_crypto_key *key); 76 + 77 + bool __blk_crypto_cfg_supported(struct blk_crypto_profile *profile, 78 + const struct blk_crypto_config *cfg); 79 + 68 80 #else /* CONFIG_BLK_INLINE_ENCRYPTION */ 69 81 70 - static inline int blk_crypto_sysfs_register(struct request_queue *q) 82 + static inline int blk_crypto_sysfs_register(struct gendisk *disk) 71 83 { 72 84 return 0; 73 85 } 74 86 75 - static inline void blk_crypto_sysfs_unregister(struct request_queue *q) { } 87 + static inline void blk_crypto_sysfs_unregister(struct gendisk *disk) 88 + { 89 + } 76 90 77 91 static inline bool bio_crypt_rq_ctx_compatible(struct request *rq, 78 92 struct bio *bio)

+1

block/blk-crypto-profile.c

··· 32 32 #include <linux/wait.h> 33 33 #include <linux/blkdev.h> 34 34 #include <linux/blk-integrity.h> 35 + #include "blk-crypto-internal.h" 35 36 36 37 struct blk_crypto_keyslot { 37 38 atomic_t slot_refs;

+6 -5

block/blk-crypto-sysfs.c

··· 126 126 * If the request_queue has a blk_crypto_profile, create the "crypto" 127 127 * subdirectory in sysfs (/sys/block/$disk/queue/crypto/). 128 128 */ 129 - int blk_crypto_sysfs_register(struct request_queue *q) 129 + int blk_crypto_sysfs_register(struct gendisk *disk) 130 130 { 131 + struct request_queue *q = disk->queue; 131 132 struct blk_crypto_kobj *obj; 132 133 int err; 133 134 ··· 140 139 return -ENOMEM; 141 140 obj->profile = q->crypto_profile; 142 141 143 - err = kobject_init_and_add(&obj->kobj, &blk_crypto_ktype, &q->kobj, 144 - "crypto"); 142 + err = kobject_init_and_add(&obj->kobj, &blk_crypto_ktype, 143 + &disk->queue_kobj, "crypto"); 145 144 if (err) { 146 145 kobject_put(&obj->kobj); 147 146 return err; ··· 150 149 return 0; 151 150 } 152 151 153 - void blk_crypto_sysfs_unregister(struct request_queue *q) 152 + void blk_crypto_sysfs_unregister(struct gendisk *disk) 154 153 { 155 - kobject_put(q->crypto_kobject); 154 + kobject_put(disk->queue->crypto_kobject); 156 155 } 157 156 158 157 static int __init blk_crypto_sysfs_init(void)

+22 -15

block/blk-crypto.c

··· 273 273 { 274 274 struct bio *bio = *bio_ptr; 275 275 const struct blk_crypto_key *bc_key = bio->bi_crypt_context->bc_key; 276 - struct blk_crypto_profile *profile; 277 276 278 277 /* Error if bio has no data. */ 279 278 if (WARN_ON_ONCE(!bio_has_data(bio))) { ··· 289 290 * Success if device supports the encryption context, or if we succeeded 290 291 * in falling back to the crypto API. 291 292 */ 292 - profile = bdev_get_queue(bio->bi_bdev)->crypto_profile; 293 - if (__blk_crypto_cfg_supported(profile, &bc_key->crypto_cfg)) 293 + if (blk_crypto_config_supported_natively(bio->bi_bdev, 294 + &bc_key->crypto_cfg)) 294 295 return true; 295 - 296 296 if (blk_crypto_fallback_bio_prep(bio_ptr)) 297 297 return true; 298 298 fail: ··· 356 358 return 0; 357 359 } 358 360 361 + bool blk_crypto_config_supported_natively(struct block_device *bdev, 362 + const struct blk_crypto_config *cfg) 363 + { 364 + return __blk_crypto_cfg_supported(bdev_get_queue(bdev)->crypto_profile, 365 + cfg); 366 + } 367 + 359 368 /* 360 369 * Check if bios with @cfg can be en/decrypted by blk-crypto (i.e. either the 361 - * request queue it's submitted to supports inline crypto, or the 370 + * block_device it's submitted to supports inline crypto, or the 362 371 * blk-crypto-fallback is enabled and supports the cfg). 363 372 */ 364 - bool blk_crypto_config_supported(struct request_queue *q, 373 + bool blk_crypto_config_supported(struct block_device *bdev, 365 374 const struct blk_crypto_config *cfg) 366 375 { 367 376 return IS_ENABLED(CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK) || 368 - __blk_crypto_cfg_supported(q->crypto_profile, cfg); 377 + blk_crypto_config_supported_natively(bdev, cfg); 369 378 } 370 379 371 380 /** 372 381 * blk_crypto_start_using_key() - Start using a blk_crypto_key on a device 382 + * @bdev: block device to operate on 373 383 * @key: A key to use on the device 374 - * @q: the request queue for the device 375 384 * 376 385 * Upper layers must call this function to ensure that either the hardware 377 386 * supports the key's crypto settings, or the crypto API fallback has transforms ··· 390 385 * blk-crypto-fallback is either disabled or the needed algorithm 391 386 * is disabled in the crypto API; or another -errno code. 392 387 */ 393 - int blk_crypto_start_using_key(const struct blk_crypto_key *key, 394 - struct request_queue *q) 388 + int blk_crypto_start_using_key(struct block_device *bdev, 389 + const struct blk_crypto_key *key) 395 390 { 396 - if (__blk_crypto_cfg_supported(q->crypto_profile, &key->crypto_cfg)) 391 + if (blk_crypto_config_supported_natively(bdev, &key->crypto_cfg)) 397 392 return 0; 398 393 return blk_crypto_fallback_start_using_mode(key->crypto_cfg.crypto_mode); 399 394 } ··· 401 396 /** 402 397 * blk_crypto_evict_key() - Evict a key from any inline encryption hardware 403 398 * it may have been programmed into 404 - * @q: The request queue who's associated inline encryption hardware this key 399 + * @bdev: The block_device who's associated inline encryption hardware this key 405 400 * might have been programmed into 406 401 * @key: The key to evict 407 402 * ··· 411 406 * 412 407 * Return: 0 on success or if the key wasn't in any keyslot; -errno on error. 413 408 */ 414 - int blk_crypto_evict_key(struct request_queue *q, 409 + int blk_crypto_evict_key(struct block_device *bdev, 415 410 const struct blk_crypto_key *key) 416 411 { 417 - if (__blk_crypto_cfg_supported(q->crypto_profile, &key->crypto_cfg)) 412 + struct request_queue *q = bdev_get_queue(bdev); 413 + 414 + if (blk_crypto_config_supported_natively(bdev, &key->crypto_cfg)) 418 415 return __blk_crypto_evict_key(q->crypto_profile, key); 419 416 420 417 /* 421 - * If the request_queue didn't support the key, then blk-crypto-fallback 418 + * If the block_device didn't support the key, then blk-crypto-fallback 422 419 * may have been used, so try to evict the key from blk-crypto-fallback. 423 420 */ 424 421 return blk_crypto_fallback_evict_key(key);

+2 -1

block/blk-ia-ranges.c

··· 123 123 */ 124 124 WARN_ON(iars->sysfs_registered); 125 125 ret = kobject_init_and_add(&iars->kobj, &blk_ia_ranges_ktype, 126 - &q->kobj, "%s", "independent_access_ranges"); 126 + &disk->queue_kobj, "%s", 127 + "independent_access_ranges"); 127 128 if (ret) { 128 129 disk->ia_ranges = NULL; 129 130 kobject_put(&iars->kobj);

+41 -16

block/blk-iocost.c

··· 111 111 * busy signal. 112 112 * 113 113 * As devices can have deep queues and be unfair in how the queued commands 114 - * are executed, soley depending on rq wait may not result in satisfactory 114 + * are executed, solely depending on rq wait may not result in satisfactory 115 115 * control quality. For a better control quality, completion latency QoS 116 116 * parameters can be configured so that the device is considered saturated 117 117 * if N'th percentile completion latency rises above the set point. ··· 556 556 u64 now_ns; 557 557 u64 now; 558 558 u64 vnow; 559 - u64 vrate; 560 559 }; 561 560 562 561 struct iocg_wait { ··· 905 906 if (idx == ioc->autop_idx && !force) 906 907 return false; 907 908 908 - if (idx != ioc->autop_idx) 909 + if (idx != ioc->autop_idx) { 909 910 atomic64_set(&ioc->vtime_rate, VTIME_PER_USEC); 911 + ioc->vtime_base_rate = VTIME_PER_USEC; 912 + } 910 913 911 914 ioc->autop_idx = idx; 912 915 ioc->autop_too_fast_at = 0; ··· 976 975 977 976 if (!ioc->busy_level || (ioc->busy_level < 0 && nr_lagging)) { 978 977 if (ioc->busy_level != prev_busy_level || nr_lagging) 979 - trace_iocost_ioc_vrate_adj(ioc, atomic64_read(&ioc->vtime_rate), 978 + trace_iocost_ioc_vrate_adj(ioc, vrate, 980 979 missed_ppm, rq_wait_pct, 981 980 nr_lagging, nr_shortages); 982 981 ··· 1019 1018 static void ioc_now(struct ioc *ioc, struct ioc_now *now) 1020 1019 { 1021 1020 unsigned seq; 1021 + u64 vrate; 1022 1022 1023 1023 now->now_ns = ktime_get(); 1024 1024 now->now = ktime_to_us(now->now_ns); 1025 - now->vrate = atomic64_read(&ioc->vtime_rate); 1025 + vrate = atomic64_read(&ioc->vtime_rate); 1026 1026 1027 1027 /* 1028 1028 * The current vtime is ··· 1036 1034 do { 1037 1035 seq = read_seqcount_begin(&ioc->period_seqcount); 1038 1036 now->vnow = ioc->period_at_vtime + 1039 - (now->now - ioc->period_at) * now->vrate; 1037 + (now->now - ioc->period_at) * vrate; 1040 1038 } while (read_seqcount_retry(&ioc->period_seqcount, seq)); 1041 1039 } 1042 1040 ··· 2205 2203 LIST_HEAD(surpluses); 2206 2204 int nr_debtors, nr_shortages = 0, nr_lagging = 0; 2207 2205 u64 usage_us_sum = 0; 2208 - u32 ppm_rthr = MILLION - ioc->params.qos[QOS_RPPM]; 2209 - u32 ppm_wthr = MILLION - ioc->params.qos[QOS_WPPM]; 2206 + u32 ppm_rthr; 2207 + u32 ppm_wthr; 2210 2208 u32 missed_ppm[2], rq_wait_pct; 2211 2209 u64 period_vtime; 2212 2210 int prev_busy_level; ··· 2217 2215 /* take care of active iocgs */ 2218 2216 spin_lock_irq(&ioc->lock); 2219 2217 2218 + ppm_rthr = MILLION - ioc->params.qos[QOS_RPPM]; 2219 + ppm_wthr = MILLION - ioc->params.qos[QOS_WPPM]; 2220 2220 ioc_now(ioc, &now); 2221 2221 2222 2222 period_vtime = now.vnow - ioc->period_at_vtime; ··· 2882 2878 spin_unlock_irq(&ioc->lock); 2883 2879 2884 2880 /* 2885 - * rqos must be added before activation to allow iocg_pd_init() to 2881 + * rqos must be added before activation to allow ioc_pd_init() to 2886 2882 * lookup the ioc from q. This means that the rqos methods may get 2887 2883 * called before policy activation completion, can't assume that the 2888 2884 * target bio has an iocg associated and need to test for NULL iocg. ··· 3191 3187 ioc = q_to_ioc(disk->queue); 3192 3188 } 3193 3189 3190 + blk_mq_freeze_queue(disk->queue); 3191 + blk_mq_quiesce_queue(disk->queue); 3192 + 3194 3193 spin_lock_irq(&ioc->lock); 3195 3194 memcpy(qos, ioc->params.qos, sizeof(qos)); 3196 3195 enable = ioc->enabled; 3197 3196 user = ioc->user_qos_params; 3198 - spin_unlock_irq(&ioc->lock); 3199 3197 3200 3198 while ((p = strsep(&input, " \t\n"))) { 3201 3199 substring_t args[MAX_OPT_ARGS]; ··· 3264 3258 if (qos[QOS_MIN] > qos[QOS_MAX]) 3265 3259 goto einval; 3266 3260 3267 - spin_lock_irq(&ioc->lock); 3268 - 3269 3261 if (enable) { 3270 3262 blk_stat_enable_accounting(disk->queue); 3271 3263 blk_queue_flag_set(QUEUE_FLAG_RQ_ALLOC_TIME, disk->queue); 3272 3264 ioc->enabled = true; 3265 + wbt_disable_default(disk->queue); 3273 3266 } else { 3274 3267 blk_queue_flag_clear(QUEUE_FLAG_RQ_ALLOC_TIME, disk->queue); 3275 3268 ioc->enabled = false; 3269 + wbt_enable_default(disk->queue); 3276 3270 } 3277 3271 3278 3272 if (user) { ··· 3285 3279 ioc_refresh_params(ioc, true); 3286 3280 spin_unlock_irq(&ioc->lock); 3287 3281 3282 + blk_mq_unquiesce_queue(disk->queue); 3283 + blk_mq_unfreeze_queue(disk->queue); 3284 + 3288 3285 blkdev_put_no_open(bdev); 3289 3286 return nbytes; 3290 3287 einval: 3288 + spin_unlock_irq(&ioc->lock); 3289 + 3290 + blk_mq_unquiesce_queue(disk->queue); 3291 + blk_mq_unfreeze_queue(disk->queue); 3292 + 3291 3293 ret = -EINVAL; 3292 3294 err: 3293 3295 blkdev_put_no_open(bdev); ··· 3350 3336 size_t nbytes, loff_t off) 3351 3337 { 3352 3338 struct block_device *bdev; 3339 + struct request_queue *q; 3353 3340 struct ioc *ioc; 3354 3341 u64 u[NR_I_LCOEFS]; 3355 3342 bool user; ··· 3361 3346 if (IS_ERR(bdev)) 3362 3347 return PTR_ERR(bdev); 3363 3348 3364 - ioc = q_to_ioc(bdev_get_queue(bdev)); 3349 + q = bdev_get_queue(bdev); 3350 + ioc = q_to_ioc(q); 3365 3351 if (!ioc) { 3366 3352 ret = blk_iocost_init(bdev->bd_disk); 3367 3353 if (ret) 3368 3354 goto err; 3369 - ioc = q_to_ioc(bdev_get_queue(bdev)); 3355 + ioc = q_to_ioc(q); 3370 3356 } 3357 + 3358 + blk_mq_freeze_queue(q); 3359 + blk_mq_quiesce_queue(q); 3371 3360 3372 3361 spin_lock_irq(&ioc->lock); 3373 3362 memcpy(u, ioc->params.i_lcoefs, sizeof(u)); 3374 3363 user = ioc->user_cost_model; 3375 - spin_unlock_irq(&ioc->lock); 3376 3364 3377 3365 while ((p = strsep(&input, " \t\n"))) { 3378 3366 substring_t args[MAX_OPT_ARGS]; ··· 3412 3394 user = true; 3413 3395 } 3414 3396 3415 - spin_lock_irq(&ioc->lock); 3416 3397 if (user) { 3417 3398 memcpy(ioc->params.i_lcoefs, u, sizeof(u)); 3418 3399 ioc->user_cost_model = true; ··· 3421 3404 ioc_refresh_params(ioc, true); 3422 3405 spin_unlock_irq(&ioc->lock); 3423 3406 3407 + blk_mq_unquiesce_queue(q); 3408 + blk_mq_unfreeze_queue(q); 3409 + 3424 3410 blkdev_put_no_open(bdev); 3425 3411 return nbytes; 3426 3412 3427 3413 einval: 3414 + spin_unlock_irq(&ioc->lock); 3415 + 3416 + blk_mq_unquiesce_queue(q); 3417 + blk_mq_unfreeze_queue(q); 3418 + 3428 3419 ret = -EINVAL; 3429 3420 err: 3430 3421 blkdev_put_no_open(bdev);

+17 -20

block/blk-iolatency.c

··· 141 141 struct latency_stat __percpu *stats; 142 142 struct latency_stat cur_stat; 143 143 struct blk_iolatency *blkiolat; 144 - struct rq_depth rq_depth; 144 + unsigned int max_depth; 145 145 struct rq_wait rq_wait; 146 146 atomic64_t window_start; 147 147 atomic_t scale_cookie; ··· 280 280 static bool iolat_acquire_inflight(struct rq_wait *rqw, void *private_data) 281 281 { 282 282 struct iolatency_grp *iolat = private_data; 283 - return rq_wait_inc_below(rqw, iolat->rq_depth.max_depth); 283 + return rq_wait_inc_below(rqw, iolat->max_depth); 284 284 } 285 285 286 286 static void __blkcg_iolatency_throttle(struct rq_qos *rqos, ··· 364 364 } 365 365 366 366 /* 367 - * Change the queue depth of the iolatency_grp. We add/subtract 1/16th of the 367 + * Change the queue depth of the iolatency_grp. We add 1/16th of the 368 368 * queue depth at a time so we don't get wild swings and hopefully dial in to 369 - * fairer distribution of the overall queue depth. 369 + * fairer distribution of the overall queue depth. We halve the queue depth 370 + * at a time so we can scale down queue depth quickly from default unlimited 371 + * to target. 370 372 */ 371 373 static void scale_change(struct iolatency_grp *iolat, bool up) 372 374 { 373 375 unsigned long qd = iolat->blkiolat->rqos.q->nr_requests; 374 376 unsigned long scale = scale_amount(qd, up); 375 - unsigned long old = iolat->rq_depth.max_depth; 377 + unsigned long old = iolat->max_depth; 376 378 377 379 if (old > qd) 378 380 old = qd; ··· 386 384 if (old < qd) { 387 385 old += scale; 388 386 old = min(old, qd); 389 - iolat->rq_depth.max_depth = old; 387 + iolat->max_depth = old; 390 388 wake_up_all(&iolat->rq_wait.wait); 391 389 } 392 390 } else { 393 391 old >>= 1; 394 - iolat->rq_depth.max_depth = max(old, 1UL); 392 + iolat->max_depth = max(old, 1UL); 395 393 } 396 394 } 397 395 ··· 404 402 unsigned int our_cookie = atomic_read(&iolat->scale_cookie); 405 403 u64 scale_lat; 406 404 int direction = 0; 407 - 408 - if (lat_to_blkg(iolat)->parent == NULL) 409 - return; 410 405 411 406 parent = blkg_to_lat(lat_to_blkg(iolat)->parent); 412 407 if (!parent) ··· 444 445 } 445 446 446 447 /* We're as low as we can go. */ 447 - if (iolat->rq_depth.max_depth == 1 && direction < 0) { 448 + if (iolat->max_depth == 1 && direction < 0) { 448 449 blkcg_use_delay(lat_to_blkg(iolat)); 449 450 return; 450 451 } ··· 452 453 /* We're back to the default cookie, unthrottle all the things. */ 453 454 if (cur_cookie == DEFAULT_SCALE_COOKIE) { 454 455 blkcg_clear_delay(lat_to_blkg(iolat)); 455 - iolat->rq_depth.max_depth = UINT_MAX; 456 + iolat->max_depth = UINT_MAX; 456 457 wake_up_all(&iolat->rq_wait.wait); 457 458 return; 458 459 } ··· 507 508 * We don't want to count issue_as_root bio's in the cgroups latency 508 509 * statistics as it could skew the numbers downwards. 509 510 */ 510 - if (unlikely(issue_as_root && iolat->rq_depth.max_depth != UINT_MAX)) { 511 + if (unlikely(issue_as_root && iolat->max_depth != UINT_MAX)) { 511 512 u64 sub = iolat->min_lat_nsec; 512 513 if (req_time < sub) 513 514 blkcg_add_delay(lat_to_blkg(iolat), now, sub - req_time); ··· 919 920 } 920 921 preempt_enable(); 921 922 922 - if (iolat->rq_depth.max_depth == UINT_MAX) 923 + if (iolat->max_depth == UINT_MAX) 923 924 seq_printf(s, " missed=%llu total=%llu depth=max", 924 925 (unsigned long long)stat.ps.missed, 925 926 (unsigned long long)stat.ps.total); ··· 927 928 seq_printf(s, " missed=%llu total=%llu depth=%u", 928 929 (unsigned long long)stat.ps.missed, 929 930 (unsigned long long)stat.ps.total, 930 - iolat->rq_depth.max_depth); 931 + iolat->max_depth); 931 932 } 932 933 933 934 static void iolatency_pd_stat(struct blkg_policy_data *pd, struct seq_file *s) ··· 944 945 945 946 avg_lat = div64_u64(iolat->lat_avg, NSEC_PER_USEC); 946 947 cur_win = div64_u64(iolat->cur_win_nsec, NSEC_PER_MSEC); 947 - if (iolat->rq_depth.max_depth == UINT_MAX) 948 + if (iolat->max_depth == UINT_MAX) 948 949 seq_printf(s, " depth=max avg_lat=%llu win=%llu", 949 950 avg_lat, cur_win); 950 951 else 951 952 seq_printf(s, " depth=%u avg_lat=%llu win=%llu", 952 - iolat->rq_depth.max_depth, avg_lat, cur_win); 953 + iolat->max_depth, avg_lat, cur_win); 953 954 } 954 955 955 956 static struct blkg_policy_data *iolatency_pd_alloc(gfp_t gfp, ··· 993 994 latency_stat_init(iolat, &iolat->cur_stat); 994 995 rq_wait_init(&iolat->rq_wait); 995 996 spin_lock_init(&iolat->child_lat.lock); 996 - iolat->rq_depth.queue_depth = blkg->q->nr_requests; 997 - iolat->rq_depth.max_depth = UINT_MAX; 998 - iolat->rq_depth.default_depth = iolat->rq_depth.queue_depth; 997 + iolat->max_depth = UINT_MAX; 999 998 iolat->blkiolat = blkiolat; 1000 999 iolat->cur_win_nsec = 100 * NSEC_PER_MSEC; 1001 1000 atomic64_set(&iolat->window_start, now);

+9 -5

block/blk-map.c

··· 267 267 { 268 268 unsigned int max_sectors = queue_max_hw_sectors(rq->q); 269 269 unsigned int nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS); 270 + unsigned int gup_flags = 0; 270 271 struct bio *bio; 271 272 int ret; 272 273 int j; ··· 279 278 if (bio == NULL) 280 279 return -ENOMEM; 281 280 281 + if (blk_queue_pci_p2pdma(rq->q)) 282 + gup_flags |= FOLL_PCI_P2PDMA; 283 + 282 284 while (iov_iter_count(iter)) { 283 285 struct page **pages, *stack_pages[UIO_FASTIOV]; 284 286 ssize_t bytes; ··· 290 286 291 287 if (nr_vecs <= ARRAY_SIZE(stack_pages)) { 292 288 pages = stack_pages; 293 - bytes = iov_iter_get_pages2(iter, pages, LONG_MAX, 294 - nr_vecs, &offs); 289 + bytes = iov_iter_get_pages(iter, pages, LONG_MAX, 290 + nr_vecs, &offs, gup_flags); 295 291 } else { 296 - bytes = iov_iter_get_pages_alloc2(iter, &pages, 297 - LONG_MAX, &offs); 292 + bytes = iov_iter_get_pages_alloc(iter, &pages, 293 + LONG_MAX, &offs, gup_flags); 298 294 } 299 295 if (unlikely(bytes <= 0)) { 300 296 ret = bytes ? bytes : -EFAULT; ··· 559 555 size_t nr_iter = iov_iter_count(iter); 560 556 size_t nr_segs = iter->nr_segs; 561 557 struct bio_vec *bvecs, *bvprvp = NULL; 562 - struct queue_limits *lim = &q->limits; 558 + const struct queue_limits *lim = &q->limits; 563 559 unsigned int nsegs = 0, bytes = 0; 564 560 struct bio *bio; 565 561 size_t i;

+27 -17

block/blk-merge.c

··· 100 100 * is defined as 'unsigned int', meantime it has to be aligned to with the 101 101 * logical block size, which is the minimum accepted unit by hardware. 102 102 */ 103 - static unsigned int bio_allowed_max_sectors(struct queue_limits *lim) 103 + static unsigned int bio_allowed_max_sectors(const struct queue_limits *lim) 104 104 { 105 105 return round_down(UINT_MAX, lim->logical_block_size) >> SECTOR_SHIFT; 106 106 } 107 107 108 - static struct bio *bio_split_discard(struct bio *bio, struct queue_limits *lim, 109 - unsigned *nsegs, struct bio_set *bs) 108 + static struct bio *bio_split_discard(struct bio *bio, 109 + const struct queue_limits *lim, 110 + unsigned *nsegs, struct bio_set *bs) 110 111 { 111 112 unsigned int max_discard_sectors, granularity; 112 113 sector_t tmp; ··· 147 146 } 148 147 149 148 static struct bio *bio_split_write_zeroes(struct bio *bio, 150 - struct queue_limits *lim, unsigned *nsegs, struct bio_set *bs) 149 + const struct queue_limits *lim, 150 + unsigned *nsegs, struct bio_set *bs) 151 151 { 152 152 *nsegs = 0; 153 153 if (!lim->max_write_zeroes_sectors) ··· 167 165 * aligned to a physical block boundary. 168 166 */ 169 167 static inline unsigned get_max_io_size(struct bio *bio, 170 - struct queue_limits *lim) 168 + const struct queue_limits *lim) 171 169 { 172 170 unsigned pbs = lim->physical_block_size >> SECTOR_SHIFT; 173 171 unsigned lbs = lim->logical_block_size >> SECTOR_SHIFT; ··· 186 184 return max_sectors & ~(lbs - 1); 187 185 } 188 186 189 - static inline unsigned get_max_segment_size(struct queue_limits *lim, 187 + /** 188 + * get_max_segment_size() - maximum number of bytes to add as a single segment 189 + * @lim: Request queue limits. 190 + * @start_page: See below. 191 + * @offset: Offset from @start_page where to add a segment. 192 + * 193 + * Returns the maximum number of bytes that can be added as a single segment. 194 + */ 195 + static inline unsigned get_max_segment_size(const struct queue_limits *lim, 190 196 struct page *start_page, unsigned long offset) 191 197 { 192 198 unsigned long mask = lim->seg_boundary_mask; ··· 202 192 offset = mask & (page_to_phys(start_page) + offset); 203 193 204 194 /* 205 - * overflow may be triggered in case of zero page physical address 206 - * on 32bit arch, use queue's max segment size when that happens. 195 + * Prevent an overflow if mask = ULONG_MAX and offset = 0 by adding 1 196 + * after having calculated the minimum. 207 197 */ 208 - return min_not_zero(mask - offset + 1, 209 - (unsigned long)lim->max_segment_size); 198 + return min(mask - offset, (unsigned long)lim->max_segment_size - 1) + 1; 210 199 } 211 200 212 201 /** ··· 228 219 * *@nsegs segments and *@sectors sectors would make that bio unacceptable for 229 220 * the block driver. 230 221 */ 231 - static bool bvec_split_segs(struct queue_limits *lim, const struct bio_vec *bv, 232 - unsigned *nsegs, unsigned *bytes, unsigned max_segs, 233 - unsigned max_bytes) 222 + static bool bvec_split_segs(const struct queue_limits *lim, 223 + const struct bio_vec *bv, unsigned *nsegs, unsigned *bytes, 224 + unsigned max_segs, unsigned max_bytes) 234 225 { 235 226 unsigned max_len = min(max_bytes, UINT_MAX) - *bytes; 236 227 unsigned len = min(bv->bv_len, max_len); ··· 276 267 * responsible for ensuring that @bs is only destroyed after processing of the 277 268 * split bio has finished. 278 269 */ 279 - static struct bio *bio_split_rw(struct bio *bio, struct queue_limits *lim, 270 + static struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim, 280 271 unsigned *segs, struct bio_set *bs, unsigned max_bytes) 281 272 { 282 273 struct bio_vec bv, bvprv, *bvprvp = NULL; ··· 340 331 * The split bio is allocated from @q->bio_split, which is provided by the 341 332 * block layer. 342 333 */ 343 - struct bio *__bio_split_to_limits(struct bio *bio, struct queue_limits *lim, 344 - unsigned int *nr_segs) 334 + struct bio *__bio_split_to_limits(struct bio *bio, 335 + const struct queue_limits *lim, 336 + unsigned int *nr_segs) 345 337 { 346 338 struct bio_set *bs = &bio->bi_bdev->bd_disk->bio_split; 347 339 struct bio *split; ··· 387 377 */ 388 378 struct bio *bio_split_to_limits(struct bio *bio) 389 379 { 390 - struct queue_limits *lim = &bdev_get_queue(bio->bi_bdev)->limits; 380 + const struct queue_limits *lim = &bdev_get_queue(bio->bi_bdev)->limits; 391 381 unsigned int nr_segs; 392 382 393 383 if (bio_may_exceed_limits(bio, lim))

+1 -7

block/blk-mq-sched.c

··· 555 555 return 0; 556 556 } 557 557 558 + /* caller must have a reference to @e, will grab another one if successful */ 558 559 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e) 559 560 { 560 561 unsigned int flags = q->tag_set->flags; ··· 563 562 struct elevator_queue *eq; 564 563 unsigned long i; 565 564 int ret; 566 - 567 - if (!e) { 568 - blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q); 569 - q->elevator = NULL; 570 - q->nr_requests = q->tag_set->queue_depth; 571 - return 0; 572 - } 573 565 574 566 /* 575 567 * Default to double of smaller one between hw queue_depth and 128,

+9 -2

block/blk-mq-sysfs.c

··· 185 185 { 186 186 struct request_queue *q = hctx->queue; 187 187 struct blk_mq_ctx *ctx; 188 - int i, ret; 188 + int i, j, ret; 189 189 190 190 if (!hctx->nr_ctx) 191 191 return 0; ··· 197 197 hctx_for_each_ctx(hctx, ctx, i) { 198 198 ret = kobject_add(&ctx->kobj, &hctx->kobj, "cpu%u", ctx->cpu); 199 199 if (ret) 200 - break; 200 + goto out; 201 201 } 202 202 203 + return 0; 204 + out: 205 + hctx_for_each_ctx(hctx, ctx, j) { 206 + if (j < i) 207 + kobject_del(&ctx->kobj); 208 + } 209 + kobject_del(&hctx->kobj); 203 210 return ret; 204 211 } 205 212

+148 -81

block/blk-mq.c

··· 254 254 255 255 /** 256 256 * blk_mq_wait_quiesce_done() - wait until in-progress quiesce is done 257 - * @q: request queue. 257 + * @set: tag_set to wait on 258 258 * 259 259 * Note: it is driver's responsibility for making sure that quiesce has 260 - * been started. 260 + * been started on or more of the request_queues of the tag_set. This 261 + * function only waits for the quiesce on those request_queues that had 262 + * the quiesce flag set using blk_mq_quiesce_queue_nowait. 261 263 */ 262 - void blk_mq_wait_quiesce_done(struct request_queue *q) 264 + void blk_mq_wait_quiesce_done(struct blk_mq_tag_set *set) 263 265 { 264 - if (blk_queue_has_srcu(q)) 265 - synchronize_srcu(q->srcu); 266 + if (set->flags & BLK_MQ_F_BLOCKING) 267 + synchronize_srcu(set->srcu); 266 268 else 267 269 synchronize_rcu(); 268 270 } ··· 282 280 void blk_mq_quiesce_queue(struct request_queue *q) 283 281 { 284 282 blk_mq_quiesce_queue_nowait(q); 285 - blk_mq_wait_quiesce_done(q); 283 + /* nothing to wait for non-mq queues */ 284 + if (queue_is_mq(q)) 285 + blk_mq_wait_quiesce_done(q->tag_set); 286 286 } 287 287 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue); 288 288 ··· 314 310 blk_mq_run_hw_queues(q, true); 315 311 } 316 312 EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue); 313 + 314 + void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set) 315 + { 316 + struct request_queue *q; 317 + 318 + mutex_lock(&set->tag_list_lock); 319 + list_for_each_entry(q, &set->tag_list, tag_set_list) { 320 + if (!blk_queue_skip_tagset_quiesce(q)) 321 + blk_mq_quiesce_queue_nowait(q); 322 + } 323 + blk_mq_wait_quiesce_done(set); 324 + mutex_unlock(&set->tag_list_lock); 325 + } 326 + EXPORT_SYMBOL_GPL(blk_mq_quiesce_tagset); 327 + 328 + void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set) 329 + { 330 + struct request_queue *q; 331 + 332 + mutex_lock(&set->tag_list_lock); 333 + list_for_each_entry(q, &set->tag_list, tag_set_list) { 334 + if (!blk_queue_skip_tagset_quiesce(q)) 335 + blk_mq_unquiesce_queue(q); 336 + } 337 + mutex_unlock(&set->tag_list_lock); 338 + } 339 + EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset); 317 340 318 341 void blk_mq_wake_waiters(struct request_queue *q) 319 342 { ··· 575 544 576 545 if (!plug) 577 546 return NULL; 547 + 578 548 if (rq_list_empty(plug->cached_rq)) { 579 549 if (plug->nr_ios == 1) 580 550 return NULL; 581 551 rq = blk_mq_rq_cache_fill(q, plug, opf, flags); 582 - if (rq) 583 - goto got_it; 584 - return NULL; 552 + if (!rq) 553 + return NULL; 554 + } else { 555 + rq = rq_list_peek(&plug->cached_rq); 556 + if (!rq || rq->q != q) 557 + return NULL; 558 + 559 + if (blk_mq_get_hctx_type(opf) != rq->mq_hctx->type) 560 + return NULL; 561 + if (op_is_flush(rq->cmd_flags) != op_is_flush(opf)) 562 + return NULL; 563 + 564 + plug->cached_rq = rq_list_next(rq); 585 565 } 586 - rq = rq_list_peek(&plug->cached_rq); 587 - if (!rq || rq->q != q) 588 - return NULL; 589 566 590 - if (blk_mq_get_hctx_type(opf) != rq->mq_hctx->type) 591 - return NULL; 592 - if (op_is_flush(rq->cmd_flags) != op_is_flush(opf)) 593 - return NULL; 594 - 595 - plug->cached_rq = rq_list_next(rq); 596 - got_it: 597 567 rq->cmd_flags = opf; 598 568 INIT_LIST_HEAD(&rq->queuelist); 599 569 return rq; ··· 1561 1529 blk_add_timer(req); 1562 1530 } 1563 1531 1564 - static bool blk_mq_req_expired(struct request *rq, unsigned long *next) 1532 + struct blk_expired_data { 1533 + bool has_timedout_rq; 1534 + unsigned long next; 1535 + unsigned long timeout_start; 1536 + }; 1537 + 1538 + static bool blk_mq_req_expired(struct request *rq, struct blk_expired_data *expired) 1565 1539 { 1566 1540 unsigned long deadline; 1567 1541 ··· 1577 1539 return false; 1578 1540 1579 1541 deadline = READ_ONCE(rq->deadline); 1580 - if (time_after_eq(jiffies, deadline)) 1542 + if (time_after_eq(expired->timeout_start, deadline)) 1581 1543 return true; 1582 1544 1583 - if (*next == 0) 1584 - *next = deadline; 1585 - else if (time_after(*next, deadline)) 1586 - *next = deadline; 1545 + if (expired->next == 0) 1546 + expired->next = deadline; 1547 + else if (time_after(expired->next, deadline)) 1548 + expired->next = deadline; 1587 1549 return false; 1588 1550 } 1589 1551 ··· 1599 1561 1600 1562 static bool blk_mq_check_expired(struct request *rq, void *priv) 1601 1563 { 1602 - unsigned long *next = priv; 1564 + struct blk_expired_data *expired = priv; 1603 1565 1604 1566 /* 1605 1567 * blk_mq_queue_tag_busy_iter() has locked the request, so it cannot ··· 1608 1570 * it was completed and reallocated as a new request after returning 1609 1571 * from blk_mq_check_expired(). 1610 1572 */ 1611 - if (blk_mq_req_expired(rq, next)) 1573 + if (blk_mq_req_expired(rq, expired)) { 1574 + expired->has_timedout_rq = true; 1575 + return false; 1576 + } 1577 + return true; 1578 + } 1579 + 1580 + static bool blk_mq_handle_expired(struct request *rq, void *priv) 1581 + { 1582 + struct blk_expired_data *expired = priv; 1583 + 1584 + if (blk_mq_req_expired(rq, expired)) 1612 1585 blk_mq_rq_timed_out(rq); 1613 1586 return true; 1614 1587 } ··· 1628 1579 { 1629 1580 struct request_queue *q = 1630 1581 container_of(work, struct request_queue, timeout_work); 1631 - unsigned long next = 0; 1582 + struct blk_expired_data expired = { 1583 + .timeout_start = jiffies, 1584 + }; 1632 1585 struct blk_mq_hw_ctx *hctx; 1633 1586 unsigned long i; 1634 1587 ··· 1650 1599 if (!percpu_ref_tryget(&q->q_usage_counter)) 1651 1600 return; 1652 1601 1653 - blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &next); 1602 + /* check if there is any timed-out request */ 1603 + blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &expired); 1604 + if (expired.has_timedout_rq) { 1605 + /* 1606 + * Before walking tags, we must ensure any submit started 1607 + * before the current time has finished. Since the submit 1608 + * uses srcu or rcu, wait for a synchronization point to 1609 + * ensure all running submits have finished 1610 + */ 1611 + blk_mq_wait_quiesce_done(q->tag_set); 1654 1612 1655 - if (next != 0) { 1656 - mod_timer(&q->timeout, next); 1613 + expired.next = 0; 1614 + blk_mq_queue_tag_busy_iter(q, blk_mq_handle_expired, &expired); 1615 + } 1616 + 1617 + if (expired.next != 0) { 1618 + mod_timer(&q->timeout, expired.next); 1657 1619 } else { 1658 1620 /* 1659 1621 * Request timeouts are handled as a forward rolling timer. If ··· 3312 3248 tags->rqs = kcalloc_node(nr_tags, sizeof(struct request *), 3313 3249 GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, 3314 3250 node); 3315 - if (!tags->rqs) { 3316 - blk_mq_free_tags(tags); 3317 - return NULL; 3318 - } 3251 + if (!tags->rqs) 3252 + goto err_free_tags; 3319 3253 3320 3254 tags->static_rqs = kcalloc_node(nr_tags, sizeof(struct request *), 3321 3255 GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, 3322 3256 node); 3323 - if (!tags->static_rqs) { 3324 - kfree(tags->rqs); 3325 - blk_mq_free_tags(tags); 3326 - return NULL; 3327 - } 3257 + if (!tags->static_rqs) 3258 + goto err_free_rqs; 3328 3259 3329 3260 return tags; 3261 + 3262 + err_free_rqs: 3263 + kfree(tags->rqs); 3264 + err_free_tags: 3265 + blk_mq_free_tags(tags); 3266 + return NULL; 3330 3267 } 3331 3268 3332 3269 static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq, ··· 4040 3975 struct request_queue *q; 4041 3976 int ret; 4042 3977 4043 - q = blk_alloc_queue(set->numa_node, set->flags & BLK_MQ_F_BLOCKING); 3978 + q = blk_alloc_queue(set->numa_node); 4044 3979 if (!q) 4045 3980 return ERR_PTR(-ENOMEM); 4046 3981 q->queuedata = queuedata; ··· 4076 4011 4077 4012 blk_queue_flag_set(QUEUE_FLAG_DYING, q); 4078 4013 blk_queue_start_drain(q); 4079 - blk_freeze_queue(q); 4014 + blk_mq_freeze_queue_wait(q); 4080 4015 4081 4016 blk_sync_queue(q); 4082 4017 blk_mq_cancel_work_sync(q); 4083 4018 blk_mq_exit_queue(q); 4084 - 4085 - /* @q is and will stay empty, shutdown and put */ 4086 - blk_put_queue(q); 4087 4019 } 4088 4020 EXPORT_SYMBOL(blk_mq_destroy_queue); 4089 4021 ··· 4097 4035 disk = __alloc_disk_node(q, set->numa_node, lkclass); 4098 4036 if (!disk) { 4099 4037 blk_mq_destroy_queue(q); 4038 + blk_put_queue(q); 4100 4039 return ERR_PTR(-ENOMEM); 4101 4040 } 4102 4041 set_bit(GD_OWNS_QUEUE, &disk->state); ··· 4210 4147 int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, 4211 4148 struct request_queue *q) 4212 4149 { 4213 - WARN_ON_ONCE(blk_queue_has_srcu(q) != 4214 - !!(set->flags & BLK_MQ_F_BLOCKING)); 4215 - 4216 4150 /* mark the queue as mq asap */ 4217 4151 q->mq_ops = set->ops; 4218 4152 ··· 4385 4325 } 4386 4326 4387 4327 static int blk_mq_realloc_tag_set_tags(struct blk_mq_tag_set *set, 4388 - int cur_nr_hw_queues, int new_nr_hw_queues) 4328 + int new_nr_hw_queues) 4389 4329 { 4390 4330 struct blk_mq_tags **new_tags; 4391 4331 4392 - if (cur_nr_hw_queues >= new_nr_hw_queues) 4393 - return 0; 4332 + if (set->nr_hw_queues >= new_nr_hw_queues) 4333 + goto done; 4394 4334 4395 4335 new_tags = kcalloc_node(new_nr_hw_queues, sizeof(struct blk_mq_tags *), 4396 4336 GFP_KERNEL, set->numa_node); ··· 4398 4338 return -ENOMEM; 4399 4339 4400 4340 if (set->tags) 4401 - memcpy(new_tags, set->tags, cur_nr_hw_queues * 4341 + memcpy(new_tags, set->tags, set->nr_hw_queues * 4402 4342 sizeof(*set->tags)); 4403 4343 kfree(set->tags); 4404 4344 set->tags = new_tags; 4345 + done: 4405 4346 set->nr_hw_queues = new_nr_hw_queues; 4406 - 4407 4347 return 0; 4408 - } 4409 - 4410 - static int blk_mq_alloc_tag_set_tags(struct blk_mq_tag_set *set, 4411 - int new_nr_hw_queues) 4412 - { 4413 - return blk_mq_realloc_tag_set_tags(set, 0, new_nr_hw_queues); 4414 4348 } 4415 4349 4416 4350 /* ··· 4460 4406 if (set->nr_maps == 1 && set->nr_hw_queues > nr_cpu_ids) 4461 4407 set->nr_hw_queues = nr_cpu_ids; 4462 4408 4463 - if (blk_mq_alloc_tag_set_tags(set, set->nr_hw_queues) < 0) 4464 - return -ENOMEM; 4409 + if (set->flags & BLK_MQ_F_BLOCKING) { 4410 + set->srcu = kmalloc(sizeof(*set->srcu), GFP_KERNEL); 4411 + if (!set->srcu) 4412 + return -ENOMEM; 4413 + ret = init_srcu_struct(set->srcu); 4414 + if (ret) 4415 + goto out_free_srcu; 4416 + } 4465 4417 4466 4418 ret = -ENOMEM; 4419 + set->tags = kcalloc_node(set->nr_hw_queues, 4420 + sizeof(struct blk_mq_tags *), GFP_KERNEL, 4421 + set->numa_node); 4422 + if (!set->tags) 4423 + goto out_cleanup_srcu; 4424 + 4467 4425 for (i = 0; i < set->nr_maps; i++) { 4468 4426 set->map[i].mq_map = kcalloc_node(nr_cpu_ids, 4469 4427 sizeof(set->map[i].mq_map[0]), ··· 4503 4437 } 4504 4438 kfree(set->tags); 4505 4439 set->tags = NULL; 4440 + out_cleanup_srcu: 4441 + if (set->flags & BLK_MQ_F_BLOCKING) 4442 + cleanup_srcu_struct(set->srcu); 4443 + out_free_srcu: 4444 + if (set->flags & BLK_MQ_F_BLOCKING) 4445 + kfree(set->srcu); 4506 4446 return ret; 4507 4447 } 4508 4448 EXPORT_SYMBOL(blk_mq_alloc_tag_set); ··· 4548 4476 4549 4477 kfree(set->tags); 4550 4478 set->tags = NULL; 4479 + if (set->flags & BLK_MQ_F_BLOCKING) { 4480 + cleanup_srcu_struct(set->srcu); 4481 + kfree(set->srcu); 4482 + } 4551 4483 } 4552 4484 EXPORT_SYMBOL(blk_mq_free_tag_set); 4553 4485 ··· 4640 4564 INIT_LIST_HEAD(&qe->node); 4641 4565 qe->q = q; 4642 4566 qe->type = q->elevator->type; 4567 + /* keep a reference to the elevator module as we'll switch back */ 4568 + __elevator_get(qe->type); 4643 4569 list_add(&qe->node, head); 4644 - 4645 - /* 4646 - * After elevator_switch, the previous elevator_queue will be 4647 - * released by elevator_release. The reference of the io scheduler 4648 - * module get by elevator_get will also be put. So we need to get 4649 - * a reference of the io scheduler module here to prevent it to be 4650 - * removed. 4651 - */ 4652 - __module_get(qe->type->elevator_owner); 4653 - elevator_switch(q, NULL); 4570 + elevator_disable(q); 4654 4571 mutex_unlock(&q->sysfs_lock); 4655 4572 4656 4573 return true; ··· 4676 4607 4677 4608 mutex_lock(&q->sysfs_lock); 4678 4609 elevator_switch(q, t); 4610 + /* drop the reference acquired in blk_mq_elv_switch_none */ 4611 + elevator_put(t); 4679 4612 mutex_unlock(&q->sysfs_lock); 4680 4613 } 4681 4614 ··· 4714 4643 } 4715 4644 4716 4645 prev_nr_hw_queues = set->nr_hw_queues; 4717 - if (blk_mq_realloc_tag_set_tags(set, set->nr_hw_queues, nr_hw_queues) < 4718 - 0) 4646 + if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) 4719 4647 goto reregister; 4720 4648 4721 - set->nr_hw_queues = nr_hw_queues; 4722 4649 fallback: 4723 4650 blk_mq_update_queue_map(set); 4724 4651 list_for_each_entry(q, &set->tag_list, tag_set_list) { ··· 4936 4867 4937 4868 void blk_mq_cancel_work_sync(struct request_queue *q) 4938 4869 { 4939 - if (queue_is_mq(q)) { 4940 - struct blk_mq_hw_ctx *hctx; 4941 - unsigned long i; 4870 + struct blk_mq_hw_ctx *hctx; 4871 + unsigned long i; 4942 4872 4943 - cancel_delayed_work_sync(&q->requeue_work); 4873 + cancel_delayed_work_sync(&q->requeue_work); 4944 4874 4945 - queue_for_each_hw_ctx(q, hctx, i) 4946 - cancel_delayed_work_sync(&hctx->run_work); 4947 - } 4875 + queue_for_each_hw_ctx(q, hctx, i) 4876 + cancel_delayed_work_sync(&hctx->run_work); 4948 4877 } 4949 4878 4950 4879 static int __init blk_mq_init(void)

+7 -7

block/blk-mq.h

··· 377 377 /* run the code block in @dispatch_ops with rcu/srcu read lock held */ 378 378 #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \ 379 379 do { \ 380 - if (!blk_queue_has_srcu(q)) { \ 381 - rcu_read_lock(); \ 382 - (dispatch_ops); \ 383 - rcu_read_unlock(); \ 384 - } else { \ 380 + if ((q)->tag_set->flags & BLK_MQ_F_BLOCKING) { \ 385 381 int srcu_idx; \ 386 382 \ 387 383 might_sleep_if(check_sleep); \ 388 - srcu_idx = srcu_read_lock((q)->srcu); \ 384 + srcu_idx = srcu_read_lock((q)->tag_set->srcu); \ 389 385 (dispatch_ops); \ 390 - srcu_read_unlock((q)->srcu, srcu_idx); \ 386 + srcu_read_unlock((q)->tag_set->srcu, srcu_idx); \ 387 + } else { \ 388 + rcu_read_lock(); \ 389 + (dispatch_ops); \ 390 + rcu_read_unlock(); \ 391 391 } \ 392 392 } while (0) 393 393

+3 -3

block/blk-settings.c

··· 481 481 } 482 482 EXPORT_SYMBOL(blk_queue_io_opt); 483 483 484 - static int queue_limit_alignment_offset(struct queue_limits *lim, 484 + static int queue_limit_alignment_offset(const struct queue_limits *lim, 485 485 sector_t sector) 486 486 { 487 487 unsigned int granularity = max(lim->physical_block_size, lim->io_min); ··· 491 491 return (granularity + lim->alignment_offset - alignment) % granularity; 492 492 } 493 493 494 - static unsigned int queue_limit_discard_alignment(struct queue_limits *lim, 495 - sector_t sector) 494 + static unsigned int queue_limit_discard_alignment( 495 + const struct queue_limits *lim, sector_t sector) 496 496 { 497 497 unsigned int alignment, granularity, offset; 498 498

+53 -84

block/blk-sysfs.c

··· 470 470 if (!wbt_rq_qos(q)) 471 471 return -EINVAL; 472 472 473 + if (wbt_disabled(q)) 474 + return sprintf(page, "0\n"); 475 + 473 476 return sprintf(page, "%llu\n", div_u64(wbt_get_min_lat(q), 1000)); 474 477 } 475 478 ··· 683 680 static umode_t queue_attr_visible(struct kobject *kobj, struct attribute *attr, 684 681 int n) 685 682 { 686 - struct request_queue *q = 687 - container_of(kobj, struct request_queue, kobj); 683 + struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj); 684 + struct request_queue *q = disk->queue; 688 685 689 686 if (attr == &queue_io_timeout_entry.attr && 690 687 (!q->mq_ops || !q->mq_ops->timeout)) ··· 710 707 queue_attr_show(struct kobject *kobj, struct attribute *attr, char *page) 711 708 { 712 709 struct queue_sysfs_entry *entry = to_queue(attr); 713 - struct request_queue *q = 714 - container_of(kobj, struct request_queue, kobj); 710 + struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj); 711 + struct request_queue *q = disk->queue; 715 712 ssize_t res; 716 713 717 714 if (!entry->show) ··· 727 724 const char *page, size_t length) 728 725 { 729 726 struct queue_sysfs_entry *entry = to_queue(attr); 730 - struct request_queue *q; 727 + struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj); 728 + struct request_queue *q = disk->queue; 731 729 ssize_t res; 732 730 733 731 if (!entry->store) 734 732 return -EIO; 735 733 736 - q = container_of(kobj, struct request_queue, kobj); 737 734 mutex_lock(&q->sysfs_lock); 738 735 res = entry->store(q, page, length); 739 736 mutex_unlock(&q->sysfs_lock); 740 737 return res; 741 - } 742 - 743 - static void blk_free_queue_rcu(struct rcu_head *rcu_head) 744 - { 745 - struct request_queue *q = container_of(rcu_head, struct request_queue, 746 - rcu_head); 747 - 748 - kmem_cache_free(blk_get_queue_kmem_cache(blk_queue_has_srcu(q)), q); 749 - } 750 - 751 - /** 752 - * blk_release_queue - releases all allocated resources of the request_queue 753 - * @kobj: pointer to a kobject, whose container is a request_queue 754 - * 755 - * This function releases all allocated resources of the request queue. 756 - * 757 - * The struct request_queue refcount is incremented with blk_get_queue() and 758 - * decremented with blk_put_queue(). Once the refcount reaches 0 this function 759 - * is called. 760 - * 761 - * Drivers exist which depend on the release of the request_queue to be 762 - * synchronous, it should not be deferred. 763 - * 764 - * Context: can sleep 765 - */ 766 - static void blk_release_queue(struct kobject *kobj) 767 - { 768 - struct request_queue *q = 769 - container_of(kobj, struct request_queue, kobj); 770 - 771 - might_sleep(); 772 - 773 - percpu_ref_exit(&q->q_usage_counter); 774 - 775 - if (q->poll_stat) 776 - blk_stat_remove_callback(q, q->poll_cb); 777 - blk_stat_free_callback(q->poll_cb); 778 - 779 - blk_free_queue_stats(q->stats); 780 - kfree(q->poll_stat); 781 - 782 - if (queue_is_mq(q)) 783 - blk_mq_release(q); 784 - 785 - if (blk_queue_has_srcu(q)) 786 - cleanup_srcu_struct(q->srcu); 787 - 788 - ida_free(&blk_queue_ida, q->id); 789 - call_rcu(&q->rcu_head, blk_free_queue_rcu); 790 738 } 791 739 792 740 static const struct sysfs_ops queue_sysfs_ops = { ··· 750 796 NULL 751 797 }; 752 798 753 - struct kobj_type blk_queue_ktype = { 799 + static void blk_queue_release(struct kobject *kobj) 800 + { 801 + /* nothing to do here, all data is associated with the parent gendisk */ 802 + } 803 + 804 + static struct kobj_type blk_queue_ktype = { 754 805 .default_groups = blk_queue_attr_groups, 755 806 .sysfs_ops = &queue_sysfs_ops, 756 - .release = blk_release_queue, 807 + .release = blk_queue_release, 757 808 }; 809 + 810 + static void blk_debugfs_remove(struct gendisk *disk) 811 + { 812 + struct request_queue *q = disk->queue; 813 + 814 + mutex_lock(&q->debugfs_mutex); 815 + blk_trace_shutdown(q); 816 + debugfs_remove_recursive(q->debugfs_dir); 817 + q->debugfs_dir = NULL; 818 + q->sched_debugfs_dir = NULL; 819 + q->rqos_debugfs_dir = NULL; 820 + mutex_unlock(&q->debugfs_mutex); 821 + } 758 822 759 823 /** 760 824 * blk_register_queue - register a block layer queue with sysfs ··· 784 812 int ret; 785 813 786 814 mutex_lock(&q->sysfs_dir_lock); 787 - 788 - ret = kobject_add(&q->kobj, &disk_to_dev(disk)->kobj, "queue"); 815 + kobject_init(&disk->queue_kobj, &blk_queue_ktype); 816 + ret = kobject_add(&disk->queue_kobj, &disk_to_dev(disk)->kobj, "queue"); 789 817 if (ret < 0) 790 - goto unlock; 818 + goto out_put_queue_kobj; 791 819 792 - if (queue_is_mq(q)) 793 - blk_mq_sysfs_register(disk); 820 + if (queue_is_mq(q)) { 821 + ret = blk_mq_sysfs_register(disk); 822 + if (ret) 823 + goto out_put_queue_kobj; 824 + } 794 825 mutex_lock(&q->sysfs_lock); 795 826 796 827 mutex_lock(&q->debugfs_mutex); 797 - q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent), 798 - blk_debugfs_root); 828 + q->debugfs_dir = debugfs_create_dir(disk->disk_name, blk_debugfs_root); 799 829 if (queue_is_mq(q)) 800 830 blk_mq_debugfs_register(q); 801 831 mutex_unlock(&q->debugfs_mutex); 802 832 803 833 ret = disk_register_independent_access_ranges(disk); 804 834 if (ret) 805 - goto put_dev; 835 + goto out_debugfs_remove; 806 836 807 837 if (q->elevator) { 808 838 ret = elv_register_queue(q, false); 809 839 if (ret) 810 - goto put_dev; 840 + goto out_unregister_ia_ranges; 811 841 } 812 842 813 - ret = blk_crypto_sysfs_register(q); 843 + ret = blk_crypto_sysfs_register(disk); 814 844 if (ret) 815 - goto put_dev; 845 + goto out_elv_unregister; 816 846 817 847 blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q); 818 848 wbt_enable_default(q); 819 849 blk_throtl_register(disk); 820 850 821 851 /* Now everything is ready and send out KOBJ_ADD uevent */ 822 - kobject_uevent(&q->kobj, KOBJ_ADD); 852 + kobject_uevent(&disk->queue_kobj, KOBJ_ADD); 823 853 if (q->elevator) 824 854 kobject_uevent(&q->elevator->kobj, KOBJ_ADD); 825 855 mutex_unlock(&q->sysfs_lock); 826 - 827 - unlock: 828 856 mutex_unlock(&q->sysfs_dir_lock); 829 857 830 858 /* ··· 843 871 844 872 return ret; 845 873 846 - put_dev: 874 + out_elv_unregister: 847 875 elv_unregister_queue(q); 876 + out_unregister_ia_ranges: 848 877 disk_unregister_independent_access_ranges(disk); 878 + out_debugfs_remove: 879 + blk_debugfs_remove(disk); 849 880 mutex_unlock(&q->sysfs_lock); 881 + out_put_queue_kobj: 882 + kobject_put(&disk->queue_kobj); 850 883 mutex_unlock(&q->sysfs_dir_lock); 851 - kobject_del(&q->kobj); 852 - 853 884 return ret; 854 885 } 855 886 ··· 890 915 */ 891 916 if (queue_is_mq(q)) 892 917 blk_mq_sysfs_unregister(disk); 893 - blk_crypto_sysfs_unregister(q); 918 + blk_crypto_sysfs_unregister(disk); 894 919 895 920 mutex_lock(&q->sysfs_lock); 896 921 elv_unregister_queue(q); ··· 898 923 mutex_unlock(&q->sysfs_lock); 899 924 900 925 /* Now that we've deleted all child objects, we can delete the queue. */ 901 - kobject_uevent(&q->kobj, KOBJ_REMOVE); 902 - kobject_del(&q->kobj); 926 + kobject_uevent(&disk->queue_kobj, KOBJ_REMOVE); 927 + kobject_del(&disk->queue_kobj); 903 928 mutex_unlock(&q->sysfs_dir_lock); 904 929 905 - mutex_lock(&q->debugfs_mutex); 906 - blk_trace_shutdown(q); 907 - debugfs_remove_recursive(q->debugfs_dir); 908 - q->debugfs_dir = NULL; 909 - q->sched_debugfs_dir = NULL; 910 - q->rqos_debugfs_dir = NULL; 911 - mutex_unlock(&q->debugfs_mutex); 930 + blk_debugfs_remove(disk); 912 931 }

+55 -47

block/blk-throttle.c

··· 129 129 /* 130 130 * cgroup's limit in LIMIT_MAX is scaled if low limit is set. This scale is to 131 131 * make the IO dispatch more smooth. 132 - * Scale up: linearly scale up according to lapsed time since upgrade. For 132 + * Scale up: linearly scale up according to elapsed time since upgrade. For 133 133 * every throtl_slice, the limit scales up 1/2 .low limit till the 134 134 * limit hits .max limit 135 135 * Scale down: exponentially scale down if a cgroup doesn't hit its .low limit ··· 395 395 * If on the default hierarchy, we switch to properly hierarchical 396 396 * behavior where limits on a given throtl_grp are applied to the 397 397 * whole subtree rather than just the group itself. e.g. If 16M 398 - * read_bps limit is set on the root group, the whole system can't 399 - * exceed 16M for the device. 398 + * read_bps limit is set on a parent group, summary bps of 399 + * parent group and its subtree groups can't exceed 16M for the 400 + * device. 400 401 * 401 402 * If not on the default hierarchy, the broken flat hierarchy 402 403 * behavior is retained where all throtl_grps are treated as if ··· 645 644 * that bandwidth. Do try to make use of that bandwidth while giving 646 645 * credit. 647 646 */ 648 - if (time_after_eq(start, tg->slice_start[rw])) 647 + if (time_after(start, tg->slice_start[rw])) 649 648 tg->slice_start[rw] = start; 650 649 651 650 tg->slice_end[rw] = jiffies + tg->td->throtl_slice; ··· 822 821 tg->carryover_ios[READ], tg->carryover_ios[WRITE]); 823 822 } 824 823 825 - static bool tg_within_iops_limit(struct throtl_grp *tg, struct bio *bio, 826 - u32 iops_limit, unsigned long *wait) 824 + static unsigned long tg_within_iops_limit(struct throtl_grp *tg, struct bio *bio, 825 + u32 iops_limit) 827 826 { 828 827 bool rw = bio_data_dir(bio); 829 828 unsigned int io_allowed; 830 829 unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd; 831 830 832 831 if (iops_limit == UINT_MAX) { 833 - if (wait) 834 - *wait = 0; 835 - return true; 832 + return 0; 836 833 } 837 834 838 835 jiffy_elapsed = jiffies - tg->slice_start[rw]; ··· 840 841 io_allowed = calculate_io_allowed(iops_limit, jiffy_elapsed_rnd) + 841 842 tg->carryover_ios[rw]; 842 843 if (tg->io_disp[rw] + 1 <= io_allowed) { 843 - if (wait) 844 - *wait = 0; 845 - return true; 844 + return 0; 846 845 } 847 846 848 847 /* Calc approx time to dispatch */ 849 848 jiffy_wait = jiffy_elapsed_rnd - jiffy_elapsed; 850 - 851 - if (wait) 852 - *wait = jiffy_wait; 853 - return false; 849 + return jiffy_wait; 854 850 } 855 851 856 - static bool tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio, 857 - u64 bps_limit, unsigned long *wait) 852 + static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio, 853 + u64 bps_limit) 858 854 { 859 855 bool rw = bio_data_dir(bio); 860 856 u64 bytes_allowed, extra_bytes; ··· 858 864 859 865 /* no need to throttle if this bio's bytes have been accounted */ 860 866 if (bps_limit == U64_MAX || bio_flagged(bio, BIO_BPS_THROTTLED)) { 861 - if (wait) 862 - *wait = 0; 863 - return true; 867 + return 0; 864 868 } 865 869 866 870 jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw]; ··· 871 879 bytes_allowed = calculate_bytes_allowed(bps_limit, jiffy_elapsed_rnd) + 872 880 tg->carryover_bytes[rw]; 873 881 if (tg->bytes_disp[rw] + bio_size <= bytes_allowed) { 874 - if (wait) 875 - *wait = 0; 876 - return true; 882 + return 0; 877 883 } 878 884 879 885 /* Calc approx time to dispatch */ ··· 886 896 * up we did. Add that time also. 887 897 */ 888 898 jiffy_wait = jiffy_wait + (jiffy_elapsed_rnd - jiffy_elapsed); 889 - if (wait) 890 - *wait = jiffy_wait; 891 - return false; 899 + return jiffy_wait; 892 900 } 893 901 894 902 /* ··· 934 946 jiffies + tg->td->throtl_slice); 935 947 } 936 948 937 - if (tg_within_bps_limit(tg, bio, bps_limit, &bps_wait) && 938 - tg_within_iops_limit(tg, bio, iops_limit, &iops_wait)) { 949 + bps_wait = tg_within_bps_limit(tg, bio, bps_limit); 950 + iops_wait = tg_within_iops_limit(tg, bio, iops_limit); 951 + if (bps_wait + iops_wait == 0) { 939 952 if (wait) 940 953 *wait = 0; 941 954 return true; ··· 1055 1066 sq->nr_queued[rw]--; 1056 1067 1057 1068 throtl_charge_bio(tg, bio); 1058 - bio_set_flag(bio, BIO_BPS_THROTTLED); 1059 1069 1060 1070 /* 1061 1071 * If our parent is another tg, we just need to transfer @bio to ··· 1067 1079 throtl_add_bio_tg(bio, &tg->qnode_on_parent[rw], parent_tg); 1068 1080 start_parent_slice_with_credit(tg, parent_tg, rw); 1069 1081 } else { 1082 + bio_set_flag(bio, BIO_BPS_THROTTLED); 1070 1083 throtl_qnode_add_bio(bio, &tg->qnode_on_parent[rw], 1071 1084 &parent_sq->queued[rw]); 1072 1085 BUG_ON(tg->td->nr_queued[rw] <= 0); ··· 1726 1737 * Set the flag to make sure throtl_pending_timer_fn() won't 1727 1738 * stop until all throttled bios are dispatched. 1728 1739 */ 1729 - blkg_to_tg(blkg)->flags |= THROTL_TG_CANCELING; 1740 + tg->flags |= THROTL_TG_CANCELING; 1741 + 1742 + /* 1743 + * Do not dispatch cgroup without THROTL_TG_PENDING or cgroup 1744 + * will be inserted to service queue without THROTL_TG_PENDING 1745 + * set in tg_update_disptime below. Then IO dispatched from 1746 + * child in tg_dispatch_one_bio will trigger double insertion 1747 + * and corrupt the tree. 1748 + */ 1749 + if (!(tg->flags & THROTL_TG_PENDING)) 1750 + continue; 1751 + 1730 1752 /* 1731 1753 * Update disptime after setting the above flag to make sure 1732 1754 * throtl_select_dispatch() won't exit without dispatching. ··· 1762 1762 return min(rtime, wtime); 1763 1763 } 1764 1764 1765 - /* tg should not be an intermediate node */ 1766 1765 static unsigned long tg_last_low_overflow_time(struct throtl_grp *tg) 1767 1766 { 1768 1767 struct throtl_service_queue *parent_sq; ··· 1815 1816 return ret; 1816 1817 } 1817 1818 1818 - static bool throtl_tg_can_upgrade(struct throtl_grp *tg) 1819 + static bool throtl_low_limit_reached(struct throtl_grp *tg, int rw) 1819 1820 { 1820 1821 struct throtl_service_queue *sq = &tg->service_queue; 1821 - bool read_limit, write_limit; 1822 + bool limit = tg->bps[rw][LIMIT_LOW] || tg->iops[rw][LIMIT_LOW]; 1822 1823 1823 1824 /* 1824 - * if cgroup reaches low limit (if low limit is 0, the cgroup always 1825 - * reaches), it's ok to upgrade to next limit 1825 + * if low limit is zero, low limit is always reached. 1826 + * if low limit is non-zero, we can check if there is any request 1827 + * is queued to determine if low limit is reached as we throttle 1828 + * request according to limit. 1826 1829 */ 1827 - read_limit = tg->bps[READ][LIMIT_LOW] || tg->iops[READ][LIMIT_LOW]; 1828 - write_limit = tg->bps[WRITE][LIMIT_LOW] || tg->iops[WRITE][LIMIT_LOW]; 1829 - if (!read_limit && !write_limit) 1830 - return true; 1831 - if (read_limit && sq->nr_queued[READ] && 1832 - (!write_limit || sq->nr_queued[WRITE])) 1833 - return true; 1834 - if (write_limit && sq->nr_queued[WRITE] && 1835 - (!read_limit || sq->nr_queued[READ])) 1830 + return !limit || sq->nr_queued[rw]; 1831 + } 1832 + 1833 + static bool throtl_tg_can_upgrade(struct throtl_grp *tg) 1834 + { 1835 + /* 1836 + * cgroup reaches low limit when low limit of READ and WRITE are 1837 + * both reached, it's ok to upgrade to next limit if cgroup reaches 1838 + * low limit 1839 + */ 1840 + if (throtl_low_limit_reached(tg, READ) && 1841 + throtl_low_limit_reached(tg, WRITE)) 1836 1842 return true; 1837 1843 1838 1844 if (time_after_eq(jiffies, ··· 1955 1951 * If cgroup is below low limit, consider downgrade and throttle other 1956 1952 * cgroups 1957 1953 */ 1958 - if (time_after_eq(now, td->low_upgrade_time + td->throtl_slice) && 1959 - time_after_eq(now, tg_last_low_overflow_time(tg) + 1954 + if (time_after_eq(now, tg_last_low_overflow_time(tg) + 1960 1955 td->throtl_slice) && 1961 1956 (!throtl_tg_is_idle(tg) || 1962 1957 !list_empty(&tg_to_blkg(tg)->blkcg->css.children))) ··· 1965 1962 1966 1963 static bool throtl_hierarchy_can_downgrade(struct throtl_grp *tg) 1967 1964 { 1965 + struct throtl_data *td = tg->td; 1966 + 1967 + if (time_before(jiffies, td->low_upgrade_time + td->throtl_slice)) 1968 + return false; 1969 + 1968 1970 while (true) { 1969 1971 if (!throtl_tg_can_downgrade(tg)) 1970 1972 return false;

+22 -4

block/blk-wbt.c

··· 27 27 28 28 #include "blk-wbt.h" 29 29 #include "blk-rq-qos.h" 30 + #include "elevator.h" 30 31 31 32 #define CREATE_TRACE_POINTS 32 33 #include <trace/events/wbt.h> ··· 423 422 rwb_wake_all(rwb); 424 423 } 425 424 425 + bool wbt_disabled(struct request_queue *q) 426 + { 427 + struct rq_qos *rqos = wbt_rq_qos(q); 428 + 429 + return !rqos || RQWB(rqos)->enable_state == WBT_STATE_OFF_DEFAULT || 430 + RQWB(rqos)->enable_state == WBT_STATE_OFF_MANUAL; 431 + } 432 + 426 433 u64 wbt_get_min_lat(struct request_queue *q) 427 434 { 428 435 struct rq_qos *rqos = wbt_rq_qos(q); ··· 444 435 struct rq_qos *rqos = wbt_rq_qos(q); 445 436 if (!rqos) 446 437 return; 438 + 447 439 RQWB(rqos)->min_lat_nsec = val; 448 - RQWB(rqos)->enable_state = WBT_STATE_ON_MANUAL; 440 + if (val) 441 + RQWB(rqos)->enable_state = WBT_STATE_ON_MANUAL; 442 + else 443 + RQWB(rqos)->enable_state = WBT_STATE_OFF_MANUAL; 444 + 449 445 wbt_update_limits(RQWB(rqos)); 450 446 } 451 447 ··· 652 638 */ 653 639 void wbt_enable_default(struct request_queue *q) 654 640 { 655 - struct rq_qos *rqos = wbt_rq_qos(q); 641 + struct rq_qos *rqos; 642 + bool disable_flag = q->elevator && 643 + test_bit(ELEVATOR_FLAG_DISABLE_WBT, &q->elevator->flags); 656 644 657 645 /* Throttling already enabled? */ 646 + rqos = wbt_rq_qos(q); 658 647 if (rqos) { 659 - if (RQWB(rqos)->enable_state == WBT_STATE_OFF_DEFAULT) 648 + if (!disable_flag && 649 + RQWB(rqos)->enable_state == WBT_STATE_OFF_DEFAULT) 660 650 RQWB(rqos)->enable_state = WBT_STATE_ON_DEFAULT; 661 651 return; 662 652 } ··· 669 651 if (!blk_queue_registered(q)) 670 652 return; 671 653 672 - if (queue_is_mq(q) && IS_ENABLED(CONFIG_BLK_WBT_MQ)) 654 + if (queue_is_mq(q) && !disable_flag) 673 655 wbt_init(q); 674 656 } 675 657 EXPORT_SYMBOL_GPL(wbt_enable_default);

+12 -5

block/blk-wbt.h

··· 28 28 }; 29 29 30 30 /* 31 - * Enable states. Either off, or on by default (done at init time), 32 - * or on through manual setup in sysfs. 31 + * If current state is WBT_STATE_ON/OFF_DEFAULT, it can be covered to any other 32 + * state, if current state is WBT_STATE_ON/OFF_MANUAL, it can only be covered 33 + * to WBT_STATE_OFF/ON_MANUAL. 33 34 */ 34 35 enum { 35 - WBT_STATE_ON_DEFAULT = 1, 36 - WBT_STATE_ON_MANUAL = 2, 37 - WBT_STATE_OFF_DEFAULT 36 + WBT_STATE_ON_DEFAULT = 1, /* on by default */ 37 + WBT_STATE_ON_MANUAL = 2, /* on manually by sysfs */ 38 + WBT_STATE_OFF_DEFAULT = 3, /* off by default */ 39 + WBT_STATE_OFF_MANUAL = 4, /* off manually by sysfs */ 38 40 }; 39 41 40 42 struct rq_wb { ··· 96 94 97 95 u64 wbt_get_min_lat(struct request_queue *q); 98 96 void wbt_set_min_lat(struct request_queue *q, u64 val); 97 + bool wbt_disabled(struct request_queue *); 99 98 100 99 void wbt_set_write_cache(struct request_queue *, bool); 101 100 ··· 127 124 static inline u64 wbt_default_latency_nsec(struct request_queue *q) 128 125 { 129 126 return 0; 127 + } 128 + static inline bool wbt_disabled(struct request_queue *q) 129 + { 130 + return true; 130 131 } 131 132 132 133 #endif /* CONFIG_BLK_WBT */

+9 -18

block/blk.h

··· 26 26 spinlock_t mq_flush_lock; 27 27 }; 28 28 29 - extern struct kmem_cache *blk_requestq_cachep; 30 - extern struct kmem_cache *blk_requestq_srcu_cachep; 31 - extern struct kobj_type blk_queue_ktype; 32 - extern struct ida blk_queue_ida; 33 - 34 29 bool is_flush_rq(struct request *req); 35 30 36 31 struct blk_flush_queue *blk_alloc_flush_queue(int node, int cmd_size, ··· 99 104 return true; 100 105 } 101 106 102 - static inline bool __bvec_gap_to_prev(struct queue_limits *lim, 107 + static inline bool __bvec_gap_to_prev(const struct queue_limits *lim, 103 108 struct bio_vec *bprv, unsigned int offset) 104 109 { 105 110 return (offset & lim->virt_boundary_mask) || ··· 110 115 * Check if adding a bio_vec after bprv with offset would create a gap in 111 116 * the SG list. Most drivers don't care about this, but some do. 112 117 */ 113 - static inline bool bvec_gap_to_prev(struct queue_limits *lim, 118 + static inline bool bvec_gap_to_prev(const struct queue_limits *lim, 114 119 struct bio_vec *bprv, unsigned int offset) 115 120 { 116 121 if (!lim->virt_boundary_mask) ··· 273 278 void blk_insert_flush(struct request *rq); 274 279 275 280 int elevator_switch(struct request_queue *q, struct elevator_type *new_e); 281 + void elevator_disable(struct request_queue *q); 276 282 void elevator_exit(struct request_queue *q); 277 283 int elv_register_queue(struct request_queue *q, bool uevent); 278 284 void elv_unregister_queue(struct request_queue *q); ··· 293 297 const char *, size_t); 294 298 295 299 static inline bool bio_may_exceed_limits(struct bio *bio, 296 - struct queue_limits *lim) 300 + const struct queue_limits *lim) 297 301 { 298 302 switch (bio_op(bio)) { 299 303 case REQ_OP_DISCARD: ··· 316 320 bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > PAGE_SIZE; 317 321 } 318 322 319 - struct bio *__bio_split_to_limits(struct bio *bio, struct queue_limits *lim, 320 - unsigned int *nr_segs); 323 + struct bio *__bio_split_to_limits(struct bio *bio, 324 + const struct queue_limits *lim, 325 + unsigned int *nr_segs); 321 326 int ll_back_merge_fn(struct request *req, struct bio *bio, 322 327 unsigned int nr_segs); 323 328 bool blk_attempt_req_merge(struct request_queue *q, struct request *rq, ··· 425 428 struct page *page, unsigned int len, unsigned int offset, 426 429 unsigned int max_sectors, bool *same_page); 427 430 428 - static inline struct kmem_cache *blk_get_queue_kmem_cache(bool srcu) 429 - { 430 - if (srcu) 431 - return blk_requestq_srcu_cachep; 432 - return blk_requestq_cachep; 433 - } 434 - struct request_queue *blk_alloc_queue(int node_id, bool alloc_srcu); 431 + struct request_queue *blk_alloc_queue(int node_id); 435 432 436 - int disk_scan_partitions(struct gendisk *disk, fmode_t mode); 433 + int disk_scan_partitions(struct gendisk *disk, fmode_t mode, void *owner); 437 434 438 435 int disk_alloc_events(struct gendisk *disk); 439 436 void disk_add_events(struct gendisk *disk);

+2

block/bsg-lib.c

··· 325 325 326 326 bsg_unregister_queue(bset->bd); 327 327 blk_mq_destroy_queue(q); 328 + blk_put_queue(q); 328 329 blk_mq_free_tag_set(&bset->tag_set); 329 330 kfree(bset); 330 331 } ··· 401 400 return q; 402 401 out_cleanup_queue: 403 402 blk_mq_destroy_queue(q); 403 + blk_put_queue(q); 404 404 out_queue: 405 405 blk_mq_free_tag_set(set); 406 406 out_tag_set:

+7 -4

block/bsg.c

··· 175 175 176 176 void bsg_unregister_queue(struct bsg_device *bd) 177 177 { 178 - if (bd->queue->kobj.sd) 179 - sysfs_remove_link(&bd->queue->kobj, "bsg"); 178 + struct gendisk *disk = bd->queue->disk; 179 + 180 + if (disk && disk->queue_kobj.sd) 181 + sysfs_remove_link(&disk->queue_kobj, "bsg"); 180 182 cdev_device_del(&bd->cdev, &bd->device); 181 183 put_device(&bd->device); 182 184 } ··· 218 216 if (ret) 219 217 goto out_put_device; 220 218 221 - if (q->kobj.sd) { 222 - ret = sysfs_create_link(&q->kobj, &bd->device.kobj, "bsg"); 219 + if (q->disk && q->disk->queue_kobj.sd) { 220 + ret = sysfs_create_link(&q->disk->queue_kobj, &bd->device.kobj, 221 + "bsg"); 223 222 if (ret) 224 223 goto out_device_del; 225 224 }

+109 -147

block/elevator.c

··· 57 57 * Query io scheduler to see if the current process issuing bio may be 58 58 * merged with rq. 59 59 */ 60 - static int elv_iosched_allow_bio_merge(struct request *rq, struct bio *bio) 60 + static bool elv_iosched_allow_bio_merge(struct request *rq, struct bio *bio) 61 61 { 62 62 struct request_queue *q = rq->q; 63 63 struct elevator_queue *e = q->elevator; ··· 65 65 if (e->type->ops.allow_merge) 66 66 return e->type->ops.allow_merge(q, rq, bio); 67 67 68 - return 1; 68 + return true; 69 69 } 70 70 71 71 /* ··· 83 83 } 84 84 EXPORT_SYMBOL(elv_bio_merge_ok); 85 85 86 - static inline bool elv_support_features(unsigned int elv_features, 87 - unsigned int required_features) 86 + static inline bool elv_support_features(struct request_queue *q, 87 + const struct elevator_type *e) 88 88 { 89 - return (required_features & elv_features) == required_features; 89 + return (q->required_elevator_features & e->elevator_features) == 90 + q->required_elevator_features; 90 91 } 91 92 92 93 /** 93 - * elevator_match - Test an elevator name and features 94 + * elevator_match - Check whether @e's name or alias matches @name 94 95 * @e: Scheduler to test 95 96 * @name: Elevator name to test 96 - * @required_features: Features that the elevator must provide 97 97 * 98 - * Return true if the elevator @e name matches @name and if @e provides all 99 - * the features specified by @required_features. 98 + * Return true if the elevator @e's name or alias matches @name. 100 99 */ 101 - static bool elevator_match(const struct elevator_type *e, const char *name, 102 - unsigned int required_features) 100 + static bool elevator_match(const struct elevator_type *e, const char *name) 103 101 { 104 - if (!elv_support_features(e->elevator_features, required_features)) 105 - return false; 106 - if (!strcmp(e->elevator_name, name)) 107 - return true; 108 - if (e->elevator_alias && !strcmp(e->elevator_alias, name)) 109 - return true; 110 - 111 - return false; 102 + return !strcmp(e->elevator_name, name) || 103 + (e->elevator_alias && !strcmp(e->elevator_alias, name)); 112 104 } 113 105 114 - /** 115 - * elevator_find - Find an elevator 116 - * @name: Name of the elevator to find 117 - * @required_features: Features that the elevator must provide 118 - * 119 - * Return the first registered scheduler with name @name and supporting the 120 - * features @required_features and NULL otherwise. 121 - */ 122 - static struct elevator_type *elevator_find(const char *name, 123 - unsigned int required_features) 106 + static struct elevator_type *__elevator_find(const char *name) 124 107 { 125 108 struct elevator_type *e; 126 109 127 - list_for_each_entry(e, &elv_list, list) { 128 - if (elevator_match(e, name, required_features)) 110 + list_for_each_entry(e, &elv_list, list) 111 + if (elevator_match(e, name)) 129 112 return e; 130 - } 131 - 132 113 return NULL; 133 114 } 134 115 135 - static void elevator_put(struct elevator_type *e) 136 - { 137 - module_put(e->elevator_owner); 138 - } 139 - 140 - static struct elevator_type *elevator_get(struct request_queue *q, 141 - const char *name, bool try_loading) 116 + static struct elevator_type *elevator_find_get(struct request_queue *q, 117 + const char *name) 142 118 { 143 119 struct elevator_type *e; 144 120 145 121 spin_lock(&elv_list_lock); 146 - 147 - e = elevator_find(name, q->required_elevator_features); 148 - if (!e && try_loading) { 149 - spin_unlock(&elv_list_lock); 150 - request_module("%s-iosched", name); 151 - spin_lock(&elv_list_lock); 152 - e = elevator_find(name, q->required_elevator_features); 153 - } 154 - 155 - if (e && !try_module_get(e->elevator_owner)) 122 + e = __elevator_find(name); 123 + if (e && (!elv_support_features(q, e) || !elevator_tryget(e))) 156 124 e = NULL; 157 - 158 125 spin_unlock(&elv_list_lock); 159 126 return e; 160 127 } ··· 137 170 if (unlikely(!eq)) 138 171 return NULL; 139 172 173 + __elevator_get(e); 140 174 eq->type = e; 141 175 kobject_init(&eq->kobj, &elv_ktype); 142 176 mutex_init(&eq->sysfs_lock); ··· 467 499 468 500 lockdep_assert_held(&q->sysfs_lock); 469 501 470 - error = kobject_add(&e->kobj, &q->kobj, "%s", "iosched"); 502 + error = kobject_add(&e->kobj, &q->disk->queue_kobj, "iosched"); 471 503 if (!error) { 472 504 struct elv_fs_entry *attr = e->type->elevator_attrs; 473 505 if (attr) { ··· 480 512 if (uevent) 481 513 kobject_uevent(&e->kobj, KOBJ_ADD); 482 514 483 - e->registered = 1; 515 + set_bit(ELEVATOR_FLAG_REGISTERED, &e->flags); 484 516 } 485 517 return error; 486 518 } ··· 491 523 492 524 lockdep_assert_held(&q->sysfs_lock); 493 525 494 - if (e && e->registered) { 495 - struct elevator_queue *e = q->elevator; 496 - 526 + if (e && test_and_clear_bit(ELEVATOR_FLAG_REGISTERED, &e->flags)) { 497 527 kobject_uevent(&e->kobj, KOBJ_REMOVE); 498 528 kobject_del(&e->kobj); 499 - 500 - e->registered = 0; 501 529 } 502 530 } 503 531 ··· 519 555 520 556 /* register, don't allow duplicate names */ 521 557 spin_lock(&elv_list_lock); 522 - if (elevator_find(e->elevator_name, 0)) { 558 + if (__elevator_find(e->elevator_name)) { 523 559 spin_unlock(&elv_list_lock); 524 560 kmem_cache_destroy(e->icq_cache); 525 561 return -EBUSY; ··· 552 588 } 553 589 EXPORT_SYMBOL_GPL(elv_unregister); 554 590 555 - static int elevator_switch_mq(struct request_queue *q, 556 - struct elevator_type *new_e) 557 - { 558 - int ret; 559 - 560 - lockdep_assert_held(&q->sysfs_lock); 561 - 562 - if (q->elevator) { 563 - elv_unregister_queue(q); 564 - elevator_exit(q); 565 - } 566 - 567 - ret = blk_mq_init_sched(q, new_e); 568 - if (ret) 569 - goto out; 570 - 571 - if (new_e) { 572 - ret = elv_register_queue(q, true); 573 - if (ret) { 574 - elevator_exit(q); 575 - goto out; 576 - } 577 - } 578 - 579 - if (new_e) 580 - blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name); 581 - else 582 - blk_add_trace_msg(q, "elv switch: none"); 583 - 584 - out: 585 - return ret; 586 - } 587 - 588 591 static inline bool elv_support_iosched(struct request_queue *q) 589 592 { 590 593 if (!queue_is_mq(q) || ··· 573 642 !blk_mq_is_shared_tags(q->tag_set->flags)) 574 643 return NULL; 575 644 576 - return elevator_get(q, "mq-deadline", false); 645 + return elevator_find_get(q, "mq-deadline"); 577 646 } 578 647 579 648 /* ··· 587 656 spin_lock(&elv_list_lock); 588 657 589 658 list_for_each_entry(e, &elv_list, list) { 590 - if (elv_support_features(e->elevator_features, 591 - q->required_elevator_features)) { 659 + if (elv_support_features(q, e)) { 592 660 found = e; 593 661 break; 594 662 } 595 663 } 596 664 597 - if (found && !try_module_get(found->elevator_owner)) 665 + if (found && !elevator_tryget(found)) 598 666 found = NULL; 599 667 600 668 spin_unlock(&elv_list_lock); ··· 643 713 if (err) { 644 714 pr_warn("\"%s\" elevator initialization failed, " 645 715 "falling back to \"none\"\n", e->elevator_name); 646 - elevator_put(e); 647 716 } 717 + 718 + elevator_put(e); 648 719 } 649 720 650 721 /* 651 - * switch to new_e io scheduler. be careful not to introduce deadlocks - 652 - * we don't free the old io scheduler, before we have allocated what we 653 - * need for the new one. this way we have a chance of going back to the old 654 - * one, if the new one fails init for some reason. 722 + * Switch to new_e io scheduler. 723 + * 724 + * If switching fails, we are most likely running out of memory and not able 725 + * to restore the old io scheduler, so leaving the io scheduler being none. 655 726 */ 656 727 int elevator_switch(struct request_queue *q, struct elevator_type *new_e) 657 728 { 658 - int err; 729 + int ret; 659 730 660 731 lockdep_assert_held(&q->sysfs_lock); 661 732 662 733 blk_mq_freeze_queue(q); 663 734 blk_mq_quiesce_queue(q); 664 735 665 - err = elevator_switch_mq(q, new_e); 736 + if (q->elevator) { 737 + elv_unregister_queue(q); 738 + elevator_exit(q); 739 + } 666 740 741 + ret = blk_mq_init_sched(q, new_e); 742 + if (ret) 743 + goto out_unfreeze; 744 + 745 + ret = elv_register_queue(q, true); 746 + if (ret) { 747 + elevator_exit(q); 748 + goto out_unfreeze; 749 + } 750 + blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name); 751 + 752 + out_unfreeze: 667 753 blk_mq_unquiesce_queue(q); 668 754 blk_mq_unfreeze_queue(q); 669 755 670 - return err; 756 + if (ret) { 757 + pr_warn("elv: switch to \"%s\" failed, falling back to \"none\"\n", 758 + new_e->elevator_name); 759 + } 760 + 761 + return ret; 762 + } 763 + 764 + void elevator_disable(struct request_queue *q) 765 + { 766 + lockdep_assert_held(&q->sysfs_lock); 767 + 768 + blk_mq_freeze_queue(q); 769 + blk_mq_quiesce_queue(q); 770 + 771 + elv_unregister_queue(q); 772 + elevator_exit(q); 773 + blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q); 774 + q->elevator = NULL; 775 + q->nr_requests = q->tag_set->queue_depth; 776 + blk_add_trace_msg(q, "elv switch: none"); 777 + 778 + blk_mq_unquiesce_queue(q); 779 + blk_mq_unfreeze_queue(q); 671 780 } 672 781 673 782 /* 674 783 * Switch this queue to the given IO scheduler. 675 784 */ 676 - static int __elevator_change(struct request_queue *q, const char *name) 785 + static int elevator_change(struct request_queue *q, const char *elevator_name) 677 786 { 678 - char elevator_name[ELV_NAME_MAX]; 679 787 struct elevator_type *e; 788 + int ret; 680 789 681 790 /* Make sure queue is not in the middle of being removed */ 682 791 if (!blk_queue_registered(q)) 683 792 return -ENOENT; 684 793 685 - /* 686 - * Special case for mq, turn off scheduling 687 - */ 688 - if (!strncmp(name, "none", 4)) { 689 - if (!q->elevator) 690 - return 0; 691 - return elevator_switch(q, NULL); 692 - } 693 - 694 - strlcpy(elevator_name, name, sizeof(elevator_name)); 695 - e = elevator_get(q, strstrip(elevator_name), true); 696 - if (!e) 697 - return -EINVAL; 698 - 699 - if (q->elevator && 700 - elevator_match(q->elevator->type, elevator_name, 0)) { 701 - elevator_put(e); 794 + if (!strncmp(elevator_name, "none", 4)) { 795 + if (q->elevator) 796 + elevator_disable(q); 702 797 return 0; 703 798 } 704 799 705 - return elevator_switch(q, e); 800 + if (q->elevator && elevator_match(q->elevator->type, elevator_name)) 801 + return 0; 802 + 803 + e = elevator_find_get(q, elevator_name); 804 + if (!e) { 805 + request_module("%s-iosched", elevator_name); 806 + e = elevator_find_get(q, elevator_name); 807 + if (!e) 808 + return -EINVAL; 809 + } 810 + ret = elevator_switch(q, e); 811 + elevator_put(e); 812 + return ret; 706 813 } 707 814 708 - ssize_t elv_iosched_store(struct request_queue *q, const char *name, 815 + ssize_t elv_iosched_store(struct request_queue *q, const char *buf, 709 816 size_t count) 710 817 { 818 + char elevator_name[ELV_NAME_MAX]; 711 819 int ret; 712 820 713 821 if (!elv_support_iosched(q)) 714 822 return count; 715 823 716 - ret = __elevator_change(q, name); 824 + strlcpy(elevator_name, buf, sizeof(elevator_name)); 825 + ret = elevator_change(q, strstrip(elevator_name)); 717 826 if (!ret) 718 827 return count; 719 - 720 828 return ret; 721 829 } 722 830 723 831 ssize_t elv_iosched_show(struct request_queue *q, char *name) 724 832 { 725 - struct elevator_queue *e = q->elevator; 726 - struct elevator_type *elv = NULL; 727 - struct elevator_type *__e; 833 + struct elevator_queue *eq = q->elevator; 834 + struct elevator_type *cur = NULL, *e; 728 835 int len = 0; 729 836 730 - if (!queue_is_mq(q)) 837 + if (!elv_support_iosched(q)) 731 838 return sprintf(name, "none\n"); 732 839 733 - if (!q->elevator) 840 + if (!q->elevator) { 734 841 len += sprintf(name+len, "[none] "); 735 - else 736 - elv = e->type; 842 + } else { 843 + len += sprintf(name+len, "none "); 844 + cur = eq->type; 845 + } 737 846 738 847 spin_lock(&elv_list_lock); 739 - list_for_each_entry(__e, &elv_list, list) { 740 - if (elv && elevator_match(elv, __e->elevator_name, 0)) { 741 - len += sprintf(name+len, "[%s] ", elv->elevator_name); 742 - continue; 743 - } 744 - if (elv_support_iosched(q) && 745 - elevator_match(__e, __e->elevator_name, 746 - q->required_elevator_features)) 747 - len += sprintf(name+len, "%s ", __e->elevator_name); 848 + list_for_each_entry(e, &elv_list, list) { 849 + if (e == cur) 850 + len += sprintf(name+len, "[%s] ", e->elevator_name); 851 + else if (elv_support_features(q, e)) 852 + len += sprintf(name+len, "%s ", e->elevator_name); 748 853 } 749 854 spin_unlock(&elv_list_lock); 750 855 751 - if (q->elevator) 752 - len += sprintf(name+len, "none"); 753 - 754 - len += sprintf(len+name, "\n"); 856 + len += sprintf(name+len, "\n"); 755 857 return len; 756 858 } 757 859

+19 -1

block/elevator.h

··· 84 84 struct list_head list; 85 85 }; 86 86 87 + static inline bool elevator_tryget(struct elevator_type *e) 88 + { 89 + return try_module_get(e->elevator_owner); 90 + } 91 + 92 + static inline void __elevator_get(struct elevator_type *e) 93 + { 94 + __module_get(e->elevator_owner); 95 + } 96 + 97 + static inline void elevator_put(struct elevator_type *e) 98 + { 99 + module_put(e->elevator_owner); 100 + } 101 + 87 102 #define ELV_HASH_BITS 6 88 103 89 104 void elv_rqhash_del(struct request_queue *q, struct request *rq); ··· 115 100 void *elevator_data; 116 101 struct kobject kobj; 117 102 struct mutex sysfs_lock; 118 - unsigned int registered:1; 103 + unsigned long flags; 119 104 DECLARE_HASHTABLE(hash, ELV_HASH_BITS); 120 105 }; 106 + 107 + #define ELEVATOR_FLAG_REGISTERED 0 108 + #define ELEVATOR_FLAG_DISABLE_WBT 1 121 109 122 110 /* 123 111 * block elevator interface

-7

block/fops.c

··· 405 405 return ret; 406 406 } 407 407 408 - static int blkdev_writepages(struct address_space *mapping, 409 - struct writeback_control *wbc) 410 - { 411 - return generic_writepages(mapping, wbc); 412 - } 413 - 414 408 const struct address_space_operations def_blk_aops = { 415 409 .dirty_folio = block_dirty_folio, 416 410 .invalidate_folio = block_invalidate_folio, ··· 413 419 .writepage = blkdev_writepage, 414 420 .write_begin = blkdev_write_begin, 415 421 .write_end = blkdev_write_end, 416 - .writepages = blkdev_writepages, 417 422 .direct_IO = blkdev_direct_IO, 418 423 .migrate_folio = buffer_migrate_folio_norefs, 419 424 .is_dirty_writeback = buffer_check_dirty_writeback,

+16 -19

block/genhd.c

··· 356 356 } 357 357 EXPORT_SYMBOL_GPL(disk_uevent); 358 358 359 - int disk_scan_partitions(struct gendisk *disk, fmode_t mode) 359 + int disk_scan_partitions(struct gendisk *disk, fmode_t mode, void *owner) 360 360 { 361 361 struct block_device *bdev; 362 362 ··· 365 365 if (test_bit(GD_SUPPRESS_PART_SCAN, &disk->state)) 366 366 return -EINVAL; 367 367 if (disk->open_partitions) 368 + return -EBUSY; 369 + /* Someone else has bdev exclusively open? */ 370 + if (disk->part0->bd_holder && disk->part0->bd_holder != owner) 368 371 return -EBUSY; 369 372 370 373 set_bit(GD_NEED_PART_SCAN, &disk->state); ··· 482 479 goto out_put_holder_dir; 483 480 } 484 481 485 - ret = bd_register_pending_holders(disk); 486 - if (ret < 0) 487 - goto out_put_slave_dir; 488 - 489 482 ret = blk_register_queue(disk); 490 483 if (ret) 491 484 goto out_put_slave_dir; ··· 499 500 500 501 bdev_add(disk->part0, ddev->devt); 501 502 if (get_capacity(disk)) 502 - disk_scan_partitions(disk, FMODE_READ); 503 + disk_scan_partitions(disk, FMODE_READ, NULL); 503 504 504 505 /* 505 506 * Announce the disk and partitions after all partitions are ··· 529 530 rq_qos_exit(disk->queue); 530 531 out_put_slave_dir: 531 532 kobject_put(disk->slave_dir); 533 + disk->slave_dir = NULL; 532 534 out_put_holder_dir: 533 535 kobject_put(disk->part0->bd_holder_dir); 534 536 out_del_integrity: ··· 560 560 { 561 561 set_bit(GD_DEAD, &disk->state); 562 562 blk_queue_start_drain(disk->queue); 563 + 564 + /* 565 + * Stop buffered writers from dirtying pages that can't be written out. 566 + */ 567 + set_capacity_and_notify(disk, 0); 563 568 } 564 569 EXPORT_SYMBOL_GPL(blk_mark_disk_dead); 565 570 ··· 634 629 635 630 kobject_put(disk->part0->bd_holder_dir); 636 631 kobject_put(disk->slave_dir); 632 + disk->slave_dir = NULL; 637 633 638 634 part_stat_set_all(disk->part0, 0); 639 635 disk->part0->bd_stamp = 0; ··· 649 643 650 644 blk_sync_queue(q); 651 645 blk_flush_integrity(); 652 - blk_mq_cancel_work_sync(q); 646 + 647 + if (queue_is_mq(q)) 648 + blk_mq_cancel_work_sync(q); 653 649 654 650 blk_mq_quiesce_queue(q); 655 651 if (q->elevator) { ··· 1201 1193 .dev_uevent = block_uevent, 1202 1194 }; 1203 1195 1204 - static char *block_devnode(struct device *dev, umode_t *mode, 1205 - kuid_t *uid, kgid_t *gid) 1206 - { 1207 - struct gendisk *disk = dev_to_disk(dev); 1208 - 1209 - if (disk->fops->devnode) 1210 - return disk->fops->devnode(disk, mode); 1211 - return NULL; 1212 - } 1213 - 1214 1196 const struct device_type disk_type = { 1215 1197 .name = "disk", 1216 1198 .groups = disk_attr_groups, 1217 1199 .release = disk_release, 1218 - .devnode = block_devnode, 1219 1200 }; 1220 1201 1221 1202 #ifdef CONFIG_PROC_FS ··· 1409 1412 struct request_queue *q; 1410 1413 struct gendisk *disk; 1411 1414 1412 - q = blk_alloc_queue(node, false); 1415 + q = blk_alloc_queue(node); 1413 1416 if (!q) 1414 1417 return NULL; 1415 1418

+43 -60

block/holder.c

··· 4 4 5 5 struct bd_holder_disk { 6 6 struct list_head list; 7 - struct block_device *bdev; 7 + struct kobject *holder_dir; 8 8 int refcnt; 9 9 }; 10 10 ··· 14 14 struct bd_holder_disk *holder; 15 15 16 16 list_for_each_entry(holder, &disk->slave_bdevs, list) 17 - if (holder->bdev == bdev) 17 + if (holder->holder_dir == bdev->bd_holder_dir) 18 18 return holder; 19 19 return NULL; 20 20 } ··· 27 27 static void del_symlink(struct kobject *from, struct kobject *to) 28 28 { 29 29 sysfs_remove_link(from, kobject_name(to)); 30 - } 31 - 32 - static int __link_disk_holder(struct block_device *bdev, struct gendisk *disk) 33 - { 34 - int ret; 35 - 36 - ret = add_symlink(disk->slave_dir, bdev_kobj(bdev)); 37 - if (ret) 38 - return ret; 39 - ret = add_symlink(bdev->bd_holder_dir, &disk_to_dev(disk)->kobj); 40 - if (ret) 41 - del_symlink(disk->slave_dir, bdev_kobj(bdev)); 42 - return ret; 43 30 } 44 31 45 32 /** ··· 62 75 struct bd_holder_disk *holder; 63 76 int ret = 0; 64 77 65 - mutex_lock(&disk->open_mutex); 78 + if (WARN_ON_ONCE(!disk->slave_dir)) 79 + return -EINVAL; 66 80 81 + if (bdev->bd_disk == disk) 82 + return -EINVAL; 83 + 84 + /* 85 + * del_gendisk drops the initial reference to bd_holder_dir, so we 86 + * need to keep our own here to allow for cleanup past that point. 87 + */ 88 + mutex_lock(&bdev->bd_disk->open_mutex); 89 + if (!disk_live(bdev->bd_disk)) { 90 + mutex_unlock(&bdev->bd_disk->open_mutex); 91 + return -ENODEV; 92 + } 93 + kobject_get(bdev->bd_holder_dir); 94 + mutex_unlock(&bdev->bd_disk->open_mutex); 95 + 96 + mutex_lock(&disk->open_mutex); 67 97 WARN_ON_ONCE(!bdev->bd_holder); 68 98 69 99 holder = bd_find_holder_disk(bdev, disk); 70 100 if (holder) { 101 + kobject_put(bdev->bd_holder_dir); 71 102 holder->refcnt++; 72 103 goto out_unlock; 73 104 } ··· 97 92 } 98 93 99 94 INIT_LIST_HEAD(&holder->list); 100 - holder->bdev = bdev; 101 95 holder->refcnt = 1; 102 - if (disk->slave_dir) { 103 - ret = __link_disk_holder(bdev, disk); 104 - if (ret) { 105 - kfree(holder); 106 - goto out_unlock; 107 - } 108 - } 96 + holder->holder_dir = bdev->bd_holder_dir; 109 97 98 + ret = add_symlink(disk->slave_dir, bdev_kobj(bdev)); 99 + if (ret) 100 + goto out_free_holder; 101 + ret = add_symlink(bdev->bd_holder_dir, &disk_to_dev(disk)->kobj); 102 + if (ret) 103 + goto out_del_symlink; 110 104 list_add(&holder->list, &disk->slave_bdevs); 111 - /* 112 - * del_gendisk drops the initial reference to bd_holder_dir, so we need 113 - * to keep our own here to allow for cleanup past that point. 114 - */ 115 - kobject_get(bdev->bd_holder_dir); 116 105 106 + mutex_unlock(&disk->open_mutex); 107 + return 0; 108 + 109 + out_del_symlink: 110 + del_symlink(disk->slave_dir, bdev_kobj(bdev)); 111 + out_free_holder: 112 + kfree(holder); 117 113 out_unlock: 118 114 mutex_unlock(&disk->open_mutex); 115 + if (ret) 116 + kobject_put(bdev->bd_holder_dir); 119 117 return ret; 120 118 } 121 119 EXPORT_SYMBOL_GPL(bd_link_disk_holder); 122 - 123 - static void __unlink_disk_holder(struct block_device *bdev, 124 - struct gendisk *disk) 125 - { 126 - del_symlink(disk->slave_dir, bdev_kobj(bdev)); 127 - del_symlink(bdev->bd_holder_dir, &disk_to_dev(disk)->kobj); 128 - } 129 120 130 121 /** 131 122 * bd_unlink_disk_holder - destroy symlinks created by bd_link_disk_holder() ··· 137 136 { 138 137 struct bd_holder_disk *holder; 139 138 139 + if (WARN_ON_ONCE(!disk->slave_dir)) 140 + return; 141 + 140 142 mutex_lock(&disk->open_mutex); 141 143 holder = bd_find_holder_disk(bdev, disk); 142 144 if (!WARN_ON_ONCE(holder == NULL) && !--holder->refcnt) { 143 - if (disk->slave_dir) 144 - __unlink_disk_holder(bdev, disk); 145 - kobject_put(bdev->bd_holder_dir); 145 + del_symlink(disk->slave_dir, bdev_kobj(bdev)); 146 + del_symlink(holder->holder_dir, &disk_to_dev(disk)->kobj); 147 + kobject_put(holder->holder_dir); 146 148 list_del_init(&holder->list); 147 149 kfree(holder); 148 150 } 149 151 mutex_unlock(&disk->open_mutex); 150 152 } 151 153 EXPORT_SYMBOL_GPL(bd_unlink_disk_holder); 152 - 153 - int bd_register_pending_holders(struct gendisk *disk) 154 - { 155 - struct bd_holder_disk *holder; 156 - int ret; 157 - 158 - mutex_lock(&disk->open_mutex); 159 - list_for_each_entry(holder, &disk->slave_bdevs, list) { 160 - ret = __link_disk_holder(holder->bdev, disk); 161 - if (ret) 162 - goto out_undo; 163 - } 164 - mutex_unlock(&disk->open_mutex); 165 - return 0; 166 - 167 - out_undo: 168 - list_for_each_entry_continue_reverse(holder, &disk->slave_bdevs, list) 169 - __unlink_disk_holder(holder->bdev, disk); 170 - mutex_unlock(&disk->open_mutex); 171 - return ret; 172 - }

+7 -5

block/ioctl.c

··· 467 467 * user space. Note the separate arg/argp parameters that are needed 468 468 * to deal with the compat_ptr() conversion. 469 469 */ 470 - static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode, 471 - unsigned cmd, unsigned long arg, void __user *argp) 470 + static int blkdev_common_ioctl(struct file *file, fmode_t mode, unsigned cmd, 471 + unsigned long arg, void __user *argp) 472 472 { 473 + struct block_device *bdev = I_BDEV(file->f_mapping->host); 473 474 unsigned int max_sectors; 474 475 475 476 switch (cmd) { ··· 528 527 return -EACCES; 529 528 if (bdev_is_partition(bdev)) 530 529 return -EINVAL; 531 - return disk_scan_partitions(bdev->bd_disk, mode & ~FMODE_EXCL); 530 + return disk_scan_partitions(bdev->bd_disk, mode & ~FMODE_EXCL, 531 + file); 532 532 case BLKTRACESTART: 533 533 case BLKTRACESTOP: 534 534 case BLKTRACETEARDOWN: ··· 607 605 break; 608 606 } 609 607 610 - ret = blkdev_common_ioctl(bdev, mode, cmd, arg, argp); 608 + ret = blkdev_common_ioctl(file, mode, cmd, arg, argp); 611 609 if (ret != -ENOIOCTLCMD) 612 610 return ret; 613 611 ··· 676 674 break; 677 675 } 678 676 679 - ret = blkdev_common_ioctl(bdev, mode, cmd, arg, argp); 677 + ret = blkdev_common_ioctl(file, mode, cmd, arg, argp); 680 678 if (ret == -ENOIOCTLCMD && disk->fops->compat_ioctl) 681 679 ret = disk->fops->compat_ioctl(bdev, mode, cmd, arg); 682 680

+77 -6

block/mq-deadline.c

··· 131 131 } 132 132 133 133 /* 134 + * get the request before `rq' in sector-sorted order 135 + */ 136 + static inline struct request * 137 + deadline_earlier_request(struct request *rq) 138 + { 139 + struct rb_node *node = rb_prev(&rq->rb_node); 140 + 141 + if (node) 142 + return rb_entry_rq(node); 143 + 144 + return NULL; 145 + } 146 + 147 + /* 134 148 * get the request after `rq' in sector-sorted order 135 149 */ 136 150 static inline struct request * ··· 292 278 } 293 279 294 280 /* 281 + * Check if rq has a sequential request preceding it. 282 + */ 283 + static bool deadline_is_seq_write(struct deadline_data *dd, struct request *rq) 284 + { 285 + struct request *prev = deadline_earlier_request(rq); 286 + 287 + if (!prev) 288 + return false; 289 + 290 + return blk_rq_pos(prev) + blk_rq_sectors(prev) == blk_rq_pos(rq); 291 + } 292 + 293 + /* 294 + * Skip all write requests that are sequential from @rq, even if we cross 295 + * a zone boundary. 296 + */ 297 + static struct request *deadline_skip_seq_writes(struct deadline_data *dd, 298 + struct request *rq) 299 + { 300 + sector_t pos = blk_rq_pos(rq); 301 + sector_t skipped_sectors = 0; 302 + 303 + while (rq) { 304 + if (blk_rq_pos(rq) != pos + skipped_sectors) 305 + break; 306 + skipped_sectors += blk_rq_sectors(rq); 307 + rq = deadline_latter_request(rq); 308 + } 309 + 310 + return rq; 311 + } 312 + 313 + /* 295 314 * For the specified data direction, return the next request to 296 315 * dispatch using arrival ordered lists. 297 316 */ ··· 344 297 345 298 /* 346 299 * Look for a write request that can be dispatched, that is one with 347 - * an unlocked target zone. 300 + * an unlocked target zone. For some HDDs, breaking a sequential 301 + * write stream can lead to lower throughput, so make sure to preserve 302 + * sequential write streams, even if that stream crosses into the next 303 + * zones and these zones are unlocked. 348 304 */ 349 305 spin_lock_irqsave(&dd->zone_lock, flags); 350 306 list_for_each_entry(rq, &per_prio->fifo_list[DD_WRITE], queuelist) { 351 - if (blk_req_can_dispatch_to_zone(rq)) 307 + if (blk_req_can_dispatch_to_zone(rq) && 308 + (blk_queue_nonrot(rq->q) || 309 + !deadline_is_seq_write(dd, rq))) 352 310 goto out; 353 311 } 354 312 rq = NULL; ··· 383 331 384 332 /* 385 333 * Look for a write request that can be dispatched, that is one with 386 - * an unlocked target zone. 334 + * an unlocked target zone. For some HDDs, breaking a sequential 335 + * write stream can lead to lower throughput, so make sure to preserve 336 + * sequential write streams, even if that stream crosses into the next 337 + * zones and these zones are unlocked. 387 338 */ 388 339 spin_lock_irqsave(&dd->zone_lock, flags); 389 340 while (rq) { 390 341 if (blk_req_can_dispatch_to_zone(rq)) 391 342 break; 392 - rq = deadline_latter_request(rq); 343 + if (blk_queue_nonrot(rq->q)) 344 + rq = deadline_latter_request(rq); 345 + else 346 + rq = deadline_skip_seq_writes(dd, rq); 393 347 } 394 348 spin_unlock_irqrestore(&dd->zone_lock, flags); 395 349 ··· 847 789 rq->elv.priv[0] = NULL; 848 790 } 849 791 792 + static bool dd_has_write_work(struct blk_mq_hw_ctx *hctx) 793 + { 794 + struct deadline_data *dd = hctx->queue->elevator->elevator_data; 795 + enum dd_prio p; 796 + 797 + for (p = 0; p <= DD_PRIO_MAX; p++) 798 + if (!list_empty_careful(&dd->per_prio[p].fifo_list[DD_WRITE])) 799 + return true; 800 + 801 + return false; 802 + } 803 + 850 804 /* 851 805 * Callback from inside blk_mq_free_request(). 852 806 * ··· 898 828 899 829 spin_lock_irqsave(&dd->zone_lock, flags); 900 830 blk_req_zone_write_unlock(rq); 901 - if (!list_empty(&per_prio->fifo_list[DD_WRITE])) 902 - blk_mq_sched_mark_restart_hctx(rq->mq_hctx); 903 831 spin_unlock_irqrestore(&dd->zone_lock, flags); 832 + 833 + if (dd_has_write_work(rq->mq_hctx)) 834 + blk_mq_sched_mark_restart_hctx(rq->mq_hctx); 904 835 } 905 836 } 906 837

+39

block/sed-opal.c

··· 2461 2461 return execute_steps(dev, mbrdone_step, ARRAY_SIZE(mbrdone_step)); 2462 2462 } 2463 2463 2464 + static void opal_lock_check_for_saved_key(struct opal_dev *dev, 2465 + struct opal_lock_unlock *lk_unlk) 2466 + { 2467 + struct opal_suspend_data *iter; 2468 + 2469 + if (lk_unlk->l_state != OPAL_LK || 2470 + lk_unlk->session.opal_key.key_len > 0) 2471 + return; 2472 + 2473 + /* 2474 + * Usually when closing a crypto device (eg: dm-crypt with LUKS) the 2475 + * volume key is not required, as it requires root privileges anyway, 2476 + * and root can deny access to a disk in many ways regardless. 2477 + * Requiring the volume key to lock the device is a peculiarity of the 2478 + * OPAL specification. Given we might already have saved the key if 2479 + * the user requested it via the 'IOC_OPAL_SAVE' ioctl, we can use 2480 + * that key to lock the device if no key was provided here, the 2481 + * locking range matches and the appropriate flag was passed with 2482 + * 'IOC_OPAL_SAVE'. 2483 + * This allows integrating OPAL with tools and libraries that are used 2484 + * to the common behaviour and do not ask for the volume key when 2485 + * closing a device. 2486 + */ 2487 + setup_opal_dev(dev); 2488 + list_for_each_entry(iter, &dev->unlk_lst, node) { 2489 + if ((iter->unlk.flags & OPAL_SAVE_FOR_LOCK) && 2490 + iter->lr == lk_unlk->session.opal_key.lr && 2491 + iter->unlk.session.opal_key.key_len > 0) { 2492 + lk_unlk->session.opal_key.key_len = 2493 + iter->unlk.session.opal_key.key_len; 2494 + memcpy(lk_unlk->session.opal_key.key, 2495 + iter->unlk.session.opal_key.key, 2496 + iter->unlk.session.opal_key.key_len); 2497 + break; 2498 + } 2499 + } 2500 + } 2501 + 2464 2502 static int opal_lock_unlock(struct opal_dev *dev, 2465 2503 struct opal_lock_unlock *lk_unlk) 2466 2504 { ··· 2508 2470 return -EINVAL; 2509 2471 2510 2472 mutex_lock(&dev->dev_lock); 2473 + opal_lock_check_for_saved_key(dev, lk_unlk); 2511 2474 ret = __opal_lock_unlock(dev, lk_unlk); 2512 2475 mutex_unlock(&dev->dev_lock); 2513 2476

-43

drivers/block/Kconfig

··· 285 285 The default value is 4096 kilobytes. Only change this if you know 286 286 what you are doing. 287 287 288 - config CDROM_PKTCDVD 289 - tristate "Packet writing on CD/DVD media (DEPRECATED)" 290 - depends on !UML 291 - depends on SCSI 292 - select CDROM 293 - help 294 - Note: This driver is deprecated and will be removed from the 295 - kernel in the near future! 296 - 297 - If you have a CDROM/DVD drive that supports packet writing, say 298 - Y to include support. It should work with any MMC/Mt Fuji 299 - compliant ATAPI or SCSI drive, which is just about any newer 300 - DVD/CD writer. 301 - 302 - Currently only writing to CD-RW, DVD-RW, DVD+RW and DVDRAM discs 303 - is possible. 304 - DVD-RW disks must be in restricted overwrite mode. 305 - 306 - See the file <file:Documentation/cdrom/packet-writing.rst> 307 - for further information on the use of this driver. 308 - 309 - To compile this driver as a module, choose M here: the 310 - module will be called pktcdvd. 311 - 312 - config CDROM_PKTCDVD_BUFFERS 313 - int "Free buffers for data gathering" 314 - depends on CDROM_PKTCDVD 315 - default "8" 316 - help 317 - This controls the maximum number of active concurrent packets. More 318 - concurrent packets can increase write performance, but also require 319 - more memory. Each concurrent packet will require approximately 64Kb 320 - of non-swappable kernel memory, memory which will be allocated when 321 - a disc is opened for writing. 322 - 323 - config CDROM_PKTCDVD_WCACHE 324 - bool "Enable write caching" 325 - depends on CDROM_PKTCDVD 326 - help 327 - If enabled, write caching will be set for the CD-R/W device. For now 328 - this option is dangerous unless the CD-RW media is known good, as we 329 - don't do deferred write error handling yet. 330 - 331 288 config ATA_OVER_ETH 332 289 tristate "ATA over Ethernet support" 333 290 depends on NET

-1

drivers/block/Makefile

··· 20 20 obj-$(CONFIG_N64CART) += n64cart.o 21 21 obj-$(CONFIG_BLK_DEV_RAM) += brd.o 22 22 obj-$(CONFIG_BLK_DEV_LOOP) += loop.o 23 - obj-$(CONFIG_CDROM_PKTCDVD) += pktcdvd.o 24 23 obj-$(CONFIG_SUNVDC) += sunvdc.o 25 24 26 25 obj-$(CONFIG_BLK_DEV_NBD) += nbd.o

+1 -1

drivers/block/drbd/Kconfig

··· 1 - # SPDX-License-Identifier: GPL-2.0 1 + # SPDX-License-Identifier: GPL-2.0-only 2 2 # 3 3 # DRBD device driver configuration 4 4 #

+1 -1

drivers/block/drbd/Makefile

··· 1 - # SPDX-License-Identifier: GPL-2.0 1 + # SPDX-License-Identifier: GPL-2.0-only 2 2 drbd-y := drbd_bitmap.o drbd_proc.o 3 3 drbd-y += drbd_worker.o drbd_receiver.o drbd_req.o drbd_actlog.o 4 4 drbd-y += drbd_main.o drbd_strings.o drbd_nl.o

+4 -4

drivers/block/drbd/drbd_actlog.c

··· 1 - // SPDX-License-Identifier: GPL-2.0-or-later 1 + // SPDX-License-Identifier: GPL-2.0-only 2 2 /* 3 3 drbd_actlog.c 4 4 ··· 868 868 nr_sectors = get_capacity(device->vdisk); 869 869 esector = sector + (size >> 9) - 1; 870 870 871 - if (!expect(sector < nr_sectors)) 871 + if (!expect(device, sector < nr_sectors)) 872 872 goto out; 873 - if (!expect(esector < nr_sectors)) 873 + if (!expect(device, esector < nr_sectors)) 874 874 esector = nr_sectors - 1; 875 875 876 876 lbnr = BM_SECT_TO_BIT(nr_sectors-1); ··· 1143 1143 bm_ext = e ? lc_entry(e, struct bm_extent, lce) : NULL; 1144 1144 if (!bm_ext) { 1145 1145 spin_unlock_irqrestore(&device->al_lock, flags); 1146 - if (__ratelimit(&drbd_ratelimit_state)) 1146 + if (drbd_ratelimit()) 1147 1147 drbd_err(device, "drbd_rs_complete_io() called, but extent not found\n"); 1148 1148 return; 1149 1149 }

+31 -31

drivers/block/drbd/drbd_bitmap.c

··· 1 - // SPDX-License-Identifier: GPL-2.0-or-later 1 + // SPDX-License-Identifier: GPL-2.0-only 2 2 /* 3 3 drbd_bitmap.c 4 4 ··· 113 113 static void __bm_print_lock_info(struct drbd_device *device, const char *func) 114 114 { 115 115 struct drbd_bitmap *b = device->bitmap; 116 - if (!__ratelimit(&drbd_ratelimit_state)) 116 + if (!drbd_ratelimit()) 117 117 return; 118 118 drbd_err(device, "FIXME %s[%d] in %s, bitmap locked for '%s' by %s[%d]\n", 119 119 current->comm, task_pid_nr(current), ··· 448 448 449 449 sector_t drbd_bm_capacity(struct drbd_device *device) 450 450 { 451 - if (!expect(device->bitmap)) 451 + if (!expect(device, device->bitmap)) 452 452 return 0; 453 453 return device->bitmap->bm_dev_capacity; 454 454 } ··· 457 457 */ 458 458 void drbd_bm_cleanup(struct drbd_device *device) 459 459 { 460 - if (!expect(device->bitmap)) 460 + if (!expect(device, device->bitmap)) 461 461 return; 462 462 bm_free_pages(device->bitmap->bm_pages, device->bitmap->bm_number_of_pages); 463 463 bm_vk_free(device->bitmap->bm_pages); ··· 636 636 int err = 0; 637 637 bool growing; 638 638 639 - if (!expect(b)) 639 + if (!expect(device, b)) 640 640 return -ENOMEM; 641 641 642 642 drbd_bm_lock(device, "resize", BM_LOCKED_MASK); ··· 757 757 unsigned long s; 758 758 unsigned long flags; 759 759 760 - if (!expect(b)) 760 + if (!expect(device, b)) 761 761 return 0; 762 - if (!expect(b->bm_pages)) 762 + if (!expect(device, b->bm_pages)) 763 763 return 0; 764 764 765 765 spin_lock_irqsave(&b->bm_lock, flags); ··· 783 783 size_t drbd_bm_words(struct drbd_device *device) 784 784 { 785 785 struct drbd_bitmap *b = device->bitmap; 786 - if (!expect(b)) 786 + if (!expect(device, b)) 787 787 return 0; 788 - if (!expect(b->bm_pages)) 788 + if (!expect(device, b->bm_pages)) 789 789 return 0; 790 790 791 791 return b->bm_words; ··· 794 794 unsigned long drbd_bm_bits(struct drbd_device *device) 795 795 { 796 796 struct drbd_bitmap *b = device->bitmap; 797 - if (!expect(b)) 797 + if (!expect(device, b)) 798 798 return 0; 799 799 800 800 return b->bm_bits; ··· 816 816 817 817 end = offset + number; 818 818 819 - if (!expect(b)) 819 + if (!expect(device, b)) 820 820 return; 821 - if (!expect(b->bm_pages)) 821 + if (!expect(device, b->bm_pages)) 822 822 return; 823 823 if (number == 0) 824 824 return; ··· 863 863 864 864 end = offset + number; 865 865 866 - if (!expect(b)) 866 + if (!expect(device, b)) 867 867 return; 868 - if (!expect(b->bm_pages)) 868 + if (!expect(device, b->bm_pages)) 869 869 return; 870 870 871 871 spin_lock_irq(&b->bm_lock); ··· 894 894 void drbd_bm_set_all(struct drbd_device *device) 895 895 { 896 896 struct drbd_bitmap *b = device->bitmap; 897 - if (!expect(b)) 897 + if (!expect(device, b)) 898 898 return; 899 - if (!expect(b->bm_pages)) 899 + if (!expect(device, b->bm_pages)) 900 900 return; 901 901 902 902 spin_lock_irq(&b->bm_lock); ··· 910 910 void drbd_bm_clear_all(struct drbd_device *device) 911 911 { 912 912 struct drbd_bitmap *b = device->bitmap; 913 - if (!expect(b)) 913 + if (!expect(device, b)) 914 914 return; 915 - if (!expect(b->bm_pages)) 915 + if (!expect(device, b->bm_pages)) 916 916 return; 917 917 918 918 spin_lock_irq(&b->bm_lock); ··· 952 952 bm_set_page_io_err(b->bm_pages[idx]); 953 953 /* Not identical to on disk version of it. 954 954 * Is BM_PAGE_IO_ERROR enough? */ 955 - if (__ratelimit(&drbd_ratelimit_state)) 955 + if (drbd_ratelimit()) 956 956 drbd_err(device, "IO ERROR %d on bitmap page idx %u\n", 957 957 bio->bi_status, idx); 958 958 } else { ··· 1013 1013 else 1014 1014 len = PAGE_SIZE; 1015 1015 } else { 1016 - if (__ratelimit(&drbd_ratelimit_state)) { 1016 + if (drbd_ratelimit()) { 1017 1017 drbd_err(device, "Invalid offset during on-disk bitmap access: " 1018 1018 "page idx %u, sector %llu\n", page_nr, on_disk_sector); 1019 1019 } ··· 1332 1332 struct drbd_bitmap *b = device->bitmap; 1333 1333 unsigned long i = DRBD_END_OF_BITMAP; 1334 1334 1335 - if (!expect(b)) 1335 + if (!expect(device, b)) 1336 1336 return i; 1337 - if (!expect(b->bm_pages)) 1337 + if (!expect(device, b->bm_pages)) 1338 1338 return i; 1339 1339 1340 1340 spin_lock_irq(&b->bm_lock); ··· 1436 1436 struct drbd_bitmap *b = device->bitmap; 1437 1437 int c = 0; 1438 1438 1439 - if (!expect(b)) 1439 + if (!expect(device, b)) 1440 1440 return 1; 1441 - if (!expect(b->bm_pages)) 1441 + if (!expect(device, b->bm_pages)) 1442 1442 return 0; 1443 1443 1444 1444 spin_lock_irqsave(&b->bm_lock, flags); ··· 1582 1582 unsigned long *p_addr; 1583 1583 int i; 1584 1584 1585 - if (!expect(b)) 1585 + if (!expect(device, b)) 1586 1586 return 0; 1587 - if (!expect(b->bm_pages)) 1587 + if (!expect(device, b->bm_pages)) 1588 1588 return 0; 1589 1589 1590 1590 spin_lock_irqsave(&b->bm_lock, flags); ··· 1619 1619 * robust in case we screwed up elsewhere, in that case pretend there 1620 1620 * was one dirty bit in the requested area, so we won't try to do a 1621 1621 * local read there (no bitmap probably implies no disk) */ 1622 - if (!expect(b)) 1622 + if (!expect(device, b)) 1623 1623 return 1; 1624 - if (!expect(b->bm_pages)) 1624 + if (!expect(device, b->bm_pages)) 1625 1625 return 1; 1626 1626 1627 1627 spin_lock_irqsave(&b->bm_lock, flags); ··· 1635 1635 bm_unmap(p_addr); 1636 1636 p_addr = bm_map_pidx(b, idx); 1637 1637 } 1638 - if (expect(bitnr < b->bm_bits)) 1638 + if (expect(device, bitnr < b->bm_bits)) 1639 1639 c += (0 != test_bit_le(bitnr - (page_nr << (PAGE_SHIFT+3)), p_addr)); 1640 1640 else 1641 1641 drbd_err(device, "bitnr=%lu bm_bits=%lu\n", bitnr, b->bm_bits); ··· 1668 1668 unsigned long flags; 1669 1669 unsigned long *p_addr, *bm; 1670 1670 1671 - if (!expect(b)) 1671 + if (!expect(device, b)) 1672 1672 return 0; 1673 - if (!expect(b->bm_pages)) 1673 + if (!expect(device, b->bm_pages)) 1674 1674 return 0; 1675 1675 1676 1676 spin_lock_irqsave(&b->bm_lock, flags);

+1 -1

drivers/block/drbd/drbd_debugfs.c

··· 1 - // SPDX-License-Identifier: GPL-2.0 1 + // SPDX-License-Identifier: GPL-2.0-only 2 2 #define pr_fmt(fmt) "drbd debugfs: " fmt 3 3 #include <linux/kernel.h> 4 4 #include <linux/module.h>

+1 -1

drivers/block/drbd/drbd_debugfs.h

··· 1 - /* SPDX-License-Identifier: GPL-2.0 */ 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 2 #include <linux/kernel.h> 3 3 #include <linux/module.h> 4 4 #include <linux/debugfs.h>

+9 -69

drivers/block/drbd/drbd_int.h

··· 1 - /* SPDX-License-Identifier: GPL-2.0-or-later */ 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 2 /* 3 3 drbd_int.h 4 4 ··· 37 37 #include "drbd_strings.h" 38 38 #include "drbd_state.h" 39 39 #include "drbd_protocol.h" 40 + #include "drbd_polymorph_printk.h" 40 41 41 42 #ifdef __CHECKER__ 42 43 # define __protected_by(x) __attribute__((require_context(x,1,999,"rdwr"))) ··· 75 74 76 75 struct drbd_device; 77 76 struct drbd_connection; 78 - 79 - #define __drbd_printk_device(level, device, fmt, args...) \ 80 - dev_printk(level, disk_to_dev((device)->vdisk), fmt, ## args) 81 - #define __drbd_printk_peer_device(level, peer_device, fmt, args...) \ 82 - dev_printk(level, disk_to_dev((peer_device)->device->vdisk), fmt, ## args) 83 - #define __drbd_printk_resource(level, resource, fmt, args...) \ 84 - printk(level "drbd %s: " fmt, (resource)->name, ## args) 85 - #define __drbd_printk_connection(level, connection, fmt, args...) \ 86 - printk(level "drbd %s: " fmt, (connection)->resource->name, ## args) 87 - 88 - void drbd_printk_with_wrong_object_type(void); 89 - 90 - #define __drbd_printk_if_same_type(obj, type, func, level, fmt, args...) \ 91 - (__builtin_types_compatible_p(typeof(obj), type) || \ 92 - __builtin_types_compatible_p(typeof(obj), const type)), \ 93 - func(level, (const type)(obj), fmt, ## args) 94 - 95 - #define drbd_printk(level, obj, fmt, args...) \ 96 - __builtin_choose_expr( \ 97 - __drbd_printk_if_same_type(obj, struct drbd_device *, \ 98 - __drbd_printk_device, level, fmt, ## args), \ 99 - __builtin_choose_expr( \ 100 - __drbd_printk_if_same_type(obj, struct drbd_resource *, \ 101 - __drbd_printk_resource, level, fmt, ## args), \ 102 - __builtin_choose_expr( \ 103 - __drbd_printk_if_same_type(obj, struct drbd_connection *, \ 104 - __drbd_printk_connection, level, fmt, ## args), \ 105 - __builtin_choose_expr( \ 106 - __drbd_printk_if_same_type(obj, struct drbd_peer_device *, \ 107 - __drbd_printk_peer_device, level, fmt, ## args), \ 108 - drbd_printk_with_wrong_object_type())))) 109 - 110 - #define drbd_dbg(obj, fmt, args...) \ 111 - drbd_printk(KERN_DEBUG, obj, fmt, ## args) 112 - #define drbd_alert(obj, fmt, args...) \ 113 - drbd_printk(KERN_ALERT, obj, fmt, ## args) 114 - #define drbd_err(obj, fmt, args...) \ 115 - drbd_printk(KERN_ERR, obj, fmt, ## args) 116 - #define drbd_warn(obj, fmt, args...) \ 117 - drbd_printk(KERN_WARNING, obj, fmt, ## args) 118 - #define drbd_info(obj, fmt, args...) \ 119 - drbd_printk(KERN_INFO, obj, fmt, ## args) 120 - #define drbd_emerg(obj, fmt, args...) \ 121 - drbd_printk(KERN_EMERG, obj, fmt, ## args) 122 - 123 - #define dynamic_drbd_dbg(device, fmt, args...) \ 124 - dynamic_dev_dbg(disk_to_dev(device->vdisk), fmt, ## args) 125 - 126 - #define D_ASSERT(device, exp) do { \ 127 - if (!(exp)) \ 128 - drbd_err(device, "ASSERT( " #exp " ) in %s:%d\n", __FILE__, __LINE__); \ 129 - } while (0) 130 - 131 - /** 132 - * expect - Make an assertion 133 - * 134 - * Unlike the assert macro, this macro returns a boolean result. 135 - */ 136 - #define expect(exp) ({ \ 137 - bool _bool = (exp); \ 138 - if (!_bool) \ 139 - drbd_err(device, "ASSERTION %s FAILED in %s\n", \ 140 - #exp, __func__); \ 141 - _bool; \ 142 - }) 143 77 144 78 /* Defines to control fault insertion */ 145 79 enum { ··· 331 395 struct drbd_peer_device *peer_device; 332 396 struct drbd_epoch *epoch; /* for writes */ 333 397 struct page *pages; 398 + blk_opf_t opf; 334 399 atomic_t pending_bios; 335 400 struct drbd_interval i; 336 401 /* see comments on ee flag bits below */ ··· 342 405 struct digest_info *digest; 343 406 }; 344 407 }; 408 + 409 + /* Equivalent to bio_op and req_op. */ 410 + #define peer_req_op(peer_req) \ 411 + ((peer_req)->opf & REQ_OP_MASK) 345 412 346 413 /* ee flag bits. 347 414 * While corresponding bios are in flight, the only modification will be ··· 1486 1545 extern bool drbd_rs_c_min_rate_throttle(struct drbd_device *device); 1487 1546 extern bool drbd_rs_should_slow_down(struct drbd_device *device, sector_t sector, 1488 1547 bool throttle_if_app_is_waiting); 1489 - extern int drbd_submit_peer_request(struct drbd_device *, 1490 - struct drbd_peer_request *, blk_opf_t, int); 1548 + extern int drbd_submit_peer_request(struct drbd_peer_request *peer_req); 1491 1549 extern int drbd_free_peer_reqs(struct drbd_device *, struct list_head *); 1492 1550 extern struct drbd_peer_request *drbd_alloc_peer_req(struct drbd_peer_device *, u64, 1493 1551 sector_t, unsigned int, ··· 1658 1718 switch (ep) { 1659 1719 case EP_PASS_ON: /* FIXME would this be better named "Ignore"? */ 1660 1720 if (df == DRBD_READ_ERROR || df == DRBD_WRITE_ERROR) { 1661 - if (__ratelimit(&drbd_ratelimit_state)) 1721 + if (drbd_ratelimit()) 1662 1722 drbd_err(device, "Local IO failed in %s.\n", where); 1663 1723 if (device->state.disk > D_INCONSISTENT) 1664 1724 _drbd_set_state(_NS(device, disk, D_INCONSISTENT), CS_HARD, NULL);

+1 -1

drivers/block/drbd/drbd_interval.c

··· 1 - // SPDX-License-Identifier: GPL-2.0 1 + // SPDX-License-Identifier: GPL-2.0-only 2 2 #include <asm/bug.h> 3 3 #include <linux/rbtree_augmented.h> 4 4 #include "drbd_interval.h"

+1 -1

drivers/block/drbd/drbd_interval.h

··· 1 - /* SPDX-License-Identifier: GPL-2.0 */ 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 2 #ifndef __DRBD_INTERVAL_H 3 3 #define __DRBD_INTERVAL_H 4 4

+11 -10

drivers/block/drbd/drbd_main.c

··· 1 - // SPDX-License-Identifier: GPL-2.0-or-later 1 + // SPDX-License-Identifier: GPL-2.0-only 2 2 /* 3 3 drbd.c 4 4 ··· 1259 1259 struct bm_xfer_ctx c; 1260 1260 int err; 1261 1261 1262 - if (!expect(device->bitmap)) 1262 + if (!expect(device, device->bitmap)) 1263 1263 return false; 1264 1264 1265 1265 if (get_ldev(device)) { ··· 2217 2217 kref_put(&peer_device->connection->kref, drbd_destroy_connection); 2218 2218 kfree(peer_device); 2219 2219 } 2220 - memset(device, 0xfd, sizeof(*device)); 2220 + if (device->submit.wq) 2221 + destroy_workqueue(device->submit.wq); 2221 2222 kfree(device); 2222 2223 kref_put(&resource->kref, drbd_destroy_resource); 2223 2224 } ··· 2250 2249 bool expected; 2251 2250 2252 2251 expected = 2253 - expect(atomic_read(&req->completion_ref) == 0) && 2254 - expect(req->rq_state & RQ_POSTPONED) && 2255 - expect((req->rq_state & RQ_LOCAL_PENDING) == 0 || 2252 + expect(device, atomic_read(&req->completion_ref) == 0) && 2253 + expect(device, req->rq_state & RQ_POSTPONED) && 2254 + expect(device, (req->rq_state & RQ_LOCAL_PENDING) == 0 || 2256 2255 (req->rq_state & RQ_LOCAL_ABORTED) != 0); 2257 2256 2258 2257 if (!expected) ··· 2310 2309 idr_destroy(&resource->devices); 2311 2310 free_cpumask_var(resource->cpu_mask); 2312 2311 kfree(resource->name); 2313 - memset(resource, 0xf2, sizeof(*resource)); 2314 2312 kfree(resource); 2315 2313 } 2316 2314 ··· 2650 2650 drbd_free_socket(&connection->data); 2651 2651 kfree(connection->int_dig_in); 2652 2652 kfree(connection->int_dig_vv); 2653 - memset(connection, 0xfc, sizeof(*connection)); 2654 2653 kfree(connection); 2655 2654 kref_put(&resource->kref, drbd_destroy_resource); 2656 2655 } ··· 2773 2774 2774 2775 err = add_disk(disk); 2775 2776 if (err) 2776 - goto out_idr_remove_from_resource; 2777 + goto out_destroy_workqueue; 2777 2778 2778 2779 /* inherit the connection state */ 2779 2780 device->state.conn = first_connection(resource)->cstate; ··· 2787 2788 drbd_debugfs_device_add(device); 2788 2789 return NO_ERROR; 2789 2790 2791 + out_destroy_workqueue: 2792 + destroy_workqueue(device->submit.wq); 2790 2793 out_idr_remove_from_resource: 2791 2794 for_each_connection_safe(connection, n, resource) { 2792 2795 peer_device = idr_remove(&connection->peer_devices, vnr); ··· 3767 3766 if (ret) { 3768 3767 drbd_fault_count++; 3769 3768 3770 - if (__ratelimit(&drbd_ratelimit_state)) 3769 + if (drbd_ratelimit()) 3771 3770 drbd_warn(device, "***Simulating %s failure\n", 3772 3771 _drbd_fault_str(type)); 3773 3772 }

+20 -7

drivers/block/drbd/drbd_nl.c

··· 1 - // SPDX-License-Identifier: GPL-2.0-or-later 1 + // SPDX-License-Identifier: GPL-2.0-only 2 2 /* 3 3 drbd_nl.c 4 4 ··· 1210 1210 struct drbd_connection *connection = 1211 1211 first_peer_device(device)->connection; 1212 1212 struct request_queue *q = device->rq_queue; 1213 + unsigned int max_discard_sectors; 1213 1214 1214 1215 if (bdev && !bdev_max_discard_sectors(bdev->backing_bdev)) 1215 1216 goto not_supported; ··· 1231 1230 * topology on all peers. 1232 1231 */ 1233 1232 blk_queue_discard_granularity(q, 512); 1234 - q->limits.max_discard_sectors = drbd_max_discard_sectors(connection); 1235 - q->limits.max_write_zeroes_sectors = 1236 - drbd_max_discard_sectors(connection); 1233 + max_discard_sectors = drbd_max_discard_sectors(connection); 1234 + blk_queue_max_discard_sectors(q, max_discard_sectors); 1235 + blk_queue_max_write_zeroes_sectors(q, max_discard_sectors); 1237 1236 return; 1238 1237 1239 1238 not_supported: 1240 1239 blk_queue_discard_granularity(q, 0); 1241 - q->limits.max_discard_sectors = 0; 1242 - q->limits.max_write_zeroes_sectors = 0; 1240 + blk_queue_max_discard_sectors(q, 0); 1243 1241 } 1244 1242 1245 1243 static void fixup_write_zeroes(struct drbd_device *device, struct request_queue *q) ··· 1254 1254 q->limits.max_write_zeroes_sectors = DRBD_MAX_BBIO_SECTORS; 1255 1255 else 1256 1256 q->limits.max_write_zeroes_sectors = 0; 1257 + } 1258 + 1259 + static void fixup_discard_support(struct drbd_device *device, struct request_queue *q) 1260 + { 1261 + unsigned int max_discard = device->rq_queue->limits.max_discard_sectors; 1262 + unsigned int discard_granularity = 1263 + device->rq_queue->limits.discard_granularity >> SECTOR_SHIFT; 1264 + 1265 + if (discard_granularity > max_discard) { 1266 + blk_queue_discard_granularity(q, 0); 1267 + blk_queue_max_discard_sectors(q, 0); 1268 + } 1257 1269 } 1258 1270 1259 1271 static void drbd_setup_queue_param(struct drbd_device *device, struct drbd_backing_dev *bdev, ··· 1300 1288 disk_update_readahead(device->vdisk); 1301 1289 } 1302 1290 fixup_write_zeroes(device, q); 1291 + fixup_discard_support(device, q); 1303 1292 } 1304 1293 1305 1294 void drbd_reconsider_queue_parameters(struct drbd_device *device, struct drbd_backing_dev *bdev, struct o_qlim *o) ··· 1543 1530 goto fail_unlock; 1544 1531 } 1545 1532 1546 - if (!expect(new_disk_conf->resync_rate >= 1)) 1533 + if (!expect(device, new_disk_conf->resync_rate >= 1)) 1547 1534 new_disk_conf->resync_rate = 1; 1548 1535 1549 1536 sanitize_disk_conf(device, new_disk_conf, device->ldev);

+1 -1

drivers/block/drbd/drbd_nla.c

··· 1 - // SPDX-License-Identifier: GPL-2.0 1 + // SPDX-License-Identifier: GPL-2.0-only 2 2 #include <linux/kernel.h> 3 3 #include <net/netlink.h> 4 4 #include <linux/drbd_genl_api.h>

+1 -1

drivers/block/drbd/drbd_nla.h

··· 1 - /* SPDX-License-Identifier: GPL-2.0 */ 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 2 #ifndef __DRBD_NLA_H 3 3 #define __DRBD_NLA_H 4 4

+141

drivers/block/drbd/drbd_polymorph_printk.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + #ifndef DRBD_POLYMORPH_PRINTK_H 3 + #define DRBD_POLYMORPH_PRINTK_H 4 + 5 + #if !defined(CONFIG_DYNAMIC_DEBUG) 6 + #undef DEFINE_DYNAMIC_DEBUG_METADATA 7 + #undef __dynamic_pr_debug 8 + #undef DYNAMIC_DEBUG_BRANCH 9 + #define DEFINE_DYNAMIC_DEBUG_METADATA(D, F) const char *D = F; ((void)D) 10 + #define __dynamic_pr_debug(D, F, args...) do { (void)(D); if (0) printk(F, ## args); } while (0) 11 + #define DYNAMIC_DEBUG_BRANCH(D) false 12 + #endif 13 + 14 + 15 + #define __drbd_printk_drbd_device_prep(device) \ 16 + const struct drbd_device *__d = (device); \ 17 + const struct drbd_resource *__r = __d->resource 18 + #define __drbd_printk_drbd_device_fmt(fmt) "drbd %s/%u drbd%u: " fmt 19 + #define __drbd_printk_drbd_device_args() __r->name, __d->vnr, __d->minor 20 + #define __drbd_printk_drbd_device_unprep() 21 + 22 + #define __drbd_printk_drbd_peer_device_prep(peer_device) \ 23 + const struct drbd_device *__d; \ 24 + const struct drbd_resource *__r; \ 25 + __d = (peer_device)->device; \ 26 + __r = __d->resource 27 + #define __drbd_printk_drbd_peer_device_fmt(fmt) \ 28 + "drbd %s/%u drbd%u: " fmt 29 + #define __drbd_printk_drbd_peer_device_args() \ 30 + __r->name, __d->vnr, __d->minor 31 + #define __drbd_printk_drbd_peer_device_unprep() 32 + 33 + #define __drbd_printk_drbd_resource_prep(resource) \ 34 + const struct drbd_resource *__r = resource 35 + #define __drbd_printk_drbd_resource_fmt(fmt) "drbd %s: " fmt 36 + #define __drbd_printk_drbd_resource_args() __r->name 37 + #define __drbd_printk_drbd_resource_unprep(resource) 38 + 39 + #define __drbd_printk_drbd_connection_prep(connection) \ 40 + const struct drbd_connection *__c = (connection); \ 41 + const struct drbd_resource *__r = __c->resource 42 + #define __drbd_printk_drbd_connection_fmt(fmt) \ 43 + "drbd %s: " fmt 44 + #define __drbd_printk_drbd_connection_args() \ 45 + __r->name 46 + #define __drbd_printk_drbd_connection_unprep() 47 + 48 + void drbd_printk_with_wrong_object_type(void); 49 + void drbd_dyn_dbg_with_wrong_object_type(void); 50 + 51 + #define __drbd_printk_choose_cond(obj, struct_name) \ 52 + (__builtin_types_compatible_p(typeof(obj), struct struct_name *) || \ 53 + __builtin_types_compatible_p(typeof(obj), const struct struct_name *)) 54 + #define __drbd_printk_if_same_type(obj, struct_name, level, fmt, args...) \ 55 + __drbd_printk_choose_cond(obj, struct_name), \ 56 + ({ \ 57 + __drbd_printk_ ## struct_name ## _prep((const struct struct_name *)(obj)); \ 58 + printk(level __drbd_printk_ ## struct_name ## _fmt(fmt), \ 59 + __drbd_printk_ ## struct_name ## _args(), ## args); \ 60 + __drbd_printk_ ## struct_name ## _unprep(); \ 61 + }) 62 + 63 + #define drbd_printk(level, obj, fmt, args...) \ 64 + __builtin_choose_expr( \ 65 + __drbd_printk_if_same_type(obj, drbd_device, level, fmt, ## args), \ 66 + __builtin_choose_expr( \ 67 + __drbd_printk_if_same_type(obj, drbd_resource, level, fmt, ## args), \ 68 + __builtin_choose_expr( \ 69 + __drbd_printk_if_same_type(obj, drbd_connection, level, fmt, ## args), \ 70 + __builtin_choose_expr( \ 71 + __drbd_printk_if_same_type(obj, drbd_peer_device, level, fmt, ## args), \ 72 + drbd_printk_with_wrong_object_type())))) 73 + 74 + #define __drbd_dyn_dbg_if_same_type(obj, struct_name, fmt, args...) \ 75 + __drbd_printk_choose_cond(obj, struct_name), \ 76 + ({ \ 77 + DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, fmt); \ 78 + if (DYNAMIC_DEBUG_BRANCH(descriptor)) { \ 79 + __drbd_printk_ ## struct_name ## _prep((const struct struct_name *)(obj)); \ 80 + __dynamic_pr_debug(&descriptor, __drbd_printk_ ## struct_name ## _fmt(fmt), \ 81 + __drbd_printk_ ## struct_name ## _args(), ## args); \ 82 + __drbd_printk_ ## struct_name ## _unprep(); \ 83 + } \ 84 + }) 85 + 86 + #define dynamic_drbd_dbg(obj, fmt, args...) \ 87 + __builtin_choose_expr( \ 88 + __drbd_dyn_dbg_if_same_type(obj, drbd_device, fmt, ## args), \ 89 + __builtin_choose_expr( \ 90 + __drbd_dyn_dbg_if_same_type(obj, drbd_resource, fmt, ## args), \ 91 + __builtin_choose_expr( \ 92 + __drbd_dyn_dbg_if_same_type(obj, drbd_connection, fmt, ## args), \ 93 + __builtin_choose_expr( \ 94 + __drbd_dyn_dbg_if_same_type(obj, drbd_peer_device, fmt, ## args), \ 95 + drbd_dyn_dbg_with_wrong_object_type())))) 96 + 97 + #define drbd_emerg(device, fmt, args...) \ 98 + drbd_printk(KERN_EMERG, device, fmt, ## args) 99 + #define drbd_alert(device, fmt, args...) \ 100 + drbd_printk(KERN_ALERT, device, fmt, ## args) 101 + #define drbd_crit(device, fmt, args...) \ 102 + drbd_printk(KERN_CRIT, device, fmt, ## args) 103 + #define drbd_err(device, fmt, args...) \ 104 + drbd_printk(KERN_ERR, device, fmt, ## args) 105 + #define drbd_warn(device, fmt, args...) \ 106 + drbd_printk(KERN_WARNING, device, fmt, ## args) 107 + #define drbd_notice(device, fmt, args...) \ 108 + drbd_printk(KERN_NOTICE, device, fmt, ## args) 109 + #define drbd_info(device, fmt, args...) \ 110 + drbd_printk(KERN_INFO, device, fmt, ## args) 111 + 112 + 113 + #define drbd_ratelimit() \ 114 + ({ \ 115 + static DEFINE_RATELIMIT_STATE(_rs, \ 116 + DEFAULT_RATELIMIT_INTERVAL, \ 117 + DEFAULT_RATELIMIT_BURST); \ 118 + __ratelimit(&_rs); \ 119 + }) 120 + 121 + #define D_ASSERT(x, exp) \ 122 + do { \ 123 + if (!(exp)) \ 124 + drbd_err(x, "ASSERTION %s FAILED in %s\n", \ 125 + #exp, __func__); \ 126 + } while (0) 127 + 128 + /** 129 + * expect - Make an assertion 130 + * 131 + * Unlike the assert macro, this macro returns a boolean result. 132 + */ 133 + #define expect(x, exp) ({ \ 134 + bool _bool = (exp); \ 135 + if (!_bool && drbd_ratelimit()) \ 136 + drbd_err(x, "ASSERTION %s FAILED in %s\n", \ 137 + #exp, __func__); \ 138 + _bool; \ 139 + }) 140 + 141 + #endif

+1 -1

drivers/block/drbd/drbd_proc.c

··· 1 - // SPDX-License-Identifier: GPL-2.0-or-later 1 + // SPDX-License-Identifier: GPL-2.0-only 2 2 /* 3 3 drbd_proc.c 4 4

+1 -1

drivers/block/drbd/drbd_protocol.h

··· 1 - /* SPDX-License-Identifier: GPL-2.0 */ 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 2 #ifndef __DRBD_PROTOCOL_H 3 3 #define __DRBD_PROTOCOL_H 4 4

+54 -45

drivers/block/drbd/drbd_receiver.c

··· 1 - // SPDX-License-Identifier: GPL-2.0-or-later 1 + // SPDX-License-Identifier: GPL-2.0-only 2 2 /* 3 3 drbd_receiver.c 4 4 ··· 413 413 drbd_free_pages(device, peer_req->pages, is_net); 414 414 D_ASSERT(device, atomic_read(&peer_req->pending_bios) == 0); 415 415 D_ASSERT(device, drbd_interval_empty(&peer_req->i)); 416 - if (!expect(!(peer_req->flags & EE_CALL_AL_COMPLETE_IO))) { 416 + if (!expect(device, !(peer_req->flags & EE_CALL_AL_COMPLETE_IO))) { 417 417 peer_req->flags &= ~EE_CALL_AL_COMPLETE_IO; 418 418 drbd_al_complete_io(device, &peer_req->i); 419 419 } ··· 1603 1603 drbd_endio_write_sec_final(peer_req); 1604 1604 } 1605 1605 1606 + static int peer_request_fault_type(struct drbd_peer_request *peer_req) 1607 + { 1608 + if (peer_req_op(peer_req) == REQ_OP_READ) { 1609 + return peer_req->flags & EE_APPLICATION ? 1610 + DRBD_FAULT_DT_RD : DRBD_FAULT_RS_RD; 1611 + } else { 1612 + return peer_req->flags & EE_APPLICATION ? 1613 + DRBD_FAULT_DT_WR : DRBD_FAULT_RS_WR; 1614 + } 1615 + } 1616 + 1606 1617 /** 1607 1618 * drbd_submit_peer_request() 1608 - * @device: DRBD device. 1609 1619 * @peer_req: peer request 1610 1620 * 1611 1621 * May spread the pages to multiple bios, ··· 1629 1619 * on certain Xen deployments. 1630 1620 */ 1631 1621 /* TODO allocate from our own bio_set. */ 1632 - int drbd_submit_peer_request(struct drbd_device *device, 1633 - struct drbd_peer_request *peer_req, 1634 - const blk_opf_t opf, const int fault_type) 1622 + int drbd_submit_peer_request(struct drbd_peer_request *peer_req) 1635 1623 { 1624 + struct drbd_device *device = peer_req->peer_device->device; 1636 1625 struct bio *bios = NULL; 1637 1626 struct bio *bio; 1638 1627 struct page *page = peer_req->pages; ··· 1676 1667 * generated bio, but a bio allocated on behalf of the peer. 1677 1668 */ 1678 1669 next_bio: 1679 - bio = bio_alloc(device->ldev->backing_bdev, nr_pages, opf, GFP_NOIO); 1670 + /* _DISCARD, _WRITE_ZEROES handled above. 1671 + * REQ_OP_FLUSH (empty flush) not expected, 1672 + * should have been mapped to a "drbd protocol barrier". 1673 + * REQ_OP_SECURE_ERASE: I don't see how we could ever support that. 1674 + */ 1675 + if (!(peer_req_op(peer_req) == REQ_OP_WRITE || 1676 + peer_req_op(peer_req) == REQ_OP_READ)) { 1677 + drbd_err(device, "Invalid bio op received: 0x%x\n", peer_req->opf); 1678 + return -EINVAL; 1679 + } 1680 + 1681 + bio = bio_alloc(device->ldev->backing_bdev, nr_pages, peer_req->opf, GFP_NOIO); 1680 1682 /* > peer_req->i.sector, unless this is the first bio */ 1681 1683 bio->bi_iter.bi_sector = sector; 1682 1684 bio->bi_private = peer_req; ··· 1717 1697 bios = bios->bi_next; 1718 1698 bio->bi_next = NULL; 1719 1699 1720 - drbd_submit_bio_noacct(device, fault_type, bio); 1700 + drbd_submit_bio_noacct(device, peer_request_fault_type(peer_req), bio); 1721 1701 } while (bios); 1722 1702 return 0; 1723 1703 } ··· 1873 1853 /* assume request_size == data_size, but special case trim. */ 1874 1854 ds = data_size; 1875 1855 if (trim) { 1876 - if (!expect(data_size == 0)) 1856 + if (!expect(peer_device, data_size == 0)) 1877 1857 return NULL; 1878 1858 ds = be32_to_cpu(trim->size); 1879 1859 } else if (zeroes) { 1880 - if (!expect(data_size == 0)) 1860 + if (!expect(peer_device, data_size == 0)) 1881 1861 return NULL; 1882 1862 ds = be32_to_cpu(zeroes->size); 1883 1863 } 1884 1864 1885 - if (!expect(IS_ALIGNED(ds, 512))) 1865 + if (!expect(peer_device, IS_ALIGNED(ds, 512))) 1886 1866 return NULL; 1887 1867 if (trim || zeroes) { 1888 - if (!expect(ds <= (DRBD_MAX_BBIO_SECTORS << 9))) 1868 + if (!expect(peer_device, ds <= (DRBD_MAX_BBIO_SECTORS << 9))) 1889 1869 return NULL; 1890 - } else if (!expect(ds <= DRBD_MAX_BIO_SIZE)) 1870 + } else if (!expect(peer_device, ds <= DRBD_MAX_BIO_SIZE)) 1891 1871 return NULL; 1892 1872 1893 1873 /* even though we trust out peer, ··· 2071 2051 * respective _drbd_clear_done_ee */ 2072 2052 2073 2053 peer_req->w.cb = e_end_resync_block; 2054 + peer_req->opf = REQ_OP_WRITE; 2074 2055 peer_req->submit_jif = jiffies; 2075 2056 2076 2057 spin_lock_irq(&device->resource->req_lock); ··· 2079 2058 spin_unlock_irq(&device->resource->req_lock); 2080 2059 2081 2060 atomic_add(pi->size >> 9, &device->rs_sect_ev); 2082 - if (drbd_submit_peer_request(device, peer_req, REQ_OP_WRITE, 2083 - DRBD_FAULT_RS_WR) == 0) 2061 + if (drbd_submit_peer_request(peer_req) == 0) 2084 2062 return 0; 2085 2063 2086 2064 /* don't care for the reason here */ ··· 2165 2145 * or in drbd_peer_request_endio. */ 2166 2146 err = recv_resync_read(peer_device, sector, pi); 2167 2147 } else { 2168 - if (__ratelimit(&drbd_ratelimit_state)) 2148 + if (drbd_ratelimit()) 2169 2149 drbd_err(device, "Can not write resync data to local disk.\n"); 2170 2150 2171 2151 err = drbd_drain_block(peer_device, pi->size); ··· 2395 2375 return ret; 2396 2376 } 2397 2377 2398 - /* see also bio_flags_to_wire() 2399 - * DRBD_REQ_*, because we need to semantically map the flags to data packet 2400 - * flags and back. We may replicate to other kernel versions. */ 2401 - static blk_opf_t wire_flags_to_bio_flags(u32 dpf) 2402 - { 2403 - return (dpf & DP_RW_SYNC ? REQ_SYNC : 0) | 2404 - (dpf & DP_FUA ? REQ_FUA : 0) | 2405 - (dpf & DP_FLUSH ? REQ_PREFLUSH : 0); 2406 - } 2407 - 2408 2378 static enum req_op wire_flags_to_bio_op(u32 dpf) 2409 2379 { 2410 2380 if (dpf & DP_ZEROES) ··· 2403 2393 return REQ_OP_DISCARD; 2404 2394 else 2405 2395 return REQ_OP_WRITE; 2396 + } 2397 + 2398 + /* see also bio_flags_to_wire() */ 2399 + static blk_opf_t wire_flags_to_bio(struct drbd_connection *connection, u32 dpf) 2400 + { 2401 + return wire_flags_to_bio_op(dpf) | 2402 + (dpf & DP_RW_SYNC ? REQ_SYNC : 0) | 2403 + (dpf & DP_FUA ? REQ_FUA : 0) | 2404 + (dpf & DP_FLUSH ? REQ_PREFLUSH : 0); 2406 2405 } 2407 2406 2408 2407 static void fail_postponed_requests(struct drbd_device *device, sector_t sector, ··· 2557 2538 struct drbd_peer_request *peer_req; 2558 2539 struct p_data *p = pi->data; 2559 2540 u32 peer_seq = be32_to_cpu(p->seq_num); 2560 - enum req_op op; 2561 - blk_opf_t op_flags; 2562 2541 u32 dp_flags; 2563 2542 int err, tp; 2564 2543 ··· 2595 2578 peer_req->flags |= EE_APPLICATION; 2596 2579 2597 2580 dp_flags = be32_to_cpu(p->dp_flags); 2598 - op = wire_flags_to_bio_op(dp_flags); 2599 - op_flags = wire_flags_to_bio_flags(dp_flags); 2581 + peer_req->opf = wire_flags_to_bio(connection, dp_flags); 2600 2582 if (pi->cmd == P_TRIM) { 2601 2583 D_ASSERT(peer_device, peer_req->i.size > 0); 2602 - D_ASSERT(peer_device, op == REQ_OP_DISCARD); 2584 + D_ASSERT(peer_device, peer_req_op(peer_req) == REQ_OP_DISCARD); 2603 2585 D_ASSERT(peer_device, peer_req->pages == NULL); 2604 2586 /* need to play safe: an older DRBD sender 2605 2587 * may mean zero-out while sending P_TRIM. */ ··· 2606 2590 peer_req->flags |= EE_ZEROOUT; 2607 2591 } else if (pi->cmd == P_ZEROES) { 2608 2592 D_ASSERT(peer_device, peer_req->i.size > 0); 2609 - D_ASSERT(peer_device, op == REQ_OP_WRITE_ZEROES); 2593 + D_ASSERT(peer_device, peer_req_op(peer_req) == REQ_OP_WRITE_ZEROES); 2610 2594 D_ASSERT(peer_device, peer_req->pages == NULL); 2611 2595 /* Do (not) pass down BLKDEV_ZERO_NOUNMAP? */ 2612 2596 if (dp_flags & DP_DISCARD) ··· 2693 2677 peer_req->flags |= EE_CALL_AL_COMPLETE_IO; 2694 2678 } 2695 2679 2696 - err = drbd_submit_peer_request(device, peer_req, op | op_flags, 2697 - DRBD_FAULT_DT_WR); 2680 + err = drbd_submit_peer_request(peer_req); 2698 2681 if (!err) 2699 2682 return 0; 2700 2683 ··· 2804 2789 struct drbd_peer_request *peer_req; 2805 2790 struct digest_info *di = NULL; 2806 2791 int size, verb; 2807 - unsigned int fault_type; 2808 2792 struct p_block_req *p = pi->data; 2809 2793 2810 2794 peer_device = conn_peer_device(connection, pi->vnr); ··· 2846 2832 default: 2847 2833 BUG(); 2848 2834 } 2849 - if (verb && __ratelimit(&drbd_ratelimit_state)) 2835 + if (verb && drbd_ratelimit()) 2850 2836 drbd_err(device, "Can not satisfy peer's read request, " 2851 2837 "no local data.\n"); 2852 2838 ··· 2863 2849 put_ldev(device); 2864 2850 return -ENOMEM; 2865 2851 } 2852 + peer_req->opf = REQ_OP_READ; 2866 2853 2867 2854 switch (pi->cmd) { 2868 2855 case P_DATA_REQUEST: 2869 2856 peer_req->w.cb = w_e_end_data_req; 2870 - fault_type = DRBD_FAULT_DT_RD; 2871 2857 /* application IO, don't drbd_rs_begin_io */ 2872 2858 peer_req->flags |= EE_APPLICATION; 2873 2859 goto submit; ··· 2881 2867 fallthrough; 2882 2868 case P_RS_DATA_REQUEST: 2883 2869 peer_req->w.cb = w_e_end_rsdata_req; 2884 - fault_type = DRBD_FAULT_RS_RD; 2885 2870 /* used in the sector offset progress display */ 2886 2871 device->bm_resync_fo = BM_SECT_TO_BIT(sector); 2887 2872 break; 2888 2873 2889 2874 case P_OV_REPLY: 2890 2875 case P_CSUM_RS_REQUEST: 2891 - fault_type = DRBD_FAULT_RS_RD; 2892 2876 di = kmalloc(sizeof(*di) + pi->size, GFP_NOIO); 2893 2877 if (!di) 2894 2878 goto out_free_e; ··· 2935 2923 (unsigned long long)sector); 2936 2924 } 2937 2925 peer_req->w.cb = w_e_end_ov_req; 2938 - fault_type = DRBD_FAULT_RS_RD; 2939 2926 break; 2940 2927 2941 2928 default: ··· 2986 2975 submit: 2987 2976 update_receiver_timing_details(connection, drbd_submit_peer_request); 2988 2977 inc_unacked(device); 2989 - if (drbd_submit_peer_request(device, peer_req, REQ_OP_READ, 2990 - fault_type) == 0) 2978 + if (drbd_submit_peer_request(peer_req) == 0) 2991 2979 return 0; 2992 2980 2993 2981 /* don't care for the reason here */ ··· 4957 4947 4958 4948 if (get_ldev(device)) { 4959 4949 struct drbd_peer_request *peer_req; 4960 - const enum req_op op = REQ_OP_WRITE_ZEROES; 4961 4950 4962 4951 peer_req = drbd_alloc_peer_req(peer_device, ID_SYNCER, sector, 4963 4952 size, 0, GFP_NOIO); ··· 4966 4957 } 4967 4958 4968 4959 peer_req->w.cb = e_end_resync_block; 4960 + peer_req->opf = REQ_OP_DISCARD; 4969 4961 peer_req->submit_jif = jiffies; 4970 4962 peer_req->flags |= EE_TRIM; 4971 4963 ··· 4975 4965 spin_unlock_irq(&device->resource->req_lock); 4976 4966 4977 4967 atomic_add(pi->size >> 9, &device->rs_sect_ev); 4978 - err = drbd_submit_peer_request(device, peer_req, op, 4979 - DRBD_FAULT_RS_WR); 4968 + err = drbd_submit_peer_request(peer_req); 4980 4969 4981 4970 if (err) { 4982 4971 spin_lock_irq(&device->resource->req_lock);

+4 -4

drivers/block/drbd/drbd_req.c

··· 1 - // SPDX-License-Identifier: GPL-2.0-or-later 1 + // SPDX-License-Identifier: GPL-2.0-only 2 2 /* 3 3 drbd_req.c 4 4 ··· 144 144 if (get_ldev_if_state(device, D_FAILED)) { 145 145 drbd_al_complete_io(device, &req->i); 146 146 put_ldev(device); 147 - } else if (__ratelimit(&drbd_ratelimit_state)) { 147 + } else if (drbd_ratelimit()) { 148 148 drbd_warn(device, "Should have called drbd_al_complete_io(, %llu, %u), " 149 149 "but my Disk seems to have failed :(\n", 150 150 (unsigned long long) req->i.sector, req->i.size); ··· 518 518 519 519 static void drbd_report_io_error(struct drbd_device *device, struct drbd_request *req) 520 520 { 521 - if (!__ratelimit(&drbd_ratelimit_state)) 521 + if (!drbd_ratelimit()) 522 522 return; 523 523 524 524 drbd_warn(device, "local %s IO error sector %llu+%u on %pg\n", ··· 1402 1402 submit_private_bio = true; 1403 1403 } else if (no_remote) { 1404 1404 nodata: 1405 - if (__ratelimit(&drbd_ratelimit_state)) 1405 + if (drbd_ratelimit()) 1406 1406 drbd_err(device, "IO ERROR: neither local nor remote data, sector %llu+%u\n", 1407 1407 (unsigned long long)req->i.sector, req->i.size >> 9); 1408 1408 /* A write may have been queued for send_oos, however.

+1 -1

drivers/block/drbd/drbd_req.h

··· 1 - /* SPDX-License-Identifier: GPL-2.0-or-later */ 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 2 /* 3 3 drbd_req.h 4 4

+1 -1

drivers/block/drbd/drbd_state.c

··· 1 - // SPDX-License-Identifier: GPL-2.0-or-later 1 + // SPDX-License-Identifier: GPL-2.0-only 2 2 /* 3 3 drbd_state.c 4 4

+1 -1

drivers/block/drbd/drbd_state.h

··· 1 - /* SPDX-License-Identifier: GPL-2.0 */ 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 2 #ifndef DRBD_STATE_H 3 3 #define DRBD_STATE_H 4 4

+1 -1

drivers/block/drbd/drbd_state_change.h

··· 1 - /* SPDX-License-Identifier: GPL-2.0 */ 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 2 #ifndef DRBD_STATE_CHANGE_H 3 3 #define DRBD_STATE_CHANGE_H 4 4

+1 -1

drivers/block/drbd/drbd_strings.c

··· 1 - // SPDX-License-Identifier: GPL-2.0-or-later 1 + // SPDX-License-Identifier: GPL-2.0-only 2 2 /* 3 3 drbd.h 4 4

+1 -1

drivers/block/drbd/drbd_strings.h

··· 1 - /* SPDX-License-Identifier: GPL-2.0 */ 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 2 #ifndef __DRBD_STRINGS_H 3 3 #define __DRBD_STRINGS_H 4 4

+1 -1

drivers/block/drbd/drbd_vli.h

··· 1 - /* SPDX-License-Identifier: GPL-2.0-or-later */ 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 2 /* 3 3 -*- linux-c -*- 4 4 drbd_receiver.c

+9 -9

drivers/block/drbd/drbd_worker.c

··· 1 - // SPDX-License-Identifier: GPL-2.0-or-later 1 + // SPDX-License-Identifier: GPL-2.0-only 2 2 /* 3 3 drbd_worker.c 4 4 ··· 176 176 bool is_discard = bio_op(bio) == REQ_OP_WRITE_ZEROES || 177 177 bio_op(bio) == REQ_OP_DISCARD; 178 178 179 - if (bio->bi_status && __ratelimit(&drbd_ratelimit_state)) 179 + if (bio->bi_status && drbd_ratelimit()) 180 180 drbd_warn(device, "%s: error=%d s=%llus\n", 181 181 is_write ? (is_discard ? "discard" : "write") 182 182 : "read", bio->bi_status, ··· 240 240 * though we still will complain noisily about it. 241 241 */ 242 242 if (unlikely(req->rq_state & RQ_LOCAL_ABORTED)) { 243 - if (__ratelimit(&drbd_ratelimit_state)) 243 + if (drbd_ratelimit()) 244 244 drbd_emerg(device, "delayed completion of aborted local request; disk-timeout may be too aggressive\n"); 245 245 246 246 if (!bio->bi_status) ··· 400 400 goto defer; 401 401 402 402 peer_req->w.cb = w_e_send_csum; 403 + peer_req->opf = REQ_OP_READ; 403 404 spin_lock_irq(&device->resource->req_lock); 404 405 list_add_tail(&peer_req->w.list, &device->read_ee); 405 406 spin_unlock_irq(&device->resource->req_lock); 406 407 407 408 atomic_add(size >> 9, &device->rs_sect_ev); 408 - if (drbd_submit_peer_request(device, peer_req, REQ_OP_READ, 409 - DRBD_FAULT_RS_RD) == 0) 409 + if (drbd_submit_peer_request(peer_req) == 0) 410 410 return 0; 411 411 412 412 /* If it failed because of ENOMEM, retry should help. If it failed ··· 1062 1062 if (likely((peer_req->flags & EE_WAS_ERROR) == 0)) { 1063 1063 err = drbd_send_block(peer_device, P_DATA_REPLY, peer_req); 1064 1064 } else { 1065 - if (__ratelimit(&drbd_ratelimit_state)) 1065 + if (drbd_ratelimit()) 1066 1066 drbd_err(device, "Sending NegDReply. sector=%llus.\n", 1067 1067 (unsigned long long)peer_req->i.sector); 1068 1068 ··· 1135 1135 else 1136 1136 err = drbd_send_block(peer_device, P_RS_DATA_REPLY, peer_req); 1137 1137 } else { 1138 - if (__ratelimit(&drbd_ratelimit_state)) 1138 + if (drbd_ratelimit()) 1139 1139 drbd_err(device, "Not sending RSDataReply, " 1140 1140 "partner DISKLESS!\n"); 1141 1141 err = 0; 1142 1142 } 1143 1143 } else { 1144 - if (__ratelimit(&drbd_ratelimit_state)) 1144 + if (drbd_ratelimit()) 1145 1145 drbd_err(device, "Sending NegRSDReply. sector %llus.\n", 1146 1146 (unsigned long long)peer_req->i.sector); 1147 1147 ··· 1212 1212 } 1213 1213 } else { 1214 1214 err = drbd_send_ack(peer_device, P_NEG_RS_DREPLY, peer_req); 1215 - if (__ratelimit(&drbd_ratelimit_state)) 1215 + if (drbd_ratelimit()) 1216 1216 drbd_err(device, "Sending NegDReply. I guess it gets messy.\n"); 1217 1217 } 1218 1218

+3 -1

drivers/block/floppy.c

··· 4593 4593 goto out_put_disk; 4594 4594 4595 4595 err = floppy_alloc_disk(drive, 0); 4596 - if (err) 4596 + if (err) { 4597 + blk_mq_free_tag_set(&tag_sets[drive]); 4597 4598 goto out_put_disk; 4599 + } 4598 4600 4599 4601 timer_setup(&motor_off_timer[drive], motor_off_callback, 0); 4600 4602 }

+21 -1

drivers/block/null_blk/main.c

··· 523 523 } 524 524 CONFIGFS_ATTR(nullb_device_, badblocks); 525 525 526 + static ssize_t nullb_device_zone_readonly_store(struct config_item *item, 527 + const char *page, size_t count) 528 + { 529 + struct nullb_device *dev = to_nullb_device(item); 530 + 531 + return zone_cond_store(dev, page, count, BLK_ZONE_COND_READONLY); 532 + } 533 + CONFIGFS_ATTR_WO(nullb_device_, zone_readonly); 534 + 535 + static ssize_t nullb_device_zone_offline_store(struct config_item *item, 536 + const char *page, size_t count) 537 + { 538 + struct nullb_device *dev = to_nullb_device(item); 539 + 540 + return zone_cond_store(dev, page, count, BLK_ZONE_COND_OFFLINE); 541 + } 542 + CONFIGFS_ATTR_WO(nullb_device_, zone_offline); 543 + 526 544 static struct configfs_attribute *nullb_device_attrs[] = { 527 545 &nullb_device_attr_size, 528 546 &nullb_device_attr_completion_nsec, ··· 567 549 &nullb_device_attr_zone_nr_conv, 568 550 &nullb_device_attr_zone_max_open, 569 551 &nullb_device_attr_zone_max_active, 552 + &nullb_device_attr_zone_readonly, 553 + &nullb_device_attr_zone_offline, 570 554 &nullb_device_attr_virt_boundary, 571 555 &nullb_device_attr_no_sched, 572 556 &nullb_device_attr_shared_tag_bitmap, ··· 634 614 "poll_queues,power,queue_mode,shared_tag_bitmap,size," 635 615 "submit_queues,use_per_node_hctx,virt_boundary,zoned," 636 616 "zone_capacity,zone_max_active,zone_max_open," 637 - "zone_nr_conv,zone_size\n"); 617 + "zone_nr_conv,zone_offline,zone_readonly,zone_size\n"); 638 618 } 639 619 640 620 CONFIGFS_ATTR_RO(memb_group_, features);

+8

drivers/block/null_blk/null_blk.h

··· 151 151 sector_t sector, sector_t nr_sectors); 152 152 size_t null_zone_valid_read_len(struct nullb *nullb, 153 153 sector_t sector, unsigned int len); 154 + ssize_t zone_cond_store(struct nullb_device *dev, const char *page, 155 + size_t count, enum blk_zone_cond cond); 154 156 #else 155 157 static inline int null_init_zoned_dev(struct nullb_device *dev, 156 158 struct request_queue *q) ··· 175 173 unsigned int len) 176 174 { 177 175 return len; 176 + } 177 + static inline ssize_t zone_cond_store(struct nullb_device *dev, 178 + const char *page, size_t count, 179 + enum blk_zone_cond cond) 180 + { 181 + return -EOPNOTSUPP; 178 182 } 179 183 #define null_report_zones NULL 180 184 #endif /* CONFIG_BLK_DEV_ZONED */

+92 -3

drivers/block/null_blk/zoned.c

··· 384 384 385 385 null_lock_zone(dev, zone); 386 386 387 - if (zone->cond == BLK_ZONE_COND_FULL) { 388 - /* Cannot write to a full zone */ 387 + if (zone->cond == BLK_ZONE_COND_FULL || 388 + zone->cond == BLK_ZONE_COND_READONLY || 389 + zone->cond == BLK_ZONE_COND_OFFLINE) { 390 + /* Cannot write to the zone */ 389 391 ret = BLK_STS_IOERR; 390 392 goto unlock; 391 393 } ··· 615 613 for (i = dev->zone_nr_conv; i < dev->nr_zones; i++) { 616 614 zone = &dev->zones[i]; 617 615 null_lock_zone(dev, zone); 618 - if (zone->cond != BLK_ZONE_COND_EMPTY) { 616 + if (zone->cond != BLK_ZONE_COND_EMPTY && 617 + zone->cond != BLK_ZONE_COND_READONLY && 618 + zone->cond != BLK_ZONE_COND_OFFLINE) { 619 619 null_reset_zone(dev, zone); 620 620 trace_nullb_zone_op(cmd, i, zone->cond); 621 621 } ··· 630 626 zone = &dev->zones[zone_no]; 631 627 632 628 null_lock_zone(dev, zone); 629 + 630 + if (zone->cond == BLK_ZONE_COND_READONLY || 631 + zone->cond == BLK_ZONE_COND_OFFLINE) { 632 + ret = BLK_STS_IOERR; 633 + goto unlock; 634 + } 633 635 634 636 switch (op) { 635 637 case REQ_OP_ZONE_RESET: ··· 658 648 if (ret == BLK_STS_OK) 659 649 trace_nullb_zone_op(cmd, zone_no, zone->cond); 660 650 651 + unlock: 661 652 null_unlock_zone(dev, zone); 662 653 663 654 return ret; ··· 685 674 default: 686 675 dev = cmd->nq->dev; 687 676 zone = &dev->zones[null_zone_no(dev, sector)]; 677 + if (zone->cond == BLK_ZONE_COND_OFFLINE) 678 + return BLK_STS_IOERR; 688 679 689 680 null_lock_zone(dev, zone); 690 681 sts = null_process_cmd(cmd, op, sector, nr_sectors); 691 682 null_unlock_zone(dev, zone); 692 683 return sts; 693 684 } 685 + } 686 + 687 + /* 688 + * Set a zone in the read-only or offline condition. 689 + */ 690 + static void null_set_zone_cond(struct nullb_device *dev, 691 + struct nullb_zone *zone, enum blk_zone_cond cond) 692 + { 693 + if (WARN_ON_ONCE(cond != BLK_ZONE_COND_READONLY && 694 + cond != BLK_ZONE_COND_OFFLINE)) 695 + return; 696 + 697 + null_lock_zone(dev, zone); 698 + 699 + /* 700 + * If the read-only condition is requested again to zones already in 701 + * read-only condition, restore back normal empty condition. Do the same 702 + * if the offline condition is requested for offline zones. Otherwise, 703 + * set the specified zone condition to the zones. Finish the zones 704 + * beforehand to free up zone resources. 705 + */ 706 + if (zone->cond == cond) { 707 + zone->cond = BLK_ZONE_COND_EMPTY; 708 + zone->wp = zone->start; 709 + if (dev->memory_backed) 710 + null_handle_discard(dev, zone->start, zone->len); 711 + } else { 712 + if (zone->cond != BLK_ZONE_COND_READONLY && 713 + zone->cond != BLK_ZONE_COND_OFFLINE) 714 + null_finish_zone(dev, zone); 715 + zone->cond = cond; 716 + zone->wp = (sector_t)-1; 717 + } 718 + 719 + null_unlock_zone(dev, zone); 720 + } 721 + 722 + /* 723 + * Identify a zone from the sector written to configfs file. Then set zone 724 + * condition to the zone. 725 + */ 726 + ssize_t zone_cond_store(struct nullb_device *dev, const char *page, 727 + size_t count, enum blk_zone_cond cond) 728 + { 729 + unsigned long long sector; 730 + unsigned int zone_no; 731 + int ret; 732 + 733 + if (!dev->zoned) { 734 + pr_err("null_blk device is not zoned\n"); 735 + return -EINVAL; 736 + } 737 + 738 + if (!dev->zones) { 739 + pr_err("null_blk device is not yet powered\n"); 740 + return -EINVAL; 741 + } 742 + 743 + ret = kstrtoull(page, 0, &sector); 744 + if (ret < 0) 745 + return ret; 746 + 747 + zone_no = null_zone_no(dev, sector); 748 + if (zone_no >= dev->nr_zones) { 749 + pr_err("Sector out of range\n"); 750 + return -EINVAL; 751 + } 752 + 753 + if (dev->zones[zone_no].type == BLK_ZONE_TYPE_CONVENTIONAL) { 754 + pr_err("Can not change condition of conventional zones\n"); 755 + return -EINVAL; 756 + } 757 + 758 + null_set_zone_cond(dev, &dev->zones[zone_no], cond); 759 + 760 + return count; 694 761 }

-2944

drivers/block/pktcdvd.c

··· 1 - /* 2 - * Copyright (C) 2000 Jens Axboe <axboe@suse.de> 3 - * Copyright (C) 2001-2004 Peter Osterlund <petero2@telia.com> 4 - * Copyright (C) 2006 Thomas Maier <balagi@justmail.de> 5 - * 6 - * May be copied or modified under the terms of the GNU General Public 7 - * License. See linux/COPYING for more information. 8 - * 9 - * Packet writing layer for ATAPI and SCSI CD-RW, DVD+RW, DVD-RW and 10 - * DVD-RAM devices. 11 - * 12 - * Theory of operation: 13 - * 14 - * At the lowest level, there is the standard driver for the CD/DVD device, 15 - * such as drivers/scsi/sr.c. This driver can handle read and write requests, 16 - * but it doesn't know anything about the special restrictions that apply to 17 - * packet writing. One restriction is that write requests must be aligned to 18 - * packet boundaries on the physical media, and the size of a write request 19 - * must be equal to the packet size. Another restriction is that a 20 - * GPCMD_FLUSH_CACHE command has to be issued to the drive before a read 21 - * command, if the previous command was a write. 22 - * 23 - * The purpose of the packet writing driver is to hide these restrictions from 24 - * higher layers, such as file systems, and present a block device that can be 25 - * randomly read and written using 2kB-sized blocks. 26 - * 27 - * The lowest layer in the packet writing driver is the packet I/O scheduler. 28 - * Its data is defined by the struct packet_iosched and includes two bio 29 - * queues with pending read and write requests. These queues are processed 30 - * by the pkt_iosched_process_queue() function. The write requests in this 31 - * queue are already properly aligned and sized. This layer is responsible for 32 - * issuing the flush cache commands and scheduling the I/O in a good order. 33 - * 34 - * The next layer transforms unaligned write requests to aligned writes. This 35 - * transformation requires reading missing pieces of data from the underlying 36 - * block device, assembling the pieces to full packets and queuing them to the 37 - * packet I/O scheduler. 38 - * 39 - * At the top layer there is a custom ->submit_bio function that forwards 40 - * read requests directly to the iosched queue and puts write requests in the 41 - * unaligned write queue. A kernel thread performs the necessary read 42 - * gathering to convert the unaligned writes to aligned writes and then feeds 43 - * them to the packet I/O scheduler. 44 - * 45 - *************************************************************************/ 46 - 47 - #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt 48 - 49 - #include <linux/pktcdvd.h> 50 - #include <linux/module.h> 51 - #include <linux/types.h> 52 - #include <linux/kernel.h> 53 - #include <linux/compat.h> 54 - #include <linux/kthread.h> 55 - #include <linux/errno.h> 56 - #include <linux/spinlock.h> 57 - #include <linux/file.h> 58 - #include <linux/proc_fs.h> 59 - #include <linux/seq_file.h> 60 - #include <linux/miscdevice.h> 61 - #include <linux/freezer.h> 62 - #include <linux/mutex.h> 63 - #include <linux/slab.h> 64 - #include <linux/backing-dev.h> 65 - #include <scsi/scsi_cmnd.h> 66 - #include <scsi/scsi_ioctl.h> 67 - #include <scsi/scsi.h> 68 - #include <linux/debugfs.h> 69 - #include <linux/device.h> 70 - #include <linux/nospec.h> 71 - #include <linux/uaccess.h> 72 - 73 - #define DRIVER_NAME "pktcdvd" 74 - 75 - #define pkt_err(pd, fmt, ...) \ 76 - pr_err("%s: " fmt, pd->name, ##__VA_ARGS__) 77 - #define pkt_notice(pd, fmt, ...) \ 78 - pr_notice("%s: " fmt, pd->name, ##__VA_ARGS__) 79 - #define pkt_info(pd, fmt, ...) \ 80 - pr_info("%s: " fmt, pd->name, ##__VA_ARGS__) 81 - 82 - #define pkt_dbg(level, pd, fmt, ...) \ 83 - do { \ 84 - if (level == 2 && PACKET_DEBUG >= 2) \ 85 - pr_notice("%s: %s():" fmt, \ 86 - pd->name, __func__, ##__VA_ARGS__); \ 87 - else if (level == 1 && PACKET_DEBUG >= 1) \ 88 - pr_notice("%s: " fmt, pd->name, ##__VA_ARGS__); \ 89 - } while (0) 90 - 91 - #define MAX_SPEED 0xffff 92 - 93 - static DEFINE_MUTEX(pktcdvd_mutex); 94 - static struct pktcdvd_device *pkt_devs[MAX_WRITERS]; 95 - static struct proc_dir_entry *pkt_proc; 96 - static int pktdev_major; 97 - static int write_congestion_on = PKT_WRITE_CONGESTION_ON; 98 - static int write_congestion_off = PKT_WRITE_CONGESTION_OFF; 99 - static struct mutex ctl_mutex; /* Serialize open/close/setup/teardown */ 100 - static mempool_t psd_pool; 101 - static struct bio_set pkt_bio_set; 102 - 103 - static struct class *class_pktcdvd = NULL; /* /sys/class/pktcdvd */ 104 - static struct dentry *pkt_debugfs_root = NULL; /* /sys/kernel/debug/pktcdvd */ 105 - 106 - /* forward declaration */ 107 - static int pkt_setup_dev(dev_t dev, dev_t* pkt_dev); 108 - static int pkt_remove_dev(dev_t pkt_dev); 109 - static int pkt_seq_show(struct seq_file *m, void *p); 110 - 111 - static sector_t get_zone(sector_t sector, struct pktcdvd_device *pd) 112 - { 113 - return (sector + pd->offset) & ~(sector_t)(pd->settings.size - 1); 114 - } 115 - 116 - /********************************************************** 117 - * sysfs interface for pktcdvd 118 - * by (C) 2006 Thomas Maier <balagi@justmail.de> 119 - 120 - /sys/class/pktcdvd/pktcdvd[0-7]/ 121 - stat/reset 122 - stat/packets_started 123 - stat/packets_finished 124 - stat/kb_written 125 - stat/kb_read 126 - stat/kb_read_gather 127 - write_queue/size 128 - write_queue/congestion_off 129 - write_queue/congestion_on 130 - **********************************************************/ 131 - 132 - static ssize_t packets_started_show(struct device *dev, 133 - struct device_attribute *attr, char *buf) 134 - { 135 - struct pktcdvd_device *pd = dev_get_drvdata(dev); 136 - 137 - return sysfs_emit(buf, "%lu\n", pd->stats.pkt_started); 138 - } 139 - static DEVICE_ATTR_RO(packets_started); 140 - 141 - static ssize_t packets_finished_show(struct device *dev, 142 - struct device_attribute *attr, char *buf) 143 - { 144 - struct pktcdvd_device *pd = dev_get_drvdata(dev); 145 - 146 - return sysfs_emit(buf, "%lu\n", pd->stats.pkt_ended); 147 - } 148 - static DEVICE_ATTR_RO(packets_finished); 149 - 150 - static ssize_t kb_written_show(struct device *dev, 151 - struct device_attribute *attr, char *buf) 152 - { 153 - struct pktcdvd_device *pd = dev_get_drvdata(dev); 154 - 155 - return sysfs_emit(buf, "%lu\n", pd->stats.secs_w >> 1); 156 - } 157 - static DEVICE_ATTR_RO(kb_written); 158 - 159 - static ssize_t kb_read_show(struct device *dev, 160 - struct device_attribute *attr, char *buf) 161 - { 162 - struct pktcdvd_device *pd = dev_get_drvdata(dev); 163 - 164 - return sysfs_emit(buf, "%lu\n", pd->stats.secs_r >> 1); 165 - } 166 - static DEVICE_ATTR_RO(kb_read); 167 - 168 - static ssize_t kb_read_gather_show(struct device *dev, 169 - struct device_attribute *attr, char *buf) 170 - { 171 - struct pktcdvd_device *pd = dev_get_drvdata(dev); 172 - 173 - return sysfs_emit(buf, "%lu\n", pd->stats.secs_rg >> 1); 174 - } 175 - static DEVICE_ATTR_RO(kb_read_gather); 176 - 177 - static ssize_t reset_store(struct device *dev, struct device_attribute *attr, 178 - const char *buf, size_t len) 179 - { 180 - struct pktcdvd_device *pd = dev_get_drvdata(dev); 181 - 182 - if (len > 0) { 183 - pd->stats.pkt_started = 0; 184 - pd->stats.pkt_ended = 0; 185 - pd->stats.secs_w = 0; 186 - pd->stats.secs_rg = 0; 187 - pd->stats.secs_r = 0; 188 - } 189 - return len; 190 - } 191 - static DEVICE_ATTR_WO(reset); 192 - 193 - static struct attribute *pkt_stat_attrs[] = { 194 - &dev_attr_packets_finished.attr, 195 - &dev_attr_packets_started.attr, 196 - &dev_attr_kb_read.attr, 197 - &dev_attr_kb_written.attr, 198 - &dev_attr_kb_read_gather.attr, 199 - &dev_attr_reset.attr, 200 - NULL, 201 - }; 202 - 203 - static const struct attribute_group pkt_stat_group = { 204 - .name = "stat", 205 - .attrs = pkt_stat_attrs, 206 - }; 207 - 208 - static ssize_t size_show(struct device *dev, 209 - struct device_attribute *attr, char *buf) 210 - { 211 - struct pktcdvd_device *pd = dev_get_drvdata(dev); 212 - int n; 213 - 214 - spin_lock(&pd->lock); 215 - n = sysfs_emit(buf, "%d\n", pd->bio_queue_size); 216 - spin_unlock(&pd->lock); 217 - return n; 218 - } 219 - static DEVICE_ATTR_RO(size); 220 - 221 - static void init_write_congestion_marks(int* lo, int* hi) 222 - { 223 - if (*hi > 0) { 224 - *hi = max(*hi, 500); 225 - *hi = min(*hi, 1000000); 226 - if (*lo <= 0) 227 - *lo = *hi - 100; 228 - else { 229 - *lo = min(*lo, *hi - 100); 230 - *lo = max(*lo, 100); 231 - } 232 - } else { 233 - *hi = -1; 234 - *lo = -1; 235 - } 236 - } 237 - 238 - static ssize_t congestion_off_show(struct device *dev, 239 - struct device_attribute *attr, char *buf) 240 - { 241 - struct pktcdvd_device *pd = dev_get_drvdata(dev); 242 - int n; 243 - 244 - spin_lock(&pd->lock); 245 - n = sysfs_emit(buf, "%d\n", pd->write_congestion_off); 246 - spin_unlock(&pd->lock); 247 - return n; 248 - } 249 - 250 - static ssize_t congestion_off_store(struct device *dev, 251 - struct device_attribute *attr, 252 - const char *buf, size_t len) 253 - { 254 - struct pktcdvd_device *pd = dev_get_drvdata(dev); 255 - int val; 256 - 257 - if (sscanf(buf, "%d", &val) == 1) { 258 - spin_lock(&pd->lock); 259 - pd->write_congestion_off = val; 260 - init_write_congestion_marks(&pd->write_congestion_off, 261 - &pd->write_congestion_on); 262 - spin_unlock(&pd->lock); 263 - } 264 - return len; 265 - } 266 - static DEVICE_ATTR_RW(congestion_off); 267 - 268 - static ssize_t congestion_on_show(struct device *dev, 269 - struct device_attribute *attr, char *buf) 270 - { 271 - struct pktcdvd_device *pd = dev_get_drvdata(dev); 272 - int n; 273 - 274 - spin_lock(&pd->lock); 275 - n = sysfs_emit(buf, "%d\n", pd->write_congestion_on); 276 - spin_unlock(&pd->lock); 277 - return n; 278 - } 279 - 280 - static ssize_t congestion_on_store(struct device *dev, 281 - struct device_attribute *attr, 282 - const char *buf, size_t len) 283 - { 284 - struct pktcdvd_device *pd = dev_get_drvdata(dev); 285 - int val; 286 - 287 - if (sscanf(buf, "%d", &val) == 1) { 288 - spin_lock(&pd->lock); 289 - pd->write_congestion_on = val; 290 - init_write_congestion_marks(&pd->write_congestion_off, 291 - &pd->write_congestion_on); 292 - spin_unlock(&pd->lock); 293 - } 294 - return len; 295 - } 296 - static DEVICE_ATTR_RW(congestion_on); 297 - 298 - static struct attribute *pkt_wq_attrs[] = { 299 - &dev_attr_congestion_on.attr, 300 - &dev_attr_congestion_off.attr, 301 - &dev_attr_size.attr, 302 - NULL, 303 - }; 304 - 305 - static const struct attribute_group pkt_wq_group = { 306 - .name = "write_queue", 307 - .attrs = pkt_wq_attrs, 308 - }; 309 - 310 - static const struct attribute_group *pkt_groups[] = { 311 - &pkt_stat_group, 312 - &pkt_wq_group, 313 - NULL, 314 - }; 315 - 316 - static void pkt_sysfs_dev_new(struct pktcdvd_device *pd) 317 - { 318 - if (class_pktcdvd) { 319 - pd->dev = device_create_with_groups(class_pktcdvd, NULL, 320 - MKDEV(0, 0), pd, pkt_groups, 321 - "%s", pd->name); 322 - if (IS_ERR(pd->dev)) 323 - pd->dev = NULL; 324 - } 325 - } 326 - 327 - static void pkt_sysfs_dev_remove(struct pktcdvd_device *pd) 328 - { 329 - if (class_pktcdvd) 330 - device_unregister(pd->dev); 331 - } 332 - 333 - 334 - /******************************************************************** 335 - /sys/class/pktcdvd/ 336 - add map block device 337 - remove unmap packet dev 338 - device_map show mappings 339 - *******************************************************************/ 340 - 341 - static void class_pktcdvd_release(struct class *cls) 342 - { 343 - kfree(cls); 344 - } 345 - 346 - static ssize_t device_map_show(struct class *c, struct class_attribute *attr, 347 - char *data) 348 - { 349 - int n = 0; 350 - int idx; 351 - mutex_lock_nested(&ctl_mutex, SINGLE_DEPTH_NESTING); 352 - for (idx = 0; idx < MAX_WRITERS; idx++) { 353 - struct pktcdvd_device *pd = pkt_devs[idx]; 354 - if (!pd) 355 - continue; 356 - n += sprintf(data+n, "%s %u:%u %u:%u\n", 357 - pd->name, 358 - MAJOR(pd->pkt_dev), MINOR(pd->pkt_dev), 359 - MAJOR(pd->bdev->bd_dev), 360 - MINOR(pd->bdev->bd_dev)); 361 - } 362 - mutex_unlock(&ctl_mutex); 363 - return n; 364 - } 365 - static CLASS_ATTR_RO(device_map); 366 - 367 - static ssize_t add_store(struct class *c, struct class_attribute *attr, 368 - const char *buf, size_t count) 369 - { 370 - unsigned int major, minor; 371 - 372 - if (sscanf(buf, "%u:%u", &major, &minor) == 2) { 373 - /* pkt_setup_dev() expects caller to hold reference to self */ 374 - if (!try_module_get(THIS_MODULE)) 375 - return -ENODEV; 376 - 377 - pkt_setup_dev(MKDEV(major, minor), NULL); 378 - 379 - module_put(THIS_MODULE); 380 - 381 - return count; 382 - } 383 - 384 - return -EINVAL; 385 - } 386 - static CLASS_ATTR_WO(add); 387 - 388 - static ssize_t remove_store(struct class *c, struct class_attribute *attr, 389 - const char *buf, size_t count) 390 - { 391 - unsigned int major, minor; 392 - if (sscanf(buf, "%u:%u", &major, &minor) == 2) { 393 - pkt_remove_dev(MKDEV(major, minor)); 394 - return count; 395 - } 396 - return -EINVAL; 397 - } 398 - static CLASS_ATTR_WO(remove); 399 - 400 - static struct attribute *class_pktcdvd_attrs[] = { 401 - &class_attr_add.attr, 402 - &class_attr_remove.attr, 403 - &class_attr_device_map.attr, 404 - NULL, 405 - }; 406 - ATTRIBUTE_GROUPS(class_pktcdvd); 407 - 408 - static int pkt_sysfs_init(void) 409 - { 410 - int ret = 0; 411 - 412 - /* 413 - * create control files in sysfs 414 - * /sys/class/pktcdvd/... 415 - */ 416 - class_pktcdvd = kzalloc(sizeof(*class_pktcdvd), GFP_KERNEL); 417 - if (!class_pktcdvd) 418 - return -ENOMEM; 419 - class_pktcdvd->name = DRIVER_NAME; 420 - class_pktcdvd->owner = THIS_MODULE; 421 - class_pktcdvd->class_release = class_pktcdvd_release; 422 - class_pktcdvd->class_groups = class_pktcdvd_groups; 423 - ret = class_register(class_pktcdvd); 424 - if (ret) { 425 - kfree(class_pktcdvd); 426 - class_pktcdvd = NULL; 427 - pr_err("failed to create class pktcdvd\n"); 428 - return ret; 429 - } 430 - return 0; 431 - } 432 - 433 - static void pkt_sysfs_cleanup(void) 434 - { 435 - if (class_pktcdvd) 436 - class_destroy(class_pktcdvd); 437 - class_pktcdvd = NULL; 438 - } 439 - 440 - /******************************************************************** 441 - entries in debugfs 442 - 443 - /sys/kernel/debug/pktcdvd[0-7]/ 444 - info 445 - 446 - *******************************************************************/ 447 - 448 - static int pkt_debugfs_seq_show(struct seq_file *m, void *p) 449 - { 450 - return pkt_seq_show(m, p); 451 - } 452 - 453 - static int pkt_debugfs_fops_open(struct inode *inode, struct file *file) 454 - { 455 - return single_open(file, pkt_debugfs_seq_show, inode->i_private); 456 - } 457 - 458 - static const struct file_operations debug_fops = { 459 - .open = pkt_debugfs_fops_open, 460 - .read = seq_read, 461 - .llseek = seq_lseek, 462 - .release = single_release, 463 - .owner = THIS_MODULE, 464 - }; 465 - 466 - static void pkt_debugfs_dev_new(struct pktcdvd_device *pd) 467 - { 468 - if (!pkt_debugfs_root) 469 - return; 470 - pd->dfs_d_root = debugfs_create_dir(pd->name, pkt_debugfs_root); 471 - if (!pd->dfs_d_root) 472 - return; 473 - 474 - pd->dfs_f_info = debugfs_create_file("info", 0444, 475 - pd->dfs_d_root, pd, &debug_fops); 476 - } 477 - 478 - static void pkt_debugfs_dev_remove(struct pktcdvd_device *pd) 479 - { 480 - if (!pkt_debugfs_root) 481 - return; 482 - debugfs_remove(pd->dfs_f_info); 483 - debugfs_remove(pd->dfs_d_root); 484 - pd->dfs_f_info = NULL; 485 - pd->dfs_d_root = NULL; 486 - } 487 - 488 - static void pkt_debugfs_init(void) 489 - { 490 - pkt_debugfs_root = debugfs_create_dir(DRIVER_NAME, NULL); 491 - } 492 - 493 - static void pkt_debugfs_cleanup(void) 494 - { 495 - debugfs_remove(pkt_debugfs_root); 496 - pkt_debugfs_root = NULL; 497 - } 498 - 499 - /* ----------------------------------------------------------*/ 500 - 501 - 502 - static void pkt_bio_finished(struct pktcdvd_device *pd) 503 - { 504 - BUG_ON(atomic_read(&pd->cdrw.pending_bios) <= 0); 505 - if (atomic_dec_and_test(&pd->cdrw.pending_bios)) { 506 - pkt_dbg(2, pd, "queue empty\n"); 507 - atomic_set(&pd->iosched.attention, 1); 508 - wake_up(&pd->wqueue); 509 - } 510 - } 511 - 512 - /* 513 - * Allocate a packet_data struct 514 - */ 515 - static struct packet_data *pkt_alloc_packet_data(int frames) 516 - { 517 - int i; 518 - struct packet_data *pkt; 519 - 520 - pkt = kzalloc(sizeof(struct packet_data), GFP_KERNEL); 521 - if (!pkt) 522 - goto no_pkt; 523 - 524 - pkt->frames = frames; 525 - pkt->w_bio = bio_kmalloc(frames, GFP_KERNEL); 526 - if (!pkt->w_bio) 527 - goto no_bio; 528 - 529 - for (i = 0; i < frames / FRAMES_PER_PAGE; i++) { 530 - pkt->pages[i] = alloc_page(GFP_KERNEL|__GFP_ZERO); 531 - if (!pkt->pages[i]) 532 - goto no_page; 533 - } 534 - 535 - spin_lock_init(&pkt->lock); 536 - bio_list_init(&pkt->orig_bios); 537 - 538 - for (i = 0; i < frames; i++) { 539 - pkt->r_bios[i] = bio_kmalloc(1, GFP_KERNEL); 540 - if (!pkt->r_bios[i]) 541 - goto no_rd_bio; 542 - } 543 - 544 - return pkt; 545 - 546 - no_rd_bio: 547 - for (i = 0; i < frames; i++) 548 - kfree(pkt->r_bios[i]); 549 - no_page: 550 - for (i = 0; i < frames / FRAMES_PER_PAGE; i++) 551 - if (pkt->pages[i]) 552 - __free_page(pkt->pages[i]); 553 - kfree(pkt->w_bio); 554 - no_bio: 555 - kfree(pkt); 556 - no_pkt: 557 - return NULL; 558 - } 559 - 560 - /* 561 - * Free a packet_data struct 562 - */ 563 - static void pkt_free_packet_data(struct packet_data *pkt) 564 - { 565 - int i; 566 - 567 - for (i = 0; i < pkt->frames; i++) 568 - kfree(pkt->r_bios[i]); 569 - for (i = 0; i < pkt->frames / FRAMES_PER_PAGE; i++) 570 - __free_page(pkt->pages[i]); 571 - kfree(pkt->w_bio); 572 - kfree(pkt); 573 - } 574 - 575 - static void pkt_shrink_pktlist(struct pktcdvd_device *pd) 576 - { 577 - struct packet_data *pkt, *next; 578 - 579 - BUG_ON(!list_empty(&pd->cdrw.pkt_active_list)); 580 - 581 - list_for_each_entry_safe(pkt, next, &pd->cdrw.pkt_free_list, list) { 582 - pkt_free_packet_data(pkt); 583 - } 584 - INIT_LIST_HEAD(&pd->cdrw.pkt_free_list); 585 - } 586 - 587 - static int pkt_grow_pktlist(struct pktcdvd_device *pd, int nr_packets) 588 - { 589 - struct packet_data *pkt; 590 - 591 - BUG_ON(!list_empty(&pd->cdrw.pkt_free_list)); 592 - 593 - while (nr_packets > 0) { 594 - pkt = pkt_alloc_packet_data(pd->settings.size >> 2); 595 - if (!pkt) { 596 - pkt_shrink_pktlist(pd); 597 - return 0; 598 - } 599 - pkt->id = nr_packets; 600 - pkt->pd = pd; 601 - list_add(&pkt->list, &pd->cdrw.pkt_free_list); 602 - nr_packets--; 603 - } 604 - return 1; 605 - } 606 - 607 - static inline struct pkt_rb_node *pkt_rbtree_next(struct pkt_rb_node *node) 608 - { 609 - struct rb_node *n = rb_next(&node->rb_node); 610 - if (!n) 611 - return NULL; 612 - return rb_entry(n, struct pkt_rb_node, rb_node); 613 - } 614 - 615 - static void pkt_rbtree_erase(struct pktcdvd_device *pd, struct pkt_rb_node *node) 616 - { 617 - rb_erase(&node->rb_node, &pd->bio_queue); 618 - mempool_free(node, &pd->rb_pool); 619 - pd->bio_queue_size--; 620 - BUG_ON(pd->bio_queue_size < 0); 621 - } 622 - 623 - /* 624 - * Find the first node in the pd->bio_queue rb tree with a starting sector >= s. 625 - */ 626 - static struct pkt_rb_node *pkt_rbtree_find(struct pktcdvd_device *pd, sector_t s) 627 - { 628 - struct rb_node *n = pd->bio_queue.rb_node; 629 - struct rb_node *next; 630 - struct pkt_rb_node *tmp; 631 - 632 - if (!n) { 633 - BUG_ON(pd->bio_queue_size > 0); 634 - return NULL; 635 - } 636 - 637 - for (;;) { 638 - tmp = rb_entry(n, struct pkt_rb_node, rb_node); 639 - if (s <= tmp->bio->bi_iter.bi_sector) 640 - next = n->rb_left; 641 - else 642 - next = n->rb_right; 643 - if (!next) 644 - break; 645 - n = next; 646 - } 647 - 648 - if (s > tmp->bio->bi_iter.bi_sector) { 649 - tmp = pkt_rbtree_next(tmp); 650 - if (!tmp) 651 - return NULL; 652 - } 653 - BUG_ON(s > tmp->bio->bi_iter.bi_sector); 654 - return tmp; 655 - } 656 - 657 - /* 658 - * Insert a node into the pd->bio_queue rb tree. 659 - */ 660 - static void pkt_rbtree_insert(struct pktcdvd_device *pd, struct pkt_rb_node *node) 661 - { 662 - struct rb_node **p = &pd->bio_queue.rb_node; 663 - struct rb_node *parent = NULL; 664 - sector_t s = node->bio->bi_iter.bi_sector; 665 - struct pkt_rb_node *tmp; 666 - 667 - while (*p) { 668 - parent = *p; 669 - tmp = rb_entry(parent, struct pkt_rb_node, rb_node); 670 - if (s < tmp->bio->bi_iter.bi_sector) 671 - p = &(*p)->rb_left; 672 - else 673 - p = &(*p)->rb_right; 674 - } 675 - rb_link_node(&node->rb_node, parent, p); 676 - rb_insert_color(&node->rb_node, &pd->bio_queue); 677 - pd->bio_queue_size++; 678 - } 679 - 680 - /* 681 - * Send a packet_command to the underlying block device and 682 - * wait for completion. 683 - */ 684 - static int pkt_generic_packet(struct pktcdvd_device *pd, struct packet_command *cgc) 685 - { 686 - struct request_queue *q = bdev_get_queue(pd->bdev); 687 - struct scsi_cmnd *scmd; 688 - struct request *rq; 689 - int ret = 0; 690 - 691 - rq = scsi_alloc_request(q, (cgc->data_direction == CGC_DATA_WRITE) ? 692 - REQ_OP_DRV_OUT : REQ_OP_DRV_IN, 0); 693 - if (IS_ERR(rq)) 694 - return PTR_ERR(rq); 695 - scmd = blk_mq_rq_to_pdu(rq); 696 - 697 - if (cgc->buflen) { 698 - ret = blk_rq_map_kern(q, rq, cgc->buffer, cgc->buflen, 699 - GFP_NOIO); 700 - if (ret) 701 - goto out; 702 - } 703 - 704 - scmd->cmd_len = COMMAND_SIZE(cgc->cmd[0]); 705 - memcpy(scmd->cmnd, cgc->cmd, CDROM_PACKET_SIZE); 706 - 707 - rq->timeout = 60*HZ; 708 - if (cgc->quiet) 709 - rq->rq_flags |= RQF_QUIET; 710 - 711 - blk_execute_rq(rq, false); 712 - if (scmd->result) 713 - ret = -EIO; 714 - out: 715 - blk_mq_free_request(rq); 716 - return ret; 717 - } 718 - 719 - static const char *sense_key_string(__u8 index) 720 - { 721 - static const char * const info[] = { 722 - "No sense", "Recovered error", "Not ready", 723 - "Medium error", "Hardware error", "Illegal request", 724 - "Unit attention", "Data protect", "Blank check", 725 - }; 726 - 727 - return index < ARRAY_SIZE(info) ? info[index] : "INVALID"; 728 - } 729 - 730 - /* 731 - * A generic sense dump / resolve mechanism should be implemented across 732 - * all ATAPI + SCSI devices. 733 - */ 734 - static void pkt_dump_sense(struct pktcdvd_device *pd, 735 - struct packet_command *cgc) 736 - { 737 - struct scsi_sense_hdr *sshdr = cgc->sshdr; 738 - 739 - if (sshdr) 740 - pkt_err(pd, "%*ph - sense %02x.%02x.%02x (%s)\n", 741 - CDROM_PACKET_SIZE, cgc->cmd, 742 - sshdr->sense_key, sshdr->asc, sshdr->ascq, 743 - sense_key_string(sshdr->sense_key)); 744 - else 745 - pkt_err(pd, "%*ph - no sense\n", CDROM_PACKET_SIZE, cgc->cmd); 746 - } 747 - 748 - /* 749 - * flush the drive cache to media 750 - */ 751 - static int pkt_flush_cache(struct pktcdvd_device *pd) 752 - { 753 - struct packet_command cgc; 754 - 755 - init_cdrom_command(&cgc, NULL, 0, CGC_DATA_NONE); 756 - cgc.cmd[0] = GPCMD_FLUSH_CACHE; 757 - cgc.quiet = 1; 758 - 759 - /* 760 - * the IMMED bit -- we default to not setting it, although that 761 - * would allow a much faster close, this is safer 762 - */ 763 - #if 0 764 - cgc.cmd[1] = 1 << 1; 765 - #endif 766 - return pkt_generic_packet(pd, &cgc); 767 - } 768 - 769 - /* 770 - * speed is given as the normal factor, e.g. 4 for 4x 771 - */ 772 - static noinline_for_stack int pkt_set_speed(struct pktcdvd_device *pd, 773 - unsigned write_speed, unsigned read_speed) 774 - { 775 - struct packet_command cgc; 776 - struct scsi_sense_hdr sshdr; 777 - int ret; 778 - 779 - init_cdrom_command(&cgc, NULL, 0, CGC_DATA_NONE); 780 - cgc.sshdr = &sshdr; 781 - cgc.cmd[0] = GPCMD_SET_SPEED; 782 - cgc.cmd[2] = (read_speed >> 8) & 0xff; 783 - cgc.cmd[3] = read_speed & 0xff; 784 - cgc.cmd[4] = (write_speed >> 8) & 0xff; 785 - cgc.cmd[5] = write_speed & 0xff; 786 - 787 - ret = pkt_generic_packet(pd, &cgc); 788 - if (ret) 789 - pkt_dump_sense(pd, &cgc); 790 - 791 - return ret; 792 - } 793 - 794 - /* 795 - * Queue a bio for processing by the low-level CD device. Must be called 796 - * from process context. 797 - */ 798 - static void pkt_queue_bio(struct pktcdvd_device *pd, struct bio *bio) 799 - { 800 - spin_lock(&pd->iosched.lock); 801 - if (bio_data_dir(bio) == READ) 802 - bio_list_add(&pd->iosched.read_queue, bio); 803 - else 804 - bio_list_add(&pd->iosched.write_queue, bio); 805 - spin_unlock(&pd->iosched.lock); 806 - 807 - atomic_set(&pd->iosched.attention, 1); 808 - wake_up(&pd->wqueue); 809 - } 810 - 811 - /* 812 - * Process the queued read/write requests. This function handles special 813 - * requirements for CDRW drives: 814 - * - A cache flush command must be inserted before a read request if the 815 - * previous request was a write. 816 - * - Switching between reading and writing is slow, so don't do it more often 817 - * than necessary. 818 - * - Optimize for throughput at the expense of latency. This means that streaming 819 - * writes will never be interrupted by a read, but if the drive has to seek 820 - * before the next write, switch to reading instead if there are any pending 821 - * read requests. 822 - * - Set the read speed according to current usage pattern. When only reading 823 - * from the device, it's best to use the highest possible read speed, but 824 - * when switching often between reading and writing, it's better to have the 825 - * same read and write speeds. 826 - */ 827 - static void pkt_iosched_process_queue(struct pktcdvd_device *pd) 828 - { 829 - 830 - if (atomic_read(&pd->iosched.attention) == 0) 831 - return; 832 - atomic_set(&pd->iosched.attention, 0); 833 - 834 - for (;;) { 835 - struct bio *bio; 836 - int reads_queued, writes_queued; 837 - 838 - spin_lock(&pd->iosched.lock); 839 - reads_queued = !bio_list_empty(&pd->iosched.read_queue); 840 - writes_queued = !bio_list_empty(&pd->iosched.write_queue); 841 - spin_unlock(&pd->iosched.lock); 842 - 843 - if (!reads_queued && !writes_queued) 844 - break; 845 - 846 - if (pd->iosched.writing) { 847 - int need_write_seek = 1; 848 - spin_lock(&pd->iosched.lock); 849 - bio = bio_list_peek(&pd->iosched.write_queue); 850 - spin_unlock(&pd->iosched.lock); 851 - if (bio && (bio->bi_iter.bi_sector == 852 - pd->iosched.last_write)) 853 - need_write_seek = 0; 854 - if (need_write_seek && reads_queued) { 855 - if (atomic_read(&pd->cdrw.pending_bios) > 0) { 856 - pkt_dbg(2, pd, "write, waiting\n"); 857 - break; 858 - } 859 - pkt_flush_cache(pd); 860 - pd->iosched.writing = 0; 861 - } 862 - } else { 863 - if (!reads_queued && writes_queued) { 864 - if (atomic_read(&pd->cdrw.pending_bios) > 0) { 865 - pkt_dbg(2, pd, "read, waiting\n"); 866 - break; 867 - } 868 - pd->iosched.writing = 1; 869 - } 870 - } 871 - 872 - spin_lock(&pd->iosched.lock); 873 - if (pd->iosched.writing) 874 - bio = bio_list_pop(&pd->iosched.write_queue); 875 - else 876 - bio = bio_list_pop(&pd->iosched.read_queue); 877 - spin_unlock(&pd->iosched.lock); 878 - 879 - if (!bio) 880 - continue; 881 - 882 - if (bio_data_dir(bio) == READ) 883 - pd->iosched.successive_reads += 884 - bio->bi_iter.bi_size >> 10; 885 - else { 886 - pd->iosched.successive_reads = 0; 887 - pd->iosched.last_write = bio_end_sector(bio); 888 - } 889 - if (pd->iosched.successive_reads >= HI_SPEED_SWITCH) { 890 - if (pd->read_speed == pd->write_speed) { 891 - pd->read_speed = MAX_SPEED; 892 - pkt_set_speed(pd, pd->write_speed, pd->read_speed); 893 - } 894 - } else { 895 - if (pd->read_speed != pd->write_speed) { 896 - pd->read_speed = pd->write_speed; 897 - pkt_set_speed(pd, pd->write_speed, pd->read_speed); 898 - } 899 - } 900 - 901 - atomic_inc(&pd->cdrw.pending_bios); 902 - submit_bio_noacct(bio); 903 - } 904 - } 905 - 906 - /* 907 - * Special care is needed if the underlying block device has a small 908 - * max_phys_segments value. 909 - */ 910 - static int pkt_set_segment_merging(struct pktcdvd_device *pd, struct request_queue *q) 911 - { 912 - if ((pd->settings.size << 9) / CD_FRAMESIZE 913 - <= queue_max_segments(q)) { 914 - /* 915 - * The cdrom device can handle one segment/frame 916 - */ 917 - clear_bit(PACKET_MERGE_SEGS, &pd->flags); 918 - return 0; 919 - } else if ((pd->settings.size << 9) / PAGE_SIZE 920 - <= queue_max_segments(q)) { 921 - /* 922 - * We can handle this case at the expense of some extra memory 923 - * copies during write operations 924 - */ 925 - set_bit(PACKET_MERGE_SEGS, &pd->flags); 926 - return 0; 927 - } else { 928 - pkt_err(pd, "cdrom max_phys_segments too small\n"); 929 - return -EIO; 930 - } 931 - } 932 - 933 - static void pkt_end_io_read(struct bio *bio) 934 - { 935 - struct packet_data *pkt = bio->bi_private; 936 - struct pktcdvd_device *pd = pkt->pd; 937 - BUG_ON(!pd); 938 - 939 - pkt_dbg(2, pd, "bio=%p sec0=%llx sec=%llx err=%d\n", 940 - bio, (unsigned long long)pkt->sector, 941 - (unsigned long long)bio->bi_iter.bi_sector, bio->bi_status); 942 - 943 - if (bio->bi_status) 944 - atomic_inc(&pkt->io_errors); 945 - bio_uninit(bio); 946 - if (atomic_dec_and_test(&pkt->io_wait)) { 947 - atomic_inc(&pkt->run_sm); 948 - wake_up(&pd->wqueue); 949 - } 950 - pkt_bio_finished(pd); 951 - } 952 - 953 - static void pkt_end_io_packet_write(struct bio *bio) 954 - { 955 - struct packet_data *pkt = bio->bi_private; 956 - struct pktcdvd_device *pd = pkt->pd; 957 - BUG_ON(!pd); 958 - 959 - pkt_dbg(2, pd, "id=%d, err=%d\n", pkt->id, bio->bi_status); 960 - 961 - pd->stats.pkt_ended++; 962 - 963 - bio_uninit(bio); 964 - pkt_bio_finished(pd); 965 - atomic_dec(&pkt->io_wait); 966 - atomic_inc(&pkt->run_sm); 967 - wake_up(&pd->wqueue); 968 - } 969 - 970 - /* 971 - * Schedule reads for the holes in a packet 972 - */ 973 - static void pkt_gather_data(struct pktcdvd_device *pd, struct packet_data *pkt) 974 - { 975 - int frames_read = 0; 976 - struct bio *bio; 977 - int f; 978 - char written[PACKET_MAX_SIZE]; 979 - 980 - BUG_ON(bio_list_empty(&pkt->orig_bios)); 981 - 982 - atomic_set(&pkt->io_wait, 0); 983 - atomic_set(&pkt->io_errors, 0); 984 - 985 - /* 986 - * Figure out which frames we need to read before we can write. 987 - */ 988 - memset(written, 0, sizeof(written)); 989 - spin_lock(&pkt->lock); 990 - bio_list_for_each(bio, &pkt->orig_bios) { 991 - int first_frame = (bio->bi_iter.bi_sector - pkt->sector) / 992 - (CD_FRAMESIZE >> 9); 993 - int num_frames = bio->bi_iter.bi_size / CD_FRAMESIZE; 994 - pd->stats.secs_w += num_frames * (CD_FRAMESIZE >> 9); 995 - BUG_ON(first_frame < 0); 996 - BUG_ON(first_frame + num_frames > pkt->frames); 997 - for (f = first_frame; f < first_frame + num_frames; f++) 998 - written[f] = 1; 999 - } 1000 - spin_unlock(&pkt->lock); 1001 - 1002 - if (pkt->cache_valid) { 1003 - pkt_dbg(2, pd, "zone %llx cached\n", 1004 - (unsigned long long)pkt->sector); 1005 - goto out_account; 1006 - } 1007 - 1008 - /* 1009 - * Schedule reads for missing parts of the packet. 1010 - */ 1011 - for (f = 0; f < pkt->frames; f++) { 1012 - int p, offset; 1013 - 1014 - if (written[f]) 1015 - continue; 1016 - 1017 - bio = pkt->r_bios[f]; 1018 - bio_init(bio, pd->bdev, bio->bi_inline_vecs, 1, REQ_OP_READ); 1019 - bio->bi_iter.bi_sector = pkt->sector + f * (CD_FRAMESIZE >> 9); 1020 - bio->bi_end_io = pkt_end_io_read; 1021 - bio->bi_private = pkt; 1022 - 1023 - p = (f * CD_FRAMESIZE) / PAGE_SIZE; 1024 - offset = (f * CD_FRAMESIZE) % PAGE_SIZE; 1025 - pkt_dbg(2, pd, "Adding frame %d, page:%p offs:%d\n", 1026 - f, pkt->pages[p], offset); 1027 - if (!bio_add_page(bio, pkt->pages[p], CD_FRAMESIZE, offset)) 1028 - BUG(); 1029 - 1030 - atomic_inc(&pkt->io_wait); 1031 - pkt_queue_bio(pd, bio); 1032 - frames_read++; 1033 - } 1034 - 1035 - out_account: 1036 - pkt_dbg(2, pd, "need %d frames for zone %llx\n", 1037 - frames_read, (unsigned long long)pkt->sector); 1038 - pd->stats.pkt_started++; 1039 - pd->stats.secs_rg += frames_read * (CD_FRAMESIZE >> 9); 1040 - } 1041 - 1042 - /* 1043 - * Find a packet matching zone, or the least recently used packet if 1044 - * there is no match. 1045 - */ 1046 - static struct packet_data *pkt_get_packet_data(struct pktcdvd_device *pd, int zone) 1047 - { 1048 - struct packet_data *pkt; 1049 - 1050 - list_for_each_entry(pkt, &pd->cdrw.pkt_free_list, list) { 1051 - if (pkt->sector == zone || pkt->list.next == &pd->cdrw.pkt_free_list) { 1052 - list_del_init(&pkt->list); 1053 - if (pkt->sector != zone) 1054 - pkt->cache_valid = 0; 1055 - return pkt; 1056 - } 1057 - } 1058 - BUG(); 1059 - return NULL; 1060 - } 1061 - 1062 - static void pkt_put_packet_data(struct pktcdvd_device *pd, struct packet_data *pkt) 1063 - { 1064 - if (pkt->cache_valid) { 1065 - list_add(&pkt->list, &pd->cdrw.pkt_free_list); 1066 - } else { 1067 - list_add_tail(&pkt->list, &pd->cdrw.pkt_free_list); 1068 - } 1069 - } 1070 - 1071 - static inline void pkt_set_state(struct packet_data *pkt, enum packet_data_state state) 1072 - { 1073 - #if PACKET_DEBUG > 1 1074 - static const char *state_name[] = { 1075 - "IDLE", "WAITING", "READ_WAIT", "WRITE_WAIT", "RECOVERY", "FINISHED" 1076 - }; 1077 - enum packet_data_state old_state = pkt->state; 1078 - pkt_dbg(2, pd, "pkt %2d : s=%6llx %s -> %s\n", 1079 - pkt->id, (unsigned long long)pkt->sector, 1080 - state_name[old_state], state_name[state]); 1081 - #endif 1082 - pkt->state = state; 1083 - } 1084 - 1085 - /* 1086 - * Scan the work queue to see if we can start a new packet. 1087 - * returns non-zero if any work was done. 1088 - */ 1089 - static int pkt_handle_queue(struct pktcdvd_device *pd) 1090 - { 1091 - struct packet_data *pkt, *p; 1092 - struct bio *bio = NULL; 1093 - sector_t zone = 0; /* Suppress gcc warning */ 1094 - struct pkt_rb_node *node, *first_node; 1095 - struct rb_node *n; 1096 - 1097 - atomic_set(&pd->scan_queue, 0); 1098 - 1099 - if (list_empty(&pd->cdrw.pkt_free_list)) { 1100 - pkt_dbg(2, pd, "no pkt\n"); 1101 - return 0; 1102 - } 1103 - 1104 - /* 1105 - * Try to find a zone we are not already working on. 1106 - */ 1107 - spin_lock(&pd->lock); 1108 - first_node = pkt_rbtree_find(pd, pd->current_sector); 1109 - if (!first_node) { 1110 - n = rb_first(&pd->bio_queue); 1111 - if (n) 1112 - first_node = rb_entry(n, struct pkt_rb_node, rb_node); 1113 - } 1114 - node = first_node; 1115 - while (node) { 1116 - bio = node->bio; 1117 - zone = get_zone(bio->bi_iter.bi_sector, pd); 1118 - list_for_each_entry(p, &pd->cdrw.pkt_active_list, list) { 1119 - if (p->sector == zone) { 1120 - bio = NULL; 1121 - goto try_next_bio; 1122 - } 1123 - } 1124 - break; 1125 - try_next_bio: 1126 - node = pkt_rbtree_next(node); 1127 - if (!node) { 1128 - n = rb_first(&pd->bio_queue); 1129 - if (n) 1130 - node = rb_entry(n, struct pkt_rb_node, rb_node); 1131 - } 1132 - if (node == first_node) 1133 - node = NULL; 1134 - } 1135 - spin_unlock(&pd->lock); 1136 - if (!bio) { 1137 - pkt_dbg(2, pd, "no bio\n"); 1138 - return 0; 1139 - } 1140 - 1141 - pkt = pkt_get_packet_data(pd, zone); 1142 - 1143 - pd->current_sector = zone + pd->settings.size; 1144 - pkt->sector = zone; 1145 - BUG_ON(pkt->frames != pd->settings.size >> 2); 1146 - pkt->write_size = 0; 1147 - 1148 - /* 1149 - * Scan work queue for bios in the same zone and link them 1150 - * to this packet. 1151 - */ 1152 - spin_lock(&pd->lock); 1153 - pkt_dbg(2, pd, "looking for zone %llx\n", (unsigned long long)zone); 1154 - while ((node = pkt_rbtree_find(pd, zone)) != NULL) { 1155 - bio = node->bio; 1156 - pkt_dbg(2, pd, "found zone=%llx\n", (unsigned long long) 1157 - get_zone(bio->bi_iter.bi_sector, pd)); 1158 - if (get_zone(bio->bi_iter.bi_sector, pd) != zone) 1159 - break; 1160 - pkt_rbtree_erase(pd, node); 1161 - spin_lock(&pkt->lock); 1162 - bio_list_add(&pkt->orig_bios, bio); 1163 - pkt->write_size += bio->bi_iter.bi_size / CD_FRAMESIZE; 1164 - spin_unlock(&pkt->lock); 1165 - } 1166 - /* check write congestion marks, and if bio_queue_size is 1167 - * below, wake up any waiters 1168 - */ 1169 - if (pd->congested && 1170 - pd->bio_queue_size <= pd->write_congestion_off) { 1171 - pd->congested = false; 1172 - wake_up_var(&pd->congested); 1173 - } 1174 - spin_unlock(&pd->lock); 1175 - 1176 - pkt->sleep_time = max(PACKET_WAIT_TIME, 1); 1177 - pkt_set_state(pkt, PACKET_WAITING_STATE); 1178 - atomic_set(&pkt->run_sm, 1); 1179 - 1180 - spin_lock(&pd->cdrw.active_list_lock); 1181 - list_add(&pkt->list, &pd->cdrw.pkt_active_list); 1182 - spin_unlock(&pd->cdrw.active_list_lock); 1183 - 1184 - return 1; 1185 - } 1186 - 1187 - /** 1188 - * bio_list_copy_data - copy contents of data buffers from one chain of bios to 1189 - * another 1190 - * @src: source bio list 1191 - * @dst: destination bio list 1192 - * 1193 - * Stops when it reaches the end of either the @src list or @dst list - that is, 1194 - * copies min(src->bi_size, dst->bi_size) bytes (or the equivalent for lists of 1195 - * bios). 1196 - */ 1197 - static void bio_list_copy_data(struct bio *dst, struct bio *src) 1198 - { 1199 - struct bvec_iter src_iter = src->bi_iter; 1200 - struct bvec_iter dst_iter = dst->bi_iter; 1201 - 1202 - while (1) { 1203 - if (!src_iter.bi_size) { 1204 - src = src->bi_next; 1205 - if (!src) 1206 - break; 1207 - 1208 - src_iter = src->bi_iter; 1209 - } 1210 - 1211 - if (!dst_iter.bi_size) { 1212 - dst = dst->bi_next; 1213 - if (!dst) 1214 - break; 1215 - 1216 - dst_iter = dst->bi_iter; 1217 - } 1218 - 1219 - bio_copy_data_iter(dst, &dst_iter, src, &src_iter); 1220 - } 1221 - } 1222 - 1223 - /* 1224 - * Assemble a bio to write one packet and queue the bio for processing 1225 - * by the underlying block device. 1226 - */ 1227 - static void pkt_start_write(struct pktcdvd_device *pd, struct packet_data *pkt) 1228 - { 1229 - int f; 1230 - 1231 - bio_init(pkt->w_bio, pd->bdev, pkt->w_bio->bi_inline_vecs, pkt->frames, 1232 - REQ_OP_WRITE); 1233 - pkt->w_bio->bi_iter.bi_sector = pkt->sector; 1234 - pkt->w_bio->bi_end_io = pkt_end_io_packet_write; 1235 - pkt->w_bio->bi_private = pkt; 1236 - 1237 - /* XXX: locking? */ 1238 - for (f = 0; f < pkt->frames; f++) { 1239 - struct page *page = pkt->pages[(f * CD_FRAMESIZE) / PAGE_SIZE]; 1240 - unsigned offset = (f * CD_FRAMESIZE) % PAGE_SIZE; 1241 - 1242 - if (!bio_add_page(pkt->w_bio, page, CD_FRAMESIZE, offset)) 1243 - BUG(); 1244 - } 1245 - pkt_dbg(2, pd, "vcnt=%d\n", pkt->w_bio->bi_vcnt); 1246 - 1247 - /* 1248 - * Fill-in bvec with data from orig_bios. 1249 - */ 1250 - spin_lock(&pkt->lock); 1251 - bio_list_copy_data(pkt->w_bio, pkt->orig_bios.head); 1252 - 1253 - pkt_set_state(pkt, PACKET_WRITE_WAIT_STATE); 1254 - spin_unlock(&pkt->lock); 1255 - 1256 - pkt_dbg(2, pd, "Writing %d frames for zone %llx\n", 1257 - pkt->write_size, (unsigned long long)pkt->sector); 1258 - 1259 - if (test_bit(PACKET_MERGE_SEGS, &pd->flags) || (pkt->write_size < pkt->frames)) 1260 - pkt->cache_valid = 1; 1261 - else 1262 - pkt->cache_valid = 0; 1263 - 1264 - /* Start the write request */ 1265 - atomic_set(&pkt->io_wait, 1); 1266 - pkt_queue_bio(pd, pkt->w_bio); 1267 - } 1268 - 1269 - static void pkt_finish_packet(struct packet_data *pkt, blk_status_t status) 1270 - { 1271 - struct bio *bio; 1272 - 1273 - if (status) 1274 - pkt->cache_valid = 0; 1275 - 1276 - /* Finish all bios corresponding to this packet */ 1277 - while ((bio = bio_list_pop(&pkt->orig_bios))) { 1278 - bio->bi_status = status; 1279 - bio_endio(bio); 1280 - } 1281 - } 1282 - 1283 - static void pkt_run_state_machine(struct pktcdvd_device *pd, struct packet_data *pkt) 1284 - { 1285 - pkt_dbg(2, pd, "pkt %d\n", pkt->id); 1286 - 1287 - for (;;) { 1288 - switch (pkt->state) { 1289 - case PACKET_WAITING_STATE: 1290 - if ((pkt->write_size < pkt->frames) && (pkt->sleep_time > 0)) 1291 - return; 1292 - 1293 - pkt->sleep_time = 0; 1294 - pkt_gather_data(pd, pkt); 1295 - pkt_set_state(pkt, PACKET_READ_WAIT_STATE); 1296 - break; 1297 - 1298 - case PACKET_READ_WAIT_STATE: 1299 - if (atomic_read(&pkt->io_wait) > 0) 1300 - return; 1301 - 1302 - if (atomic_read(&pkt->io_errors) > 0) { 1303 - pkt_set_state(pkt, PACKET_RECOVERY_STATE); 1304 - } else { 1305 - pkt_start_write(pd, pkt); 1306 - } 1307 - break; 1308 - 1309 - case PACKET_WRITE_WAIT_STATE: 1310 - if (atomic_read(&pkt->io_wait) > 0) 1311 - return; 1312 - 1313 - if (!pkt->w_bio->bi_status) { 1314 - pkt_set_state(pkt, PACKET_FINISHED_STATE); 1315 - } else { 1316 - pkt_set_state(pkt, PACKET_RECOVERY_STATE); 1317 - } 1318 - break; 1319 - 1320 - case PACKET_RECOVERY_STATE: 1321 - pkt_dbg(2, pd, "No recovery possible\n"); 1322 - pkt_set_state(pkt, PACKET_FINISHED_STATE); 1323 - break; 1324 - 1325 - case PACKET_FINISHED_STATE: 1326 - pkt_finish_packet(pkt, pkt->w_bio->bi_status); 1327 - return; 1328 - 1329 - default: 1330 - BUG(); 1331 - break; 1332 - } 1333 - } 1334 - } 1335 - 1336 - static void pkt_handle_packets(struct pktcdvd_device *pd) 1337 - { 1338 - struct packet_data *pkt, *next; 1339 - 1340 - /* 1341 - * Run state machine for active packets 1342 - */ 1343 - list_for_each_entry(pkt, &pd->cdrw.pkt_active_list, list) { 1344 - if (atomic_read(&pkt->run_sm) > 0) { 1345 - atomic_set(&pkt->run_sm, 0); 1346 - pkt_run_state_machine(pd, pkt); 1347 - } 1348 - } 1349 - 1350 - /* 1351 - * Move no longer active packets to the free list 1352 - */ 1353 - spin_lock(&pd->cdrw.active_list_lock); 1354 - list_for_each_entry_safe(pkt, next, &pd->cdrw.pkt_active_list, list) { 1355 - if (pkt->state == PACKET_FINISHED_STATE) { 1356 - list_del(&pkt->list); 1357 - pkt_put_packet_data(pd, pkt); 1358 - pkt_set_state(pkt, PACKET_IDLE_STATE); 1359 - atomic_set(&pd->scan_queue, 1); 1360 - } 1361 - } 1362 - spin_unlock(&pd->cdrw.active_list_lock); 1363 - } 1364 - 1365 - static void pkt_count_states(struct pktcdvd_device *pd, int *states) 1366 - { 1367 - struct packet_data *pkt; 1368 - int i; 1369 - 1370 - for (i = 0; i < PACKET_NUM_STATES; i++) 1371 - states[i] = 0; 1372 - 1373 - spin_lock(&pd->cdrw.active_list_lock); 1374 - list_for_each_entry(pkt, &pd->cdrw.pkt_active_list, list) { 1375 - states[pkt->state]++; 1376 - } 1377 - spin_unlock(&pd->cdrw.active_list_lock); 1378 - } 1379 - 1380 - /* 1381 - * kcdrwd is woken up when writes have been queued for one of our 1382 - * registered devices 1383 - */ 1384 - static int kcdrwd(void *foobar) 1385 - { 1386 - struct pktcdvd_device *pd = foobar; 1387 - struct packet_data *pkt; 1388 - long min_sleep_time, residue; 1389 - 1390 - set_user_nice(current, MIN_NICE); 1391 - set_freezable(); 1392 - 1393 - for (;;) { 1394 - DECLARE_WAITQUEUE(wait, current); 1395 - 1396 - /* 1397 - * Wait until there is something to do 1398 - */ 1399 - add_wait_queue(&pd->wqueue, &wait); 1400 - for (;;) { 1401 - set_current_state(TASK_INTERRUPTIBLE); 1402 - 1403 - /* Check if we need to run pkt_handle_queue */ 1404 - if (atomic_read(&pd->scan_queue) > 0) 1405 - goto work_to_do; 1406 - 1407 - /* Check if we need to run the state machine for some packet */ 1408 - list_for_each_entry(pkt, &pd->cdrw.pkt_active_list, list) { 1409 - if (atomic_read(&pkt->run_sm) > 0) 1410 - goto work_to_do; 1411 - } 1412 - 1413 - /* Check if we need to process the iosched queues */ 1414 - if (atomic_read(&pd->iosched.attention) != 0) 1415 - goto work_to_do; 1416 - 1417 - /* Otherwise, go to sleep */ 1418 - if (PACKET_DEBUG > 1) { 1419 - int states[PACKET_NUM_STATES]; 1420 - pkt_count_states(pd, states); 1421 - pkt_dbg(2, pd, "i:%d ow:%d rw:%d ww:%d rec:%d fin:%d\n", 1422 - states[0], states[1], states[2], 1423 - states[3], states[4], states[5]); 1424 - } 1425 - 1426 - min_sleep_time = MAX_SCHEDULE_TIMEOUT; 1427 - list_for_each_entry(pkt, &pd->cdrw.pkt_active_list, list) { 1428 - if (pkt->sleep_time && pkt->sleep_time < min_sleep_time) 1429 - min_sleep_time = pkt->sleep_time; 1430 - } 1431 - 1432 - pkt_dbg(2, pd, "sleeping\n"); 1433 - residue = schedule_timeout(min_sleep_time); 1434 - pkt_dbg(2, pd, "wake up\n"); 1435 - 1436 - /* make swsusp happy with our thread */ 1437 - try_to_freeze(); 1438 - 1439 - list_for_each_entry(pkt, &pd->cdrw.pkt_active_list, list) { 1440 - if (!pkt->sleep_time) 1441 - continue; 1442 - pkt->sleep_time -= min_sleep_time - residue; 1443 - if (pkt->sleep_time <= 0) { 1444 - pkt->sleep_time = 0; 1445 - atomic_inc(&pkt->run_sm); 1446 - } 1447 - } 1448 - 1449 - if (kthread_should_stop()) 1450 - break; 1451 - } 1452 - work_to_do: 1453 - set_current_state(TASK_RUNNING); 1454 - remove_wait_queue(&pd->wqueue, &wait); 1455 - 1456 - if (kthread_should_stop()) 1457 - break; 1458 - 1459 - /* 1460 - * if pkt_handle_queue returns true, we can queue 1461 - * another request. 1462 - */ 1463 - while (pkt_handle_queue(pd)) 1464 - ; 1465 - 1466 - /* 1467 - * Handle packet state machine 1468 - */ 1469 - pkt_handle_packets(pd); 1470 - 1471 - /* 1472 - * Handle iosched queues 1473 - */ 1474 - pkt_iosched_process_queue(pd); 1475 - } 1476 - 1477 - return 0; 1478 - } 1479 - 1480 - static void pkt_print_settings(struct pktcdvd_device *pd) 1481 - { 1482 - pkt_info(pd, "%s packets, %u blocks, Mode-%c disc\n", 1483 - pd->settings.fp ? "Fixed" : "Variable", 1484 - pd->settings.size >> 2, 1485 - pd->settings.block_mode == 8 ? '1' : '2'); 1486 - } 1487 - 1488 - static int pkt_mode_sense(struct pktcdvd_device *pd, struct packet_command *cgc, int page_code, int page_control) 1489 - { 1490 - memset(cgc->cmd, 0, sizeof(cgc->cmd)); 1491 - 1492 - cgc->cmd[0] = GPCMD_MODE_SENSE_10; 1493 - cgc->cmd[2] = page_code | (page_control << 6); 1494 - cgc->cmd[7] = cgc->buflen >> 8; 1495 - cgc->cmd[8] = cgc->buflen & 0xff; 1496 - cgc->data_direction = CGC_DATA_READ; 1497 - return pkt_generic_packet(pd, cgc); 1498 - } 1499 - 1500 - static int pkt_mode_select(struct pktcdvd_device *pd, struct packet_command *cgc) 1501 - { 1502 - memset(cgc->cmd, 0, sizeof(cgc->cmd)); 1503 - memset(cgc->buffer, 0, 2); 1504 - cgc->cmd[0] = GPCMD_MODE_SELECT_10; 1505 - cgc->cmd[1] = 0x10; /* PF */ 1506 - cgc->cmd[7] = cgc->buflen >> 8; 1507 - cgc->cmd[8] = cgc->buflen & 0xff; 1508 - cgc->data_direction = CGC_DATA_WRITE; 1509 - return pkt_generic_packet(pd, cgc); 1510 - } 1511 - 1512 - static int pkt_get_disc_info(struct pktcdvd_device *pd, disc_information *di) 1513 - { 1514 - struct packet_command cgc; 1515 - int ret; 1516 - 1517 - /* set up command and get the disc info */ 1518 - init_cdrom_command(&cgc, di, sizeof(*di), CGC_DATA_READ); 1519 - cgc.cmd[0] = GPCMD_READ_DISC_INFO; 1520 - cgc.cmd[8] = cgc.buflen = 2; 1521 - cgc.quiet = 1; 1522 - 1523 - ret = pkt_generic_packet(pd, &cgc); 1524 - if (ret) 1525 - return ret; 1526 - 1527 - /* not all drives have the same disc_info length, so requeue 1528 - * packet with the length the drive tells us it can supply 1529 - */ 1530 - cgc.buflen = be16_to_cpu(di->disc_information_length) + 1531 - sizeof(di->disc_information_length); 1532 - 1533 - if (cgc.buflen > sizeof(disc_information)) 1534 - cgc.buflen = sizeof(disc_information); 1535 - 1536 - cgc.cmd[8] = cgc.buflen; 1537 - return pkt_generic_packet(pd, &cgc); 1538 - } 1539 - 1540 - static int pkt_get_track_info(struct pktcdvd_device *pd, __u16 track, __u8 type, track_information *ti) 1541 - { 1542 - struct packet_command cgc; 1543 - int ret; 1544 - 1545 - init_cdrom_command(&cgc, ti, 8, CGC_DATA_READ); 1546 - cgc.cmd[0] = GPCMD_READ_TRACK_RZONE_INFO; 1547 - cgc.cmd[1] = type & 3; 1548 - cgc.cmd[4] = (track & 0xff00) >> 8; 1549 - cgc.cmd[5] = track & 0xff; 1550 - cgc.cmd[8] = 8; 1551 - cgc.quiet = 1; 1552 - 1553 - ret = pkt_generic_packet(pd, &cgc); 1554 - if (ret) 1555 - return ret; 1556 - 1557 - cgc.buflen = be16_to_cpu(ti->track_information_length) + 1558 - sizeof(ti->track_information_length); 1559 - 1560 - if (cgc.buflen > sizeof(track_information)) 1561 - cgc.buflen = sizeof(track_information); 1562 - 1563 - cgc.cmd[8] = cgc.buflen; 1564 - return pkt_generic_packet(pd, &cgc); 1565 - } 1566 - 1567 - static noinline_for_stack int pkt_get_last_written(struct pktcdvd_device *pd, 1568 - long *last_written) 1569 - { 1570 - disc_information di; 1571 - track_information ti; 1572 - __u32 last_track; 1573 - int ret; 1574 - 1575 - ret = pkt_get_disc_info(pd, &di); 1576 - if (ret) 1577 - return ret; 1578 - 1579 - last_track = (di.last_track_msb << 8) | di.last_track_lsb; 1580 - ret = pkt_get_track_info(pd, last_track, 1, &ti); 1581 - if (ret) 1582 - return ret; 1583 - 1584 - /* if this track is blank, try the previous. */ 1585 - if (ti.blank) { 1586 - last_track--; 1587 - ret = pkt_get_track_info(pd, last_track, 1, &ti); 1588 - if (ret) 1589 - return ret; 1590 - } 1591 - 1592 - /* if last recorded field is valid, return it. */ 1593 - if (ti.lra_v) { 1594 - *last_written = be32_to_cpu(ti.last_rec_address); 1595 - } else { 1596 - /* make it up instead */ 1597 - *last_written = be32_to_cpu(ti.track_start) + 1598 - be32_to_cpu(ti.track_size); 1599 - if (ti.free_blocks) 1600 - *last_written -= (be32_to_cpu(ti.free_blocks) + 7); 1601 - } 1602 - return 0; 1603 - } 1604 - 1605 - /* 1606 - * write mode select package based on pd->settings 1607 - */ 1608 - static noinline_for_stack int pkt_set_write_settings(struct pktcdvd_device *pd) 1609 - { 1610 - struct packet_command cgc; 1611 - struct scsi_sense_hdr sshdr; 1612 - write_param_page *wp; 1613 - char buffer[128]; 1614 - int ret, size; 1615 - 1616 - /* doesn't apply to DVD+RW or DVD-RAM */ 1617 - if ((pd->mmc3_profile == 0x1a) || (pd->mmc3_profile == 0x12)) 1618 - return 0; 1619 - 1620 - memset(buffer, 0, sizeof(buffer)); 1621 - init_cdrom_command(&cgc, buffer, sizeof(*wp), CGC_DATA_READ); 1622 - cgc.sshdr = &sshdr; 1623 - ret = pkt_mode_sense(pd, &cgc, GPMODE_WRITE_PARMS_PAGE, 0); 1624 - if (ret) { 1625 - pkt_dump_sense(pd, &cgc); 1626 - return ret; 1627 - } 1628 - 1629 - size = 2 + ((buffer[0] << 8) | (buffer[1] & 0xff)); 1630 - pd->mode_offset = (buffer[6] << 8) | (buffer[7] & 0xff); 1631 - if (size > sizeof(buffer)) 1632 - size = sizeof(buffer); 1633 - 1634 - /* 1635 - * now get it all 1636 - */ 1637 - init_cdrom_command(&cgc, buffer, size, CGC_DATA_READ); 1638 - cgc.sshdr = &sshdr; 1639 - ret = pkt_mode_sense(pd, &cgc, GPMODE_WRITE_PARMS_PAGE, 0); 1640 - if (ret) { 1641 - pkt_dump_sense(pd, &cgc); 1642 - return ret; 1643 - } 1644 - 1645 - /* 1646 - * write page is offset header + block descriptor length 1647 - */ 1648 - wp = (write_param_page *) &buffer[sizeof(struct mode_page_header) + pd->mode_offset]; 1649 - 1650 - wp->fp = pd->settings.fp; 1651 - wp->track_mode = pd->settings.track_mode; 1652 - wp->write_type = pd->settings.write_type; 1653 - wp->data_block_type = pd->settings.block_mode; 1654 - 1655 - wp->multi_session = 0; 1656 - 1657 - #ifdef PACKET_USE_LS 1658 - wp->link_size = 7; 1659 - wp->ls_v = 1; 1660 - #endif 1661 - 1662 - if (wp->data_block_type == PACKET_BLOCK_MODE1) { 1663 - wp->session_format = 0; 1664 - wp->subhdr2 = 0x20; 1665 - } else if (wp->data_block_type == PACKET_BLOCK_MODE2) { 1666 - wp->session_format = 0x20; 1667 - wp->subhdr2 = 8; 1668 - #if 0 1669 - wp->mcn[0] = 0x80; 1670 - memcpy(&wp->mcn[1], PACKET_MCN, sizeof(wp->mcn) - 1); 1671 - #endif 1672 - } else { 1673 - /* 1674 - * paranoia 1675 - */ 1676 - pkt_err(pd, "write mode wrong %d\n", wp->data_block_type); 1677 - return 1; 1678 - } 1679 - wp->packet_size = cpu_to_be32(pd->settings.size >> 2); 1680 - 1681 - cgc.buflen = cgc.cmd[8] = size; 1682 - ret = pkt_mode_select(pd, &cgc); 1683 - if (ret) { 1684 - pkt_dump_sense(pd, &cgc); 1685 - return ret; 1686 - } 1687 - 1688 - pkt_print_settings(pd); 1689 - return 0; 1690 - } 1691 - 1692 - /* 1693 - * 1 -- we can write to this track, 0 -- we can't 1694 - */ 1695 - static int pkt_writable_track(struct pktcdvd_device *pd, track_information *ti) 1696 - { 1697 - switch (pd->mmc3_profile) { 1698 - case 0x1a: /* DVD+RW */ 1699 - case 0x12: /* DVD-RAM */ 1700 - /* The track is always writable on DVD+RW/DVD-RAM */ 1701 - return 1; 1702 - default: 1703 - break; 1704 - } 1705 - 1706 - if (!ti->packet || !ti->fp) 1707 - return 0; 1708 - 1709 - /* 1710 - * "good" settings as per Mt Fuji. 1711 - */ 1712 - if (ti->rt == 0 && ti->blank == 0) 1713 - return 1; 1714 - 1715 - if (ti->rt == 0 && ti->blank == 1) 1716 - return 1; 1717 - 1718 - if (ti->rt == 1 && ti->blank == 0) 1719 - return 1; 1720 - 1721 - pkt_err(pd, "bad state %d-%d-%d\n", ti->rt, ti->blank, ti->packet); 1722 - return 0; 1723 - } 1724 - 1725 - /* 1726 - * 1 -- we can write to this disc, 0 -- we can't 1727 - */ 1728 - static int pkt_writable_disc(struct pktcdvd_device *pd, disc_information *di) 1729 - { 1730 - switch (pd->mmc3_profile) { 1731 - case 0x0a: /* CD-RW */ 1732 - case 0xffff: /* MMC3 not supported */ 1733 - break; 1734 - case 0x1a: /* DVD+RW */ 1735 - case 0x13: /* DVD-RW */ 1736 - case 0x12: /* DVD-RAM */ 1737 - return 1; 1738 - default: 1739 - pkt_dbg(2, pd, "Wrong disc profile (%x)\n", 1740 - pd->mmc3_profile); 1741 - return 0; 1742 - } 1743 - 1744 - /* 1745 - * for disc type 0xff we should probably reserve a new track. 1746 - * but i'm not sure, should we leave this to user apps? probably. 1747 - */ 1748 - if (di->disc_type == 0xff) { 1749 - pkt_notice(pd, "unknown disc - no track?\n"); 1750 - return 0; 1751 - } 1752 - 1753 - if (di->disc_type != 0x20 && di->disc_type != 0) { 1754 - pkt_err(pd, "wrong disc type (%x)\n", di->disc_type); 1755 - return 0; 1756 - } 1757 - 1758 - if (di->erasable == 0) { 1759 - pkt_notice(pd, "disc not erasable\n"); 1760 - return 0; 1761 - } 1762 - 1763 - if (di->border_status == PACKET_SESSION_RESERVED) { 1764 - pkt_err(pd, "can't write to last track (reserved)\n"); 1765 - return 0; 1766 - } 1767 - 1768 - return 1; 1769 - } 1770 - 1771 - static noinline_for_stack int pkt_probe_settings(struct pktcdvd_device *pd) 1772 - { 1773 - struct packet_command cgc; 1774 - unsigned char buf[12]; 1775 - disc_information di; 1776 - track_information ti; 1777 - int ret, track; 1778 - 1779 - init_cdrom_command(&cgc, buf, sizeof(buf), CGC_DATA_READ); 1780 - cgc.cmd[0] = GPCMD_GET_CONFIGURATION; 1781 - cgc.cmd[8] = 8; 1782 - ret = pkt_generic_packet(pd, &cgc); 1783 - pd->mmc3_profile = ret ? 0xffff : buf[6] << 8 | buf[7]; 1784 - 1785 - memset(&di, 0, sizeof(disc_information)); 1786 - memset(&ti, 0, sizeof(track_information)); 1787 - 1788 - ret = pkt_get_disc_info(pd, &di); 1789 - if (ret) { 1790 - pkt_err(pd, "failed get_disc\n"); 1791 - return ret; 1792 - } 1793 - 1794 - if (!pkt_writable_disc(pd, &di)) 1795 - return -EROFS; 1796 - 1797 - pd->type = di.erasable ? PACKET_CDRW : PACKET_CDR; 1798 - 1799 - track = 1; /* (di.last_track_msb << 8) | di.last_track_lsb; */ 1800 - ret = pkt_get_track_info(pd, track, 1, &ti); 1801 - if (ret) { 1802 - pkt_err(pd, "failed get_track\n"); 1803 - return ret; 1804 - } 1805 - 1806 - if (!pkt_writable_track(pd, &ti)) { 1807 - pkt_err(pd, "can't write to this track\n"); 1808 - return -EROFS; 1809 - } 1810 - 1811 - /* 1812 - * we keep packet size in 512 byte units, makes it easier to 1813 - * deal with request calculations. 1814 - */ 1815 - pd->settings.size = be32_to_cpu(ti.fixed_packet_size) << 2; 1816 - if (pd->settings.size == 0) { 1817 - pkt_notice(pd, "detected zero packet size!\n"); 1818 - return -ENXIO; 1819 - } 1820 - if (pd->settings.size > PACKET_MAX_SECTORS) { 1821 - pkt_err(pd, "packet size is too big\n"); 1822 - return -EROFS; 1823 - } 1824 - pd->settings.fp = ti.fp; 1825 - pd->offset = (be32_to_cpu(ti.track_start) << 2) & (pd->settings.size - 1); 1826 - 1827 - if (ti.nwa_v) { 1828 - pd->nwa = be32_to_cpu(ti.next_writable); 1829 - set_bit(PACKET_NWA_VALID, &pd->flags); 1830 - } 1831 - 1832 - /* 1833 - * in theory we could use lra on -RW media as well and just zero 1834 - * blocks that haven't been written yet, but in practice that 1835 - * is just a no-go. we'll use that for -R, naturally. 1836 - */ 1837 - if (ti.lra_v) { 1838 - pd->lra = be32_to_cpu(ti.last_rec_address); 1839 - set_bit(PACKET_LRA_VALID, &pd->flags); 1840 - } else { 1841 - pd->lra = 0xffffffff; 1842 - set_bit(PACKET_LRA_VALID, &pd->flags); 1843 - } 1844 - 1845 - /* 1846 - * fine for now 1847 - */ 1848 - pd->settings.link_loss = 7; 1849 - pd->settings.write_type = 0; /* packet */ 1850 - pd->settings.track_mode = ti.track_mode; 1851 - 1852 - /* 1853 - * mode1 or mode2 disc 1854 - */ 1855 - switch (ti.data_mode) { 1856 - case PACKET_MODE1: 1857 - pd->settings.block_mode = PACKET_BLOCK_MODE1; 1858 - break; 1859 - case PACKET_MODE2: 1860 - pd->settings.block_mode = PACKET_BLOCK_MODE2; 1861 - break; 1862 - default: 1863 - pkt_err(pd, "unknown data mode\n"); 1864 - return -EROFS; 1865 - } 1866 - return 0; 1867 - } 1868 - 1869 - /* 1870 - * enable/disable write caching on drive 1871 - */ 1872 - static noinline_for_stack int pkt_write_caching(struct pktcdvd_device *pd, 1873 - int set) 1874 - { 1875 - struct packet_command cgc; 1876 - struct scsi_sense_hdr sshdr; 1877 - unsigned char buf[64]; 1878 - int ret; 1879 - 1880 - init_cdrom_command(&cgc, buf, sizeof(buf), CGC_DATA_READ); 1881 - cgc.sshdr = &sshdr; 1882 - cgc.buflen = pd->mode_offset + 12; 1883 - 1884 - /* 1885 - * caching mode page might not be there, so quiet this command 1886 - */ 1887 - cgc.quiet = 1; 1888 - 1889 - ret = pkt_mode_sense(pd, &cgc, GPMODE_WCACHING_PAGE, 0); 1890 - if (ret) 1891 - return ret; 1892 - 1893 - buf[pd->mode_offset + 10] |= (!!set << 2); 1894 - 1895 - cgc.buflen = cgc.cmd[8] = 2 + ((buf[0] << 8) | (buf[1] & 0xff)); 1896 - ret = pkt_mode_select(pd, &cgc); 1897 - if (ret) { 1898 - pkt_err(pd, "write caching control failed\n"); 1899 - pkt_dump_sense(pd, &cgc); 1900 - } else if (!ret && set) 1901 - pkt_notice(pd, "enabled write caching\n"); 1902 - return ret; 1903 - } 1904 - 1905 - static int pkt_lock_door(struct pktcdvd_device *pd, int lockflag) 1906 - { 1907 - struct packet_command cgc; 1908 - 1909 - init_cdrom_command(&cgc, NULL, 0, CGC_DATA_NONE); 1910 - cgc.cmd[0] = GPCMD_PREVENT_ALLOW_MEDIUM_REMOVAL; 1911 - cgc.cmd[4] = lockflag ? 1 : 0; 1912 - return pkt_generic_packet(pd, &cgc); 1913 - } 1914 - 1915 - /* 1916 - * Returns drive maximum write speed 1917 - */ 1918 - static noinline_for_stack int pkt_get_max_speed(struct pktcdvd_device *pd, 1919 - unsigned *write_speed) 1920 - { 1921 - struct packet_command cgc; 1922 - struct scsi_sense_hdr sshdr; 1923 - unsigned char buf[256+18]; 1924 - unsigned char *cap_buf; 1925 - int ret, offset; 1926 - 1927 - cap_buf = &buf[sizeof(struct mode_page_header) + pd->mode_offset]; 1928 - init_cdrom_command(&cgc, buf, sizeof(buf), CGC_DATA_UNKNOWN); 1929 - cgc.sshdr = &sshdr; 1930 - 1931 - ret = pkt_mode_sense(pd, &cgc, GPMODE_CAPABILITIES_PAGE, 0); 1932 - if (ret) { 1933 - cgc.buflen = pd->mode_offset + cap_buf[1] + 2 + 1934 - sizeof(struct mode_page_header); 1935 - ret = pkt_mode_sense(pd, &cgc, GPMODE_CAPABILITIES_PAGE, 0); 1936 - if (ret) { 1937 - pkt_dump_sense(pd, &cgc); 1938 - return ret; 1939 - } 1940 - } 1941 - 1942 - offset = 20; /* Obsoleted field, used by older drives */ 1943 - if (cap_buf[1] >= 28) 1944 - offset = 28; /* Current write speed selected */ 1945 - if (cap_buf[1] >= 30) { 1946 - /* If the drive reports at least one "Logical Unit Write 1947 - * Speed Performance Descriptor Block", use the information 1948 - * in the first block. (contains the highest speed) 1949 - */ 1950 - int num_spdb = (cap_buf[30] << 8) + cap_buf[31]; 1951 - if (num_spdb > 0) 1952 - offset = 34; 1953 - } 1954 - 1955 - *write_speed = (cap_buf[offset] << 8) | cap_buf[offset + 1]; 1956 - return 0; 1957 - } 1958 - 1959 - /* These tables from cdrecord - I don't have orange book */ 1960 - /* standard speed CD-RW (1-4x) */ 1961 - static char clv_to_speed[16] = { 1962 - /* 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 */ 1963 - 0, 2, 4, 6, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 1964 - }; 1965 - /* high speed CD-RW (-10x) */ 1966 - static char hs_clv_to_speed[16] = { 1967 - /* 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 */ 1968 - 0, 2, 4, 6, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 1969 - }; 1970 - /* ultra high speed CD-RW */ 1971 - static char us_clv_to_speed[16] = { 1972 - /* 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 */ 1973 - 0, 2, 4, 8, 0, 0,16, 0,24,32,40,48, 0, 0, 0, 0 1974 - }; 1975 - 1976 - /* 1977 - * reads the maximum media speed from ATIP 1978 - */ 1979 - static noinline_for_stack int pkt_media_speed(struct pktcdvd_device *pd, 1980 - unsigned *speed) 1981 - { 1982 - struct packet_command cgc; 1983 - struct scsi_sense_hdr sshdr; 1984 - unsigned char buf[64]; 1985 - unsigned int size, st, sp; 1986 - int ret; 1987 - 1988 - init_cdrom_command(&cgc, buf, 2, CGC_DATA_READ); 1989 - cgc.sshdr = &sshdr; 1990 - cgc.cmd[0] = GPCMD_READ_TOC_PMA_ATIP; 1991 - cgc.cmd[1] = 2; 1992 - cgc.cmd[2] = 4; /* READ ATIP */ 1993 - cgc.cmd[8] = 2; 1994 - ret = pkt_generic_packet(pd, &cgc); 1995 - if (ret) { 1996 - pkt_dump_sense(pd, &cgc); 1997 - return ret; 1998 - } 1999 - size = ((unsigned int) buf[0]<<8) + buf[1] + 2; 2000 - if (size > sizeof(buf)) 2001 - size = sizeof(buf); 2002 - 2003 - init_cdrom_command(&cgc, buf, size, CGC_DATA_READ); 2004 - cgc.sshdr = &sshdr; 2005 - cgc.cmd[0] = GPCMD_READ_TOC_PMA_ATIP; 2006 - cgc.cmd[1] = 2; 2007 - cgc.cmd[2] = 4; 2008 - cgc.cmd[8] = size; 2009 - ret = pkt_generic_packet(pd, &cgc); 2010 - if (ret) { 2011 - pkt_dump_sense(pd, &cgc); 2012 - return ret; 2013 - } 2014 - 2015 - if (!(buf[6] & 0x40)) { 2016 - pkt_notice(pd, "disc type is not CD-RW\n"); 2017 - return 1; 2018 - } 2019 - if (!(buf[6] & 0x4)) { 2020 - pkt_notice(pd, "A1 values on media are not valid, maybe not CDRW?\n"); 2021 - return 1; 2022 - } 2023 - 2024 - st = (buf[6] >> 3) & 0x7; /* disc sub-type */ 2025 - 2026 - sp = buf[16] & 0xf; /* max speed from ATIP A1 field */ 2027 - 2028 - /* Info from cdrecord */ 2029 - switch (st) { 2030 - case 0: /* standard speed */ 2031 - *speed = clv_to_speed[sp]; 2032 - break; 2033 - case 1: /* high speed */ 2034 - *speed = hs_clv_to_speed[sp]; 2035 - break; 2036 - case 2: /* ultra high speed */ 2037 - *speed = us_clv_to_speed[sp]; 2038 - break; 2039 - default: 2040 - pkt_notice(pd, "unknown disc sub-type %d\n", st); 2041 - return 1; 2042 - } 2043 - if (*speed) { 2044 - pkt_info(pd, "maximum media speed: %d\n", *speed); 2045 - return 0; 2046 - } else { 2047 - pkt_notice(pd, "unknown speed %d for sub-type %d\n", sp, st); 2048 - return 1; 2049 - } 2050 - } 2051 - 2052 - static noinline_for_stack int pkt_perform_opc(struct pktcdvd_device *pd) 2053 - { 2054 - struct packet_command cgc; 2055 - struct scsi_sense_hdr sshdr; 2056 - int ret; 2057 - 2058 - pkt_dbg(2, pd, "Performing OPC\n"); 2059 - 2060 - init_cdrom_command(&cgc, NULL, 0, CGC_DATA_NONE); 2061 - cgc.sshdr = &sshdr; 2062 - cgc.timeout = 60*HZ; 2063 - cgc.cmd[0] = GPCMD_SEND_OPC; 2064 - cgc.cmd[1] = 1; 2065 - ret = pkt_generic_packet(pd, &cgc); 2066 - if (ret) 2067 - pkt_dump_sense(pd, &cgc); 2068 - return ret; 2069 - } 2070 - 2071 - static int pkt_open_write(struct pktcdvd_device *pd) 2072 - { 2073 - int ret; 2074 - unsigned int write_speed, media_write_speed, read_speed; 2075 - 2076 - ret = pkt_probe_settings(pd); 2077 - if (ret) { 2078 - pkt_dbg(2, pd, "failed probe\n"); 2079 - return ret; 2080 - } 2081 - 2082 - ret = pkt_set_write_settings(pd); 2083 - if (ret) { 2084 - pkt_dbg(1, pd, "failed saving write settings\n"); 2085 - return -EIO; 2086 - } 2087 - 2088 - pkt_write_caching(pd, USE_WCACHING); 2089 - 2090 - ret = pkt_get_max_speed(pd, &write_speed); 2091 - if (ret) 2092 - write_speed = 16 * 177; 2093 - switch (pd->mmc3_profile) { 2094 - case 0x13: /* DVD-RW */ 2095 - case 0x1a: /* DVD+RW */ 2096 - case 0x12: /* DVD-RAM */ 2097 - pkt_dbg(1, pd, "write speed %ukB/s\n", write_speed); 2098 - break; 2099 - default: 2100 - ret = pkt_media_speed(pd, &media_write_speed); 2101 - if (ret) 2102 - media_write_speed = 16; 2103 - write_speed = min(write_speed, media_write_speed * 177); 2104 - pkt_dbg(1, pd, "write speed %ux\n", write_speed / 176); 2105 - break; 2106 - } 2107 - read_speed = write_speed; 2108 - 2109 - ret = pkt_set_speed(pd, write_speed, read_speed); 2110 - if (ret) { 2111 - pkt_dbg(1, pd, "couldn't set write speed\n"); 2112 - return -EIO; 2113 - } 2114 - pd->write_speed = write_speed; 2115 - pd->read_speed = read_speed; 2116 - 2117 - ret = pkt_perform_opc(pd); 2118 - if (ret) { 2119 - pkt_dbg(1, pd, "Optimum Power Calibration failed\n"); 2120 - } 2121 - 2122 - return 0; 2123 - } 2124 - 2125 - /* 2126 - * called at open time. 2127 - */ 2128 - static int pkt_open_dev(struct pktcdvd_device *pd, fmode_t write) 2129 - { 2130 - int ret; 2131 - long lba; 2132 - struct request_queue *q; 2133 - struct block_device *bdev; 2134 - 2135 - /* 2136 - * We need to re-open the cdrom device without O_NONBLOCK to be able 2137 - * to read/write from/to it. It is already opened in O_NONBLOCK mode 2138 - * so open should not fail. 2139 - */ 2140 - bdev = blkdev_get_by_dev(pd->bdev->bd_dev, FMODE_READ | FMODE_EXCL, pd); 2141 - if (IS_ERR(bdev)) { 2142 - ret = PTR_ERR(bdev); 2143 - goto out; 2144 - } 2145 - 2146 - ret = pkt_get_last_written(pd, &lba); 2147 - if (ret) { 2148 - pkt_err(pd, "pkt_get_last_written failed\n"); 2149 - goto out_putdev; 2150 - } 2151 - 2152 - set_capacity(pd->disk, lba << 2); 2153 - set_capacity_and_notify(pd->bdev->bd_disk, lba << 2); 2154 - 2155 - q = bdev_get_queue(pd->bdev); 2156 - if (write) { 2157 - ret = pkt_open_write(pd); 2158 - if (ret) 2159 - goto out_putdev; 2160 - /* 2161 - * Some CDRW drives can not handle writes larger than one packet, 2162 - * even if the size is a multiple of the packet size. 2163 - */ 2164 - blk_queue_max_hw_sectors(q, pd->settings.size); 2165 - set_bit(PACKET_WRITABLE, &pd->flags); 2166 - } else { 2167 - pkt_set_speed(pd, MAX_SPEED, MAX_SPEED); 2168 - clear_bit(PACKET_WRITABLE, &pd->flags); 2169 - } 2170 - 2171 - ret = pkt_set_segment_merging(pd, q); 2172 - if (ret) 2173 - goto out_putdev; 2174 - 2175 - if (write) { 2176 - if (!pkt_grow_pktlist(pd, CONFIG_CDROM_PKTCDVD_BUFFERS)) { 2177 - pkt_err(pd, "not enough memory for buffers\n"); 2178 - ret = -ENOMEM; 2179 - goto out_putdev; 2180 - } 2181 - pkt_info(pd, "%lukB available on disc\n", lba << 1); 2182 - } 2183 - 2184 - return 0; 2185 - 2186 - out_putdev: 2187 - blkdev_put(bdev, FMODE_READ | FMODE_EXCL); 2188 - out: 2189 - return ret; 2190 - } 2191 - 2192 - /* 2193 - * called when the device is closed. makes sure that the device flushes 2194 - * the internal cache before we close. 2195 - */ 2196 - static void pkt_release_dev(struct pktcdvd_device *pd, int flush) 2197 - { 2198 - if (flush && pkt_flush_cache(pd)) 2199 - pkt_dbg(1, pd, "not flushing cache\n"); 2200 - 2201 - pkt_lock_door(pd, 0); 2202 - 2203 - pkt_set_speed(pd, MAX_SPEED, MAX_SPEED); 2204 - blkdev_put(pd->bdev, FMODE_READ | FMODE_EXCL); 2205 - 2206 - pkt_shrink_pktlist(pd); 2207 - } 2208 - 2209 - static struct pktcdvd_device *pkt_find_dev_from_minor(unsigned int dev_minor) 2210 - { 2211 - if (dev_minor >= MAX_WRITERS) 2212 - return NULL; 2213 - 2214 - dev_minor = array_index_nospec(dev_minor, MAX_WRITERS); 2215 - return pkt_devs[dev_minor]; 2216 - } 2217 - 2218 - static int pkt_open(struct block_device *bdev, fmode_t mode) 2219 - { 2220 - struct pktcdvd_device *pd = NULL; 2221 - int ret; 2222 - 2223 - mutex_lock(&pktcdvd_mutex); 2224 - mutex_lock(&ctl_mutex); 2225 - pd = pkt_find_dev_from_minor(MINOR(bdev->bd_dev)); 2226 - if (!pd) { 2227 - ret = -ENODEV; 2228 - goto out; 2229 - } 2230 - BUG_ON(pd->refcnt < 0); 2231 - 2232 - pd->refcnt++; 2233 - if (pd->refcnt > 1) { 2234 - if ((mode & FMODE_WRITE) && 2235 - !test_bit(PACKET_WRITABLE, &pd->flags)) { 2236 - ret = -EBUSY; 2237 - goto out_dec; 2238 - } 2239 - } else { 2240 - ret = pkt_open_dev(pd, mode & FMODE_WRITE); 2241 - if (ret) 2242 - goto out_dec; 2243 - /* 2244 - * needed here as well, since ext2 (among others) may change 2245 - * the blocksize at mount time 2246 - */ 2247 - set_blocksize(bdev, CD_FRAMESIZE); 2248 - } 2249 - 2250 - mutex_unlock(&ctl_mutex); 2251 - mutex_unlock(&pktcdvd_mutex); 2252 - return 0; 2253 - 2254 - out_dec: 2255 - pd->refcnt--; 2256 - out: 2257 - mutex_unlock(&ctl_mutex); 2258 - mutex_unlock(&pktcdvd_mutex); 2259 - return ret; 2260 - } 2261 - 2262 - static void pkt_close(struct gendisk *disk, fmode_t mode) 2263 - { 2264 - struct pktcdvd_device *pd = disk->private_data; 2265 - 2266 - mutex_lock(&pktcdvd_mutex); 2267 - mutex_lock(&ctl_mutex); 2268 - pd->refcnt--; 2269 - BUG_ON(pd->refcnt < 0); 2270 - if (pd->refcnt == 0) { 2271 - int flush = test_bit(PACKET_WRITABLE, &pd->flags); 2272 - pkt_release_dev(pd, flush); 2273 - } 2274 - mutex_unlock(&ctl_mutex); 2275 - mutex_unlock(&pktcdvd_mutex); 2276 - } 2277 - 2278 - 2279 - static void pkt_end_io_read_cloned(struct bio *bio) 2280 - { 2281 - struct packet_stacked_data *psd = bio->bi_private; 2282 - struct pktcdvd_device *pd = psd->pd; 2283 - 2284 - psd->bio->bi_status = bio->bi_status; 2285 - bio_put(bio); 2286 - bio_endio(psd->bio); 2287 - mempool_free(psd, &psd_pool); 2288 - pkt_bio_finished(pd); 2289 - } 2290 - 2291 - static void pkt_make_request_read(struct pktcdvd_device *pd, struct bio *bio) 2292 - { 2293 - struct bio *cloned_bio = 2294 - bio_alloc_clone(pd->bdev, bio, GFP_NOIO, &pkt_bio_set); 2295 - struct packet_stacked_data *psd = mempool_alloc(&psd_pool, GFP_NOIO); 2296 - 2297 - psd->pd = pd; 2298 - psd->bio = bio; 2299 - cloned_bio->bi_private = psd; 2300 - cloned_bio->bi_end_io = pkt_end_io_read_cloned; 2301 - pd->stats.secs_r += bio_sectors(bio); 2302 - pkt_queue_bio(pd, cloned_bio); 2303 - } 2304 - 2305 - static void pkt_make_request_write(struct request_queue *q, struct bio *bio) 2306 - { 2307 - struct pktcdvd_device *pd = q->queuedata; 2308 - sector_t zone; 2309 - struct packet_data *pkt; 2310 - int was_empty, blocked_bio; 2311 - struct pkt_rb_node *node; 2312 - 2313 - zone = get_zone(bio->bi_iter.bi_sector, pd); 2314 - 2315 - /* 2316 - * If we find a matching packet in state WAITING or READ_WAIT, we can 2317 - * just append this bio to that packet. 2318 - */ 2319 - spin_lock(&pd->cdrw.active_list_lock); 2320 - blocked_bio = 0; 2321 - list_for_each_entry(pkt, &pd->cdrw.pkt_active_list, list) { 2322 - if (pkt->sector == zone) { 2323 - spin_lock(&pkt->lock); 2324 - if ((pkt->state == PACKET_WAITING_STATE) || 2325 - (pkt->state == PACKET_READ_WAIT_STATE)) { 2326 - bio_list_add(&pkt->orig_bios, bio); 2327 - pkt->write_size += 2328 - bio->bi_iter.bi_size / CD_FRAMESIZE; 2329 - if ((pkt->write_size >= pkt->frames) && 2330 - (pkt->state == PACKET_WAITING_STATE)) { 2331 - atomic_inc(&pkt->run_sm); 2332 - wake_up(&pd->wqueue); 2333 - } 2334 - spin_unlock(&pkt->lock); 2335 - spin_unlock(&pd->cdrw.active_list_lock); 2336 - return; 2337 - } else { 2338 - blocked_bio = 1; 2339 - } 2340 - spin_unlock(&pkt->lock); 2341 - } 2342 - } 2343 - spin_unlock(&pd->cdrw.active_list_lock); 2344 - 2345 - /* 2346 - * Test if there is enough room left in the bio work queue 2347 - * (queue size >= congestion on mark). 2348 - * If not, wait till the work queue size is below the congestion off mark. 2349 - */ 2350 - spin_lock(&pd->lock); 2351 - if (pd->write_congestion_on > 0 2352 - && pd->bio_queue_size >= pd->write_congestion_on) { 2353 - struct wait_bit_queue_entry wqe; 2354 - 2355 - init_wait_var_entry(&wqe, &pd->congested, 0); 2356 - for (;;) { 2357 - prepare_to_wait_event(__var_waitqueue(&pd->congested), 2358 - &wqe.wq_entry, 2359 - TASK_UNINTERRUPTIBLE); 2360 - if (pd->bio_queue_size <= pd->write_congestion_off) 2361 - break; 2362 - pd->congested = true; 2363 - spin_unlock(&pd->lock); 2364 - schedule(); 2365 - spin_lock(&pd->lock); 2366 - } 2367 - } 2368 - spin_unlock(&pd->lock); 2369 - 2370 - /* 2371 - * No matching packet found. Store the bio in the work queue. 2372 - */ 2373 - node = mempool_alloc(&pd->rb_pool, GFP_NOIO); 2374 - node->bio = bio; 2375 - spin_lock(&pd->lock); 2376 - BUG_ON(pd->bio_queue_size < 0); 2377 - was_empty = (pd->bio_queue_size == 0); 2378 - pkt_rbtree_insert(pd, node); 2379 - spin_unlock(&pd->lock); 2380 - 2381 - /* 2382 - * Wake up the worker thread. 2383 - */ 2384 - atomic_set(&pd->scan_queue, 1); 2385 - if (was_empty) { 2386 - /* This wake_up is required for correct operation */ 2387 - wake_up(&pd->wqueue); 2388 - } else if (!list_empty(&pd->cdrw.pkt_free_list) && !blocked_bio) { 2389 - /* 2390 - * This wake up is not required for correct operation, 2391 - * but improves performance in some cases. 2392 - */ 2393 - wake_up(&pd->wqueue); 2394 - } 2395 - } 2396 - 2397 - static void pkt_submit_bio(struct bio *bio) 2398 - { 2399 - struct pktcdvd_device *pd = bio->bi_bdev->bd_disk->queue->queuedata; 2400 - struct bio *split; 2401 - 2402 - bio = bio_split_to_limits(bio); 2403 - 2404 - pkt_dbg(2, pd, "start = %6llx stop = %6llx\n", 2405 - (unsigned long long)bio->bi_iter.bi_sector, 2406 - (unsigned long long)bio_end_sector(bio)); 2407 - 2408 - /* 2409 - * Clone READ bios so we can have our own bi_end_io callback. 2410 - */ 2411 - if (bio_data_dir(bio) == READ) { 2412 - pkt_make_request_read(pd, bio); 2413 - return; 2414 - } 2415 - 2416 - if (!test_bit(PACKET_WRITABLE, &pd->flags)) { 2417 - pkt_notice(pd, "WRITE for ro device (%llu)\n", 2418 - (unsigned long long)bio->bi_iter.bi_sector); 2419 - goto end_io; 2420 - } 2421 - 2422 - if (!bio->bi_iter.bi_size || (bio->bi_iter.bi_size % CD_FRAMESIZE)) { 2423 - pkt_err(pd, "wrong bio size\n"); 2424 - goto end_io; 2425 - } 2426 - 2427 - do { 2428 - sector_t zone = get_zone(bio->bi_iter.bi_sector, pd); 2429 - sector_t last_zone = get_zone(bio_end_sector(bio) - 1, pd); 2430 - 2431 - if (last_zone != zone) { 2432 - BUG_ON(last_zone != zone + pd->settings.size); 2433 - 2434 - split = bio_split(bio, last_zone - 2435 - bio->bi_iter.bi_sector, 2436 - GFP_NOIO, &pkt_bio_set); 2437 - bio_chain(split, bio); 2438 - } else { 2439 - split = bio; 2440 - } 2441 - 2442 - pkt_make_request_write(bio->bi_bdev->bd_disk->queue, split); 2443 - } while (split != bio); 2444 - 2445 - return; 2446 - end_io: 2447 - bio_io_error(bio); 2448 - } 2449 - 2450 - static void pkt_init_queue(struct pktcdvd_device *pd) 2451 - { 2452 - struct request_queue *q = pd->disk->queue; 2453 - 2454 - blk_queue_logical_block_size(q, CD_FRAMESIZE); 2455 - blk_queue_max_hw_sectors(q, PACKET_MAX_SECTORS); 2456 - q->queuedata = pd; 2457 - } 2458 - 2459 - static int pkt_seq_show(struct seq_file *m, void *p) 2460 - { 2461 - struct pktcdvd_device *pd = m->private; 2462 - char *msg; 2463 - int states[PACKET_NUM_STATES]; 2464 - 2465 - seq_printf(m, "Writer %s mapped to %pg:\n", pd->name, pd->bdev); 2466 - 2467 - seq_printf(m, "\nSettings:\n"); 2468 - seq_printf(m, "\tpacket size:\t\t%dkB\n", pd->settings.size / 2); 2469 - 2470 - if (pd->settings.write_type == 0) 2471 - msg = "Packet"; 2472 - else 2473 - msg = "Unknown"; 2474 - seq_printf(m, "\twrite type:\t\t%s\n", msg); 2475 - 2476 - seq_printf(m, "\tpacket type:\t\t%s\n", pd->settings.fp ? "Fixed" : "Variable"); 2477 - seq_printf(m, "\tlink loss:\t\t%d\n", pd->settings.link_loss); 2478 - 2479 - seq_printf(m, "\ttrack mode:\t\t%d\n", pd->settings.track_mode); 2480 - 2481 - if (pd->settings.block_mode == PACKET_BLOCK_MODE1) 2482 - msg = "Mode 1"; 2483 - else if (pd->settings.block_mode == PACKET_BLOCK_MODE2) 2484 - msg = "Mode 2"; 2485 - else 2486 - msg = "Unknown"; 2487 - seq_printf(m, "\tblock mode:\t\t%s\n", msg); 2488 - 2489 - seq_printf(m, "\nStatistics:\n"); 2490 - seq_printf(m, "\tpackets started:\t%lu\n", pd->stats.pkt_started); 2491 - seq_printf(m, "\tpackets ended:\t\t%lu\n", pd->stats.pkt_ended); 2492 - seq_printf(m, "\twritten:\t\t%lukB\n", pd->stats.secs_w >> 1); 2493 - seq_printf(m, "\tread gather:\t\t%lukB\n", pd->stats.secs_rg >> 1); 2494 - seq_printf(m, "\tread:\t\t\t%lukB\n", pd->stats.secs_r >> 1); 2495 - 2496 - seq_printf(m, "\nMisc:\n"); 2497 - seq_printf(m, "\treference count:\t%d\n", pd->refcnt); 2498 - seq_printf(m, "\tflags:\t\t\t0x%lx\n", pd->flags); 2499 - seq_printf(m, "\tread speed:\t\t%ukB/s\n", pd->read_speed); 2500 - seq_printf(m, "\twrite speed:\t\t%ukB/s\n", pd->write_speed); 2501 - seq_printf(m, "\tstart offset:\t\t%lu\n", pd->offset); 2502 - seq_printf(m, "\tmode page offset:\t%u\n", pd->mode_offset); 2503 - 2504 - seq_printf(m, "\nQueue state:\n"); 2505 - seq_printf(m, "\tbios queued:\t\t%d\n", pd->bio_queue_size); 2506 - seq_printf(m, "\tbios pending:\t\t%d\n", atomic_read(&pd->cdrw.pending_bios)); 2507 - seq_printf(m, "\tcurrent sector:\t\t0x%llx\n", (unsigned long long)pd->current_sector); 2508 - 2509 - pkt_count_states(pd, states); 2510 - seq_printf(m, "\tstate:\t\t\ti:%d ow:%d rw:%d ww:%d rec:%d fin:%d\n", 2511 - states[0], states[1], states[2], states[3], states[4], states[5]); 2512 - 2513 - seq_printf(m, "\twrite congestion marks:\toff=%d on=%d\n", 2514 - pd->write_congestion_off, 2515 - pd->write_congestion_on); 2516 - return 0; 2517 - } 2518 - 2519 - static int pkt_new_dev(struct pktcdvd_device *pd, dev_t dev) 2520 - { 2521 - int i; 2522 - struct block_device *bdev; 2523 - struct scsi_device *sdev; 2524 - 2525 - if (pd->pkt_dev == dev) { 2526 - pkt_err(pd, "recursive setup not allowed\n"); 2527 - return -EBUSY; 2528 - } 2529 - for (i = 0; i < MAX_WRITERS; i++) { 2530 - struct pktcdvd_device *pd2 = pkt_devs[i]; 2531 - if (!pd2) 2532 - continue; 2533 - if (pd2->bdev->bd_dev == dev) { 2534 - pkt_err(pd, "%pg already setup\n", pd2->bdev); 2535 - return -EBUSY; 2536 - } 2537 - if (pd2->pkt_dev == dev) { 2538 - pkt_err(pd, "can't chain pktcdvd devices\n"); 2539 - return -EBUSY; 2540 - } 2541 - } 2542 - 2543 - bdev = blkdev_get_by_dev(dev, FMODE_READ | FMODE_NDELAY, NULL); 2544 - if (IS_ERR(bdev)) 2545 - return PTR_ERR(bdev); 2546 - sdev = scsi_device_from_queue(bdev->bd_disk->queue); 2547 - if (!sdev) { 2548 - blkdev_put(bdev, FMODE_READ | FMODE_NDELAY); 2549 - return -EINVAL; 2550 - } 2551 - put_device(&sdev->sdev_gendev); 2552 - 2553 - /* This is safe, since we have a reference from open(). */ 2554 - __module_get(THIS_MODULE); 2555 - 2556 - pd->bdev = bdev; 2557 - set_blocksize(bdev, CD_FRAMESIZE); 2558 - 2559 - pkt_init_queue(pd); 2560 - 2561 - atomic_set(&pd->cdrw.pending_bios, 0); 2562 - pd->cdrw.thread = kthread_run(kcdrwd, pd, "%s", pd->name); 2563 - if (IS_ERR(pd->cdrw.thread)) { 2564 - pkt_err(pd, "can't start kernel thread\n"); 2565 - goto out_mem; 2566 - } 2567 - 2568 - proc_create_single_data(pd->name, 0, pkt_proc, pkt_seq_show, pd); 2569 - pkt_dbg(1, pd, "writer mapped to %pg\n", bdev); 2570 - return 0; 2571 - 2572 - out_mem: 2573 - blkdev_put(bdev, FMODE_READ | FMODE_NDELAY); 2574 - /* This is safe: open() is still holding a reference. */ 2575 - module_put(THIS_MODULE); 2576 - return -ENOMEM; 2577 - } 2578 - 2579 - static int pkt_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd, unsigned long arg) 2580 - { 2581 - struct pktcdvd_device *pd = bdev->bd_disk->private_data; 2582 - int ret; 2583 - 2584 - pkt_dbg(2, pd, "cmd %x, dev %d:%d\n", 2585 - cmd, MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev)); 2586 - 2587 - mutex_lock(&pktcdvd_mutex); 2588 - switch (cmd) { 2589 - case CDROMEJECT: 2590 - /* 2591 - * The door gets locked when the device is opened, so we 2592 - * have to unlock it or else the eject command fails. 2593 - */ 2594 - if (pd->refcnt == 1) 2595 - pkt_lock_door(pd, 0); 2596 - fallthrough; 2597 - /* 2598 - * forward selected CDROM ioctls to CD-ROM, for UDF 2599 - */ 2600 - case CDROMMULTISESSION: 2601 - case CDROMREADTOCENTRY: 2602 - case CDROM_LAST_WRITTEN: 2603 - case CDROM_SEND_PACKET: 2604 - case SCSI_IOCTL_SEND_COMMAND: 2605 - if (!bdev->bd_disk->fops->ioctl) 2606 - ret = -ENOTTY; 2607 - else 2608 - ret = bdev->bd_disk->fops->ioctl(bdev, mode, cmd, arg); 2609 - break; 2610 - default: 2611 - pkt_dbg(2, pd, "Unknown ioctl (%x)\n", cmd); 2612 - ret = -ENOTTY; 2613 - } 2614 - mutex_unlock(&pktcdvd_mutex); 2615 - 2616 - return ret; 2617 - } 2618 - 2619 - static unsigned int pkt_check_events(struct gendisk *disk, 2620 - unsigned int clearing) 2621 - { 2622 - struct pktcdvd_device *pd = disk->private_data; 2623 - struct gendisk *attached_disk; 2624 - 2625 - if (!pd) 2626 - return 0; 2627 - if (!pd->bdev) 2628 - return 0; 2629 - attached_disk = pd->bdev->bd_disk; 2630 - if (!attached_disk || !attached_disk->fops->check_events) 2631 - return 0; 2632 - return attached_disk->fops->check_events(attached_disk, clearing); 2633 - } 2634 - 2635 - static char *pkt_devnode(struct gendisk *disk, umode_t *mode) 2636 - { 2637 - return kasprintf(GFP_KERNEL, "pktcdvd/%s", disk->disk_name); 2638 - } 2639 - 2640 - static const struct block_device_operations pktcdvd_ops = { 2641 - .owner = THIS_MODULE, 2642 - .submit_bio = pkt_submit_bio, 2643 - .open = pkt_open, 2644 - .release = pkt_close, 2645 - .ioctl = pkt_ioctl, 2646 - .compat_ioctl = blkdev_compat_ptr_ioctl, 2647 - .check_events = pkt_check_events, 2648 - .devnode = pkt_devnode, 2649 - }; 2650 - 2651 - /* 2652 - * Set up mapping from pktcdvd device to CD-ROM device. 2653 - */ 2654 - static int pkt_setup_dev(dev_t dev, dev_t* pkt_dev) 2655 - { 2656 - int idx; 2657 - int ret = -ENOMEM; 2658 - struct pktcdvd_device *pd; 2659 - struct gendisk *disk; 2660 - 2661 - mutex_lock_nested(&ctl_mutex, SINGLE_DEPTH_NESTING); 2662 - 2663 - for (idx = 0; idx < MAX_WRITERS; idx++) 2664 - if (!pkt_devs[idx]) 2665 - break; 2666 - if (idx == MAX_WRITERS) { 2667 - pr_err("max %d writers supported\n", MAX_WRITERS); 2668 - ret = -EBUSY; 2669 - goto out_mutex; 2670 - } 2671 - 2672 - pd = kzalloc(sizeof(struct pktcdvd_device), GFP_KERNEL); 2673 - if (!pd) 2674 - goto out_mutex; 2675 - 2676 - ret = mempool_init_kmalloc_pool(&pd->rb_pool, PKT_RB_POOL_SIZE, 2677 - sizeof(struct pkt_rb_node)); 2678 - if (ret) 2679 - goto out_mem; 2680 - 2681 - INIT_LIST_HEAD(&pd->cdrw.pkt_free_list); 2682 - INIT_LIST_HEAD(&pd->cdrw.pkt_active_list); 2683 - spin_lock_init(&pd->cdrw.active_list_lock); 2684 - 2685 - spin_lock_init(&pd->lock); 2686 - spin_lock_init(&pd->iosched.lock); 2687 - bio_list_init(&pd->iosched.read_queue); 2688 - bio_list_init(&pd->iosched.write_queue); 2689 - sprintf(pd->name, DRIVER_NAME"%d", idx); 2690 - init_waitqueue_head(&pd->wqueue); 2691 - pd->bio_queue = RB_ROOT; 2692 - 2693 - pd->write_congestion_on = write_congestion_on; 2694 - pd->write_congestion_off = write_congestion_off; 2695 - 2696 - ret = -ENOMEM; 2697 - disk = blk_alloc_disk(NUMA_NO_NODE); 2698 - if (!disk) 2699 - goto out_mem; 2700 - pd->disk = disk; 2701 - disk->major = pktdev_major; 2702 - disk->first_minor = idx; 2703 - disk->minors = 1; 2704 - disk->fops = &pktcdvd_ops; 2705 - disk->flags = GENHD_FL_REMOVABLE | GENHD_FL_NO_PART; 2706 - strcpy(disk->disk_name, pd->name); 2707 - disk->private_data = pd; 2708 - 2709 - pd->pkt_dev = MKDEV(pktdev_major, idx); 2710 - ret = pkt_new_dev(pd, dev); 2711 - if (ret) 2712 - goto out_mem2; 2713 - 2714 - /* inherit events of the host device */ 2715 - disk->events = pd->bdev->bd_disk->events; 2716 - 2717 - ret = add_disk(disk); 2718 - if (ret) 2719 - goto out_mem2; 2720 - 2721 - pkt_sysfs_dev_new(pd); 2722 - pkt_debugfs_dev_new(pd); 2723 - 2724 - pkt_devs[idx] = pd; 2725 - if (pkt_dev) 2726 - *pkt_dev = pd->pkt_dev; 2727 - 2728 - mutex_unlock(&ctl_mutex); 2729 - return 0; 2730 - 2731 - out_mem2: 2732 - put_disk(disk); 2733 - out_mem: 2734 - mempool_exit(&pd->rb_pool); 2735 - kfree(pd); 2736 - out_mutex: 2737 - mutex_unlock(&ctl_mutex); 2738 - pr_err("setup of pktcdvd device failed\n"); 2739 - return ret; 2740 - } 2741 - 2742 - /* 2743 - * Tear down mapping from pktcdvd device to CD-ROM device. 2744 - */ 2745 - static int pkt_remove_dev(dev_t pkt_dev) 2746 - { 2747 - struct pktcdvd_device *pd; 2748 - int idx; 2749 - int ret = 0; 2750 - 2751 - mutex_lock_nested(&ctl_mutex, SINGLE_DEPTH_NESTING); 2752 - 2753 - for (idx = 0; idx < MAX_WRITERS; idx++) { 2754 - pd = pkt_devs[idx]; 2755 - if (pd && (pd->pkt_dev == pkt_dev)) 2756 - break; 2757 - } 2758 - if (idx == MAX_WRITERS) { 2759 - pr_debug("dev not setup\n"); 2760 - ret = -ENXIO; 2761 - goto out; 2762 - } 2763 - 2764 - if (pd->refcnt > 0) { 2765 - ret = -EBUSY; 2766 - goto out; 2767 - } 2768 - if (!IS_ERR(pd->cdrw.thread)) 2769 - kthread_stop(pd->cdrw.thread); 2770 - 2771 - pkt_devs[idx] = NULL; 2772 - 2773 - pkt_debugfs_dev_remove(pd); 2774 - pkt_sysfs_dev_remove(pd); 2775 - 2776 - blkdev_put(pd->bdev, FMODE_READ | FMODE_NDELAY); 2777 - 2778 - remove_proc_entry(pd->name, pkt_proc); 2779 - pkt_dbg(1, pd, "writer unmapped\n"); 2780 - 2781 - del_gendisk(pd->disk); 2782 - put_disk(pd->disk); 2783 - 2784 - mempool_exit(&pd->rb_pool); 2785 - kfree(pd); 2786 - 2787 - /* This is safe: open() is still holding a reference. */ 2788 - module_put(THIS_MODULE); 2789 - 2790 - out: 2791 - mutex_unlock(&ctl_mutex); 2792 - return ret; 2793 - } 2794 - 2795 - static void pkt_get_status(struct pkt_ctrl_command *ctrl_cmd) 2796 - { 2797 - struct pktcdvd_device *pd; 2798 - 2799 - mutex_lock_nested(&ctl_mutex, SINGLE_DEPTH_NESTING); 2800 - 2801 - pd = pkt_find_dev_from_minor(ctrl_cmd->dev_index); 2802 - if (pd) { 2803 - ctrl_cmd->dev = new_encode_dev(pd->bdev->bd_dev); 2804 - ctrl_cmd->pkt_dev = new_encode_dev(pd->pkt_dev); 2805 - } else { 2806 - ctrl_cmd->dev = 0; 2807 - ctrl_cmd->pkt_dev = 0; 2808 - } 2809 - ctrl_cmd->num_devices = MAX_WRITERS; 2810 - 2811 - mutex_unlock(&ctl_mutex); 2812 - } 2813 - 2814 - static long pkt_ctl_ioctl(struct file *file, unsigned int cmd, unsigned long arg) 2815 - { 2816 - void __user *argp = (void __user *)arg; 2817 - struct pkt_ctrl_command ctrl_cmd; 2818 - int ret = 0; 2819 - dev_t pkt_dev = 0; 2820 - 2821 - if (cmd != PACKET_CTRL_CMD) 2822 - return -ENOTTY; 2823 - 2824 - if (copy_from_user(&ctrl_cmd, argp, sizeof(struct pkt_ctrl_command))) 2825 - return -EFAULT; 2826 - 2827 - switch (ctrl_cmd.command) { 2828 - case PKT_CTRL_CMD_SETUP: 2829 - if (!capable(CAP_SYS_ADMIN)) 2830 - return -EPERM; 2831 - ret = pkt_setup_dev(new_decode_dev(ctrl_cmd.dev), &pkt_dev); 2832 - ctrl_cmd.pkt_dev = new_encode_dev(pkt_dev); 2833 - break; 2834 - case PKT_CTRL_CMD_TEARDOWN: 2835 - if (!capable(CAP_SYS_ADMIN)) 2836 - return -EPERM; 2837 - ret = pkt_remove_dev(new_decode_dev(ctrl_cmd.pkt_dev)); 2838 - break; 2839 - case PKT_CTRL_CMD_STATUS: 2840 - pkt_get_status(&ctrl_cmd); 2841 - break; 2842 - default: 2843 - return -ENOTTY; 2844 - } 2845 - 2846 - if (copy_to_user(argp, &ctrl_cmd, sizeof(struct pkt_ctrl_command))) 2847 - return -EFAULT; 2848 - return ret; 2849 - } 2850 - 2851 - #ifdef CONFIG_COMPAT 2852 - static long pkt_ctl_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) 2853 - { 2854 - return pkt_ctl_ioctl(file, cmd, (unsigned long)compat_ptr(arg)); 2855 - } 2856 - #endif 2857 - 2858 - static const struct file_operations pkt_ctl_fops = { 2859 - .open = nonseekable_open, 2860 - .unlocked_ioctl = pkt_ctl_ioctl, 2861 - #ifdef CONFIG_COMPAT 2862 - .compat_ioctl = pkt_ctl_compat_ioctl, 2863 - #endif 2864 - .owner = THIS_MODULE, 2865 - .llseek = no_llseek, 2866 - }; 2867 - 2868 - static struct miscdevice pkt_misc = { 2869 - .minor = MISC_DYNAMIC_MINOR, 2870 - .name = DRIVER_NAME, 2871 - .nodename = "pktcdvd/control", 2872 - .fops = &pkt_ctl_fops 2873 - }; 2874 - 2875 - static int __init pkt_init(void) 2876 - { 2877 - int ret; 2878 - 2879 - mutex_init(&ctl_mutex); 2880 - 2881 - ret = mempool_init_kmalloc_pool(&psd_pool, PSD_POOL_SIZE, 2882 - sizeof(struct packet_stacked_data)); 2883 - if (ret) 2884 - return ret; 2885 - ret = bioset_init(&pkt_bio_set, BIO_POOL_SIZE, 0, 0); 2886 - if (ret) { 2887 - mempool_exit(&psd_pool); 2888 - return ret; 2889 - } 2890 - 2891 - ret = register_blkdev(pktdev_major, DRIVER_NAME); 2892 - if (ret < 0) { 2893 - pr_err("unable to register block device\n"); 2894 - goto out2; 2895 - } 2896 - if (!pktdev_major) 2897 - pktdev_major = ret; 2898 - 2899 - ret = pkt_sysfs_init(); 2900 - if (ret) 2901 - goto out; 2902 - 2903 - pkt_debugfs_init(); 2904 - 2905 - ret = misc_register(&pkt_misc); 2906 - if (ret) { 2907 - pr_err("unable to register misc device\n"); 2908 - goto out_misc; 2909 - } 2910 - 2911 - pkt_proc = proc_mkdir("driver/"DRIVER_NAME, NULL); 2912 - 2913 - return 0; 2914 - 2915 - out_misc: 2916 - pkt_debugfs_cleanup(); 2917 - pkt_sysfs_cleanup(); 2918 - out: 2919 - unregister_blkdev(pktdev_major, DRIVER_NAME); 2920 - out2: 2921 - mempool_exit(&psd_pool); 2922 - bioset_exit(&pkt_bio_set); 2923 - return ret; 2924 - } 2925 - 2926 - static void __exit pkt_exit(void) 2927 - { 2928 - remove_proc_entry("driver/"DRIVER_NAME, NULL); 2929 - misc_deregister(&pkt_misc); 2930 - 2931 - pkt_debugfs_cleanup(); 2932 - pkt_sysfs_cleanup(); 2933 - 2934 - unregister_blkdev(pktdev_major, DRIVER_NAME); 2935 - mempool_exit(&psd_pool); 2936 - bioset_exit(&pkt_bio_set); 2937 - } 2938 - 2939 - MODULE_DESCRIPTION("Packet writing layer for CD/DVD drives"); 2940 - MODULE_AUTHOR("Jens Axboe <axboe@suse.de>"); 2941 - MODULE_LICENSE("GPL"); 2942 - 2943 - module_init(pkt_init); 2944 - module_exit(pkt_exit);

+4 -4

drivers/block/virtio_blk.c

··· 512 512 { 513 513 struct virtio_blk *vblk = disk->private_data; 514 514 515 - ida_simple_remove(&vd_index_ida, vblk->index); 515 + ida_free(&vd_index_ida, vblk->index); 516 516 mutex_destroy(&vblk->vdev_mutex); 517 517 kfree(vblk); 518 518 } ··· 902 902 return -EINVAL; 903 903 } 904 904 905 - err = ida_simple_get(&vd_index_ida, 0, minor_to_index(1 << MINORBITS), 906 - GFP_KERNEL); 905 + err = ida_alloc_range(&vd_index_ida, 0, 906 + minor_to_index(1 << MINORBITS) - 1, GFP_KERNEL); 907 907 if (err < 0) 908 908 goto out; 909 909 index = err; ··· 1163 1163 out_free_vblk: 1164 1164 kfree(vblk); 1165 1165 out_free_index: 1166 - ida_simple_remove(&vd_index_ida, index); 1166 + ida_free(&vd_index_ida, index); 1167 1167 out: 1168 1168 return err; 1169 1169 }

-1

drivers/block/xen-blkfront.c

··· 2129 2129 if (info->rq && info->gd) { 2130 2130 blk_mq_stop_hw_queues(info->rq); 2131 2131 blk_mark_disk_dead(info->gd); 2132 - set_capacity(info->gd, 0); 2133 2132 } 2134 2133 2135 2134 for_each_rinfo(info, rinfo, i) {

+1 -1

drivers/md/bcache/movinggc.c

··· 160 160 moving_init(io); 161 161 bio = &io->bio.bio; 162 162 163 - bio_set_op_attrs(bio, REQ_OP_READ, 0); 163 + bio->bi_opf = REQ_OP_READ; 164 164 bio->bi_end_io = read_moving_endio; 165 165 166 166 if (bch_bio_alloc_pages(bio, GFP_KERNEL))

+1 -1

drivers/md/bcache/request.c

··· 244 244 trace_bcache_cache_insert(k); 245 245 bch_keylist_push(&op->insert_keys); 246 246 247 - bio_set_op_attrs(n, REQ_OP_WRITE, 0); 247 + n->bi_opf = REQ_OP_WRITE; 248 248 bch_submit_bbio(n, op->c, k, 0); 249 249 } while (n != bio); 250 250

+2 -2

drivers/md/bcache/writeback.c

··· 434 434 */ 435 435 if (KEY_DIRTY(&w->key)) { 436 436 dirty_init(w); 437 - bio_set_op_attrs(&io->bio, REQ_OP_WRITE, 0); 437 + io->bio.bi_opf = REQ_OP_WRITE; 438 438 io->bio.bi_iter.bi_sector = KEY_START(&w->key); 439 439 bio_set_dev(&io->bio, io->dc->bdev); 440 440 io->bio.bi_end_io = dirty_endio; ··· 547 547 io->sequence = sequence++; 548 548 549 549 dirty_init(w); 550 - bio_set_op_attrs(&io->bio, REQ_OP_READ, 0); 550 + io->bio.bi_opf = REQ_OP_READ; 551 551 io->bio.bi_iter.bi_sector = PTR_OFFSET(&w->key, 0); 552 552 bio_set_dev(&io->bio, dc->disk.c->cache->bdev); 553 553 io->bio.bi_end_io = read_dirty_endio;

+1 -1

drivers/md/dm-table.c

··· 1215 1215 struct dm_keyslot_evict_args *args = data; 1216 1216 int err; 1217 1217 1218 - err = blk_crypto_evict_key(bdev_get_queue(dev->bdev), args->key); 1218 + err = blk_crypto_evict_key(dev->bdev, args->key); 1219 1219 if (!args->err) 1220 1220 args->err = err; 1221 1221 /* Always try to evict the key from all devices. */

+1 -1

drivers/md/dm-thin.c

··· 410 410 * need to wait for the chain to complete. 411 411 */ 412 412 bio_chain(op->bio, op->parent_bio); 413 - bio_set_op_attrs(op->bio, REQ_OP_DISCARD, 0); 413 + op->bio->bi_opf = REQ_OP_DISCARD; 414 414 submit_bio(op->bio); 415 415 } 416 416

+82 -58

drivers/md/dm.c

··· 732 732 /* 733 733 * Open a table device so we can use it as a map destination. 734 734 */ 735 - static int open_table_device(struct table_device *td, dev_t dev, 736 - struct mapped_device *md) 735 + static struct table_device *open_table_device(struct mapped_device *md, 736 + dev_t dev, fmode_t mode) 737 737 { 738 + struct table_device *td; 738 739 struct block_device *bdev; 739 740 u64 part_off; 740 741 int r; 741 742 742 - BUG_ON(td->dm_dev.bdev); 743 + td = kmalloc_node(sizeof(*td), GFP_KERNEL, md->numa_node_id); 744 + if (!td) 745 + return ERR_PTR(-ENOMEM); 746 + refcount_set(&td->count, 1); 743 747 744 - bdev = blkdev_get_by_dev(dev, td->dm_dev.mode | FMODE_EXCL, _dm_claim_ptr); 745 - if (IS_ERR(bdev)) 746 - return PTR_ERR(bdev); 747 - 748 - r = bd_link_disk_holder(bdev, dm_disk(md)); 749 - if (r) { 750 - blkdev_put(bdev, td->dm_dev.mode | FMODE_EXCL); 751 - return r; 748 + bdev = blkdev_get_by_dev(dev, mode | FMODE_EXCL, _dm_claim_ptr); 749 + if (IS_ERR(bdev)) { 750 + r = PTR_ERR(bdev); 751 + goto out_free_td; 752 752 } 753 753 754 + /* 755 + * We can be called before the dm disk is added. In that case we can't 756 + * register the holder relation here. It will be done once add_disk was 757 + * called. 758 + */ 759 + if (md->disk->slave_dir) { 760 + r = bd_link_disk_holder(bdev, md->disk); 761 + if (r) 762 + goto out_blkdev_put; 763 + } 764 + 765 + td->dm_dev.mode = mode; 754 766 td->dm_dev.bdev = bdev; 755 767 td->dm_dev.dax_dev = fs_dax_get_by_bdev(bdev, &part_off, NULL, NULL); 756 - return 0; 768 + format_dev_t(td->dm_dev.name, dev); 769 + list_add(&td->list, &md->table_devices); 770 + return td; 771 + 772 + out_blkdev_put: 773 + blkdev_put(bdev, mode | FMODE_EXCL); 774 + out_free_td: 775 + kfree(td); 776 + return ERR_PTR(r); 757 777 } 758 778 759 779 /* ··· 781 761 */ 782 762 static void close_table_device(struct table_device *td, struct mapped_device *md) 783 763 { 784 - if (!td->dm_dev.bdev) 785 - return; 786 - 787 - bd_unlink_disk_holder(td->dm_dev.bdev, dm_disk(md)); 764 + if (md->disk->slave_dir) 765 + bd_unlink_disk_holder(td->dm_dev.bdev, md->disk); 788 766 blkdev_put(td->dm_dev.bdev, td->dm_dev.mode | FMODE_EXCL); 789 767 put_dax(td->dm_dev.dax_dev); 790 - td->dm_dev.bdev = NULL; 791 - td->dm_dev.dax_dev = NULL; 768 + list_del(&td->list); 769 + kfree(td); 792 770 } 793 771 794 772 static struct table_device *find_table_device(struct list_head *l, dev_t dev, ··· 804 786 int dm_get_table_device(struct mapped_device *md, dev_t dev, fmode_t mode, 805 787 struct dm_dev **result) 806 788 { 807 - int r; 808 789 struct table_device *td; 809 790 810 791 mutex_lock(&md->table_devices_lock); 811 792 td = find_table_device(&md->table_devices, dev, mode); 812 793 if (!td) { 813 - td = kmalloc_node(sizeof(*td), GFP_KERNEL, md->numa_node_id); 814 - if (!td) { 794 + td = open_table_device(md, dev, mode); 795 + if (IS_ERR(td)) { 815 796 mutex_unlock(&md->table_devices_lock); 816 - return -ENOMEM; 797 + return PTR_ERR(td); 817 798 } 818 - 819 - td->dm_dev.mode = mode; 820 - td->dm_dev.bdev = NULL; 821 - 822 - if ((r = open_table_device(td, dev, md))) { 823 - mutex_unlock(&md->table_devices_lock); 824 - kfree(td); 825 - return r; 826 - } 827 - 828 - format_dev_t(td->dm_dev.name, dev); 829 - 830 - refcount_set(&td->count, 1); 831 - list_add(&td->list, &md->table_devices); 832 799 } else { 833 800 refcount_inc(&td->count); 834 801 } ··· 828 825 struct table_device *td = container_of(d, struct table_device, dm_dev); 829 826 830 827 mutex_lock(&md->table_devices_lock); 831 - if (refcount_dec_and_test(&td->count)) { 828 + if (refcount_dec_and_test(&td->count)) 832 829 close_table_device(td, md); 833 - list_del(&td->list); 834 - kfree(td); 835 - } 836 830 mutex_unlock(&md->table_devices_lock); 837 - } 838 - 839 - static void free_table_devices(struct list_head *devices) 840 - { 841 - struct list_head *tmp, *next; 842 - 843 - list_for_each_safe(tmp, next, devices) { 844 - struct table_device *td = list_entry(tmp, struct table_device, list); 845 - 846 - DMWARN("dm_destroy: %s still exists with %d references", 847 - td->dm_dev.name, refcount_read(&td->count)); 848 - kfree(td); 849 - } 850 831 } 851 832 852 833 /* ··· 1959 1972 md->disk->private_data = NULL; 1960 1973 spin_unlock(&_minor_lock); 1961 1974 if (dm_get_md_type(md) != DM_TYPE_NONE) { 1975 + struct table_device *td; 1976 + 1962 1977 dm_sysfs_exit(md); 1978 + list_for_each_entry(td, &md->table_devices, list) { 1979 + bd_unlink_disk_holder(td->dm_dev.bdev, 1980 + md->disk); 1981 + } 1982 + 1983 + /* 1984 + * Hold lock to make sure del_gendisk() won't concurrent 1985 + * with open/close_table_device(). 1986 + */ 1987 + mutex_lock(&md->table_devices_lock); 1963 1988 del_gendisk(md->disk); 1989 + mutex_unlock(&md->table_devices_lock); 1964 1990 } 1965 1991 dm_queue_destroy_crypto_profile(md->queue); 1966 1992 put_disk(md->disk); ··· 2122 2122 2123 2123 cleanup_mapped_device(md); 2124 2124 2125 - free_table_devices(&md->table_devices); 2125 + WARN_ON_ONCE(!list_empty(&md->table_devices)); 2126 2126 dm_stats_cleanup(&md->stats); 2127 2127 free_minor(minor); 2128 2128 ··· 2305 2305 { 2306 2306 enum dm_queue_mode type = dm_table_get_type(t); 2307 2307 struct queue_limits limits; 2308 + struct table_device *td; 2308 2309 int r; 2309 2310 2310 2311 switch (type) { ··· 2334 2333 if (r) 2335 2334 return r; 2336 2335 2336 + /* 2337 + * Hold lock to make sure add_disk() and del_gendisk() won't concurrent 2338 + * with open_table_device() and close_table_device(). 2339 + */ 2340 + mutex_lock(&md->table_devices_lock); 2337 2341 r = add_disk(md->disk); 2342 + mutex_unlock(&md->table_devices_lock); 2338 2343 if (r) 2339 2344 return r; 2340 2345 2341 - r = dm_sysfs_init(md); 2342 - if (r) { 2343 - del_gendisk(md->disk); 2344 - return r; 2346 + /* 2347 + * Register the holder relationship for devices added before the disk 2348 + * was live. 2349 + */ 2350 + list_for_each_entry(td, &md->table_devices, list) { 2351 + r = bd_link_disk_holder(td->dm_dev.bdev, md->disk); 2352 + if (r) 2353 + goto out_undo_holders; 2345 2354 } 2355 + 2356 + r = dm_sysfs_init(md); 2357 + if (r) 2358 + goto out_undo_holders; 2359 + 2346 2360 md->type = type; 2347 2361 return 0; 2362 + 2363 + out_undo_holders: 2364 + list_for_each_entry_continue_reverse(td, &md->table_devices, list) 2365 + bd_unlink_disk_holder(td->dm_dev.bdev, md->disk); 2366 + mutex_lock(&md->table_devices_lock); 2367 + del_gendisk(md->disk); 2368 + mutex_unlock(&md->table_devices_lock); 2369 + return r; 2348 2370 } 2349 2371 2350 2372 struct mapped_device *dm_get_md(dev_t dev)

+27 -20

drivers/md/md-bitmap.c

··· 486 486 sb = kmap_atomic(bitmap->storage.sb_page); 487 487 pr_debug("%s: bitmap file superblock:\n", bmname(bitmap)); 488 488 pr_debug(" magic: %08x\n", le32_to_cpu(sb->magic)); 489 - pr_debug(" version: %d\n", le32_to_cpu(sb->version)); 489 + pr_debug(" version: %u\n", le32_to_cpu(sb->version)); 490 490 pr_debug(" uuid: %08x.%08x.%08x.%08x\n", 491 491 le32_to_cpu(*(__le32 *)(sb->uuid+0)), 492 492 le32_to_cpu(*(__le32 *)(sb->uuid+4)), ··· 497 497 pr_debug("events cleared: %llu\n", 498 498 (unsigned long long) le64_to_cpu(sb->events_cleared)); 499 499 pr_debug(" state: %08x\n", le32_to_cpu(sb->state)); 500 - pr_debug(" chunksize: %d B\n", le32_to_cpu(sb->chunksize)); 501 - pr_debug(" daemon sleep: %ds\n", le32_to_cpu(sb->daemon_sleep)); 500 + pr_debug(" chunksize: %u B\n", le32_to_cpu(sb->chunksize)); 501 + pr_debug(" daemon sleep: %us\n", le32_to_cpu(sb->daemon_sleep)); 502 502 pr_debug(" sync size: %llu KB\n", 503 503 (unsigned long long)le64_to_cpu(sb->sync_size)/2); 504 - pr_debug("max write behind: %d\n", le32_to_cpu(sb->write_behind)); 504 + pr_debug("max write behind: %u\n", le32_to_cpu(sb->write_behind)); 505 505 kunmap_atomic(sb); 506 506 } 507 507 ··· 2105 2105 bytes = DIV_ROUND_UP(chunks, 8); 2106 2106 if (!bitmap->mddev->bitmap_info.external) 2107 2107 bytes += sizeof(bitmap_super_t); 2108 - } while (bytes > (space << 9)); 2108 + } while (bytes > (space << 9) && (chunkshift + BITMAP_BLOCK_SHIFT) < 2109 + (BITS_PER_BYTE * sizeof(((bitmap_super_t *)0)->chunksize) - 1)); 2109 2110 } else 2110 2111 chunkshift = ffz(~chunksize) - BITMAP_BLOCK_SHIFT; 2111 2112 ··· 2151 2150 bitmap->counts.missing_pages = pages; 2152 2151 bitmap->counts.chunkshift = chunkshift; 2153 2152 bitmap->counts.chunks = chunks; 2154 - bitmap->mddev->bitmap_info.chunksize = 1 << (chunkshift + 2153 + bitmap->mddev->bitmap_info.chunksize = 1UL << (chunkshift + 2155 2154 BITMAP_BLOCK_SHIFT); 2156 2155 2157 2156 blocks = min(old_counts.chunks << old_counts.chunkshift, ··· 2177 2176 bitmap->counts.missing_pages = old_counts.pages; 2178 2177 bitmap->counts.chunkshift = old_counts.chunkshift; 2179 2178 bitmap->counts.chunks = old_counts.chunks; 2180 - bitmap->mddev->bitmap_info.chunksize = 1 << (old_counts.chunkshift + 2181 - BITMAP_BLOCK_SHIFT); 2179 + bitmap->mddev->bitmap_info.chunksize = 2180 + 1UL << (old_counts.chunkshift + BITMAP_BLOCK_SHIFT); 2182 2181 blocks = old_counts.chunks << old_counts.chunkshift; 2183 2182 pr_warn("Could not pre-allocate in-memory bitmap for cluster raid\n"); 2184 2183 break; ··· 2196 2195 2197 2196 if (set) { 2198 2197 bmc_new = md_bitmap_get_counter(&bitmap->counts, block, &new_blocks, 1); 2199 - if (*bmc_new == 0) { 2200 - /* need to set on-disk bits too. */ 2201 - sector_t end = block + new_blocks; 2202 - sector_t start = block >> chunkshift; 2203 - start <<= chunkshift; 2204 - while (start < end) { 2205 - md_bitmap_file_set_bit(bitmap, block); 2206 - start += 1 << chunkshift; 2198 + if (bmc_new) { 2199 + if (*bmc_new == 0) { 2200 + /* need to set on-disk bits too. */ 2201 + sector_t end = block + new_blocks; 2202 + sector_t start = block >> chunkshift; 2203 + 2204 + start <<= chunkshift; 2205 + while (start < end) { 2206 + md_bitmap_file_set_bit(bitmap, block); 2207 + start += 1 << chunkshift; 2208 + } 2209 + *bmc_new = 2; 2210 + md_bitmap_count_page(&bitmap->counts, block, 1); 2211 + md_bitmap_set_pending(&bitmap->counts, block); 2207 2212 } 2208 - *bmc_new = 2; 2209 - md_bitmap_count_page(&bitmap->counts, block, 1); 2210 - md_bitmap_set_pending(&bitmap->counts, block); 2213 + *bmc_new |= NEEDED_MASK; 2211 2214 } 2212 - *bmc_new |= NEEDED_MASK; 2213 2215 if (new_blocks < old_blocks) 2214 2216 old_blocks = new_blocks; 2215 2217 } ··· 2538 2534 if (csize < 512 || 2539 2535 !is_power_of_2(csize)) 2540 2536 return -EINVAL; 2537 + if (BITS_PER_LONG > 32 && csize >= (1ULL << (BITS_PER_BYTE * 2538 + sizeof(((bitmap_super_t *)0)->chunksize)))) 2539 + return -EOVERFLOW; 2541 2540 mddev->bitmap_info.chunksize = csize; 2542 2541 return len; 2543 2542 }

+159 -164

drivers/md/md.c

··· 93 93 struct md_rdev *this); 94 94 static void mddev_detach(struct mddev *mddev); 95 95 96 + enum md_ro_state { 97 + MD_RDWR, 98 + MD_RDONLY, 99 + MD_AUTO_READ, 100 + MD_MAX_STATE 101 + }; 102 + 103 + static bool md_is_rdwr(struct mddev *mddev) 104 + { 105 + return (mddev->ro == MD_RDWR); 106 + } 107 + 96 108 /* 97 109 * Default number of read corrections we'll attempt on an rdev 98 110 * before ejecting it from the array. We divide the read error ··· 456 444 457 445 bio = bio_split_to_limits(bio); 458 446 459 - if (mddev->ro == 1 && unlikely(rw == WRITE)) { 447 + if (mddev->ro == MD_RDONLY && unlikely(rw == WRITE)) { 460 448 if (bio_sectors(bio) != 0) 461 449 bio->bi_status = BLK_STS_IOERR; 462 450 bio_endio(bio); ··· 521 509 struct md_rdev *rdev = bio->bi_private; 522 510 struct mddev *mddev = rdev->mddev; 523 511 512 + bio_put(bio); 513 + 524 514 rdev_dec_pending(rdev, mddev); 525 515 526 516 if (atomic_dec_and_test(&mddev->flush_pending)) { 527 517 /* The pre-request flush has finished */ 528 518 queue_work(md_wq, &mddev->flush_work); 529 519 } 530 - bio_put(bio); 531 520 } 532 521 533 522 static void md_submit_flush_data(struct work_struct *ws); ··· 926 913 } else 927 914 clear_bit(LastDev, &rdev->flags); 928 915 916 + bio_put(bio); 917 + 918 + rdev_dec_pending(rdev, mddev); 919 + 929 920 if (atomic_dec_and_test(&mddev->pending_writes)) 930 921 wake_up(&mddev->sb_wait); 931 - rdev_dec_pending(rdev, mddev); 932 - bio_put(bio); 933 922 } 934 923 935 924 void md_super_write(struct mddev *mddev, struct md_rdev *rdev, ··· 2468 2453 kobject_put(&rdev->kobj); 2469 2454 } 2470 2455 2471 - static void unbind_rdev_from_array(struct md_rdev *rdev) 2456 + void md_autodetect_dev(dev_t dev); 2457 + 2458 + static void export_rdev(struct md_rdev *rdev) 2459 + { 2460 + pr_debug("md: export_rdev(%pg)\n", rdev->bdev); 2461 + md_rdev_clear(rdev); 2462 + #ifndef MODULE 2463 + if (test_bit(AutoDetected, &rdev->flags)) 2464 + md_autodetect_dev(rdev->bdev->bd_dev); 2465 + #endif 2466 + blkdev_put(rdev->bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL); 2467 + rdev->bdev = NULL; 2468 + kobject_put(&rdev->kobj); 2469 + } 2470 + 2471 + static void md_kick_rdev_from_array(struct md_rdev *rdev) 2472 2472 { 2473 2473 bd_unlink_disk_holder(rdev->bdev, rdev->mddev->gendisk); 2474 2474 list_del_rcu(&rdev->same_set); ··· 2506 2476 INIT_WORK(&rdev->del_work, rdev_delayed_delete); 2507 2477 kobject_get(&rdev->kobj); 2508 2478 queue_work(md_rdev_misc_wq, &rdev->del_work); 2509 - } 2510 - 2511 - /* 2512 - * prevent the device from being mounted, repartitioned or 2513 - * otherwise reused by a RAID array (or any other kernel 2514 - * subsystem), by bd_claiming the device. 2515 - */ 2516 - static int lock_rdev(struct md_rdev *rdev, dev_t dev, int shared) 2517 - { 2518 - int err = 0; 2519 - struct block_device *bdev; 2520 - 2521 - bdev = blkdev_get_by_dev(dev, FMODE_READ|FMODE_WRITE|FMODE_EXCL, 2522 - shared ? (struct md_rdev *)lock_rdev : rdev); 2523 - if (IS_ERR(bdev)) { 2524 - pr_warn("md: could not open device unknown-block(%u,%u).\n", 2525 - MAJOR(dev), MINOR(dev)); 2526 - return PTR_ERR(bdev); 2527 - } 2528 - rdev->bdev = bdev; 2529 - return err; 2530 - } 2531 - 2532 - static void unlock_rdev(struct md_rdev *rdev) 2533 - { 2534 - struct block_device *bdev = rdev->bdev; 2535 - rdev->bdev = NULL; 2536 - blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL); 2537 - } 2538 - 2539 - void md_autodetect_dev(dev_t dev); 2540 - 2541 - static void export_rdev(struct md_rdev *rdev) 2542 - { 2543 - pr_debug("md: export_rdev(%pg)\n", rdev->bdev); 2544 - md_rdev_clear(rdev); 2545 - #ifndef MODULE 2546 - if (test_bit(AutoDetected, &rdev->flags)) 2547 - md_autodetect_dev(rdev->bdev->bd_dev); 2548 - #endif 2549 - unlock_rdev(rdev); 2550 - kobject_put(&rdev->kobj); 2551 - } 2552 - 2553 - void md_kick_rdev_from_array(struct md_rdev *rdev) 2554 - { 2555 - unbind_rdev_from_array(rdev); 2556 2479 export_rdev(rdev); 2557 2480 } 2558 - EXPORT_SYMBOL_GPL(md_kick_rdev_from_array); 2559 2481 2560 2482 static void export_array(struct mddev *mddev) 2561 2483 { ··· 2621 2639 int any_badblocks_changed = 0; 2622 2640 int ret = -1; 2623 2641 2624 - if (mddev->ro) { 2642 + if (!md_is_rdwr(mddev)) { 2625 2643 if (force_change) 2626 2644 set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags); 2627 2645 return; ··· 3642 3660 */ 3643 3661 static struct md_rdev *md_import_device(dev_t newdev, int super_format, int super_minor) 3644 3662 { 3645 - int err; 3663 + static struct md_rdev *claim_rdev; /* just for claiming the bdev */ 3646 3664 struct md_rdev *rdev; 3647 3665 sector_t size; 3666 + int err; 3648 3667 3649 3668 rdev = kzalloc(sizeof(*rdev), GFP_KERNEL); 3650 3669 if (!rdev) ··· 3653 3670 3654 3671 err = md_rdev_init(rdev); 3655 3672 if (err) 3656 - goto abort_free; 3673 + goto out_free_rdev; 3657 3674 err = alloc_disk_sb(rdev); 3658 3675 if (err) 3659 - goto abort_free; 3676 + goto out_clear_rdev; 3660 3677 3661 - err = lock_rdev(rdev, newdev, super_format == -2); 3662 - if (err) 3663 - goto abort_free; 3678 + rdev->bdev = blkdev_get_by_dev(newdev, 3679 + FMODE_READ | FMODE_WRITE | FMODE_EXCL, 3680 + super_format == -2 ? claim_rdev : rdev); 3681 + if (IS_ERR(rdev->bdev)) { 3682 + pr_warn("md: could not open device unknown-block(%u,%u).\n", 3683 + MAJOR(newdev), MINOR(newdev)); 3684 + err = PTR_ERR(rdev->bdev); 3685 + goto out_clear_rdev; 3686 + } 3664 3687 3665 3688 kobject_init(&rdev->kobj, &rdev_ktype); 3666 3689 ··· 3675 3686 pr_warn("md: %pg has zero or unknown size, marking faulty!\n", 3676 3687 rdev->bdev); 3677 3688 err = -EINVAL; 3678 - goto abort_free; 3689 + goto out_blkdev_put; 3679 3690 } 3680 3691 3681 3692 if (super_format >= 0) { ··· 3685 3696 pr_warn("md: %pg does not have a valid v%d.%d superblock, not importing!\n", 3686 3697 rdev->bdev, 3687 3698 super_format, super_minor); 3688 - goto abort_free; 3699 + goto out_blkdev_put; 3689 3700 } 3690 3701 if (err < 0) { 3691 3702 pr_warn("md: could not read %pg's sb, not importing!\n", 3692 3703 rdev->bdev); 3693 - goto abort_free; 3704 + goto out_blkdev_put; 3694 3705 } 3695 3706 } 3696 3707 3697 3708 return rdev; 3698 3709 3699 - abort_free: 3700 - if (rdev->bdev) 3701 - unlock_rdev(rdev); 3710 + out_blkdev_put: 3711 + blkdev_put(rdev->bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL); 3712 + out_clear_rdev: 3702 3713 md_rdev_clear(rdev); 3714 + out_free_rdev: 3703 3715 kfree(rdev); 3704 3716 return ERR_PTR(err); 3705 3717 } ··· 3891 3901 goto out_unlock; 3892 3902 } 3893 3903 rv = -EROFS; 3894 - if (mddev->ro) 3904 + if (!md_is_rdwr(mddev)) 3895 3905 goto out_unlock; 3896 3906 3897 3907 /* request to change the personality. Need to ensure: ··· 4097 4107 if (mddev->pers) { 4098 4108 if (mddev->pers->check_reshape == NULL) 4099 4109 err = -EBUSY; 4100 - else if (mddev->ro) 4110 + else if (!md_is_rdwr(mddev)) 4101 4111 err = -EROFS; 4102 4112 else { 4103 4113 mddev->new_layout = n; ··· 4206 4216 if (mddev->pers) { 4207 4217 if (mddev->pers->check_reshape == NULL) 4208 4218 err = -EBUSY; 4209 - else if (mddev->ro) 4219 + else if (!md_is_rdwr(mddev)) 4210 4220 err = -EROFS; 4211 4221 else { 4212 4222 mddev->new_chunk_sectors = n >> 9; ··· 4329 4339 4330 4340 if (mddev->pers && !test_bit(MD_NOT_READY, &mddev->flags)) { 4331 4341 switch(mddev->ro) { 4332 - case 1: 4342 + case MD_RDONLY: 4333 4343 st = readonly; 4334 4344 break; 4335 - case 2: 4345 + case MD_AUTO_READ: 4336 4346 st = read_auto; 4337 4347 break; 4338 - case 0: 4348 + case MD_RDWR: 4339 4349 spin_lock(&mddev->lock); 4340 4350 if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) 4341 4351 st = write_pending; ··· 4371 4381 int err = 0; 4372 4382 enum array_state st = match_word(buf, array_states); 4373 4383 4374 - if (mddev->pers && (st == active || st == clean) && mddev->ro != 1) { 4384 + if (mddev->pers && (st == active || st == clean) && 4385 + mddev->ro != MD_RDONLY) { 4375 4386 /* don't take reconfig_mutex when toggling between 4376 4387 * clean and active 4377 4388 */ ··· 4416 4425 if (mddev->pers) 4417 4426 err = md_set_readonly(mddev, NULL); 4418 4427 else { 4419 - mddev->ro = 1; 4428 + mddev->ro = MD_RDONLY; 4420 4429 set_disk_ro(mddev->gendisk, 1); 4421 4430 err = do_md_run(mddev); 4422 4431 } 4423 4432 break; 4424 4433 case read_auto: 4425 4434 if (mddev->pers) { 4426 - if (mddev->ro == 0) 4435 + if (md_is_rdwr(mddev)) 4427 4436 err = md_set_readonly(mddev, NULL); 4428 - else if (mddev->ro == 1) 4437 + else if (mddev->ro == MD_RDONLY) 4429 4438 err = restart_array(mddev); 4430 4439 if (err == 0) { 4431 - mddev->ro = 2; 4440 + mddev->ro = MD_AUTO_READ; 4432 4441 set_disk_ro(mddev->gendisk, 0); 4433 4442 } 4434 4443 } else { 4435 - mddev->ro = 2; 4444 + mddev->ro = MD_AUTO_READ; 4436 4445 err = do_md_run(mddev); 4437 4446 } 4438 4447 break; ··· 4457 4466 wake_up(&mddev->sb_wait); 4458 4467 err = 0; 4459 4468 } else { 4460 - mddev->ro = 0; 4469 + mddev->ro = MD_RDWR; 4461 4470 set_disk_ro(mddev->gendisk, 0); 4462 4471 err = do_md_run(mddev); 4463 4472 } ··· 4756 4765 if (test_bit(MD_RECOVERY_FROZEN, &recovery)) 4757 4766 type = "frozen"; 4758 4767 else if (test_bit(MD_RECOVERY_RUNNING, &recovery) || 4759 - (!mddev->ro && test_bit(MD_RECOVERY_NEEDED, &recovery))) { 4768 + (md_is_rdwr(mddev) && test_bit(MD_RECOVERY_NEEDED, &recovery))) { 4760 4769 if (test_bit(MD_RECOVERY_RESHAPE, &recovery)) 4761 4770 type = "reshape"; 4762 4771 else if (test_bit(MD_RECOVERY_SYNC, &recovery)) { ··· 4842 4851 set_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); 4843 4852 set_bit(MD_RECOVERY_SYNC, &mddev->recovery); 4844 4853 } 4845 - if (mddev->ro == 2) { 4854 + if (mddev->ro == MD_AUTO_READ) { 4846 4855 /* A write to sync_action is enough to justify 4847 4856 * canceling read-auto mode 4848 4857 */ 4849 - mddev->ro = 0; 4858 + mddev->ro = MD_RDWR; 4850 4859 md_wakeup_thread(mddev->sync_thread); 4851 4860 } 4852 4861 set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); ··· 5074 5083 goto out_unlock; 5075 5084 5076 5085 err = -EBUSY; 5077 - if (max < mddev->resync_max && 5078 - mddev->ro == 0 && 5086 + if (max < mddev->resync_max && md_is_rdwr(mddev) && 5079 5087 test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) 5080 5088 goto out_unlock; 5081 5089 ··· 5803 5813 continue; 5804 5814 sync_blockdev(rdev->bdev); 5805 5815 invalidate_bdev(rdev->bdev); 5806 - if (mddev->ro != 1 && rdev_read_only(rdev)) { 5807 - mddev->ro = 1; 5816 + if (mddev->ro != MD_RDONLY && rdev_read_only(rdev)) { 5817 + mddev->ro = MD_RDONLY; 5808 5818 if (mddev->gendisk) 5809 5819 set_disk_ro(mddev->gendisk, 1); 5810 5820 } ··· 5907 5917 5908 5918 mddev->ok_start_degraded = start_dirty_degraded; 5909 5919 5910 - if (start_readonly && mddev->ro == 0) 5911 - mddev->ro = 2; /* read-only, but switch on first write */ 5920 + if (start_readonly && md_is_rdwr(mddev)) 5921 + mddev->ro = MD_AUTO_READ; /* read-only, but switch on first write */ 5912 5922 5913 5923 err = pers->run(mddev); 5914 5924 if (err) ··· 5986 5996 mddev->sysfs_action = sysfs_get_dirent_safe(mddev->kobj.sd, "sync_action"); 5987 5997 mddev->sysfs_completed = sysfs_get_dirent_safe(mddev->kobj.sd, "sync_completed"); 5988 5998 mddev->sysfs_degraded = sysfs_get_dirent_safe(mddev->kobj.sd, "degraded"); 5989 - } else if (mddev->ro == 2) /* auto-readonly not meaningful */ 5990 - mddev->ro = 0; 5999 + } else if (mddev->ro == MD_AUTO_READ) 6000 + mddev->ro = MD_RDWR; 5991 6001 5992 6002 atomic_set(&mddev->max_corr_read_errors, 5993 6003 MD_DEFAULT_MAX_CORRECTED_READ_ERRORS); ··· 6005 6015 if (rdev->raid_disk >= 0) 6006 6016 sysfs_link_rdev(mddev, rdev); /* failure here is OK */ 6007 6017 6008 - if (mddev->degraded && !mddev->ro) 6018 + if (mddev->degraded && md_is_rdwr(mddev)) 6009 6019 /* This ensures that recovering status is reported immediately 6010 6020 * via sysfs - until a lack of spares is confirmed. 6011 6021 */ ··· 6095 6105 return -ENXIO; 6096 6106 if (!mddev->pers) 6097 6107 return -EINVAL; 6098 - if (!mddev->ro) 6108 + if (md_is_rdwr(mddev)) 6099 6109 return -EBUSY; 6100 6110 6101 6111 rcu_read_lock(); ··· 6114 6124 return -EROFS; 6115 6125 6116 6126 mddev->safemode = 0; 6117 - mddev->ro = 0; 6127 + mddev->ro = MD_RDWR; 6118 6128 set_disk_ro(disk, 0); 6119 6129 pr_debug("md: %s switched to read-write mode.\n", mdname(mddev)); 6120 6130 /* Kick recovery or resync if necessary */ ··· 6141 6151 mddev->clevel[0] = 0; 6142 6152 mddev->flags = 0; 6143 6153 mddev->sb_flags = 0; 6144 - mddev->ro = 0; 6154 + mddev->ro = MD_RDWR; 6145 6155 mddev->metadata_type[0] = 0; 6146 6156 mddev->chunk_sectors = 0; 6147 6157 mddev->ctime = mddev->utime = 0; ··· 6193 6203 } 6194 6204 md_bitmap_flush(mddev); 6195 6205 6196 - if (mddev->ro == 0 && 6206 + if (md_is_rdwr(mddev) && 6197 6207 ((!mddev->in_sync && !mddev_is_clustered(mddev)) || 6198 6208 mddev->sb_flags)) { 6199 6209 /* mark array as shutdown cleanly */ ··· 6302 6312 __md_stop_writes(mddev); 6303 6313 6304 6314 err = -ENXIO; 6305 - if (mddev->ro==1) 6315 + if (mddev->ro == MD_RDONLY) 6306 6316 goto out; 6307 - mddev->ro = 1; 6317 + mddev->ro = MD_RDONLY; 6308 6318 set_disk_ro(mddev->gendisk, 1); 6309 6319 clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 6310 6320 set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); ··· 6361 6371 return -EBUSY; 6362 6372 } 6363 6373 if (mddev->pers) { 6364 - if (mddev->ro) 6374 + if (!md_is_rdwr(mddev)) 6365 6375 set_disk_ro(disk, 0); 6366 6376 6367 6377 __md_stop_writes(mddev); ··· 6378 6388 mutex_unlock(&mddev->open_mutex); 6379 6389 mddev->changed = 1; 6380 6390 6381 - if (mddev->ro) 6382 - mddev->ro = 0; 6391 + if (!md_is_rdwr(mddev)) 6392 + mddev->ro = MD_RDWR; 6383 6393 } else 6384 6394 mutex_unlock(&mddev->open_mutex); 6385 6395 /* ··· 7194 7204 if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || 7195 7205 mddev->sync_thread) 7196 7206 return -EBUSY; 7197 - if (mddev->ro) 7207 + if (!md_is_rdwr(mddev)) 7198 7208 return -EROFS; 7199 7209 7200 7210 rdev_for_each(rdev, mddev) { ··· 7224 7234 /* change the number of raid disks */ 7225 7235 if (mddev->pers->check_reshape == NULL) 7226 7236 return -EINVAL; 7227 - if (mddev->ro) 7237 + if (!md_is_rdwr(mddev)) 7228 7238 return -EROFS; 7229 7239 if (raid_disks <= 0 || 7230 7240 (mddev->max_disks && raid_disks >= mddev->max_disks)) ··· 7454 7464 } 7455 7465 } 7456 7466 7467 + static int __md_set_array_info(struct mddev *mddev, void __user *argp) 7468 + { 7469 + mdu_array_info_t info; 7470 + int err; 7471 + 7472 + if (!argp) 7473 + memset(&info, 0, sizeof(info)); 7474 + else if (copy_from_user(&info, argp, sizeof(info))) 7475 + return -EFAULT; 7476 + 7477 + if (mddev->pers) { 7478 + err = update_array_info(mddev, &info); 7479 + if (err) 7480 + pr_warn("md: couldn't update array info. %d\n", err); 7481 + return err; 7482 + } 7483 + 7484 + if (!list_empty(&mddev->disks)) { 7485 + pr_warn("md: array %s already has disks!\n", mdname(mddev)); 7486 + return -EBUSY; 7487 + } 7488 + 7489 + if (mddev->raid_disks) { 7490 + pr_warn("md: array %s already initialised!\n", mdname(mddev)); 7491 + return -EBUSY; 7492 + } 7493 + 7494 + err = md_set_array_info(mddev, &info); 7495 + if (err) 7496 + pr_warn("md: couldn't set array info. %d\n", err); 7497 + 7498 + return err; 7499 + } 7500 + 7457 7501 static int md_ioctl(struct block_device *bdev, fmode_t mode, 7458 7502 unsigned int cmd, unsigned long arg) 7459 7503 { ··· 7593 7569 } 7594 7570 7595 7571 if (cmd == SET_ARRAY_INFO) { 7596 - mdu_array_info_t info; 7597 - if (!arg) 7598 - memset(&info, 0, sizeof(info)); 7599 - else if (copy_from_user(&info, argp, sizeof(info))) { 7600 - err = -EFAULT; 7601 - goto unlock; 7602 - } 7603 - if (mddev->pers) { 7604 - err = update_array_info(mddev, &info); 7605 - if (err) { 7606 - pr_warn("md: couldn't update array info. %d\n", err); 7607 - goto unlock; 7608 - } 7609 - goto unlock; 7610 - } 7611 - if (!list_empty(&mddev->disks)) { 7612 - pr_warn("md: array %s already has disks!\n", mdname(mddev)); 7613 - err = -EBUSY; 7614 - goto unlock; 7615 - } 7616 - if (mddev->raid_disks) { 7617 - pr_warn("md: array %s already initialised!\n", mdname(mddev)); 7618 - err = -EBUSY; 7619 - goto unlock; 7620 - } 7621 - err = md_set_array_info(mddev, &info); 7622 - if (err) { 7623 - pr_warn("md: couldn't set array info. %d\n", err); 7624 - goto unlock; 7625 - } 7572 + err = __md_set_array_info(mddev, argp); 7626 7573 goto unlock; 7627 7574 } 7628 7575 ··· 7653 7658 * The remaining ioctls are changing the state of the 7654 7659 * superblock, so we do not allow them on read-only arrays. 7655 7660 */ 7656 - if (mddev->ro && mddev->pers) { 7657 - if (mddev->ro == 2) { 7658 - mddev->ro = 0; 7659 - sysfs_notify_dirent_safe(mddev->sysfs_state); 7660 - set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 7661 - /* mddev_unlock will wake thread */ 7662 - /* If a device failed while we were read-only, we 7663 - * need to make sure the metadata is updated now. 7664 - */ 7665 - if (test_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags)) { 7666 - mddev_unlock(mddev); 7667 - wait_event(mddev->sb_wait, 7668 - !test_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags) && 7669 - !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)); 7670 - mddev_lock_nointr(mddev); 7671 - } 7672 - } else { 7661 + if (!md_is_rdwr(mddev) && mddev->pers) { 7662 + if (mddev->ro != MD_AUTO_READ) { 7673 7663 err = -EROFS; 7674 7664 goto unlock; 7665 + } 7666 + mddev->ro = MD_RDWR; 7667 + sysfs_notify_dirent_safe(mddev->sysfs_state); 7668 + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 7669 + /* mddev_unlock will wake thread */ 7670 + /* If a device failed while we were read-only, we 7671 + * need to make sure the metadata is updated now. 7672 + */ 7673 + if (test_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags)) { 7674 + mddev_unlock(mddev); 7675 + wait_event(mddev->sb_wait, 7676 + !test_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags) && 7677 + !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)); 7678 + mddev_lock_nointr(mddev); 7675 7679 } 7676 7680 } 7677 7681 ··· 7757 7763 * Transitioning to read-auto need only happen for arrays that call 7758 7764 * md_write_start and which are not ready for writes yet. 7759 7765 */ 7760 - if (!ro && mddev->ro == 1 && mddev->pers) { 7766 + if (!ro && mddev->ro == MD_RDONLY && mddev->pers) { 7761 7767 err = restart_array(mddev); 7762 7768 if (err) 7763 7769 goto out_unlock; 7764 - mddev->ro = 2; 7770 + mddev->ro = MD_AUTO_READ; 7765 7771 } 7766 7772 7767 7773 out_unlock: ··· 8235 8241 seq_printf(seq, "%s : %sactive", mdname(mddev), 8236 8242 mddev->pers ? "" : "in"); 8237 8243 if (mddev->pers) { 8238 - if (mddev->ro==1) 8244 + if (mddev->ro == MD_RDONLY) 8239 8245 seq_printf(seq, " (read-only)"); 8240 - if (mddev->ro==2) 8246 + if (mddev->ro == MD_AUTO_READ) 8241 8247 seq_printf(seq, " (auto-read-only)"); 8242 8248 seq_printf(seq, " %s", mddev->pers->name); 8243 8249 } ··· 8496 8502 if (bio_data_dir(bi) != WRITE) 8497 8503 return true; 8498 8504 8499 - BUG_ON(mddev->ro == 1); 8500 - if (mddev->ro == 2) { 8505 + BUG_ON(mddev->ro == MD_RDONLY); 8506 + if (mddev->ro == MD_AUTO_READ) { 8501 8507 /* need to switch to read/write */ 8502 - mddev->ro = 0; 8508 + mddev->ro = MD_RDWR; 8503 8509 set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 8504 8510 md_wakeup_thread(mddev->thread); 8505 8511 md_wakeup_thread(mddev->sync_thread); ··· 8550 8556 { 8551 8557 if (bio_data_dir(bi) != WRITE) 8552 8558 return; 8553 - WARN_ON_ONCE(mddev->in_sync || mddev->ro); 8559 + WARN_ON_ONCE(mddev->in_sync || !md_is_rdwr(mddev)); 8554 8560 percpu_ref_get(&mddev->writes_pending); 8555 8561 } 8556 8562 EXPORT_SYMBOL(md_write_inc); ··· 8655 8661 { 8656 8662 if (!mddev->pers) 8657 8663 return; 8658 - if (mddev->ro) 8664 + if (!md_is_rdwr(mddev)) 8659 8665 return; 8660 8666 if (!mddev->pers->sync_request) 8661 8667 return; ··· 8703 8709 if (test_bit(MD_RECOVERY_DONE, &mddev->recovery) || 8704 8710 test_bit(MD_RECOVERY_WAIT, &mddev->recovery)) 8705 8711 return; 8706 - if (mddev->ro) {/* never try to sync a read-only array */ 8712 + if (!md_is_rdwr(mddev)) {/* never try to sync a read-only array */ 8707 8713 set_bit(MD_RECOVERY_INTR, &mddev->recovery); 8708 8714 return; 8709 8715 } ··· 9172 9178 if (test_bit(Faulty, &rdev->flags)) 9173 9179 continue; 9174 9180 if (!test_bit(Journal, &rdev->flags)) { 9175 - if (mddev->ro && 9176 - ! (rdev->saved_raid_disk >= 0 && 9177 - !test_bit(Bitmap_sync, &rdev->flags))) 9181 + if (!md_is_rdwr(mddev) && 9182 + !(rdev->saved_raid_disk >= 0 && 9183 + !test_bit(Bitmap_sync, &rdev->flags))) 9178 9184 continue; 9179 9185 9180 9186 rdev->recovery_offset = 0; ··· 9272 9278 flush_signals(current); 9273 9279 } 9274 9280 9275 - if (mddev->ro && !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery)) 9281 + if (!md_is_rdwr(mddev) && 9282 + !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery)) 9276 9283 return; 9277 9284 if ( ! ( 9278 9285 (mddev->sb_flags & ~ (1<<MD_SB_CHANGE_PENDING)) || ··· 9292 9297 if (!mddev->external && mddev->safemode == 1) 9293 9298 mddev->safemode = 0; 9294 9299 9295 - if (mddev->ro) { 9300 + if (!md_is_rdwr(mddev)) { 9296 9301 struct md_rdev *rdev; 9297 9302 if (!mddev->external && mddev->in_sync) 9298 9303 /* 'Blocked' flag not needed as failed devices

-1

drivers/md/md.h

··· 782 782 783 783 extern void md_reload_sb(struct mddev *mddev, int raid_disk); 784 784 extern void md_update_sb(struct mddev *mddev, int force); 785 - extern void md_kick_rdev_from_array(struct md_rdev * rdev); 786 785 extern void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev, 787 786 bool is_suspend); 788 787 extern void mddev_destroy_serial_pool(struct mddev *mddev, struct md_rdev *rdev,

-1

drivers/md/raid0.c

··· 398 398 399 399 blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors); 400 400 blk_queue_max_write_zeroes_sectors(mddev->queue, mddev->chunk_sectors); 401 - blk_queue_max_discard_sectors(mddev->queue, UINT_MAX); 402 401 403 402 blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9); 404 403 blk_queue_io_opt(mddev->queue,

+7 -6

drivers/md/raid1.c

··· 1321 1321 read_bio->bi_iter.bi_sector = r1_bio->sector + 1322 1322 mirror->rdev->data_offset; 1323 1323 read_bio->bi_end_io = raid1_end_read_request; 1324 - bio_set_op_attrs(read_bio, op, do_sync); 1324 + read_bio->bi_opf = op | do_sync; 1325 1325 if (test_bit(FailFast, &mirror->rdev->flags) && 1326 1326 test_bit(R1BIO_FailFast, &r1_bio->state)) 1327 1327 read_bio->bi_opf |= MD_FAILFAST; ··· 2254 2254 continue; 2255 2255 } 2256 2256 2257 - bio_set_op_attrs(wbio, REQ_OP_WRITE, 0); 2257 + wbio->bi_opf = REQ_OP_WRITE; 2258 2258 if (test_bit(FailFast, &conf->mirrors[i].rdev->flags)) 2259 2259 wbio->bi_opf |= MD_FAILFAST; 2260 2260 ··· 2419 2419 GFP_NOIO, &mddev->bio_set); 2420 2420 } 2421 2421 2422 - bio_set_op_attrs(wbio, REQ_OP_WRITE, 0); 2422 + wbio->bi_opf = REQ_OP_WRITE; 2423 2423 wbio->bi_iter.bi_sector = r1_bio->sector; 2424 2424 wbio->bi_iter.bi_size = r1_bio->sectors << 9; 2425 2425 ··· 2770 2770 if (i < conf->raid_disks) 2771 2771 still_degraded = 1; 2772 2772 } else if (!test_bit(In_sync, &rdev->flags)) { 2773 - bio_set_op_attrs(bio, REQ_OP_WRITE, 0); 2773 + bio->bi_opf = REQ_OP_WRITE; 2774 2774 bio->bi_end_io = end_sync_write; 2775 2775 write_targets ++; 2776 2776 } else { ··· 2797 2797 if (disk < 0) 2798 2798 disk = i; 2799 2799 } 2800 - bio_set_op_attrs(bio, REQ_OP_READ, 0); 2800 + bio->bi_opf = REQ_OP_READ; 2801 2801 bio->bi_end_io = end_sync_read; 2802 2802 read_targets++; 2803 2803 } else if (!test_bit(WriteErrorSeen, &rdev->flags) && ··· 2809 2809 * if we are doing resync or repair. Otherwise, leave 2810 2810 * this device alone for this sync request. 2811 2811 */ 2812 - bio_set_op_attrs(bio, REQ_OP_WRITE, 0); 2812 + bio->bi_opf = REQ_OP_WRITE; 2813 2813 bio->bi_end_io = end_sync_write; 2814 2814 write_targets++; 2815 2815 } ··· 3159 3159 * RAID1 needs at least one disk in active 3160 3160 */ 3161 3161 if (conf->raid_disks - mddev->degraded < 1) { 3162 + md_unregister_thread(&conf->thread); 3162 3163 ret = -EINVAL; 3163 3164 goto abort; 3164 3165 }

+9 -11

drivers/md/raid10.c

··· 1254 1254 read_bio->bi_iter.bi_sector = r10_bio->devs[slot].addr + 1255 1255 choose_data_offset(r10_bio, rdev); 1256 1256 read_bio->bi_end_io = raid10_end_read_request; 1257 - bio_set_op_attrs(read_bio, op, do_sync); 1257 + read_bio->bi_opf = op | do_sync; 1258 1258 if (test_bit(FailFast, &rdev->flags) && 1259 1259 test_bit(R10BIO_FailFast, &r10_bio->state)) 1260 1260 read_bio->bi_opf |= MD_FAILFAST; ··· 1301 1301 mbio->bi_iter.bi_sector = (r10_bio->devs[n_copy].addr + 1302 1302 choose_data_offset(r10_bio, rdev)); 1303 1303 mbio->bi_end_io = raid10_end_write_request; 1304 - bio_set_op_attrs(mbio, op, do_sync | do_fua); 1304 + mbio->bi_opf = op | do_sync | do_fua; 1305 1305 if (!replacement && test_bit(FailFast, 1306 1306 &conf->mirrors[devnum].rdev->flags) 1307 1307 && enough(conf, devnum)) ··· 2933 2933 wsector = r10_bio->devs[i].addr + (sector - r10_bio->sector); 2934 2934 wbio->bi_iter.bi_sector = wsector + 2935 2935 choose_data_offset(r10_bio, rdev); 2936 - bio_set_op_attrs(wbio, REQ_OP_WRITE, 0); 2936 + wbio->bi_opf = REQ_OP_WRITE; 2937 2937 2938 2938 if (submit_bio_wait(wbio) < 0) 2939 2939 /* Failure! */ ··· 3542 3542 bio->bi_next = biolist; 3543 3543 biolist = bio; 3544 3544 bio->bi_end_io = end_sync_read; 3545 - bio_set_op_attrs(bio, REQ_OP_READ, 0); 3545 + bio->bi_opf = REQ_OP_READ; 3546 3546 if (test_bit(FailFast, &rdev->flags)) 3547 3547 bio->bi_opf |= MD_FAILFAST; 3548 3548 from_addr = r10_bio->devs[j].addr; ··· 3567 3567 bio->bi_next = biolist; 3568 3568 biolist = bio; 3569 3569 bio->bi_end_io = end_sync_write; 3570 - bio_set_op_attrs(bio, REQ_OP_WRITE, 0); 3570 + bio->bi_opf = REQ_OP_WRITE; 3571 3571 bio->bi_iter.bi_sector = to_addr 3572 3572 + mrdev->data_offset; 3573 3573 bio_set_dev(bio, mrdev->bdev); ··· 3588 3588 bio->bi_next = biolist; 3589 3589 biolist = bio; 3590 3590 bio->bi_end_io = end_sync_write; 3591 - bio_set_op_attrs(bio, REQ_OP_WRITE, 0); 3591 + bio->bi_opf = REQ_OP_WRITE; 3592 3592 bio->bi_iter.bi_sector = to_addr + 3593 3593 mreplace->data_offset; 3594 3594 bio_set_dev(bio, mreplace->bdev); ··· 3742 3742 bio->bi_next = biolist; 3743 3743 biolist = bio; 3744 3744 bio->bi_end_io = end_sync_read; 3745 - bio_set_op_attrs(bio, REQ_OP_READ, 0); 3745 + bio->bi_opf = REQ_OP_READ; 3746 3746 if (test_bit(FailFast, &rdev->flags)) 3747 3747 bio->bi_opf |= MD_FAILFAST; 3748 3748 bio->bi_iter.bi_sector = sector + rdev->data_offset; ··· 3764 3764 bio->bi_next = biolist; 3765 3765 biolist = bio; 3766 3766 bio->bi_end_io = end_sync_write; 3767 - bio_set_op_attrs(bio, REQ_OP_WRITE, 0); 3767 + bio->bi_opf = REQ_OP_WRITE; 3768 3768 if (test_bit(FailFast, &rdev->flags)) 3769 3769 bio->bi_opf |= MD_FAILFAST; 3770 3770 bio->bi_iter.bi_sector = sector + rdev->data_offset; ··· 4145 4145 conf->thread = NULL; 4146 4146 4147 4147 if (mddev->queue) { 4148 - blk_queue_max_discard_sectors(mddev->queue, 4149 - UINT_MAX); 4150 4148 blk_queue_max_write_zeroes_sectors(mddev->queue, 0); 4151 4149 blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9); 4152 4150 raid10_set_io_opt(conf); ··· 4970 4972 b->bi_iter.bi_sector = r10_bio->devs[s/2].addr + 4971 4973 rdev2->new_data_offset; 4972 4974 b->bi_end_io = end_reshape_write; 4973 - bio_set_op_attrs(b, REQ_OP_WRITE, 0); 4975 + b->bi_opf = REQ_OP_WRITE; 4974 4976 b->bi_next = blist; 4975 4977 blist = b; 4976 4978 }

+4 -6

drivers/md/raid5-cache.c

··· 1565 1565 1566 1566 if (!log) 1567 1567 return; 1568 + 1569 + target = READ_ONCE(log->reclaim_target); 1568 1570 do { 1569 - target = log->reclaim_target; 1570 1571 if (new < target) 1571 1572 return; 1572 - } while (cmpxchg(&log->reclaim_target, target, new) != target); 1573 + } while (!try_cmpxchg(&log->reclaim_target, &target, new)); 1573 1574 md_wakeup_thread(log->reclaim_thread); 1574 1575 } 1575 1576 ··· 3062 3061 3063 3062 int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev) 3064 3063 { 3065 - struct request_queue *q = bdev_get_queue(rdev->bdev); 3066 3064 struct r5l_log *log; 3067 3065 int ret; 3068 3066 ··· 3090 3090 if (!log) 3091 3091 return -ENOMEM; 3092 3092 log->rdev = rdev; 3093 - 3094 - log->need_cache_flush = test_bit(QUEUE_FLAG_WC, &q->queue_flags) != 0; 3095 - 3093 + log->need_cache_flush = bdev_write_cache(rdev->bdev); 3096 3094 log->uuid_checksum = crc32c_le(~0, rdev->mddev->uuid, 3097 3095 sizeof(rdev->mddev->uuid)); 3098 3096

+1 -4

drivers/md/raid5-ppl.c

··· 1301 1301 1302 1302 static void ppl_init_child_log(struct ppl_log *log, struct md_rdev *rdev) 1303 1303 { 1304 - struct request_queue *q; 1305 - 1306 1304 if ((rdev->ppl.size << 9) >= (PPL_SPACE_SIZE + 1307 1305 PPL_HEADER_SIZE) * 2) { 1308 1306 log->use_multippl = true; ··· 1314 1316 } 1315 1317 log->next_io_sector = rdev->ppl.sector; 1316 1318 1317 - q = bdev_get_queue(rdev->bdev); 1318 - if (test_bit(QUEUE_FLAG_WC, &q->queue_flags)) 1319 + if (bdev_write_cache(rdev->bdev)) 1319 1320 log->wb_cache_on = true; 1320 1321 } 1321 1322

+10 -20

drivers/nvme/host/apple.c

··· 763 763 goto out_free_cmd; 764 764 } 765 765 766 - blk_mq_start_request(req); 766 + nvme_start_request(req); 767 767 apple_nvme_submit_cmd(q, cmnd); 768 768 return BLK_STS_OK; 769 769 ··· 821 821 if (!dead && shutdown && freeze) 822 822 nvme_wait_freeze_timeout(&anv->ctrl, NVME_IO_TIMEOUT); 823 823 824 - nvme_stop_queues(&anv->ctrl); 824 + nvme_quiesce_io_queues(&anv->ctrl); 825 825 826 826 if (!dead) { 827 827 if (READ_ONCE(anv->ioq.enabled)) { ··· 829 829 apple_nvme_remove_cq(anv); 830 830 } 831 831 832 - if (shutdown) 833 - nvme_shutdown_ctrl(&anv->ctrl); 834 - nvme_disable_ctrl(&anv->ctrl); 832 + nvme_disable_ctrl(&anv->ctrl, shutdown); 835 833 } 836 834 837 835 WRITE_ONCE(anv->ioq.enabled, false); 838 836 WRITE_ONCE(anv->adminq.enabled, false); 839 837 mb(); /* ensure that nvme_queue_rq() sees that enabled is cleared */ 840 - nvme_stop_admin_queue(&anv->ctrl); 838 + nvme_quiesce_admin_queue(&anv->ctrl); 841 839 842 840 /* last chance to complete any requests before nvme_cancel_request */ 843 841 spin_lock_irqsave(&anv->lock, flags); ··· 852 854 * deadlocking blk-mq hot-cpu notifier. 853 855 */ 854 856 if (shutdown) { 855 - nvme_start_queues(&anv->ctrl); 856 - nvme_start_admin_queue(&anv->ctrl); 857 + nvme_unquiesce_io_queues(&anv->ctrl); 858 + nvme_unquiesce_admin_queue(&anv->ctrl); 857 859 } 858 860 } 859 861 ··· 1091 1093 1092 1094 dev_dbg(anv->dev, "Starting admin queue"); 1093 1095 apple_nvme_init_queue(&anv->adminq); 1094 - nvme_start_admin_queue(&anv->ctrl); 1096 + nvme_unquiesce_admin_queue(&anv->ctrl); 1095 1097 1096 1098 if (!nvme_change_ctrl_state(&anv->ctrl, NVME_CTRL_CONNECTING)) { 1097 1099 dev_warn(anv->ctrl.device, ··· 1100 1102 goto out; 1101 1103 } 1102 1104 1103 - ret = nvme_init_ctrl_finish(&anv->ctrl); 1105 + ret = nvme_init_ctrl_finish(&anv->ctrl, false); 1104 1106 if (ret) 1105 1107 goto out; 1106 1108 ··· 1125 1127 1126 1128 anv->ctrl.queue_count = nr_io_queues + 1; 1127 1129 1128 - nvme_start_queues(&anv->ctrl); 1130 + nvme_unquiesce_io_queues(&anv->ctrl); 1129 1131 nvme_wait_freeze(&anv->ctrl); 1130 1132 blk_mq_update_nr_hw_queues(&anv->tagset, 1); 1131 1133 nvme_unfreeze(&anv->ctrl); ··· 1151 1153 nvme_change_ctrl_state(&anv->ctrl, NVME_CTRL_DELETING); 1152 1154 nvme_get_ctrl(&anv->ctrl); 1153 1155 apple_nvme_disable(anv, false); 1154 - nvme_kill_queues(&anv->ctrl); 1156 + nvme_mark_namespaces_dead(&anv->ctrl); 1155 1157 if (!queue_work(nvme_wq, &anv->remove_work)) 1156 1158 nvme_put_ctrl(&anv->ctrl); 1157 1159 } ··· 1502 1504 anv->ctrl.admin_q = blk_mq_init_queue(&anv->admin_tagset); 1503 1505 if (IS_ERR(anv->ctrl.admin_q)) { 1504 1506 ret = -ENOMEM; 1505 - goto put_dev; 1506 - } 1507 - 1508 - if (!blk_get_queue(anv->ctrl.admin_q)) { 1509 - nvme_start_admin_queue(&anv->ctrl); 1510 - blk_mq_destroy_queue(anv->ctrl.admin_q); 1511 - anv->ctrl.admin_q = NULL; 1512 - ret = -ENODEV; 1513 1507 goto put_dev; 1514 1508 } 1515 1509

+136 -122

drivers/nvme/host/auth.c

··· 13 13 #include "fabrics.h" 14 14 #include <linux/nvme-auth.h> 15 15 16 + #define CHAP_BUF_SIZE 4096 17 + static struct kmem_cache *nvme_chap_buf_cache; 18 + static mempool_t *nvme_chap_buf_pool; 19 + 16 20 struct nvme_dhchap_queue_context { 17 21 struct list_head entry; 18 22 struct work_struct auth_work; ··· 24 20 struct crypto_shash *shash_tfm; 25 21 struct crypto_kpp *dh_tfm; 26 22 void *buf; 27 - size_t buf_size; 28 23 int qid; 29 24 int error; 30 25 u32 s1; ··· 49 46 (qid == 0) ? 0 : BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_RESERVED 50 47 #define nvme_auth_queue_from_qid(ctrl, qid) \ 51 48 (qid == 0) ? (ctrl)->fabrics_q : (ctrl)->connect_q 49 + 50 + static inline int ctrl_max_dhchaps(struct nvme_ctrl *ctrl) 51 + { 52 + return ctrl->opts->nr_io_queues + ctrl->opts->nr_write_queues + 53 + ctrl->opts->nr_poll_queues + 1; 54 + } 52 55 53 56 static int nvme_auth_submit(struct nvme_ctrl *ctrl, int qid, 54 57 void *data, size_t data_len, bool auth_send) ··· 121 112 struct nvmf_auth_dhchap_negotiate_data *data = chap->buf; 122 113 size_t size = sizeof(*data) + sizeof(union nvmf_auth_protocol); 123 114 124 - if (chap->buf_size < size) { 115 + if (size > CHAP_BUF_SIZE) { 125 116 chap->status = NVME_AUTH_DHCHAP_FAILURE_INCORRECT_PAYLOAD; 126 117 return -EINVAL; 127 118 } ··· 156 147 const char *gid_name = nvme_auth_dhgroup_name(data->dhgid); 157 148 const char *hmac_name, *kpp_name; 158 149 159 - if (chap->buf_size < size) { 150 + if (size > CHAP_BUF_SIZE) { 160 151 chap->status = NVME_AUTH_DHCHAP_FAILURE_INCORRECT_PAYLOAD; 161 152 return NVME_SC_INVALID_FIELD; 162 153 } ··· 206 197 return NVME_SC_AUTH_REQUIRED; 207 198 } 208 199 209 - /* Reset host response if the hash had been changed */ 210 - if (chap->hash_id != data->hashid) { 211 - kfree(chap->host_response); 212 - chap->host_response = NULL; 213 - } 214 - 215 200 chap->hash_id = data->hashid; 216 201 chap->hash_len = data->hl; 217 202 dev_dbg(ctrl->device, "qid %d: selected hash %s\n", ··· 221 218 /* Leave previous dh_tfm intact */ 222 219 return NVME_SC_AUTH_REQUIRED; 223 220 } 224 - 225 - /* Clear host and controller key to avoid accidental reuse */ 226 - kfree_sensitive(chap->host_key); 227 - chap->host_key = NULL; 228 - chap->host_key_len = 0; 229 - kfree_sensitive(chap->ctrl_key); 230 - chap->ctrl_key = NULL; 231 - chap->ctrl_key_len = 0; 232 221 233 222 if (chap->dhgroup_id == data->dhgid && 234 223 (data->dhgid == NVME_AUTH_DHGROUP_NULL || chap->dh_tfm)) { ··· 297 302 if (chap->host_key_len) 298 303 size += chap->host_key_len; 299 304 300 - if (chap->buf_size < size) { 305 + if (size > CHAP_BUF_SIZE) { 301 306 chap->status = NVME_AUTH_DHCHAP_FAILURE_INCORRECT_PAYLOAD; 302 307 return -EINVAL; 303 308 } ··· 339 344 struct nvmf_auth_dhchap_success1_data *data = chap->buf; 340 345 size_t size = sizeof(*data); 341 346 342 - if (ctrl->ctrl_key) 347 + if (chap->ctrl_key) 343 348 size += chap->hash_len; 344 349 345 - if (chap->buf_size < size) { 350 + if (size > CHAP_BUF_SIZE) { 346 351 chap->status = NVME_AUTH_DHCHAP_FAILURE_INCORRECT_PAYLOAD; 347 352 return NVME_SC_INVALID_FIELD; 348 353 } ··· 516 521 ret = PTR_ERR(ctrl_response); 517 522 return ret; 518 523 } 524 + 519 525 ret = crypto_shash_setkey(chap->shash_tfm, 520 526 ctrl_response, ctrl->ctrl_key->len); 521 527 if (ret) { ··· 617 621 if (ret) { 618 622 dev_dbg(ctrl->device, 619 623 "failed to generate public key, error %d\n", ret); 620 - kfree(chap->host_key); 621 - chap->host_key = NULL; 622 - chap->host_key_len = 0; 623 624 chap->status = NVME_AUTH_DHCHAP_FAILURE_INCORRECT_PAYLOAD; 624 625 return ret; 625 626 } ··· 636 643 if (ret) { 637 644 dev_dbg(ctrl->device, 638 645 "failed to generate shared secret, error %d\n", ret); 639 - kfree_sensitive(chap->sess_key); 640 - chap->sess_key = NULL; 641 - chap->sess_key_len = 0; 642 646 chap->status = NVME_AUTH_DHCHAP_FAILURE_INCORRECT_PAYLOAD; 643 647 return ret; 644 648 } ··· 644 654 return 0; 645 655 } 646 656 647 - static void __nvme_auth_reset(struct nvme_dhchap_queue_context *chap) 657 + static void nvme_auth_reset_dhchap(struct nvme_dhchap_queue_context *chap) 648 658 { 649 659 kfree_sensitive(chap->host_response); 650 660 chap->host_response = NULL; ··· 664 674 chap->transaction = 0; 665 675 memset(chap->c1, 0, sizeof(chap->c1)); 666 676 memset(chap->c2, 0, sizeof(chap->c2)); 677 + mempool_free(chap->buf, nvme_chap_buf_pool); 678 + chap->buf = NULL; 667 679 } 668 680 669 - static void __nvme_auth_free(struct nvme_dhchap_queue_context *chap) 681 + static void nvme_auth_free_dhchap(struct nvme_dhchap_queue_context *chap) 670 682 { 671 - __nvme_auth_reset(chap); 683 + nvme_auth_reset_dhchap(chap); 672 684 if (chap->shash_tfm) 673 685 crypto_free_shash(chap->shash_tfm); 674 686 if (chap->dh_tfm) 675 687 crypto_free_kpp(chap->dh_tfm); 676 - kfree_sensitive(chap->ctrl_key); 677 - kfree_sensitive(chap->host_key); 678 - kfree_sensitive(chap->sess_key); 679 - kfree_sensitive(chap->host_response); 680 - kfree(chap->buf); 681 - kfree(chap); 682 688 } 683 689 684 - static void __nvme_auth_work(struct work_struct *work) 690 + static void nvme_queue_auth_work(struct work_struct *work) 685 691 { 686 692 struct nvme_dhchap_queue_context *chap = 687 693 container_of(work, struct nvme_dhchap_queue_context, auth_work); 688 694 struct nvme_ctrl *ctrl = chap->ctrl; 689 695 size_t tl; 690 696 int ret = 0; 697 + 698 + /* 699 + * Allocate a large enough buffer for the entire negotiation: 700 + * 4k is enough to ffdhe8192. 701 + */ 702 + chap->buf = mempool_alloc(nvme_chap_buf_pool, GFP_KERNEL); 703 + if (!chap->buf) { 704 + chap->error = -ENOMEM; 705 + return; 706 + } 691 707 692 708 chap->transaction = ctrl->transaction++; 693 709 ··· 716 720 dev_dbg(ctrl->device, "%s: qid %d receive challenge\n", 717 721 __func__, chap->qid); 718 722 719 - memset(chap->buf, 0, chap->buf_size); 720 - ret = nvme_auth_submit(ctrl, chap->qid, chap->buf, chap->buf_size, false); 723 + memset(chap->buf, 0, CHAP_BUF_SIZE); 724 + ret = nvme_auth_submit(ctrl, chap->qid, chap->buf, CHAP_BUF_SIZE, 725 + false); 721 726 if (ret) { 722 727 dev_warn(ctrl->device, 723 728 "qid %d failed to receive challenge, %s %d\n", ··· 754 757 755 758 dev_dbg(ctrl->device, "%s: qid %d host response\n", 756 759 __func__, chap->qid); 760 + mutex_lock(&ctrl->dhchap_auth_mutex); 757 761 ret = nvme_auth_dhchap_setup_host_response(ctrl, chap); 758 762 if (ret) { 763 + mutex_unlock(&ctrl->dhchap_auth_mutex); 759 764 chap->error = ret; 760 765 goto fail2; 761 766 } 767 + mutex_unlock(&ctrl->dhchap_auth_mutex); 762 768 763 769 /* DH-HMAC-CHAP Step 3: send reply */ 764 770 dev_dbg(ctrl->device, "%s: qid %d send reply\n", ··· 783 783 dev_dbg(ctrl->device, "%s: qid %d receive success1\n", 784 784 __func__, chap->qid); 785 785 786 - memset(chap->buf, 0, chap->buf_size); 787 - ret = nvme_auth_submit(ctrl, chap->qid, chap->buf, chap->buf_size, false); 786 + memset(chap->buf, 0, CHAP_BUF_SIZE); 787 + ret = nvme_auth_submit(ctrl, chap->qid, chap->buf, CHAP_BUF_SIZE, 788 + false); 788 789 if (ret) { 789 790 dev_warn(ctrl->device, 790 791 "qid %d failed to receive success1, %s %d\n", ··· 802 801 return; 803 802 } 804 803 804 + mutex_lock(&ctrl->dhchap_auth_mutex); 805 805 if (ctrl->ctrl_key) { 806 806 dev_dbg(ctrl->device, 807 807 "%s: qid %d controller response\n", 808 808 __func__, chap->qid); 809 809 ret = nvme_auth_dhchap_setup_ctrl_response(ctrl, chap); 810 810 if (ret) { 811 + mutex_unlock(&ctrl->dhchap_auth_mutex); 811 812 chap->error = ret; 812 813 goto fail2; 813 814 } 814 815 } 816 + mutex_unlock(&ctrl->dhchap_auth_mutex); 815 817 816 818 ret = nvme_auth_process_dhchap_success1(ctrl, chap); 817 819 if (ret) { ··· 823 819 goto fail2; 824 820 } 825 821 826 - if (ctrl->ctrl_key) { 822 + if (chap->ctrl_key) { 827 823 /* DH-HMAC-CHAP Step 5: send success2 */ 828 824 dev_dbg(ctrl->device, "%s: qid %d send success2\n", 829 825 __func__, chap->qid); ··· 864 860 return -ENOKEY; 865 861 } 866 862 867 - mutex_lock(&ctrl->dhchap_auth_mutex); 868 - /* Check if the context is already queued */ 869 - list_for_each_entry(chap, &ctrl->dhchap_auth_list, entry) { 870 - WARN_ON(!chap->buf); 871 - if (chap->qid == qid) { 872 - dev_dbg(ctrl->device, "qid %d: re-using context\n", qid); 873 - mutex_unlock(&ctrl->dhchap_auth_mutex); 874 - flush_work(&chap->auth_work); 875 - __nvme_auth_reset(chap); 876 - queue_work(nvme_wq, &chap->auth_work); 877 - return 0; 878 - } 879 - } 880 - chap = kzalloc(sizeof(*chap), GFP_KERNEL); 881 - if (!chap) { 882 - mutex_unlock(&ctrl->dhchap_auth_mutex); 883 - return -ENOMEM; 884 - } 885 - chap->qid = (qid == NVME_QID_ANY) ? 0 : qid; 886 - chap->ctrl = ctrl; 887 - 888 - /* 889 - * Allocate a large enough buffer for the entire negotiation: 890 - * 4k should be enough to ffdhe8192. 891 - */ 892 - chap->buf_size = 4096; 893 - chap->buf = kzalloc(chap->buf_size, GFP_KERNEL); 894 - if (!chap->buf) { 895 - mutex_unlock(&ctrl->dhchap_auth_mutex); 896 - kfree(chap); 897 - return -ENOMEM; 898 - } 899 - 900 - INIT_WORK(&chap->auth_work, __nvme_auth_work); 901 - list_add(&chap->entry, &ctrl->dhchap_auth_list); 902 - mutex_unlock(&ctrl->dhchap_auth_mutex); 863 + chap = &ctrl->dhchap_ctxs[qid]; 864 + cancel_work_sync(&chap->auth_work); 903 865 queue_work(nvme_wq, &chap->auth_work); 904 866 return 0; 905 867 } ··· 876 906 struct nvme_dhchap_queue_context *chap; 877 907 int ret; 878 908 879 - mutex_lock(&ctrl->dhchap_auth_mutex); 880 - list_for_each_entry(chap, &ctrl->dhchap_auth_list, entry) { 881 - if (chap->qid != qid) 882 - continue; 883 - mutex_unlock(&ctrl->dhchap_auth_mutex); 884 - flush_work(&chap->auth_work); 885 - ret = chap->error; 886 - return ret; 887 - } 888 - mutex_unlock(&ctrl->dhchap_auth_mutex); 889 - return -ENXIO; 909 + chap = &ctrl->dhchap_ctxs[qid]; 910 + flush_work(&chap->auth_work); 911 + ret = chap->error; 912 + /* clear sensitive info */ 913 + nvme_auth_reset_dhchap(chap); 914 + return ret; 890 915 } 891 916 EXPORT_SYMBOL_GPL(nvme_auth_wait); 892 917 893 - void nvme_auth_reset(struct nvme_ctrl *ctrl) 894 - { 895 - struct nvme_dhchap_queue_context *chap; 896 - 897 - mutex_lock(&ctrl->dhchap_auth_mutex); 898 - list_for_each_entry(chap, &ctrl->dhchap_auth_list, entry) { 899 - mutex_unlock(&ctrl->dhchap_auth_mutex); 900 - flush_work(&chap->auth_work); 901 - __nvme_auth_reset(chap); 902 - } 903 - mutex_unlock(&ctrl->dhchap_auth_mutex); 904 - } 905 - EXPORT_SYMBOL_GPL(nvme_auth_reset); 906 - 907 - static void nvme_dhchap_auth_work(struct work_struct *work) 918 + static void nvme_ctrl_auth_work(struct work_struct *work) 908 919 { 909 920 struct nvme_ctrl *ctrl = 910 921 container_of(work, struct nvme_ctrl, dhchap_auth_work); 911 922 int ret, q; 923 + 924 + /* 925 + * If the ctrl is no connected, bail as reconnect will handle 926 + * authentication. 927 + */ 928 + if (ctrl->state != NVME_CTRL_LIVE) 929 + return; 912 930 913 931 /* Authenticate admin queue first */ 914 932 ret = nvme_auth_negotiate(ctrl, 0); ··· 926 968 * Failure is a soft-state; credentials remain valid until 927 969 * the controller terminates the connection. 928 970 */ 971 + for (q = 1; q < ctrl->queue_count; q++) { 972 + ret = nvme_auth_wait(ctrl, q); 973 + if (ret) 974 + dev_warn(ctrl->device, 975 + "qid %d: authentication failed\n", q); 976 + } 929 977 } 930 978 931 - void nvme_auth_init_ctrl(struct nvme_ctrl *ctrl) 979 + int nvme_auth_init_ctrl(struct nvme_ctrl *ctrl) 932 980 { 933 - INIT_LIST_HEAD(&ctrl->dhchap_auth_list); 934 - INIT_WORK(&ctrl->dhchap_auth_work, nvme_dhchap_auth_work); 981 + struct nvme_dhchap_queue_context *chap; 982 + int i, ret; 983 + 935 984 mutex_init(&ctrl->dhchap_auth_mutex); 985 + INIT_WORK(&ctrl->dhchap_auth_work, nvme_ctrl_auth_work); 936 986 if (!ctrl->opts) 937 - return; 938 - nvme_auth_generate_key(ctrl->opts->dhchap_secret, &ctrl->host_key); 939 - nvme_auth_generate_key(ctrl->opts->dhchap_ctrl_secret, &ctrl->ctrl_key); 987 + return 0; 988 + ret = nvme_auth_generate_key(ctrl->opts->dhchap_secret, 989 + &ctrl->host_key); 990 + if (ret) 991 + return ret; 992 + ret = nvme_auth_generate_key(ctrl->opts->dhchap_ctrl_secret, 993 + &ctrl->ctrl_key); 994 + if (ret) 995 + goto err_free_dhchap_secret; 996 + 997 + if (!ctrl->opts->dhchap_secret && !ctrl->opts->dhchap_ctrl_secret) 998 + return ret; 999 + 1000 + ctrl->dhchap_ctxs = kvcalloc(ctrl_max_dhchaps(ctrl), 1001 + sizeof(*chap), GFP_KERNEL); 1002 + if (!ctrl->dhchap_ctxs) { 1003 + ret = -ENOMEM; 1004 + goto err_free_dhchap_ctrl_secret; 1005 + } 1006 + 1007 + for (i = 0; i < ctrl_max_dhchaps(ctrl); i++) { 1008 + chap = &ctrl->dhchap_ctxs[i]; 1009 + chap->qid = i; 1010 + chap->ctrl = ctrl; 1011 + INIT_WORK(&chap->auth_work, nvme_queue_auth_work); 1012 + } 1013 + 1014 + return 0; 1015 + err_free_dhchap_ctrl_secret: 1016 + nvme_auth_free_key(ctrl->ctrl_key); 1017 + ctrl->ctrl_key = NULL; 1018 + err_free_dhchap_secret: 1019 + nvme_auth_free_key(ctrl->host_key); 1020 + ctrl->host_key = NULL; 1021 + return ret; 940 1022 } 941 1023 EXPORT_SYMBOL_GPL(nvme_auth_init_ctrl); 942 1024 943 1025 void nvme_auth_stop(struct nvme_ctrl *ctrl) 944 1026 { 945 - struct nvme_dhchap_queue_context *chap = NULL, *tmp; 946 - 947 1027 cancel_work_sync(&ctrl->dhchap_auth_work); 948 - mutex_lock(&ctrl->dhchap_auth_mutex); 949 - list_for_each_entry_safe(chap, tmp, &ctrl->dhchap_auth_list, entry) 950 - cancel_work_sync(&chap->auth_work); 951 - mutex_unlock(&ctrl->dhchap_auth_mutex); 952 1028 } 953 1029 EXPORT_SYMBOL_GPL(nvme_auth_stop); 954 1030 955 1031 void nvme_auth_free(struct nvme_ctrl *ctrl) 956 1032 { 957 - struct nvme_dhchap_queue_context *chap = NULL, *tmp; 1033 + int i; 958 1034 959 - mutex_lock(&ctrl->dhchap_auth_mutex); 960 - list_for_each_entry_safe(chap, tmp, &ctrl->dhchap_auth_list, entry) { 961 - list_del_init(&chap->entry); 962 - flush_work(&chap->auth_work); 963 - __nvme_auth_free(chap); 1035 + if (ctrl->dhchap_ctxs) { 1036 + for (i = 0; i < ctrl_max_dhchaps(ctrl); i++) 1037 + nvme_auth_free_dhchap(&ctrl->dhchap_ctxs[i]); 1038 + kfree(ctrl->dhchap_ctxs); 964 1039 } 965 - mutex_unlock(&ctrl->dhchap_auth_mutex); 966 1040 if (ctrl->host_key) { 967 1041 nvme_auth_free_key(ctrl->host_key); 968 1042 ctrl->host_key = NULL; ··· 1005 1015 } 1006 1016 } 1007 1017 EXPORT_SYMBOL_GPL(nvme_auth_free); 1018 + 1019 + int __init nvme_init_auth(void) 1020 + { 1021 + nvme_chap_buf_cache = kmem_cache_create("nvme-chap-buf-cache", 1022 + CHAP_BUF_SIZE, 0, SLAB_HWCACHE_ALIGN, NULL); 1023 + if (!nvme_chap_buf_cache) 1024 + return -ENOMEM; 1025 + 1026 + nvme_chap_buf_pool = mempool_create(16, mempool_alloc_slab, 1027 + mempool_free_slab, nvme_chap_buf_cache); 1028 + if (!nvme_chap_buf_pool) 1029 + goto err_destroy_chap_buf_cache; 1030 + 1031 + return 0; 1032 + err_destroy_chap_buf_cache: 1033 + kmem_cache_destroy(nvme_chap_buf_cache); 1034 + return -ENOMEM; 1035 + } 1036 + 1037 + void __exit nvme_exit_auth(void) 1038 + { 1039 + mempool_destroy(nvme_chap_buf_pool); 1040 + kmem_cache_destroy(nvme_chap_buf_cache); 1041 + }

+161 -158

drivers/nvme/host/core.c

··· 384 384 nvme_log_error(req); 385 385 nvme_end_req_zoned(req); 386 386 nvme_trace_bio_complete(req); 387 + if (req->cmd_flags & REQ_NVME_MPATH) 388 + nvme_mpath_end_request(req); 387 389 blk_mq_end_request(req, status); 388 390 } 389 391 ··· 853 851 cmnd->write_zeroes.length = 854 852 cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1); 855 853 854 + if (!(req->cmd_flags & REQ_NOUNMAP) && (ns->features & NVME_NS_DEAC)) 855 + cmnd->write_zeroes.control |= cpu_to_le16(NVME_WZ_DEAC); 856 + 856 857 if (nvme_ns_has_pi(ns)) { 857 - cmnd->write_zeroes.control = cpu_to_le16(NVME_RW_PRINFO_PRACT); 858 + cmnd->write_zeroes.control |= cpu_to_le16(NVME_RW_PRINFO_PRACT); 858 859 859 860 switch (ns->pi_type) { 860 861 case NVME_NS_DPS_PI_TYPE1: ··· 1123 1118 nvme_unfreeze(ctrl); 1124 1119 nvme_mpath_unfreeze(ctrl->subsys); 1125 1120 mutex_unlock(&ctrl->subsys->lock); 1126 - nvme_remove_invalid_namespaces(ctrl, NVME_NSID_ALL); 1127 1121 mutex_unlock(&ctrl->scan_lock); 1128 1122 } 1129 - if (effects & NVME_CMD_EFFECTS_CCC) 1130 - nvme_init_ctrl_finish(ctrl); 1123 + if (effects & NVME_CMD_EFFECTS_CCC) { 1124 + dev_info(ctrl->device, 1125 + "controller capabilities changed, reset may be required to take effect.\n"); 1126 + } 1131 1127 if (effects & (NVME_CMD_EFFECTS_NIC | NVME_CMD_EFFECTS_NCC)) { 1132 1128 nvme_queue_scan(ctrl); 1133 1129 flush_work(&ctrl->scan_work); ··· 2009 2003 } 2010 2004 } 2011 2005 2006 + /* 2007 + * Only set the DEAC bit if the device guarantees that reads from 2008 + * deallocated data return zeroes. While the DEAC bit does not 2009 + * require that, it must be a no-op if reads from deallocated data 2010 + * do not return zeroes. 2011 + */ 2012 + if ((id->dlfeat & 0x7) == 0x1 && (id->dlfeat & (1 << 3))) 2013 + ns->features |= NVME_NS_DEAC; 2012 2014 set_disk_ro(ns->disk, nvme_ns_is_readonly(ns, info)); 2013 2015 set_bit(NVME_NS_READY, &ns->flags); 2014 2016 blk_mq_unfreeze_queue(ns->disk->queue); ··· 2193 2179 }; 2194 2180 2195 2181 #ifdef CONFIG_BLK_SED_OPAL 2196 - int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len, 2182 + static int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len, 2197 2183 bool send) 2198 2184 { 2199 2185 struct nvme_ctrl *ctrl = data; ··· 2210 2196 return __nvme_submit_sync_cmd(ctrl->admin_q, &cmd, NULL, buffer, len, 2211 2197 NVME_QID_ANY, 1, 0); 2212 2198 } 2213 - EXPORT_SYMBOL_GPL(nvme_sec_submit); 2199 + 2200 + static void nvme_configure_opal(struct nvme_ctrl *ctrl, bool was_suspended) 2201 + { 2202 + if (ctrl->oacs & NVME_CTRL_OACS_SEC_SUPP) { 2203 + if (!ctrl->opal_dev) 2204 + ctrl->opal_dev = init_opal_dev(ctrl, &nvme_sec_submit); 2205 + else if (was_suspended) 2206 + opal_unlock_from_suspend(ctrl->opal_dev); 2207 + } else { 2208 + free_opal_dev(ctrl->opal_dev); 2209 + ctrl->opal_dev = NULL; 2210 + } 2211 + } 2212 + #else 2213 + static void nvme_configure_opal(struct nvme_ctrl *ctrl, bool was_suspended) 2214 + { 2215 + } 2214 2216 #endif /* CONFIG_BLK_SED_OPAL */ 2215 2217 2216 2218 #ifdef CONFIG_BLK_DEV_ZONED ··· 2251 2221 .pr_ops = &nvme_pr_ops, 2252 2222 }; 2253 2223 2254 - static int nvme_wait_ready(struct nvme_ctrl *ctrl, u32 timeout, bool enabled) 2224 + static int nvme_wait_ready(struct nvme_ctrl *ctrl, u32 mask, u32 val, 2225 + u32 timeout, const char *op) 2255 2226 { 2256 - unsigned long timeout_jiffies = ((timeout + 1) * HZ / 2) + jiffies; 2257 - u32 csts, bit = enabled ? NVME_CSTS_RDY : 0; 2227 + unsigned long timeout_jiffies = jiffies + timeout * HZ; 2228 + u32 csts; 2258 2229 int ret; 2259 2230 2260 2231 while ((ret = ctrl->ops->reg_read32(ctrl, NVME_REG_CSTS, &csts)) == 0) { 2261 2232 if (csts == ~0) 2262 2233 return -ENODEV; 2263 - if ((csts & NVME_CSTS_RDY) == bit) 2234 + if ((csts & mask) == val) 2264 2235 break; 2265 2236 2266 2237 usleep_range(1000, 2000); ··· 2270 2239 if (time_after(jiffies, timeout_jiffies)) { 2271 2240 dev_err(ctrl->device, 2272 2241 "Device not ready; aborting %s, CSTS=0x%x\n", 2273 - enabled ? "initialisation" : "reset", csts); 2242 + op, csts); 2274 2243 return -ENODEV; 2275 2244 } 2276 2245 } ··· 2278 2247 return ret; 2279 2248 } 2280 2249 2281 - /* 2282 - * If the device has been passed off to us in an enabled state, just clear 2283 - * the enabled bit. The spec says we should set the 'shutdown notification 2284 - * bits', but doing so may cause the device to complete commands to the 2285 - * admin queue ... and we don't know what memory that might be pointing at! 2286 - */ 2287 - int nvme_disable_ctrl(struct nvme_ctrl *ctrl) 2250 + int nvme_disable_ctrl(struct nvme_ctrl *ctrl, bool shutdown) 2288 2251 { 2289 2252 int ret; 2290 2253 2291 2254 ctrl->ctrl_config &= ~NVME_CC_SHN_MASK; 2292 - ctrl->ctrl_config &= ~NVME_CC_ENABLE; 2255 + if (shutdown) 2256 + ctrl->ctrl_config |= NVME_CC_SHN_NORMAL; 2257 + else 2258 + ctrl->ctrl_config &= ~NVME_CC_ENABLE; 2293 2259 2294 2260 ret = ctrl->ops->reg_write32(ctrl, NVME_REG_CC, ctrl->ctrl_config); 2295 2261 if (ret) 2296 2262 return ret; 2297 2263 2264 + if (shutdown) { 2265 + return nvme_wait_ready(ctrl, NVME_CSTS_SHST_MASK, 2266 + NVME_CSTS_SHST_CMPLT, 2267 + ctrl->shutdown_timeout, "shutdown"); 2268 + } 2298 2269 if (ctrl->quirks & NVME_QUIRK_DELAY_BEFORE_CHK_RDY) 2299 2270 msleep(NVME_QUIRK_DELAY_AMOUNT); 2300 - 2301 - return nvme_wait_ready(ctrl, NVME_CAP_TIMEOUT(ctrl->cap), false); 2271 + return nvme_wait_ready(ctrl, NVME_CSTS_RDY, 0, 2272 + (NVME_CAP_TIMEOUT(ctrl->cap) + 1) / 2, "reset"); 2302 2273 } 2303 2274 EXPORT_SYMBOL_GPL(nvme_disable_ctrl); 2304 2275 ··· 2365 2332 ret = ctrl->ops->reg_write32(ctrl, NVME_REG_CC, ctrl->ctrl_config); 2366 2333 if (ret) 2367 2334 return ret; 2368 - return nvme_wait_ready(ctrl, timeout, true); 2335 + return nvme_wait_ready(ctrl, NVME_CSTS_RDY, NVME_CSTS_RDY, 2336 + (timeout + 1) / 2, "initialisation"); 2369 2337 } 2370 2338 EXPORT_SYMBOL_GPL(nvme_enable_ctrl); 2371 - 2372 - int nvme_shutdown_ctrl(struct nvme_ctrl *ctrl) 2373 - { 2374 - unsigned long timeout = jiffies + (ctrl->shutdown_timeout * HZ); 2375 - u32 csts; 2376 - int ret; 2377 - 2378 - ctrl->ctrl_config &= ~NVME_CC_SHN_MASK; 2379 - ctrl->ctrl_config |= NVME_CC_SHN_NORMAL; 2380 - 2381 - ret = ctrl->ops->reg_write32(ctrl, NVME_REG_CC, ctrl->ctrl_config); 2382 - if (ret) 2383 - return ret; 2384 - 2385 - while ((ret = ctrl->ops->reg_read32(ctrl, NVME_REG_CSTS, &csts)) == 0) { 2386 - if ((csts & NVME_CSTS_SHST_MASK) == NVME_CSTS_SHST_CMPLT) 2387 - break; 2388 - 2389 - msleep(100); 2390 - if (fatal_signal_pending(current)) 2391 - return -EINTR; 2392 - if (time_after(jiffies, timeout)) { 2393 - dev_err(ctrl->device, 2394 - "Device shutdown incomplete; abort shutdown\n"); 2395 - return -ENODEV; 2396 - } 2397 - } 2398 - 2399 - return ret; 2400 - } 2401 - EXPORT_SYMBOL_GPL(nvme_shutdown_ctrl); 2402 2339 2403 2340 static int nvme_configure_timestamp(struct nvme_ctrl *ctrl) 2404 2341 { ··· 3052 3049 3053 3050 id = kzalloc(sizeof(*id), GFP_KERNEL); 3054 3051 if (!id) 3055 - return 0; 3052 + return -ENOMEM; 3056 3053 3057 3054 c.identify.opcode = nvme_admin_identify; 3058 3055 c.identify.cns = NVME_ID_CNS_CS_CTRL; ··· 3232 3229 * register in our nvme_ctrl structure. This should be called as soon as 3233 3230 * the admin queue is fully up and running. 3234 3231 */ 3235 - int nvme_init_ctrl_finish(struct nvme_ctrl *ctrl) 3232 + int nvme_init_ctrl_finish(struct nvme_ctrl *ctrl, bool was_suspended) 3236 3233 { 3237 3234 int ret; 3238 3235 ··· 3262 3259 ret = nvme_configure_host_options(ctrl); 3263 3260 if (ret < 0) 3264 3261 return ret; 3262 + 3263 + nvme_configure_opal(ctrl, was_suspended); 3265 3264 3266 3265 if (!ctrl->identified && !nvme_discovery_ctrl(ctrl)) { 3267 3266 /* ··· 3750 3745 memcpy(dhchap_secret, buf, count); 3751 3746 nvme_auth_stop(ctrl); 3752 3747 if (strcmp(dhchap_secret, opts->dhchap_secret)) { 3748 + struct nvme_dhchap_key *key, *host_key; 3753 3749 int ret; 3754 3750 3755 - ret = nvme_auth_generate_key(dhchap_secret, &ctrl->host_key); 3751 + ret = nvme_auth_generate_key(dhchap_secret, &key); 3756 3752 if (ret) 3757 3753 return ret; 3758 3754 kfree(opts->dhchap_secret); 3759 3755 opts->dhchap_secret = dhchap_secret; 3760 - /* Key has changed; re-authentication with new key */ 3761 - nvme_auth_reset(ctrl); 3756 + host_key = ctrl->host_key; 3757 + mutex_lock(&ctrl->dhchap_auth_mutex); 3758 + ctrl->host_key = key; 3759 + mutex_unlock(&ctrl->dhchap_auth_mutex); 3760 + nvme_auth_free_key(host_key); 3762 3761 } 3763 3762 /* Start re-authentication */ 3764 3763 dev_info(ctrl->device, "re-authenticating controller\n"); ··· 3804 3795 memcpy(dhchap_secret, buf, count); 3805 3796 nvme_auth_stop(ctrl); 3806 3797 if (strcmp(dhchap_secret, opts->dhchap_ctrl_secret)) { 3798 + struct nvme_dhchap_key *key, *ctrl_key; 3807 3799 int ret; 3808 3800 3809 - ret = nvme_auth_generate_key(dhchap_secret, &ctrl->ctrl_key); 3801 + ret = nvme_auth_generate_key(dhchap_secret, &key); 3810 3802 if (ret) 3811 3803 return ret; 3812 3804 kfree(opts->dhchap_ctrl_secret); 3813 3805 opts->dhchap_ctrl_secret = dhchap_secret; 3814 - /* Key has changed; re-authentication with new key */ 3815 - nvme_auth_reset(ctrl); 3806 + ctrl_key = ctrl->ctrl_key; 3807 + mutex_lock(&ctrl->dhchap_auth_mutex); 3808 + ctrl->ctrl_key = key; 3809 + mutex_unlock(&ctrl->dhchap_auth_mutex); 3810 + nvme_auth_free_key(ctrl_key); 3816 3811 } 3817 3812 /* Start re-authentication */ 3818 3813 dev_info(ctrl->device, "re-authenticating controller\n"); ··· 3888 3875 return a->mode; 3889 3876 } 3890 3877 3891 - static const struct attribute_group nvme_dev_attrs_group = { 3878 + const struct attribute_group nvme_dev_attrs_group = { 3892 3879 .attrs = nvme_dev_attrs, 3893 3880 .is_visible = nvme_dev_attrs_are_visible, 3894 3881 }; 3882 + EXPORT_SYMBOL_GPL(nvme_dev_attrs_group); 3895 3883 3896 3884 static const struct attribute_group *nvme_dev_attr_groups[] = { 3897 3885 &nvme_dev_attrs_group, ··· 4347 4333 { 4348 4334 int ret = NVME_SC_INVALID_NS | NVME_SC_DNR; 4349 4335 4350 - if (test_bit(NVME_NS_DEAD, &ns->flags)) 4351 - goto out; 4352 - 4353 - ret = NVME_SC_INVALID_NS | NVME_SC_DNR; 4354 4336 if (!nvme_ns_ids_equal(&ns->head->ids, &info->ids)) { 4355 4337 dev_err(ns->ctrl->device, 4356 4338 "identifiers changed for nsid %d\n", ns->head->ns_id); ··· 4417 4407 4418 4408 down_write(&ctrl->namespaces_rwsem); 4419 4409 list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) { 4420 - if (ns->head->ns_id > nsid || test_bit(NVME_NS_DEAD, &ns->flags)) 4410 + if (ns->head->ns_id > nsid) 4421 4411 list_move_tail(&ns->list, &rm_list); 4422 4412 } 4423 4413 up_write(&ctrl->namespaces_rwsem); ··· 4433 4423 __le32 *ns_list; 4434 4424 u32 prev = 0; 4435 4425 int ret = 0, i; 4436 - 4437 - if (nvme_ctrl_limited_cns(ctrl)) 4438 - return -EOPNOTSUPP; 4439 4426 4440 4427 ns_list = kzalloc(NVME_IDENTIFY_DATA_SIZE, GFP_KERNEL); 4441 4428 if (!ns_list) ··· 4541 4534 } 4542 4535 4543 4536 mutex_lock(&ctrl->scan_lock); 4544 - if (nvme_scan_ns_list(ctrl) != 0) 4537 + if (nvme_ctrl_limited_cns(ctrl)) { 4545 4538 nvme_scan_ns_sequential(ctrl); 4539 + } else { 4540 + /* 4541 + * Fall back to sequential scan if DNR is set to handle broken 4542 + * devices which should support Identify NS List (as per the VS 4543 + * they report) but don't actually support it. 4544 + */ 4545 + ret = nvme_scan_ns_list(ctrl); 4546 + if (ret > 0 && ret & NVME_SC_DNR) 4547 + nvme_scan_ns_sequential(ctrl); 4548 + } 4546 4549 mutex_unlock(&ctrl->scan_lock); 4547 4550 } 4548 4551 ··· 4582 4565 * removing the namespaces' disks; fail all the queues now to avoid 4583 4566 * potentially having to clean up the failed sync later. 4584 4567 */ 4585 - if (ctrl->state == NVME_CTRL_DEAD) 4586 - nvme_kill_queues(ctrl); 4568 + if (ctrl->state == NVME_CTRL_DEAD) { 4569 + nvme_mark_namespaces_dead(ctrl); 4570 + nvme_unquiesce_io_queues(ctrl); 4571 + } 4587 4572 4588 4573 /* this is a no-op when called from the controller reset handler */ 4589 4574 nvme_change_ctrl_state(ctrl, NVME_CTRL_DELETING_NOIO); ··· 4711 4692 fw_act_timeout = jiffies + 4712 4693 msecs_to_jiffies(admin_timeout * 1000); 4713 4694 4714 - nvme_stop_queues(ctrl); 4695 + nvme_quiesce_io_queues(ctrl); 4715 4696 while (nvme_ctrl_pp_status(ctrl)) { 4716 4697 if (time_after(jiffies, fw_act_timeout)) { 4717 4698 dev_warn(ctrl->device, ··· 4725 4706 if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE)) 4726 4707 return; 4727 4708 4728 - nvme_start_queues(ctrl); 4709 + nvme_unquiesce_io_queues(ctrl); 4729 4710 /* read FW slot information to clear the AER */ 4730 4711 nvme_get_fw_slot_info(ctrl); 4731 4712 ··· 4830 4811 EXPORT_SYMBOL_GPL(nvme_complete_async_event); 4831 4812 4832 4813 int nvme_alloc_admin_tag_set(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set, 4833 - const struct blk_mq_ops *ops, unsigned int flags, 4834 - unsigned int cmd_size) 4814 + const struct blk_mq_ops *ops, unsigned int cmd_size) 4835 4815 { 4836 4816 int ret; 4837 4817 ··· 4840 4822 if (ctrl->ops->flags & NVME_F_FABRICS) 4841 4823 set->reserved_tags = NVMF_RESERVED_TAGS; 4842 4824 set->numa_node = ctrl->numa_node; 4843 - set->flags = flags; 4825 + set->flags = BLK_MQ_F_NO_SCHED; 4826 + if (ctrl->ops->flags & NVME_F_BLOCKING) 4827 + set->flags |= BLK_MQ_F_BLOCKING; 4844 4828 set->cmd_size = cmd_size; 4845 4829 set->driver_data = ctrl; 4846 4830 set->nr_hw_queues = 1; ··· 4870 4850 4871 4851 out_cleanup_admin_q: 4872 4852 blk_mq_destroy_queue(ctrl->admin_q); 4853 + blk_put_queue(ctrl->admin_q); 4873 4854 out_free_tagset: 4874 4855 blk_mq_free_tag_set(ctrl->admin_tagset); 4875 4856 return ret; ··· 4880 4859 void nvme_remove_admin_tag_set(struct nvme_ctrl *ctrl) 4881 4860 { 4882 4861 blk_mq_destroy_queue(ctrl->admin_q); 4883 - if (ctrl->ops->flags & NVME_F_FABRICS) 4862 + blk_put_queue(ctrl->admin_q); 4863 + if (ctrl->ops->flags & NVME_F_FABRICS) { 4884 4864 blk_mq_destroy_queue(ctrl->fabrics_q); 4865 + blk_put_queue(ctrl->fabrics_q); 4866 + } 4885 4867 blk_mq_free_tag_set(ctrl->admin_tagset); 4886 4868 } 4887 4869 EXPORT_SYMBOL_GPL(nvme_remove_admin_tag_set); 4888 4870 4889 4871 int nvme_alloc_io_tag_set(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set, 4890 - const struct blk_mq_ops *ops, unsigned int flags, 4872 + const struct blk_mq_ops *ops, unsigned int nr_maps, 4891 4873 unsigned int cmd_size) 4892 4874 { 4893 4875 int ret; ··· 4898 4874 memset(set, 0, sizeof(*set)); 4899 4875 set->ops = ops; 4900 4876 set->queue_depth = ctrl->sqsize + 1; 4901 - set->reserved_tags = NVMF_RESERVED_TAGS; 4877 + /* 4878 + * Some Apple controllers requires tags to be unique across admin and 4879 + * the (only) I/O queue, so reserve the first 32 tags of the I/O queue. 4880 + */ 4881 + if (ctrl->quirks & NVME_QUIRK_SHARED_TAGS) 4882 + set->reserved_tags = NVME_AQ_DEPTH; 4883 + else if (ctrl->ops->flags & NVME_F_FABRICS) 4884 + set->reserved_tags = NVMF_RESERVED_TAGS; 4902 4885 set->numa_node = ctrl->numa_node; 4903 - set->flags = flags; 4886 + set->flags = BLK_MQ_F_SHOULD_MERGE; 4887 + if (ctrl->ops->flags & NVME_F_BLOCKING) 4888 + set->flags |= BLK_MQ_F_BLOCKING; 4904 4889 set->cmd_size = cmd_size, 4905 4890 set->driver_data = ctrl; 4906 4891 set->nr_hw_queues = ctrl->queue_count - 1; 4907 4892 set->timeout = NVME_IO_TIMEOUT; 4908 - if (ops->map_queues) 4909 - set->nr_maps = ctrl->opts->nr_poll_queues ? HCTX_MAX_TYPES : 2; 4893 + set->nr_maps = nr_maps; 4910 4894 ret = blk_mq_alloc_tag_set(set); 4911 4895 if (ret) 4912 4896 return ret; ··· 4925 4893 ret = PTR_ERR(ctrl->connect_q); 4926 4894 goto out_free_tag_set; 4927 4895 } 4896 + blk_queue_flag_set(QUEUE_FLAG_SKIP_TAGSET_QUIESCE, 4897 + ctrl->connect_q); 4928 4898 } 4929 4899 4930 4900 ctrl->tagset = set; ··· 4940 4906 4941 4907 void nvme_remove_io_tag_set(struct nvme_ctrl *ctrl) 4942 4908 { 4943 - if (ctrl->ops->flags & NVME_F_FABRICS) 4909 + if (ctrl->ops->flags & NVME_F_FABRICS) { 4944 4910 blk_mq_destroy_queue(ctrl->connect_q); 4911 + blk_put_queue(ctrl->connect_q); 4912 + } 4945 4913 blk_mq_free_tag_set(ctrl->tagset); 4946 4914 } 4947 4915 EXPORT_SYMBOL_GPL(nvme_remove_io_tag_set); ··· 4979 4943 4980 4944 if (ctrl->queue_count > 1) { 4981 4945 nvme_queue_scan(ctrl); 4982 - nvme_start_queues(ctrl); 4946 + nvme_unquiesce_io_queues(ctrl); 4983 4947 nvme_mpath_update(ctrl); 4984 4948 } 4985 4949 ··· 5024 4988 nvme_auth_stop(ctrl); 5025 4989 nvme_auth_free(ctrl); 5026 4990 __free_page(ctrl->discard_page); 4991 + free_opal_dev(ctrl->opal_dev); 5027 4992 5028 4993 if (subsys) { 5029 4994 mutex_lock(&nvme_subsystems_lock); ··· 5090 5053 ctrl->instance); 5091 5054 ctrl->device->class = nvme_class; 5092 5055 ctrl->device->parent = ctrl->dev; 5093 - ctrl->device->groups = nvme_dev_attr_groups; 5056 + if (ops->dev_attr_groups) 5057 + ctrl->device->groups = ops->dev_attr_groups; 5058 + else 5059 + ctrl->device->groups = nvme_dev_attr_groups; 5094 5060 ctrl->device->release = nvme_free_ctrl; 5095 5061 dev_set_drvdata(ctrl->device, ctrl); 5096 5062 ret = dev_set_name(ctrl->device, "nvme%d", ctrl->instance); ··· 5117 5077 5118 5078 nvme_fault_inject_init(&ctrl->fault_inject, dev_name(ctrl->device)); 5119 5079 nvme_mpath_init_ctrl(ctrl); 5120 - nvme_auth_init_ctrl(ctrl); 5080 + ret = nvme_auth_init_ctrl(ctrl); 5081 + if (ret) 5082 + goto out_free_cdev; 5121 5083 5122 5084 return 0; 5085 + out_free_cdev: 5086 + cdev_device_del(&ctrl->cdev, ctrl->device); 5123 5087 out_free_name: 5124 5088 nvme_put_ctrl(ctrl); 5125 5089 kfree_const(ctrl->device->kobj.name); ··· 5136 5092 } 5137 5093 EXPORT_SYMBOL_GPL(nvme_init_ctrl); 5138 5094 5139 - static void nvme_start_ns_queue(struct nvme_ns *ns) 5140 - { 5141 - if (test_and_clear_bit(NVME_NS_STOPPED, &ns->flags)) 5142 - blk_mq_unquiesce_queue(ns->queue); 5143 - } 5144 - 5145 - static void nvme_stop_ns_queue(struct nvme_ns *ns) 5146 - { 5147 - if (!test_and_set_bit(NVME_NS_STOPPED, &ns->flags)) 5148 - blk_mq_quiesce_queue(ns->queue); 5149 - else 5150 - blk_mq_wait_quiesce_done(ns->queue); 5151 - } 5152 - 5153 - /* 5154 - * Prepare a queue for teardown. 5155 - * 5156 - * This must forcibly unquiesce queues to avoid blocking dispatch, and only set 5157 - * the capacity to 0 after that to avoid blocking dispatchers that may be 5158 - * holding bd_butex. This will end buffered writers dirtying pages that can't 5159 - * be synced. 5160 - */ 5161 - static void nvme_set_queue_dying(struct nvme_ns *ns) 5162 - { 5163 - if (test_and_set_bit(NVME_NS_DEAD, &ns->flags)) 5164 - return; 5165 - 5166 - blk_mark_disk_dead(ns->disk); 5167 - nvme_start_ns_queue(ns); 5168 - 5169 - set_capacity_and_notify(ns->disk, 0); 5170 - } 5171 - 5172 - /** 5173 - * nvme_kill_queues(): Ends all namespace queues 5174 - * @ctrl: the dead controller that needs to end 5175 - * 5176 - * Call this function when the driver determines it is unable to get the 5177 - * controller in a state capable of servicing IO. 5178 - */ 5179 - void nvme_kill_queues(struct nvme_ctrl *ctrl) 5095 + /* let I/O to all namespaces fail in preparation for surprise removal */ 5096 + void nvme_mark_namespaces_dead(struct nvme_ctrl *ctrl) 5180 5097 { 5181 5098 struct nvme_ns *ns; 5182 5099 5183 5100 down_read(&ctrl->namespaces_rwsem); 5184 - 5185 - /* Forcibly unquiesce queues to avoid blocking dispatch */ 5186 - if (ctrl->admin_q && !blk_queue_dying(ctrl->admin_q)) 5187 - nvme_start_admin_queue(ctrl); 5188 - 5189 5101 list_for_each_entry(ns, &ctrl->namespaces, list) 5190 - nvme_set_queue_dying(ns); 5191 - 5102 + blk_mark_disk_dead(ns->disk); 5192 5103 up_read(&ctrl->namespaces_rwsem); 5193 5104 } 5194 - EXPORT_SYMBOL_GPL(nvme_kill_queues); 5105 + EXPORT_SYMBOL_GPL(nvme_mark_namespaces_dead); 5195 5106 5196 5107 void nvme_unfreeze(struct nvme_ctrl *ctrl) 5197 5108 { ··· 5196 5197 } 5197 5198 EXPORT_SYMBOL_GPL(nvme_start_freeze); 5198 5199 5199 - void nvme_stop_queues(struct nvme_ctrl *ctrl) 5200 + void nvme_quiesce_io_queues(struct nvme_ctrl *ctrl) 5200 5201 { 5201 - struct nvme_ns *ns; 5202 - 5203 - down_read(&ctrl->namespaces_rwsem); 5204 - list_for_each_entry(ns, &ctrl->namespaces, list) 5205 - nvme_stop_ns_queue(ns); 5206 - up_read(&ctrl->namespaces_rwsem); 5202 + if (!ctrl->tagset) 5203 + return; 5204 + if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) 5205 + blk_mq_quiesce_tagset(ctrl->tagset); 5206 + else 5207 + blk_mq_wait_quiesce_done(ctrl->tagset); 5207 5208 } 5208 - EXPORT_SYMBOL_GPL(nvme_stop_queues); 5209 + EXPORT_SYMBOL_GPL(nvme_quiesce_io_queues); 5209 5210 5210 - void nvme_start_queues(struct nvme_ctrl *ctrl) 5211 + void nvme_unquiesce_io_queues(struct nvme_ctrl *ctrl) 5211 5212 { 5212 - struct nvme_ns *ns; 5213 - 5214 - down_read(&ctrl->namespaces_rwsem); 5215 - list_for_each_entry(ns, &ctrl->namespaces, list) 5216 - nvme_start_ns_queue(ns); 5217 - up_read(&ctrl->namespaces_rwsem); 5213 + if (!ctrl->tagset) 5214 + return; 5215 + if (test_and_clear_bit(NVME_CTRL_STOPPED, &ctrl->flags)) 5216 + blk_mq_unquiesce_tagset(ctrl->tagset); 5218 5217 } 5219 - EXPORT_SYMBOL_GPL(nvme_start_queues); 5218 + EXPORT_SYMBOL_GPL(nvme_unquiesce_io_queues); 5220 5219 5221 - void nvme_stop_admin_queue(struct nvme_ctrl *ctrl) 5220 + void nvme_quiesce_admin_queue(struct nvme_ctrl *ctrl) 5222 5221 { 5223 5222 if (!test_and_set_bit(NVME_CTRL_ADMIN_Q_STOPPED, &ctrl->flags)) 5224 5223 blk_mq_quiesce_queue(ctrl->admin_q); 5225 5224 else 5226 - blk_mq_wait_quiesce_done(ctrl->admin_q); 5225 + blk_mq_wait_quiesce_done(ctrl->admin_q->tag_set); 5227 5226 } 5228 - EXPORT_SYMBOL_GPL(nvme_stop_admin_queue); 5227 + EXPORT_SYMBOL_GPL(nvme_quiesce_admin_queue); 5229 5228 5230 - void nvme_start_admin_queue(struct nvme_ctrl *ctrl) 5229 + void nvme_unquiesce_admin_queue(struct nvme_ctrl *ctrl) 5231 5230 { 5232 5231 if (test_and_clear_bit(NVME_CTRL_ADMIN_Q_STOPPED, &ctrl->flags)) 5233 5232 blk_mq_unquiesce_queue(ctrl->admin_q); 5234 5233 } 5235 - EXPORT_SYMBOL_GPL(nvme_start_admin_queue); 5234 + EXPORT_SYMBOL_GPL(nvme_unquiesce_admin_queue); 5236 5235 5237 5236 void nvme_sync_io_queues(struct nvme_ctrl *ctrl) 5238 5237 { ··· 5341 5344 goto unregister_generic_ns; 5342 5345 } 5343 5346 5347 + result = nvme_init_auth(); 5348 + if (result) 5349 + goto destroy_ns_chr; 5344 5350 return 0; 5345 5351 5352 + destroy_ns_chr: 5353 + class_destroy(nvme_ns_chr_class); 5346 5354 unregister_generic_ns: 5347 5355 unregister_chrdev_region(nvme_ns_chr_devt, NVME_MINORS); 5348 5356 destroy_subsys_class: ··· 5368 5366 5369 5367 static void __exit nvme_core_exit(void) 5370 5368 { 5369 + nvme_exit_auth(); 5371 5370 class_destroy(nvme_ns_chr_class); 5372 5371 class_destroy(nvme_subsys_class); 5373 5372 class_destroy(nvme_class);

+36 -23

drivers/nvme/host/fc.c

··· 1475 1475 fc_dma_unmap_single(lport->dev, lsop->rspdma, 1476 1476 sizeof(*lsop->rspbuf), DMA_TO_DEVICE); 1477 1477 1478 + kfree(lsop->rspbuf); 1479 + kfree(lsop->rqstbuf); 1478 1480 kfree(lsop); 1479 1481 1480 1482 nvme_fc_rport_put(rport); ··· 1701 1699 spin_unlock_irqrestore(&rport->lock, flags); 1702 1700 } 1703 1701 1702 + static 1703 + void nvme_fc_rcv_ls_req_err_msg(struct nvme_fc_lport *lport, 1704 + struct fcnvme_ls_rqst_w0 *w0) 1705 + { 1706 + dev_info(lport->dev, "RCV %s LS failed: No memory\n", 1707 + (w0->ls_cmd <= NVME_FC_LAST_LS_CMD_VALUE) ? 1708 + nvmefc_ls_names[w0->ls_cmd] : ""); 1709 + } 1710 + 1704 1711 /** 1705 1712 * nvme_fc_rcv_ls_req - transport entry point called by an LLDD 1706 1713 * upon the reception of a NVME LS request. ··· 1762 1751 goto out_put; 1763 1752 } 1764 1753 1765 - lsop = kzalloc(sizeof(*lsop) + 1766 - sizeof(union nvmefc_ls_requests) + 1767 - sizeof(union nvmefc_ls_responses), 1768 - GFP_KERNEL); 1754 + lsop = kzalloc(sizeof(*lsop), GFP_KERNEL); 1769 1755 if (!lsop) { 1770 - dev_info(lport->dev, 1771 - "RCV %s LS failed: No memory\n", 1772 - (w0->ls_cmd <= NVME_FC_LAST_LS_CMD_VALUE) ? 1773 - nvmefc_ls_names[w0->ls_cmd] : ""); 1756 + nvme_fc_rcv_ls_req_err_msg(lport, w0); 1774 1757 ret = -ENOMEM; 1775 1758 goto out_put; 1776 1759 } 1777 - lsop->rqstbuf = (union nvmefc_ls_requests *)&lsop[1]; 1778 - lsop->rspbuf = (union nvmefc_ls_responses *)&lsop->rqstbuf[1]; 1760 + 1761 + lsop->rqstbuf = kzalloc(sizeof(*lsop->rqstbuf), GFP_KERNEL); 1762 + lsop->rspbuf = kzalloc(sizeof(*lsop->rspbuf), GFP_KERNEL); 1763 + if (!lsop->rqstbuf || !lsop->rspbuf) { 1764 + nvme_fc_rcv_ls_req_err_msg(lport, w0); 1765 + ret = -ENOMEM; 1766 + goto out_free; 1767 + } 1779 1768 1780 1769 lsop->rspdma = fc_dma_map_single(lport->dev, lsop->rspbuf, 1781 1770 sizeof(*lsop->rspbuf), ··· 1812 1801 fc_dma_unmap_single(lport->dev, lsop->rspdma, 1813 1802 sizeof(*lsop->rspbuf), DMA_TO_DEVICE); 1814 1803 out_free: 1804 + kfree(lsop->rspbuf); 1805 + kfree(lsop->rqstbuf); 1815 1806 kfree(lsop); 1816 1807 out_put: 1817 1808 nvme_fc_rport_put(rport); ··· 2404 2391 list_del(&ctrl->ctrl_list); 2405 2392 spin_unlock_irqrestore(&ctrl->rport->lock, flags); 2406 2393 2407 - nvme_start_admin_queue(&ctrl->ctrl); 2394 + nvme_unquiesce_admin_queue(&ctrl->ctrl); 2408 2395 nvme_remove_admin_tag_set(&ctrl->ctrl); 2409 2396 2410 2397 kfree(ctrl->queues); ··· 2505 2492 * (but with error status). 2506 2493 */ 2507 2494 if (ctrl->ctrl.queue_count > 1) { 2508 - nvme_stop_queues(&ctrl->ctrl); 2495 + nvme_quiesce_io_queues(&ctrl->ctrl); 2509 2496 nvme_sync_io_queues(&ctrl->ctrl); 2510 2497 blk_mq_tagset_busy_iter(&ctrl->tag_set, 2511 2498 nvme_fc_terminate_exchange, &ctrl->ctrl); 2512 2499 blk_mq_tagset_wait_completed_request(&ctrl->tag_set); 2513 2500 if (start_queues) 2514 - nvme_start_queues(&ctrl->ctrl); 2501 + nvme_unquiesce_io_queues(&ctrl->ctrl); 2515 2502 } 2516 2503 2517 2504 /* 2518 2505 * Other transports, which don't have link-level contexts bound 2519 2506 * to sqe's, would try to gracefully shutdown the controller by 2520 2507 * writing the registers for shutdown and polling (call 2521 - * nvme_shutdown_ctrl()). Given a bunch of i/o was potentially 2508 + * nvme_disable_ctrl()). Given a bunch of i/o was potentially 2522 2509 * just aborted and we will wait on those contexts, and given 2523 2510 * there was no indication of how live the controlelr is on the 2524 2511 * link, don't send more io to create more contexts for the ··· 2529 2516 /* 2530 2517 * clean up the admin queue. Same thing as above. 2531 2518 */ 2532 - nvme_stop_admin_queue(&ctrl->ctrl); 2519 + nvme_quiesce_admin_queue(&ctrl->ctrl); 2533 2520 blk_sync_queue(ctrl->ctrl.admin_q); 2534 2521 blk_mq_tagset_busy_iter(&ctrl->admin_tag_set, 2535 2522 nvme_fc_terminate_exchange, &ctrl->ctrl); 2536 2523 blk_mq_tagset_wait_completed_request(&ctrl->admin_tag_set); 2537 2524 if (start_queues) 2538 - nvme_start_admin_queue(&ctrl->ctrl); 2525 + nvme_unquiesce_admin_queue(&ctrl->ctrl); 2539 2526 } 2540 2527 2541 2528 static void ··· 2745 2732 atomic_set(&op->state, FCPOP_STATE_ACTIVE); 2746 2733 2747 2734 if (!(op->flags & FCOP_FLAGS_AEN)) 2748 - blk_mq_start_request(op->rq); 2735 + nvme_start_request(op->rq); 2749 2736 2750 2737 cmdiu->csn = cpu_to_be32(atomic_inc_return(&queue->csn)); 2751 2738 ret = ctrl->lport->ops->fcp_io(&ctrl->lport->localport, ··· 2916 2903 nvme_fc_init_io_queues(ctrl); 2917 2904 2918 2905 ret = nvme_alloc_io_tag_set(&ctrl->ctrl, &ctrl->tag_set, 2919 - &nvme_fc_mq_ops, BLK_MQ_F_SHOULD_MERGE, 2906 + &nvme_fc_mq_ops, 1, 2920 2907 struct_size((struct nvme_fcp_op_w_sgl *)NULL, priv, 2921 2908 ctrl->lport->ops->fcprqst_priv_sz)); 2922 2909 if (ret) ··· 3117 3104 ctrl->ctrl.max_hw_sectors = ctrl->ctrl.max_segments << 3118 3105 (ilog2(SZ_4K) - 9); 3119 3106 3120 - nvme_start_admin_queue(&ctrl->ctrl); 3107 + nvme_unquiesce_admin_queue(&ctrl->ctrl); 3121 3108 3122 - ret = nvme_init_ctrl_finish(&ctrl->ctrl); 3109 + ret = nvme_init_ctrl_finish(&ctrl->ctrl, false); 3123 3110 if (ret || test_bit(ASSOC_FAILED, &ctrl->flags)) 3124 3111 goto out_disconnect_admin_queue; 3125 3112 ··· 3263 3250 nvme_fc_free_queue(&ctrl->queues[0]); 3264 3251 3265 3252 /* re-enable the admin_q so anything new can fast fail */ 3266 - nvme_start_admin_queue(&ctrl->ctrl); 3253 + nvme_unquiesce_admin_queue(&ctrl->ctrl); 3267 3254 3268 3255 /* resume the io queues so that things will fast fail */ 3269 - nvme_start_queues(&ctrl->ctrl); 3256 + nvme_unquiesce_io_queues(&ctrl->ctrl); 3270 3257 3271 3258 nvme_fc_ctlr_inactive_on_rport(ctrl); 3272 3259 } ··· 3522 3509 nvme_fc_init_queue(ctrl, 0); 3523 3510 3524 3511 ret = nvme_alloc_admin_tag_set(&ctrl->ctrl, &ctrl->admin_tag_set, 3525 - &nvme_fc_admin_mq_ops, BLK_MQ_F_NO_SCHED, 3512 + &nvme_fc_admin_mq_ops, 3526 3513 struct_size((struct nvme_fcp_op_w_sgl *)NULL, priv, 3527 3514 ctrl->lport->ops->fcprqst_priv_sz)); 3528 3515 if (ret)

+85 -33

drivers/nvme/host/ioctl.c

··· 8 8 #include <linux/io_uring.h> 9 9 #include "nvme.h" 10 10 11 + static bool nvme_cmd_allowed(struct nvme_ns *ns, struct nvme_command *c, 12 + fmode_t mode) 13 + { 14 + if (capable(CAP_SYS_ADMIN)) 15 + return true; 16 + 17 + /* 18 + * Do not allow unprivileged processes to send vendor specific or fabrics 19 + * commands as we can't be sure about their effects. 20 + */ 21 + if (c->common.opcode >= nvme_cmd_vendor_start || 22 + c->common.opcode == nvme_fabrics_command) 23 + return false; 24 + 25 + /* 26 + * Do not allow unprivileged passthrough of admin commands except 27 + * for a subset of identify commands that contain information required 28 + * to form proper I/O commands in userspace and do not expose any 29 + * potentially sensitive information. 30 + */ 31 + if (!ns) { 32 + if (c->common.opcode == nvme_admin_identify) { 33 + switch (c->identify.cns) { 34 + case NVME_ID_CNS_NS: 35 + case NVME_ID_CNS_CS_NS: 36 + case NVME_ID_CNS_NS_CS_INDEP: 37 + case NVME_ID_CNS_CS_CTRL: 38 + case NVME_ID_CNS_CTRL: 39 + return true; 40 + } 41 + } 42 + return false; 43 + } 44 + 45 + /* 46 + * Only allow I/O commands that transfer data to the controller if the 47 + * special file is open for writing, but always allow I/O commands that 48 + * transfer data from the controller. 49 + */ 50 + if (nvme_is_write(c)) 51 + return mode & FMODE_WRITE; 52 + return true; 53 + } 54 + 11 55 /* 12 56 * Convert integer values from ioctl structures to user pointers, silently 13 57 * ignoring the upper bits in the compat case to match behaviour of 32-bit ··· 305 261 } 306 262 307 263 static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns, 308 - struct nvme_passthru_cmd __user *ucmd) 264 + struct nvme_passthru_cmd __user *ucmd, fmode_t mode) 309 265 { 310 266 struct nvme_passthru_cmd cmd; 311 267 struct nvme_command c; ··· 313 269 u64 result; 314 270 int status; 315 271 316 - if (!capable(CAP_SYS_ADMIN)) 317 - return -EACCES; 318 272 if (copy_from_user(&cmd, ucmd, sizeof(cmd))) 319 273 return -EFAULT; 320 274 if (cmd.flags) ··· 332 290 c.common.cdw13 = cpu_to_le32(cmd.cdw13); 333 291 c.common.cdw14 = cpu_to_le32(cmd.cdw14); 334 292 c.common.cdw15 = cpu_to_le32(cmd.cdw15); 293 + 294 + if (!nvme_cmd_allowed(ns, &c, mode)) 295 + return -EACCES; 335 296 336 297 if (cmd.timeout_ms) 337 298 timeout = msecs_to_jiffies(cmd.timeout_ms); ··· 353 308 } 354 309 355 310 static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns, 356 - struct nvme_passthru_cmd64 __user *ucmd, bool vec) 311 + struct nvme_passthru_cmd64 __user *ucmd, bool vec, 312 + fmode_t mode) 357 313 { 358 314 struct nvme_passthru_cmd64 cmd; 359 315 struct nvme_command c; 360 316 unsigned timeout = 0; 361 317 int status; 362 318 363 - if (!capable(CAP_SYS_ADMIN)) 364 - return -EACCES; 365 319 if (copy_from_user(&cmd, ucmd, sizeof(cmd))) 366 320 return -EFAULT; 367 321 if (cmd.flags) ··· 380 336 c.common.cdw13 = cpu_to_le32(cmd.cdw13); 381 337 c.common.cdw14 = cpu_to_le32(cmd.cdw14); 382 338 c.common.cdw15 = cpu_to_le32(cmd.cdw15); 339 + 340 + if (!nvme_cmd_allowed(ns, &c, mode)) 341 + return -EACCES; 383 342 384 343 if (cmd.timeout_ms) 385 344 timeout = msecs_to_jiffies(cmd.timeout_ms); ··· 530 483 void *meta = NULL; 531 484 int ret; 532 485 533 - if (!capable(CAP_SYS_ADMIN)) 534 - return -EACCES; 535 - 536 486 c.common.opcode = READ_ONCE(cmd->opcode); 537 487 c.common.flags = READ_ONCE(cmd->flags); 538 488 if (c.common.flags) ··· 550 506 c.common.cdw13 = cpu_to_le32(READ_ONCE(cmd->cdw13)); 551 507 c.common.cdw14 = cpu_to_le32(READ_ONCE(cmd->cdw14)); 552 508 c.common.cdw15 = cpu_to_le32(READ_ONCE(cmd->cdw15)); 509 + 510 + if (!nvme_cmd_allowed(ns, &c, ioucmd->file->f_mode)) 511 + return -EACCES; 553 512 554 513 d.metadata = READ_ONCE(cmd->metadata); 555 514 d.addr = READ_ONCE(cmd->addr); ··· 617 570 } 618 571 619 572 static int nvme_ctrl_ioctl(struct nvme_ctrl *ctrl, unsigned int cmd, 620 - void __user *argp) 573 + void __user *argp, fmode_t mode) 621 574 { 622 575 switch (cmd) { 623 576 case NVME_IOCTL_ADMIN_CMD: 624 - return nvme_user_cmd(ctrl, NULL, argp); 577 + return nvme_user_cmd(ctrl, NULL, argp, mode); 625 578 case NVME_IOCTL_ADMIN64_CMD: 626 - return nvme_user_cmd64(ctrl, NULL, argp, false); 579 + return nvme_user_cmd64(ctrl, NULL, argp, false, mode); 627 580 default: 628 581 return sed_ioctl(ctrl->opal_dev, cmd, argp); 629 582 } ··· 648 601 #endif /* COMPAT_FOR_U64_ALIGNMENT */ 649 602 650 603 static int nvme_ns_ioctl(struct nvme_ns *ns, unsigned int cmd, 651 - void __user *argp) 604 + void __user *argp, fmode_t mode) 652 605 { 653 606 switch (cmd) { 654 607 case NVME_IOCTL_ID: 655 608 force_successful_syscall_return(); 656 609 return ns->head->ns_id; 657 610 case NVME_IOCTL_IO_CMD: 658 - return nvme_user_cmd(ns->ctrl, ns, argp); 611 + return nvme_user_cmd(ns->ctrl, ns, argp, mode); 659 612 /* 660 613 * struct nvme_user_io can have different padding on some 32-bit ABIs. 661 614 * Just accept the compat version as all fields that are used are the ··· 667 620 case NVME_IOCTL_SUBMIT_IO: 668 621 return nvme_submit_io(ns, argp); 669 622 case NVME_IOCTL_IO64_CMD: 670 - return nvme_user_cmd64(ns->ctrl, ns, argp, false); 623 + return nvme_user_cmd64(ns->ctrl, ns, argp, false, mode); 671 624 case NVME_IOCTL_IO64_CMD_VEC: 672 - return nvme_user_cmd64(ns->ctrl, ns, argp, true); 625 + return nvme_user_cmd64(ns->ctrl, ns, argp, true, mode); 673 626 default: 674 627 return -ENOTTY; 675 628 } 676 629 } 677 630 678 - static int __nvme_ioctl(struct nvme_ns *ns, unsigned int cmd, void __user *arg) 631 + static int __nvme_ioctl(struct nvme_ns *ns, unsigned int cmd, void __user *arg, 632 + fmode_t mode) 679 633 { 680 - if (is_ctrl_ioctl(cmd)) 681 - return nvme_ctrl_ioctl(ns->ctrl, cmd, arg); 682 - return nvme_ns_ioctl(ns, cmd, arg); 634 + if (is_ctrl_ioctl(cmd)) 635 + return nvme_ctrl_ioctl(ns->ctrl, cmd, arg, mode); 636 + return nvme_ns_ioctl(ns, cmd, arg, mode); 683 637 } 684 638 685 639 int nvme_ioctl(struct block_device *bdev, fmode_t mode, ··· 688 640 { 689 641 struct nvme_ns *ns = bdev->bd_disk->private_data; 690 642 691 - return __nvme_ioctl(ns, cmd, (void __user *)arg); 643 + return __nvme_ioctl(ns, cmd, (void __user *)arg, mode); 692 644 } 693 645 694 646 long nvme_ns_chr_ioctl(struct file *file, unsigned int cmd, unsigned long arg) ··· 696 648 struct nvme_ns *ns = 697 649 container_of(file_inode(file)->i_cdev, struct nvme_ns, cdev); 698 650 699 - return __nvme_ioctl(ns, cmd, (void __user *)arg); 651 + return __nvme_ioctl(ns, cmd, (void __user *)arg, file->f_mode); 700 652 } 701 653 702 654 static int nvme_uring_cmd_checks(unsigned int issue_flags) ··· 764 716 } 765 717 #ifdef CONFIG_NVME_MULTIPATH 766 718 static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd, 767 - void __user *argp, struct nvme_ns_head *head, int srcu_idx) 719 + void __user *argp, struct nvme_ns_head *head, int srcu_idx, 720 + fmode_t mode) 768 721 __releases(&head->srcu) 769 722 { 770 723 struct nvme_ctrl *ctrl = ns->ctrl; ··· 773 724 774 725 nvme_get_ctrl(ns->ctrl); 775 726 srcu_read_unlock(&head->srcu, srcu_idx); 776 - ret = nvme_ctrl_ioctl(ns->ctrl, cmd, argp); 727 + ret = nvme_ctrl_ioctl(ns->ctrl, cmd, argp, mode); 777 728 778 729 nvme_put_ctrl(ctrl); 779 730 return ret; ··· 798 749 * deadlock when deleting namespaces using the passthrough interface. 799 750 */ 800 751 if (is_ctrl_ioctl(cmd)) 801 - return nvme_ns_head_ctrl_ioctl(ns, cmd, argp, head, srcu_idx); 752 + return nvme_ns_head_ctrl_ioctl(ns, cmd, argp, head, srcu_idx, 753 + mode); 802 754 803 - ret = nvme_ns_ioctl(ns, cmd, argp); 755 + ret = nvme_ns_ioctl(ns, cmd, argp, mode); 804 756 out_unlock: 805 757 srcu_read_unlock(&head->srcu, srcu_idx); 806 758 return ret; ··· 823 773 goto out_unlock; 824 774 825 775 if (is_ctrl_ioctl(cmd)) 826 - return nvme_ns_head_ctrl_ioctl(ns, cmd, argp, head, srcu_idx); 776 + return nvme_ns_head_ctrl_ioctl(ns, cmd, argp, head, srcu_idx, 777 + file->f_mode); 827 778 828 - ret = nvme_ns_ioctl(ns, cmd, argp); 779 + ret = nvme_ns_ioctl(ns, cmd, argp, file->f_mode); 829 780 out_unlock: 830 781 srcu_read_unlock(&head->srcu, srcu_idx); 831 782 return ret; ··· 900 849 return ret; 901 850 } 902 851 903 - static int nvme_dev_user_cmd(struct nvme_ctrl *ctrl, void __user *argp) 852 + static int nvme_dev_user_cmd(struct nvme_ctrl *ctrl, void __user *argp, 853 + fmode_t mode) 904 854 { 905 855 struct nvme_ns *ns; 906 856 int ret; ··· 925 873 kref_get(&ns->kref); 926 874 up_read(&ctrl->namespaces_rwsem); 927 875 928 - ret = nvme_user_cmd(ctrl, ns, argp); 876 + ret = nvme_user_cmd(ctrl, ns, argp, mode); 929 877 nvme_put_ns(ns); 930 878 return ret; 931 879 ··· 942 890 943 891 switch (cmd) { 944 892 case NVME_IOCTL_ADMIN_CMD: 945 - return nvme_user_cmd(ctrl, NULL, argp); 893 + return nvme_user_cmd(ctrl, NULL, argp, file->f_mode); 946 894 case NVME_IOCTL_ADMIN64_CMD: 947 - return nvme_user_cmd64(ctrl, NULL, argp, false); 895 + return nvme_user_cmd64(ctrl, NULL, argp, false, file->f_mode); 948 896 case NVME_IOCTL_IO_CMD: 949 - return nvme_dev_user_cmd(ctrl, argp); 897 + return nvme_dev_user_cmd(ctrl, argp, file->f_mode); 950 898 case NVME_IOCTL_RESET: 951 899 if (!capable(CAP_SYS_ADMIN)) 952 900 return -EACCES;

+26

drivers/nvme/host/multipath.c

··· 114 114 kblockd_schedule_work(&ns->head->requeue_work); 115 115 } 116 116 117 + void nvme_mpath_start_request(struct request *rq) 118 + { 119 + struct nvme_ns *ns = rq->q->queuedata; 120 + struct gendisk *disk = ns->head->disk; 121 + 122 + if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq)) 123 + return; 124 + 125 + nvme_req(rq)->flags |= NVME_MPATH_IO_STATS; 126 + nvme_req(rq)->start_time = bdev_start_io_acct(disk->part0, 127 + blk_rq_bytes(rq) >> SECTOR_SHIFT, 128 + req_op(rq), jiffies); 129 + } 130 + EXPORT_SYMBOL_GPL(nvme_mpath_start_request); 131 + 132 + void nvme_mpath_end_request(struct request *rq) 133 + { 134 + struct nvme_ns *ns = rq->q->queuedata; 135 + 136 + if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS)) 137 + return; 138 + bdev_end_io_acct(ns->head->disk->part0, req_op(rq), 139 + nvme_req(rq)->start_time); 140 + } 141 + 117 142 void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl) 118 143 { 119 144 struct nvme_ns *ns; ··· 531 506 532 507 blk_queue_flag_set(QUEUE_FLAG_NONROT, head->disk->queue); 533 508 blk_queue_flag_set(QUEUE_FLAG_NOWAIT, head->disk->queue); 509 + blk_queue_flag_set(QUEUE_FLAG_IO_STAT, head->disk->queue); 534 510 /* 535 511 * This assumes all controllers that refer to a namespace either 536 512 * support poll queues or not. That is not a strict guarantee,

+49 -20

drivers/nvme/host/nvme.h

··· 162 162 u8 retries; 163 163 u8 flags; 164 164 u16 status; 165 + #ifdef CONFIG_NVME_MULTIPATH 166 + unsigned long start_time; 167 + #endif 165 168 struct nvme_ctrl *ctrl; 166 169 }; 167 170 ··· 176 173 enum { 177 174 NVME_REQ_CANCELLED = (1 << 0), 178 175 NVME_REQ_USERCMD = (1 << 1), 176 + NVME_MPATH_IO_STATS = (1 << 2), 179 177 }; 180 178 181 179 static inline struct nvme_request *nvme_req(struct request *req) ··· 241 237 NVME_CTRL_FAILFAST_EXPIRED = 0, 242 238 NVME_CTRL_ADMIN_Q_STOPPED = 1, 243 239 NVME_CTRL_STARTED_ONCE = 2, 240 + NVME_CTRL_STOPPED = 3, 244 241 }; 245 242 246 243 struct nvme_ctrl { ··· 341 336 342 337 #ifdef CONFIG_NVME_AUTH 343 338 struct work_struct dhchap_auth_work; 344 - struct list_head dhchap_auth_list; 345 339 struct mutex dhchap_auth_mutex; 340 + struct nvme_dhchap_queue_context *dhchap_ctxs; 346 341 struct nvme_dhchap_key *host_key; 347 342 struct nvme_dhchap_key *ctrl_key; 348 343 u16 transaction; ··· 459 454 enum nvme_ns_features { 460 455 NVME_NS_EXT_LBAS = 1 << 0, /* support extended LBA format */ 461 456 NVME_NS_METADATA_SUPPORTED = 1 << 1, /* support getting generated md */ 457 + NVME_NS_DEAC, /* DEAC bit in Write Zeores supported */ 462 458 }; 463 459 464 460 struct nvme_ns { ··· 489 483 unsigned long features; 490 484 unsigned long flags; 491 485 #define NVME_NS_REMOVING 0 492 - #define NVME_NS_DEAD 1 493 486 #define NVME_NS_ANA_PENDING 2 494 487 #define NVME_NS_FORCE_RO 3 495 488 #define NVME_NS_READY 4 496 - #define NVME_NS_STOPPED 5 497 489 498 490 struct cdev cdev; 499 491 struct device cdev_device; ··· 512 508 unsigned int flags; 513 509 #define NVME_F_FABRICS (1 << 0) 514 510 #define NVME_F_METADATA_SUPPORTED (1 << 1) 511 + #define NVME_F_BLOCKING (1 << 2) 512 + 513 + const struct attribute_group **dev_attr_groups; 515 514 int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val); 516 515 int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val); 517 516 int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val); ··· 735 728 void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl); 736 729 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, 737 730 enum nvme_ctrl_state new_state); 738 - int nvme_disable_ctrl(struct nvme_ctrl *ctrl); 731 + int nvme_disable_ctrl(struct nvme_ctrl *ctrl, bool shutdown); 739 732 int nvme_enable_ctrl(struct nvme_ctrl *ctrl); 740 - int nvme_shutdown_ctrl(struct nvme_ctrl *ctrl); 741 733 int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev, 742 734 const struct nvme_ctrl_ops *ops, unsigned long quirks); 743 735 void nvme_uninit_ctrl(struct nvme_ctrl *ctrl); 744 736 void nvme_start_ctrl(struct nvme_ctrl *ctrl); 745 737 void nvme_stop_ctrl(struct nvme_ctrl *ctrl); 746 - int nvme_init_ctrl_finish(struct nvme_ctrl *ctrl); 738 + int nvme_init_ctrl_finish(struct nvme_ctrl *ctrl, bool was_suspended); 747 739 int nvme_alloc_admin_tag_set(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set, 748 - const struct blk_mq_ops *ops, unsigned int flags, 749 - unsigned int cmd_size); 740 + const struct blk_mq_ops *ops, unsigned int cmd_size); 750 741 void nvme_remove_admin_tag_set(struct nvme_ctrl *ctrl); 751 742 int nvme_alloc_io_tag_set(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set, 752 - const struct blk_mq_ops *ops, unsigned int flags, 743 + const struct blk_mq_ops *ops, unsigned int nr_maps, 753 744 unsigned int cmd_size); 754 745 void nvme_remove_io_tag_set(struct nvme_ctrl *ctrl); 755 746 756 747 void nvme_remove_namespaces(struct nvme_ctrl *ctrl); 757 748 758 - int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len, 759 - bool send); 760 - 761 749 void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status, 762 750 volatile union nvme_result *res); 763 751 764 - void nvme_stop_queues(struct nvme_ctrl *ctrl); 765 - void nvme_start_queues(struct nvme_ctrl *ctrl); 766 - void nvme_stop_admin_queue(struct nvme_ctrl *ctrl); 767 - void nvme_start_admin_queue(struct nvme_ctrl *ctrl); 768 - void nvme_kill_queues(struct nvme_ctrl *ctrl); 752 + void nvme_quiesce_io_queues(struct nvme_ctrl *ctrl); 753 + void nvme_unquiesce_io_queues(struct nvme_ctrl *ctrl); 754 + void nvme_quiesce_admin_queue(struct nvme_ctrl *ctrl); 755 + void nvme_unquiesce_admin_queue(struct nvme_ctrl *ctrl); 756 + void nvme_mark_namespaces_dead(struct nvme_ctrl *ctrl); 769 757 void nvme_sync_queues(struct nvme_ctrl *ctrl); 770 758 void nvme_sync_io_queues(struct nvme_ctrl *ctrl); 771 759 void nvme_unfreeze(struct nvme_ctrl *ctrl); ··· 859 857 extern const struct attribute_group *nvme_ns_id_attr_groups[]; 860 858 extern const struct pr_ops nvme_pr_ops; 861 859 extern const struct block_device_operations nvme_ns_head_ops; 860 + extern const struct attribute_group nvme_dev_attrs_group; 862 861 863 862 struct nvme_ns *nvme_find_path(struct nvme_ns_head *head); 864 863 #ifdef CONFIG_NVME_MULTIPATH ··· 886 883 void nvme_mpath_revalidate_paths(struct nvme_ns *ns); 887 884 void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl); 888 885 void nvme_mpath_shutdown_disk(struct nvme_ns_head *head); 886 + void nvme_mpath_start_request(struct request *rq); 887 + void nvme_mpath_end_request(struct request *rq); 889 888 890 889 static inline void nvme_trace_bio_complete(struct request *req) 891 890 { ··· 973 968 static inline void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys) 974 969 { 975 970 } 971 + static inline void nvme_mpath_start_request(struct request *rq) 972 + { 973 + } 974 + static inline void nvme_mpath_end_request(struct request *rq) 975 + { 976 + } 976 977 #endif /* CONFIG_NVME_MULTIPATH */ 977 978 978 979 int nvme_revalidate_zones(struct nvme_ns *ns); ··· 1024 1013 } 1025 1014 #endif 1026 1015 1016 + static inline void nvme_start_request(struct request *rq) 1017 + { 1018 + if (rq->cmd_flags & REQ_NVME_MPATH) 1019 + nvme_mpath_start_request(rq); 1020 + blk_mq_start_request(rq); 1021 + } 1022 + 1027 1023 static inline bool nvme_ctrl_sgl_supported(struct nvme_ctrl *ctrl) 1028 1024 { 1029 1025 return ctrl->sgls & ((1 << 0) | (1 << 1)); 1030 1026 } 1031 1027 1032 1028 #ifdef CONFIG_NVME_AUTH 1033 - void nvme_auth_init_ctrl(struct nvme_ctrl *ctrl); 1029 + int __init nvme_init_auth(void); 1030 + void __exit nvme_exit_auth(void); 1031 + int nvme_auth_init_ctrl(struct nvme_ctrl *ctrl); 1034 1032 void nvme_auth_stop(struct nvme_ctrl *ctrl); 1035 1033 int nvme_auth_negotiate(struct nvme_ctrl *ctrl, int qid); 1036 1034 int nvme_auth_wait(struct nvme_ctrl *ctrl, int qid); 1037 - void nvme_auth_reset(struct nvme_ctrl *ctrl); 1038 1035 void nvme_auth_free(struct nvme_ctrl *ctrl); 1039 1036 #else 1040 - static inline void nvme_auth_init_ctrl(struct nvme_ctrl *ctrl) {}; 1037 + static inline int nvme_auth_init_ctrl(struct nvme_ctrl *ctrl) 1038 + { 1039 + return 0; 1040 + } 1041 + static inline int __init nvme_init_auth(void) 1042 + { 1043 + return 0; 1044 + } 1045 + static inline void __exit nvme_exit_auth(void) 1046 + { 1047 + } 1041 1048 static inline void nvme_auth_stop(struct nvme_ctrl *ctrl) {}; 1042 1049 static inline int nvme_auth_negotiate(struct nvme_ctrl *ctrl, int qid) 1043 1050 {

+268 -338

drivers/nvme/host/pci.c

··· 15 15 #include <linux/init.h> 16 16 #include <linux/interrupt.h> 17 17 #include <linux/io.h> 18 + #include <linux/kstrtox.h> 18 19 #include <linux/memremap.h> 19 20 #include <linux/mm.h> 20 21 #include <linux/module.h> ··· 109 108 struct nvme_queue; 110 109 111 110 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown); 112 - static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode); 111 + static void nvme_delete_io_queues(struct nvme_dev *dev); 113 112 114 113 /* 115 114 * Represents an NVM Express device. Each nvme_dev is a PCI function. ··· 131 130 u32 db_stride; 132 131 void __iomem *bar; 133 132 unsigned long bar_mapped_size; 134 - struct work_struct remove_work; 135 133 struct mutex shutdown_lock; 136 134 bool subsystem; 137 135 u64 cmb_size; ··· 158 158 unsigned int nr_allocated_queues; 159 159 unsigned int nr_write_queues; 160 160 unsigned int nr_poll_queues; 161 - 162 - bool attrs_added; 163 161 }; 164 162 165 163 static int io_queue_depth_set(const char *val, const struct kernel_param *kp) ··· 239 241 return dev->nr_allocated_queues * 8 * dev->db_stride; 240 242 } 241 243 242 - static int nvme_dbbuf_dma_alloc(struct nvme_dev *dev) 244 + static void nvme_dbbuf_dma_alloc(struct nvme_dev *dev) 243 245 { 244 246 unsigned int mem_size = nvme_dbbuf_size(dev); 247 + 248 + if (!(dev->ctrl.oacs & NVME_CTRL_OACS_DBBUF_SUPP)) 249 + return; 245 250 246 251 if (dev->dbbuf_dbs) { 247 252 /* ··· 253 252 */ 254 253 memset(dev->dbbuf_dbs, 0, mem_size); 255 254 memset(dev->dbbuf_eis, 0, mem_size); 256 - return 0; 255 + return; 257 256 } 258 257 259 258 dev->dbbuf_dbs = dma_alloc_coherent(dev->dev, mem_size, 260 259 &dev->dbbuf_dbs_dma_addr, 261 260 GFP_KERNEL); 262 261 if (!dev->dbbuf_dbs) 263 - return -ENOMEM; 262 + goto fail; 264 263 dev->dbbuf_eis = dma_alloc_coherent(dev->dev, mem_size, 265 264 &dev->dbbuf_eis_dma_addr, 266 265 GFP_KERNEL); 267 - if (!dev->dbbuf_eis) { 268 - dma_free_coherent(dev->dev, mem_size, 269 - dev->dbbuf_dbs, dev->dbbuf_dbs_dma_addr); 270 - dev->dbbuf_dbs = NULL; 271 - return -ENOMEM; 272 - } 266 + if (!dev->dbbuf_eis) 267 + goto fail_free_dbbuf_dbs; 268 + return; 273 269 274 - return 0; 270 + fail_free_dbbuf_dbs: 271 + dma_free_coherent(dev->dev, mem_size, dev->dbbuf_dbs, 272 + dev->dbbuf_dbs_dma_addr); 273 + dev->dbbuf_dbs = NULL; 274 + fail: 275 + dev_warn(dev->dev, "unable to allocate dma for dbbuf\n"); 275 276 } 276 277 277 278 static void nvme_dbbuf_dma_free(struct nvme_dev *dev) ··· 395 392 PAGE_SIZE); 396 393 } 397 394 398 - static size_t nvme_pci_iod_alloc_size(void) 399 - { 400 - size_t npages = max(nvme_pci_npages_prp(), nvme_pci_npages_sgl()); 401 - 402 - return sizeof(__le64 *) * npages + 403 - sizeof(struct scatterlist) * NVME_MAX_SEGS; 404 - } 405 - 406 395 static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data, 407 396 unsigned int hctx_idx) 408 397 { 409 - struct nvme_dev *dev = data; 398 + struct nvme_dev *dev = to_nvme_dev(data); 410 399 struct nvme_queue *nvmeq = &dev->queues[0]; 411 400 412 401 WARN_ON(hctx_idx != 0); ··· 411 416 static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data, 412 417 unsigned int hctx_idx) 413 418 { 414 - struct nvme_dev *dev = data; 419 + struct nvme_dev *dev = to_nvme_dev(data); 415 420 struct nvme_queue *nvmeq = &dev->queues[hctx_idx + 1]; 416 421 417 422 WARN_ON(dev->tagset.tags[hctx_idx] != hctx->tags); ··· 423 428 struct request *req, unsigned int hctx_idx, 424 429 unsigned int numa_node) 425 430 { 426 - struct nvme_dev *dev = set->driver_data; 431 + struct nvme_dev *dev = to_nvme_dev(set->driver_data); 427 432 struct nvme_iod *iod = blk_mq_rq_to_pdu(req); 428 433 429 434 nvme_req(req)->ctrl = &dev->ctrl; ··· 442 447 443 448 static void nvme_pci_map_queues(struct blk_mq_tag_set *set) 444 449 { 445 - struct nvme_dev *dev = set->driver_data; 450 + struct nvme_dev *dev = to_nvme_dev(set->driver_data); 446 451 int i, qoff, offset; 447 452 448 453 offset = queue_irq_offset(dev); ··· 909 914 goto out_unmap_data; 910 915 } 911 916 912 - blk_mq_start_request(req); 917 + nvme_start_request(req); 913 918 return BLK_STS_OK; 914 919 out_unmap_data: 915 920 nvme_unmap_data(dev, req); ··· 1469 1474 } 1470 1475 } 1471 1476 1472 - /** 1473 - * nvme_suspend_queue - put queue into suspended state 1474 - * @nvmeq: queue to suspend 1475 - */ 1476 - static int nvme_suspend_queue(struct nvme_queue *nvmeq) 1477 + static void nvme_suspend_queue(struct nvme_dev *dev, unsigned int qid) 1477 1478 { 1479 + struct nvme_queue *nvmeq = &dev->queues[qid]; 1480 + 1478 1481 if (!test_and_clear_bit(NVMEQ_ENABLED, &nvmeq->flags)) 1479 - return 1; 1482 + return; 1480 1483 1481 1484 /* ensure that nvme_queue_rq() sees NVMEQ_ENABLED cleared */ 1482 1485 mb(); 1483 1486 1484 1487 nvmeq->dev->online_queues--; 1485 1488 if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q) 1486 - nvme_stop_admin_queue(&nvmeq->dev->ctrl); 1489 + nvme_quiesce_admin_queue(&nvmeq->dev->ctrl); 1487 1490 if (!test_and_clear_bit(NVMEQ_POLLED, &nvmeq->flags)) 1488 - pci_free_irq(to_pci_dev(nvmeq->dev->dev), nvmeq->cq_vector, nvmeq); 1489 - return 0; 1491 + pci_free_irq(to_pci_dev(dev->dev), nvmeq->cq_vector, nvmeq); 1490 1492 } 1491 1493 1492 1494 static void nvme_suspend_io_queues(struct nvme_dev *dev) ··· 1491 1499 int i; 1492 1500 1493 1501 for (i = dev->ctrl.queue_count - 1; i > 0; i--) 1494 - nvme_suspend_queue(&dev->queues[i]); 1495 - } 1496 - 1497 - static void nvme_disable_admin_queue(struct nvme_dev *dev, bool shutdown) 1498 - { 1499 - struct nvme_queue *nvmeq = &dev->queues[0]; 1500 - 1501 - if (shutdown) 1502 - nvme_shutdown_ctrl(&dev->ctrl); 1503 - else 1504 - nvme_disable_ctrl(&dev->ctrl); 1505 - 1506 - nvme_poll_irqdisable(nvmeq); 1502 + nvme_suspend_queue(dev, i); 1507 1503 } 1508 1504 1509 1505 /* ··· 1728 1748 * user requests may be waiting on a stopped queue. Start the 1729 1749 * queue to flush these to completion. 1730 1750 */ 1731 - nvme_start_admin_queue(&dev->ctrl); 1732 - blk_mq_destroy_queue(dev->ctrl.admin_q); 1733 - blk_mq_free_tag_set(&dev->admin_tagset); 1751 + nvme_unquiesce_admin_queue(&dev->ctrl); 1752 + nvme_remove_admin_tag_set(&dev->ctrl); 1734 1753 } 1735 - } 1736 - 1737 - static int nvme_pci_alloc_admin_tag_set(struct nvme_dev *dev) 1738 - { 1739 - struct blk_mq_tag_set *set = &dev->admin_tagset; 1740 - 1741 - set->ops = &nvme_mq_admin_ops; 1742 - set->nr_hw_queues = 1; 1743 - 1744 - set->queue_depth = NVME_AQ_MQ_TAG_DEPTH; 1745 - set->timeout = NVME_ADMIN_TIMEOUT; 1746 - set->numa_node = dev->ctrl.numa_node; 1747 - set->cmd_size = sizeof(struct nvme_iod); 1748 - set->flags = BLK_MQ_F_NO_SCHED; 1749 - set->driver_data = dev; 1750 - 1751 - if (blk_mq_alloc_tag_set(set)) 1752 - return -ENOMEM; 1753 - dev->ctrl.admin_tagset = set; 1754 - 1755 - dev->ctrl.admin_q = blk_mq_init_queue(set); 1756 - if (IS_ERR(dev->ctrl.admin_q)) { 1757 - blk_mq_free_tag_set(set); 1758 - dev->ctrl.admin_q = NULL; 1759 - return -ENOMEM; 1760 - } 1761 - if (!blk_get_queue(dev->ctrl.admin_q)) { 1762 - nvme_dev_remove_admin(dev); 1763 - dev->ctrl.admin_q = NULL; 1764 - return -ENODEV; 1765 - } 1766 - return 0; 1767 1754 } 1768 1755 1769 1756 static unsigned long db_bar_size(struct nvme_dev *dev, unsigned nr_io_queues) ··· 1776 1829 (readl(dev->bar + NVME_REG_CSTS) & NVME_CSTS_NSSRO)) 1777 1830 writel(NVME_CSTS_NSSRO, dev->bar + NVME_REG_CSTS); 1778 1831 1779 - result = nvme_disable_ctrl(&dev->ctrl); 1832 + /* 1833 + * If the device has been passed off to us in an enabled state, just 1834 + * clear the enabled bit. The spec says we should set the 'shutdown 1835 + * notification bits', but doing so may cause the device to complete 1836 + * commands to the admin queue ... and we don't know what memory that 1837 + * might be pointing at! 1838 + */ 1839 + result = nvme_disable_ctrl(&dev->ctrl, false); 1780 1840 if (result < 0) 1781 1841 return result; 1782 1842 ··· 2066 2112 u32 enable_bits = NVME_HOST_MEM_ENABLE; 2067 2113 int ret; 2068 2114 2115 + if (!dev->ctrl.hmpre) 2116 + return 0; 2117 + 2069 2118 preferred = min(preferred, max); 2070 2119 if (min > max) { 2071 2120 dev_warn(dev->ctrl.device, ··· 2149 2192 bool new; 2150 2193 int ret; 2151 2194 2152 - if (strtobool(buf, &new) < 0) 2195 + if (kstrtobool(buf, &new) < 0) 2153 2196 return -EINVAL; 2154 2197 2155 2198 if (new == ndev->hmb) ··· 2197 2240 NULL, 2198 2241 }; 2199 2242 2200 - static const struct attribute_group nvme_pci_attr_group = { 2243 + static const struct attribute_group nvme_pci_dev_attrs_group = { 2201 2244 .attrs = nvme_pci_attrs, 2202 2245 .is_visible = nvme_pci_attrs_are_visible, 2246 + }; 2247 + 2248 + static const struct attribute_group *nvme_pci_dev_attr_groups[] = { 2249 + &nvme_dev_attrs_group, 2250 + &nvme_pci_dev_attrs_group, 2251 + NULL, 2203 2252 }; 2204 2253 2205 2254 /* ··· 2280 2317 irq_queues += (nr_io_queues - poll_queues); 2281 2318 return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues, 2282 2319 PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd); 2283 - } 2284 - 2285 - static void nvme_disable_io_queues(struct nvme_dev *dev) 2286 - { 2287 - if (__nvme_disable_io_queues(dev, nvme_admin_delete_sq)) 2288 - __nvme_disable_io_queues(dev, nvme_admin_delete_cq); 2289 2320 } 2290 2321 2291 2322 static unsigned int nvme_max_io_queues(struct nvme_dev *dev) ··· 2389 2432 2390 2433 if (dev->online_queues - 1 < dev->max_qid) { 2391 2434 nr_io_queues = dev->online_queues - 1; 2392 - nvme_disable_io_queues(dev); 2435 + nvme_delete_io_queues(dev); 2393 2436 result = nvme_setup_io_queues_trylock(dev); 2394 2437 if (result) 2395 2438 return result; ··· 2452 2495 return 0; 2453 2496 } 2454 2497 2455 - static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode) 2498 + static bool __nvme_delete_io_queues(struct nvme_dev *dev, u8 opcode) 2456 2499 { 2457 2500 int nr_queues = dev->online_queues - 1, sent = 0; 2458 2501 unsigned long timeout; ··· 2480 2523 return true; 2481 2524 } 2482 2525 2483 - static void nvme_pci_alloc_tag_set(struct nvme_dev *dev) 2526 + static void nvme_delete_io_queues(struct nvme_dev *dev) 2484 2527 { 2485 - struct blk_mq_tag_set * set = &dev->tagset; 2486 - int ret; 2528 + if (__nvme_delete_io_queues(dev, nvme_admin_delete_sq)) 2529 + __nvme_delete_io_queues(dev, nvme_admin_delete_cq); 2530 + } 2487 2531 2488 - set->ops = &nvme_mq_ops; 2489 - set->nr_hw_queues = dev->online_queues - 1; 2490 - set->nr_maps = 1; 2491 - if (dev->io_queues[HCTX_TYPE_READ]) 2492 - set->nr_maps = 2; 2532 + static unsigned int nvme_pci_nr_maps(struct nvme_dev *dev) 2533 + { 2493 2534 if (dev->io_queues[HCTX_TYPE_POLL]) 2494 - set->nr_maps = 3; 2495 - set->timeout = NVME_IO_TIMEOUT; 2496 - set->numa_node = dev->ctrl.numa_node; 2497 - set->queue_depth = min_t(unsigned, dev->q_depth, BLK_MQ_MAX_DEPTH) - 1; 2498 - set->cmd_size = sizeof(struct nvme_iod); 2499 - set->flags = BLK_MQ_F_SHOULD_MERGE; 2500 - set->driver_data = dev; 2501 - 2502 - /* 2503 - * Some Apple controllers requires tags to be unique 2504 - * across admin and IO queue, so reserve the first 32 2505 - * tags of the IO queue. 2506 - */ 2507 - if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS) 2508 - set->reserved_tags = NVME_AQ_DEPTH; 2509 - 2510 - ret = blk_mq_alloc_tag_set(set); 2511 - if (ret) { 2512 - dev_warn(dev->ctrl.device, 2513 - "IO queues tagset allocation failed %d\n", ret); 2514 - return; 2515 - } 2516 - dev->ctrl.tagset = set; 2535 + return 3; 2536 + if (dev->io_queues[HCTX_TYPE_READ]) 2537 + return 2; 2538 + return 1; 2517 2539 } 2518 2540 2519 2541 static void nvme_pci_update_nr_queues(struct nvme_dev *dev) ··· 2583 2647 2584 2648 pci_enable_pcie_error_reporting(pdev); 2585 2649 pci_save_state(pdev); 2586 - return 0; 2650 + 2651 + return nvme_pci_configure_admin_queue(dev); 2587 2652 2588 2653 disable: 2589 2654 pci_disable_device(pdev); ··· 2598 2661 pci_release_mem_regions(to_pci_dev(dev->dev)); 2599 2662 } 2600 2663 2601 - static void nvme_pci_disable(struct nvme_dev *dev) 2664 + static bool nvme_pci_ctrl_is_dead(struct nvme_dev *dev) 2602 2665 { 2603 2666 struct pci_dev *pdev = to_pci_dev(dev->dev); 2667 + u32 csts; 2604 2668 2605 - pci_free_irq_vectors(pdev); 2669 + if (!pci_is_enabled(pdev) || !pci_device_is_present(pdev)) 2670 + return true; 2671 + if (pdev->error_state != pci_channel_io_normal) 2672 + return true; 2606 2673 2607 - if (pci_is_enabled(pdev)) { 2608 - pci_disable_pcie_error_reporting(pdev); 2609 - pci_disable_device(pdev); 2610 - } 2674 + csts = readl(dev->bar + NVME_REG_CSTS); 2675 + return (csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY); 2611 2676 } 2612 2677 2613 2678 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown) 2614 2679 { 2615 - bool dead = true, freeze = false; 2616 2680 struct pci_dev *pdev = to_pci_dev(dev->dev); 2681 + bool dead; 2617 2682 2618 2683 mutex_lock(&dev->shutdown_lock); 2619 - if (pci_is_enabled(pdev)) { 2620 - u32 csts; 2621 - 2622 - if (pci_device_is_present(pdev)) 2623 - csts = readl(dev->bar + NVME_REG_CSTS); 2624 - else 2625 - csts = ~0; 2626 - 2627 - if (dev->ctrl.state == NVME_CTRL_LIVE || 2628 - dev->ctrl.state == NVME_CTRL_RESETTING) { 2629 - freeze = true; 2684 + dead = nvme_pci_ctrl_is_dead(dev); 2685 + if (dev->ctrl.state == NVME_CTRL_LIVE || 2686 + dev->ctrl.state == NVME_CTRL_RESETTING) { 2687 + if (pci_is_enabled(pdev)) 2630 2688 nvme_start_freeze(&dev->ctrl); 2631 - } 2632 - dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) || 2633 - pdev->error_state != pci_channel_io_normal); 2689 + /* 2690 + * Give the controller a chance to complete all entered requests 2691 + * if doing a safe shutdown. 2692 + */ 2693 + if (!dead && shutdown) 2694 + nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT); 2634 2695 } 2635 2696 2636 - /* 2637 - * Give the controller a chance to complete all entered requests if 2638 - * doing a safe shutdown. 2639 - */ 2640 - if (!dead && shutdown && freeze) 2641 - nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT); 2642 - 2643 - nvme_stop_queues(&dev->ctrl); 2697 + nvme_quiesce_io_queues(&dev->ctrl); 2644 2698 2645 2699 if (!dead && dev->ctrl.queue_count > 0) { 2646 - nvme_disable_io_queues(dev); 2647 - nvme_disable_admin_queue(dev, shutdown); 2700 + nvme_delete_io_queues(dev); 2701 + nvme_disable_ctrl(&dev->ctrl, shutdown); 2702 + nvme_poll_irqdisable(&dev->queues[0]); 2648 2703 } 2649 2704 nvme_suspend_io_queues(dev); 2650 - nvme_suspend_queue(&dev->queues[0]); 2651 - nvme_pci_disable(dev); 2705 + nvme_suspend_queue(dev, 0); 2706 + pci_free_irq_vectors(pdev); 2707 + if (pci_is_enabled(pdev)) { 2708 + pci_disable_pcie_error_reporting(pdev); 2709 + pci_disable_device(pdev); 2710 + } 2652 2711 nvme_reap_pending_cqes(dev); 2653 2712 2654 2713 nvme_cancel_tagset(&dev->ctrl); ··· 2656 2723 * deadlocking blk-mq hot-cpu notifier. 2657 2724 */ 2658 2725 if (shutdown) { 2659 - nvme_start_queues(&dev->ctrl); 2726 + nvme_unquiesce_io_queues(&dev->ctrl); 2660 2727 if (dev->ctrl.admin_q && !blk_queue_dying(dev->ctrl.admin_q)) 2661 - nvme_start_admin_queue(&dev->ctrl); 2728 + nvme_unquiesce_admin_queue(&dev->ctrl); 2662 2729 } 2663 2730 mutex_unlock(&dev->shutdown_lock); 2664 2731 } ··· 2695 2762 dma_pool_destroy(dev->prp_small_pool); 2696 2763 } 2697 2764 2765 + static int nvme_pci_alloc_iod_mempool(struct nvme_dev *dev) 2766 + { 2767 + size_t npages = max(nvme_pci_npages_prp(), nvme_pci_npages_sgl()); 2768 + size_t alloc_size = sizeof(__le64 *) * npages + 2769 + sizeof(struct scatterlist) * NVME_MAX_SEGS; 2770 + 2771 + WARN_ON_ONCE(alloc_size > PAGE_SIZE); 2772 + dev->iod_mempool = mempool_create_node(1, 2773 + mempool_kmalloc, mempool_kfree, 2774 + (void *)alloc_size, GFP_KERNEL, 2775 + dev_to_node(dev->dev)); 2776 + if (!dev->iod_mempool) 2777 + return -ENOMEM; 2778 + return 0; 2779 + } 2780 + 2698 2781 static void nvme_free_tagset(struct nvme_dev *dev) 2699 2782 { 2700 2783 if (dev->tagset.tags) 2701 - blk_mq_free_tag_set(&dev->tagset); 2784 + nvme_remove_io_tag_set(&dev->ctrl); 2702 2785 dev->ctrl.tagset = NULL; 2703 2786 } 2704 2787 2788 + /* pairs with nvme_pci_alloc_dev */ 2705 2789 static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl) 2706 2790 { 2707 2791 struct nvme_dev *dev = to_nvme_dev(ctrl); 2708 2792 2709 - nvme_dbbuf_dma_free(dev); 2710 2793 nvme_free_tagset(dev); 2711 - if (dev->ctrl.admin_q) 2712 - blk_put_queue(dev->ctrl.admin_q); 2713 - free_opal_dev(dev->ctrl.opal_dev); 2714 - mempool_destroy(dev->iod_mempool); 2715 2794 put_device(dev->dev); 2716 2795 kfree(dev->queues); 2717 2796 kfree(dev); 2718 - } 2719 - 2720 - static void nvme_remove_dead_ctrl(struct nvme_dev *dev) 2721 - { 2722 - /* 2723 - * Set state to deleting now to avoid blocking nvme_wait_reset(), which 2724 - * may be holding this pci_dev's device lock. 2725 - */ 2726 - nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING); 2727 - nvme_get_ctrl(&dev->ctrl); 2728 - nvme_dev_disable(dev, false); 2729 - nvme_kill_queues(&dev->ctrl); 2730 - if (!queue_work(nvme_wq, &dev->remove_work)) 2731 - nvme_put_ctrl(&dev->ctrl); 2732 2797 } 2733 2798 2734 2799 static void nvme_reset_work(struct work_struct *work) ··· 2739 2808 if (dev->ctrl.state != NVME_CTRL_RESETTING) { 2740 2809 dev_warn(dev->ctrl.device, "ctrl state %d is not RESETTING\n", 2741 2810 dev->ctrl.state); 2742 - result = -ENODEV; 2743 - goto out; 2811 + return; 2744 2812 } 2745 2813 2746 2814 /* ··· 2754 2824 result = nvme_pci_enable(dev); 2755 2825 if (result) 2756 2826 goto out_unlock; 2757 - 2758 - result = nvme_pci_configure_admin_queue(dev); 2759 - if (result) 2760 - goto out_unlock; 2761 - 2762 - if (!dev->ctrl.admin_q) { 2763 - result = nvme_pci_alloc_admin_tag_set(dev); 2764 - if (result) 2765 - goto out_unlock; 2766 - } else { 2767 - nvme_start_admin_queue(&dev->ctrl); 2768 - } 2769 - 2770 - dma_set_min_align_mask(dev->dev, NVME_CTRL_PAGE_SIZE - 1); 2771 - 2772 - /* 2773 - * Limit the max command size to prevent iod->sg allocations going 2774 - * over a single page. 2775 - */ 2776 - dev->ctrl.max_hw_sectors = min_t(u32, 2777 - NVME_MAX_KB_SZ << 1, dma_max_mapping_size(dev->dev) >> 9); 2778 - dev->ctrl.max_segments = NVME_MAX_SEGS; 2779 - 2780 - /* 2781 - * Don't limit the IOMMU merged segment size. 2782 - */ 2783 - dma_set_max_seg_size(dev->dev, 0xffffffff); 2784 - 2827 + nvme_unquiesce_admin_queue(&dev->ctrl); 2785 2828 mutex_unlock(&dev->shutdown_lock); 2786 2829 2787 2830 /* ··· 2768 2865 goto out; 2769 2866 } 2770 2867 2771 - /* 2772 - * We do not support an SGL for metadata (yet), so we are limited to a 2773 - * single integrity segment for the separate metadata pointer. 2774 - */ 2775 - dev->ctrl.max_integrity_segments = 1; 2776 - 2777 - result = nvme_init_ctrl_finish(&dev->ctrl); 2868 + result = nvme_init_ctrl_finish(&dev->ctrl, was_suspend); 2778 2869 if (result) 2779 2870 goto out; 2780 2871 2781 - if (dev->ctrl.oacs & NVME_CTRL_OACS_SEC_SUPP) { 2782 - if (!dev->ctrl.opal_dev) 2783 - dev->ctrl.opal_dev = 2784 - init_opal_dev(&dev->ctrl, &nvme_sec_submit); 2785 - else if (was_suspend) 2786 - opal_unlock_from_suspend(dev->ctrl.opal_dev); 2787 - } else { 2788 - free_opal_dev(dev->ctrl.opal_dev); 2789 - dev->ctrl.opal_dev = NULL; 2790 - } 2872 + nvme_dbbuf_dma_alloc(dev); 2791 2873 2792 - if (dev->ctrl.oacs & NVME_CTRL_OACS_DBBUF_SUPP) { 2793 - result = nvme_dbbuf_dma_alloc(dev); 2794 - if (result) 2795 - dev_warn(dev->dev, 2796 - "unable to allocate dma for dbbuf\n"); 2797 - } 2798 - 2799 - if (dev->ctrl.hmpre) { 2800 - result = nvme_setup_host_mem(dev); 2801 - if (result < 0) 2802 - goto out; 2803 - } 2874 + result = nvme_setup_host_mem(dev); 2875 + if (result < 0) 2876 + goto out; 2804 2877 2805 2878 result = nvme_setup_io_queues(dev); 2806 2879 if (result) 2807 2880 goto out; 2808 2881 2809 2882 /* 2810 - * Keep the controller around but remove all namespaces if we don't have 2811 - * any working I/O queue. 2883 + * Freeze and update the number of I/O queues as thos might have 2884 + * changed. If there are no I/O queues left after this reset, keep the 2885 + * controller around but remove all namespaces. 2812 2886 */ 2813 - if (dev->online_queues < 2) { 2814 - dev_warn(dev->ctrl.device, "IO queues not created\n"); 2815 - nvme_kill_queues(&dev->ctrl); 2816 - nvme_remove_namespaces(&dev->ctrl); 2817 - nvme_free_tagset(dev); 2818 - } else { 2819 - nvme_start_queues(&dev->ctrl); 2887 + if (dev->online_queues > 1) { 2888 + nvme_unquiesce_io_queues(&dev->ctrl); 2820 2889 nvme_wait_freeze(&dev->ctrl); 2821 - if (!dev->ctrl.tagset) 2822 - nvme_pci_alloc_tag_set(dev); 2823 - else 2824 - nvme_pci_update_nr_queues(dev); 2890 + nvme_pci_update_nr_queues(dev); 2825 2891 nvme_dbbuf_set(dev); 2826 2892 nvme_unfreeze(&dev->ctrl); 2893 + } else { 2894 + dev_warn(dev->ctrl.device, "IO queues lost\n"); 2895 + nvme_mark_namespaces_dead(&dev->ctrl); 2896 + nvme_unquiesce_io_queues(&dev->ctrl); 2897 + nvme_remove_namespaces(&dev->ctrl); 2898 + nvme_free_tagset(dev); 2827 2899 } 2828 2900 2829 2901 /* ··· 2812 2934 goto out; 2813 2935 } 2814 2936 2815 - if (!dev->attrs_added && !sysfs_create_group(&dev->ctrl.device->kobj, 2816 - &nvme_pci_attr_group)) 2817 - dev->attrs_added = true; 2818 - 2819 2937 nvme_start_ctrl(&dev->ctrl); 2820 2938 return; 2821 2939 2822 2940 out_unlock: 2823 2941 mutex_unlock(&dev->shutdown_lock); 2824 2942 out: 2825 - if (result) 2826 - dev_warn(dev->ctrl.device, 2827 - "Removing after probe failure status: %d\n", result); 2828 - nvme_remove_dead_ctrl(dev); 2829 - } 2830 - 2831 - static void nvme_remove_dead_ctrl_work(struct work_struct *work) 2832 - { 2833 - struct nvme_dev *dev = container_of(work, struct nvme_dev, remove_work); 2834 - struct pci_dev *pdev = to_pci_dev(dev->dev); 2835 - 2836 - if (pci_get_drvdata(pdev)) 2837 - device_release_driver(&pdev->dev); 2838 - nvme_put_ctrl(&dev->ctrl); 2943 + /* 2944 + * Set state to deleting now to avoid blocking nvme_wait_reset(), which 2945 + * may be holding this pci_dev's device lock. 2946 + */ 2947 + dev_warn(dev->ctrl.device, "Disabling device after reset failure: %d\n", 2948 + result); 2949 + nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING); 2950 + nvme_dev_disable(dev, true); 2951 + nvme_mark_namespaces_dead(&dev->ctrl); 2952 + nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD); 2839 2953 } 2840 2954 2841 2955 static int nvme_pci_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val) ··· 2880 3010 .name = "pcie", 2881 3011 .module = THIS_MODULE, 2882 3012 .flags = NVME_F_METADATA_SUPPORTED, 3013 + .dev_attr_groups = nvme_pci_dev_attr_groups, 2883 3014 .reg_read32 = nvme_pci_reg_read32, 2884 3015 .reg_write32 = nvme_pci_reg_write32, 2885 3016 .reg_read64 = nvme_pci_reg_read64, ··· 2950 3079 return 0; 2951 3080 } 2952 3081 2953 - static void nvme_async_probe(void *data, async_cookie_t cookie) 3082 + static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev, 3083 + const struct pci_device_id *id) 2954 3084 { 2955 - struct nvme_dev *dev = data; 2956 - 2957 - flush_work(&dev->ctrl.reset_work); 2958 - flush_work(&dev->ctrl.scan_work); 2959 - nvme_put_ctrl(&dev->ctrl); 2960 - } 2961 - 2962 - static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) 2963 - { 2964 - int node, result = -ENOMEM; 2965 - struct nvme_dev *dev; 2966 3085 unsigned long quirks = id->driver_data; 2967 - size_t alloc_size; 3086 + int node = dev_to_node(&pdev->dev); 3087 + struct nvme_dev *dev; 3088 + int ret = -ENOMEM; 2968 3089 2969 - node = dev_to_node(&pdev->dev); 2970 3090 if (node == NUMA_NO_NODE) 2971 3091 set_dev_node(&pdev->dev, first_memory_node); 2972 3092 2973 3093 dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node); 2974 3094 if (!dev) 2975 - return -ENOMEM; 3095 + return NULL; 3096 + INIT_WORK(&dev->ctrl.reset_work, nvme_reset_work); 3097 + mutex_init(&dev->shutdown_lock); 2976 3098 2977 3099 dev->nr_write_queues = write_queues; 2978 3100 dev->nr_poll_queues = poll_queues; ··· 2973 3109 dev->queues = kcalloc_node(dev->nr_allocated_queues, 2974 3110 sizeof(struct nvme_queue), GFP_KERNEL, node); 2975 3111 if (!dev->queues) 2976 - goto free; 3112 + goto out_free_dev; 2977 3113 2978 3114 dev->dev = get_device(&pdev->dev); 2979 - pci_set_drvdata(pdev, dev); 2980 - 2981 - result = nvme_dev_map(dev); 2982 - if (result) 2983 - goto put_pci; 2984 - 2985 - INIT_WORK(&dev->ctrl.reset_work, nvme_reset_work); 2986 - INIT_WORK(&dev->remove_work, nvme_remove_dead_ctrl_work); 2987 - mutex_init(&dev->shutdown_lock); 2988 - 2989 - result = nvme_setup_prp_pools(dev); 2990 - if (result) 2991 - goto unmap; 2992 3115 2993 3116 quirks |= check_vendor_combination_bug(pdev); 2994 - 2995 3117 if (!noacpi && acpi_storage_d3(&pdev->dev)) { 2996 3118 /* 2997 3119 * Some systems use a bios work around to ask for D3 on ··· 2987 3137 "platform quirk: setting simple suspend\n"); 2988 3138 quirks |= NVME_QUIRK_SIMPLE_SUSPEND; 2989 3139 } 3140 + ret = nvme_init_ctrl(&dev->ctrl, &pdev->dev, &nvme_pci_ctrl_ops, 3141 + quirks); 3142 + if (ret) 3143 + goto out_put_device; 3144 + 3145 + dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1); 3146 + dma_set_max_seg_size(&pdev->dev, 0xffffffff); 2990 3147 2991 3148 /* 2992 - * Double check that our mempool alloc size will cover the biggest 2993 - * command we support. 3149 + * Limit the max command size to prevent iod->sg allocations going 3150 + * over a single page. 2994 3151 */ 2995 - alloc_size = nvme_pci_iod_alloc_size(); 2996 - WARN_ON_ONCE(alloc_size > PAGE_SIZE); 3152 + dev->ctrl.max_hw_sectors = min_t(u32, 3153 + NVME_MAX_KB_SZ << 1, dma_max_mapping_size(&pdev->dev) >> 9); 3154 + dev->ctrl.max_segments = NVME_MAX_SEGS; 2997 3155 2998 - dev->iod_mempool = mempool_create_node(1, mempool_kmalloc, 2999 - mempool_kfree, 3000 - (void *) alloc_size, 3001 - GFP_KERNEL, node); 3002 - if (!dev->iod_mempool) { 3003 - result = -ENOMEM; 3004 - goto release_pools; 3005 - } 3156 + /* 3157 + * There is no support for SGLs for metadata (yet), so we are limited to 3158 + * a single integrity segment for the separate metadata pointer. 3159 + */ 3160 + dev->ctrl.max_integrity_segments = 1; 3161 + return dev; 3006 3162 3007 - result = nvme_init_ctrl(&dev->ctrl, &pdev->dev, &nvme_pci_ctrl_ops, 3008 - quirks); 3163 + out_put_device: 3164 + put_device(dev->dev); 3165 + kfree(dev->queues); 3166 + out_free_dev: 3167 + kfree(dev); 3168 + return ERR_PTR(ret); 3169 + } 3170 + 3171 + static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) 3172 + { 3173 + struct nvme_dev *dev; 3174 + int result = -ENOMEM; 3175 + 3176 + dev = nvme_pci_alloc_dev(pdev, id); 3177 + if (!dev) 3178 + return -ENOMEM; 3179 + 3180 + result = nvme_dev_map(dev); 3009 3181 if (result) 3010 - goto release_mempool; 3182 + goto out_uninit_ctrl; 3183 + 3184 + result = nvme_setup_prp_pools(dev); 3185 + if (result) 3186 + goto out_dev_unmap; 3187 + 3188 + result = nvme_pci_alloc_iod_mempool(dev); 3189 + if (result) 3190 + goto out_release_prp_pools; 3011 3191 3012 3192 dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev)); 3013 3193 3014 - nvme_reset_ctrl(&dev->ctrl); 3015 - async_schedule(nvme_async_probe, dev); 3194 + result = nvme_pci_enable(dev); 3195 + if (result) 3196 + goto out_release_iod_mempool; 3016 3197 3198 + result = nvme_alloc_admin_tag_set(&dev->ctrl, &dev->admin_tagset, 3199 + &nvme_mq_admin_ops, sizeof(struct nvme_iod)); 3200 + if (result) 3201 + goto out_disable; 3202 + 3203 + /* 3204 + * Mark the controller as connecting before sending admin commands to 3205 + * allow the timeout handler to do the right thing. 3206 + */ 3207 + if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_CONNECTING)) { 3208 + dev_warn(dev->ctrl.device, 3209 + "failed to mark controller CONNECTING\n"); 3210 + result = -EBUSY; 3211 + goto out_disable; 3212 + } 3213 + 3214 + result = nvme_init_ctrl_finish(&dev->ctrl, false); 3215 + if (result) 3216 + goto out_disable; 3217 + 3218 + nvme_dbbuf_dma_alloc(dev); 3219 + 3220 + result = nvme_setup_host_mem(dev); 3221 + if (result < 0) 3222 + goto out_disable; 3223 + 3224 + result = nvme_setup_io_queues(dev); 3225 + if (result) 3226 + goto out_disable; 3227 + 3228 + if (dev->online_queues > 1) { 3229 + nvme_alloc_io_tag_set(&dev->ctrl, &dev->tagset, &nvme_mq_ops, 3230 + nvme_pci_nr_maps(dev), sizeof(struct nvme_iod)); 3231 + nvme_dbbuf_set(dev); 3232 + } 3233 + 3234 + if (!dev->ctrl.tagset) 3235 + dev_warn(dev->ctrl.device, "IO queues not created\n"); 3236 + 3237 + if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_LIVE)) { 3238 + dev_warn(dev->ctrl.device, 3239 + "failed to mark controller live state\n"); 3240 + result = -ENODEV; 3241 + goto out_disable; 3242 + } 3243 + 3244 + pci_set_drvdata(pdev, dev); 3245 + 3246 + nvme_start_ctrl(&dev->ctrl); 3247 + nvme_put_ctrl(&dev->ctrl); 3017 3248 return 0; 3018 3249 3019 - release_mempool: 3250 + out_disable: 3251 + nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING); 3252 + nvme_dev_disable(dev, true); 3253 + nvme_free_host_mem(dev); 3254 + nvme_dev_remove_admin(dev); 3255 + nvme_dbbuf_dma_free(dev); 3256 + nvme_free_queues(dev, 0); 3257 + out_release_iod_mempool: 3020 3258 mempool_destroy(dev->iod_mempool); 3021 - release_pools: 3259 + out_release_prp_pools: 3022 3260 nvme_release_prp_pools(dev); 3023 - unmap: 3261 + out_dev_unmap: 3024 3262 nvme_dev_unmap(dev); 3025 - put_pci: 3026 - put_device(dev->dev); 3027 - free: 3028 - kfree(dev->queues); 3029 - kfree(dev); 3263 + out_uninit_ctrl: 3264 + nvme_uninit_ctrl(&dev->ctrl); 3030 3265 return result; 3031 3266 } 3032 3267 ··· 3143 3208 nvme_disable_prepare_reset(dev, true); 3144 3209 } 3145 3210 3146 - static void nvme_remove_attrs(struct nvme_dev *dev) 3147 - { 3148 - if (dev->attrs_added) 3149 - sysfs_remove_group(&dev->ctrl.device->kobj, 3150 - &nvme_pci_attr_group); 3151 - } 3152 - 3153 3211 /* 3154 3212 * The driver's remove may be called on a device in a partially initialized 3155 3213 * state. This function must not have any dependencies on the device state in ··· 3164 3236 nvme_stop_ctrl(&dev->ctrl); 3165 3237 nvme_remove_namespaces(&dev->ctrl); 3166 3238 nvme_dev_disable(dev, true); 3167 - nvme_remove_attrs(dev); 3168 3239 nvme_free_host_mem(dev); 3169 3240 nvme_dev_remove_admin(dev); 3241 + nvme_dbbuf_dma_free(dev); 3170 3242 nvme_free_queues(dev, 0); 3243 + mempool_destroy(dev->iod_mempool); 3171 3244 nvme_release_prp_pools(dev); 3172 3245 nvme_dev_unmap(dev); 3173 3246 nvme_uninit_ctrl(&dev->ctrl); ··· 3505 3576 .probe = nvme_probe, 3506 3577 .remove = nvme_remove, 3507 3578 .shutdown = nvme_shutdown, 3508 - #ifdef CONFIG_PM_SLEEP 3509 3579 .driver = { 3510 - .pm = &nvme_dev_pm_ops, 3511 - }, 3580 + .probe_type = PROBE_PREFER_ASYNCHRONOUS, 3581 + #ifdef CONFIG_PM_SLEEP 3582 + .pm = &nvme_dev_pm_ops, 3512 3583 #endif 3584 + }, 3513 3585 .sriov_configure = pci_sriov_configure_simple, 3514 3586 .err_handler = &nvme_err_handler, 3515 3587 };

+20 -22

drivers/nvme/host/rdma.c

··· 798 798 NVME_RDMA_METADATA_SGL_SIZE; 799 799 800 800 return nvme_alloc_io_tag_set(ctrl, &to_rdma_ctrl(ctrl)->tag_set, 801 - &nvme_rdma_mq_ops, BLK_MQ_F_SHOULD_MERGE, cmd_size); 801 + &nvme_rdma_mq_ops, 802 + ctrl->opts->nr_poll_queues ? HCTX_MAX_TYPES : 2, 803 + cmd_size); 802 804 } 803 805 804 806 static void nvme_rdma_destroy_admin_queue(struct nvme_rdma_ctrl *ctrl) ··· 848 846 if (new) { 849 847 error = nvme_alloc_admin_tag_set(&ctrl->ctrl, 850 848 &ctrl->admin_tag_set, &nvme_rdma_admin_mq_ops, 851 - BLK_MQ_F_NO_SCHED, 852 849 sizeof(struct nvme_rdma_request) + 853 850 NVME_RDMA_DATA_SGL_SIZE); 854 851 if (error) ··· 870 869 else 871 870 ctrl->ctrl.max_integrity_segments = 0; 872 871 873 - nvme_start_admin_queue(&ctrl->ctrl); 872 + nvme_unquiesce_admin_queue(&ctrl->ctrl); 874 873 875 - error = nvme_init_ctrl_finish(&ctrl->ctrl); 874 + error = nvme_init_ctrl_finish(&ctrl->ctrl, false); 876 875 if (error) 877 876 goto out_quiesce_queue; 878 877 879 878 return 0; 880 879 881 880 out_quiesce_queue: 882 - nvme_stop_admin_queue(&ctrl->ctrl); 881 + nvme_quiesce_admin_queue(&ctrl->ctrl); 883 882 blk_sync_queue(ctrl->ctrl.admin_q); 884 883 out_stop_queue: 885 884 nvme_rdma_stop_queue(&ctrl->queues[0]); ··· 923 922 goto out_cleanup_tagset; 924 923 925 924 if (!new) { 926 - nvme_start_queues(&ctrl->ctrl); 925 + nvme_unquiesce_io_queues(&ctrl->ctrl); 927 926 if (!nvme_wait_freeze_timeout(&ctrl->ctrl, NVME_IO_TIMEOUT)) { 928 927 /* 929 928 * If we timed out waiting for freeze we are likely to ··· 950 949 return 0; 951 950 952 951 out_wait_freeze_timed_out: 953 - nvme_stop_queues(&ctrl->ctrl); 952 + nvme_quiesce_io_queues(&ctrl->ctrl); 954 953 nvme_sync_io_queues(&ctrl->ctrl); 955 954 nvme_rdma_stop_io_queues(ctrl); 956 955 out_cleanup_tagset: ··· 965 964 static void nvme_rdma_teardown_admin_queue(struct nvme_rdma_ctrl *ctrl, 966 965 bool remove) 967 966 { 968 - nvme_stop_admin_queue(&ctrl->ctrl); 967 + nvme_quiesce_admin_queue(&ctrl->ctrl); 969 968 blk_sync_queue(ctrl->ctrl.admin_q); 970 969 nvme_rdma_stop_queue(&ctrl->queues[0]); 971 970 nvme_cancel_admin_tagset(&ctrl->ctrl); 972 971 if (remove) { 973 - nvme_start_admin_queue(&ctrl->ctrl); 972 + nvme_unquiesce_admin_queue(&ctrl->ctrl); 974 973 nvme_remove_admin_tag_set(&ctrl->ctrl); 975 974 } 976 975 nvme_rdma_destroy_admin_queue(ctrl); ··· 981 980 { 982 981 if (ctrl->ctrl.queue_count > 1) { 983 982 nvme_start_freeze(&ctrl->ctrl); 984 - nvme_stop_queues(&ctrl->ctrl); 983 + nvme_quiesce_io_queues(&ctrl->ctrl); 985 984 nvme_sync_io_queues(&ctrl->ctrl); 986 985 nvme_rdma_stop_io_queues(ctrl); 987 986 nvme_cancel_tagset(&ctrl->ctrl); 988 987 if (remove) { 989 - nvme_start_queues(&ctrl->ctrl); 988 + nvme_unquiesce_io_queues(&ctrl->ctrl); 990 989 nvme_remove_io_tag_set(&ctrl->ctrl); 991 990 } 992 991 nvme_rdma_free_io_queues(ctrl); ··· 1107 1106 1108 1107 destroy_io: 1109 1108 if (ctrl->ctrl.queue_count > 1) { 1110 - nvme_stop_queues(&ctrl->ctrl); 1109 + nvme_quiesce_io_queues(&ctrl->ctrl); 1111 1110 nvme_sync_io_queues(&ctrl->ctrl); 1112 1111 nvme_rdma_stop_io_queues(ctrl); 1113 1112 nvme_cancel_tagset(&ctrl->ctrl); ··· 1116 1115 nvme_rdma_free_io_queues(ctrl); 1117 1116 } 1118 1117 destroy_admin: 1119 - nvme_stop_admin_queue(&ctrl->ctrl); 1118 + nvme_quiesce_admin_queue(&ctrl->ctrl); 1120 1119 blk_sync_queue(ctrl->ctrl.admin_q); 1121 1120 nvme_rdma_stop_queue(&ctrl->queues[0]); 1122 1121 nvme_cancel_admin_tagset(&ctrl->ctrl); ··· 1154 1153 struct nvme_rdma_ctrl *ctrl = container_of(work, 1155 1154 struct nvme_rdma_ctrl, err_work); 1156 1155 1157 - nvme_auth_stop(&ctrl->ctrl); 1158 1156 nvme_stop_keep_alive(&ctrl->ctrl); 1159 1157 flush_work(&ctrl->ctrl.async_event_work); 1160 1158 nvme_rdma_teardown_io_queues(ctrl, false); 1161 - nvme_start_queues(&ctrl->ctrl); 1159 + nvme_unquiesce_io_queues(&ctrl->ctrl); 1162 1160 nvme_rdma_teardown_admin_queue(ctrl, false); 1163 - nvme_start_admin_queue(&ctrl->ctrl); 1161 + nvme_unquiesce_admin_queue(&ctrl->ctrl); 1162 + nvme_auth_stop(&ctrl->ctrl); 1164 1163 1165 1164 if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) { 1166 1165 /* state change failure is ok if we started ctrl delete */ ··· 2041 2040 if (ret) 2042 2041 goto unmap_qe; 2043 2042 2044 - blk_mq_start_request(rq); 2043 + nvme_start_request(rq); 2045 2044 2046 2045 if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) && 2047 2046 queue->pi_support && ··· 2208 2207 static void nvme_rdma_shutdown_ctrl(struct nvme_rdma_ctrl *ctrl, bool shutdown) 2209 2208 { 2210 2209 nvme_rdma_teardown_io_queues(ctrl, shutdown); 2211 - nvme_stop_admin_queue(&ctrl->ctrl); 2212 - if (shutdown) 2213 - nvme_shutdown_ctrl(&ctrl->ctrl); 2214 - else 2215 - nvme_disable_ctrl(&ctrl->ctrl); 2210 + nvme_quiesce_admin_queue(&ctrl->ctrl); 2211 + nvme_disable_ctrl(&ctrl->ctrl, shutdown); 2216 2212 nvme_rdma_teardown_admin_queue(ctrl, shutdown); 2217 2213 } 2218 2214

+21 -24

drivers/nvme/host/tcp.c

··· 1867 1867 if (new) { 1868 1868 ret = nvme_alloc_io_tag_set(ctrl, &to_tcp_ctrl(ctrl)->tag_set, 1869 1869 &nvme_tcp_mq_ops, 1870 - BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_BLOCKING, 1870 + ctrl->opts->nr_poll_queues ? HCTX_MAX_TYPES : 2, 1871 1871 sizeof(struct nvme_tcp_request)); 1872 1872 if (ret) 1873 1873 goto out_free_io_queues; ··· 1884 1884 goto out_cleanup_connect_q; 1885 1885 1886 1886 if (!new) { 1887 - nvme_start_queues(ctrl); 1887 + nvme_unquiesce_io_queues(ctrl); 1888 1888 if (!nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT)) { 1889 1889 /* 1890 1890 * If we timed out waiting for freeze we are likely to ··· 1911 1911 return 0; 1912 1912 1913 1913 out_wait_freeze_timed_out: 1914 - nvme_stop_queues(ctrl); 1914 + nvme_quiesce_io_queues(ctrl); 1915 1915 nvme_sync_io_queues(ctrl); 1916 1916 nvme_tcp_stop_io_queues(ctrl); 1917 1917 out_cleanup_connect_q: ··· 1942 1942 if (new) { 1943 1943 error = nvme_alloc_admin_tag_set(ctrl, 1944 1944 &to_tcp_ctrl(ctrl)->admin_tag_set, 1945 - &nvme_tcp_admin_mq_ops, BLK_MQ_F_BLOCKING, 1945 + &nvme_tcp_admin_mq_ops, 1946 1946 sizeof(struct nvme_tcp_request)); 1947 1947 if (error) 1948 1948 goto out_free_queue; ··· 1956 1956 if (error) 1957 1957 goto out_stop_queue; 1958 1958 1959 - nvme_start_admin_queue(ctrl); 1959 + nvme_unquiesce_admin_queue(ctrl); 1960 1960 1961 - error = nvme_init_ctrl_finish(ctrl); 1961 + error = nvme_init_ctrl_finish(ctrl, false); 1962 1962 if (error) 1963 1963 goto out_quiesce_queue; 1964 1964 1965 1965 return 0; 1966 1966 1967 1967 out_quiesce_queue: 1968 - nvme_stop_admin_queue(ctrl); 1968 + nvme_quiesce_admin_queue(ctrl); 1969 1969 blk_sync_queue(ctrl->admin_q); 1970 1970 out_stop_queue: 1971 1971 nvme_tcp_stop_queue(ctrl, 0); ··· 1981 1981 static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl, 1982 1982 bool remove) 1983 1983 { 1984 - nvme_stop_admin_queue(ctrl); 1984 + nvme_quiesce_admin_queue(ctrl); 1985 1985 blk_sync_queue(ctrl->admin_q); 1986 1986 nvme_tcp_stop_queue(ctrl, 0); 1987 1987 nvme_cancel_admin_tagset(ctrl); 1988 1988 if (remove) 1989 - nvme_start_admin_queue(ctrl); 1989 + nvme_unquiesce_admin_queue(ctrl); 1990 1990 nvme_tcp_destroy_admin_queue(ctrl, remove); 1991 1991 } 1992 1992 ··· 1995 1995 { 1996 1996 if (ctrl->queue_count <= 1) 1997 1997 return; 1998 - nvme_stop_admin_queue(ctrl); 1998 + nvme_quiesce_admin_queue(ctrl); 1999 1999 nvme_start_freeze(ctrl); 2000 - nvme_stop_queues(ctrl); 2000 + nvme_quiesce_io_queues(ctrl); 2001 2001 nvme_sync_io_queues(ctrl); 2002 2002 nvme_tcp_stop_io_queues(ctrl); 2003 2003 nvme_cancel_tagset(ctrl); 2004 2004 if (remove) 2005 - nvme_start_queues(ctrl); 2005 + nvme_unquiesce_io_queues(ctrl); 2006 2006 nvme_tcp_destroy_io_queues(ctrl, remove); 2007 2007 } 2008 2008 ··· 2083 2083 2084 2084 destroy_io: 2085 2085 if (ctrl->queue_count > 1) { 2086 - nvme_stop_queues(ctrl); 2086 + nvme_quiesce_io_queues(ctrl); 2087 2087 nvme_sync_io_queues(ctrl); 2088 2088 nvme_tcp_stop_io_queues(ctrl); 2089 2089 nvme_cancel_tagset(ctrl); 2090 2090 nvme_tcp_destroy_io_queues(ctrl, new); 2091 2091 } 2092 2092 destroy_admin: 2093 - nvme_stop_admin_queue(ctrl); 2093 + nvme_quiesce_admin_queue(ctrl); 2094 2094 blk_sync_queue(ctrl->admin_q); 2095 2095 nvme_tcp_stop_queue(ctrl, 0); 2096 2096 nvme_cancel_admin_tagset(ctrl); ··· 2128 2128 struct nvme_tcp_ctrl, err_work); 2129 2129 struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl; 2130 2130 2131 - nvme_auth_stop(ctrl); 2132 2131 nvme_stop_keep_alive(ctrl); 2133 2132 flush_work(&ctrl->async_event_work); 2134 2133 nvme_tcp_teardown_io_queues(ctrl, false); 2135 2134 /* unquiesce to fail fast pending requests */ 2136 - nvme_start_queues(ctrl); 2135 + nvme_unquiesce_io_queues(ctrl); 2137 2136 nvme_tcp_teardown_admin_queue(ctrl, false); 2138 - nvme_start_admin_queue(ctrl); 2137 + nvme_unquiesce_admin_queue(ctrl); 2138 + nvme_auth_stop(ctrl); 2139 2139 2140 2140 if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_CONNECTING)) { 2141 2141 /* state change failure is ok if we started ctrl delete */ ··· 2150 2150 static void nvme_tcp_teardown_ctrl(struct nvme_ctrl *ctrl, bool shutdown) 2151 2151 { 2152 2152 nvme_tcp_teardown_io_queues(ctrl, shutdown); 2153 - nvme_stop_admin_queue(ctrl); 2154 - if (shutdown) 2155 - nvme_shutdown_ctrl(ctrl); 2156 - else 2157 - nvme_disable_ctrl(ctrl); 2153 + nvme_quiesce_admin_queue(ctrl); 2154 + nvme_disable_ctrl(ctrl, shutdown); 2158 2155 nvme_tcp_teardown_admin_queue(ctrl, shutdown); 2159 2156 } 2160 2157 ··· 2411 2414 if (unlikely(ret)) 2412 2415 return ret; 2413 2416 2414 - blk_mq_start_request(rq); 2417 + nvme_start_request(rq); 2415 2418 2416 2419 nvme_tcp_queue_request(req, true, bd->last); 2417 2420 ··· 2520 2523 static const struct nvme_ctrl_ops nvme_tcp_ctrl_ops = { 2521 2524 .name = "tcp", 2522 2525 .module = THIS_MODULE, 2523 - .flags = NVME_F_FABRICS, 2526 + .flags = NVME_F_FABRICS | NVME_F_BLOCKING, 2524 2527 .reg_read32 = nvmf_reg_read32, 2525 2528 .reg_read64 = nvmf_reg_read64, 2526 2529 .reg_write32 = nvmf_reg_write32,

+4 -7

drivers/nvme/target/admin-cmd.c

··· 370 370 memcpy_and_pad(id->mn, sizeof(id->mn), subsys->model_number, 371 371 strlen(subsys->model_number), ' '); 372 372 memcpy_and_pad(id->fr, sizeof(id->fr), 373 - UTS_RELEASE, strlen(UTS_RELEASE), ' '); 373 + subsys->firmware_rev, strlen(subsys->firmware_rev), ' '); 374 + 375 + put_unaligned_le24(subsys->ieee_oui, id->ieee); 374 376 375 377 id->rab = 6; 376 378 ··· 380 378 id->cntrltype = NVME_CTRL_DISC; 381 379 else 382 380 id->cntrltype = NVME_CTRL_IO; 383 - 384 - /* 385 - * XXX: figure out how we can assign a IEEE OUI, but until then 386 - * the safest is to leave it as zeroes. 387 - */ 388 381 389 382 /* we support multiple ports, multiples hosts and ANA: */ 390 383 id->cmic = NVME_CTRL_CMIC_MULTI_PORT | NVME_CTRL_CMIC_MULTI_CTRL | ··· 561 564 } 562 565 563 566 if (req->ns->readonly) 564 - id->nsattr |= (1 << 0); 567 + id->nsattr |= NVME_NS_ATTR_RO; 565 568 done: 566 569 if (!status) 567 570 status = nvmet_copy_to_sgl(req, 0, id, sizeof(*id));

+129 -9

drivers/nvme/target/configfs.c

··· 4 4 * Copyright (c) 2015-2016 HGST, a Western Digital Company. 5 5 */ 6 6 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt 7 + #include <linux/kstrtox.h> 7 8 #include <linux/kernel.h> 8 9 #include <linux/module.h> 9 10 #include <linux/slab.h> ··· 268 267 struct nvmet_port *port = to_nvmet_port(item); 269 268 bool val; 270 269 271 - if (strtobool(page, &val)) 270 + if (kstrtobool(page, &val)) 272 271 return -EINVAL; 273 272 274 273 if (nvmet_is_port_enabled(port, __func__)) ··· 533 532 bool enable; 534 533 int ret = 0; 535 534 536 - if (strtobool(page, &enable)) 535 + if (kstrtobool(page, &enable)) 537 536 return -EINVAL; 538 537 539 538 if (enable) ··· 557 556 struct nvmet_ns *ns = to_nvmet_ns(item); 558 557 bool val; 559 558 560 - if (strtobool(page, &val)) 559 + if (kstrtobool(page, &val)) 561 560 return -EINVAL; 562 561 563 562 mutex_lock(&ns->subsys->lock); ··· 580 579 struct nvmet_ns *ns = to_nvmet_ns(item); 581 580 bool val; 582 581 583 - if (strtobool(page, &val)) 582 + if (kstrtobool(page, &val)) 584 583 return -EINVAL; 585 584 586 585 if (!val) ··· 729 728 bool enable; 730 729 int ret = 0; 731 730 732 - if (strtobool(page, &enable)) 731 + if (kstrtobool(page, &enable)) 733 732 return -EINVAL; 734 733 735 734 if (enable) ··· 996 995 bool allow_any_host; 997 996 int ret = 0; 998 997 999 - if (strtobool(page, &allow_any_host)) 998 + if (kstrtobool(page, &allow_any_host)) 1000 999 return -EINVAL; 1001 1000 1002 1001 down_write(&nvmet_config_sem); ··· 1263 1262 } 1264 1263 CONFIGFS_ATTR(nvmet_subsys_, attr_model); 1265 1264 1265 + static ssize_t nvmet_subsys_attr_ieee_oui_show(struct config_item *item, 1266 + char *page) 1267 + { 1268 + struct nvmet_subsys *subsys = to_subsys(item); 1269 + 1270 + return sysfs_emit(page, "0x%06x\n", subsys->ieee_oui); 1271 + } 1272 + 1273 + static ssize_t nvmet_subsys_attr_ieee_oui_store_locked(struct nvmet_subsys *subsys, 1274 + const char *page, size_t count) 1275 + { 1276 + uint32_t val = 0; 1277 + int ret; 1278 + 1279 + if (subsys->subsys_discovered) { 1280 + pr_err("Can't set IEEE OUI. 0x%06x is already assigned\n", 1281 + subsys->ieee_oui); 1282 + return -EINVAL; 1283 + } 1284 + 1285 + ret = kstrtou32(page, 0, &val); 1286 + if (ret < 0) 1287 + return ret; 1288 + 1289 + if (val >= 0x1000000) 1290 + return -EINVAL; 1291 + 1292 + subsys->ieee_oui = val; 1293 + 1294 + return count; 1295 + } 1296 + 1297 + static ssize_t nvmet_subsys_attr_ieee_oui_store(struct config_item *item, 1298 + const char *page, size_t count) 1299 + { 1300 + struct nvmet_subsys *subsys = to_subsys(item); 1301 + ssize_t ret; 1302 + 1303 + down_write(&nvmet_config_sem); 1304 + mutex_lock(&subsys->lock); 1305 + ret = nvmet_subsys_attr_ieee_oui_store_locked(subsys, page, count); 1306 + mutex_unlock(&subsys->lock); 1307 + up_write(&nvmet_config_sem); 1308 + 1309 + return ret; 1310 + } 1311 + CONFIGFS_ATTR(nvmet_subsys_, attr_ieee_oui); 1312 + 1313 + static ssize_t nvmet_subsys_attr_firmware_show(struct config_item *item, 1314 + char *page) 1315 + { 1316 + struct nvmet_subsys *subsys = to_subsys(item); 1317 + 1318 + return sysfs_emit(page, "%s\n", subsys->firmware_rev); 1319 + } 1320 + 1321 + static ssize_t nvmet_subsys_attr_firmware_store_locked(struct nvmet_subsys *subsys, 1322 + const char *page, size_t count) 1323 + { 1324 + int pos = 0, len; 1325 + char *val; 1326 + 1327 + if (subsys->subsys_discovered) { 1328 + pr_err("Can't set firmware revision. %s is already assigned\n", 1329 + subsys->firmware_rev); 1330 + return -EINVAL; 1331 + } 1332 + 1333 + len = strcspn(page, "\n"); 1334 + if (!len) 1335 + return -EINVAL; 1336 + 1337 + if (len > NVMET_FR_MAX_SIZE) { 1338 + pr_err("Firmware revision size can not exceed %d Bytes\n", 1339 + NVMET_FR_MAX_SIZE); 1340 + return -EINVAL; 1341 + } 1342 + 1343 + for (pos = 0; pos < len; pos++) { 1344 + if (!nvmet_is_ascii(page[pos])) 1345 + return -EINVAL; 1346 + } 1347 + 1348 + val = kmemdup_nul(page, len, GFP_KERNEL); 1349 + if (!val) 1350 + return -ENOMEM; 1351 + 1352 + kfree(subsys->firmware_rev); 1353 + 1354 + subsys->firmware_rev = val; 1355 + 1356 + return count; 1357 + } 1358 + 1359 + static ssize_t nvmet_subsys_attr_firmware_store(struct config_item *item, 1360 + const char *page, size_t count) 1361 + { 1362 + struct nvmet_subsys *subsys = to_subsys(item); 1363 + ssize_t ret; 1364 + 1365 + down_write(&nvmet_config_sem); 1366 + mutex_lock(&subsys->lock); 1367 + ret = nvmet_subsys_attr_firmware_store_locked(subsys, page, count); 1368 + mutex_unlock(&subsys->lock); 1369 + up_write(&nvmet_config_sem); 1370 + 1371 + return ret; 1372 + } 1373 + CONFIGFS_ATTR(nvmet_subsys_, attr_firmware); 1374 + 1266 1375 #ifdef CONFIG_BLK_DEV_INTEGRITY 1267 1376 static ssize_t nvmet_subsys_attr_pi_enable_show(struct config_item *item, 1268 1377 char *page) ··· 1386 1275 struct nvmet_subsys *subsys = to_subsys(item); 1387 1276 bool pi_enable; 1388 1277 1389 - if (strtobool(page, &pi_enable)) 1278 + if (kstrtobool(page, &pi_enable)) 1390 1279 return -EINVAL; 1391 1280 1392 1281 subsys->pi_support = pi_enable; ··· 1404 1293 static ssize_t nvmet_subsys_attr_qid_max_store(struct config_item *item, 1405 1294 const char *page, size_t cnt) 1406 1295 { 1296 + struct nvmet_subsys *subsys = to_subsys(item); 1297 + struct nvmet_ctrl *ctrl; 1407 1298 u16 qid_max; 1408 1299 1409 1300 if (sscanf(page, "%hu\n", &qid_max) != 1) ··· 1415 1302 return -EINVAL; 1416 1303 1417 1304 down_write(&nvmet_config_sem); 1418 - to_subsys(item)->max_qid = qid_max; 1305 + subsys->max_qid = qid_max; 1306 + 1307 + /* Force reconnect */ 1308 + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) 1309 + ctrl->ops->delete_ctrl(ctrl); 1419 1310 up_write(&nvmet_config_sem); 1311 + 1420 1312 return cnt; 1421 1313 } 1422 1314 CONFIGFS_ATTR(nvmet_subsys_, attr_qid_max); ··· 1434 1316 &nvmet_subsys_attr_attr_cntlid_max, 1435 1317 &nvmet_subsys_attr_attr_model, 1436 1318 &nvmet_subsys_attr_attr_qid_max, 1319 + &nvmet_subsys_attr_attr_ieee_oui, 1320 + &nvmet_subsys_attr_attr_firmware, 1437 1321 #ifdef CONFIG_BLK_DEV_INTEGRITY 1438 1322 &nvmet_subsys_attr_attr_pi_enable, 1439 1323 #endif ··· 1515 1395 struct nvmet_port *port = to_nvmet_port(item); 1516 1396 bool enable; 1517 1397 1518 - if (strtobool(page, &enable)) 1398 + if (kstrtobool(page, &enable)) 1519 1399 goto inval; 1520 1400 1521 1401 if (enable)

+31 -13

drivers/nvme/target/core.c

··· 10 10 #include <linux/pci-p2pdma.h> 11 11 #include <linux/scatterlist.h> 12 12 13 + #include <generated/utsrelease.h> 14 + 13 15 #define CREATE_TRACE_POINTS 14 16 #include "trace.h" 15 17 16 18 #include "nvmet.h" 17 19 20 + struct kmem_cache *nvmet_bvec_cache; 18 21 struct workqueue_struct *buffered_io_wq; 19 22 struct workqueue_struct *zbd_wq; 20 23 static const struct nvmet_fabrics_ops *nvmet_transports[NVMF_TRTYPE_MAX]; ··· 698 695 if (req->sq->size) { 699 696 u32 old_sqhd, new_sqhd; 700 697 698 + old_sqhd = READ_ONCE(req->sq->sqhd); 701 699 do { 702 - old_sqhd = req->sq->sqhd; 703 700 new_sqhd = (old_sqhd + 1) % req->sq->size; 704 - } while (cmpxchg(&req->sq->sqhd, old_sqhd, new_sqhd) != 705 - old_sqhd); 701 + } while (!try_cmpxchg(&req->sq->sqhd, &old_sqhd, new_sqhd)); 706 702 } 707 703 req->cqe->sq_head = cpu_to_le16(req->sq->sqhd & 0x0000FFFF); 708 704 } ··· 1563 1561 goto free_subsys; 1564 1562 } 1565 1563 1564 + subsys->ieee_oui = 0; 1565 + 1566 + subsys->firmware_rev = kstrndup(UTS_RELEASE, NVMET_FR_MAX_SIZE, GFP_KERNEL); 1567 + if (!subsys->firmware_rev) { 1568 + ret = -ENOMEM; 1569 + goto free_mn; 1570 + } 1571 + 1566 1572 switch (type) { 1567 1573 case NVME_NQN_NVME: 1568 1574 subsys->max_qid = NVMET_NR_QUEUES; ··· 1582 1572 default: 1583 1573 pr_err("%s: Unknown Subsystem type - %d\n", __func__, type); 1584 1574 ret = -EINVAL; 1585 - goto free_mn; 1575 + goto free_fr; 1586 1576 } 1587 1577 subsys->type = type; 1588 1578 subsys->subsysnqn = kstrndup(subsysnqn, NVMF_NQN_SIZE, 1589 1579 GFP_KERNEL); 1590 1580 if (!subsys->subsysnqn) { 1591 1581 ret = -ENOMEM; 1592 - goto free_mn; 1582 + goto free_fr; 1593 1583 } 1594 1584 subsys->cntlid_min = NVME_CNTLID_MIN; 1595 1585 subsys->cntlid_max = NVME_CNTLID_MAX; ··· 1602 1592 1603 1593 return subsys; 1604 1594 1595 + free_fr: 1596 + kfree(subsys->firmware_rev); 1605 1597 free_mn: 1606 1598 kfree(subsys->model_number); 1607 1599 free_subsys: ··· 1623 1611 1624 1612 kfree(subsys->subsysnqn); 1625 1613 kfree(subsys->model_number); 1614 + kfree(subsys->firmware_rev); 1626 1615 kfree(subsys); 1627 1616 } 1628 1617 ··· 1644 1631 1645 1632 static int __init nvmet_init(void) 1646 1633 { 1647 - int error; 1634 + int error = -ENOMEM; 1648 1635 1649 1636 nvmet_ana_group_enabled[NVMET_DEFAULT_ANA_GRPID] = 1; 1650 1637 1638 + nvmet_bvec_cache = kmem_cache_create("nvmet-bvec", 1639 + NVMET_MAX_MPOOL_BVEC * sizeof(struct bio_vec), 0, 1640 + SLAB_HWCACHE_ALIGN, NULL); 1641 + if (!nvmet_bvec_cache) 1642 + return -ENOMEM; 1643 + 1651 1644 zbd_wq = alloc_workqueue("nvmet-zbd-wq", WQ_MEM_RECLAIM, 0); 1652 1645 if (!zbd_wq) 1653 - return -ENOMEM; 1646 + goto out_destroy_bvec_cache; 1654 1647 1655 1648 buffered_io_wq = alloc_workqueue("nvmet-buffered-io-wq", 1656 1649 WQ_MEM_RECLAIM, 0); 1657 - if (!buffered_io_wq) { 1658 - error = -ENOMEM; 1650 + if (!buffered_io_wq) 1659 1651 goto out_free_zbd_work_queue; 1660 - } 1661 1652 1662 1653 nvmet_wq = alloc_workqueue("nvmet-wq", WQ_MEM_RECLAIM, 0); 1663 - if (!nvmet_wq) { 1664 - error = -ENOMEM; 1654 + if (!nvmet_wq) 1665 1655 goto out_free_buffered_work_queue; 1666 - } 1667 1656 1668 1657 error = nvmet_init_discovery(); 1669 1658 if (error) ··· 1684 1669 destroy_workqueue(buffered_io_wq); 1685 1670 out_free_zbd_work_queue: 1686 1671 destroy_workqueue(zbd_wq); 1672 + out_destroy_bvec_cache: 1673 + kmem_cache_destroy(nvmet_bvec_cache); 1687 1674 return error; 1688 1675 } 1689 1676 ··· 1697 1680 destroy_workqueue(nvmet_wq); 1698 1681 destroy_workqueue(buffered_io_wq); 1699 1682 destroy_workqueue(zbd_wq); 1683 + kmem_cache_destroy(nvmet_bvec_cache); 1700 1684 1701 1685 BUILD_BUG_ON(sizeof(struct nvmf_disc_rsp_page_entry) != 1024); 1702 1686 BUILD_BUG_ON(sizeof(struct nvmf_disc_rsp_page_hdr) != 1024);

+3 -13

drivers/nvme/target/io-cmd-file.c

··· 11 11 #include <linux/fs.h> 12 12 #include "nvmet.h" 13 13 14 - #define NVMET_MAX_MPOOL_BVEC 16 15 14 #define NVMET_MIN_MPOOL_OBJ 16 16 15 17 16 void nvmet_file_ns_revalidate(struct nvmet_ns *ns) ··· 25 26 flush_workqueue(buffered_io_wq); 26 27 mempool_destroy(ns->bvec_pool); 27 28 ns->bvec_pool = NULL; 28 - kmem_cache_destroy(ns->bvec_cache); 29 - ns->bvec_cache = NULL; 30 29 fput(ns->file); 31 30 ns->file = NULL; 32 31 } ··· 56 59 ns->blksize_shift = min_t(u8, 57 60 file_inode(ns->file)->i_blkbits, 12); 58 61 59 - ns->bvec_cache = kmem_cache_create("nvmet-bvec", 60 - NVMET_MAX_MPOOL_BVEC * sizeof(struct bio_vec), 61 - 0, SLAB_HWCACHE_ALIGN, NULL); 62 - if (!ns->bvec_cache) { 63 - ret = -ENOMEM; 64 - goto err; 65 - } 66 - 67 62 ns->bvec_pool = mempool_create(NVMET_MIN_MPOOL_OBJ, mempool_alloc_slab, 68 - mempool_free_slab, ns->bvec_cache); 63 + mempool_free_slab, nvmet_bvec_cache); 69 64 70 65 if (!ns->bvec_pool) { 71 66 ret = -ENOMEM; ··· 66 77 67 78 return ret; 68 79 err: 80 + fput(ns->file); 81 + ns->file = NULL; 69 82 ns->size = 0; 70 83 ns->blksize_shift = 0; 71 - nvmet_file_ns_disable(ns); 72 84 return ret; 73 85 } 74 86

+8 -8

drivers/nvme/target/loop.c

··· 145 145 if (ret) 146 146 return ret; 147 147 148 - blk_mq_start_request(req); 148 + nvme_start_request(req); 149 149 iod->cmd.common.flags |= NVME_CMD_SGL_METABUF; 150 150 iod->req.port = queue->ctrl->port; 151 151 if (!nvmet_req_init(&iod->req, &queue->nvme_cq, ··· 353 353 ctrl->ctrl.queue_count = 1; 354 354 355 355 error = nvme_alloc_admin_tag_set(&ctrl->ctrl, &ctrl->admin_tag_set, 356 - &nvme_loop_admin_mq_ops, BLK_MQ_F_NO_SCHED, 356 + &nvme_loop_admin_mq_ops, 357 357 sizeof(struct nvme_loop_iod) + 358 358 NVME_INLINE_SG_CNT * sizeof(struct scatterlist)); 359 359 if (error) ··· 375 375 ctrl->ctrl.max_hw_sectors = 376 376 (NVME_LOOP_MAX_SEGMENTS - 1) << (PAGE_SHIFT - 9); 377 377 378 - nvme_start_admin_queue(&ctrl->ctrl); 378 + nvme_unquiesce_admin_queue(&ctrl->ctrl); 379 379 380 - error = nvme_init_ctrl_finish(&ctrl->ctrl); 380 + error = nvme_init_ctrl_finish(&ctrl->ctrl, false); 381 381 if (error) 382 382 goto out_cleanup_tagset; 383 383 ··· 394 394 static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl) 395 395 { 396 396 if (ctrl->ctrl.queue_count > 1) { 397 - nvme_stop_queues(&ctrl->ctrl); 397 + nvme_quiesce_io_queues(&ctrl->ctrl); 398 398 nvme_cancel_tagset(&ctrl->ctrl); 399 399 nvme_loop_destroy_io_queues(ctrl); 400 400 } 401 401 402 - nvme_stop_admin_queue(&ctrl->ctrl); 402 + nvme_quiesce_admin_queue(&ctrl->ctrl); 403 403 if (ctrl->ctrl.state == NVME_CTRL_LIVE) 404 - nvme_shutdown_ctrl(&ctrl->ctrl); 404 + nvme_disable_ctrl(&ctrl->ctrl, true); 405 405 406 406 nvme_cancel_admin_tagset(&ctrl->ctrl); 407 407 nvme_loop_destroy_admin_queue(ctrl); ··· 494 494 return ret; 495 495 496 496 ret = nvme_alloc_io_tag_set(&ctrl->ctrl, &ctrl->tag_set, 497 - &nvme_loop_mq_ops, BLK_MQ_F_SHOULD_MERGE, 497 + &nvme_loop_mq_ops, 1, 498 498 sizeof(struct nvme_loop_iod) + 499 499 NVME_INLINE_SG_CNT * sizeof(struct scatterlist)); 500 500 if (ret)

+5 -1

drivers/nvme/target/nvmet.h

··· 29 29 #define NVMET_DEFAULT_CTRL_MODEL "Linux" 30 30 #define NVMET_MN_MAX_SIZE 40 31 31 #define NVMET_SN_MAX_SIZE 20 32 + #define NVMET_FR_MAX_SIZE 8 32 33 33 34 /* 34 35 * Supported optional AENs: ··· 78 77 79 78 struct completion disable_done; 80 79 mempool_t *bvec_pool; 81 - struct kmem_cache *bvec_cache; 82 80 83 81 int use_p2pmem; 84 82 struct pci_dev *p2p_dev; ··· 264 264 struct config_group allowed_hosts_group; 265 265 266 266 char *model_number; 267 + u32 ieee_oui; 268 + char *firmware_rev; 267 269 268 270 #ifdef CONFIG_NVME_TARGET_PASSTHRU 269 271 struct nvme_ctrl *passthru_ctrl; ··· 395 393 u64 error_slba; 396 394 }; 397 395 396 + #define NVMET_MAX_MPOOL_BVEC 16 397 + extern struct kmem_cache *nvmet_bvec_cache; 398 398 extern struct workqueue_struct *buffered_io_wq; 399 399 extern struct workqueue_struct *zbd_wq; 400 400 extern struct workqueue_struct *nvmet_wq;

+124

drivers/pci/p2pdma.c

··· 89 89 } 90 90 static DEVICE_ATTR_RO(published); 91 91 92 + static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj, 93 + struct bin_attribute *attr, struct vm_area_struct *vma) 94 + { 95 + struct pci_dev *pdev = to_pci_dev(kobj_to_dev(kobj)); 96 + size_t len = vma->vm_end - vma->vm_start; 97 + struct pci_p2pdma *p2pdma; 98 + struct percpu_ref *ref; 99 + unsigned long vaddr; 100 + void *kaddr; 101 + int ret; 102 + 103 + /* prevent private mappings from being established */ 104 + if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) { 105 + pci_info_ratelimited(pdev, 106 + "%s: fail, attempted private mapping\n", 107 + current->comm); 108 + return -EINVAL; 109 + } 110 + 111 + if (vma->vm_pgoff) { 112 + pci_info_ratelimited(pdev, 113 + "%s: fail, attempted mapping with non-zero offset\n", 114 + current->comm); 115 + return -EINVAL; 116 + } 117 + 118 + rcu_read_lock(); 119 + p2pdma = rcu_dereference(pdev->p2pdma); 120 + if (!p2pdma) { 121 + ret = -ENODEV; 122 + goto out; 123 + } 124 + 125 + kaddr = (void *)gen_pool_alloc_owner(p2pdma->pool, len, (void **)&ref); 126 + if (!kaddr) { 127 + ret = -ENOMEM; 128 + goto out; 129 + } 130 + 131 + /* 132 + * vm_insert_page() can sleep, so a reference is taken to mapping 133 + * such that rcu_read_unlock() can be done before inserting the 134 + * pages 135 + */ 136 + if (unlikely(!percpu_ref_tryget_live_rcu(ref))) { 137 + ret = -ENODEV; 138 + goto out_free_mem; 139 + } 140 + rcu_read_unlock(); 141 + 142 + for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) { 143 + ret = vm_insert_page(vma, vaddr, virt_to_page(kaddr)); 144 + if (ret) { 145 + gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len); 146 + return ret; 147 + } 148 + percpu_ref_get(ref); 149 + put_page(virt_to_page(kaddr)); 150 + kaddr += PAGE_SIZE; 151 + len -= PAGE_SIZE; 152 + } 153 + 154 + percpu_ref_put(ref); 155 + 156 + return 0; 157 + out_free_mem: 158 + gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len); 159 + out: 160 + rcu_read_unlock(); 161 + return ret; 162 + } 163 + 164 + static struct bin_attribute p2pmem_alloc_attr = { 165 + .attr = { .name = "allocate", .mode = 0660 }, 166 + .mmap = p2pmem_alloc_mmap, 167 + /* 168 + * Some places where we want to call mmap (ie. python) will check 169 + * that the file size is greater than the mmap size before allowing 170 + * the mmap to continue. To work around this, just set the size 171 + * to be very large. 172 + */ 173 + .size = SZ_1T, 174 + }; 175 + 92 176 static struct attribute *p2pmem_attrs[] = { 93 177 &dev_attr_size.attr, 94 178 &dev_attr_available.attr, ··· 180 96 NULL, 181 97 }; 182 98 99 + static struct bin_attribute *p2pmem_bin_attrs[] = { 100 + &p2pmem_alloc_attr, 101 + NULL, 102 + }; 103 + 183 104 static const struct attribute_group p2pmem_group = { 184 105 .attrs = p2pmem_attrs, 106 + .bin_attrs = p2pmem_bin_attrs, 185 107 .name = "p2pmem", 108 + }; 109 + 110 + static void p2pdma_page_free(struct page *page) 111 + { 112 + struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap); 113 + struct percpu_ref *ref; 114 + 115 + gen_pool_free_owner(pgmap->provider->p2pdma->pool, 116 + (uintptr_t)page_to_virt(page), PAGE_SIZE, 117 + (void **)&ref); 118 + percpu_ref_put(ref); 119 + } 120 + 121 + static const struct dev_pagemap_ops p2pdma_pgmap_ops = { 122 + .page_free = p2pdma_page_free, 186 123 }; 187 124 188 125 static void pci_p2pdma_release(void *data) ··· 257 152 return error; 258 153 } 259 154 155 + static void pci_p2pdma_unmap_mappings(void *data) 156 + { 157 + struct pci_dev *pdev = data; 158 + 159 + /* 160 + * Removing the alloc attribute from sysfs will call 161 + * unmap_mapping_range() on the inode, teardown any existing userspace 162 + * mappings and prevent new ones from being created. 163 + */ 164 + sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr, 165 + p2pmem_group.name); 166 + } 167 + 260 168 /** 261 169 * pci_p2pdma_add_resource - add memory for use as p2p memory 262 170 * @pdev: the device to add the memory to ··· 316 198 pgmap->range.end = pgmap->range.start + size - 1; 317 199 pgmap->nr_range = 1; 318 200 pgmap->type = MEMORY_DEVICE_PCI_P2PDMA; 201 + pgmap->ops = &p2pdma_pgmap_ops; 319 202 320 203 p2p_pgmap->provider = pdev; 321 204 p2p_pgmap->bus_offset = pci_bus_address(pdev, bar) - ··· 327 208 error = PTR_ERR(addr); 328 209 goto pgmap_free; 329 210 } 211 + 212 + error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings, 213 + pdev); 214 + if (error) 215 + goto pages_free; 330 216 331 217 p2pdma = rcu_dereference_protected(pdev->p2pdma, 1); 332 218 error = gen_pool_add_owner(p2pdma->pool, (unsigned long)addr,

+1 -1

drivers/scsi/scsi_lib.c

··· 2735 2735 blk_mq_quiesce_queue(sdev->request_queue); 2736 2736 } else { 2737 2737 if (!nowait) 2738 - blk_mq_wait_quiesce_done(sdev->request_queue); 2738 + blk_mq_wait_quiesce_done(sdev->request_queue->tag_set); 2739 2739 } 2740 2740 } 2741 2741

-1

drivers/scsi/scsi_scan.c

··· 344 344 sdev->request_queue = q; 345 345 q->queuedata = sdev; 346 346 __scsi_init_queue(sdev->host, q); 347 - WARN_ON_ONCE(!blk_get_queue(q)); 348 347 349 348 depth = sdev->host->cmd_per_lun ?: 1; 350 349

+2

drivers/ufs/core/ufshcd.c

··· 9544 9544 ufshpb_remove(hba); 9545 9545 ufs_sysfs_remove_nodes(hba->dev); 9546 9546 blk_mq_destroy_queue(hba->tmf_queue); 9547 + blk_put_queue(hba->tmf_queue); 9547 9548 blk_mq_free_tag_set(&hba->tmf_tag_set); 9548 9549 scsi_remove_host(hba->host); 9549 9550 /* disable interrupts */ ··· 9841 9840 9842 9841 free_tmf_queue: 9843 9842 blk_mq_destroy_queue(hba->tmf_queue); 9843 + blk_put_queue(hba->tmf_queue); 9844 9844 free_tmf_tag_set: 9845 9845 blk_mq_free_tag_set(&hba->tmf_tag_set); 9846 9846 out_remove_scsi_host:

+5 -9

fs/crypto/inline_crypt.c

··· 12 12 * provides the key and IV to use. 13 13 */ 14 14 15 - #include <linux/blk-crypto-profile.h> 15 + #include <linux/blk-crypto.h> 16 16 #include <linux/blkdev.h> 17 17 #include <linux/buffer_head.h> 18 18 #include <linux/sched/mm.h> ··· 77 77 unsigned int i; 78 78 79 79 for (i = 0; i < num_devs; i++) { 80 - struct request_queue *q = bdev_get_queue(devs[i]); 81 - 82 80 if (!IS_ENABLED(CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK) || 83 - __blk_crypto_cfg_supported(q->crypto_profile, cfg)) { 81 + blk_crypto_config_supported_natively(devs[i], cfg)) { 84 82 if (!xchg(&mode->logged_blk_crypto_native, 1)) 85 83 pr_info("fscrypt: %s using blk-crypto (native)\n", 86 84 mode->friendly_name); ··· 137 139 return PTR_ERR(devs); 138 140 139 141 for (i = 0; i < num_devs; i++) { 140 - if (!blk_crypto_config_supported(bdev_get_queue(devs[i]), 141 - &crypto_cfg)) 142 + if (!blk_crypto_config_supported(devs[i], &crypto_cfg)) 142 143 goto out_free_devs; 143 144 } 144 145 ··· 181 184 goto fail; 182 185 } 183 186 for (i = 0; i < num_devs; i++) { 184 - err = blk_crypto_start_using_key(blk_key, 185 - bdev_get_queue(devs[i])); 187 + err = blk_crypto_start_using_key(devs[i], blk_key); 186 188 if (err) 187 189 break; 188 190 } ··· 220 224 devs = fscrypt_get_devices(sb, &num_devs); 221 225 if (!IS_ERR(devs)) { 222 226 for (i = 0; i < num_devs; i++) 223 - blk_crypto_evict_key(bdev_get_queue(devs[i]), blk_key); 227 + blk_crypto_evict_key(devs[i], blk_key); 224 228 kfree(devs); 225 229 } 226 230 kfree_sensitive(blk_key);

-2

include/linux/bio.h

··· 475 475 extern void bio_set_pages_dirty(struct bio *bio); 476 476 extern void bio_check_pages_dirty(struct bio *bio); 477 477 478 - extern void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter, 479 - struct bio *src, struct bvec_iter *src_iter); 480 478 extern void bio_copy_data(struct bio *dst, struct bio *src); 481 479 extern void bio_free_pages(struct bio *bio); 482 480 void guard_bio_eod(struct bio *bio);

-12

include/linux/blk-crypto-profile.h

··· 138 138 139 139 unsigned int blk_crypto_keyslot_index(struct blk_crypto_keyslot *slot); 140 140 141 - blk_status_t blk_crypto_get_keyslot(struct blk_crypto_profile *profile, 142 - const struct blk_crypto_key *key, 143 - struct blk_crypto_keyslot **slot_ptr); 144 - 145 - void blk_crypto_put_keyslot(struct blk_crypto_keyslot *slot); 146 - 147 - bool __blk_crypto_cfg_supported(struct blk_crypto_profile *profile, 148 - const struct blk_crypto_config *cfg); 149 - 150 - int __blk_crypto_evict_key(struct blk_crypto_profile *profile, 151 - const struct blk_crypto_key *key); 152 - 153 141 void blk_crypto_reprogram_all_keys(struct blk_crypto_profile *profile); 154 142 155 143 void blk_crypto_profile_destroy(struct blk_crypto_profile *profile);

+6 -7

include/linux/blk-crypto.h

··· 72 72 #include <linux/blk_types.h> 73 73 #include <linux/blkdev.h> 74 74 75 - struct request; 76 - struct request_queue; 77 - 78 75 #ifdef CONFIG_BLK_INLINE_ENCRYPTION 79 76 80 77 static inline bool bio_has_crypt_ctx(struct bio *bio) ··· 92 95 unsigned int dun_bytes, 93 96 unsigned int data_unit_size); 94 97 95 - int blk_crypto_start_using_key(const struct blk_crypto_key *key, 96 - struct request_queue *q); 98 + int blk_crypto_start_using_key(struct block_device *bdev, 99 + const struct blk_crypto_key *key); 97 100 98 - int blk_crypto_evict_key(struct request_queue *q, 101 + int blk_crypto_evict_key(struct block_device *bdev, 99 102 const struct blk_crypto_key *key); 100 103 101 - bool blk_crypto_config_supported(struct request_queue *q, 104 + bool blk_crypto_config_supported_natively(struct block_device *bdev, 105 + const struct blk_crypto_config *cfg); 106 + bool blk_crypto_config_supported(struct block_device *bdev, 102 107 const struct blk_crypto_config *cfg); 103 108 104 109 #else /* CONFIG_BLK_INLINE_ENCRYPTION */

+7 -2

include/linux/blk-mq.h

··· 7 7 #include <linux/lockdep.h> 8 8 #include <linux/scatterlist.h> 9 9 #include <linux/prefetch.h> 10 + #include <linux/srcu.h> 10 11 11 12 struct blk_mq_tags; 12 13 struct blk_flush_queue; ··· 141 140 struct blk_crypto_keyslot *crypt_keyslot; 142 141 #endif 143 142 144 - unsigned short write_hint; 145 143 unsigned short ioprio; 146 144 147 145 enum mq_rq_state state; ··· 501 501 * @tag_list_lock: Serializes tag_list accesses. 502 502 * @tag_list: List of the request queues that use this tag set. See also 503 503 * request_queue.tag_set_list. 504 + * @srcu: Use as lock when type of the request queue is blocking 505 + * (BLK_MQ_F_BLOCKING). 504 506 */ 505 507 struct blk_mq_tag_set { 506 508 struct blk_mq_queue_map map[HCTX_MAX_TYPES]; ··· 523 521 524 522 struct mutex tag_list_lock; 525 523 struct list_head tag_list; 524 + struct srcu_struct *srcu; 526 525 }; 527 526 528 527 /** ··· 881 878 void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async); 882 879 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async); 883 880 void blk_mq_quiesce_queue(struct request_queue *q); 884 - void blk_mq_wait_quiesce_done(struct request_queue *q); 881 + void blk_mq_wait_quiesce_done(struct blk_mq_tag_set *set); 882 + void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set); 883 + void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set); 885 884 void blk_mq_unquiesce_queue(struct request_queue *q); 886 885 void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs); 887 886 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);

-7

include/linux/blk_types.h

··· 472 472 return bio->bi_opf & REQ_OP_MASK; 473 473 } 474 474 475 - /* obsolete, don't use in new code */ 476 - static inline void bio_set_op_attrs(struct bio *bio, enum req_op op, 477 - blk_opf_t op_flags) 478 - { 479 - bio->bi_opf = op | op_flags; 480 - } 481 - 482 475 static inline bool op_is_write(blk_opf_t op) 483 476 { 484 477 return !!(op & (__force blk_opf_t)1);

+6 -26

include/linux/blkdev.h

··· 22 22 #include <linux/blkzoned.h> 23 23 #include <linux/sched.h> 24 24 #include <linux/sbitmap.h> 25 - #include <linux/srcu.h> 26 25 #include <linux/uuid.h> 27 26 #include <linux/xarray.h> 28 27 ··· 155 156 unsigned open_partitions; /* number of open partitions */ 156 157 157 158 struct backing_dev_info *bdi; 159 + struct kobject queue_kobj; /* the queue/ directory */ 158 160 struct kobject *slave_dir; 159 161 #ifdef CONFIG_BLOCK_HOLDER_DEPRECATED 160 162 struct list_head slave_bdevs; ··· 438 438 439 439 struct gendisk *disk; 440 440 441 - /* 442 - * queue kobject 443 - */ 444 - struct kobject kobj; 441 + refcount_t refs; 445 442 446 443 /* 447 444 * mq queue kobject ··· 541 544 struct mutex debugfs_mutex; 542 545 543 546 bool mq_sysfs_init_done; 544 - 545 - /** 546 - * @srcu: Sleepable RCU. Use as lock when type of the request queue 547 - * is blocking (BLK_MQ_F_BLOCKING). Must be the last member 548 - */ 549 - struct srcu_struct srcu[]; 550 547 }; 551 548 552 549 /* Keep blk_queue_flag_name[] in sync with the definitions below */ 553 550 #define QUEUE_FLAG_STOPPED 0 /* queue is stopped */ 554 551 #define QUEUE_FLAG_DYING 1 /* queue being torn down */ 555 - #define QUEUE_FLAG_HAS_SRCU 2 /* SRCU is allocated */ 556 552 #define QUEUE_FLAG_NOMERGES 3 /* disable merge attempts */ 557 553 #define QUEUE_FLAG_SAME_COMP 4 /* complete on same CPU-group */ 558 554 #define QUEUE_FLAG_FAIL_IO 5 /* fake timeout */ ··· 570 580 #define QUEUE_FLAG_HCTX_ACTIVE 28 /* at least one blk-mq hctx is active */ 571 581 #define QUEUE_FLAG_NOWAIT 29 /* device supports NOWAIT */ 572 582 #define QUEUE_FLAG_SQ_SCHED 30 /* single queue style io dispatch */ 583 + #define QUEUE_FLAG_SKIP_TAGSET_QUIESCE 31 /* quiesce_tagset skip the queue*/ 573 584 574 585 #define QUEUE_FLAG_MQ_DEFAULT ((1UL << QUEUE_FLAG_IO_STAT) | \ 575 586 (1UL << QUEUE_FLAG_SAME_COMP) | \ ··· 582 591 583 592 #define blk_queue_stopped(q) test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags) 584 593 #define blk_queue_dying(q) test_bit(QUEUE_FLAG_DYING, &(q)->queue_flags) 585 - #define blk_queue_has_srcu(q) test_bit(QUEUE_FLAG_HAS_SRCU, &(q)->queue_flags) 586 594 #define blk_queue_init_done(q) test_bit(QUEUE_FLAG_INIT_DONE, &(q)->queue_flags) 587 595 #define blk_queue_nomerges(q) test_bit(QUEUE_FLAG_NOMERGES, &(q)->queue_flags) 588 596 #define blk_queue_noxmerges(q) \ ··· 610 620 #define blk_queue_pm_only(q) atomic_read(&(q)->pm_only) 611 621 #define blk_queue_registered(q) test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags) 612 622 #define blk_queue_sq_sched(q) test_bit(QUEUE_FLAG_SQ_SCHED, &(q)->queue_flags) 623 + #define blk_queue_skip_tagset_quiesce(q) \ 624 + test_bit(QUEUE_FLAG_SKIP_TAGSET_QUIESCE, &(q)->queue_flags) 613 625 614 626 extern void blk_set_pm_only(struct request_queue *q); 615 627 extern void blk_clear_pm_only(struct request_queue *q); ··· 832 840 #ifdef CONFIG_BLOCK_HOLDER_DEPRECATED 833 841 int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk); 834 842 void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk *disk); 835 - int bd_register_pending_holders(struct gendisk *disk); 836 843 #else 837 844 static inline int bd_link_disk_holder(struct block_device *bdev, 838 845 struct gendisk *disk) ··· 841 850 static inline void bd_unlink_disk_holder(struct block_device *bdev, 842 851 struct gendisk *disk) 843 852 { 844 - } 845 - static inline int bd_register_pending_holders(struct gendisk *disk) 846 - { 847 - return 0; 848 853 } 849 854 #endif /* CONFIG_BLOCK_HOLDER_DEPRECATED */ 850 855 ··· 1336 1349 /* assumes size > 256 */ 1337 1350 static inline unsigned int blksize_bits(unsigned int size) 1338 1351 { 1339 - unsigned int bits = 8; 1340 - do { 1341 - bits++; 1342 - size >>= 1; 1343 - } while (size > 256); 1344 - return bits; 1352 + return order_base_2(size >> SECTOR_SHIFT) + SECTOR_SHIFT; 1345 1353 } 1346 1354 1347 1355 static inline unsigned int block_size(struct block_device *bdev) ··· 1395 1413 void (*swap_slot_free_notify) (struct block_device *, unsigned long); 1396 1414 int (*report_zones)(struct gendisk *, sector_t sector, 1397 1415 unsigned int nr_zones, report_zones_cb cb, void *data); 1398 - char *(*devnode)(struct gendisk *disk, umode_t *mode); 1399 1416 /* returns the length of the identifier or a negative errno: */ 1400 1417 int (*get_unique_id)(struct gendisk *disk, u8 id[16], 1401 1418 enum blk_unique_id id_type); ··· 1439 1458 void bdev_end_io_acct(struct block_device *bdev, enum req_op op, 1440 1459 unsigned long start_time); 1441 1460 1442 - void bio_start_io_acct_time(struct bio *bio, unsigned long start_time); 1443 1461 unsigned long bio_start_io_acct(struct bio *bio); 1444 1462 void bio_end_io_acct_remapped(struct bio *bio, unsigned long start_time, 1445 1463 struct block_device *orig_bdev);

-3

include/linux/lru_cache.h

··· 199 199 unsigned long flags; 200 200 201 201 202 - void *lc_private; 203 202 const char *name; 204 203 205 204 /* nr_elements there */ ··· 240 241 unsigned e_count, size_t e_size, size_t e_off); 241 242 extern void lc_reset(struct lru_cache *lc); 242 243 extern void lc_destroy(struct lru_cache *lc); 243 - extern void lc_set(struct lru_cache *lc, unsigned int enr, int index); 244 244 extern void lc_del(struct lru_cache *lc, struct lc_element *element); 245 245 246 246 extern struct lc_element *lc_get_cumulative(struct lru_cache *lc, unsigned int enr); ··· 295 297 container_of(ptr, type, member) 296 298 297 299 extern struct lc_element *lc_element_by_index(struct lru_cache *lc, unsigned i); 298 - extern unsigned int lc_index_of(struct lru_cache *lc, struct lc_element *e); 299 300 300 301 #endif

+5

include/linux/mempool.h

··· 30 30 return pool->elements != NULL; 31 31 } 32 32 33 + static inline bool mempool_is_saturated(mempool_t *pool) 34 + { 35 + return READ_ONCE(pool->curr_nr) >= pool->min_nr; 36 + } 37 + 33 38 void mempool_exit(mempool_t *pool); 34 39 int mempool_init_node(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn, 35 40 mempool_free_t *free_fn, void *pool_data,

+2 -1

include/linux/mm.h

··· 1129 1129 folio_get(page_folio(page)); 1130 1130 } 1131 1131 1132 - bool __must_check try_grab_page(struct page *page, unsigned int flags); 1132 + int __must_check try_grab_page(struct page *page, unsigned int flags); 1133 1133 1134 1134 static inline __must_check bool try_get_page(struct page *page) 1135 1135 { ··· 2979 2979 #define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ 2980 2980 #define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */ 2981 2981 #define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */ 2982 + #define FOLL_PCI_P2PDMA 0x100000 /* allow returning PCI P2PDMA pages */ 2982 2983 2983 2984 /* 2984 2985 * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each

+24

include/linux/mmzone.h

··· 986 986 { 987 987 return page_zonenum(page) == ZONE_DEVICE; 988 988 } 989 + 990 + /* 991 + * Consecutive zone device pages should not be merged into the same sgl 992 + * or bvec segment with other types of pages or if they belong to different 993 + * pgmaps. Otherwise getting the pgmap of a given segment is not possible 994 + * without scanning the entire segment. This helper returns true either if 995 + * both pages are not zone device pages or both pages are zone device pages 996 + * with the same pgmap. 997 + */ 998 + static inline bool zone_device_pages_have_same_pgmap(const struct page *a, 999 + const struct page *b) 1000 + { 1001 + if (is_zone_device_page(a) != is_zone_device_page(b)) 1002 + return false; 1003 + if (!is_zone_device_page(a)) 1004 + return true; 1005 + return a->pgmap == b->pgmap; 1006 + } 1007 + 989 1008 extern void memmap_init_zone_device(struct zone *, unsigned long, 990 1009 unsigned long, struct dev_pagemap *); 991 1010 #else 992 1011 static inline bool is_zone_device_page(const struct page *page) 993 1012 { 994 1013 return false; 1014 + } 1015 + static inline bool zone_device_pages_have_same_pgmap(const struct page *a, 1016 + const struct page *b) 1017 + { 1018 + return true; 995 1019 } 996 1020 #endif 997 1021

+2

include/linux/nvme.h

··· 797 797 nvme_cmd_zone_mgmt_send = 0x79, 798 798 nvme_cmd_zone_mgmt_recv = 0x7a, 799 799 nvme_cmd_zone_append = 0x7d, 800 + nvme_cmd_vendor_start = 0x80, 800 801 }; 801 802 802 803 #define nvme_opcode_name(opcode) { opcode, #opcode } ··· 964 963 NVME_RW_PRINFO_PRCHK_GUARD = 1 << 12, 965 964 NVME_RW_PRINFO_PRACT = 1 << 13, 966 965 NVME_RW_DTYPE_STREAMS = 1 << 4, 966 + NVME_WZ_DEAC = 1 << 9, 967 967 }; 968 968 969 969 struct nvme_dsm_cmd {

-197

include/linux/pktcdvd.h

··· 1 - /* 2 - * Copyright (C) 2000 Jens Axboe <axboe@suse.de> 3 - * Copyright (C) 2001-2004 Peter Osterlund <petero2@telia.com> 4 - * 5 - * May be copied or modified under the terms of the GNU General Public 6 - * License. See linux/COPYING for more information. 7 - * 8 - * Packet writing layer for ATAPI and SCSI CD-R, CD-RW, DVD-R, and 9 - * DVD-RW devices. 10 - * 11 - */ 12 - #ifndef __PKTCDVD_H 13 - #define __PKTCDVD_H 14 - 15 - #include <linux/blkdev.h> 16 - #include <linux/completion.h> 17 - #include <linux/cdrom.h> 18 - #include <linux/kobject.h> 19 - #include <linux/sysfs.h> 20 - #include <linux/mempool.h> 21 - #include <uapi/linux/pktcdvd.h> 22 - 23 - /* default bio write queue congestion marks */ 24 - #define PKT_WRITE_CONGESTION_ON 10000 25 - #define PKT_WRITE_CONGESTION_OFF 9000 26 - 27 - 28 - struct packet_settings 29 - { 30 - __u32 size; /* packet size in (512 byte) sectors */ 31 - __u8 fp; /* fixed packets */ 32 - __u8 link_loss; /* the rest is specified 33 - * as per Mt Fuji */ 34 - __u8 write_type; 35 - __u8 track_mode; 36 - __u8 block_mode; 37 - }; 38 - 39 - /* 40 - * Very crude stats for now 41 - */ 42 - struct packet_stats 43 - { 44 - unsigned long pkt_started; 45 - unsigned long pkt_ended; 46 - unsigned long secs_w; 47 - unsigned long secs_rg; 48 - unsigned long secs_r; 49 - }; 50 - 51 - struct packet_cdrw 52 - { 53 - struct list_head pkt_free_list; 54 - struct list_head pkt_active_list; 55 - spinlock_t active_list_lock; /* Serialize access to pkt_active_list */ 56 - struct task_struct *thread; 57 - atomic_t pending_bios; 58 - }; 59 - 60 - /* 61 - * Switch to high speed reading after reading this many kilobytes 62 - * with no interspersed writes. 63 - */ 64 - #define HI_SPEED_SWITCH 512 65 - 66 - struct packet_iosched 67 - { 68 - atomic_t attention; /* Set to non-zero when queue processing is needed */ 69 - int writing; /* Non-zero when writing, zero when reading */ 70 - spinlock_t lock; /* Protecting read/write queue manipulations */ 71 - struct bio_list read_queue; 72 - struct bio_list write_queue; 73 - sector_t last_write; /* The sector where the last write ended */ 74 - int successive_reads; 75 - }; 76 - 77 - /* 78 - * 32 buffers of 2048 bytes 79 - */ 80 - #if (PAGE_SIZE % CD_FRAMESIZE) != 0 81 - #error "PAGE_SIZE must be a multiple of CD_FRAMESIZE" 82 - #endif 83 - #define PACKET_MAX_SIZE 128 84 - #define FRAMES_PER_PAGE (PAGE_SIZE / CD_FRAMESIZE) 85 - #define PACKET_MAX_SECTORS (PACKET_MAX_SIZE * CD_FRAMESIZE >> 9) 86 - 87 - enum packet_data_state { 88 - PACKET_IDLE_STATE, /* Not used at the moment */ 89 - PACKET_WAITING_STATE, /* Waiting for more bios to arrive, so */ 90 - /* we don't have to do as much */ 91 - /* data gathering */ 92 - PACKET_READ_WAIT_STATE, /* Waiting for reads to fill in holes */ 93 - PACKET_WRITE_WAIT_STATE, /* Waiting for the write to complete */ 94 - PACKET_RECOVERY_STATE, /* Recover after read/write errors */ 95 - PACKET_FINISHED_STATE, /* After write has finished */ 96 - 97 - PACKET_NUM_STATES /* Number of possible states */ 98 - }; 99 - 100 - /* 101 - * Information needed for writing a single packet 102 - */ 103 - struct pktcdvd_device; 104 - 105 - struct packet_data 106 - { 107 - struct list_head list; 108 - 109 - spinlock_t lock; /* Lock protecting state transitions and */ 110 - /* orig_bios list */ 111 - 112 - struct bio_list orig_bios; /* Original bios passed to pkt_make_request */ 113 - /* that will be handled by this packet */ 114 - int write_size; /* Total size of all bios in the orig_bios */ 115 - /* list, measured in number of frames */ 116 - 117 - struct bio *w_bio; /* The bio we will send to the real CD */ 118 - /* device once we have all data for the */ 119 - /* packet we are going to write */ 120 - sector_t sector; /* First sector in this packet */ 121 - int frames; /* Number of frames in this packet */ 122 - 123 - enum packet_data_state state; /* Current state */ 124 - atomic_t run_sm; /* Incremented whenever the state */ 125 - /* machine needs to be run */ 126 - long sleep_time; /* Set this to non-zero to make the state */ 127 - /* machine run after this many jiffies. */ 128 - 129 - atomic_t io_wait; /* Number of pending IO operations */ 130 - atomic_t io_errors; /* Number of read/write errors during IO */ 131 - 132 - struct bio *r_bios[PACKET_MAX_SIZE]; /* bios to use during data gathering */ 133 - struct page *pages[PACKET_MAX_SIZE / FRAMES_PER_PAGE]; 134 - 135 - int cache_valid; /* If non-zero, the data for the zone defined */ 136 - /* by the sector variable is completely cached */ 137 - /* in the pages[] vector. */ 138 - 139 - int id; /* ID number for debugging */ 140 - struct pktcdvd_device *pd; 141 - }; 142 - 143 - struct pkt_rb_node { 144 - struct rb_node rb_node; 145 - struct bio *bio; 146 - }; 147 - 148 - struct packet_stacked_data 149 - { 150 - struct bio *bio; /* Original read request bio */ 151 - struct pktcdvd_device *pd; 152 - }; 153 - #define PSD_POOL_SIZE 64 154 - 155 - struct pktcdvd_device 156 - { 157 - struct block_device *bdev; /* dev attached */ 158 - dev_t pkt_dev; /* our dev */ 159 - char name[20]; 160 - struct packet_settings settings; 161 - struct packet_stats stats; 162 - int refcnt; /* Open count */ 163 - int write_speed; /* current write speed, kB/s */ 164 - int read_speed; /* current read speed, kB/s */ 165 - unsigned long offset; /* start offset */ 166 - __u8 mode_offset; /* 0 / 8 */ 167 - __u8 type; 168 - unsigned long flags; 169 - __u16 mmc3_profile; 170 - __u32 nwa; /* next writable address */ 171 - __u32 lra; /* last recorded address */ 172 - struct packet_cdrw cdrw; 173 - wait_queue_head_t wqueue; 174 - 175 - spinlock_t lock; /* Serialize access to bio_queue */ 176 - struct rb_root bio_queue; /* Work queue of bios we need to handle */ 177 - int bio_queue_size; /* Number of nodes in bio_queue */ 178 - bool congested; /* Someone is waiting for bio_queue_size 179 - * to drop. */ 180 - sector_t current_sector; /* Keep track of where the elevator is */ 181 - atomic_t scan_queue; /* Set to non-zero when pkt_handle_queue */ 182 - /* needs to be run. */ 183 - mempool_t rb_pool; /* mempool for pkt_rb_node allocations */ 184 - 185 - struct packet_iosched iosched; 186 - struct gendisk *disk; 187 - 188 - int write_congestion_off; 189 - int write_congestion_on; 190 - 191 - struct device *dev; /* sysfs pktcdvd[0-7] dev */ 192 - 193 - struct dentry *dfs_d_root; /* debugfs: devname directory */ 194 - struct dentry *dfs_f_info; /* debugfs: info file */ 195 - }; 196 - 197 - #endif /* __PKTCDVD_H */

-8

include/linux/raid/pq.h

··· 10 10 11 11 #ifdef __KERNEL__ 12 12 13 - /* Set to 1 to use kernel-wide empty_zero_page */ 14 - #define RAID6_USE_EMPTY_ZERO_PAGE 0 15 13 #include <linux/blkdev.h> 16 14 17 - /* We need a pre-zeroed page... if we don't want to use the kernel-provided 18 - one define it here */ 19 - #if RAID6_USE_EMPTY_ZERO_PAGE 20 - # define raid6_empty_zero_page empty_zero_page 21 - #else 22 15 extern const char raid6_empty_zero_page[PAGE_SIZE]; 23 - #endif 24 16 25 17 #else /* ! __KERNEL__ */ 26 18 /* Used for testing in user space */

+11 -5

include/linux/sbitmap.h

··· 87 87 */ 88 88 struct sbq_wait_state { 89 89 /** 90 - * @wait_cnt: Number of frees remaining before we wake up. 91 - */ 92 - atomic_t wait_cnt; 93 - 94 - /** 95 90 * @wait: Wait queue. 96 91 */ 97 92 wait_queue_head_t wait; ··· 133 138 * sbitmap_queue_get_shallow() 134 139 */ 135 140 unsigned int min_shallow_depth; 141 + 142 + /** 143 + * @completion_cnt: Number of bits cleared passed to the 144 + * wakeup function. 145 + */ 146 + atomic_t completion_cnt; 147 + 148 + /** 149 + * @wakeup_cnt: Number of thread wake ups issued. 150 + */ 151 + atomic_t wakeup_cnt; 136 152 }; 137 153 138 154 /**

+2 -1

include/linux/sed-opal.h

··· 11 11 #define LINUX_OPAL_H 12 12 13 13 #include <uapi/linux/sed-opal.h> 14 - #include <linux/kernel.h> 14 + #include <linux/compiler_types.h> 15 + #include <linux/types.h> 15 16 16 17 struct opal_dev; 17 18

+6

include/linux/uio.h

··· 250 250 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count); 251 251 void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray, 252 252 loff_t start, size_t count); 253 + ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages, 254 + size_t maxsize, unsigned maxpages, size_t *start, 255 + unsigned gup_flags); 253 256 ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages, 254 257 size_t maxsize, unsigned maxpages, size_t *start); 258 + ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, 259 + struct page ***pages, size_t maxsize, size_t *start, 260 + unsigned gup_flags); 255 261 ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages, 256 262 size_t maxsize, size_t *start); 257 263 int iov_iter_npages(const struct iov_iter *i, int maxpages);

+1 -1

include/linux/wait.h

··· 209 209 list_del(&wq_entry->entry); 210 210 } 211 211 212 - void __wake_up(struct wait_queue_head *wq_head, unsigned int mode, int nr, void *key); 212 + int __wake_up(struct wait_queue_head *wq_head, unsigned int mode, int nr, void *key); 213 213 void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key); 214 214 void __wake_up_locked_key_bookmark(struct wait_queue_head *wq_head, 215 215 unsigned int mode, void *key, wait_queue_entry_t *bookmark);

+2 -2

include/trace/events/iocost.h

··· 38 38 __assign_str(cgroup, path); 39 39 __entry->now = now->now; 40 40 __entry->vnow = now->vnow; 41 - __entry->vrate = now->vrate; 41 + __entry->vrate = iocg->ioc->vtime_base_rate; 42 42 __entry->last_period = last_period; 43 43 __entry->cur_period = cur_period; 44 44 __entry->vtime = vtime; ··· 160 160 161 161 TP_fast_assign( 162 162 __assign_str(devname, ioc_name(ioc)); 163 - __entry->old_vrate = atomic64_read(&ioc->vtime_rate); 163 + __entry->old_vrate = ioc->vtime_base_rate; 164 164 __entry->new_vrate = new_vrate; 165 165 __entry->busy_level = ioc->busy_level; 166 166 __entry->read_missed_ppm = missed_ppm[READ];

-112

include/uapi/linux/pktcdvd.h

··· 1 - /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ 2 - /* 3 - * Copyright (C) 2000 Jens Axboe <axboe@suse.de> 4 - * Copyright (C) 2001-2004 Peter Osterlund <petero2@telia.com> 5 - * 6 - * May be copied or modified under the terms of the GNU General Public 7 - * License. See linux/COPYING for more information. 8 - * 9 - * Packet writing layer for ATAPI and SCSI CD-R, CD-RW, DVD-R, and 10 - * DVD-RW devices. 11 - * 12 - */ 13 - #ifndef _UAPI__PKTCDVD_H 14 - #define _UAPI__PKTCDVD_H 15 - 16 - #include <linux/types.h> 17 - 18 - /* 19 - * 1 for normal debug messages, 2 is very verbose. 0 to turn it off. 20 - */ 21 - #define PACKET_DEBUG 1 22 - 23 - #define MAX_WRITERS 8 24 - 25 - #define PKT_RB_POOL_SIZE 512 26 - 27 - /* 28 - * How long we should hold a non-full packet before starting data gathering. 29 - */ 30 - #define PACKET_WAIT_TIME (HZ * 5 / 1000) 31 - 32 - /* 33 - * use drive write caching -- we need deferred error handling to be 34 - * able to successfully recover with this option (drive will return good 35 - * status as soon as the cdb is validated). 36 - */ 37 - #if defined(CONFIG_CDROM_PKTCDVD_WCACHE) 38 - #define USE_WCACHING 1 39 - #else 40 - #define USE_WCACHING 0 41 - #endif 42 - 43 - /* 44 - * No user-servicable parts beyond this point -> 45 - */ 46 - 47 - /* 48 - * device types 49 - */ 50 - #define PACKET_CDR 1 51 - #define PACKET_CDRW 2 52 - #define PACKET_DVDR 3 53 - #define PACKET_DVDRW 4 54 - 55 - /* 56 - * flags 57 - */ 58 - #define PACKET_WRITABLE 1 /* pd is writable */ 59 - #define PACKET_NWA_VALID 2 /* next writable address valid */ 60 - #define PACKET_LRA_VALID 3 /* last recorded address valid */ 61 - #define PACKET_MERGE_SEGS 4 /* perform segment merging to keep */ 62 - /* underlying cdrom device happy */ 63 - 64 - /* 65 - * Disc status -- from READ_DISC_INFO 66 - */ 67 - #define PACKET_DISC_EMPTY 0 68 - #define PACKET_DISC_INCOMPLETE 1 69 - #define PACKET_DISC_COMPLETE 2 70 - #define PACKET_DISC_OTHER 3 71 - 72 - /* 73 - * write type, and corresponding data block type 74 - */ 75 - #define PACKET_MODE1 1 76 - #define PACKET_MODE2 2 77 - #define PACKET_BLOCK_MODE1 8 78 - #define PACKET_BLOCK_MODE2 10 79 - 80 - /* 81 - * Last session/border status 82 - */ 83 - #define PACKET_SESSION_EMPTY 0 84 - #define PACKET_SESSION_INCOMPLETE 1 85 - #define PACKET_SESSION_RESERVED 2 86 - #define PACKET_SESSION_COMPLETE 3 87 - 88 - #define PACKET_MCN "4a656e734178626f65323030300000" 89 - 90 - #undef PACKET_USE_LS 91 - 92 - #define PKT_CTRL_CMD_SETUP 0 93 - #define PKT_CTRL_CMD_TEARDOWN 1 94 - #define PKT_CTRL_CMD_STATUS 2 95 - 96 - struct pkt_ctrl_command { 97 - __u32 command; /* in: Setup, teardown, status */ 98 - __u32 dev_index; /* in/out: Device index */ 99 - __u32 dev; /* in/out: Device nr for cdrw device */ 100 - __u32 pkt_dev; /* in/out: Device nr for packet device */ 101 - __u32 num_devices; /* out: Largest device index + 1 */ 102 - __u32 padding; /* Not used */ 103 - }; 104 - 105 - /* 106 - * packet ioctls 107 - */ 108 - #define PACKET_IOCTL_MAGIC ('X') 109 - #define PACKET_CTRL_CMD _IOWR(PACKET_IOCTL_MAGIC, 1, struct pkt_ctrl_command) 110 - 111 - 112 - #endif /* _UAPI__PKTCDVD_H */

+7 -1

include/uapi/linux/sed-opal.h

··· 44 44 OPAL_LK = 0x04, /* 0100 */ 45 45 }; 46 46 47 + enum opal_lock_flags { 48 + /* IOC_OPAL_SAVE will also store the provided key for locking */ 49 + OPAL_SAVE_FOR_LOCK = 0x01, 50 + }; 51 + 47 52 struct opal_key { 48 53 __u8 lr; 49 54 __u8 key_len; ··· 81 76 struct opal_lock_unlock { 82 77 struct opal_session_info session; 83 78 __u32 l_state; 84 - __u8 __align[4]; 79 + __u16 flags; 80 + __u8 __align[2]; 85 81 }; 86 82 87 83 struct opal_new_pw {

+2 -1

io_uring/rw.c

··· 671 671 ret = kiocb_set_rw_flags(kiocb, rw->flags); 672 672 if (unlikely(ret)) 673 673 return ret; 674 + kiocb->ki_flags |= IOCB_ALLOC_CACHE; 674 675 675 676 /* 676 677 * If the file is marked O_NONBLOCK, still allow retry for it if it ··· 687 686 return -EOPNOTSUPP; 688 687 689 688 kiocb->private = NULL; 690 - kiocb->ki_flags |= IOCB_HIPRI | IOCB_ALLOC_CACHE; 689 + kiocb->ki_flags |= IOCB_HIPRI; 691 690 kiocb->ki_complete = io_complete_rw_iopoll; 692 691 req->iopoll_completed = 0; 693 692 } else {

+11 -7

kernel/sched/wait.c

··· 121 121 return nr_exclusive; 122 122 } 123 123 124 - static void __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int mode, 124 + static int __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int mode, 125 125 int nr_exclusive, int wake_flags, void *key) 126 126 { 127 127 unsigned long flags; 128 128 wait_queue_entry_t bookmark; 129 + int remaining = nr_exclusive; 129 130 130 131 bookmark.flags = 0; 131 132 bookmark.private = NULL; ··· 135 134 136 135 do { 137 136 spin_lock_irqsave(&wq_head->lock, flags); 138 - nr_exclusive = __wake_up_common(wq_head, mode, nr_exclusive, 137 + remaining = __wake_up_common(wq_head, mode, remaining, 139 138 wake_flags, key, &bookmark); 140 139 spin_unlock_irqrestore(&wq_head->lock, flags); 141 140 } while (bookmark.flags & WQ_FLAG_BOOKMARK); 141 + 142 + return nr_exclusive - remaining; 142 143 } 143 144 144 145 /** ··· 150 147 * @nr_exclusive: how many wake-one or wake-many threads to wake up 151 148 * @key: is directly passed to the wakeup function 152 149 * 153 - * If this function wakes up a task, it executes a full memory barrier before 154 - * accessing the task state. 150 + * If this function wakes up a task, it executes a full memory barrier 151 + * before accessing the task state. Returns the number of exclusive 152 + * tasks that were awaken. 155 153 */ 156 - void __wake_up(struct wait_queue_head *wq_head, unsigned int mode, 157 - int nr_exclusive, void *key) 154 + int __wake_up(struct wait_queue_head *wq_head, unsigned int mode, 155 + int nr_exclusive, void *key) 158 156 { 159 - __wake_up_common_lock(wq_head, mode, nr_exclusive, 0, key); 157 + return __wake_up_common_lock(wq_head, mode, nr_exclusive, 0, key); 160 158 } 161 159 EXPORT_SYMBOL(__wake_up); 162 160

+4 -3

kernel/trace/blktrace.c

··· 721 721 */ 722 722 723 723 /** 724 - * blk_trace_ioctl: - handle the ioctls associated with tracing 724 + * blk_trace_ioctl - handle the ioctls associated with tracing 725 725 * @bdev: the block device 726 726 * @cmd: the ioctl cmd 727 727 * @arg: the argument data, if any ··· 769 769 } 770 770 771 771 /** 772 - * blk_trace_shutdown: - stop and cleanup trace structures 772 + * blk_trace_shutdown - stop and cleanup trace structures 773 773 * @q: the request queue associated with the device 774 774 * 775 775 **/ ··· 1548 1548 1549 1549 static enum print_line_t blk_tracer_print_line(struct trace_iterator *iter) 1550 1550 { 1551 - if (!(blk_tracer_flags.val & TRACE_BLK_OPT_CLASSIC)) 1551 + if ((iter->ent->type != TRACE_BLK) || 1552 + !(blk_tracer_flags.val & TRACE_BLK_OPT_CLASSIC)) 1552 1553 return TRACE_TYPE_UNHANDLED; 1553 1554 1554 1555 return print_one_line(iter, true);

+24 -8

lib/iov_iter.c

··· 1431 1431 1432 1432 static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i, 1433 1433 struct page ***pages, size_t maxsize, 1434 - unsigned int maxpages, size_t *start) 1434 + unsigned int maxpages, size_t *start, 1435 + unsigned int gup_flags) 1435 1436 { 1436 1437 unsigned int n; 1437 1438 ··· 1444 1443 maxsize = MAX_RW_COUNT; 1445 1444 1446 1445 if (likely(user_backed_iter(i))) { 1447 - unsigned int gup_flags = 0; 1448 1446 unsigned long addr; 1449 1447 int res; 1450 1448 ··· 1493 1493 return -EFAULT; 1494 1494 } 1495 1495 1496 - ssize_t iov_iter_get_pages2(struct iov_iter *i, 1496 + ssize_t iov_iter_get_pages(struct iov_iter *i, 1497 1497 struct page **pages, size_t maxsize, unsigned maxpages, 1498 - size_t *start) 1498 + size_t *start, unsigned gup_flags) 1499 1499 { 1500 1500 if (!maxpages) 1501 1501 return 0; 1502 1502 BUG_ON(!pages); 1503 1503 1504 - return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start); 1504 + return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, 1505 + start, gup_flags); 1506 + } 1507 + EXPORT_SYMBOL_GPL(iov_iter_get_pages); 1508 + 1509 + ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages, 1510 + size_t maxsize, unsigned maxpages, size_t *start) 1511 + { 1512 + return iov_iter_get_pages(i, pages, maxsize, maxpages, start, 0); 1505 1513 } 1506 1514 EXPORT_SYMBOL(iov_iter_get_pages2); 1507 1515 1508 - ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, 1516 + ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, 1509 1517 struct page ***pages, size_t maxsize, 1510 - size_t *start) 1518 + size_t *start, unsigned gup_flags) 1511 1519 { 1512 1520 ssize_t len; 1513 1521 1514 1522 *pages = NULL; 1515 1523 1516 - len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start); 1524 + len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start, 1525 + gup_flags); 1517 1526 if (len <= 0) { 1518 1527 kvfree(*pages); 1519 1528 *pages = NULL; 1520 1529 } 1521 1530 return len; 1531 + } 1532 + EXPORT_SYMBOL_GPL(iov_iter_get_pages_alloc); 1533 + 1534 + ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, 1535 + struct page ***pages, size_t maxsize, size_t *start) 1536 + { 1537 + return iov_iter_get_pages_alloc(i, pages, maxsize, start, 0); 1522 1538 } 1523 1539 EXPORT_SYMBOL(iov_iter_get_pages_alloc2); 1524 1540

+2 -57

lib/lru_cache.c

··· 60 60 } while (unlikely (val == LC_PARANOIA)); 61 61 /* Spin until no-one is inside a PARANOIA_ENTRY()/RETURN() section. */ 62 62 return 0 == val; 63 - #if 0 64 - /* Alternative approach, spin in case someone enters or leaves a 65 - * PARANOIA_ENTRY()/RETURN() section. */ 66 - unsigned long old, new, val; 67 - do { 68 - old = lc->flags & LC_PARANOIA; 69 - new = old | LC_LOCKED; 70 - val = cmpxchg(&lc->flags, old, new); 71 - } while (unlikely (val == (old ^ LC_PARANOIA))); 72 - return old == val; 73 - #endif 74 63 } 75 64 76 65 /** ··· 353 364 struct lc_element *e; 354 365 355 366 PARANOIA_ENTRY(); 356 - if (lc->flags & LC_STARVING) { 367 + if (test_bit(__LC_STARVING, &lc->flags)) { 357 368 ++lc->starving; 358 369 RETURN(NULL); 359 370 } ··· 406 417 * the LRU element, we have to wait ... 407 418 */ 408 419 if (!lc_unused_element_available(lc)) { 409 - __set_bit(__LC_STARVING, &lc->flags); 420 + set_bit(__LC_STARVING, &lc->flags); 410 421 RETURN(NULL); 411 422 } 412 423 ··· 575 586 } 576 587 577 588 /** 578 - * lc_index_of 579 - * @lc: the lru cache to operate on 580 - * @e: the element to query for its index position in lc->element 581 - */ 582 - unsigned int lc_index_of(struct lru_cache *lc, struct lc_element *e) 583 - { 584 - PARANOIA_LC_ELEMENT(lc, e); 585 - return e->lc_index; 586 - } 587 - 588 - /** 589 - * lc_set - associate index with label 590 - * @lc: the lru cache to operate on 591 - * @enr: the label to set 592 - * @index: the element index to associate label with. 593 - * 594 - * Used to initialize the active set to some previously recorded state. 595 - */ 596 - void lc_set(struct lru_cache *lc, unsigned int enr, int index) 597 - { 598 - struct lc_element *e; 599 - struct list_head *lh; 600 - 601 - if (index < 0 || index >= lc->nr_elements) 602 - return; 603 - 604 - e = lc_element_by_index(lc, index); 605 - BUG_ON(e->lc_number != e->lc_new_number); 606 - BUG_ON(e->refcnt != 0); 607 - 608 - e->lc_number = e->lc_new_number = enr; 609 - hlist_del_init(&e->colision); 610 - if (enr == LC_FREE) 611 - lh = &lc->free; 612 - else { 613 - hlist_add_head(&e->colision, lc_hash_slot(lc, enr)); 614 - lh = &lc->lru; 615 - } 616 - list_move(&e->list, lh); 617 - } 618 - 619 - /** 620 589 * lc_seq_dump_details - Dump a complete LRU cache to seq in textual form. 621 590 * @lc: the lru cache to operate on 622 591 * @seq: the &struct seq_file pointer to seq_printf into ··· 608 661 EXPORT_SYMBOL(lc_create); 609 662 EXPORT_SYMBOL(lc_reset); 610 663 EXPORT_SYMBOL(lc_destroy); 611 - EXPORT_SYMBOL(lc_set); 612 664 EXPORT_SYMBOL(lc_del); 613 665 EXPORT_SYMBOL(lc_try_get); 614 666 EXPORT_SYMBOL(lc_find); ··· 615 669 EXPORT_SYMBOL(lc_put); 616 670 EXPORT_SYMBOL(lc_committed); 617 671 EXPORT_SYMBOL(lc_element_by_index); 618 - EXPORT_SYMBOL(lc_index_of); 619 672 EXPORT_SYMBOL(lc_seq_printf_stats); 620 673 EXPORT_SYMBOL(lc_seq_dump_details); 621 674 EXPORT_SYMBOL(lc_try_lock);

-2

lib/raid6/algos.c

··· 18 18 #else 19 19 #include <linux/module.h> 20 20 #include <linux/gfp.h> 21 - #if !RAID6_USE_EMPTY_ZERO_PAGE 22 21 /* In .bss so it's zeroed */ 23 22 const char raid6_empty_zero_page[PAGE_SIZE] __attribute__((aligned(256))); 24 23 EXPORT_SYMBOL(raid6_empty_zero_page); 25 - #endif 26 24 #endif 27 25 28 26 struct raid6_calls raid6_call;

+42 -110

lib/sbitmap.c

··· 434 434 sbq->wake_batch = sbq_calc_wake_batch(sbq, depth); 435 435 atomic_set(&sbq->wake_index, 0); 436 436 atomic_set(&sbq->ws_active, 0); 437 + atomic_set(&sbq->completion_cnt, 0); 438 + atomic_set(&sbq->wakeup_cnt, 0); 437 439 438 440 sbq->ws = kzalloc_node(SBQ_WAIT_QUEUES * sizeof(*sbq->ws), flags, node); 439 441 if (!sbq->ws) { ··· 443 441 return -ENOMEM; 444 442 } 445 443 446 - for (i = 0; i < SBQ_WAIT_QUEUES; i++) { 444 + for (i = 0; i < SBQ_WAIT_QUEUES; i++) 447 445 init_waitqueue_head(&sbq->ws[i].wait); 448 - atomic_set(&sbq->ws[i].wait_cnt, sbq->wake_batch); 449 - } 450 446 451 447 return 0; 452 448 } 453 449 EXPORT_SYMBOL_GPL(sbitmap_queue_init_node); 454 - 455 - static inline void __sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq, 456 - unsigned int wake_batch) 457 - { 458 - int i; 459 - 460 - if (sbq->wake_batch != wake_batch) { 461 - WRITE_ONCE(sbq->wake_batch, wake_batch); 462 - /* 463 - * Pairs with the memory barrier in sbitmap_queue_wake_up() 464 - * to ensure that the batch size is updated before the wait 465 - * counts. 466 - */ 467 - smp_mb(); 468 - for (i = 0; i < SBQ_WAIT_QUEUES; i++) 469 - atomic_set(&sbq->ws[i].wait_cnt, 1); 470 - } 471 - } 472 450 473 451 static void sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq, 474 452 unsigned int depth) ··· 456 474 unsigned int wake_batch; 457 475 458 476 wake_batch = sbq_calc_wake_batch(sbq, depth); 459 - __sbitmap_queue_update_wake_batch(sbq, wake_batch); 477 + if (sbq->wake_batch != wake_batch) 478 + WRITE_ONCE(sbq->wake_batch, wake_batch); 460 479 } 461 480 462 481 void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq, ··· 471 488 472 489 wake_batch = clamp_val(depth / SBQ_WAIT_QUEUES, 473 490 min_batch, SBQ_WAKE_BATCH); 474 - __sbitmap_queue_update_wake_batch(sbq, wake_batch); 491 + 492 + WRITE_ONCE(sbq->wake_batch, wake_batch); 475 493 } 476 494 EXPORT_SYMBOL_GPL(sbitmap_queue_recalculate_wake_batch); 477 495 ··· 560 576 } 561 577 EXPORT_SYMBOL_GPL(sbitmap_queue_min_shallow_depth); 562 578 563 - static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq) 579 + static void __sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr) 564 580 { 565 581 int i, wake_index; 566 582 567 583 if (!atomic_read(&sbq->ws_active)) 568 - return NULL; 584 + return; 569 585 570 586 wake_index = atomic_read(&sbq->wake_index); 571 587 for (i = 0; i < SBQ_WAIT_QUEUES; i++) { 572 588 struct sbq_wait_state *ws = &sbq->ws[wake_index]; 573 589 574 - if (waitqueue_active(&ws->wait) && atomic_read(&ws->wait_cnt)) { 575 - if (wake_index != atomic_read(&sbq->wake_index)) 576 - atomic_set(&sbq->wake_index, wake_index); 577 - return ws; 578 - } 579 - 590 + /* 591 + * Advance the index before checking the current queue. 592 + * It improves fairness, by ensuring the queue doesn't 593 + * need to be fully emptied before trying to wake up 594 + * from the next one. 595 + */ 580 596 wake_index = sbq_index_inc(wake_index); 597 + 598 + /* 599 + * It is sufficient to wake up at least one waiter to 600 + * guarantee forward progress. 601 + */ 602 + if (waitqueue_active(&ws->wait) && 603 + wake_up_nr(&ws->wait, nr)) 604 + break; 581 605 } 582 606 583 - return NULL; 584 - } 585 - 586 - static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr) 587 - { 588 - struct sbq_wait_state *ws; 589 - unsigned int wake_batch; 590 - int wait_cnt, cur, sub; 591 - bool ret; 592 - 593 - if (*nr <= 0) 594 - return false; 595 - 596 - ws = sbq_wake_ptr(sbq); 597 - if (!ws) 598 - return false; 599 - 600 - cur = atomic_read(&ws->wait_cnt); 601 - do { 602 - /* 603 - * For concurrent callers of this, callers should call this 604 - * function again to wakeup a new batch on a different 'ws'. 605 - */ 606 - if (cur == 0) 607 - return true; 608 - sub = min(*nr, cur); 609 - wait_cnt = cur - sub; 610 - } while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt)); 611 - 612 - /* 613 - * If we decremented queue without waiters, retry to avoid lost 614 - * wakeups. 615 - */ 616 - if (wait_cnt > 0) 617 - return !waitqueue_active(&ws->wait); 618 - 619 - *nr -= sub; 620 - 621 - /* 622 - * When wait_cnt == 0, we have to be particularly careful as we are 623 - * responsible to reset wait_cnt regardless whether we've actually 624 - * woken up anybody. But in case we didn't wakeup anybody, we still 625 - * need to retry. 626 - */ 627 - ret = !waitqueue_active(&ws->wait); 628 - wake_batch = READ_ONCE(sbq->wake_batch); 629 - 630 - /* 631 - * Wake up first in case that concurrent callers decrease wait_cnt 632 - * while waitqueue is empty. 633 - */ 634 - wake_up_nr(&ws->wait, wake_batch); 635 - 636 - /* 637 - * Pairs with the memory barrier in sbitmap_queue_resize() to 638 - * ensure that we see the batch size update before the wait 639 - * count is reset. 640 - * 641 - * Also pairs with the implicit barrier between decrementing wait_cnt 642 - * and checking for waitqueue_active() to make sure waitqueue_active() 643 - * sees result of the wakeup if atomic_dec_return() has seen the result 644 - * of atomic_set(). 645 - */ 646 - smp_mb__before_atomic(); 647 - 648 - /* 649 - * Increase wake_index before updating wait_cnt, otherwise concurrent 650 - * callers can see valid wait_cnt in old waitqueue, which can cause 651 - * invalid wakeup on the old waitqueue. 652 - */ 653 - sbq_index_atomic_inc(&sbq->wake_index); 654 - atomic_set(&ws->wait_cnt, wake_batch); 655 - 656 - return ret || *nr; 607 + if (wake_index != atomic_read(&sbq->wake_index)) 608 + atomic_set(&sbq->wake_index, wake_index); 657 609 } 658 610 659 611 void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr) 660 612 { 661 - while (__sbq_wake_up(sbq, &nr)) 662 - ; 613 + unsigned int wake_batch = READ_ONCE(sbq->wake_batch); 614 + unsigned int wakeups; 615 + 616 + if (!atomic_read(&sbq->ws_active)) 617 + return; 618 + 619 + atomic_add(nr, &sbq->completion_cnt); 620 + wakeups = atomic_read(&sbq->wakeup_cnt); 621 + 622 + do { 623 + if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch) 624 + return; 625 + } while (!atomic_try_cmpxchg(&sbq->wakeup_cnt, 626 + &wakeups, wakeups + wake_batch)); 627 + 628 + __sbitmap_queue_wake_up(sbq, wake_batch); 663 629 } 664 630 EXPORT_SYMBOL_GPL(sbitmap_queue_wake_up); 665 631 ··· 726 792 seq_puts(m, "ws={\n"); 727 793 for (i = 0; i < SBQ_WAIT_QUEUES; i++) { 728 794 struct sbq_wait_state *ws = &sbq->ws[i]; 729 - 730 - seq_printf(m, "\t{.wait_cnt=%d, .wait=%s},\n", 731 - atomic_read(&ws->wait_cnt), 795 + seq_printf(m, "\t{.wait=%s},\n", 732 796 waitqueue_active(&ws->wait) ? "active" : "inactive"); 733 797 } 734 798 seq_puts(m, "}\n");

+15 -10

lib/scatterlist.c

··· 410 410 return new_sg; 411 411 } 412 412 413 + static bool pages_are_mergeable(struct page *a, struct page *b) 414 + { 415 + if (page_to_pfn(a) != page_to_pfn(b) + 1) 416 + return false; 417 + if (!zone_device_pages_have_same_pgmap(a, b)) 418 + return false; 419 + return true; 420 + } 421 + 413 422 /** 414 423 * sg_alloc_append_table_from_pages - Allocate and initialize an append sg 415 424 * table from an array of pages ··· 456 447 unsigned int chunks, cur_page, seg_len, i, prv_len = 0; 457 448 unsigned int added_nents = 0; 458 449 struct scatterlist *s = sgt_append->prv; 450 + struct page *last_pg; 459 451 460 452 /* 461 453 * The algorithm below requires max_segment to be aligned to PAGE_SIZE ··· 470 460 return -EOPNOTSUPP; 471 461 472 462 if (sgt_append->prv) { 473 - unsigned long paddr = 474 - (page_to_pfn(sg_page(sgt_append->prv)) * PAGE_SIZE + 475 - sgt_append->prv->offset + sgt_append->prv->length) / 476 - PAGE_SIZE; 477 - 478 463 if (WARN_ON(offset)) 479 464 return -EINVAL; 480 465 481 466 /* Merge contiguous pages into the last SG */ 482 467 prv_len = sgt_append->prv->length; 483 - while (n_pages && page_to_pfn(pages[0]) == paddr) { 468 + last_pg = sg_page(sgt_append->prv); 469 + while (n_pages && pages_are_mergeable(last_pg, pages[0])) { 484 470 if (sgt_append->prv->length + PAGE_SIZE > max_segment) 485 471 break; 486 472 sgt_append->prv->length += PAGE_SIZE; 487 - paddr++; 473 + last_pg = pages[0]; 488 474 pages++; 489 475 n_pages--; 490 476 } ··· 494 488 for (i = 1; i < n_pages; i++) { 495 489 seg_len += PAGE_SIZE; 496 490 if (seg_len >= max_segment || 497 - page_to_pfn(pages[i]) != page_to_pfn(pages[i - 1]) + 1) { 491 + !pages_are_mergeable(pages[i], pages[i - 1])) { 498 492 chunks++; 499 493 seg_len = 0; 500 494 } ··· 510 504 for (j = cur_page + 1; j < n_pages; j++) { 511 505 seg_len += PAGE_SIZE; 512 506 if (seg_len >= max_segment || 513 - page_to_pfn(pages[j]) != 514 - page_to_pfn(pages[j - 1]) + 1) 507 + !pages_are_mergeable(pages[j], pages[j - 1])) 515 508 break; 516 509 } 517 510

+32 -13

mm/gup.c

··· 123 123 */ 124 124 struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) 125 125 { 126 + if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) 127 + return NULL; 128 + 126 129 if (flags & FOLL_GET) 127 130 return try_get_folio(page, refs); 128 131 else if (flags & FOLL_PIN) { ··· 205 202 * time. Cases: please see the try_grab_folio() documentation, with 206 203 * "refs=1". 207 204 * 208 - * Return: true for success, or if no action was required (if neither FOLL_PIN 209 - * nor FOLL_GET was set, nothing is done). False for failure: FOLL_GET or 210 - * FOLL_PIN was set, but the page could not be grabbed. 205 + * Return: 0 for success, or if no action was required (if neither FOLL_PIN 206 + * nor FOLL_GET was set, nothing is done). A negative error code for failure: 207 + * 208 + * -ENOMEM FOLL_GET or FOLL_PIN was set, but the page could not 209 + * be grabbed. 211 210 */ 212 - bool __must_check try_grab_page(struct page *page, unsigned int flags) 211 + int __must_check try_grab_page(struct page *page, unsigned int flags) 213 212 { 214 213 struct folio *folio = page_folio(page); 215 214 216 215 WARN_ON_ONCE((flags & (FOLL_GET | FOLL_PIN)) == (FOLL_GET | FOLL_PIN)); 217 216 if (WARN_ON_ONCE(folio_ref_count(folio) <= 0)) 218 - return false; 217 + return -ENOMEM; 218 + 219 + if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) 220 + return -EREMOTEIO; 219 221 220 222 if (flags & FOLL_GET) 221 223 folio_ref_inc(folio); ··· 240 232 node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1); 241 233 } 242 234 243 - return true; 235 + return 0; 244 236 } 245 237 246 238 /** ··· 632 624 !PageAnonExclusive(page), page); 633 625 634 626 /* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */ 635 - if (unlikely(!try_grab_page(page, flags))) { 636 - page = ERR_PTR(-ENOMEM); 627 + ret = try_grab_page(page, flags); 628 + if (unlikely(ret)) { 629 + page = ERR_PTR(ret); 637 630 goto out; 638 631 } 632 + 639 633 /* 640 634 * We need to make the page accessible if and only if we are going 641 635 * to access its content (the FOLL_PIN case). Please see ··· 970 960 goto unmap; 971 961 *page = pte_page(*pte); 972 962 } 973 - if (unlikely(!try_grab_page(*page, gup_flags))) { 974 - ret = -ENOMEM; 963 + ret = try_grab_page(*page, gup_flags); 964 + if (unlikely(ret)) 975 965 goto unmap; 976 - } 977 966 out: 978 967 ret = 0; 979 968 unmap: ··· 1065 1056 return -EFAULT; 1066 1057 1067 1058 if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) 1059 + return -EOPNOTSUPP; 1060 + 1061 + if ((gup_flags & FOLL_LONGTERM) && (gup_flags & FOLL_PCI_P2PDMA)) 1068 1062 return -EOPNOTSUPP; 1069 1063 1070 1064 if (vma_is_secretmem(vma)) ··· 2546 2534 undo_dev_pagemap(nr, nr_start, flags, pages); 2547 2535 break; 2548 2536 } 2537 + 2538 + if (!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)) { 2539 + undo_dev_pagemap(nr, nr_start, flags, pages); 2540 + break; 2541 + } 2542 + 2549 2543 SetPageReferenced(page); 2550 2544 pages[*nr] = page; 2551 - if (unlikely(!try_grab_page(page, flags))) { 2545 + if (unlikely(try_grab_page(page, flags))) { 2552 2546 undo_dev_pagemap(nr, nr_start, flags, pages); 2553 2547 break; 2554 2548 } ··· 3036 3018 3037 3019 if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | 3038 3020 FOLL_FORCE | FOLL_PIN | FOLL_GET | 3039 - FOLL_FAST_ONLY | FOLL_NOFAULT))) 3021 + FOLL_FAST_ONLY | FOLL_NOFAULT | 3022 + FOLL_PCI_P2PDMA))) 3040 3023 return -EINVAL; 3041 3024 3042 3025 if (gup_flags & FOLL_PIN)

+13 -6

mm/huge_memory.c

··· 1035 1035 unsigned long pfn = pmd_pfn(*pmd); 1036 1036 struct mm_struct *mm = vma->vm_mm; 1037 1037 struct page *page; 1038 + int ret; 1038 1039 1039 1040 assert_spin_locked(pmd_lockptr(mm, pmd)); 1040 1041 ··· 1067 1066 if (!*pgmap) 1068 1067 return ERR_PTR(-EFAULT); 1069 1068 page = pfn_to_page(pfn); 1070 - if (!try_grab_page(page, flags)) 1071 - page = ERR_PTR(-ENOMEM); 1069 + ret = try_grab_page(page, flags); 1070 + if (ret) 1071 + page = ERR_PTR(ret); 1072 1072 1073 1073 return page; 1074 1074 } ··· 1195 1193 unsigned long pfn = pud_pfn(*pud); 1196 1194 struct mm_struct *mm = vma->vm_mm; 1197 1195 struct page *page; 1196 + int ret; 1198 1197 1199 1198 assert_spin_locked(pud_lockptr(mm, pud)); 1200 1199 ··· 1229 1226 if (!*pgmap) 1230 1227 return ERR_PTR(-EFAULT); 1231 1228 page = pfn_to_page(pfn); 1232 - if (!try_grab_page(page, flags)) 1233 - page = ERR_PTR(-ENOMEM); 1229 + 1230 + ret = try_grab_page(page, flags); 1231 + if (ret) 1232 + page = ERR_PTR(ret); 1234 1233 1235 1234 return page; 1236 1235 } ··· 1440 1435 { 1441 1436 struct mm_struct *mm = vma->vm_mm; 1442 1437 struct page *page; 1438 + int ret; 1443 1439 1444 1440 assert_spin_locked(pmd_lockptr(mm, pmd)); 1445 1441 ··· 1465 1459 VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) && 1466 1460 !PageAnonExclusive(page), page); 1467 1461 1468 - if (!try_grab_page(page, flags)) 1469 - return ERR_PTR(-ENOMEM); 1462 + ret = try_grab_page(page, flags); 1463 + if (ret) 1464 + return ERR_PTR(ret); 1470 1465 1471 1466 if (flags & FOLL_TOUCH) 1472 1467 touch_pmd(vma, addr, pmd, flags & FOLL_WRITE);

+13 -10

mm/hugetlb.c

··· 6372 6372 * tables. If the huge page is present, then the tail 6373 6373 * pages must also be present. The ptl prevents the 6374 6374 * head page and tail pages from being rearranged in 6375 - * any way. So this page must be available at this 6376 - * point, unless the page refcount overflowed: 6375 + * any way. As this is hugetlb, the pages will never 6376 + * be p2pdma or not longterm pinable. So this page 6377 + * must be available at this point, unless the page 6378 + * refcount overflowed: 6377 6379 */ 6378 6380 if (WARN_ON_ONCE(!try_grab_folio(pages[i], refs, 6379 6381 flags))) { ··· 7256 7254 page = pte_page(pte) + 7257 7255 ((address & ~huge_page_mask(h)) >> PAGE_SHIFT); 7258 7256 /* 7259 - * try_grab_page() should always succeed here, because: a) we 7260 - * hold the pmd (ptl) lock, and b) we've just checked that the 7261 - * huge pmd (head) page is present in the page tables. The ptl 7262 - * prevents the head page and tail pages from being rearranged 7263 - * in any way. So this page must be available at this point, 7264 - * unless the page refcount overflowed: 7257 + * try_grab_page() should always be able to get the page here, 7258 + * because: a) we hold the pmd (ptl) lock, and b) we've just 7259 + * checked that the huge pmd (head) page is present in the 7260 + * page tables. The ptl prevents the head page and tail pages 7261 + * from being rearranged in any way. So this page must be 7262 + * available at this point, unless the page refcount 7263 + * overflowed: 7265 7264 */ 7266 - if (WARN_ON_ONCE(!try_grab_page(page, flags))) { 7265 + if (try_grab_page(page, flags)) { 7267 7266 page = NULL; 7268 7267 goto out; 7269 7268 } ··· 7302 7299 pte = huge_ptep_get((pte_t *)pud); 7303 7300 if (pte_present(pte)) { 7304 7301 page = pud_page(*pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT); 7305 - if (WARN_ON_ONCE(!try_grab_page(page, flags))) { 7302 + if (try_grab_page(page, flags)) { 7306 7303 page = NULL; 7307 7304 goto out; 7308 7305 }