Merge tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

+2 -2

Documentation/device-mapper/cache-policies.txt

··· 60 60 The mq policy used a lot of memory; 88 bytes per cache block on a 64 61 61 bit machine. 62 62 63 - smq uses 28bit indexes to implement it's data structures rather than 63 + smq uses 28bit indexes to implement its data structures rather than 64 64 pointers. It avoids storing an explicit hit count for each block. It 65 65 has a 'hotspot' queue, rather than a pre-cache, which uses a quarter of 66 66 the entries (each hotspot block covers a larger area than a single ··· 84 84 85 85 Adaptability: 86 86 The mq policy maintained a hit count for each cache block. For a 87 - different block to get promoted to the cache it's hit count has to 87 + different block to get promoted to the cache its hit count has to 88 88 exceed the lowest currently in the cache. This meant it could take a 89 89 long time for the cache to adapt between varying IO patterns. 90 90

+2 -7

Documentation/device-mapper/cache.txt

··· 59 59 The origin is divided up into blocks of a fixed size. This block size 60 60 is configurable when you first create the cache. Typically we've been 61 61 using block sizes of 256KB - 1024KB. The block size must be between 64 62 - (32KB) and 2097152 (1GB) and a multiple of 64 (32KB). 62 + sectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB). 63 63 64 64 Having a fixed block size simplifies the target a lot. But it is 65 65 something of a compromise. For instance, a small part of a block may be ··· 119 119 120 120 For the time being, a message "migration_threshold <#sectors>" 121 121 can be used to set the maximum number of sectors being migrated, 122 - the default being 204800 sectors (or 100MB). 122 + the default being 2048 sectors (1MB). 123 123 124 124 Updating on-disk metadata 125 125 ------------------------- ··· 142 142 the policy how big this chunk is, but it should be kept small. Like the 143 143 dirty flags this data is lost if there's a crash so a safe fallback 144 144 value should always be possible. 145 - 146 - For instance, the 'mq' policy, which is currently the default policy, 147 - uses this facility to store the hit count of the cache blocks. If 148 - there's a crash this information will be lost, which means the cache 149 - may be less efficient until those hit counts are regenerated. 150 145 151 146 Policy hints affect performance, not correctness. 152 147

+4 -1

Documentation/device-mapper/dm-raid.txt

··· 343 343 1.11.0 Fix table line argument order 344 344 (wrong raid10_copies/raid10_format sequence) 345 345 1.11.1 Add raid4/5/6 journal write-back support via journal_mode option 346 - 1.12.1 fix for MD deadlock between mddev_suspend() and md_write_start() available 346 + 1.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available 347 347 1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A') 348 + 1.13.1 Fix deadlock caused by early md_stop_writes(). Also fix size an 349 + state races. 350 + 1.13.2 Fix raid redundancy validation and avoid keeping raid set frozen

+4

Documentation/device-mapper/snapshot.txt

··· 49 49 snapshots less metadata must be saved on disk - they can be kept in 50 50 memory by the kernel. 51 51 52 + When loading or unloading the snapshot target, the corresponding 53 + snapshot-origin or snapshot-merge target must be suspended. A failure to 54 + suspend the origin target could result in data corruption. 55 + 52 56 53 57 * snapshot-merge <origin> <COW device> <persistent> <chunksize> 54 58

+10 -4

Documentation/device-mapper/thin-provisioning.txt

··· 112 112 free space on the data device drops below this level then a dm event 113 113 will be triggered which a userspace daemon should catch allowing it to 114 114 extend the pool device. Only one such event will be sent. 115 - Resuming a device with a new table itself triggers an event so the 116 - userspace daemon can use this to detect a situation where a new table 117 - already exceeds the threshold. 115 + 116 + No special event is triggered if a just resumed device's free space is below 117 + the low water mark. However, resuming a device always triggers an 118 + event; a userspace daemon should verify that free space exceeds the low 119 + water mark when handling this event. 118 120 119 121 A low water mark for the metadata device is maintained in the kernel and 120 122 will trigger a dm event if free space on the metadata device drops below ··· 276 274 277 275 <transaction id> <used metadata blocks>/<total metadata blocks> 278 276 <used data blocks>/<total data blocks> <held metadata root> 279 - [no_]discard_passdown ro|rw 277 + ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space 278 + needs_check|- 280 279 281 280 transaction id: 282 281 A 64-bit number used by userspace to help synchronise with metadata ··· 397 394 If the pool has encountered device errors and failed, the status 398 395 will just contain the string 'Fail'. The userspace recovery 399 396 tools should then be used. 397 + 398 + In the case where <nr mapped sectors> is 0, there is no highest 399 + mapped sector and the value of <highest mapped sector> is unspecified.

+124

Documentation/device-mapper/unstriped.txt

··· 1 + Introduction 2 + ============ 3 + 4 + The device-mapper "unstriped" target provides a transparent mechanism to 5 + unstripe a device-mapper "striped" target to access the underlying disks 6 + without having to touch the true backing block-device. It can also be 7 + used to unstripe a hardware RAID-0 to access backing disks. 8 + 9 + Parameters: 10 + <number of stripes> <chunk size> <stripe #> <dev_path> <offset> 11 + 12 + <number of stripes> 13 + The number of stripes in the RAID 0. 14 + 15 + <chunk size> 16 + The amount of 512B sectors in the chunk striping. 17 + 18 + <dev_path> 19 + The block device you wish to unstripe. 20 + 21 + <stripe #> 22 + The stripe number within the device that corresponds to physical 23 + drive you wish to unstripe. This must be 0 indexed. 24 + 25 + 26 + Why use this module? 27 + ==================== 28 + 29 + An example of undoing an existing dm-stripe 30 + ------------------------------------------- 31 + 32 + This small bash script will setup 4 loop devices and use the existing 33 + striped target to combine the 4 devices into one. It then will use 34 + the unstriped target ontop of the striped device to access the 35 + individual backing loop devices. We write data to the newly exposed 36 + unstriped devices and verify the data written matches the correct 37 + underlying device on the striped array. 38 + 39 + #!/bin/bash 40 + 41 + MEMBER_SIZE=$((128 * 1024 * 1024)) 42 + NUM=4 43 + SEQ_END=$((${NUM}-1)) 44 + CHUNK=256 45 + BS=4096 46 + 47 + RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512)) 48 + DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}" 49 + COUNT=$((${MEMBER_SIZE} / ${BS})) 50 + 51 + for i in $(seq 0 ${SEQ_END}); do 52 + dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct 53 + losetup /dev/loop${i} member-${i} 54 + DM_PARMS+=" /dev/loop${i} 0" 55 + done 56 + 57 + echo $DM_PARMS | dmsetup create raid0 58 + for i in $(seq 0 ${SEQ_END}); do 59 + echo "0 1 unstriped ${NUM} ${CHUNK} ${i} /dev/mapper/raid0 0" | dmsetup create set-${i} 60 + done; 61 + 62 + for i in $(seq 0 ${SEQ_END}); do 63 + dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} oflag=direct 64 + diff /dev/mapper/set-${i} member-${i} 65 + done; 66 + 67 + for i in $(seq 0 ${SEQ_END}); do 68 + dmsetup remove set-${i} 69 + done 70 + 71 + dmsetup remove raid0 72 + 73 + for i in $(seq 0 ${SEQ_END}); do 74 + losetup -d /dev/loop${i} 75 + rm -f member-${i} 76 + done 77 + 78 + Another example 79 + --------------- 80 + 81 + Intel NVMe drives contain two cores on the physical device. 82 + Each core of the drive has segregated access to its LBA range. 83 + The current LBA model has a RAID 0 128k chunk on each core, resulting 84 + in a 256k stripe across the two cores: 85 + 86 + Core 0: Core 1: 87 + __________ __________ 88 + | LBA 512| | LBA 768| 89 + | LBA 0 | | LBA 256| 90 + ---------- ---------- 91 + 92 + The purpose of this unstriping is to provide better QoS in noisy 93 + neighbor environments. When two partitions are created on the 94 + aggregate drive without this unstriping, reads on one partition 95 + can affect writes on another partition. This is because the partitions 96 + are striped across the two cores. When we unstripe this hardware RAID 0 97 + and make partitions on each new exposed device the two partitions are now 98 + physically separated. 99 + 100 + With the dm-unstriped target we're able to segregate an fio script that 101 + has read and write jobs that are independent of each other. Compared to 102 + when we run the test on a combined drive with partitions, we were able 103 + to get a 92% reduction in read latency using this device mapper target. 104 + 105 + 106 + Example dmsetup usage 107 + ===================== 108 + 109 + unstriped ontop of Intel NVMe device that has 2 cores 110 + ----------------------------------------------------- 111 + dmsetup create nvmset0 --table '0 512 unstriped 2 256 0 /dev/nvme0n1 0' 112 + dmsetup create nvmset1 --table '0 512 unstriped 2 256 1 /dev/nvme0n1 0' 113 + 114 + There will now be two devices that expose Intel NVMe core 0 and 1 115 + respectively: 116 + /dev/mapper/nvmset0 117 + /dev/mapper/nvmset1 118 + 119 + unstriped ontop of striped with 4 drives using 128K chunk size 120 + -------------------------------------------------------------- 121 + dmsetup create raid_disk0 --table '0 512 unstriped 4 256 0 /dev/mapper/striped 0' 122 + dmsetup create raid_disk1 --table '0 512 unstriped 4 256 1 /dev/mapper/striped 0' 123 + dmsetup create raid_disk2 --table '0 512 unstriped 4 256 2 /dev/mapper/striped 0' 124 + dmsetup create raid_disk3 --table '0 512 unstriped 4 256 3 /dev/mapper/striped 0'

+7

drivers/md/Kconfig

··· 269 269 270 270 source "drivers/md/persistent-data/Kconfig" 271 271 272 + config DM_UNSTRIPED 273 + tristate "Unstriped target" 274 + depends on BLK_DEV_DM 275 + ---help--- 276 + Unstripes I/O so it is issued solely on a single drive in a HW 277 + RAID0 or dm-striped target. 278 + 272 279 config DM_CRYPT 273 280 tristate "Crypt target support" 274 281 depends on BLK_DEV_DM

+1

drivers/md/Makefile

··· 43 43 obj-$(CONFIG_BLK_DEV_MD) += md-mod.o 44 44 obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o 45 45 obj-$(CONFIG_BLK_DEV_DM_BUILTIN) += dm-builtin.o 46 + obj-$(CONFIG_DM_UNSTRIPED) += dm-unstripe.o 46 47 obj-$(CONFIG_DM_BUFIO) += dm-bufio.o 47 48 obj-$(CONFIG_DM_BIO_PRISON) += dm-bio-prison.o 48 49 obj-$(CONFIG_DM_CRYPT) += dm-crypt.o

+20 -17

drivers/md/dm-bufio.c

··· 662 662 663 663 sector = (b->block << b->c->sectors_per_block_bits) + b->c->start; 664 664 665 - if (rw != WRITE) { 665 + if (rw != REQ_OP_WRITE) { 666 666 n_sectors = 1 << b->c->sectors_per_block_bits; 667 667 offset = 0; 668 668 } else { ··· 740 740 b->write_end = b->dirty_end; 741 741 742 742 if (!write_list) 743 - submit_io(b, WRITE, write_endio); 743 + submit_io(b, REQ_OP_WRITE, write_endio); 744 744 else 745 745 list_add_tail(&b->write_list, write_list); 746 746 } ··· 753 753 struct dm_buffer *b = 754 754 list_entry(write_list->next, struct dm_buffer, write_list); 755 755 list_del(&b->write_list); 756 - submit_io(b, WRITE, write_endio); 756 + submit_io(b, REQ_OP_WRITE, write_endio); 757 757 cond_resched(); 758 758 } 759 759 blk_finish_plug(&plug); ··· 1123 1123 return NULL; 1124 1124 1125 1125 if (need_submit) 1126 - submit_io(b, READ, read_endio); 1126 + submit_io(b, REQ_OP_READ, read_endio); 1127 1127 1128 1128 wait_on_bit_io(&b->state, B_READING, TASK_UNINTERRUPTIBLE); 1129 1129 ··· 1193 1193 dm_bufio_unlock(c); 1194 1194 1195 1195 if (need_submit) 1196 - submit_io(b, READ, read_endio); 1196 + submit_io(b, REQ_OP_READ, read_endio); 1197 1197 dm_bufio_release(b); 1198 1198 1199 1199 cond_resched(); ··· 1454 1454 old_block = b->block; 1455 1455 __unlink_buffer(b); 1456 1456 __link_buffer(b, new_block, b->list_mode); 1457 - submit_io(b, WRITE, write_endio); 1457 + submit_io(b, REQ_OP_WRITE, write_endio); 1458 1458 wait_on_bit_io(&b->state, B_WRITING, 1459 1459 TASK_UNINTERRUPTIBLE); 1460 1460 __unlink_buffer(b); ··· 1716 1716 if (!DM_BUFIO_CACHE_NAME(c)) { 1717 1717 r = -ENOMEM; 1718 1718 mutex_unlock(&dm_bufio_clients_lock); 1719 - goto bad_cache; 1719 + goto bad; 1720 1720 } 1721 1721 } 1722 1722 ··· 1727 1727 if (!DM_BUFIO_CACHE(c)) { 1728 1728 r = -ENOMEM; 1729 1729 mutex_unlock(&dm_bufio_clients_lock); 1730 - goto bad_cache; 1730 + goto bad; 1731 1731 } 1732 1732 } 1733 1733 } ··· 1738 1738 1739 1739 if (!b) { 1740 1740 r = -ENOMEM; 1741 - goto bad_buffer; 1741 + goto bad; 1742 1742 } 1743 1743 __free_buffer_wake(b); 1744 1744 } 1745 + 1746 + c->shrinker.count_objects = dm_bufio_shrink_count; 1747 + c->shrinker.scan_objects = dm_bufio_shrink_scan; 1748 + c->shrinker.seeks = 1; 1749 + c->shrinker.batch = 0; 1750 + r = register_shrinker(&c->shrinker); 1751 + if (r) 1752 + goto bad; 1745 1753 1746 1754 mutex_lock(&dm_bufio_clients_lock); 1747 1755 dm_bufio_client_count++; ··· 1757 1749 __cache_size_refresh(); 1758 1750 mutex_unlock(&dm_bufio_clients_lock); 1759 1751 1760 - c->shrinker.count_objects = dm_bufio_shrink_count; 1761 - c->shrinker.scan_objects = dm_bufio_shrink_scan; 1762 - c->shrinker.seeks = 1; 1763 - c->shrinker.batch = 0; 1764 - register_shrinker(&c->shrinker); 1765 - 1766 1752 return c; 1767 1753 1768 - bad_buffer: 1769 - bad_cache: 1754 + bad: 1770 1755 while (!list_empty(&c->reserved_buffers)) { 1771 1756 struct dm_buffer *b = list_entry(c->reserved_buffers.next, 1772 1757 struct dm_buffer, lru_list); ··· 1768 1767 } 1769 1768 dm_io_client_destroy(c->dm_io); 1770 1769 bad_dm_io: 1770 + mutex_destroy(&c->lock); 1771 1771 kfree(c); 1772 1772 bad_client: 1773 1773 return ERR_PTR(r); ··· 1813 1811 BUG_ON(c->n_buffers[i]); 1814 1812 1815 1813 dm_io_client_destroy(c->dm_io); 1814 + mutex_destroy(&c->lock); 1816 1815 kfree(c); 1817 1816 } 1818 1817 EXPORT_SYMBOL_GPL(dm_bufio_client_destroy);

+1 -4

drivers/md/dm-core.h

··· 91 91 /* 92 92 * io objects are allocated from here. 93 93 */ 94 - mempool_t *io_pool; 95 - 94 + struct bio_set *io_bs; 96 95 struct bio_set *bs; 97 96 98 97 /* ··· 129 130 struct srcu_struct io_barrier; 130 131 }; 131 132 132 - void dm_init_md_queue(struct mapped_device *md); 133 - void dm_init_normal_md_queue(struct mapped_device *md); 134 133 int md_in_flight(struct mapped_device *md); 135 134 void disable_write_same(struct mapped_device *md); 136 135 void disable_write_zeroes(struct mapped_device *md);

+3 -2

drivers/md/dm-crypt.c

··· 2193 2193 kzfree(cc->cipher_auth); 2194 2194 kzfree(cc->authenc_key); 2195 2195 2196 + mutex_destroy(&cc->bio_alloc_lock); 2197 + 2196 2198 /* Must zero key material before freeing */ 2197 2199 kzfree(cc); 2198 2200 } ··· 2704 2702 goto bad; 2705 2703 } 2706 2704 2707 - cc->bs = bioset_create(MIN_IOS, 0, (BIOSET_NEED_BVECS | 2708 - BIOSET_NEED_RESCUER)); 2705 + cc->bs = bioset_create(MIN_IOS, 0, BIOSET_NEED_BVECS); 2709 2706 if (!cc->bs) { 2710 2707 ti->error = "Cannot allocate crypt bioset"; 2711 2708 goto bad;

+2

drivers/md/dm-delay.c

··· 229 229 if (dc->dev_write) 230 230 dm_put_device(ti, dc->dev_write); 231 231 232 + mutex_destroy(&dc->timer_lock); 233 + 232 234 kfree(dc); 233 235 } 234 236

+5

drivers/md/dm-flakey.c

··· 70 70 arg_name = dm_shift_arg(as); 71 71 argc--; 72 72 73 + if (!arg_name) { 74 + ti->error = "Insufficient feature arguments"; 75 + return -EINVAL; 76 + } 77 + 73 78 /* 74 79 * drop_writes 75 80 */

+1 -2

drivers/md/dm-io.c

··· 58 58 if (!client->pool) 59 59 goto bad; 60 60 61 - client->bios = bioset_create(min_ios, 0, (BIOSET_NEED_BVECS | 62 - BIOSET_NEED_RESCUER)); 61 + client->bios = bioset_create(min_ios, 0, BIOSET_NEED_BVECS); 63 62 if (!client->bios) 64 63 goto bad; 65 64

+4 -2

drivers/md/dm-kcopyd.c

··· 477 477 * If this is the master job, the sub jobs have already 478 478 * completed so we can free everything. 479 479 */ 480 - if (job->master_job == job) 480 + if (job->master_job == job) { 481 + mutex_destroy(&job->lock); 481 482 mempool_free(job, kc->job_pool); 483 + } 482 484 fn(read_err, write_err, context); 483 485 484 486 if (atomic_dec_and_test(&kc->nr_jobs)) ··· 752 750 * followed by SPLIT_COUNT sub jobs. 753 751 */ 754 752 job = mempool_alloc(kc->job_pool, GFP_NOIO); 753 + mutex_init(&job->lock); 755 754 756 755 /* 757 756 * set up for the read. ··· 814 811 if (job->source.count <= SUB_JOB_SIZE) 815 812 dispatch_job(job); 816 813 else { 817 - mutex_init(&job->lock); 818 814 job->progress = 0; 819 815 split_job(job); 820 816 }

+1 -1

drivers/md/dm-log-writes.c

··· 594 594 return -ENOMEM; 595 595 } 596 596 597 - block->data = kstrndup(data, maxsize, GFP_KERNEL); 597 + block->data = kstrndup(data, maxsize - 1, GFP_KERNEL); 598 598 if (!block->data) { 599 599 DMERR("Error copying mark data"); 600 600 kfree(block);

+187 -110

drivers/md/dm-mpath.c

··· 64 64 65 65 /* Multipath context */ 66 66 struct multipath { 67 - struct list_head list; 68 - struct dm_target *ti; 69 - 70 - const char *hw_handler_name; 71 - char *hw_handler_params; 67 + unsigned long flags; /* Multipath state flags */ 72 68 73 69 spinlock_t lock; 74 - 75 - unsigned nr_priority_groups; 76 - struct list_head priority_groups; 77 - 78 - wait_queue_head_t pg_init_wait; /* Wait for pg_init completion */ 70 + enum dm_queue_mode queue_mode; 79 71 80 72 struct pgpath *current_pgpath; 81 73 struct priority_group *current_pg; 82 74 struct priority_group *next_pg; /* Switch to this PG if set */ 83 75 84 - unsigned long flags; /* Multipath state flags */ 76 + atomic_t nr_valid_paths; /* Total number of usable paths */ 77 + unsigned nr_priority_groups; 78 + struct list_head priority_groups; 85 79 80 + const char *hw_handler_name; 81 + char *hw_handler_params; 82 + wait_queue_head_t pg_init_wait; /* Wait for pg_init completion */ 86 83 unsigned pg_init_retries; /* Number of times to retry pg_init */ 87 84 unsigned pg_init_delay_msecs; /* Number of msecs before pg_init retry */ 88 - 89 - atomic_t nr_valid_paths; /* Total number of usable paths */ 90 85 atomic_t pg_init_in_progress; /* Only one pg_init allowed at once */ 91 86 atomic_t pg_init_count; /* Number of times pg_init called */ 92 87 93 - enum dm_queue_mode queue_mode; 94 - 95 88 struct mutex work_mutex; 96 89 struct work_struct trigger_event; 90 + struct dm_target *ti; 97 91 98 92 struct work_struct process_queued_bios; 99 93 struct bio_list queued_bios; ··· 129 135 { 130 136 struct pgpath *pgpath = kzalloc(sizeof(*pgpath), GFP_KERNEL); 131 137 132 - if (pgpath) { 133 - pgpath->is_active = true; 134 - INIT_DELAYED_WORK(&pgpath->activate_path, activate_path_work); 135 - } 138 + if (!pgpath) 139 + return NULL; 140 + 141 + pgpath->is_active = true; 136 142 137 143 return pgpath; 138 144 } ··· 187 193 if (m) { 188 194 INIT_LIST_HEAD(&m->priority_groups); 189 195 spin_lock_init(&m->lock); 190 - set_bit(MPATHF_QUEUE_IO, &m->flags); 191 196 atomic_set(&m->nr_valid_paths, 0); 192 - atomic_set(&m->pg_init_in_progress, 0); 193 - atomic_set(&m->pg_init_count, 0); 194 - m->pg_init_delay_msecs = DM_PG_INIT_DELAY_DEFAULT; 195 197 INIT_WORK(&m->trigger_event, trigger_event); 196 - init_waitqueue_head(&m->pg_init_wait); 197 198 mutex_init(&m->work_mutex); 198 199 199 200 m->queue_mode = DM_TYPE_NONE; ··· 210 221 m->queue_mode = DM_TYPE_MQ_REQUEST_BASED; 211 222 else 212 223 m->queue_mode = DM_TYPE_REQUEST_BASED; 213 - } else if (m->queue_mode == DM_TYPE_BIO_BASED) { 224 + 225 + } else if (m->queue_mode == DM_TYPE_BIO_BASED || 226 + m->queue_mode == DM_TYPE_NVME_BIO_BASED) { 214 227 INIT_WORK(&m->process_queued_bios, process_queued_bios); 215 - /* 216 - * bio-based doesn't support any direct scsi_dh management; 217 - * it just discovers if a scsi_dh is attached. 218 - */ 219 - set_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags); 228 + 229 + if (m->queue_mode == DM_TYPE_BIO_BASED) { 230 + /* 231 + * bio-based doesn't support any direct scsi_dh management; 232 + * it just discovers if a scsi_dh is attached. 233 + */ 234 + set_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags); 235 + } 236 + } 237 + 238 + if (m->queue_mode != DM_TYPE_NVME_BIO_BASED) { 239 + set_bit(MPATHF_QUEUE_IO, &m->flags); 240 + atomic_set(&m->pg_init_in_progress, 0); 241 + atomic_set(&m->pg_init_count, 0); 242 + m->pg_init_delay_msecs = DM_PG_INIT_DELAY_DEFAULT; 243 + init_waitqueue_head(&m->pg_init_wait); 220 244 } 221 245 222 246 dm_table_set_type(ti->table, m->queue_mode); ··· 248 246 249 247 kfree(m->hw_handler_name); 250 248 kfree(m->hw_handler_params); 249 + mutex_destroy(&m->work_mutex); 251 250 kfree(m); 252 251 } 253 252 ··· 267 264 return dm_per_bio_data(bio, multipath_per_bio_data_size()); 268 265 } 269 266 270 - static struct dm_bio_details *get_bio_details_from_bio(struct bio *bio) 267 + static struct dm_bio_details *get_bio_details_from_mpio(struct dm_mpath_io *mpio) 271 268 { 272 269 /* dm_bio_details is immediately after the dm_mpath_io in bio's per-bio-data */ 273 - struct dm_mpath_io *mpio = get_mpio_from_bio(bio); 274 270 void *bio_details = mpio + 1; 275 - 276 271 return bio_details; 277 272 } 278 273 279 - static void multipath_init_per_bio_data(struct bio *bio, struct dm_mpath_io **mpio_p, 280 - struct dm_bio_details **bio_details_p) 274 + static void multipath_init_per_bio_data(struct bio *bio, struct dm_mpath_io **mpio_p) 281 275 { 282 276 struct dm_mpath_io *mpio = get_mpio_from_bio(bio); 283 - struct dm_bio_details *bio_details = get_bio_details_from_bio(bio); 277 + struct dm_bio_details *bio_details = get_bio_details_from_mpio(mpio); 284 278 285 - memset(mpio, 0, sizeof(*mpio)); 286 - memset(bio_details, 0, sizeof(*bio_details)); 279 + mpio->nr_bytes = bio->bi_iter.bi_size; 280 + mpio->pgpath = NULL; 281 + *mpio_p = mpio; 282 + 287 283 dm_bio_record(bio_details, bio); 288 - 289 - if (mpio_p) 290 - *mpio_p = mpio; 291 - if (bio_details_p) 292 - *bio_details_p = bio_details; 293 284 } 294 285 295 286 /*----------------------------------------------- ··· 337 340 { 338 341 m->current_pg = pg; 339 342 343 + if (m->queue_mode == DM_TYPE_NVME_BIO_BASED) 344 + return; 345 + 340 346 /* Must we initialise the PG first, and queue I/O till it's ready? */ 341 347 if (m->hw_handler_name) { 342 348 set_bit(MPATHF_PG_INIT_REQUIRED, &m->flags); ··· 385 385 unsigned bypassed = 1; 386 386 387 387 if (!atomic_read(&m->nr_valid_paths)) { 388 - clear_bit(MPATHF_QUEUE_IO, &m->flags); 388 + if (m->queue_mode != DM_TYPE_NVME_BIO_BASED) 389 + clear_bit(MPATHF_QUEUE_IO, &m->flags); 389 390 goto failed; 390 391 } 391 392 ··· 517 516 return DM_MAPIO_KILL; 518 517 } else if (test_bit(MPATHF_QUEUE_IO, &m->flags) || 519 518 test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) { 520 - if (pg_init_all_paths(m)) 521 - return DM_MAPIO_DELAY_REQUEUE; 522 - return DM_MAPIO_REQUEUE; 519 + pg_init_all_paths(m); 520 + return DM_MAPIO_DELAY_REQUEUE; 523 521 } 524 522 525 - memset(mpio, 0, sizeof(*mpio)); 526 523 mpio->pgpath = pgpath; 527 524 mpio->nr_bytes = nr_bytes; 528 525 ··· 529 530 clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_ATOMIC); 530 531 if (IS_ERR(clone)) { 531 532 /* EBUSY, ENODEV or EWOULDBLOCK: requeue */ 532 - bool queue_dying = blk_queue_dying(q); 533 - if (queue_dying) { 533 + if (blk_queue_dying(q)) { 534 534 atomic_inc(&m->pg_init_in_progress); 535 535 activate_or_offline_path(pgpath); 536 + return DM_MAPIO_DELAY_REQUEUE; 536 537 } 537 - return DM_MAPIO_DELAY_REQUEUE; 538 + 539 + /* 540 + * blk-mq's SCHED_RESTART can cover this requeue, so we 541 + * needn't deal with it by DELAY_REQUEUE. More importantly, 542 + * we have to return DM_MAPIO_REQUEUE so that blk-mq can 543 + * get the queue busy feedback (via BLK_STS_RESOURCE), 544 + * otherwise I/O merging can suffer. 545 + */ 546 + if (q->mq_ops) 547 + return DM_MAPIO_REQUEUE; 548 + else 549 + return DM_MAPIO_DELAY_REQUEUE; 538 550 } 539 551 clone->bio = clone->biotail = NULL; 540 552 clone->rq_disk = bdev->bd_disk; ··· 567 557 /* 568 558 * Map cloned bios (bio-based multipath) 569 559 */ 570 - static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_mpath_io *mpio) 560 + 561 + static struct pgpath *__map_bio(struct multipath *m, struct bio *bio) 571 562 { 572 - size_t nr_bytes = bio->bi_iter.bi_size; 573 563 struct pgpath *pgpath; 574 564 unsigned long flags; 575 565 bool queue_io; ··· 578 568 pgpath = READ_ONCE(m->current_pgpath); 579 569 queue_io = test_bit(MPATHF_QUEUE_IO, &m->flags); 580 570 if (!pgpath || !queue_io) 581 - pgpath = choose_pgpath(m, nr_bytes); 571 + pgpath = choose_pgpath(m, bio->bi_iter.bi_size); 582 572 583 573 if ((pgpath && queue_io) || 584 574 (!pgpath && test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags))) { ··· 586 576 spin_lock_irqsave(&m->lock, flags); 587 577 bio_list_add(&m->queued_bios, bio); 588 578 spin_unlock_irqrestore(&m->lock, flags); 579 + 589 580 /* PG_INIT_REQUIRED cannot be set without QUEUE_IO */ 590 581 if (queue_io || test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) 591 582 pg_init_all_paths(m); 592 583 else if (!queue_io) 593 584 queue_work(kmultipathd, &m->process_queued_bios); 594 - return DM_MAPIO_SUBMITTED; 585 + 586 + return ERR_PTR(-EAGAIN); 595 587 } 588 + 589 + return pgpath; 590 + } 591 + 592 + static struct pgpath *__map_bio_nvme(struct multipath *m, struct bio *bio) 593 + { 594 + struct pgpath *pgpath; 595 + unsigned long flags; 596 + 597 + /* Do we need to select a new pgpath? */ 598 + /* 599 + * FIXME: currently only switching path if no path (due to failure, etc) 600 + * - which negates the point of using a path selector 601 + */ 602 + pgpath = READ_ONCE(m->current_pgpath); 603 + if (!pgpath) 604 + pgpath = choose_pgpath(m, bio->bi_iter.bi_size); 605 + 606 + if (!pgpath) { 607 + if (test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags)) { 608 + /* Queue for the daemon to resubmit */ 609 + spin_lock_irqsave(&m->lock, flags); 610 + bio_list_add(&m->queued_bios, bio); 611 + spin_unlock_irqrestore(&m->lock, flags); 612 + queue_work(kmultipathd, &m->process_queued_bios); 613 + 614 + return ERR_PTR(-EAGAIN); 615 + } 616 + return NULL; 617 + } 618 + 619 + return pgpath; 620 + } 621 + 622 + static int __multipath_map_bio(struct multipath *m, struct bio *bio, 623 + struct dm_mpath_io *mpio) 624 + { 625 + struct pgpath *pgpath; 626 + 627 + if (m->queue_mode == DM_TYPE_NVME_BIO_BASED) 628 + pgpath = __map_bio_nvme(m, bio); 629 + else 630 + pgpath = __map_bio(m, bio); 631 + 632 + if (IS_ERR(pgpath)) 633 + return DM_MAPIO_SUBMITTED; 596 634 597 635 if (!pgpath) { 598 636 if (must_push_back_bio(m)) ··· 650 592 } 651 593 652 594 mpio->pgpath = pgpath; 653 - mpio->nr_bytes = nr_bytes; 654 595 655 596 bio->bi_status = 0; 656 597 bio_set_dev(bio, pgpath->path.dev->bdev); ··· 658 601 if (pgpath->pg->ps.type->start_io) 659 602 pgpath->pg->ps.type->start_io(&pgpath->pg->ps, 660 603 &pgpath->path, 661 - nr_bytes); 604 + mpio->nr_bytes); 662 605 return DM_MAPIO_REMAPPED; 663 606 } 664 607 ··· 667 610 struct multipath *m = ti->private; 668 611 struct dm_mpath_io *mpio = NULL; 669 612 670 - multipath_init_per_bio_data(bio, &mpio, NULL); 671 - 613 + multipath_init_per_bio_data(bio, &mpio); 672 614 return __multipath_map_bio(m, bio, mpio); 673 615 } 674 616 ··· 675 619 { 676 620 if (m->queue_mode == DM_TYPE_MQ_REQUEST_BASED) 677 621 dm_mq_kick_requeue_list(dm_table_get_md(m->ti->table)); 678 - else if (m->queue_mode == DM_TYPE_BIO_BASED) 622 + else if (m->queue_mode == DM_TYPE_BIO_BASED || 623 + m->queue_mode == DM_TYPE_NVME_BIO_BASED) 679 624 queue_work(kmultipathd, &m->process_queued_bios); 680 625 } 681 626 ··· 706 649 707 650 blk_start_plug(&plug); 708 651 while ((bio = bio_list_pop(&bios))) { 709 - r = __multipath_map_bio(m, bio, get_mpio_from_bio(bio)); 652 + struct dm_mpath_io *mpio = get_mpio_from_bio(bio); 653 + dm_bio_restore(get_bio_details_from_mpio(mpio), bio); 654 + r = __multipath_map_bio(m, bio, mpio); 710 655 switch (r) { 711 656 case DM_MAPIO_KILL: 712 657 bio->bi_status = BLK_STS_IOERR; ··· 811 752 return 0; 812 753 } 813 754 814 - static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps, 815 - struct dm_target *ti) 755 + static int setup_scsi_dh(struct block_device *bdev, struct multipath *m, char **error) 816 756 { 817 - int r; 818 - struct pgpath *p; 819 - struct multipath *m = ti->private; 820 - struct request_queue *q = NULL; 757 + struct request_queue *q = bdev_get_queue(bdev); 821 758 const char *attached_handler_name; 822 - 823 - /* we need at least a path arg */ 824 - if (as->argc < 1) { 825 - ti->error = "no device given"; 826 - return ERR_PTR(-EINVAL); 827 - } 828 - 829 - p = alloc_pgpath(); 830 - if (!p) 831 - return ERR_PTR(-ENOMEM); 832 - 833 - r = dm_get_device(ti, dm_shift_arg(as), dm_table_get_mode(ti->table), 834 - &p->path.dev); 835 - if (r) { 836 - ti->error = "error getting device"; 837 - goto bad; 838 - } 839 - 840 - if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags) || m->hw_handler_name) 841 - q = bdev_get_queue(p->path.dev->bdev); 759 + int r; 842 760 843 761 if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags)) { 844 762 retain: ··· 847 811 char b[BDEVNAME_SIZE]; 848 812 849 813 printk(KERN_INFO "dm-mpath: retaining handler on device %s\n", 850 - bdevname(p->path.dev->bdev, b)); 814 + bdevname(bdev, b)); 851 815 goto retain; 852 816 } 853 817 if (r < 0) { 854 - ti->error = "error attaching hardware handler"; 855 - dm_put_device(ti, p->path.dev); 856 - goto bad; 818 + *error = "error attaching hardware handler"; 819 + return r; 857 820 } 858 821 859 822 if (m->hw_handler_params) { 860 823 r = scsi_dh_set_params(q, m->hw_handler_params); 861 824 if (r < 0) { 862 - ti->error = "unable to set hardware " 863 - "handler parameters"; 864 - dm_put_device(ti, p->path.dev); 865 - goto bad; 825 + *error = "unable to set hardware handler parameters"; 826 + return r; 866 827 } 828 + } 829 + } 830 + 831 + return 0; 832 + } 833 + 834 + static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps, 835 + struct dm_target *ti) 836 + { 837 + int r; 838 + struct pgpath *p; 839 + struct multipath *m = ti->private; 840 + 841 + /* we need at least a path arg */ 842 + if (as->argc < 1) { 843 + ti->error = "no device given"; 844 + return ERR_PTR(-EINVAL); 845 + } 846 + 847 + p = alloc_pgpath(); 848 + if (!p) 849 + return ERR_PTR(-ENOMEM); 850 + 851 + r = dm_get_device(ti, dm_shift_arg(as), dm_table_get_mode(ti->table), 852 + &p->path.dev); 853 + if (r) { 854 + ti->error = "error getting device"; 855 + goto bad; 856 + } 857 + 858 + if (m->queue_mode != DM_TYPE_NVME_BIO_BASED) { 859 + INIT_DELAYED_WORK(&p->activate_path, activate_path_work); 860 + r = setup_scsi_dh(p->path.dev->bdev, m, &ti->error); 861 + if (r) { 862 + dm_put_device(ti, p->path.dev); 863 + goto bad; 867 864 } 868 865 } 869 866 ··· 907 838 } 908 839 909 840 return p; 910 - 911 841 bad: 912 842 free_pgpath(p); 913 843 return ERR_PTR(r); ··· 1001 933 if (!hw_argc) 1002 934 return 0; 1003 935 1004 - if (m->queue_mode == DM_TYPE_BIO_BASED) { 936 + if (m->queue_mode == DM_TYPE_BIO_BASED || 937 + m->queue_mode == DM_TYPE_NVME_BIO_BASED) { 1005 938 dm_consume_args(as, hw_argc); 1006 939 DMERR("bio-based multipath doesn't allow hardware handler args"); 1007 940 return 0; ··· 1091 1022 1092 1023 if (!strcasecmp(queue_mode_name, "bio")) 1093 1024 m->queue_mode = DM_TYPE_BIO_BASED; 1025 + else if (!strcasecmp(queue_mode_name, "nvme")) 1026 + m->queue_mode = DM_TYPE_NVME_BIO_BASED; 1094 1027 else if (!strcasecmp(queue_mode_name, "rq")) 1095 1028 m->queue_mode = DM_TYPE_REQUEST_BASED; 1096 1029 else if (!strcasecmp(queue_mode_name, "mq")) ··· 1193 1122 ti->num_discard_bios = 1; 1194 1123 ti->num_write_same_bios = 1; 1195 1124 ti->num_write_zeroes_bios = 1; 1196 - if (m->queue_mode == DM_TYPE_BIO_BASED) 1125 + if (m->queue_mode == DM_TYPE_BIO_BASED || m->queue_mode == DM_TYPE_NVME_BIO_BASED) 1197 1126 ti->per_io_data_size = multipath_per_bio_data_size(); 1198 1127 else 1199 1128 ti->per_io_data_size = sizeof(struct dm_mpath_io); ··· 1222 1151 1223 1152 static void flush_multipath_work(struct multipath *m) 1224 1153 { 1225 - set_bit(MPATHF_PG_INIT_DISABLED, &m->flags); 1226 - smp_mb__after_atomic(); 1154 + if (m->hw_handler_name) { 1155 + set_bit(MPATHF_PG_INIT_DISABLED, &m->flags); 1156 + smp_mb__after_atomic(); 1227 1157 1228 - flush_workqueue(kmpath_handlerd); 1229 - multipath_wait_for_pg_init_completion(m); 1158 + flush_workqueue(kmpath_handlerd); 1159 + multipath_wait_for_pg_init_completion(m); 1160 + 1161 + clear_bit(MPATHF_PG_INIT_DISABLED, &m->flags); 1162 + smp_mb__after_atomic(); 1163 + } 1164 + 1230 1165 flush_workqueue(kmultipathd); 1231 1166 flush_work(&m->trigger_event); 1232 - 1233 - clear_bit(MPATHF_PG_INIT_DISABLED, &m->flags); 1234 - smp_mb__after_atomic(); 1235 1167 } 1236 1168 1237 1169 static void multipath_dtr(struct dm_target *ti) ··· 1570 1496 if (error && blk_path_error(error)) { 1571 1497 struct multipath *m = ti->private; 1572 1498 1573 - r = DM_ENDIO_REQUEUE; 1499 + if (error == BLK_STS_RESOURCE) 1500 + r = DM_ENDIO_DELAY_REQUEUE; 1501 + else 1502 + r = DM_ENDIO_REQUEUE; 1574 1503 1575 1504 if (pgpath) 1576 1505 fail_path(pgpath); ··· 1598 1521 } 1599 1522 1600 1523 static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone, 1601 - blk_status_t *error) 1524 + blk_status_t *error) 1602 1525 { 1603 1526 struct multipath *m = ti->private; 1604 1527 struct dm_mpath_io *mpio = get_mpio_from_bio(clone); ··· 1622 1545 } 1623 1546 goto done; 1624 1547 } 1625 - 1626 - /* Queue for the daemon to resubmit */ 1627 - dm_bio_restore(get_bio_details_from_bio(clone), clone); 1628 1548 1629 1549 spin_lock_irqsave(&m->lock, flags); 1630 1550 bio_list_add(&m->queued_bios, clone); ··· 1729 1655 switch(m->queue_mode) { 1730 1656 case DM_TYPE_BIO_BASED: 1731 1657 DMEMIT("queue_mode bio "); 1658 + break; 1659 + case DM_TYPE_NVME_BIO_BASED: 1660 + DMEMIT("queue_mode nvme "); 1732 1661 break; 1733 1662 case DM_TYPE_MQ_REQUEST_BASED: 1734 1663 DMEMIT("queue_mode mq ");

+3 -3

drivers/md/dm-queue-length.c

··· 195 195 if (list_empty(&s->valid_paths)) 196 196 goto out; 197 197 198 - /* Change preferred (first in list) path to evenly balance. */ 199 - list_move_tail(s->valid_paths.next, &s->valid_paths); 200 - 201 198 list_for_each_entry(pi, &s->valid_paths, list) { 202 199 if (!best || 203 200 (atomic_read(&pi->qlen) < atomic_read(&best->qlen))) ··· 206 209 207 210 if (!best) 208 211 goto out; 212 + 213 + /* Move most recently used to least preferred to evenly balance. */ 214 + list_move_tail(&best->list, &s->valid_paths); 209 215 210 216 ret = best->path; 211 217 out:

+254 -126

drivers/md/dm-raid.c

··· 29 29 */ 30 30 #define MIN_RAID456_JOURNAL_SPACE (4*2048) 31 31 32 + /* Global list of all raid sets */ 33 + static LIST_HEAD(raid_sets); 34 + 32 35 static bool devices_handle_discard_safely = false; 33 36 34 37 /* ··· 107 104 #define CTR_FLAG_RAID10_USE_NEAR_SETS (1 << __CTR_FLAG_RAID10_USE_NEAR_SETS) 108 105 #define CTR_FLAG_JOURNAL_DEV (1 << __CTR_FLAG_JOURNAL_DEV) 109 106 #define CTR_FLAG_JOURNAL_MODE (1 << __CTR_FLAG_JOURNAL_MODE) 110 - 111 - #define RESUME_STAY_FROZEN_FLAGS (CTR_FLAG_DELTA_DISKS | CTR_FLAG_DATA_OFFSET) 112 107 113 108 /* 114 109 * Definitions of various constructor flags to ··· 210 209 #define RT_FLAG_UPDATE_SBS 3 211 210 #define RT_FLAG_RESHAPE_RS 4 212 211 #define RT_FLAG_RS_SUSPENDED 5 212 + #define RT_FLAG_RS_IN_SYNC 6 213 + #define RT_FLAG_RS_RESYNCING 7 213 214 214 215 /* Array elements of 64 bit needed for rebuild/failed disk bits */ 215 216 #define DISKS_ARRAY_ELEMS ((MAX_RAID_DEVICES + (sizeof(uint64_t) * 8 - 1)) / sizeof(uint64_t) / 8) ··· 227 224 228 225 struct raid_set { 229 226 struct dm_target *ti; 227 + struct list_head list; 230 228 231 - uint32_t bitmap_loaded; 232 229 uint32_t stripe_cache_entries; 233 230 unsigned long ctr_flags; 234 231 unsigned long runtime_flags; ··· 271 268 mddev->new_level = l->new_level; 272 269 mddev->new_layout = l->new_layout; 273 270 mddev->new_chunk_sectors = l->new_chunk_sectors; 271 + } 272 + 273 + /* Find any raid_set in active slot for @rs on global list */ 274 + static struct raid_set *rs_find_active(struct raid_set *rs) 275 + { 276 + struct raid_set *r; 277 + struct mapped_device *md = dm_table_get_md(rs->ti->table); 278 + 279 + list_for_each_entry(r, &raid_sets, list) 280 + if (r != rs && dm_table_get_md(r->ti->table) == md) 281 + return r; 282 + 283 + return NULL; 274 284 } 275 285 276 286 /* raid10 algorithms (i.e. formats) */ ··· 588 572 } 589 573 590 574 /* Return md raid10 algorithm for @name */ 591 - static int raid10_name_to_format(const char *name) 575 + static const int raid10_name_to_format(const char *name) 592 576 { 593 577 if (!strcasecmp(name, "near")) 594 578 return ALGORITHM_RAID10_NEAR; ··· 691 675 return NULL; 692 676 } 693 677 694 - /* 695 - * Conditionally change bdev capacity of @rs 696 - * in case of a disk add/remove reshape 697 - */ 698 - static void rs_set_capacity(struct raid_set *rs) 678 + /* Adjust rdev sectors */ 679 + static void rs_set_rdev_sectors(struct raid_set *rs) 699 680 { 700 681 struct mddev *mddev = &rs->md; 701 682 struct md_rdev *rdev; 702 - struct gendisk *gendisk = dm_disk(dm_table_get_md(rs->ti->table)); 703 683 704 684 /* 705 685 * raid10 sets rdev->sector to the device size, which ··· 704 692 rdev_for_each(rdev, mddev) 705 693 if (!test_bit(Journal, &rdev->flags)) 706 694 rdev->sectors = mddev->dev_sectors; 695 + } 707 696 708 - set_capacity(gendisk, mddev->array_sectors); 697 + /* 698 + * Change bdev capacity of @rs in case of a disk add/remove reshape 699 + */ 700 + static void rs_set_capacity(struct raid_set *rs) 701 + { 702 + struct gendisk *gendisk = dm_disk(dm_table_get_md(rs->ti->table)); 703 + 704 + set_capacity(gendisk, rs->md.array_sectors); 709 705 revalidate_disk(gendisk); 710 706 } 711 707 ··· 764 744 765 745 mddev_init(&rs->md); 766 746 747 + INIT_LIST_HEAD(&rs->list); 767 748 rs->raid_disks = raid_devs; 768 749 rs->delta_disks = 0; 769 750 ··· 782 761 for (i = 0; i < raid_devs; i++) 783 762 md_rdev_init(&rs->dev[i].rdev); 784 763 764 + /* Add @rs to global list. */ 765 + list_add(&rs->list, &raid_sets); 766 + 785 767 /* 786 768 * Remaining items to be initialized by further RAID params: 787 769 * rs->md.persistent ··· 797 773 return rs; 798 774 } 799 775 776 + /* Free all @rs allocations and remove it from global list. */ 800 777 static void raid_set_free(struct raid_set *rs) 801 778 { 802 779 int i; ··· 814 789 if (rs->dev[i].data_dev) 815 790 dm_put_device(rs->ti, rs->dev[i].data_dev); 816 791 } 792 + 793 + list_del(&rs->list); 817 794 818 795 kfree(rs); 819 796 } ··· 1029 1002 !rs->dev[i].rdev.sb_page) 1030 1003 rebuild_cnt++; 1031 1004 1032 - switch (rs->raid_type->level) { 1005 + switch (rs->md.level) { 1033 1006 case 0: 1034 1007 break; 1035 1008 case 1: ··· 1044 1017 break; 1045 1018 case 10: 1046 1019 copies = raid10_md_layout_to_copies(rs->md.new_layout); 1020 + if (copies < 2) { 1021 + DMERR("Bogus raid10 data copies < 2!"); 1022 + return -EINVAL; 1023 + } 1024 + 1047 1025 if (rebuild_cnt < copies) 1048 1026 break; 1049 1027 ··· 1608 1576 return 0; 1609 1577 } 1610 1578 1579 + /* Check that calculated dev_sectors fits all component devices. */ 1580 + static int _check_data_dev_sectors(struct raid_set *rs) 1581 + { 1582 + sector_t ds = ~0; 1583 + struct md_rdev *rdev; 1584 + 1585 + rdev_for_each(rdev, &rs->md) 1586 + if (!test_bit(Journal, &rdev->flags) && rdev->bdev) { 1587 + ds = min(ds, to_sector(i_size_read(rdev->bdev->bd_inode))); 1588 + if (ds < rs->md.dev_sectors) { 1589 + rs->ti->error = "Component device(s) too small"; 1590 + return -EINVAL; 1591 + } 1592 + } 1593 + 1594 + return 0; 1595 + } 1596 + 1611 1597 /* Calculate the sectors per device and per array used for @rs */ 1612 1598 static int rs_set_dev_and_array_sectors(struct raid_set *rs, bool use_mddev) 1613 1599 { ··· 1675 1625 mddev->array_sectors = array_sectors; 1676 1626 mddev->dev_sectors = dev_sectors; 1677 1627 1678 - return 0; 1628 + return _check_data_dev_sectors(rs); 1679 1629 bad: 1680 1630 rs->ti->error = "Target length not divisible by number of data devices"; 1681 1631 return -EINVAL; ··· 1724 1674 struct raid_set *rs = container_of(ws, struct raid_set, md.event_work); 1725 1675 1726 1676 smp_rmb(); /* Make sure we access most actual mddev properties */ 1727 - if (!rs_is_reshaping(rs)) 1677 + if (!rs_is_reshaping(rs)) { 1678 + if (rs_is_raid10(rs)) 1679 + rs_set_rdev_sectors(rs); 1728 1680 rs_set_capacity(rs); 1681 + } 1729 1682 dm_table_event(rs->ti->table); 1730 1683 } 1731 1684 ··· 1913 1860 if (rs_takeover_requested(rs)) 1914 1861 return false; 1915 1862 1916 - if (!mddev->level) 1863 + if (rs_is_raid0(rs)) 1917 1864 return false; 1918 1865 1919 1866 change = mddev->new_layout != mddev->layout || ··· 1921 1868 rs->delta_disks; 1922 1869 1923 1870 /* Historical case to support raid1 reshape without delta disks */ 1924 - if (mddev->level == 1) { 1871 + if (rs_is_raid1(rs)) { 1925 1872 if (rs->delta_disks) 1926 1873 return !!rs->delta_disks; 1927 1874 ··· 1929 1876 mddev->raid_disks != rs->raid_disks; 1930 1877 } 1931 1878 1932 - if (mddev->level == 10) 1879 + if (rs_is_raid10(rs)) 1933 1880 return change && 1934 1881 !__is_raid10_far(mddev->new_layout) && 1935 1882 rs->delta_disks >= 0; ··· 2393 2340 DMERR("new device%s provided without 'rebuild'", 2394 2341 new_devs > 1 ? "s" : ""); 2395 2342 return -EINVAL; 2396 - } else if (rs_is_recovering(rs)) { 2343 + } else if (!test_bit(__CTR_FLAG_REBUILD, &rs->ctr_flags) && rs_is_recovering(rs)) { 2397 2344 DMERR("'rebuild' specified while raid set is not in-sync (recovery_cp=%llu)", 2398 2345 (unsigned long long) mddev->recovery_cp); 2399 2346 return -EINVAL; ··· 2693 2640 * Make sure we got a minimum amount of free sectors per device 2694 2641 */ 2695 2642 if (rs->data_offset && 2696 - to_sector(i_size_read(rdev->bdev->bd_inode)) - rdev->sectors < MIN_FREE_RESHAPE_SPACE) { 2643 + to_sector(i_size_read(rdev->bdev->bd_inode)) - rs->md.dev_sectors < MIN_FREE_RESHAPE_SPACE) { 2697 2644 rs->ti->error = data_offset ? "No space for forward reshape" : 2698 2645 "No space for backward reshape"; 2699 2646 return -ENOSPC; 2700 2647 } 2701 2648 out: 2649 + /* 2650 + * Raise recovery_cp in case data_offset != 0 to 2651 + * avoid false recovery positives in the constructor. 2652 + */ 2653 + if (rs->md.recovery_cp < rs->md.dev_sectors) 2654 + rs->md.recovery_cp += rs->dev[0].rdev.data_offset; 2655 + 2702 2656 /* Adjust data offsets on all rdevs but on any raid4/5/6 journal device */ 2703 2657 rdev_for_each(rdev, &rs->md) { 2704 2658 if (!test_bit(Journal, &rdev->flags)) { ··· 2742 2682 sector_t new_data_offset = rs->dev[0].rdev.data_offset ? 0 : rs->data_offset; 2743 2683 2744 2684 if (rt_is_raid10(rs->raid_type)) { 2745 - if (mddev->level == 0) { 2685 + if (rs_is_raid0(rs)) { 2746 2686 /* Userpace reordered disks -> adjust raid_disk indexes */ 2747 2687 __reorder_raid_disk_indexes(rs); 2748 2688 2749 2689 /* raid0 -> raid10_far layout */ 2750 2690 mddev->layout = raid10_format_to_md_layout(rs, ALGORITHM_RAID10_FAR, 2751 2691 rs->raid10_copies); 2752 - } else if (mddev->level == 1) 2692 + } else if (rs_is_raid1(rs)) 2753 2693 /* raid1 -> raid10_near layout */ 2754 2694 mddev->layout = raid10_format_to_md_layout(rs, ALGORITHM_RAID10_NEAR, 2755 2695 rs->raid_disks); ··· 2837 2777 return 0; 2838 2778 } 2839 2779 2780 + /* Get reshape sectors from data_offsets or raid set */ 2781 + static sector_t _get_reshape_sectors(struct raid_set *rs) 2782 + { 2783 + struct md_rdev *rdev; 2784 + sector_t reshape_sectors = 0; 2785 + 2786 + rdev_for_each(rdev, &rs->md) 2787 + if (!test_bit(Journal, &rdev->flags)) { 2788 + reshape_sectors = (rdev->data_offset > rdev->new_data_offset) ? 2789 + rdev->data_offset - rdev->new_data_offset : 2790 + rdev->new_data_offset - rdev->data_offset; 2791 + break; 2792 + } 2793 + 2794 + return max(reshape_sectors, (sector_t) rs->data_offset); 2795 + } 2796 + 2840 2797 /* 2841 2798 * 2842 2799 * - change raid layout ··· 2865 2788 { 2866 2789 int r = 0; 2867 2790 unsigned int cur_raid_devs, d; 2791 + sector_t reshape_sectors = _get_reshape_sectors(rs); 2868 2792 struct mddev *mddev = &rs->md; 2869 2793 struct md_rdev *rdev; 2870 2794 ··· 2882 2804 /* 2883 2805 * Adjust array size: 2884 2806 * 2885 - * - in case of adding disks, array size has 2807 + * - in case of adding disk(s), array size has 2886 2808 * to grow after the disk adding reshape, 2887 2809 * which'll hapen in the event handler; 2888 2810 * reshape will happen forward, so space has to 2889 2811 * be available at the beginning of each disk 2890 2812 * 2891 - * - in case of removing disks, array size 2813 + * - in case of removing disk(s), array size 2892 2814 * has to shrink before starting the reshape, 2893 2815 * which'll happen here; 2894 2816 * reshape will happen backward, so space has to ··· 2919 2841 rdev->recovery_offset = rs_is_raid1(rs) ? 0 : MaxSector; 2920 2842 } 2921 2843 2922 - mddev->reshape_backwards = 0; /* adding disks -> forward reshape */ 2844 + mddev->reshape_backwards = 0; /* adding disk(s) -> forward reshape */ 2923 2845 2924 2846 /* Remove disk(s) */ 2925 2847 } else if (rs->delta_disks < 0) { ··· 2952 2874 mddev->reshape_backwards = rs->dev[0].rdev.data_offset ? 0 : 1; 2953 2875 } 2954 2876 2877 + /* 2878 + * Adjust device size for forward reshape 2879 + * because md_finish_reshape() reduces it. 2880 + */ 2881 + if (!mddev->reshape_backwards) 2882 + rdev_for_each(rdev, &rs->md) 2883 + if (!test_bit(Journal, &rdev->flags)) 2884 + rdev->sectors += reshape_sectors; 2885 + 2955 2886 return r; 2956 2887 } 2957 2888 ··· 2977 2890 /* 2978 2891 * XXX: RAID level 4,5,6 require zeroing for safety. 2979 2892 */ 2980 - raid456 = (rs->md.level == 4 || rs->md.level == 5 || rs->md.level == 6); 2893 + raid456 = rs_is_raid456(rs); 2981 2894 2982 2895 for (i = 0; i < rs->raid_disks; i++) { 2983 2896 struct request_queue *q; ··· 3002 2915 * RAID1 and RAID10 personalities require bio splitting, 3003 2916 * RAID0/4/5/6 don't and process large discard bios properly. 3004 2917 */ 3005 - ti->split_discard_bios = !!(rs->md.level == 1 || rs->md.level == 10); 2918 + ti->split_discard_bios = !!(rs_is_raid1(rs) || rs_is_raid10(rs)); 3006 2919 ti->num_discard_bios = 1; 3007 2920 } 3008 2921 ··· 3022 2935 static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv) 3023 2936 { 3024 2937 int r; 3025 - bool resize; 2938 + bool resize = false; 3026 2939 struct raid_type *rt; 3027 2940 unsigned int num_raid_params, num_raid_devs; 3028 - sector_t calculated_dev_sectors, rdev_sectors; 2941 + sector_t calculated_dev_sectors, rdev_sectors, reshape_sectors; 3029 2942 struct raid_set *rs = NULL; 3030 2943 const char *arg; 3031 2944 struct rs_layout rs_layout; ··· 3108 3021 goto bad; 3109 3022 } 3110 3023 3111 - resize = calculated_dev_sectors != rdev_sectors; 3024 + 3025 + reshape_sectors = _get_reshape_sectors(rs); 3026 + if (calculated_dev_sectors != rdev_sectors) 3027 + resize = calculated_dev_sectors != (reshape_sectors ? rdev_sectors - reshape_sectors : rdev_sectors); 3112 3028 3113 3029 INIT_WORK(&rs->md.event_work, do_table_event); 3114 3030 ti->private = rs; ··· 3195 3105 goto bad; 3196 3106 } 3197 3107 3198 - /* 3199 - * We can only prepare for a reshape here, because the 3200 - * raid set needs to run to provide the repective reshape 3201 - * check functions via its MD personality instance. 3202 - * 3203 - * So do the reshape check after md_run() succeeded. 3204 - */ 3205 - r = rs_prepare_reshape(rs); 3206 - if (r) 3207 - return r; 3108 + /* Out-of-place space has to be available to allow for a reshape unless raid1! */ 3109 + if (reshape_sectors || rs_is_raid1(rs)) { 3110 + /* 3111 + * We can only prepare for a reshape here, because the 3112 + * raid set needs to run to provide the repective reshape 3113 + * check functions via its MD personality instance. 3114 + * 3115 + * So do the reshape check after md_run() succeeded. 3116 + */ 3117 + r = rs_prepare_reshape(rs); 3118 + if (r) 3119 + return r; 3208 3120 3209 - /* Reshaping ain't recovery, so disable recovery */ 3210 - rs_setup_recovery(rs, MaxSector); 3121 + /* Reshaping ain't recovery, so disable recovery */ 3122 + rs_setup_recovery(rs, MaxSector); 3123 + } 3211 3124 rs_set_cur(rs); 3212 3125 } else { 3213 3126 /* May not set recovery when a device rebuild is requested */ ··· 3237 3144 mddev_lock_nointr(&rs->md); 3238 3145 r = md_run(&rs->md); 3239 3146 rs->md.in_sync = 0; /* Assume already marked dirty */ 3240 - 3241 3147 if (r) { 3242 3148 ti->error = "Failed to run raid array"; 3243 3149 mddev_unlock(&rs->md); ··· 3340 3248 } 3341 3249 3342 3250 /* Return string describing the current sync action of @mddev */ 3343 - static const char *decipher_sync_action(struct mddev *mddev) 3251 + static const char *decipher_sync_action(struct mddev *mddev, unsigned long recovery) 3344 3252 { 3345 - if (test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) 3253 + if (test_bit(MD_RECOVERY_FROZEN, &recovery)) 3346 3254 return "frozen"; 3347 3255 3348 - if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || 3349 - (!mddev->ro && test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))) { 3350 - if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) 3256 + /* The MD sync thread can be done with io but still be running */ 3257 + if (!test_bit(MD_RECOVERY_DONE, &recovery) && 3258 + (test_bit(MD_RECOVERY_RUNNING, &recovery) || 3259 + (!mddev->ro && test_bit(MD_RECOVERY_NEEDED, &recovery)))) { 3260 + if (test_bit(MD_RECOVERY_RESHAPE, &recovery)) 3351 3261 return "reshape"; 3352 3262 3353 - if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) { 3354 - if (!test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) 3263 + if (test_bit(MD_RECOVERY_SYNC, &recovery)) { 3264 + if (!test_bit(MD_RECOVERY_REQUESTED, &recovery)) 3355 3265 return "resync"; 3356 - else if (test_bit(MD_RECOVERY_CHECK, &mddev->recovery)) 3266 + else if (test_bit(MD_RECOVERY_CHECK, &recovery)) 3357 3267 return "check"; 3358 3268 return "repair"; 3359 3269 } 3360 3270 3361 - if (test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) 3271 + if (test_bit(MD_RECOVERY_RECOVER, &recovery)) 3362 3272 return "recover"; 3363 3273 } 3364 3274 ··· 3377 3283 * 'A' = Alive and in-sync raid set component _or_ alive raid4/5/6 'write_through' journal device 3378 3284 * '-' = Non-existing device (i.e. uspace passed '- -' into the ctr) 3379 3285 */ 3380 - static const char *__raid_dev_status(struct raid_set *rs, struct md_rdev *rdev, bool array_in_sync) 3286 + static const char *__raid_dev_status(struct raid_set *rs, struct md_rdev *rdev) 3381 3287 { 3382 3288 if (!rdev->bdev) 3383 3289 return "-"; ··· 3385 3291 return "D"; 3386 3292 else if (test_bit(Journal, &rdev->flags)) 3387 3293 return (rs->journal_dev.mode == R5C_JOURNAL_MODE_WRITE_THROUGH) ? "A" : "a"; 3388 - else if (!array_in_sync || !test_bit(In_sync, &rdev->flags)) 3294 + else if (test_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags) || 3295 + (!test_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags) && 3296 + !test_bit(In_sync, &rdev->flags))) 3389 3297 return "a"; 3390 3298 else 3391 3299 return "A"; 3392 3300 } 3393 3301 3394 - /* Helper to return resync/reshape progress for @rs and @array_in_sync */ 3395 - static sector_t rs_get_progress(struct raid_set *rs, 3396 - sector_t resync_max_sectors, bool *array_in_sync) 3302 + /* Helper to return resync/reshape progress for @rs and runtime flags for raid set in sync / resynching */ 3303 + static sector_t rs_get_progress(struct raid_set *rs, unsigned long recovery, 3304 + sector_t resync_max_sectors) 3397 3305 { 3398 - sector_t r, curr_resync_completed; 3306 + sector_t r; 3399 3307 struct mddev *mddev = &rs->md; 3400 3308 3401 - curr_resync_completed = mddev->curr_resync_completed ?: mddev->recovery_cp; 3402 - *array_in_sync = false; 3309 + clear_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags); 3310 + clear_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags); 3403 3311 3404 3312 if (rs_is_raid0(rs)) { 3405 3313 r = resync_max_sectors; 3406 - *array_in_sync = true; 3314 + set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags); 3407 3315 3408 3316 } else { 3409 - r = mddev->reshape_position; 3410 - 3411 - /* Reshape is relative to the array size */ 3412 - if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) || 3413 - r != MaxSector) { 3414 - if (r == MaxSector) { 3415 - *array_in_sync = true; 3416 - r = resync_max_sectors; 3417 - } else { 3418 - /* Got to reverse on backward reshape */ 3419 - if (mddev->reshape_backwards) 3420 - r = mddev->array_sectors - r; 3421 - 3422 - /* Devide by # of data stripes */ 3423 - sector_div(r, mddev_data_stripes(rs)); 3424 - } 3425 - 3426 - /* Sync is relative to the component device size */ 3427 - } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) 3428 - r = curr_resync_completed; 3317 + if (test_bit(MD_RECOVERY_NEEDED, &recovery) || 3318 + test_bit(MD_RECOVERY_RESHAPE, &recovery) || 3319 + test_bit(MD_RECOVERY_RUNNING, &recovery)) 3320 + r = mddev->curr_resync_completed; 3429 3321 else 3430 3322 r = mddev->recovery_cp; 3431 3323 3432 - if ((r == MaxSector) || 3433 - (test_bit(MD_RECOVERY_DONE, &mddev->recovery) && 3434 - (mddev->curr_resync_completed == resync_max_sectors))) { 3324 + if (r >= resync_max_sectors && 3325 + (!test_bit(MD_RECOVERY_REQUESTED, &recovery) || 3326 + (!test_bit(MD_RECOVERY_FROZEN, &recovery) && 3327 + !test_bit(MD_RECOVERY_NEEDED, &recovery) && 3328 + !test_bit(MD_RECOVERY_RUNNING, &recovery)))) { 3435 3329 /* 3436 3330 * Sync complete. 3437 3331 */ 3438 - *array_in_sync = true; 3439 - r = resync_max_sectors; 3440 - } else if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) { 3332 + /* In case we have finished recovering, the array is in sync. */ 3333 + if (test_bit(MD_RECOVERY_RECOVER, &recovery)) 3334 + set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags); 3335 + 3336 + } else if (test_bit(MD_RECOVERY_RECOVER, &recovery)) { 3337 + /* 3338 + * In case we are recovering, the array is not in sync 3339 + * and health chars should show the recovering legs. 3340 + */ 3341 + ; 3342 + 3343 + } else if (test_bit(MD_RECOVERY_SYNC, &recovery) && 3344 + !test_bit(MD_RECOVERY_REQUESTED, &recovery)) { 3345 + /* 3346 + * If "resync" is occurring, the raid set 3347 + * is or may be out of sync hence the health 3348 + * characters shall be 'a'. 3349 + */ 3350 + set_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags); 3351 + 3352 + } else if (test_bit(MD_RECOVERY_RESHAPE, &recovery) && 3353 + !test_bit(MD_RECOVERY_REQUESTED, &recovery)) { 3354 + /* 3355 + * If "reshape" is occurring, the raid set 3356 + * is or may be out of sync hence the health 3357 + * characters shall be 'a'. 3358 + */ 3359 + set_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags); 3360 + 3361 + } else if (test_bit(MD_RECOVERY_REQUESTED, &recovery)) { 3441 3362 /* 3442 3363 * If "check" or "repair" is occurring, the raid set has 3443 3364 * undergone an initial sync and the health characters 3444 3365 * should not be 'a' anymore. 3445 3366 */ 3446 - *array_in_sync = true; 3367 + set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags); 3368 + 3447 3369 } else { 3448 3370 struct md_rdev *rdev; 3371 + 3372 + /* 3373 + * We are idle and recovery is needed, prevent 'A' chars race 3374 + * caused by components still set to in-sync by constrcuctor. 3375 + */ 3376 + if (test_bit(MD_RECOVERY_NEEDED, &recovery)) 3377 + set_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags); 3449 3378 3450 3379 /* 3451 3380 * The raid set may be doing an initial sync, or it may ··· 3476 3359 * devices are In_sync, then it is the raid set that is 3477 3360 * being initialized. 3478 3361 */ 3362 + set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags); 3479 3363 rdev_for_each(rdev, mddev) 3480 3364 if (!test_bit(Journal, &rdev->flags) && 3481 - !test_bit(In_sync, &rdev->flags)) 3482 - *array_in_sync = true; 3483 - #if 0 3484 - r = 0; /* HM FIXME: TESTME: https://bugzilla.redhat.com/show_bug.cgi?id=1210637 ? */ 3485 - #endif 3365 + !test_bit(In_sync, &rdev->flags)) { 3366 + clear_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags); 3367 + break; 3368 + } 3486 3369 } 3487 3370 } 3488 3371 3489 - return r; 3372 + return min(r, resync_max_sectors); 3490 3373 } 3491 3374 3492 3375 /* Helper to return @dev name or "-" if !@dev */ ··· 3502 3385 struct mddev *mddev = &rs->md; 3503 3386 struct r5conf *conf = mddev->private; 3504 3387 int i, max_nr_stripes = conf ? conf->max_nr_stripes : 0; 3505 - bool array_in_sync; 3388 + unsigned long recovery; 3506 3389 unsigned int raid_param_cnt = 1; /* at least 1 for chunksize */ 3507 3390 unsigned int sz = 0; 3508 3391 unsigned int rebuild_disks; ··· 3522 3405 3523 3406 /* Access most recent mddev properties for status output */ 3524 3407 smp_rmb(); 3408 + recovery = rs->md.recovery; 3525 3409 /* Get sensible max sectors even if raid set not yet started */ 3526 3410 resync_max_sectors = test_bit(RT_FLAG_RS_PRERESUMED, &rs->runtime_flags) ? 3527 3411 mddev->resync_max_sectors : mddev->dev_sectors; 3528 - progress = rs_get_progress(rs, resync_max_sectors, &array_in_sync); 3412 + progress = rs_get_progress(rs, recovery, resync_max_sectors); 3529 3413 resync_mismatches = (mddev->last_sync_action && !strcasecmp(mddev->last_sync_action, "check")) ? 3530 3414 atomic64_read(&mddev->resync_mismatches) : 0; 3531 - sync_action = decipher_sync_action(&rs->md); 3415 + sync_action = decipher_sync_action(&rs->md, recovery); 3532 3416 3533 3417 /* HM FIXME: do we want another state char for raid0? It shows 'D'/'A'/'-' now */ 3534 3418 for (i = 0; i < rs->raid_disks; i++) 3535 - DMEMIT(__raid_dev_status(rs, &rs->dev[i].rdev, array_in_sync)); 3419 + DMEMIT(__raid_dev_status(rs, &rs->dev[i].rdev)); 3536 3420 3537 3421 /* 3538 3422 * In-sync/Reshape ratio: ··· 3584 3466 * v1.10.0+: 3585 3467 */ 3586 3468 DMEMIT(" %s", test_bit(__CTR_FLAG_JOURNAL_DEV, &rs->ctr_flags) ? 3587 - __raid_dev_status(rs, &rs->journal_dev.rdev, 0) : "-"); 3469 + __raid_dev_status(rs, &rs->journal_dev.rdev) : "-"); 3588 3470 break; 3589 3471 3590 3472 case STATUSTYPE_TABLE: ··· 3740 3622 blk_limits_io_opt(limits, chunk_size * mddev_data_stripes(rs)); 3741 3623 } 3742 3624 3743 - static void raid_presuspend(struct dm_target *ti) 3744 - { 3745 - struct raid_set *rs = ti->private; 3746 - 3747 - md_stop_writes(&rs->md); 3748 - } 3749 - 3750 3625 static void raid_postsuspend(struct dm_target *ti) 3751 3626 { 3752 3627 struct raid_set *rs = ti->private; 3753 3628 3754 3629 if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) { 3630 + /* Writes have to be stopped before suspending to avoid deadlocks. */ 3631 + if (!test_bit(MD_RECOVERY_FROZEN, &rs->md.recovery)) 3632 + md_stop_writes(&rs->md); 3633 + 3755 3634 mddev_lock_nointr(&rs->md); 3756 3635 mddev_suspend(&rs->md); 3757 3636 mddev_unlock(&rs->md); 3758 3637 } 3759 - 3760 - rs->md.ro = 1; 3761 3638 } 3762 3639 3763 3640 static void attempt_restore_of_faulty_devices(struct raid_set *rs) ··· 3929 3816 struct raid_set *rs = ti->private; 3930 3817 struct mddev *mddev = &rs->md; 3931 3818 3932 - /* This is a resume after a suspend of the set -> it's already started */ 3819 + /* This is a resume after a suspend of the set -> it's already started. */ 3933 3820 if (test_and_set_bit(RT_FLAG_RS_PRERESUMED, &rs->runtime_flags)) 3934 3821 return 0; 3822 + 3823 + if (!test_bit(__CTR_FLAG_REBUILD, &rs->ctr_flags)) { 3824 + struct raid_set *rs_active = rs_find_active(rs); 3825 + 3826 + if (rs_active) { 3827 + /* 3828 + * In case no rebuilds have been requested 3829 + * and an active table slot exists, copy 3830 + * current resynchonization completed and 3831 + * reshape position pointers across from 3832 + * suspended raid set in the active slot. 3833 + * 3834 + * This resumes the new mapping at current 3835 + * offsets to continue recover/reshape without 3836 + * necessarily redoing a raid set partially or 3837 + * causing data corruption in case of a reshape. 3838 + */ 3839 + if (rs_active->md.curr_resync_completed != MaxSector) 3840 + mddev->curr_resync_completed = rs_active->md.curr_resync_completed; 3841 + if (rs_active->md.reshape_position != MaxSector) 3842 + mddev->reshape_position = rs_active->md.reshape_position; 3843 + } 3844 + } 3935 3845 3936 3846 /* 3937 3847 * The superblocks need to be updated on disk if the ··· 3987 3851 mddev->resync_min = mddev->recovery_cp; 3988 3852 } 3989 3853 3990 - rs_set_capacity(rs); 3991 - 3992 3854 /* Check for any reshape request unless new raid set */ 3993 - if (test_and_clear_bit(RT_FLAG_RESHAPE_RS, &rs->runtime_flags)) { 3855 + if (test_bit(RT_FLAG_RESHAPE_RS, &rs->runtime_flags)) { 3994 3856 /* Initiate a reshape. */ 3857 + rs_set_rdev_sectors(rs); 3995 3858 mddev_lock_nointr(mddev); 3996 3859 r = rs_start_reshape(rs); 3997 3860 mddev_unlock(mddev); ··· 4016 3881 attempt_restore_of_faulty_devices(rs); 4017 3882 } 4018 3883 4019 - mddev->ro = 0; 4020 - mddev->in_sync = 0; 4021 - 4022 - /* 4023 - * Keep the RAID set frozen if reshape/rebuild flags are set. 4024 - * The RAID set is unfrozen once the next table load/resume, 4025 - * which clears the reshape/rebuild flags, occurs. 4026 - * This ensures that the constructor for the inactive table 4027 - * retrieves an up-to-date reshape_position. 4028 - */ 4029 - if (!(rs->ctr_flags & RESUME_STAY_FROZEN_FLAGS)) 4030 - clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 4031 - 4032 3884 if (test_and_clear_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) { 3885 + /* Only reduce raid set size before running a disk removing reshape. */ 3886 + if (mddev->delta_disks < 0) 3887 + rs_set_capacity(rs); 3888 + 4033 3889 mddev_lock_nointr(mddev); 3890 + clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 3891 + mddev->ro = 0; 3892 + mddev->in_sync = 0; 4034 3893 mddev_resume(mddev); 4035 3894 mddev_unlock(mddev); 4036 3895 } ··· 4032 3903 4033 3904 static struct target_type raid_target = { 4034 3905 .name = "raid", 4035 - .version = {1, 13, 0}, 3906 + .version = {1, 13, 2}, 4036 3907 .module = THIS_MODULE, 4037 3908 .ctr = raid_ctr, 4038 3909 .dtr = raid_dtr, ··· 4041 3912 .message = raid_message, 4042 3913 .iterate_devices = raid_iterate_devices, 4043 3914 .io_hints = raid_io_hints, 4044 - .presuspend = raid_presuspend, 4045 3915 .postsuspend = raid_postsuspend, 4046 3916 .preresume = raid_preresume, 4047 3917 .resume = raid_resume,

+4 -2

drivers/md/dm-rq.c

··· 315 315 /* The target wants to requeue the I/O */ 316 316 dm_requeue_original_request(tio, false); 317 317 break; 318 + case DM_ENDIO_DELAY_REQUEUE: 319 + /* The target wants to requeue the I/O after a delay */ 320 + dm_requeue_original_request(tio, true); 321 + break; 318 322 default: 319 323 DMWARN("unimplemented target endio return value: %d", r); 320 324 BUG(); ··· 717 713 /* disable dm_old_request_fn's merge heuristic by default */ 718 714 md->seq_rq_merge_deadline_usecs = 0; 719 715 720 - dm_init_normal_md_queue(md); 721 716 blk_queue_softirq_done(md->queue, dm_softirq_done); 722 717 723 718 /* Initialize the request-based DM worker thread */ ··· 824 821 err = PTR_ERR(q); 825 822 goto out_tag_set; 826 823 } 827 - dm_init_md_queue(md); 828 824 829 825 return 0; 830 826

+3 -3

drivers/md/dm-service-time.c

··· 282 282 if (list_empty(&s->valid_paths)) 283 283 goto out; 284 284 285 - /* Change preferred (first in list) path to evenly balance. */ 286 - list_move_tail(s->valid_paths.next, &s->valid_paths); 287 - 288 285 list_for_each_entry(pi, &s->valid_paths, list) 289 286 if (!best || (st_compare_load(pi, best, nr_bytes) < 0)) 290 287 best = pi; 291 288 292 289 if (!best) 293 290 goto out; 291 + 292 + /* Move most recently used to least preferred to evenly balance. */ 293 + list_move_tail(&best->list, &s->valid_paths); 294 294 295 295 ret = best->path; 296 296 out:

+43 -41

drivers/md/dm-snap.c

··· 47 47 }; 48 48 49 49 struct dm_snapshot { 50 - struct rw_semaphore lock; 50 + struct mutex lock; 51 51 52 52 struct dm_dev *origin; 53 53 struct dm_dev *cow; ··· 439 439 if (!bdev_equal(s->cow->bdev, snap->cow->bdev)) 440 440 continue; 441 441 442 - down_read(&s->lock); 442 + mutex_lock(&s->lock); 443 443 active = s->active; 444 - up_read(&s->lock); 444 + mutex_unlock(&s->lock); 445 445 446 446 if (active) { 447 447 if (snap_src) ··· 909 909 int r; 910 910 chunk_t old_chunk = s->first_merging_chunk + s->num_merging_chunks - 1; 911 911 912 - down_write(&s->lock); 912 + mutex_lock(&s->lock); 913 913 914 914 /* 915 915 * Process chunks (and associated exceptions) in reverse order ··· 924 924 b = __release_queued_bios_after_merge(s); 925 925 926 926 out: 927 - up_write(&s->lock); 927 + mutex_unlock(&s->lock); 928 928 if (b) 929 929 flush_bios(b); 930 930 ··· 983 983 if (linear_chunks < 0) { 984 984 DMERR("Read error in exception store: " 985 985 "shutting down merge"); 986 - down_write(&s->lock); 986 + mutex_lock(&s->lock); 987 987 s->merge_failed = 1; 988 - up_write(&s->lock); 988 + mutex_unlock(&s->lock); 989 989 } 990 990 goto shut; 991 991 } ··· 1026 1026 previous_count = read_pending_exceptions_done_count(); 1027 1027 } 1028 1028 1029 - down_write(&s->lock); 1029 + mutex_lock(&s->lock); 1030 1030 s->first_merging_chunk = old_chunk; 1031 1031 s->num_merging_chunks = linear_chunks; 1032 - up_write(&s->lock); 1032 + mutex_unlock(&s->lock); 1033 1033 1034 1034 /* Wait until writes to all 'linear_chunks' drain */ 1035 1035 for (i = 0; i < linear_chunks; i++) ··· 1071 1071 return; 1072 1072 1073 1073 shut: 1074 - down_write(&s->lock); 1074 + mutex_lock(&s->lock); 1075 1075 s->merge_failed = 1; 1076 1076 b = __release_queued_bios_after_merge(s); 1077 - up_write(&s->lock); 1077 + mutex_unlock(&s->lock); 1078 1078 error_bios(b); 1079 1079 1080 1080 merge_shutdown(s); ··· 1173 1173 s->exception_start_sequence = 0; 1174 1174 s->exception_complete_sequence = 0; 1175 1175 INIT_LIST_HEAD(&s->out_of_order_list); 1176 - init_rwsem(&s->lock); 1176 + mutex_init(&s->lock); 1177 1177 INIT_LIST_HEAD(&s->list); 1178 1178 spin_lock_init(&s->pe_lock); 1179 1179 s->state_bits = 0; ··· 1338 1338 /* Check whether exception handover must be cancelled */ 1339 1339 (void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL); 1340 1340 if (snap_src && snap_dest && (s == snap_src)) { 1341 - down_write(&snap_dest->lock); 1341 + mutex_lock(&snap_dest->lock); 1342 1342 snap_dest->valid = 0; 1343 - up_write(&snap_dest->lock); 1343 + mutex_unlock(&snap_dest->lock); 1344 1344 DMERR("Cancelling snapshot handover."); 1345 1345 } 1346 1346 up_read(&_origins_lock); ··· 1370 1370 mempool_destroy(s->pending_pool); 1371 1371 1372 1372 dm_exception_store_destroy(s->store); 1373 + 1374 + mutex_destroy(&s->lock); 1373 1375 1374 1376 dm_put_device(ti, s->cow); 1375 1377 ··· 1460 1458 1461 1459 if (!success) { 1462 1460 /* Read/write error - snapshot is unusable */ 1463 - down_write(&s->lock); 1461 + mutex_lock(&s->lock); 1464 1462 __invalidate_snapshot(s, -EIO); 1465 1463 error = 1; 1466 1464 goto out; ··· 1468 1466 1469 1467 e = alloc_completed_exception(GFP_NOIO); 1470 1468 if (!e) { 1471 - down_write(&s->lock); 1469 + mutex_lock(&s->lock); 1472 1470 __invalidate_snapshot(s, -ENOMEM); 1473 1471 error = 1; 1474 1472 goto out; 1475 1473 } 1476 1474 *e = pe->e; 1477 1475 1478 - down_write(&s->lock); 1476 + mutex_lock(&s->lock); 1479 1477 if (!s->valid) { 1480 1478 free_completed_exception(e); 1481 1479 error = 1; ··· 1500 1498 full_bio->bi_end_io = pe->full_bio_end_io; 1501 1499 increment_pending_exceptions_done_count(); 1502 1500 1503 - up_write(&s->lock); 1501 + mutex_unlock(&s->lock); 1504 1502 1505 1503 /* Submit any pending write bios */ 1506 1504 if (error) { ··· 1696 1694 1697 1695 /* FIXME: should only take write lock if we need 1698 1696 * to copy an exception */ 1699 - down_write(&s->lock); 1697 + mutex_lock(&s->lock); 1700 1698 1701 1699 if (!s->valid || (unlikely(s->snapshot_overflowed) && 1702 1700 bio_data_dir(bio) == WRITE)) { ··· 1719 1717 if (bio_data_dir(bio) == WRITE) { 1720 1718 pe = __lookup_pending_exception(s, chunk); 1721 1719 if (!pe) { 1722 - up_write(&s->lock); 1720 + mutex_unlock(&s->lock); 1723 1721 pe = alloc_pending_exception(s); 1724 - down_write(&s->lock); 1722 + mutex_lock(&s->lock); 1725 1723 1726 1724 if (!s->valid || s->snapshot_overflowed) { 1727 1725 free_pending_exception(pe); ··· 1756 1754 bio->bi_iter.bi_size == 1757 1755 (s->store->chunk_size << SECTOR_SHIFT)) { 1758 1756 pe->started = 1; 1759 - up_write(&s->lock); 1757 + mutex_unlock(&s->lock); 1760 1758 start_full_bio(pe, bio); 1761 1759 goto out; 1762 1760 } ··· 1766 1764 if (!pe->started) { 1767 1765 /* this is protected by snap->lock */ 1768 1766 pe->started = 1; 1769 - up_write(&s->lock); 1767 + mutex_unlock(&s->lock); 1770 1768 start_copy(pe); 1771 1769 goto out; 1772 1770 } ··· 1776 1774 } 1777 1775 1778 1776 out_unlock: 1779 - up_write(&s->lock); 1777 + mutex_unlock(&s->lock); 1780 1778 out: 1781 1779 return r; 1782 1780 } ··· 1812 1810 1813 1811 chunk = sector_to_chunk(s->store, bio->bi_iter.bi_sector); 1814 1812 1815 - down_write(&s->lock); 1813 + mutex_lock(&s->lock); 1816 1814 1817 1815 /* Full merging snapshots are redirected to the origin */ 1818 1816 if (!s->valid) ··· 1843 1841 bio_set_dev(bio, s->origin->bdev); 1844 1842 1845 1843 if (bio_data_dir(bio) == WRITE) { 1846 - up_write(&s->lock); 1844 + mutex_unlock(&s->lock); 1847 1845 return do_origin(s->origin, bio); 1848 1846 } 1849 1847 1850 1848 out_unlock: 1851 - up_write(&s->lock); 1849 + mutex_unlock(&s->lock); 1852 1850 1853 1851 return r; 1854 1852 } ··· 1880 1878 down_read(&_origins_lock); 1881 1879 (void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL); 1882 1880 if (snap_src && snap_dest) { 1883 - down_read(&snap_src->lock); 1881 + mutex_lock(&snap_src->lock); 1884 1882 if (s == snap_src) { 1885 1883 DMERR("Unable to resume snapshot source until " 1886 1884 "handover completes."); ··· 1890 1888 "source is suspended."); 1891 1889 r = -EINVAL; 1892 1890 } 1893 - up_read(&snap_src->lock); 1891 + mutex_unlock(&snap_src->lock); 1894 1892 } 1895 1893 up_read(&_origins_lock); 1896 1894 ··· 1936 1934 1937 1935 (void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL); 1938 1936 if (snap_src && snap_dest) { 1939 - down_write(&snap_src->lock); 1940 - down_write_nested(&snap_dest->lock, SINGLE_DEPTH_NESTING); 1937 + mutex_lock(&snap_src->lock); 1938 + mutex_lock_nested(&snap_dest->lock, SINGLE_DEPTH_NESTING); 1941 1939 __handover_exceptions(snap_src, snap_dest); 1942 - up_write(&snap_dest->lock); 1943 - up_write(&snap_src->lock); 1940 + mutex_unlock(&snap_dest->lock); 1941 + mutex_unlock(&snap_src->lock); 1944 1942 } 1945 1943 1946 1944 up_read(&_origins_lock); ··· 1955 1953 /* Now we have correct chunk size, reregister */ 1956 1954 reregister_snapshot(s); 1957 1955 1958 - down_write(&s->lock); 1956 + mutex_lock(&s->lock); 1959 1957 s->active = 1; 1960 - up_write(&s->lock); 1958 + mutex_unlock(&s->lock); 1961 1959 } 1962 1960 1963 1961 static uint32_t get_origin_minimum_chunksize(struct block_device *bdev) ··· 1997 1995 switch (type) { 1998 1996 case STATUSTYPE_INFO: 1999 1997 2000 - down_write(&snap->lock); 1998 + mutex_lock(&snap->lock); 2001 1999 2002 2000 if (!snap->valid) 2003 2001 DMEMIT("Invalid"); ··· 2022 2020 DMEMIT("Unknown"); 2023 2021 } 2024 2022 2025 - up_write(&snap->lock); 2023 + mutex_unlock(&snap->lock); 2026 2024 2027 2025 break; 2028 2026 ··· 2088 2086 if (dm_target_is_snapshot_merge(snap->ti)) 2089 2087 continue; 2090 2088 2091 - down_write(&snap->lock); 2089 + mutex_lock(&snap->lock); 2092 2090 2093 2091 /* Only deal with valid and active snapshots */ 2094 2092 if (!snap->valid || !snap->active) ··· 2115 2113 2116 2114 pe = __lookup_pending_exception(snap, chunk); 2117 2115 if (!pe) { 2118 - up_write(&snap->lock); 2116 + mutex_unlock(&snap->lock); 2119 2117 pe = alloc_pending_exception(snap); 2120 - down_write(&snap->lock); 2118 + mutex_lock(&snap->lock); 2121 2119 2122 2120 if (!snap->valid) { 2123 2121 free_pending_exception(pe); ··· 2160 2158 } 2161 2159 2162 2160 next_snapshot: 2163 - up_write(&snap->lock); 2161 + mutex_unlock(&snap->lock); 2164 2162 2165 2163 if (pe_to_start_now) { 2166 2164 start_copy(pe_to_start_now);

+1

drivers/md/dm-stats.c

··· 228 228 dm_stat_free(&s->rcu_head); 229 229 } 230 230 free_percpu(stats->last); 231 + mutex_destroy(&stats->mutex); 231 232 } 232 233 233 234 static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end,

+88 -28

drivers/md/dm-table.c

··· 866 866 static bool __table_type_bio_based(enum dm_queue_mode table_type) 867 867 { 868 868 return (table_type == DM_TYPE_BIO_BASED || 869 - table_type == DM_TYPE_DAX_BIO_BASED); 869 + table_type == DM_TYPE_DAX_BIO_BASED || 870 + table_type == DM_TYPE_NVME_BIO_BASED); 870 871 } 871 872 872 873 static bool __table_type_request_based(enum dm_queue_mode table_type) ··· 910 909 return true; 911 910 } 912 911 912 + static bool dm_table_does_not_support_partial_completion(struct dm_table *t); 913 + 914 + struct verify_rq_based_data { 915 + unsigned sq_count; 916 + unsigned mq_count; 917 + }; 918 + 919 + static int device_is_rq_based(struct dm_target *ti, struct dm_dev *dev, 920 + sector_t start, sector_t len, void *data) 921 + { 922 + struct request_queue *q = bdev_get_queue(dev->bdev); 923 + struct verify_rq_based_data *v = data; 924 + 925 + if (q->mq_ops) 926 + v->mq_count++; 927 + else 928 + v->sq_count++; 929 + 930 + return queue_is_rq_based(q); 931 + } 932 + 913 933 static int dm_table_determine_type(struct dm_table *t) 914 934 { 915 935 unsigned i; 916 936 unsigned bio_based = 0, request_based = 0, hybrid = 0; 917 - unsigned sq_count = 0, mq_count = 0; 937 + struct verify_rq_based_data v = {.sq_count = 0, .mq_count = 0}; 918 938 struct dm_target *tgt; 919 - struct dm_dev_internal *dd; 920 939 struct list_head *devices = dm_table_get_devices(t); 921 940 enum dm_queue_mode live_md_type = dm_get_md_type(t->md); 922 941 ··· 944 923 /* target already set the table's type */ 945 924 if (t->type == DM_TYPE_BIO_BASED) 946 925 return 0; 926 + else if (t->type == DM_TYPE_NVME_BIO_BASED) { 927 + if (!dm_table_does_not_support_partial_completion(t)) { 928 + DMERR("nvme bio-based is only possible with devices" 929 + " that don't support partial completion"); 930 + return -EINVAL; 931 + } 932 + /* Fallthru, also verify all devices are blk-mq */ 933 + } 947 934 BUG_ON(t->type == DM_TYPE_DAX_BIO_BASED); 948 935 goto verify_rq_based; 949 936 } ··· 966 937 bio_based = 1; 967 938 968 939 if (bio_based && request_based) { 969 - DMWARN("Inconsistent table: different target types" 970 - " can't be mixed up"); 940 + DMERR("Inconsistent table: different target types" 941 + " can't be mixed up"); 971 942 return -EINVAL; 972 943 } 973 944 } ··· 988 959 /* We must use this table as bio-based */ 989 960 t->type = DM_TYPE_BIO_BASED; 990 961 if (dm_table_supports_dax(t) || 991 - (list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED)) 962 + (list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED)) { 992 963 t->type = DM_TYPE_DAX_BIO_BASED; 964 + } else { 965 + /* Check if upgrading to NVMe bio-based is valid or required */ 966 + tgt = dm_table_get_immutable_target(t); 967 + if (tgt && !tgt->max_io_len && dm_table_does_not_support_partial_completion(t)) { 968 + t->type = DM_TYPE_NVME_BIO_BASED; 969 + goto verify_rq_based; /* must be stacked directly on NVMe (blk-mq) */ 970 + } else if (list_empty(devices) && live_md_type == DM_TYPE_NVME_BIO_BASED) { 971 + t->type = DM_TYPE_NVME_BIO_BASED; 972 + } 973 + } 993 974 return 0; 994 975 } 995 976 ··· 1019 980 * (e.g. request completion process for partial completion.) 1020 981 */ 1021 982 if (t->num_targets > 1) { 1022 - DMWARN("Request-based dm doesn't support multiple targets yet"); 983 + DMERR("%s DM doesn't support multiple targets", 984 + t->type == DM_TYPE_NVME_BIO_BASED ? "nvme bio-based" : "request-based"); 1023 985 return -EINVAL; 1024 986 } 1025 987 ··· 1037 997 return 0; 1038 998 } 1039 999 1040 - /* Non-request-stackable devices can't be used for request-based dm */ 1041 - list_for_each_entry(dd, devices, list) { 1042 - struct request_queue *q = bdev_get_queue(dd->dm_dev->bdev); 1043 - 1044 - if (!queue_is_rq_based(q)) { 1045 - DMERR("table load rejected: including" 1046 - " non-request-stackable devices"); 1047 - return -EINVAL; 1048 - } 1049 - 1050 - if (q->mq_ops) 1051 - mq_count++; 1052 - else 1053 - sq_count++; 1000 + tgt = dm_table_get_immutable_target(t); 1001 + if (!tgt) { 1002 + DMERR("table load rejected: immutable target is required"); 1003 + return -EINVAL; 1004 + } else if (tgt->max_io_len) { 1005 + DMERR("table load rejected: immutable target that splits IO is not supported"); 1006 + return -EINVAL; 1054 1007 } 1055 - if (sq_count && mq_count) { 1008 + 1009 + /* Non-request-stackable devices can't be used for request-based dm */ 1010 + if (!tgt->type->iterate_devices || 1011 + !tgt->type->iterate_devices(tgt, device_is_rq_based, &v)) { 1012 + DMERR("table load rejected: including non-request-stackable devices"); 1013 + return -EINVAL; 1014 + } 1015 + if (v.sq_count && v.mq_count) { 1056 1016 DMERR("table load rejected: not all devices are blk-mq request-stackable"); 1057 1017 return -EINVAL; 1058 1018 } 1059 - t->all_blk_mq = mq_count > 0; 1019 + t->all_blk_mq = v.mq_count > 0; 1060 1020 1061 - if (t->type == DM_TYPE_MQ_REQUEST_BASED && !t->all_blk_mq) { 1021 + if (!t->all_blk_mq && 1022 + (t->type == DM_TYPE_MQ_REQUEST_BASED || t->type == DM_TYPE_NVME_BIO_BASED)) { 1062 1023 DMERR("table load rejected: all devices are not blk-mq request-stackable"); 1063 1024 return -EINVAL; 1064 1025 } ··· 1120 1079 { 1121 1080 enum dm_queue_mode type = dm_table_get_type(t); 1122 1081 unsigned per_io_data_size = 0; 1123 - struct dm_target *tgt; 1082 + unsigned min_pool_size = 0; 1083 + struct dm_target *ti; 1124 1084 unsigned i; 1125 1085 1126 1086 if (unlikely(type == DM_TYPE_NONE)) { ··· 1131 1089 1132 1090 if (__table_type_bio_based(type)) 1133 1091 for (i = 0; i < t->num_targets; i++) { 1134 - tgt = t->targets + i; 1135 - per_io_data_size = max(per_io_data_size, tgt->per_io_data_size); 1092 + ti = t->targets + i; 1093 + per_io_data_size = max(per_io_data_size, ti->per_io_data_size); 1094 + min_pool_size = max(min_pool_size, ti->num_flush_bios); 1136 1095 } 1137 1096 1138 - t->mempools = dm_alloc_md_mempools(md, type, t->integrity_supported, per_io_data_size); 1097 + t->mempools = dm_alloc_md_mempools(md, type, t->integrity_supported, 1098 + per_io_data_size, min_pool_size); 1139 1099 if (!t->mempools) 1140 1100 return -ENOMEM; 1141 1101 ··· 1749 1705 return true; 1750 1706 } 1751 1707 1708 + static int device_no_partial_completion(struct dm_target *ti, struct dm_dev *dev, 1709 + sector_t start, sector_t len, void *data) 1710 + { 1711 + char b[BDEVNAME_SIZE]; 1712 + 1713 + /* For now, NVMe devices are the only devices of this class */ 1714 + return (strncmp(bdevname(dev->bdev, b), "nvme", 3) == 0); 1715 + } 1716 + 1717 + static bool dm_table_does_not_support_partial_completion(struct dm_table *t) 1718 + { 1719 + return dm_table_all_devices_attribute(t, device_no_partial_completion); 1720 + } 1721 + 1752 1722 static int device_not_write_same_capable(struct dm_target *ti, struct dm_dev *dev, 1753 1723 sector_t start, sector_t len, void *data) 1754 1724 { ··· 1878 1820 } 1879 1821 blk_queue_write_cache(q, wc, fua); 1880 1822 1823 + if (dm_table_supports_dax(t)) 1824 + queue_flag_set_unlocked(QUEUE_FLAG_DAX, q); 1881 1825 if (dm_table_supports_dax_write_cache(t)) 1882 1826 dax_write_cache(t->md->dax_dev, true); 1883 1827

+8 -1

drivers/md/dm-thin.c

··· 492 492 INIT_LIST_HEAD(&dm_thin_pool_table.pools); 493 493 } 494 494 495 + static void pool_table_exit(void) 496 + { 497 + mutex_destroy(&dm_thin_pool_table.mutex); 498 + } 499 + 495 500 static void __pool_table_insert(struct pool *pool) 496 501 { 497 502 BUG_ON(!mutex_is_locked(&dm_thin_pool_table.mutex)); ··· 1722 1717 bio_op(bio) == REQ_OP_DISCARD) 1723 1718 bio_list_add(&info->defer_bios, bio); 1724 1719 else { 1725 - struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));; 1720 + struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook)); 1726 1721 1727 1722 h->shared_read_entry = dm_deferred_entry_inc(info->tc->pool->shared_read_ds); 1728 1723 inc_all_io_entry(info->tc->pool, bio); ··· 4392 4387 dm_unregister_target(&pool_target); 4393 4388 4394 4389 kmem_cache_destroy(_new_mapping_cache); 4390 + 4391 + pool_table_exit(); 4395 4392 } 4396 4393 4397 4394 module_init(dm_thin_init);

+219

drivers/md/dm-unstripe.c

··· 1 + /* 2 + * Copyright (C) 2017 Intel Corporation. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #include "dm.h" 8 + 9 + #include <linux/module.h> 10 + #include <linux/init.h> 11 + #include <linux/blkdev.h> 12 + #include <linux/bio.h> 13 + #include <linux/slab.h> 14 + #include <linux/bitops.h> 15 + #include <linux/device-mapper.h> 16 + 17 + struct unstripe_c { 18 + struct dm_dev *dev; 19 + sector_t physical_start; 20 + 21 + uint32_t stripes; 22 + 23 + uint32_t unstripe; 24 + sector_t unstripe_width; 25 + sector_t unstripe_offset; 26 + 27 + uint32_t chunk_size; 28 + u8 chunk_shift; 29 + }; 30 + 31 + #define DM_MSG_PREFIX "unstriped" 32 + 33 + static void cleanup_unstripe(struct unstripe_c *uc, struct dm_target *ti) 34 + { 35 + if (uc->dev) 36 + dm_put_device(ti, uc->dev); 37 + kfree(uc); 38 + } 39 + 40 + /* 41 + * Contruct an unstriped mapping. 42 + * <number of stripes> <chunk size> <stripe #> <dev_path> <offset> 43 + */ 44 + static int unstripe_ctr(struct dm_target *ti, unsigned int argc, char **argv) 45 + { 46 + struct unstripe_c *uc; 47 + sector_t tmp_len; 48 + unsigned long long start; 49 + char dummy; 50 + 51 + if (argc != 5) { 52 + ti->error = "Invalid number of arguments"; 53 + return -EINVAL; 54 + } 55 + 56 + uc = kzalloc(sizeof(*uc), GFP_KERNEL); 57 + if (!uc) { 58 + ti->error = "Memory allocation for unstriped context failed"; 59 + return -ENOMEM; 60 + } 61 + 62 + if (kstrtouint(argv[0], 10, &uc->stripes) || !uc->stripes) { 63 + ti->error = "Invalid stripe count"; 64 + goto err; 65 + } 66 + 67 + if (kstrtouint(argv[1], 10, &uc->chunk_size) || !uc->chunk_size) { 68 + ti->error = "Invalid chunk_size"; 69 + goto err; 70 + } 71 + 72 + // FIXME: must support non power of 2 chunk_size, dm-stripe.c does 73 + if (!is_power_of_2(uc->chunk_size)) { 74 + ti->error = "Non power of 2 chunk_size is not supported yet"; 75 + goto err; 76 + } 77 + 78 + if (kstrtouint(argv[2], 10, &uc->unstripe)) { 79 + ti->error = "Invalid stripe number"; 80 + goto err; 81 + } 82 + 83 + if (uc->unstripe > uc->stripes && uc->stripes > 1) { 84 + ti->error = "Please provide stripe between [0, # of stripes]"; 85 + goto err; 86 + } 87 + 88 + if (dm_get_device(ti, argv[3], dm_table_get_mode(ti->table), &uc->dev)) { 89 + ti->error = "Couldn't get striped device"; 90 + goto err; 91 + } 92 + 93 + if (sscanf(argv[4], "%llu%c", &start, &dummy) != 1) { 94 + ti->error = "Invalid striped device offset"; 95 + goto err; 96 + } 97 + uc->physical_start = start; 98 + 99 + uc->unstripe_offset = uc->unstripe * uc->chunk_size; 100 + uc->unstripe_width = (uc->stripes - 1) * uc->chunk_size; 101 + uc->chunk_shift = fls(uc->chunk_size) - 1; 102 + 103 + tmp_len = ti->len; 104 + if (sector_div(tmp_len, uc->chunk_size)) { 105 + ti->error = "Target length not divisible by chunk size"; 106 + goto err; 107 + } 108 + 109 + if (dm_set_target_max_io_len(ti, uc->chunk_size)) { 110 + ti->error = "Failed to set max io len"; 111 + goto err; 112 + } 113 + 114 + ti->private = uc; 115 + return 0; 116 + err: 117 + cleanup_unstripe(uc, ti); 118 + return -EINVAL; 119 + } 120 + 121 + static void unstripe_dtr(struct dm_target *ti) 122 + { 123 + struct unstripe_c *uc = ti->private; 124 + 125 + cleanup_unstripe(uc, ti); 126 + } 127 + 128 + static sector_t map_to_core(struct dm_target *ti, struct bio *bio) 129 + { 130 + struct unstripe_c *uc = ti->private; 131 + sector_t sector = bio->bi_iter.bi_sector; 132 + 133 + /* Shift us up to the right "row" on the stripe */ 134 + sector += uc->unstripe_width * (sector >> uc->chunk_shift); 135 + 136 + /* Account for what stripe we're operating on */ 137 + sector += uc->unstripe_offset; 138 + 139 + return sector; 140 + } 141 + 142 + static int unstripe_map(struct dm_target *ti, struct bio *bio) 143 + { 144 + struct unstripe_c *uc = ti->private; 145 + 146 + bio_set_dev(bio, uc->dev->bdev); 147 + bio->bi_iter.bi_sector = map_to_core(ti, bio) + uc->physical_start; 148 + 149 + return DM_MAPIO_REMAPPED; 150 + } 151 + 152 + static void unstripe_status(struct dm_target *ti, status_type_t type, 153 + unsigned int status_flags, char *result, unsigned int maxlen) 154 + { 155 + struct unstripe_c *uc = ti->private; 156 + unsigned int sz = 0; 157 + 158 + switch (type) { 159 + case STATUSTYPE_INFO: 160 + break; 161 + 162 + case STATUSTYPE_TABLE: 163 + DMEMIT("%d %llu %d %s %llu", 164 + uc->stripes, (unsigned long long)uc->chunk_size, uc->unstripe, 165 + uc->dev->name, (unsigned long long)uc->physical_start); 166 + break; 167 + } 168 + } 169 + 170 + static int unstripe_iterate_devices(struct dm_target *ti, 171 + iterate_devices_callout_fn fn, void *data) 172 + { 173 + struct unstripe_c *uc = ti->private; 174 + 175 + return fn(ti, uc->dev, uc->physical_start, ti->len, data); 176 + } 177 + 178 + static void unstripe_io_hints(struct dm_target *ti, 179 + struct queue_limits *limits) 180 + { 181 + struct unstripe_c *uc = ti->private; 182 + 183 + limits->chunk_sectors = uc->chunk_size; 184 + } 185 + 186 + static struct target_type unstripe_target = { 187 + .name = "unstriped", 188 + .version = {1, 0, 0}, 189 + .module = THIS_MODULE, 190 + .ctr = unstripe_ctr, 191 + .dtr = unstripe_dtr, 192 + .map = unstripe_map, 193 + .status = unstripe_status, 194 + .iterate_devices = unstripe_iterate_devices, 195 + .io_hints = unstripe_io_hints, 196 + }; 197 + 198 + static int __init dm_unstripe_init(void) 199 + { 200 + int r; 201 + 202 + r = dm_register_target(&unstripe_target); 203 + if (r < 0) 204 + DMERR("target registration failed"); 205 + 206 + return r; 207 + } 208 + 209 + static void __exit dm_unstripe_exit(void) 210 + { 211 + dm_unregister_target(&unstripe_target); 212 + } 213 + 214 + module_init(dm_unstripe_init); 215 + module_exit(dm_unstripe_exit); 216 + 217 + MODULE_DESCRIPTION(DM_NAME " unstriped target"); 218 + MODULE_AUTHOR("Scott Bauer <scott.bauer@intel.com>"); 219 + MODULE_LICENSE("GPL");

+3

drivers/md/dm-zoned-metadata.c

··· 2333 2333 2334 2334 /* Free the zone descriptors */ 2335 2335 dmz_drop_zones(zmd); 2336 + 2337 + mutex_destroy(&zmd->mblk_flush_lock); 2338 + mutex_destroy(&zmd->map_lock); 2336 2339 } 2337 2340 2338 2341 /*

+3

drivers/md/dm-zoned-target.c

··· 827 827 err_cwq: 828 828 destroy_workqueue(dmz->chunk_wq); 829 829 err_bio: 830 + mutex_destroy(&dmz->chunk_lock); 830 831 bioset_free(dmz->bio_set); 831 832 err_meta: 832 833 dmz_dtr_metadata(dmz->metadata); ··· 861 860 bioset_free(dmz->bio_set); 862 861 863 862 dmz_put_zoned_device(ti); 863 + 864 + mutex_destroy(&dmz->chunk_lock); 864 865 865 866 kfree(dmz); 866 867 }

+390 -269

drivers/md/dm.c

··· 60 60 } 61 61 62 62 /* 63 - * One of these is allocated per bio. 63 + * One of these is allocated (on-stack) per original bio. 64 64 */ 65 + struct clone_info { 66 + struct dm_table *map; 67 + struct bio *bio; 68 + struct dm_io *io; 69 + sector_t sector; 70 + unsigned sector_count; 71 + }; 72 + 73 + /* 74 + * One of these is allocated per clone bio. 75 + */ 76 + #define DM_TIO_MAGIC 7282014 77 + struct dm_target_io { 78 + unsigned magic; 79 + struct dm_io *io; 80 + struct dm_target *ti; 81 + unsigned target_bio_nr; 82 + unsigned *len_ptr; 83 + bool inside_dm_io; 84 + struct bio clone; 85 + }; 86 + 87 + /* 88 + * One of these is allocated per original bio. 89 + * It contains the first clone used for that original. 90 + */ 91 + #define DM_IO_MAGIC 5191977 65 92 struct dm_io { 93 + unsigned magic; 66 94 struct mapped_device *md; 67 95 blk_status_t status; 68 96 atomic_t io_count; 69 - struct bio *bio; 97 + struct bio *orig_bio; 70 98 unsigned long start_time; 71 99 spinlock_t endio_lock; 72 100 struct dm_stats_aux stats_aux; 101 + /* last member of dm_target_io is 'struct bio' */ 102 + struct dm_target_io tio; 73 103 }; 104 + 105 + void *dm_per_bio_data(struct bio *bio, size_t data_size) 106 + { 107 + struct dm_target_io *tio = container_of(bio, struct dm_target_io, clone); 108 + if (!tio->inside_dm_io) 109 + return (char *)bio - offsetof(struct dm_target_io, clone) - data_size; 110 + return (char *)bio - offsetof(struct dm_target_io, clone) - offsetof(struct dm_io, tio) - data_size; 111 + } 112 + EXPORT_SYMBOL_GPL(dm_per_bio_data); 113 + 114 + struct bio *dm_bio_from_per_bio_data(void *data, size_t data_size) 115 + { 116 + struct dm_io *io = (struct dm_io *)((char *)data + data_size); 117 + if (io->magic == DM_IO_MAGIC) 118 + return (struct bio *)((char *)io + offsetof(struct dm_io, tio) + offsetof(struct dm_target_io, clone)); 119 + BUG_ON(io->magic != DM_TIO_MAGIC); 120 + return (struct bio *)((char *)io + offsetof(struct dm_target_io, clone)); 121 + } 122 + EXPORT_SYMBOL_GPL(dm_bio_from_per_bio_data); 123 + 124 + unsigned dm_bio_get_target_bio_nr(const struct bio *bio) 125 + { 126 + return container_of(bio, struct dm_target_io, clone)->target_bio_nr; 127 + } 128 + EXPORT_SYMBOL_GPL(dm_bio_get_target_bio_nr); 74 129 75 130 #define MINOR_ALLOCED ((void *)-1) 76 131 ··· 148 93 * For mempools pre-allocation at the table loading time. 149 94 */ 150 95 struct dm_md_mempools { 151 - mempool_t *io_pool; 152 96 struct bio_set *bs; 97 + struct bio_set *io_bs; 153 98 }; 154 99 155 100 struct table_device { ··· 158 103 struct dm_dev dm_dev; 159 104 }; 160 105 161 - static struct kmem_cache *_io_cache; 162 106 static struct kmem_cache *_rq_tio_cache; 163 107 static struct kmem_cache *_rq_cache; 164 108 ··· 224 170 { 225 171 int r = -ENOMEM; 226 172 227 - /* allocate a slab for the dm_ios */ 228 - _io_cache = KMEM_CACHE(dm_io, 0); 229 - if (!_io_cache) 230 - return r; 231 - 232 173 _rq_tio_cache = KMEM_CACHE(dm_rq_target_io, 0); 233 174 if (!_rq_tio_cache) 234 - goto out_free_io_cache; 175 + return r; 235 176 236 177 _rq_cache = kmem_cache_create("dm_old_clone_request", sizeof(struct request), 237 178 __alignof__(struct request), 0, NULL); ··· 261 212 kmem_cache_destroy(_rq_cache); 262 213 out_free_rq_tio_cache: 263 214 kmem_cache_destroy(_rq_tio_cache); 264 - out_free_io_cache: 265 - kmem_cache_destroy(_io_cache); 266 215 267 216 return r; 268 217 } ··· 272 225 273 226 kmem_cache_destroy(_rq_cache); 274 227 kmem_cache_destroy(_rq_tio_cache); 275 - kmem_cache_destroy(_io_cache); 276 228 unregister_blkdev(_major, _name); 277 229 dm_uevent_exit(); 278 230 ··· 532 486 return r; 533 487 } 534 488 535 - static struct dm_io *alloc_io(struct mapped_device *md) 489 + static void start_io_acct(struct dm_io *io); 490 + 491 + static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio) 536 492 { 537 - return mempool_alloc(md->io_pool, GFP_NOIO); 493 + struct dm_io *io; 494 + struct dm_target_io *tio; 495 + struct bio *clone; 496 + 497 + clone = bio_alloc_bioset(GFP_NOIO, 0, md->io_bs); 498 + if (!clone) 499 + return NULL; 500 + 501 + tio = container_of(clone, struct dm_target_io, clone); 502 + tio->inside_dm_io = true; 503 + tio->io = NULL; 504 + 505 + io = container_of(tio, struct dm_io, tio); 506 + io->magic = DM_IO_MAGIC; 507 + io->status = 0; 508 + atomic_set(&io->io_count, 1); 509 + io->orig_bio = bio; 510 + io->md = md; 511 + spin_lock_init(&io->endio_lock); 512 + 513 + start_io_acct(io); 514 + 515 + return io; 538 516 } 539 517 540 518 static void free_io(struct mapped_device *md, struct dm_io *io) 541 519 { 542 - mempool_free(io, md->io_pool); 520 + bio_put(&io->tio.clone); 521 + } 522 + 523 + static struct dm_target_io *alloc_tio(struct clone_info *ci, struct dm_target *ti, 524 + unsigned target_bio_nr, gfp_t gfp_mask) 525 + { 526 + struct dm_target_io *tio; 527 + 528 + if (!ci->io->tio.io) { 529 + /* the dm_target_io embedded in ci->io is available */ 530 + tio = &ci->io->tio; 531 + } else { 532 + struct bio *clone = bio_alloc_bioset(gfp_mask, 0, ci->io->md->bs); 533 + if (!clone) 534 + return NULL; 535 + 536 + tio = container_of(clone, struct dm_target_io, clone); 537 + tio->inside_dm_io = false; 538 + } 539 + 540 + tio->magic = DM_TIO_MAGIC; 541 + tio->io = ci->io; 542 + tio->ti = ti; 543 + tio->target_bio_nr = target_bio_nr; 544 + 545 + return tio; 543 546 } 544 547 545 548 static void free_tio(struct dm_target_io *tio) 546 549 { 550 + if (tio->inside_dm_io) 551 + return; 547 552 bio_put(&tio->clone); 548 553 } 549 554 ··· 607 510 static void start_io_acct(struct dm_io *io) 608 511 { 609 512 struct mapped_device *md = io->md; 610 - struct bio *bio = io->bio; 611 - int cpu; 513 + struct bio *bio = io->orig_bio; 612 514 int rw = bio_data_dir(bio); 613 515 614 516 io->start_time = jiffies; 615 517 616 - cpu = part_stat_lock(); 617 - part_round_stats(md->queue, cpu, &dm_disk(md)->part0); 618 - part_stat_unlock(); 518 + generic_start_io_acct(md->queue, rw, bio_sectors(bio), &dm_disk(md)->part0); 519 + 619 520 atomic_set(&dm_disk(md)->part0.in_flight[rw], 620 - atomic_inc_return(&md->pending[rw])); 521 + atomic_inc_return(&md->pending[rw])); 621 522 622 523 if (unlikely(dm_stats_used(&md->stats))) 623 524 dm_stats_account_io(&md->stats, bio_data_dir(bio), ··· 626 531 static void end_io_acct(struct dm_io *io) 627 532 { 628 533 struct mapped_device *md = io->md; 629 - struct bio *bio = io->bio; 534 + struct bio *bio = io->orig_bio; 630 535 unsigned long duration = jiffies - io->start_time; 631 536 int pending; 632 537 int rw = bio_data_dir(bio); ··· 847 752 return 0; 848 753 } 849 754 850 - /*----------------------------------------------------------------- 851 - * CRUD START: 852 - * A more elegant soln is in the works that uses the queue 853 - * merge fn, unfortunately there are a couple of changes to 854 - * the block layer that I want to make for this. So in the 855 - * interests of getting something for people to use I give 856 - * you this clearly demarcated crap. 857 - *---------------------------------------------------------------*/ 858 - 859 755 static int __noflush_suspending(struct mapped_device *md) 860 756 { 861 757 return test_bit(DMF_NOFLUSH_SUSPENDING, &md->flags); ··· 866 780 /* Push-back supersedes any I/O errors */ 867 781 if (unlikely(error)) { 868 782 spin_lock_irqsave(&io->endio_lock, flags); 869 - if (!(io->status == BLK_STS_DM_REQUEUE && 870 - __noflush_suspending(md))) 783 + if (!(io->status == BLK_STS_DM_REQUEUE && __noflush_suspending(md))) 871 784 io->status = error; 872 785 spin_unlock_irqrestore(&io->endio_lock, flags); 873 786 } ··· 878 793 */ 879 794 spin_lock_irqsave(&md->deferred_lock, flags); 880 795 if (__noflush_suspending(md)) 881 - bio_list_add_head(&md->deferred, io->bio); 796 + /* NOTE early return due to BLK_STS_DM_REQUEUE below */ 797 + bio_list_add_head(&md->deferred, io->orig_bio); 882 798 else 883 799 /* noflush suspend was interrupted. */ 884 800 io->status = BLK_STS_IOERR; ··· 887 801 } 888 802 889 803 io_error = io->status; 890 - bio = io->bio; 804 + bio = io->orig_bio; 891 805 end_io_acct(io); 892 806 free_io(md, io); 893 807 ··· 933 847 struct mapped_device *md = tio->io->md; 934 848 dm_endio_fn endio = tio->ti->type->end_io; 935 849 936 - if (unlikely(error == BLK_STS_TARGET)) { 850 + if (unlikely(error == BLK_STS_TARGET) && md->type != DM_TYPE_NVME_BIO_BASED) { 937 851 if (bio_op(bio) == REQ_OP_WRITE_SAME && 938 852 !bio->bi_disk->queue->limits.max_write_same_sectors) 939 853 disable_write_same(md); ··· 1091 1005 1092 1006 /* 1093 1007 * A target may call dm_accept_partial_bio only from the map routine. It is 1094 - * allowed for all bio types except REQ_PREFLUSH. 1008 + * allowed for all bio types except REQ_PREFLUSH and REQ_OP_ZONE_RESET. 1095 1009 * 1096 1010 * dm_accept_partial_bio informs the dm that the target only wants to process 1097 1011 * additional n_sectors sectors of the bio and the rest of the data should be ··· 1141 1055 { 1142 1056 #ifdef CONFIG_BLK_DEV_ZONED 1143 1057 struct dm_target_io *tio = container_of(bio, struct dm_target_io, clone); 1144 - struct bio *report_bio = tio->io->bio; 1058 + struct bio *report_bio = tio->io->orig_bio; 1145 1059 struct blk_zone_report_hdr *hdr = NULL; 1146 1060 struct blk_zone *zone; 1147 1061 unsigned int nr_rep = 0; ··· 1208 1122 } 1209 1123 EXPORT_SYMBOL_GPL(dm_remap_zone_report); 1210 1124 1211 - /* 1212 - * Flush current->bio_list when the target map method blocks. 1213 - * This fixes deadlocks in snapshot and possibly in other targets. 1214 - */ 1215 - struct dm_offload { 1216 - struct blk_plug plug; 1217 - struct blk_plug_cb cb; 1218 - }; 1219 - 1220 - static void flush_current_bio_list(struct blk_plug_cb *cb, bool from_schedule) 1221 - { 1222 - struct dm_offload *o = container_of(cb, struct dm_offload, cb); 1223 - struct bio_list list; 1224 - struct bio *bio; 1225 - int i; 1226 - 1227 - INIT_LIST_HEAD(&o->cb.list); 1228 - 1229 - if (unlikely(!current->bio_list)) 1230 - return; 1231 - 1232 - for (i = 0; i < 2; i++) { 1233 - list = current->bio_list[i]; 1234 - bio_list_init(&current->bio_list[i]); 1235 - 1236 - while ((bio = bio_list_pop(&list))) { 1237 - struct bio_set *bs = bio->bi_pool; 1238 - if (unlikely(!bs) || bs == fs_bio_set || 1239 - !bs->rescue_workqueue) { 1240 - bio_list_add(&current->bio_list[i], bio); 1241 - continue; 1242 - } 1243 - 1244 - spin_lock(&bs->rescue_lock); 1245 - bio_list_add(&bs->rescue_list, bio); 1246 - queue_work(bs->rescue_workqueue, &bs->rescue_work); 1247 - spin_unlock(&bs->rescue_lock); 1248 - } 1249 - } 1250 - } 1251 - 1252 - static void dm_offload_start(struct dm_offload *o) 1253 - { 1254 - blk_start_plug(&o->plug); 1255 - o->cb.callback = flush_current_bio_list; 1256 - list_add(&o->cb.list, &current->plug->cb_list); 1257 - } 1258 - 1259 - static void dm_offload_end(struct dm_offload *o) 1260 - { 1261 - list_del(&o->cb.list); 1262 - blk_finish_plug(&o->plug); 1263 - } 1264 - 1265 - static void __map_bio(struct dm_target_io *tio) 1125 + static blk_qc_t __map_bio(struct dm_target_io *tio) 1266 1126 { 1267 1127 int r; 1268 1128 sector_t sector; 1269 - struct dm_offload o; 1270 1129 struct bio *clone = &tio->clone; 1130 + struct dm_io *io = tio->io; 1131 + struct mapped_device *md = io->md; 1271 1132 struct dm_target *ti = tio->ti; 1133 + blk_qc_t ret = BLK_QC_T_NONE; 1272 1134 1273 1135 clone->bi_end_io = clone_endio; 1274 1136 ··· 1225 1191 * anything, the target has assumed ownership of 1226 1192 * this io. 1227 1193 */ 1228 - atomic_inc(&tio->io->io_count); 1194 + atomic_inc(&io->io_count); 1229 1195 sector = clone->bi_iter.bi_sector; 1230 1196 1231 - dm_offload_start(&o); 1232 1197 r = ti->type->map(ti, clone); 1233 - dm_offload_end(&o); 1234 - 1235 1198 switch (r) { 1236 1199 case DM_MAPIO_SUBMITTED: 1237 1200 break; 1238 1201 case DM_MAPIO_REMAPPED: 1239 1202 /* the bio has been remapped so dispatch it */ 1240 1203 trace_block_bio_remap(clone->bi_disk->queue, clone, 1241 - bio_dev(tio->io->bio), sector); 1242 - generic_make_request(clone); 1204 + bio_dev(io->orig_bio), sector); 1205 + if (md->type == DM_TYPE_NVME_BIO_BASED) 1206 + ret = direct_make_request(clone); 1207 + else 1208 + ret = generic_make_request(clone); 1243 1209 break; 1244 1210 case DM_MAPIO_KILL: 1245 - dec_pending(tio->io, BLK_STS_IOERR); 1246 1211 free_tio(tio); 1212 + dec_pending(io, BLK_STS_IOERR); 1247 1213 break; 1248 1214 case DM_MAPIO_REQUEUE: 1249 - dec_pending(tio->io, BLK_STS_DM_REQUEUE); 1250 1215 free_tio(tio); 1216 + dec_pending(io, BLK_STS_DM_REQUEUE); 1251 1217 break; 1252 1218 default: 1253 1219 DMWARN("unimplemented target map return value: %d", r); 1254 1220 BUG(); 1255 1221 } 1256 - } 1257 1222 1258 - struct clone_info { 1259 - struct mapped_device *md; 1260 - struct dm_table *map; 1261 - struct bio *bio; 1262 - struct dm_io *io; 1263 - sector_t sector; 1264 - unsigned sector_count; 1265 - }; 1223 + return ret; 1224 + } 1266 1225 1267 1226 static void bio_setup_sector(struct bio *bio, sector_t sector, unsigned len) 1268 1227 { ··· 1299 1272 return 0; 1300 1273 } 1301 1274 1302 - static struct dm_target_io *alloc_tio(struct clone_info *ci, 1303 - struct dm_target *ti, 1304 - unsigned target_bio_nr) 1275 + static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci, 1276 + struct dm_target *ti, unsigned num_bios) 1305 1277 { 1306 1278 struct dm_target_io *tio; 1307 - struct bio *clone; 1279 + int try; 1308 1280 1309 - clone = bio_alloc_bioset(GFP_NOIO, 0, ci->md->bs); 1310 - tio = container_of(clone, struct dm_target_io, clone); 1281 + if (!num_bios) 1282 + return; 1311 1283 1312 - tio->io = ci->io; 1313 - tio->ti = ti; 1314 - tio->target_bio_nr = target_bio_nr; 1284 + if (num_bios == 1) { 1285 + tio = alloc_tio(ci, ti, 0, GFP_NOIO); 1286 + bio_list_add(blist, &tio->clone); 1287 + return; 1288 + } 1315 1289 1316 - return tio; 1290 + for (try = 0; try < 2; try++) { 1291 + int bio_nr; 1292 + struct bio *bio; 1293 + 1294 + if (try) 1295 + mutex_lock(&ci->io->md->table_devices_lock); 1296 + for (bio_nr = 0; bio_nr < num_bios; bio_nr++) { 1297 + tio = alloc_tio(ci, ti, bio_nr, try ? GFP_NOIO : GFP_NOWAIT); 1298 + if (!tio) 1299 + break; 1300 + 1301 + bio_list_add(blist, &tio->clone); 1302 + } 1303 + if (try) 1304 + mutex_unlock(&ci->io->md->table_devices_lock); 1305 + if (bio_nr == num_bios) 1306 + return; 1307 + 1308 + while ((bio = bio_list_pop(blist))) { 1309 + tio = container_of(bio, struct dm_target_io, clone); 1310 + free_tio(tio); 1311 + } 1312 + } 1317 1313 } 1318 1314 1319 - static void __clone_and_map_simple_bio(struct clone_info *ci, 1320 - struct dm_target *ti, 1321 - unsigned target_bio_nr, unsigned *len) 1315 + static blk_qc_t __clone_and_map_simple_bio(struct clone_info *ci, 1316 + struct dm_target_io *tio, unsigned *len) 1322 1317 { 1323 - struct dm_target_io *tio = alloc_tio(ci, ti, target_bio_nr); 1324 1318 struct bio *clone = &tio->clone; 1325 1319 1326 1320 tio->len_ptr = len; ··· 1350 1302 if (len) 1351 1303 bio_setup_sector(clone, ci->sector, *len); 1352 1304 1353 - __map_bio(tio); 1305 + return __map_bio(tio); 1354 1306 } 1355 1307 1356 1308 static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti, 1357 1309 unsigned num_bios, unsigned *len) 1358 1310 { 1359 - unsigned target_bio_nr; 1311 + struct bio_list blist = BIO_EMPTY_LIST; 1312 + struct bio *bio; 1313 + struct dm_target_io *tio; 1360 1314 1361 - for (target_bio_nr = 0; target_bio_nr < num_bios; target_bio_nr++) 1362 - __clone_and_map_simple_bio(ci, ti, target_bio_nr, len); 1315 + alloc_multiple_bios(&blist, ci, ti, num_bios); 1316 + 1317 + while ((bio = bio_list_pop(&blist))) { 1318 + tio = container_of(bio, struct dm_target_io, clone); 1319 + (void) __clone_and_map_simple_bio(ci, tio, len); 1320 + } 1363 1321 } 1364 1322 1365 1323 static int __send_empty_flush(struct clone_info *ci) ··· 1381 1327 } 1382 1328 1383 1329 static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti, 1384 - sector_t sector, unsigned *len) 1330 + sector_t sector, unsigned *len) 1385 1331 { 1386 1332 struct bio *bio = ci->bio; 1387 1333 struct dm_target_io *tio; 1388 - unsigned target_bio_nr; 1389 - unsigned num_target_bios = 1; 1390 - int r = 0; 1334 + int r; 1391 1335 1392 - /* 1393 - * Does the target want to receive duplicate copies of the bio? 1394 - */ 1395 - if (bio_data_dir(bio) == WRITE && ti->num_write_bios) 1396 - num_target_bios = ti->num_write_bios(ti, bio); 1397 - 1398 - for (target_bio_nr = 0; target_bio_nr < num_target_bios; target_bio_nr++) { 1399 - tio = alloc_tio(ci, ti, target_bio_nr); 1400 - tio->len_ptr = len; 1401 - r = clone_bio(tio, bio, sector, *len); 1402 - if (r < 0) { 1403 - free_tio(tio); 1404 - break; 1405 - } 1406 - __map_bio(tio); 1336 + tio = alloc_tio(ci, ti, 0, GFP_NOIO); 1337 + tio->len_ptr = len; 1338 + r = clone_bio(tio, bio, sector, *len); 1339 + if (r < 0) { 1340 + free_tio(tio); 1341 + return r; 1407 1342 } 1343 + (void) __map_bio(tio); 1408 1344 1409 - return r; 1345 + return 0; 1410 1346 } 1411 1347 1412 1348 typedef unsigned (*get_num_bios_fn)(struct dm_target *ti); ··· 1423 1379 return ti->split_discard_bios; 1424 1380 } 1425 1381 1426 - static int __send_changing_extent_only(struct clone_info *ci, 1382 + static int __send_changing_extent_only(struct clone_info *ci, struct dm_target *ti, 1427 1383 get_num_bios_fn get_num_bios, 1428 1384 is_split_required_fn is_split_required) 1429 1385 { 1430 - struct dm_target *ti; 1431 1386 unsigned len; 1432 1387 unsigned num_bios; 1433 1388 1434 - do { 1435 - ti = dm_table_find_target(ci->map, ci->sector); 1436 - if (!dm_target_is_valid(ti)) 1437 - return -EIO; 1389 + /* 1390 + * Even though the device advertised support for this type of 1391 + * request, that does not mean every target supports it, and 1392 + * reconfiguration might also have changed that since the 1393 + * check was performed. 1394 + */ 1395 + num_bios = get_num_bios ? get_num_bios(ti) : 0; 1396 + if (!num_bios) 1397 + return -EOPNOTSUPP; 1438 1398 1439 - /* 1440 - * Even though the device advertised support for this type of 1441 - * request, that does not mean every target supports it, and 1442 - * reconfiguration might also have changed that since the 1443 - * check was performed. 1444 - */ 1445 - num_bios = get_num_bios ? get_num_bios(ti) : 0; 1446 - if (!num_bios) 1447 - return -EOPNOTSUPP; 1399 + if (is_split_required && !is_split_required(ti)) 1400 + len = min((sector_t)ci->sector_count, max_io_len_target_boundary(ci->sector, ti)); 1401 + else 1402 + len = min((sector_t)ci->sector_count, max_io_len(ci->sector, ti)); 1448 1403 1449 - if (is_split_required && !is_split_required(ti)) 1450 - len = min((sector_t)ci->sector_count, max_io_len_target_boundary(ci->sector, ti)); 1451 - else 1452 - len = min((sector_t)ci->sector_count, max_io_len(ci->sector, ti)); 1404 + __send_duplicate_bios(ci, ti, num_bios, &len); 1453 1405 1454 - __send_duplicate_bios(ci, ti, num_bios, &len); 1455 - 1456 - ci->sector += len; 1457 - } while (ci->sector_count -= len); 1406 + ci->sector += len; 1407 + ci->sector_count -= len; 1458 1408 1459 1409 return 0; 1460 1410 } 1461 1411 1462 - static int __send_discard(struct clone_info *ci) 1412 + static int __send_discard(struct clone_info *ci, struct dm_target *ti) 1463 1413 { 1464 - return __send_changing_extent_only(ci, get_num_discard_bios, 1414 + return __send_changing_extent_only(ci, ti, get_num_discard_bios, 1465 1415 is_split_required_for_discard); 1466 1416 } 1467 1417 1468 - static int __send_write_same(struct clone_info *ci) 1418 + static int __send_write_same(struct clone_info *ci, struct dm_target *ti) 1469 1419 { 1470 - return __send_changing_extent_only(ci, get_num_write_same_bios, NULL); 1420 + return __send_changing_extent_only(ci, ti, get_num_write_same_bios, NULL); 1471 1421 } 1472 1422 1473 - static int __send_write_zeroes(struct clone_info *ci) 1423 + static int __send_write_zeroes(struct clone_info *ci, struct dm_target *ti) 1474 1424 { 1475 - return __send_changing_extent_only(ci, get_num_write_zeroes_bios, NULL); 1425 + return __send_changing_extent_only(ci, ti, get_num_write_zeroes_bios, NULL); 1476 1426 } 1477 1427 1478 1428 /* ··· 1479 1441 unsigned len; 1480 1442 int r; 1481 1443 1482 - if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) 1483 - return __send_discard(ci); 1484 - else if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME)) 1485 - return __send_write_same(ci); 1486 - else if (unlikely(bio_op(bio) == REQ_OP_WRITE_ZEROES)) 1487 - return __send_write_zeroes(ci); 1488 - 1489 1444 ti = dm_table_find_target(ci->map, ci->sector); 1490 1445 if (!dm_target_is_valid(ti)) 1491 1446 return -EIO; 1447 + 1448 + if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) 1449 + return __send_discard(ci, ti); 1450 + else if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME)) 1451 + return __send_write_same(ci, ti); 1452 + else if (unlikely(bio_op(bio) == REQ_OP_WRITE_ZEROES)) 1453 + return __send_write_zeroes(ci, ti); 1492 1454 1493 1455 if (bio_op(bio) == REQ_OP_ZONE_REPORT) 1494 1456 len = ci->sector_count; ··· 1506 1468 return 0; 1507 1469 } 1508 1470 1471 + static void init_clone_info(struct clone_info *ci, struct mapped_device *md, 1472 + struct dm_table *map, struct bio *bio) 1473 + { 1474 + ci->map = map; 1475 + ci->io = alloc_io(md, bio); 1476 + ci->sector = bio->bi_iter.bi_sector; 1477 + } 1478 + 1509 1479 /* 1510 1480 * Entry point to split a bio into clones and submit them to the targets. 1511 1481 */ 1512 - static void __split_and_process_bio(struct mapped_device *md, 1513 - struct dm_table *map, struct bio *bio) 1482 + static blk_qc_t __split_and_process_bio(struct mapped_device *md, 1483 + struct dm_table *map, struct bio *bio) 1514 1484 { 1515 1485 struct clone_info ci; 1486 + blk_qc_t ret = BLK_QC_T_NONE; 1516 1487 int error = 0; 1517 1488 1518 1489 if (unlikely(!map)) { 1519 1490 bio_io_error(bio); 1520 - return; 1491 + return ret; 1521 1492 } 1522 1493 1523 - ci.map = map; 1524 - ci.md = md; 1525 - ci.io = alloc_io(md); 1526 - ci.io->status = 0; 1527 - atomic_set(&ci.io->io_count, 1); 1528 - ci.io->bio = bio; 1529 - ci.io->md = md; 1530 - spin_lock_init(&ci.io->endio_lock); 1531 - ci.sector = bio->bi_iter.bi_sector; 1532 - 1533 - start_io_acct(ci.io); 1494 + init_clone_info(&ci, md, map, bio); 1534 1495 1535 1496 if (bio->bi_opf & REQ_PREFLUSH) { 1536 - ci.bio = &ci.md->flush_bio; 1497 + ci.bio = &ci.io->md->flush_bio; 1537 1498 ci.sector_count = 0; 1538 1499 error = __send_empty_flush(&ci); 1539 1500 /* dec_pending submits any data associated with flush */ ··· 1543 1506 } else { 1544 1507 ci.bio = bio; 1545 1508 ci.sector_count = bio_sectors(bio); 1546 - while (ci.sector_count && !error) 1509 + while (ci.sector_count && !error) { 1547 1510 error = __split_and_process_non_flush(&ci); 1511 + if (current->bio_list && ci.sector_count && !error) { 1512 + /* 1513 + * Remainder must be passed to generic_make_request() 1514 + * so that it gets handled *after* bios already submitted 1515 + * have been completely processed. 1516 + * We take a clone of the original to store in 1517 + * ci.io->orig_bio to be used by end_io_acct() and 1518 + * for dec_pending to use for completion handling. 1519 + * As this path is not used for REQ_OP_ZONE_REPORT, 1520 + * the usage of io->orig_bio in dm_remap_zone_report() 1521 + * won't be affected by this reassignment. 1522 + */ 1523 + struct bio *b = bio_clone_bioset(bio, GFP_NOIO, 1524 + md->queue->bio_split); 1525 + ci.io->orig_bio = b; 1526 + bio_advance(bio, (bio_sectors(bio) - ci.sector_count) << 9); 1527 + bio_chain(b, bio); 1528 + ret = generic_make_request(bio); 1529 + break; 1530 + } 1531 + } 1548 1532 } 1549 1533 1550 1534 /* drop the extra reference count */ 1551 1535 dec_pending(ci.io, errno_to_blk_status(error)); 1536 + return ret; 1552 1537 } 1553 - /*----------------------------------------------------------------- 1554 - * CRUD END 1555 - *---------------------------------------------------------------*/ 1556 1538 1557 1539 /* 1558 - * The request function that just remaps the bio built up by 1559 - * dm_merge_bvec. 1540 + * Optimized variant of __split_and_process_bio that leverages the 1541 + * fact that targets that use it do _not_ have a need to split bios. 1560 1542 */ 1561 - static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio) 1543 + static blk_qc_t __process_bio(struct mapped_device *md, 1544 + struct dm_table *map, struct bio *bio) 1562 1545 { 1563 - int rw = bio_data_dir(bio); 1546 + struct clone_info ci; 1547 + blk_qc_t ret = BLK_QC_T_NONE; 1548 + int error = 0; 1549 + 1550 + if (unlikely(!map)) { 1551 + bio_io_error(bio); 1552 + return ret; 1553 + } 1554 + 1555 + init_clone_info(&ci, md, map, bio); 1556 + 1557 + if (bio->bi_opf & REQ_PREFLUSH) { 1558 + ci.bio = &ci.io->md->flush_bio; 1559 + ci.sector_count = 0; 1560 + error = __send_empty_flush(&ci); 1561 + /* dec_pending submits any data associated with flush */ 1562 + } else { 1563 + struct dm_target *ti = md->immutable_target; 1564 + struct dm_target_io *tio; 1565 + 1566 + /* 1567 + * Defend against IO still getting in during teardown 1568 + * - as was seen for a time with nvme-fcloop 1569 + */ 1570 + if (unlikely(WARN_ON_ONCE(!ti || !dm_target_is_valid(ti)))) { 1571 + error = -EIO; 1572 + goto out; 1573 + } 1574 + 1575 + tio = alloc_tio(&ci, ti, 0, GFP_NOIO); 1576 + ci.bio = bio; 1577 + ci.sector_count = bio_sectors(bio); 1578 + ret = __clone_and_map_simple_bio(&ci, tio, NULL); 1579 + } 1580 + out: 1581 + /* drop the extra reference count */ 1582 + dec_pending(ci.io, errno_to_blk_status(error)); 1583 + return ret; 1584 + } 1585 + 1586 + typedef blk_qc_t (process_bio_fn)(struct mapped_device *, struct dm_table *, struct bio *); 1587 + 1588 + static blk_qc_t __dm_make_request(struct request_queue *q, struct bio *bio, 1589 + process_bio_fn process_bio) 1590 + { 1564 1591 struct mapped_device *md = q->queuedata; 1592 + blk_qc_t ret = BLK_QC_T_NONE; 1565 1593 int srcu_idx; 1566 1594 struct dm_table *map; 1567 1595 1568 1596 map = dm_get_live_table(md, &srcu_idx); 1569 - 1570 - generic_start_io_acct(q, rw, bio_sectors(bio), &dm_disk(md)->part0); 1571 1597 1572 1598 /* if we're suspended, we have to queue this io for later */ 1573 1599 if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) { ··· 1640 1540 queue_io(md, bio); 1641 1541 else 1642 1542 bio_io_error(bio); 1643 - return BLK_QC_T_NONE; 1543 + return ret; 1644 1544 } 1645 1545 1646 - __split_and_process_bio(md, map, bio); 1546 + ret = process_bio(md, map, bio); 1547 + 1647 1548 dm_put_live_table(md, srcu_idx); 1648 - return BLK_QC_T_NONE; 1549 + return ret; 1550 + } 1551 + 1552 + /* 1553 + * The request function that remaps the bio to one target and 1554 + * splits off any remainder. 1555 + */ 1556 + static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio) 1557 + { 1558 + return __dm_make_request(q, bio, __split_and_process_bio); 1559 + } 1560 + 1561 + static blk_qc_t dm_make_request_nvme(struct request_queue *q, struct bio *bio) 1562 + { 1563 + return __dm_make_request(q, bio, __process_bio); 1649 1564 } 1650 1565 1651 1566 static int dm_any_congested(void *congested_data, int bdi_bits) ··· 1741 1626 1742 1627 static void dm_wq_work(struct work_struct *work); 1743 1628 1744 - void dm_init_md_queue(struct mapped_device *md) 1745 - { 1746 - /* 1747 - * Initialize data that will only be used by a non-blk-mq DM queue 1748 - * - must do so here (in alloc_dev callchain) before queue is used 1749 - */ 1750 - md->queue->queuedata = md; 1751 - md->queue->backing_dev_info->congested_data = md; 1752 - } 1753 - 1754 - void dm_init_normal_md_queue(struct mapped_device *md) 1629 + static void dm_init_normal_md_queue(struct mapped_device *md) 1755 1630 { 1756 1631 md->use_blk_mq = false; 1757 - dm_init_md_queue(md); 1758 1632 1759 1633 /* 1760 1634 * Initialize aspects of queue that aren't relevant for blk-mq ··· 1757 1653 destroy_workqueue(md->wq); 1758 1654 if (md->kworker_task) 1759 1655 kthread_stop(md->kworker_task); 1760 - mempool_destroy(md->io_pool); 1761 1656 if (md->bs) 1762 1657 bioset_free(md->bs); 1658 + if (md->io_bs) 1659 + bioset_free(md->io_bs); 1763 1660 1764 1661 if (md->dax_dev) { 1765 1662 kill_dax(md->dax_dev); ··· 1785 1680 bdput(md->bdev); 1786 1681 md->bdev = NULL; 1787 1682 } 1683 + 1684 + mutex_destroy(&md->suspend_lock); 1685 + mutex_destroy(&md->type_lock); 1686 + mutex_destroy(&md->table_devices_lock); 1788 1687 1789 1688 dm_mq_cleanup_mapped_device(md); 1790 1689 } ··· 1843 1734 md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id); 1844 1735 if (!md->queue) 1845 1736 goto bad; 1737 + md->queue->queuedata = md; 1738 + md->queue->backing_dev_info->congested_data = md; 1846 1739 1847 - dm_init_md_queue(md); 1848 - 1849 - md->disk = alloc_disk_node(1, numa_node_id); 1740 + md->disk = alloc_disk_node(1, md->numa_node_id); 1850 1741 if (!md->disk) 1851 1742 goto bad; 1852 1743 ··· 1929 1820 { 1930 1821 struct dm_md_mempools *p = dm_table_get_md_mempools(t); 1931 1822 1932 - if (md->bs) { 1933 - /* The md already has necessary mempools. */ 1934 - if (dm_table_bio_based(t)) { 1935 - /* 1936 - * Reload bioset because front_pad may have changed 1937 - * because a different table was loaded. 1938 - */ 1823 + if (dm_table_bio_based(t)) { 1824 + /* 1825 + * The md may already have mempools that need changing. 1826 + * If so, reload bioset because front_pad may have changed 1827 + * because a different table was loaded. 1828 + */ 1829 + if (md->bs) { 1939 1830 bioset_free(md->bs); 1940 - md->bs = p->bs; 1941 - p->bs = NULL; 1831 + md->bs = NULL; 1942 1832 } 1833 + if (md->io_bs) { 1834 + bioset_free(md->io_bs); 1835 + md->io_bs = NULL; 1836 + } 1837 + 1838 + } else if (md->bs) { 1943 1839 /* 1944 1840 * There's no need to reload with request-based dm 1945 1841 * because the size of front_pad doesn't change. ··· 1956 1842 goto out; 1957 1843 } 1958 1844 1959 - BUG_ON(!p || md->io_pool || md->bs); 1845 + BUG_ON(!p || md->bs || md->io_bs); 1960 1846 1961 - md->io_pool = p->io_pool; 1962 - p->io_pool = NULL; 1963 1847 md->bs = p->bs; 1964 1848 p->bs = NULL; 1965 - 1849 + md->io_bs = p->io_bs; 1850 + p->io_bs = NULL; 1966 1851 out: 1967 1852 /* mempool bind completed, no longer need any mempools in the table */ 1968 1853 dm_table_free_md_mempools(t); ··· 2007 1894 { 2008 1895 struct dm_table *old_map; 2009 1896 struct request_queue *q = md->queue; 1897 + bool request_based = dm_table_request_based(t); 2010 1898 sector_t size; 2011 1899 2012 1900 lockdep_assert_held(&md->suspend_lock); ··· 2031 1917 * This must be done before setting the queue restrictions, 2032 1918 * because request-based dm may be run just after the setting. 2033 1919 */ 2034 - if (dm_table_request_based(t)) { 1920 + if (request_based) 2035 1921 dm_stop_queue(q); 1922 + 1923 + if (request_based || md->type == DM_TYPE_NVME_BIO_BASED) { 2036 1924 /* 2037 - * Leverage the fact that request-based DM targets are 2038 - * immutable singletons and establish md->immutable_target 2039 - * - used to optimize both dm_request_fn and dm_mq_queue_rq 1925 + * Leverage the fact that request-based DM targets and 1926 + * NVMe bio based targets are immutable singletons 1927 + * - used to optimize both dm_request_fn and dm_mq_queue_rq; 1928 + * and __process_bio. 2040 1929 */ 2041 1930 md->immutable_target = dm_table_get_immutable_target(t); 2042 1931 } ··· 2079 1962 */ 2080 1963 int dm_create(int minor, struct mapped_device **result) 2081 1964 { 1965 + int r; 2082 1966 struct mapped_device *md; 2083 1967 2084 1968 md = alloc_dev(minor); 2085 1969 if (!md) 2086 1970 return -ENXIO; 2087 1971 2088 - dm_sysfs_init(md); 1972 + r = dm_sysfs_init(md); 1973 + if (r) { 1974 + free_dev(md); 1975 + return r; 1976 + } 2089 1977 2090 1978 *result = md; 2091 1979 return 0; ··· 2148 2026 2149 2027 switch (type) { 2150 2028 case DM_TYPE_REQUEST_BASED: 2029 + dm_init_normal_md_queue(md); 2151 2030 r = dm_old_init_request_queue(md, t); 2152 2031 if (r) { 2153 2032 DMERR("Cannot initialize queue for request-based mapped device"); ··· 2166 2043 case DM_TYPE_DAX_BIO_BASED: 2167 2044 dm_init_normal_md_queue(md); 2168 2045 blk_queue_make_request(md->queue, dm_make_request); 2169 - /* 2170 - * DM handles splitting bios as needed. Free the bio_split bioset 2171 - * since it won't be used (saves 1 process per bio-based DM device). 2172 - */ 2173 - bioset_free(md->queue->bio_split); 2174 - md->queue->bio_split = NULL; 2175 - 2176 - if (type == DM_TYPE_DAX_BIO_BASED) 2177 - queue_flag_set_unlocked(QUEUE_FLAG_DAX, md->queue); 2046 + break; 2047 + case DM_TYPE_NVME_BIO_BASED: 2048 + dm_init_normal_md_queue(md); 2049 + blk_queue_make_request(md->queue, dm_make_request_nvme); 2178 2050 break; 2179 2051 case DM_TYPE_NONE: 2180 2052 WARN_ON_ONCE(true); ··· 2248 2130 2249 2131 static void __dm_destroy(struct mapped_device *md, bool wait) 2250 2132 { 2251 - struct request_queue *q = dm_get_md_queue(md); 2252 2133 struct dm_table *map; 2253 2134 int srcu_idx; 2254 2135 ··· 2258 2141 set_bit(DMF_FREEING, &md->flags); 2259 2142 spin_unlock(&_minor_lock); 2260 2143 2261 - blk_set_queue_dying(q); 2144 + blk_set_queue_dying(md->queue); 2262 2145 2263 2146 if (dm_request_based(md) && md->kworker_task) 2264 2147 kthread_flush_worker(&md->kworker); ··· 2869 2752 EXPORT_SYMBOL_GPL(dm_noflush_suspending); 2870 2753 2871 2754 struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_queue_mode type, 2872 - unsigned integrity, unsigned per_io_data_size) 2755 + unsigned integrity, unsigned per_io_data_size, 2756 + unsigned min_pool_size) 2873 2757 { 2874 2758 struct dm_md_mempools *pools = kzalloc_node(sizeof(*pools), GFP_KERNEL, md->numa_node_id); 2875 2759 unsigned int pool_size = 0; 2876 - unsigned int front_pad; 2760 + unsigned int front_pad, io_front_pad; 2877 2761 2878 2762 if (!pools) 2879 2763 return NULL; ··· 2882 2764 switch (type) { 2883 2765 case DM_TYPE_BIO_BASED: 2884 2766 case DM_TYPE_DAX_BIO_BASED: 2885 - pool_size = dm_get_reserved_bio_based_ios(); 2767 + case DM_TYPE_NVME_BIO_BASED: 2768 + pool_size = max(dm_get_reserved_bio_based_ios(), min_pool_size); 2886 2769 front_pad = roundup(per_io_data_size, __alignof__(struct dm_target_io)) + offsetof(struct dm_target_io, clone); 2887 - 2888 - pools->io_pool = mempool_create_slab_pool(pool_size, _io_cache); 2889 - if (!pools->io_pool) 2770 + io_front_pad = roundup(front_pad, __alignof__(struct dm_io)) + offsetof(struct dm_io, tio); 2771 + pools->io_bs = bioset_create(pool_size, io_front_pad, 0); 2772 + if (!pools->io_bs) 2773 + goto out; 2774 + if (integrity && bioset_integrity_create(pools->io_bs, pool_size)) 2890 2775 goto out; 2891 2776 break; 2892 2777 case DM_TYPE_REQUEST_BASED: 2893 2778 case DM_TYPE_MQ_REQUEST_BASED: 2894 - pool_size = dm_get_reserved_rq_based_ios(); 2779 + pool_size = max(dm_get_reserved_rq_based_ios(), min_pool_size); 2895 2780 front_pad = offsetof(struct dm_rq_clone_bio_info, clone); 2896 2781 /* per_io_data_size is used for blk-mq pdu at queue allocation */ 2897 2782 break; ··· 2902 2781 BUG(); 2903 2782 } 2904 2783 2905 - pools->bs = bioset_create(pool_size, front_pad, BIOSET_NEED_RESCUER); 2784 + pools->bs = bioset_create(pool_size, front_pad, 0); 2906 2785 if (!pools->bs) 2907 2786 goto out; 2908 2787 ··· 2922 2801 if (!pools) 2923 2802 return; 2924 2803 2925 - mempool_destroy(pools->io_pool); 2926 - 2927 2804 if (pools->bs) 2928 2805 bioset_free(pools->bs); 2806 + if (pools->io_bs) 2807 + bioset_free(pools->io_bs); 2929 2808 2930 2809 kfree(pools); 2931 2810 }

+2 -2

drivers/md/dm.h

··· 49 49 /*----------------------------------------------------------------- 50 50 * Internal table functions. 51 51 *---------------------------------------------------------------*/ 52 - void dm_table_destroy(struct dm_table *t); 53 52 void dm_table_event_callback(struct dm_table *t, 54 53 void (*fn)(void *), void *context); 55 54 struct dm_target *dm_table_get_target(struct dm_table *t, unsigned int index); ··· 205 206 * Mempool operations 206 207 */ 207 208 struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_queue_mode type, 208 - unsigned integrity, unsigned per_bio_data_size); 209 + unsigned integrity, unsigned per_bio_data_size, 210 + unsigned min_pool_size); 209 211 void dm_free_md_mempools(struct dm_md_mempools *pools); 210 212 211 213 /*

+11 -45

include/linux/device-mapper.h

··· 28 28 DM_TYPE_REQUEST_BASED = 2, 29 29 DM_TYPE_MQ_REQUEST_BASED = 3, 30 30 DM_TYPE_DAX_BIO_BASED = 4, 31 + DM_TYPE_NVME_BIO_BASED = 5, 31 32 }; 32 33 33 34 typedef enum { STATUSTYPE_INFO, STATUSTYPE_TABLE } status_type_t; ··· 222 221 #define dm_target_is_wildcard(type) ((type)->features & DM_TARGET_WILDCARD) 223 222 224 223 /* 225 - * Some targets need to be sent the same WRITE bio severals times so 226 - * that they can send copies of it to different devices. This function 227 - * examines any supplied bio and returns the number of copies of it the 228 - * target requires. 229 - */ 230 - typedef unsigned (*dm_num_write_bios_fn) (struct dm_target *ti, struct bio *bio); 231 - 232 - /* 233 224 * A target implements own bio data integrity. 234 225 */ 235 226 #define DM_TARGET_INTEGRITY 0x00000010 ··· 284 291 */ 285 292 unsigned per_io_data_size; 286 293 287 - /* 288 - * If defined, this function is called to find out how many 289 - * duplicate bios should be sent to the target when writing 290 - * data. 291 - */ 292 - dm_num_write_bios_fn num_write_bios; 293 - 294 294 /* target specific data */ 295 295 void *private; 296 296 ··· 315 329 int (*congested_fn) (struct dm_target_callbacks *, int); 316 330 }; 317 331 318 - /* 319 - * For bio-based dm. 320 - * One of these is allocated for each bio. 321 - * This structure shouldn't be touched directly by target drivers. 322 - * It is here so that we can inline dm_per_bio_data and 323 - * dm_bio_from_per_bio_data 324 - */ 325 - struct dm_target_io { 326 - struct dm_io *io; 327 - struct dm_target *ti; 328 - unsigned target_bio_nr; 329 - unsigned *len_ptr; 330 - struct bio clone; 331 - }; 332 - 333 - static inline void *dm_per_bio_data(struct bio *bio, size_t data_size) 334 - { 335 - return (char *)bio - offsetof(struct dm_target_io, clone) - data_size; 336 - } 337 - 338 - static inline struct bio *dm_bio_from_per_bio_data(void *data, size_t data_size) 339 - { 340 - return (struct bio *)((char *)data + data_size + offsetof(struct dm_target_io, clone)); 341 - } 342 - 343 - static inline unsigned dm_bio_get_target_bio_nr(const struct bio *bio) 344 - { 345 - return container_of(bio, struct dm_target_io, clone)->target_bio_nr; 346 - } 332 + void *dm_per_bio_data(struct bio *bio, size_t data_size); 333 + struct bio *dm_bio_from_per_bio_data(void *data, size_t data_size); 334 + unsigned dm_bio_get_target_bio_nr(const struct bio *bio); 347 335 348 336 int dm_register_target(struct target_type *t); 349 337 void dm_unregister_target(struct target_type *t); ··· 460 500 int dm_table_complete(struct dm_table *t); 461 501 462 502 /* 503 + * Destroy the table when finished. 504 + */ 505 + void dm_table_destroy(struct dm_table *t); 506 + 507 + /* 463 508 * Target may require that it is never sent I/O larger than len. 464 509 */ 465 510 int __must_check dm_set_target_max_io_len(struct dm_target *ti, sector_t len); ··· 550 585 #define DM_ENDIO_DONE 0 551 586 #define DM_ENDIO_INCOMPLETE 1 552 587 #define DM_ENDIO_REQUEUE 2 588 + #define DM_ENDIO_DELAY_REQUEUE 3 553 589 554 590 /* 555 591 * Definitions of return values from target map function. ··· 558 592 #define DM_MAPIO_SUBMITTED 0 559 593 #define DM_MAPIO_REMAPPED 1 560 594 #define DM_MAPIO_REQUEUE DM_ENDIO_REQUEUE 561 - #define DM_MAPIO_DELAY_REQUEUE 3 595 + #define DM_MAPIO_DELAY_REQUEUE DM_ENDIO_DELAY_REQUEUE 562 596 #define DM_MAPIO_KILL 4 563 597 564 598 #define dm_sector_div64(x, y)( \