Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge tag 'for-5.8/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

- The largest change for this cycle is the DM zoned target's metadata
version 2 feature that adds support for pairing regular block devices
with a zoned device to ease the performance impact associated with
finite random zones of zoned device.

The changes came in three batches: the first prepared for and then
added the ability to pair a single regular block device, the second
was a batch of fixes to improve zoned's reclaim heuristic, and the
third removed the limitation of only adding a single additional
regular block device to allow many devices.

Testing has shown linear scaling as more devices are added.

- Add new emulated block size (ebs) target that emulates a smaller
logical_block_size than a block device supports

The primary use-case is to emulate "512e" devices that have 512 byte
logical_block_size and 4KB physical_block_size. This is useful to
some legacy applications that otherwise wouldn't be able to be used
on 4K devices because they depend on issuing IO in 512 byte
granularity.

- Add discard interfaces to DM bufio. First consumer of the interface
is the dm-ebs target that makes heavy use of dm-bufio.

- Fix DM crypt's block queue_limits stacking to not truncate
logic_block_size.

- Add Documentation for DM integrity's status line.

- Switch DMDEBUG from a compile time config option to instead use
dynamic debug via pr_debug.

- Fix DM multipath target's hueristic for how it manages
"queue_if_no_path" state internally.

DM multipath now avoids disabling "queue_if_no_path" unless it is
actually needed (e.g. in response to configure timeout or explicit
"fail_if_no_path" message).

This fixes reports of spurious -EIO being reported back to userspace
application during fault tolerance testing with an NVMe backend.
Added various dynamic DMDEBUG messages to assist with debugging
queue_if_no_path in the future.

- Add a new DM multipath "Historical Service Time" Path Selector.

- Fix DM multipath's dm_blk_ioctl() to switch paths on IO error.

- Improve DM writecache target performance by using explicit cache
flushing for target's single-threaded usecase and a small cleanup to
remove unnecessary test in persistent_memory_claim.

- Other small cleanups in DM core, dm-persistent-data, and DM
integrity.

* tag 'for-5.8/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (62 commits)
dm crypt: avoid truncating the logical block size
dm mpath: add DM device name to Failing/Reinstating path log messages
dm mpath: enhance queue_if_no_path debugging
dm mpath: restrict queue_if_no_path state machine
dm mpath: simplify __must_push_back
dm zoned: check superblock location
dm zoned: prefer full zones for reclaim
dm zoned: select reclaim zone based on device index
dm zoned: allocate zone by device index
dm zoned: support arbitrary number of devices
dm zoned: move random and sequential zones into struct dmz_dev
dm zoned: per-device reclaim
dm zoned: add metadata pointer to struct dmz_dev
dm zoned: add device pointer to struct dm_zone
dm zoned: allocate temporary superblock for tertiary devices
dm zoned: convert to xarray
dm zoned: add a 'reserved' zone flag
dm zoned: improve logging messages for reclaim
dm zoned: avoid unnecessary device recalulation for secondary superblock
dm zoned: add debugging message for reading superblocks
...

+2779 -649
+51
Documentation/admin-guide/device-mapper/dm-ebs.rst
··· 1 + ====== 2 + dm-ebs 3 + ====== 4 + 5 + 6 + This target is similar to the linear target except that it emulates 7 + a smaller logical block size on a device with a larger logical block 8 + size. Its main purpose is to provide emulation of 512 byte sectors on 9 + devices that do not provide this emulation (i.e. 4K native disks). 10 + 11 + Supported emulated logical block sizes 512, 1024, 2048 and 4096. 12 + 13 + Underlying block size can be set to > 4K to test buffering larger units. 14 + 15 + 16 + Table parameters 17 + ---------------- 18 + <dev path> <offset> <emulated sectors> [<underlying sectors>] 19 + 20 + Mandatory parameters: 21 + 22 + <dev path>: 23 + Full pathname to the underlying block-device, 24 + or a "major:minor" device-number. 25 + <offset>: 26 + Starting sector within the device; 27 + has to be a multiple of <emulated sectors>. 28 + <emulated sectors>: 29 + Number of sectors defining the logical block size to be emulated; 30 + 1, 2, 4, 8 sectors of 512 bytes supported. 31 + 32 + Optional parameter: 33 + 34 + <underyling sectors>: 35 + Number of sectors defining the logical block size of <dev path>. 36 + 2^N supported, e.g. 8 = emulate 8 sectors of 512 bytes = 4KiB. 37 + If not provided, the logical block size of <dev path> will be used. 38 + 39 + 40 + Examples: 41 + 42 + Emulate 1 sector = 512 bytes logical block size on /dev/sda starting at 43 + offset 1024 sectors with underlying devices block size automatically set: 44 + 45 + ebs /dev/sda 1024 1 46 + 47 + Emulate 2 sector = 1KiB logical block size on /dev/sda starting at 48 + offset 128 sectors, enforce 2KiB underlying device block size. 49 + This presumes 2KiB logical blocksize on /dev/sda or less to work: 50 + 51 + ebs /dev/sda 128 2 4
+8
Documentation/admin-guide/device-mapper/dm-integrity.rst
··· 193 193 data depend on them and the reloaded target would be non-functional. 194 194 195 195 196 + Status line: 197 + 198 + 1. the number of integrity mismatches 199 + 2. provided data sectors - that is the number of sectors that the user 200 + could use 201 + 3. the current recalculating position (or '-' if we didn't recalculate) 202 + 203 + 196 204 The layout of the formatted block device: 197 205 198 206 * reserved sectors
+55 -7
Documentation/admin-guide/device-mapper/dm-zoned.rst
··· 37 37 dm-zoned implements an on-disk buffering scheme to handle non-sequential 38 38 write accesses to the sequential zones of a zoned block device. 39 39 Conventional zones are used for caching as well as for storing internal 40 - metadata. 40 + metadata. It can also use a regular block device together with the zoned 41 + block device; in that case the regular block device will be split logically 42 + in zones with the same size as the zoned block device. These zones will be 43 + placed in front of the zones from the zoned block device and will be handled 44 + just like conventional zones. 41 45 42 - The zones of the device are separated into 2 types: 46 + The zones of the device(s) are separated into 2 types: 43 47 44 48 1) Metadata zones: these are conventional zones used to store metadata. 45 49 Metadata zones are not reported as useable capacity to the user. ··· 131 127 discard requests. Read requests can be processed concurrently while 132 128 metadata flush is being executed. 133 129 130 + If a regular device is used in conjunction with the zoned block device, 131 + a third set of metadata (without the zone bitmaps) is written to the 132 + start of the zoned block device. This metadata has a generation counter of 133 + '0' and will never be updated during normal operation; it just serves for 134 + identification purposes. The first and second copy of the metadata 135 + are located at the start of the regular block device. 136 + 134 137 Usage 135 138 ===== 136 139 ··· 149 138 150 139 dmzadm --format /dev/sdxx 151 140 152 - For a formatted device, the target can be created normally with the 153 - dmsetup utility. The only parameter that dm-zoned requires is the 154 - underlying zoned block device name. Ex:: 155 141 156 - echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | \ 157 - dmsetup create dmz-`basename ${dev}` 142 + If two drives are to be used, both devices must be specified, with the 143 + regular block device as the first device. 144 + 145 + Ex:: 146 + 147 + dmzadm --format /dev/sdxx /dev/sdyy 148 + 149 + 150 + Fomatted device(s) can be started with the dmzadm utility, too.: 151 + 152 + Ex:: 153 + 154 + dmzadm --start /dev/sdxx /dev/sdyy 155 + 156 + 157 + Information about the internal layout and current usage of the zones can 158 + be obtained with the 'status' callback from dmsetup: 159 + 160 + Ex:: 161 + 162 + dmsetup status /dev/dm-X 163 + 164 + will return a line 165 + 166 + 0 <size> zoned <nr_zones> zones <nr_unmap_rnd>/<nr_rnd> random <nr_unmap_seq>/<nr_seq> sequential 167 + 168 + where <nr_zones> is the total number of zones, <nr_unmap_rnd> is the number 169 + of unmapped (ie free) random zones, <nr_rnd> the total number of zones, 170 + <nr_unmap_seq> the number of unmapped sequential zones, and <nr_seq> the 171 + total number of sequential zones. 172 + 173 + Normally the reclaim process will be started once there are less than 50 174 + percent free random zones. In order to start the reclaim process manually 175 + even before reaching this threshold the 'dmsetup message' function can be 176 + used: 177 + 178 + Ex:: 179 + 180 + dmsetup message /dev/dm-X 0 reclaim 181 + 182 + will start the reclaim process and random zones will be moved to sequential 183 + zones.
+20
drivers/md/Kconfig
··· 269 269 config DM_CRYPT 270 270 tristate "Crypt target support" 271 271 depends on BLK_DEV_DM 272 + depends on (ENCRYPTED_KEYS || ENCRYPTED_KEYS=n) 272 273 select CRYPTO 273 274 select CRYPTO_CBC 274 275 select CRYPTO_ESSIV ··· 336 335 337 336 The writecache target doesn't cache reads because reads are supposed 338 337 to be cached in standard RAM. 338 + 339 + config DM_EBS 340 + tristate "Emulated block size target (EXPERIMENTAL)" 341 + depends on BLK_DEV_DM 342 + select DM_BUFIO 343 + help 344 + dm-ebs emulates smaller logical block size on backing devices 345 + with larger ones (e.g. 512 byte sectors on 4K native disks). 339 346 340 347 config DM_ERA 341 348 tristate "Era target (EXPERIMENTAL)" ··· 449 440 This path selector is a dynamic load balancer which selects 450 441 the path expected to complete the incoming I/O in the shortest 451 442 time. 443 + 444 + If unsure, say N. 445 + 446 + config DM_MULTIPATH_HST 447 + tristate "I/O Path Selector based on historical service time" 448 + depends on DM_MULTIPATH 449 + help 450 + This path selector is a dynamic load balancer which selects 451 + the path expected to complete the incoming I/O in the shortest 452 + time by comparing estimated service time (based on historical 453 + service time). 452 454 453 455 If unsure, say N. 454 456
+3
drivers/md/Makefile
··· 17 17 dm-cache-y += dm-cache-target.o dm-cache-metadata.o dm-cache-policy.o \ 18 18 dm-cache-background-tracker.o 19 19 dm-cache-smq-y += dm-cache-policy-smq.o 20 + dm-ebs-y += dm-ebs-target.o 20 21 dm-era-y += dm-era-target.o 21 22 dm-clone-y += dm-clone-target.o dm-clone-metadata.o 22 23 dm-verity-y += dm-verity-target.o ··· 55 54 obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o 56 55 obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o 57 56 obj-$(CONFIG_DM_MULTIPATH_ST) += dm-service-time.o 57 + obj-$(CONFIG_DM_MULTIPATH_HST) += dm-historical-service-time.o 58 58 obj-$(CONFIG_DM_SWITCH) += dm-switch.o 59 59 obj-$(CONFIG_DM_SNAPSHOT) += dm-snapshot.o 60 60 obj-$(CONFIG_DM_PERSISTENT_DATA) += persistent-data/ ··· 67 65 obj-$(CONFIG_DM_VERITY) += dm-verity.o 68 66 obj-$(CONFIG_DM_CACHE) += dm-cache.o 69 67 obj-$(CONFIG_DM_CACHE_SMQ) += dm-cache-smq.o 68 + obj-$(CONFIG_DM_EBS) += dm-ebs.o 70 69 obj-$(CONFIG_DM_ERA) += dm-era.o 71 70 obj-$(CONFIG_DM_CLONE) += dm-clone.o 72 71 obj-$(CONFIG_DM_LOG_WRITES) += dm-log-writes.o
+97 -12
drivers/md/dm-bufio.c
··· 256 256 if (b->block == block) 257 257 return b; 258 258 259 - n = (b->block < block) ? n->rb_left : n->rb_right; 259 + n = block < b->block ? n->rb_left : n->rb_right; 260 260 } 261 261 262 262 return NULL; 263 + } 264 + 265 + static struct dm_buffer *__find_next(struct dm_bufio_client *c, sector_t block) 266 + { 267 + struct rb_node *n = c->buffer_tree.rb_node; 268 + struct dm_buffer *b; 269 + struct dm_buffer *best = NULL; 270 + 271 + while (n) { 272 + b = container_of(n, struct dm_buffer, node); 273 + 274 + if (b->block == block) 275 + return b; 276 + 277 + if (block <= b->block) { 278 + n = n->rb_left; 279 + best = b; 280 + } else { 281 + n = n->rb_right; 282 + } 283 + } 284 + 285 + return best; 263 286 } 264 287 265 288 static void __insert(struct dm_bufio_client *c, struct dm_buffer *b) ··· 299 276 } 300 277 301 278 parent = *new; 302 - new = (found->block < b->block) ? 303 - &((*new)->rb_left) : &((*new)->rb_right); 279 + new = b->block < found->block ? 280 + &found->node.rb_left : &found->node.rb_right; 304 281 } 305 282 306 283 rb_link_node(&b->node, parent, new); ··· 654 631 submit_bio(bio); 655 632 } 656 633 634 + static inline sector_t block_to_sector(struct dm_bufio_client *c, sector_t block) 635 + { 636 + sector_t sector; 637 + 638 + if (likely(c->sectors_per_block_bits >= 0)) 639 + sector = block << c->sectors_per_block_bits; 640 + else 641 + sector = block * (c->block_size >> SECTOR_SHIFT); 642 + sector += c->start; 643 + 644 + return sector; 645 + } 646 + 657 647 static void submit_io(struct dm_buffer *b, int rw, void (*end_io)(struct dm_buffer *, blk_status_t)) 658 648 { 659 649 unsigned n_sectors; ··· 675 639 676 640 b->end_io = end_io; 677 641 678 - if (likely(b->c->sectors_per_block_bits >= 0)) 679 - sector = b->block << b->c->sectors_per_block_bits; 680 - else 681 - sector = b->block * (b->c->block_size >> SECTOR_SHIFT); 682 - sector += b->c->start; 642 + sector = block_to_sector(b->c, b->block); 683 643 684 644 if (rw != REQ_OP_WRITE) { 685 645 n_sectors = b->c->block_size >> SECTOR_SHIFT; ··· 1358 1326 EXPORT_SYMBOL_GPL(dm_bufio_issue_flush); 1359 1327 1360 1328 /* 1329 + * Use dm-io to send a discard request to flush the device. 1330 + */ 1331 + int dm_bufio_issue_discard(struct dm_bufio_client *c, sector_t block, sector_t count) 1332 + { 1333 + struct dm_io_request io_req = { 1334 + .bi_op = REQ_OP_DISCARD, 1335 + .bi_op_flags = REQ_SYNC, 1336 + .mem.type = DM_IO_KMEM, 1337 + .mem.ptr.addr = NULL, 1338 + .client = c->dm_io, 1339 + }; 1340 + struct dm_io_region io_reg = { 1341 + .bdev = c->bdev, 1342 + .sector = block_to_sector(c, block), 1343 + .count = block_to_sector(c, count), 1344 + }; 1345 + 1346 + BUG_ON(dm_bufio_in_request()); 1347 + 1348 + return dm_io(&io_req, 1, &io_reg, NULL); 1349 + } 1350 + EXPORT_SYMBOL_GPL(dm_bufio_issue_discard); 1351 + 1352 + /* 1361 1353 * We first delete any other buffer that may be at that new location. 1362 1354 * 1363 1355 * Then, we write the buffer to the original location if it was dirty. ··· 1457 1401 } 1458 1402 EXPORT_SYMBOL_GPL(dm_bufio_release_move); 1459 1403 1404 + static void forget_buffer_locked(struct dm_buffer *b) 1405 + { 1406 + if (likely(!b->hold_count) && likely(!b->state)) { 1407 + __unlink_buffer(b); 1408 + __free_buffer_wake(b); 1409 + } 1410 + } 1411 + 1460 1412 /* 1461 1413 * Free the given buffer. 1462 1414 * ··· 1478 1414 dm_bufio_lock(c); 1479 1415 1480 1416 b = __find(c, block); 1481 - if (b && likely(!b->hold_count) && likely(!b->state)) { 1482 - __unlink_buffer(b); 1483 - __free_buffer_wake(b); 1484 - } 1417 + if (b) 1418 + forget_buffer_locked(b); 1485 1419 1486 1420 dm_bufio_unlock(c); 1487 1421 } 1488 1422 EXPORT_SYMBOL_GPL(dm_bufio_forget); 1423 + 1424 + void dm_bufio_forget_buffers(struct dm_bufio_client *c, sector_t block, sector_t n_blocks) 1425 + { 1426 + struct dm_buffer *b; 1427 + sector_t end_block = block + n_blocks; 1428 + 1429 + while (block < end_block) { 1430 + dm_bufio_lock(c); 1431 + 1432 + b = __find_next(c, block); 1433 + if (b) { 1434 + block = b->block + 1; 1435 + forget_buffer_locked(b); 1436 + } 1437 + 1438 + dm_bufio_unlock(c); 1439 + 1440 + if (!b) 1441 + break; 1442 + } 1443 + 1444 + } 1445 + EXPORT_SYMBOL_GPL(dm_bufio_forget_buffers); 1489 1446 1490 1447 void dm_bufio_set_minimum_buffers(struct dm_bufio_client *c, unsigned n) 1491 1448 {
+59 -21
drivers/md/dm-crypt.c
··· 34 34 #include <crypto/aead.h> 35 35 #include <crypto/authenc.h> 36 36 #include <linux/rtnetlink.h> /* for struct rtattr and RTA macros only */ 37 + #include <linux/key-type.h> 37 38 #include <keys/user-type.h> 39 + #include <keys/encrypted-type.h> 38 40 39 41 #include <linux/device-mapper.h> 40 42 ··· 214 212 struct mutex bio_alloc_lock; 215 213 216 214 u8 *authenc_key; /* space for keys in authenc() format (if used) */ 217 - u8 key[0]; 215 + u8 key[]; 218 216 }; 219 217 220 218 #define MIN_IOS 64 ··· 2217 2215 return false; 2218 2216 } 2219 2217 2218 + static int set_key_user(struct crypt_config *cc, struct key *key) 2219 + { 2220 + const struct user_key_payload *ukp; 2221 + 2222 + ukp = user_key_payload_locked(key); 2223 + if (!ukp) 2224 + return -EKEYREVOKED; 2225 + 2226 + if (cc->key_size != ukp->datalen) 2227 + return -EINVAL; 2228 + 2229 + memcpy(cc->key, ukp->data, cc->key_size); 2230 + 2231 + return 0; 2232 + } 2233 + 2234 + #if defined(CONFIG_ENCRYPTED_KEYS) || defined(CONFIG_ENCRYPTED_KEYS_MODULE) 2235 + static int set_key_encrypted(struct crypt_config *cc, struct key *key) 2236 + { 2237 + const struct encrypted_key_payload *ekp; 2238 + 2239 + ekp = key->payload.data[0]; 2240 + if (!ekp) 2241 + return -EKEYREVOKED; 2242 + 2243 + if (cc->key_size != ekp->decrypted_datalen) 2244 + return -EINVAL; 2245 + 2246 + memcpy(cc->key, ekp->decrypted_data, cc->key_size); 2247 + 2248 + return 0; 2249 + } 2250 + #endif /* CONFIG_ENCRYPTED_KEYS */ 2251 + 2220 2252 static int crypt_set_keyring_key(struct crypt_config *cc, const char *key_string) 2221 2253 { 2222 2254 char *new_key_string, *key_desc; 2223 2255 int ret; 2256 + struct key_type *type; 2224 2257 struct key *key; 2225 - const struct user_key_payload *ukp; 2258 + int (*set_key)(struct crypt_config *cc, struct key *key); 2226 2259 2227 2260 /* 2228 2261 * Reject key_string with whitespace. dm core currently lacks code for ··· 2273 2236 if (!key_desc || key_desc == key_string || !strlen(key_desc + 1)) 2274 2237 return -EINVAL; 2275 2238 2276 - if (strncmp(key_string, "logon:", key_desc - key_string + 1) && 2277 - strncmp(key_string, "user:", key_desc - key_string + 1)) 2239 + if (!strncmp(key_string, "logon:", key_desc - key_string + 1)) { 2240 + type = &key_type_logon; 2241 + set_key = set_key_user; 2242 + } else if (!strncmp(key_string, "user:", key_desc - key_string + 1)) { 2243 + type = &key_type_user; 2244 + set_key = set_key_user; 2245 + #if defined(CONFIG_ENCRYPTED_KEYS) || defined(CONFIG_ENCRYPTED_KEYS_MODULE) 2246 + } else if (!strncmp(key_string, "encrypted:", key_desc - key_string + 1)) { 2247 + type = &key_type_encrypted; 2248 + set_key = set_key_encrypted; 2249 + #endif 2250 + } else { 2278 2251 return -EINVAL; 2252 + } 2279 2253 2280 2254 new_key_string = kstrdup(key_string, GFP_KERNEL); 2281 2255 if (!new_key_string) 2282 2256 return -ENOMEM; 2283 2257 2284 - key = request_key(key_string[0] == 'l' ? &key_type_logon : &key_type_user, 2285 - key_desc + 1, NULL); 2258 + key = request_key(type, key_desc + 1, NULL); 2286 2259 if (IS_ERR(key)) { 2287 2260 kzfree(new_key_string); 2288 2261 return PTR_ERR(key); ··· 2300 2253 2301 2254 down_read(&key->sem); 2302 2255 2303 - ukp = user_key_payload_locked(key); 2304 - if (!ukp) { 2256 + ret = set_key(cc, key); 2257 + if (ret < 0) { 2305 2258 up_read(&key->sem); 2306 2259 key_put(key); 2307 2260 kzfree(new_key_string); 2308 - return -EKEYREVOKED; 2261 + return ret; 2309 2262 } 2310 - 2311 - if (cc->key_size != ukp->datalen) { 2312 - up_read(&key->sem); 2313 - key_put(key); 2314 - kzfree(new_key_string); 2315 - return -EINVAL; 2316 - } 2317 - 2318 - memcpy(cc->key, ukp->data, cc->key_size); 2319 2263 2320 2264 up_read(&key->sem); 2321 2265 key_put(key); ··· 2361 2323 return (*key_string[0] == ':') ? -EINVAL : strlen(*key_string) >> 1; 2362 2324 } 2363 2325 2364 - #endif 2326 + #endif /* CONFIG_KEYS */ 2365 2327 2366 2328 static int crypt_set_key(struct crypt_config *cc, char *key) 2367 2329 { ··· 3312 3274 limits->max_segment_size = PAGE_SIZE; 3313 3275 3314 3276 limits->logical_block_size = 3315 - max_t(unsigned short, limits->logical_block_size, cc->sector_size); 3277 + max_t(unsigned, limits->logical_block_size, cc->sector_size); 3316 3278 limits->physical_block_size = 3317 3279 max_t(unsigned, limits->physical_block_size, cc->sector_size); 3318 3280 limits->io_min = max_t(unsigned, limits->io_min, cc->sector_size); ··· 3320 3282 3321 3283 static struct target_type crypt_target = { 3322 3284 .name = "crypt", 3323 - .version = {1, 20, 0}, 3285 + .version = {1, 21, 0}, 3324 3286 .module = THIS_MODULE, 3325 3287 .ctr = crypt_ctr, 3326 3288 .dtr = crypt_dtr,
+471
drivers/md/dm-ebs-target.c
··· 1 + /* 2 + * Copyright (C) 2020 Red Hat GmbH 3 + * 4 + * This file is released under the GPL. 5 + * 6 + * Device-mapper target to emulate smaller logical block 7 + * size on backing devices exposing (natively) larger ones. 8 + * 9 + * E.g. 512 byte sector emulation on 4K native disks. 10 + */ 11 + 12 + #include "dm.h" 13 + #include <linux/module.h> 14 + #include <linux/workqueue.h> 15 + #include <linux/dm-bufio.h> 16 + 17 + #define DM_MSG_PREFIX "ebs" 18 + 19 + static void ebs_dtr(struct dm_target *ti); 20 + 21 + /* Emulated block size context. */ 22 + struct ebs_c { 23 + struct dm_dev *dev; /* Underlying device to emulate block size on. */ 24 + struct dm_bufio_client *bufio; /* Use dm-bufio for read and read-modify-write processing. */ 25 + struct workqueue_struct *wq; /* Workqueue for ^ processing of bios. */ 26 + struct work_struct ws; /* Work item used for ^. */ 27 + struct bio_list bios_in; /* Worker bios input list. */ 28 + spinlock_t lock; /* Guard bios input list above. */ 29 + sector_t start; /* <start> table line argument, see ebs_ctr below. */ 30 + unsigned int e_bs; /* Emulated block size in sectors exposed to upper layer. */ 31 + unsigned int u_bs; /* Underlying block size in sectors retrievd from/set on lower layer device. */ 32 + unsigned char block_shift; /* bitshift sectors -> blocks used in dm-bufio API. */ 33 + bool u_bs_set:1; /* Flag to indicate underlying block size is set on table line. */ 34 + }; 35 + 36 + static inline sector_t __sector_to_block(struct ebs_c *ec, sector_t sector) 37 + { 38 + return sector >> ec->block_shift; 39 + } 40 + 41 + static inline sector_t __block_mod(sector_t sector, unsigned int bs) 42 + { 43 + return sector & (bs - 1); 44 + } 45 + 46 + /* Return number of blocks for a bio, accounting for misalignement of start and end sectors. */ 47 + static inline unsigned int __nr_blocks(struct ebs_c *ec, struct bio *bio) 48 + { 49 + sector_t end_sector = __block_mod(bio->bi_iter.bi_sector, ec->u_bs) + bio_sectors(bio); 50 + 51 + return __sector_to_block(ec, end_sector) + (__block_mod(end_sector, ec->u_bs) ? 1 : 0); 52 + } 53 + 54 + static inline bool __ebs_check_bs(unsigned int bs) 55 + { 56 + return bs && is_power_of_2(bs); 57 + } 58 + 59 + /* 60 + * READ/WRITE: 61 + * 62 + * copy blocks between bufio blocks and bio vector's (partial/overlapping) pages. 63 + */ 64 + static int __ebs_rw_bvec(struct ebs_c *ec, int rw, struct bio_vec *bv, struct bvec_iter *iter) 65 + { 66 + int r = 0; 67 + unsigned char *ba, *pa; 68 + unsigned int cur_len; 69 + unsigned int bv_len = bv->bv_len; 70 + unsigned int buf_off = to_bytes(__block_mod(iter->bi_sector, ec->u_bs)); 71 + sector_t block = __sector_to_block(ec, iter->bi_sector); 72 + struct dm_buffer *b; 73 + 74 + if (unlikely(!bv->bv_page || !bv_len)) 75 + return -EIO; 76 + 77 + pa = page_address(bv->bv_page) + bv->bv_offset; 78 + 79 + /* Handle overlapping page <-> blocks */ 80 + while (bv_len) { 81 + cur_len = min(dm_bufio_get_block_size(ec->bufio) - buf_off, bv_len); 82 + 83 + /* Avoid reading for writes in case bio vector's page overwrites block completely. */ 84 + if (rw == READ || buf_off || bv_len < dm_bufio_get_block_size(ec->bufio)) 85 + ba = dm_bufio_read(ec->bufio, block, &b); 86 + else 87 + ba = dm_bufio_new(ec->bufio, block, &b); 88 + 89 + if (unlikely(IS_ERR(ba))) { 90 + /* 91 + * Carry on with next buffer, if any, to issue all possible 92 + * data but return error. 93 + */ 94 + r = PTR_ERR(ba); 95 + } else { 96 + /* Copy data to/from bio to buffer if read/new was successful above. */ 97 + ba += buf_off; 98 + if (rw == READ) { 99 + memcpy(pa, ba, cur_len); 100 + flush_dcache_page(bv->bv_page); 101 + } else { 102 + flush_dcache_page(bv->bv_page); 103 + memcpy(ba, pa, cur_len); 104 + dm_bufio_mark_partial_buffer_dirty(b, buf_off, buf_off + cur_len); 105 + } 106 + 107 + dm_bufio_release(b); 108 + } 109 + 110 + pa += cur_len; 111 + bv_len -= cur_len; 112 + buf_off = 0; 113 + block++; 114 + } 115 + 116 + return r; 117 + } 118 + 119 + /* READ/WRITE: iterate bio vector's copying between (partial) pages and bufio blocks. */ 120 + static int __ebs_rw_bio(struct ebs_c *ec, int rw, struct bio *bio) 121 + { 122 + int r = 0, rr; 123 + struct bio_vec bv; 124 + struct bvec_iter iter; 125 + 126 + bio_for_each_bvec(bv, bio, iter) { 127 + rr = __ebs_rw_bvec(ec, rw, &bv, &iter); 128 + if (rr) 129 + r = rr; 130 + } 131 + 132 + return r; 133 + } 134 + 135 + /* 136 + * Discard bio's blocks, i.e. pass discards down. 137 + * 138 + * Avoid discarding partial blocks at beginning and end; 139 + * return 0 in case no blocks can be discarded as a result. 140 + */ 141 + static int __ebs_discard_bio(struct ebs_c *ec, struct bio *bio) 142 + { 143 + sector_t block, blocks, sector = bio->bi_iter.bi_sector; 144 + 145 + block = __sector_to_block(ec, sector); 146 + blocks = __nr_blocks(ec, bio); 147 + 148 + /* 149 + * Partial first underlying block (__nr_blocks() may have 150 + * resulted in one block). 151 + */ 152 + if (__block_mod(sector, ec->u_bs)) { 153 + block++; 154 + blocks--; 155 + } 156 + 157 + /* Partial last underlying block if any. */ 158 + if (blocks && __block_mod(bio_end_sector(bio), ec->u_bs)) 159 + blocks--; 160 + 161 + return blocks ? dm_bufio_issue_discard(ec->bufio, block, blocks) : 0; 162 + } 163 + 164 + /* Release blocks them from the bufio cache. */ 165 + static void __ebs_forget_bio(struct ebs_c *ec, struct bio *bio) 166 + { 167 + sector_t blocks, sector = bio->bi_iter.bi_sector; 168 + 169 + blocks = __nr_blocks(ec, bio); 170 + 171 + dm_bufio_forget_buffers(ec->bufio, __sector_to_block(ec, sector), blocks); 172 + } 173 + 174 + /* Worker funtion to process incoming bios. */ 175 + static void __ebs_process_bios(struct work_struct *ws) 176 + { 177 + int r; 178 + bool write = false; 179 + sector_t block1, block2; 180 + struct ebs_c *ec = container_of(ws, struct ebs_c, ws); 181 + struct bio *bio; 182 + struct bio_list bios; 183 + 184 + bio_list_init(&bios); 185 + 186 + spin_lock_irq(&ec->lock); 187 + bios = ec->bios_in; 188 + bio_list_init(&ec->bios_in); 189 + spin_unlock_irq(&ec->lock); 190 + 191 + /* Prefetch all read and any mis-aligned write buffers */ 192 + bio_list_for_each(bio, &bios) { 193 + block1 = __sector_to_block(ec, bio->bi_iter.bi_sector); 194 + if (bio_op(bio) == REQ_OP_READ) 195 + dm_bufio_prefetch(ec->bufio, block1, __nr_blocks(ec, bio)); 196 + else if (bio_op(bio) == REQ_OP_WRITE && !(bio->bi_opf & REQ_PREFLUSH)) { 197 + block2 = __sector_to_block(ec, bio_end_sector(bio)); 198 + if (__block_mod(bio->bi_iter.bi_sector, ec->u_bs)) 199 + dm_bufio_prefetch(ec->bufio, block1, 1); 200 + if (__block_mod(bio_end_sector(bio), ec->u_bs) && block2 != block1) 201 + dm_bufio_prefetch(ec->bufio, block2, 1); 202 + } 203 + } 204 + 205 + bio_list_for_each(bio, &bios) { 206 + r = -EIO; 207 + if (bio_op(bio) == REQ_OP_READ) 208 + r = __ebs_rw_bio(ec, READ, bio); 209 + else if (bio_op(bio) == REQ_OP_WRITE) { 210 + write = true; 211 + r = __ebs_rw_bio(ec, WRITE, bio); 212 + } else if (bio_op(bio) == REQ_OP_DISCARD) { 213 + __ebs_forget_bio(ec, bio); 214 + r = __ebs_discard_bio(ec, bio); 215 + } 216 + 217 + if (r < 0) 218 + bio->bi_status = errno_to_blk_status(r); 219 + } 220 + 221 + /* 222 + * We write dirty buffers after processing I/O on them 223 + * but before we endio thus addressing REQ_FUA/REQ_SYNC. 224 + */ 225 + r = write ? dm_bufio_write_dirty_buffers(ec->bufio) : 0; 226 + 227 + while ((bio = bio_list_pop(&bios))) { 228 + /* Any other request is endioed. */ 229 + if (unlikely(r && bio_op(bio) == REQ_OP_WRITE)) 230 + bio_io_error(bio); 231 + else 232 + bio_endio(bio); 233 + } 234 + } 235 + 236 + /* 237 + * Construct an emulated block size mapping: <dev_path> <offset> <ebs> [<ubs>] 238 + * 239 + * <dev_path>: path of the underlying device 240 + * <offset>: offset in 512 bytes sectors into <dev_path> 241 + * <ebs>: emulated block size in units of 512 bytes exposed to the upper layer 242 + * [<ubs>]: underlying block size in units of 512 bytes imposed on the lower layer; 243 + * optional, if not supplied, retrieve logical block size from underlying device 244 + */ 245 + static int ebs_ctr(struct dm_target *ti, unsigned int argc, char **argv) 246 + { 247 + int r; 248 + unsigned short tmp1; 249 + unsigned long long tmp; 250 + char dummy; 251 + struct ebs_c *ec; 252 + 253 + if (argc < 3 || argc > 4) { 254 + ti->error = "Invalid argument count"; 255 + return -EINVAL; 256 + } 257 + 258 + ec = ti->private = kzalloc(sizeof(*ec), GFP_KERNEL); 259 + if (!ec) { 260 + ti->error = "Cannot allocate ebs context"; 261 + return -ENOMEM; 262 + } 263 + 264 + r = -EINVAL; 265 + if (sscanf(argv[1], "%llu%c", &tmp, &dummy) != 1 || 266 + tmp != (sector_t)tmp || 267 + (sector_t)tmp >= ti->len) { 268 + ti->error = "Invalid device offset sector"; 269 + goto bad; 270 + } 271 + ec->start = tmp; 272 + 273 + if (sscanf(argv[2], "%hu%c", &tmp1, &dummy) != 1 || 274 + !__ebs_check_bs(tmp1) || 275 + to_bytes(tmp1) > PAGE_SIZE) { 276 + ti->error = "Invalid emulated block size"; 277 + goto bad; 278 + } 279 + ec->e_bs = tmp1; 280 + 281 + if (argc > 3) { 282 + if (sscanf(argv[3], "%hu%c", &tmp1, &dummy) != 1 || !__ebs_check_bs(tmp1)) { 283 + ti->error = "Invalid underlying block size"; 284 + goto bad; 285 + } 286 + ec->u_bs = tmp1; 287 + ec->u_bs_set = true; 288 + } else 289 + ec->u_bs_set = false; 290 + 291 + r = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table), &ec->dev); 292 + if (r) { 293 + ti->error = "Device lookup failed"; 294 + ec->dev = NULL; 295 + goto bad; 296 + } 297 + 298 + r = -EINVAL; 299 + if (!ec->u_bs_set) { 300 + ec->u_bs = to_sector(bdev_logical_block_size(ec->dev->bdev)); 301 + if (!__ebs_check_bs(ec->u_bs)) { 302 + ti->error = "Invalid retrieved underlying block size"; 303 + goto bad; 304 + } 305 + } 306 + 307 + if (!ec->u_bs_set && ec->e_bs == ec->u_bs) 308 + DMINFO("Emulation superfluous: emulated equal to underlying block size"); 309 + 310 + if (__block_mod(ec->start, ec->u_bs)) { 311 + ti->error = "Device offset must be multiple of underlying block size"; 312 + goto bad; 313 + } 314 + 315 + ec->bufio = dm_bufio_client_create(ec->dev->bdev, to_bytes(ec->u_bs), 1, 0, NULL, NULL); 316 + if (IS_ERR(ec->bufio)) { 317 + ti->error = "Cannot create dm bufio client"; 318 + r = PTR_ERR(ec->bufio); 319 + ec->bufio = NULL; 320 + goto bad; 321 + } 322 + 323 + ec->wq = alloc_ordered_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM); 324 + if (!ec->wq) { 325 + ti->error = "Cannot create dm-" DM_MSG_PREFIX " workqueue"; 326 + r = -ENOMEM; 327 + goto bad; 328 + } 329 + 330 + ec->block_shift = __ffs(ec->u_bs); 331 + INIT_WORK(&ec->ws, &__ebs_process_bios); 332 + bio_list_init(&ec->bios_in); 333 + spin_lock_init(&ec->lock); 334 + 335 + ti->num_flush_bios = 1; 336 + ti->num_discard_bios = 1; 337 + ti->num_secure_erase_bios = 0; 338 + ti->num_write_same_bios = 0; 339 + ti->num_write_zeroes_bios = 0; 340 + return 0; 341 + bad: 342 + ebs_dtr(ti); 343 + return r; 344 + } 345 + 346 + static void ebs_dtr(struct dm_target *ti) 347 + { 348 + struct ebs_c *ec = ti->private; 349 + 350 + if (ec->wq) 351 + destroy_workqueue(ec->wq); 352 + if (ec->bufio) 353 + dm_bufio_client_destroy(ec->bufio); 354 + if (ec->dev) 355 + dm_put_device(ti, ec->dev); 356 + kfree(ec); 357 + } 358 + 359 + static int ebs_map(struct dm_target *ti, struct bio *bio) 360 + { 361 + struct ebs_c *ec = ti->private; 362 + 363 + bio_set_dev(bio, ec->dev->bdev); 364 + bio->bi_iter.bi_sector = ec->start + dm_target_offset(ti, bio->bi_iter.bi_sector); 365 + 366 + if (unlikely(bio->bi_opf & REQ_OP_FLUSH)) 367 + return DM_MAPIO_REMAPPED; 368 + /* 369 + * Only queue for bufio processing in case of partial or overlapping buffers 370 + * -or- 371 + * emulation with ebs == ubs aiming for tests of dm-bufio overhead. 372 + */ 373 + if (likely(__block_mod(bio->bi_iter.bi_sector, ec->u_bs) || 374 + __block_mod(bio_end_sector(bio), ec->u_bs) || 375 + ec->e_bs == ec->u_bs)) { 376 + spin_lock_irq(&ec->lock); 377 + bio_list_add(&ec->bios_in, bio); 378 + spin_unlock_irq(&ec->lock); 379 + 380 + queue_work(ec->wq, &ec->ws); 381 + 382 + return DM_MAPIO_SUBMITTED; 383 + } 384 + 385 + /* Forget any buffer content relative to this direct backing device I/O. */ 386 + __ebs_forget_bio(ec, bio); 387 + 388 + return DM_MAPIO_REMAPPED; 389 + } 390 + 391 + static void ebs_status(struct dm_target *ti, status_type_t type, 392 + unsigned status_flags, char *result, unsigned maxlen) 393 + { 394 + struct ebs_c *ec = ti->private; 395 + 396 + switch (type) { 397 + case STATUSTYPE_INFO: 398 + *result = '\0'; 399 + break; 400 + case STATUSTYPE_TABLE: 401 + snprintf(result, maxlen, ec->u_bs_set ? "%s %llu %u %u" : "%s %llu %u", 402 + ec->dev->name, (unsigned long long) ec->start, ec->e_bs, ec->u_bs); 403 + break; 404 + } 405 + } 406 + 407 + static int ebs_prepare_ioctl(struct dm_target *ti, struct block_device **bdev) 408 + { 409 + struct ebs_c *ec = ti->private; 410 + struct dm_dev *dev = ec->dev; 411 + 412 + /* 413 + * Only pass ioctls through if the device sizes match exactly. 414 + */ 415 + *bdev = dev->bdev; 416 + return !!(ec->start || ti->len != i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT); 417 + } 418 + 419 + static void ebs_io_hints(struct dm_target *ti, struct queue_limits *limits) 420 + { 421 + struct ebs_c *ec = ti->private; 422 + 423 + limits->logical_block_size = to_bytes(ec->e_bs); 424 + limits->physical_block_size = to_bytes(ec->u_bs); 425 + limits->alignment_offset = limits->physical_block_size; 426 + blk_limits_io_min(limits, limits->logical_block_size); 427 + } 428 + 429 + static int ebs_iterate_devices(struct dm_target *ti, 430 + iterate_devices_callout_fn fn, void *data) 431 + { 432 + struct ebs_c *ec = ti->private; 433 + 434 + return fn(ti, ec->dev, ec->start, ti->len, data); 435 + } 436 + 437 + static struct target_type ebs_target = { 438 + .name = "ebs", 439 + .version = {1, 0, 1}, 440 + .features = DM_TARGET_PASSES_INTEGRITY, 441 + .module = THIS_MODULE, 442 + .ctr = ebs_ctr, 443 + .dtr = ebs_dtr, 444 + .map = ebs_map, 445 + .status = ebs_status, 446 + .io_hints = ebs_io_hints, 447 + .prepare_ioctl = ebs_prepare_ioctl, 448 + .iterate_devices = ebs_iterate_devices, 449 + }; 450 + 451 + static int __init dm_ebs_init(void) 452 + { 453 + int r = dm_register_target(&ebs_target); 454 + 455 + if (r < 0) 456 + DMERR("register failed %d", r); 457 + 458 + return r; 459 + } 460 + 461 + static void dm_ebs_exit(void) 462 + { 463 + dm_unregister_target(&ebs_target); 464 + } 465 + 466 + module_init(dm_ebs_init); 467 + module_exit(dm_ebs_exit); 468 + 469 + MODULE_AUTHOR("Heinz Mauelshagen <dm-devel@redhat.com>"); 470 + MODULE_DESCRIPTION(DM_NAME " emulated block size target"); 471 + MODULE_LICENSE("GPL");
+561
drivers/md/dm-historical-service-time.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Historical Service Time 4 + * 5 + * Keeps a time-weighted exponential moving average of the historical 6 + * service time. Estimates future service time based on the historical 7 + * service time and the number of outstanding requests. 8 + * 9 + * Marks paths stale if they have not finished within hst * 10 + * num_paths. If a path is stale and unused, we will send a single 11 + * request to probe in case the path has improved. This situation 12 + * generally arises if the path is so much worse than others that it 13 + * will never have the best estimated service time, or if the entire 14 + * multipath device is unused. If a path is stale and in use, limit the 15 + * number of requests it can receive with the assumption that the path 16 + * has become degraded. 17 + * 18 + * To avoid repeatedly calculating exponents for time weighting, times 19 + * are split into HST_WEIGHT_COUNT buckets each (1 >> HST_BUCKET_SHIFT) 20 + * ns, and the weighting is pre-calculated. 21 + * 22 + */ 23 + 24 + #include "dm.h" 25 + #include "dm-path-selector.h" 26 + 27 + #include <linux/blkdev.h> 28 + #include <linux/slab.h> 29 + #include <linux/module.h> 30 + 31 + 32 + #define DM_MSG_PREFIX "multipath historical-service-time" 33 + #define HST_MIN_IO 1 34 + #define HST_VERSION "0.1.1" 35 + 36 + #define HST_FIXED_SHIFT 10 /* 10 bits of decimal precision */ 37 + #define HST_FIXED_MAX (ULLONG_MAX >> HST_FIXED_SHIFT) 38 + #define HST_FIXED_1 (1 << HST_FIXED_SHIFT) 39 + #define HST_FIXED_95 972 40 + #define HST_MAX_INFLIGHT HST_FIXED_1 41 + #define HST_BUCKET_SHIFT 24 /* Buckets are ~ 16ms */ 42 + #define HST_WEIGHT_COUNT 64ULL 43 + 44 + struct selector { 45 + struct list_head valid_paths; 46 + struct list_head failed_paths; 47 + int valid_count; 48 + spinlock_t lock; 49 + 50 + unsigned int weights[HST_WEIGHT_COUNT]; 51 + unsigned int threshold_multiplier; 52 + }; 53 + 54 + struct path_info { 55 + struct list_head list; 56 + struct dm_path *path; 57 + unsigned int repeat_count; 58 + 59 + spinlock_t lock; 60 + 61 + u64 historical_service_time; /* Fixed point */ 62 + 63 + u64 stale_after; 64 + u64 last_finish; 65 + 66 + u64 outstanding; 67 + }; 68 + 69 + /** 70 + * fixed_power - compute: x^n, in O(log n) time 71 + * 72 + * @x: base of the power 73 + * @frac_bits: fractional bits of @x 74 + * @n: power to raise @x to. 75 + * 76 + * By exploiting the relation between the definition of the natural power 77 + * function: x^n := x*x*...*x (x multiplied by itself for n times), and 78 + * the binary encoding of numbers used by computers: n := \Sum n_i * 2^i, 79 + * (where: n_i \elem {0, 1}, the binary vector representing n), 80 + * we find: x^n := x^(\Sum n_i * 2^i) := \Prod x^(n_i * 2^i), which is 81 + * of course trivially computable in O(log_2 n), the length of our binary 82 + * vector. 83 + * 84 + * (see: kernel/sched/loadavg.c) 85 + */ 86 + static u64 fixed_power(u64 x, unsigned int frac_bits, unsigned int n) 87 + { 88 + unsigned long result = 1UL << frac_bits; 89 + 90 + if (n) { 91 + for (;;) { 92 + if (n & 1) { 93 + result *= x; 94 + result += 1UL << (frac_bits - 1); 95 + result >>= frac_bits; 96 + } 97 + n >>= 1; 98 + if (!n) 99 + break; 100 + x *= x; 101 + x += 1UL << (frac_bits - 1); 102 + x >>= frac_bits; 103 + } 104 + } 105 + 106 + return result; 107 + } 108 + 109 + /* 110 + * Calculate the next value of an exponential moving average 111 + * a_1 = a_0 * e + a * (1 - e) 112 + * 113 + * @last: [0, ULLONG_MAX >> HST_FIXED_SHIFT] 114 + * @next: [0, ULLONG_MAX >> HST_FIXED_SHIFT] 115 + * @weight: [0, HST_FIXED_1] 116 + * 117 + * Note: 118 + * To account for multiple periods in the same calculation, 119 + * a_n = a_0 * e^n + a * (1 - e^n), 120 + * so call fixed_ema(last, next, pow(weight, N)) 121 + */ 122 + static u64 fixed_ema(u64 last, u64 next, u64 weight) 123 + { 124 + last *= weight; 125 + last += next * (HST_FIXED_1 - weight); 126 + last += 1ULL << (HST_FIXED_SHIFT - 1); 127 + return last >> HST_FIXED_SHIFT; 128 + } 129 + 130 + static struct selector *alloc_selector(void) 131 + { 132 + struct selector *s = kmalloc(sizeof(*s), GFP_KERNEL); 133 + 134 + if (s) { 135 + INIT_LIST_HEAD(&s->valid_paths); 136 + INIT_LIST_HEAD(&s->failed_paths); 137 + spin_lock_init(&s->lock); 138 + s->valid_count = 0; 139 + } 140 + 141 + return s; 142 + } 143 + 144 + /* 145 + * Get the weight for a given time span. 146 + */ 147 + static u64 hst_weight(struct path_selector *ps, u64 delta) 148 + { 149 + struct selector *s = ps->context; 150 + int bucket = clamp(delta >> HST_BUCKET_SHIFT, 0ULL, 151 + HST_WEIGHT_COUNT - 1); 152 + 153 + return s->weights[bucket]; 154 + } 155 + 156 + /* 157 + * Set up the weights array. 158 + * 159 + * weights[len-1] = 0 160 + * weights[n] = base ^ (n + 1) 161 + */ 162 + static void hst_set_weights(struct path_selector *ps, unsigned int base) 163 + { 164 + struct selector *s = ps->context; 165 + int i; 166 + 167 + if (base >= HST_FIXED_1) 168 + return; 169 + 170 + for (i = 0; i < HST_WEIGHT_COUNT - 1; i++) 171 + s->weights[i] = fixed_power(base, HST_FIXED_SHIFT, i + 1); 172 + s->weights[HST_WEIGHT_COUNT - 1] = 0; 173 + } 174 + 175 + static int hst_create(struct path_selector *ps, unsigned int argc, char **argv) 176 + { 177 + struct selector *s; 178 + unsigned int base_weight = HST_FIXED_95; 179 + unsigned int threshold_multiplier = 0; 180 + char dummy; 181 + 182 + /* 183 + * Arguments: [<base_weight> [<threshold_multiplier>]] 184 + * <base_weight>: Base weight for ema [0, 1024) 10-bit fixed point. A 185 + * value of 0 will completely ignore any history. 186 + * If not given, default (HST_FIXED_95) is used. 187 + * <threshold_multiplier>: Minimum threshold multiplier for paths to 188 + * be considered different. That is, a path is 189 + * considered different iff (p1 > N * p2) where p1 190 + * is the path with higher service time. A threshold 191 + * of 1 or 0 has no effect. Defaults to 0. 192 + */ 193 + if (argc > 2) 194 + return -EINVAL; 195 + 196 + if (argc && (sscanf(argv[0], "%u%c", &base_weight, &dummy) != 1 || 197 + base_weight >= HST_FIXED_1)) { 198 + return -EINVAL; 199 + } 200 + 201 + if (argc > 1 && (sscanf(argv[1], "%u%c", 202 + &threshold_multiplier, &dummy) != 1)) { 203 + return -EINVAL; 204 + } 205 + 206 + s = alloc_selector(); 207 + if (!s) 208 + return -ENOMEM; 209 + 210 + ps->context = s; 211 + 212 + hst_set_weights(ps, base_weight); 213 + s->threshold_multiplier = threshold_multiplier; 214 + return 0; 215 + } 216 + 217 + static void free_paths(struct list_head *paths) 218 + { 219 + struct path_info *pi, *next; 220 + 221 + list_for_each_entry_safe(pi, next, paths, list) { 222 + list_del(&pi->list); 223 + kfree(pi); 224 + } 225 + } 226 + 227 + static void hst_destroy(struct path_selector *ps) 228 + { 229 + struct selector *s = ps->context; 230 + 231 + free_paths(&s->valid_paths); 232 + free_paths(&s->failed_paths); 233 + kfree(s); 234 + ps->context = NULL; 235 + } 236 + 237 + static int hst_status(struct path_selector *ps, struct dm_path *path, 238 + status_type_t type, char *result, unsigned int maxlen) 239 + { 240 + unsigned int sz = 0; 241 + struct path_info *pi; 242 + 243 + if (!path) { 244 + struct selector *s = ps->context; 245 + 246 + DMEMIT("2 %u %u ", s->weights[0], s->threshold_multiplier); 247 + } else { 248 + pi = path->pscontext; 249 + 250 + switch (type) { 251 + case STATUSTYPE_INFO: 252 + DMEMIT("%llu %llu %llu ", pi->historical_service_time, 253 + pi->outstanding, pi->stale_after); 254 + break; 255 + case STATUSTYPE_TABLE: 256 + DMEMIT("0 "); 257 + break; 258 + } 259 + } 260 + 261 + return sz; 262 + } 263 + 264 + static int hst_add_path(struct path_selector *ps, struct dm_path *path, 265 + int argc, char **argv, char **error) 266 + { 267 + struct selector *s = ps->context; 268 + struct path_info *pi; 269 + unsigned int repeat_count = HST_MIN_IO; 270 + char dummy; 271 + unsigned long flags; 272 + 273 + /* 274 + * Arguments: [<repeat_count>] 275 + * <repeat_count>: The number of I/Os before switching path. 276 + * If not given, default (HST_MIN_IO) is used. 277 + */ 278 + if (argc > 1) { 279 + *error = "historical-service-time ps: incorrect number of arguments"; 280 + return -EINVAL; 281 + } 282 + 283 + if (argc && (sscanf(argv[0], "%u%c", &repeat_count, &dummy) != 1)) { 284 + *error = "historical-service-time ps: invalid repeat count"; 285 + return -EINVAL; 286 + } 287 + 288 + /* allocate the path */ 289 + pi = kmalloc(sizeof(*pi), GFP_KERNEL); 290 + if (!pi) { 291 + *error = "historical-service-time ps: Error allocating path context"; 292 + return -ENOMEM; 293 + } 294 + 295 + pi->path = path; 296 + pi->repeat_count = repeat_count; 297 + 298 + pi->historical_service_time = HST_FIXED_1; 299 + 300 + spin_lock_init(&pi->lock); 301 + pi->outstanding = 0; 302 + 303 + pi->stale_after = 0; 304 + pi->last_finish = 0; 305 + 306 + path->pscontext = pi; 307 + 308 + spin_lock_irqsave(&s->lock, flags); 309 + list_add_tail(&pi->list, &s->valid_paths); 310 + s->valid_count++; 311 + spin_unlock_irqrestore(&s->lock, flags); 312 + 313 + return 0; 314 + } 315 + 316 + static void hst_fail_path(struct path_selector *ps, struct dm_path *path) 317 + { 318 + struct selector *s = ps->context; 319 + struct path_info *pi = path->pscontext; 320 + unsigned long flags; 321 + 322 + spin_lock_irqsave(&s->lock, flags); 323 + list_move(&pi->list, &s->failed_paths); 324 + s->valid_count--; 325 + spin_unlock_irqrestore(&s->lock, flags); 326 + } 327 + 328 + static int hst_reinstate_path(struct path_selector *ps, struct dm_path *path) 329 + { 330 + struct selector *s = ps->context; 331 + struct path_info *pi = path->pscontext; 332 + unsigned long flags; 333 + 334 + spin_lock_irqsave(&s->lock, flags); 335 + list_move_tail(&pi->list, &s->valid_paths); 336 + s->valid_count++; 337 + spin_unlock_irqrestore(&s->lock, flags); 338 + 339 + return 0; 340 + } 341 + 342 + static void hst_fill_compare(struct path_info *pi, u64 *hst, 343 + u64 *out, u64 *stale) 344 + { 345 + unsigned long flags; 346 + 347 + spin_lock_irqsave(&pi->lock, flags); 348 + *hst = pi->historical_service_time; 349 + *out = pi->outstanding; 350 + *stale = pi->stale_after; 351 + spin_unlock_irqrestore(&pi->lock, flags); 352 + } 353 + 354 + /* 355 + * Compare the estimated service time of 2 paths, pi1 and pi2, 356 + * for the incoming I/O. 357 + * 358 + * Returns: 359 + * < 0 : pi1 is better 360 + * 0 : no difference between pi1 and pi2 361 + * > 0 : pi2 is better 362 + * 363 + */ 364 + static long long hst_compare(struct path_info *pi1, struct path_info *pi2, 365 + u64 time_now, struct path_selector *ps) 366 + { 367 + struct selector *s = ps->context; 368 + u64 hst1, hst2; 369 + long long out1, out2, stale1, stale2; 370 + int pi2_better, over_threshold; 371 + 372 + hst_fill_compare(pi1, &hst1, &out1, &stale1); 373 + hst_fill_compare(pi2, &hst2, &out2, &stale2); 374 + 375 + /* Check here if estimated latency for two paths are too similar. 376 + * If this is the case, we skip extra calculation and just compare 377 + * outstanding requests. In this case, any unloaded paths will 378 + * be preferred. 379 + */ 380 + if (hst1 > hst2) 381 + over_threshold = hst1 > (s->threshold_multiplier * hst2); 382 + else 383 + over_threshold = hst2 > (s->threshold_multiplier * hst1); 384 + 385 + if (!over_threshold) 386 + return out1 - out2; 387 + 388 + /* 389 + * If an unloaded path is stale, choose it. If both paths are unloaded, 390 + * choose path that is the most stale. 391 + * (If one path is loaded, choose the other) 392 + */ 393 + if ((!out1 && stale1 < time_now) || (!out2 && stale2 < time_now) || 394 + (!out1 && !out2)) 395 + return (!out2 * stale1) - (!out1 * stale2); 396 + 397 + /* Compare estimated service time. If outstanding is the same, we 398 + * don't need to multiply 399 + */ 400 + if (out1 == out2) { 401 + pi2_better = hst1 > hst2; 402 + } else { 403 + /* Potential overflow with out >= 1024 */ 404 + if (unlikely(out1 >= HST_MAX_INFLIGHT || 405 + out2 >= HST_MAX_INFLIGHT)) { 406 + /* If over 1023 in-flights, we may overflow if hst 407 + * is at max. (With this shift we still overflow at 408 + * 1048576 in-flights, which is high enough). 409 + */ 410 + hst1 >>= HST_FIXED_SHIFT; 411 + hst2 >>= HST_FIXED_SHIFT; 412 + } 413 + pi2_better = (1 + out1) * hst1 > (1 + out2) * hst2; 414 + } 415 + 416 + /* In the case that the 'winner' is stale, limit to equal usage. */ 417 + if (pi2_better) { 418 + if (stale2 < time_now) 419 + return out1 - out2; 420 + return 1; 421 + } 422 + if (stale1 < time_now) 423 + return out1 - out2; 424 + return -1; 425 + } 426 + 427 + static struct dm_path *hst_select_path(struct path_selector *ps, 428 + size_t nr_bytes) 429 + { 430 + struct selector *s = ps->context; 431 + struct path_info *pi = NULL, *best = NULL; 432 + u64 time_now = sched_clock(); 433 + struct dm_path *ret = NULL; 434 + unsigned long flags; 435 + 436 + spin_lock_irqsave(&s->lock, flags); 437 + if (list_empty(&s->valid_paths)) 438 + goto out; 439 + 440 + list_for_each_entry(pi, &s->valid_paths, list) { 441 + if (!best || (hst_compare(pi, best, time_now, ps) < 0)) 442 + best = pi; 443 + } 444 + 445 + if (!best) 446 + goto out; 447 + 448 + /* Move last used path to end (least preferred in case of ties) */ 449 + list_move_tail(&best->list, &s->valid_paths); 450 + 451 + ret = best->path; 452 + 453 + out: 454 + spin_unlock_irqrestore(&s->lock, flags); 455 + return ret; 456 + } 457 + 458 + static int hst_start_io(struct path_selector *ps, struct dm_path *path, 459 + size_t nr_bytes) 460 + { 461 + struct path_info *pi = path->pscontext; 462 + unsigned long flags; 463 + 464 + spin_lock_irqsave(&pi->lock, flags); 465 + pi->outstanding++; 466 + spin_unlock_irqrestore(&pi->lock, flags); 467 + 468 + return 0; 469 + } 470 + 471 + static u64 path_service_time(struct path_info *pi, u64 start_time) 472 + { 473 + u64 sched_now = ktime_get_ns(); 474 + 475 + /* if a previous disk request has finished after this IO was 476 + * sent to the hardware, pretend the submission happened 477 + * serially. 478 + */ 479 + if (time_after64(pi->last_finish, start_time)) 480 + start_time = pi->last_finish; 481 + 482 + pi->last_finish = sched_now; 483 + if (time_before64(sched_now, start_time)) 484 + return 0; 485 + 486 + return sched_now - start_time; 487 + } 488 + 489 + static int hst_end_io(struct path_selector *ps, struct dm_path *path, 490 + size_t nr_bytes, u64 start_time) 491 + { 492 + struct path_info *pi = path->pscontext; 493 + struct selector *s = ps->context; 494 + unsigned long flags; 495 + u64 st; 496 + 497 + spin_lock_irqsave(&pi->lock, flags); 498 + 499 + st = path_service_time(pi, start_time); 500 + pi->outstanding--; 501 + pi->historical_service_time = 502 + fixed_ema(pi->historical_service_time, 503 + min(st * HST_FIXED_1, HST_FIXED_MAX), 504 + hst_weight(ps, st)); 505 + 506 + /* 507 + * On request end, mark path as fresh. If a path hasn't 508 + * finished any requests within the fresh period, the estimated 509 + * service time is considered too optimistic and we limit the 510 + * maximum requests on that path. 511 + */ 512 + pi->stale_after = pi->last_finish + 513 + (s->valid_count * (pi->historical_service_time >> HST_FIXED_SHIFT)); 514 + 515 + spin_unlock_irqrestore(&pi->lock, flags); 516 + 517 + return 0; 518 + } 519 + 520 + static struct path_selector_type hst_ps = { 521 + .name = "historical-service-time", 522 + .module = THIS_MODULE, 523 + .table_args = 1, 524 + .info_args = 3, 525 + .create = hst_create, 526 + .destroy = hst_destroy, 527 + .status = hst_status, 528 + .add_path = hst_add_path, 529 + .fail_path = hst_fail_path, 530 + .reinstate_path = hst_reinstate_path, 531 + .select_path = hst_select_path, 532 + .start_io = hst_start_io, 533 + .end_io = hst_end_io, 534 + }; 535 + 536 + static int __init dm_hst_init(void) 537 + { 538 + int r = dm_register_path_selector(&hst_ps); 539 + 540 + if (r < 0) 541 + DMERR("register failed %d", r); 542 + 543 + DMINFO("version " HST_VERSION " loaded"); 544 + 545 + return r; 546 + } 547 + 548 + static void __exit dm_hst_exit(void) 549 + { 550 + int r = dm_unregister_path_selector(&hst_ps); 551 + 552 + if (r < 0) 553 + DMERR("unregister failed %d", r); 554 + } 555 + 556 + module_init(dm_hst_init); 557 + module_exit(dm_hst_exit); 558 + 559 + MODULE_DESCRIPTION(DM_NAME " measured service time oriented path selector"); 560 + MODULE_AUTHOR("Khazhismel Kumykov <khazhy@google.com>"); 561 + MODULE_LICENSE("GPL");
+1 -5
drivers/md/dm-integrity.c
··· 92 92 } s; 93 93 __u64 sector; 94 94 } u; 95 - commit_id_t last_bytes[0]; 95 + commit_id_t last_bytes[]; 96 96 /* __u8 tag[0]; */ 97 97 }; 98 98 ··· 1553 1553 char checksums_onstack[max((size_t)HASH_MAX_DIGESTSIZE, MAX_TAG_SIZE)]; 1554 1554 sector_t sector; 1555 1555 unsigned sectors_to_process; 1556 - sector_t save_metadata_block; 1557 - unsigned save_metadata_offset; 1558 1556 1559 1557 if (unlikely(ic->mode == 'R')) 1560 1558 goto skip_io; ··· 1603 1605 goto skip_io; 1604 1606 } 1605 1607 1606 - save_metadata_block = dio->metadata_block; 1607 - save_metadata_offset = dio->metadata_offset; 1608 1608 sector = dio->range.logical_sector; 1609 1609 sectors_to_process = dio->range.n_sectors; 1610 1610
+1 -1
drivers/md/dm-log-writes.c
··· 127 127 char *data; 128 128 u32 datalen; 129 129 struct list_head list; 130 - struct bio_vec vecs[0]; 130 + struct bio_vec vecs[]; 131 131 }; 132 132 133 133 struct per_bio_data {
+73 -50
drivers/md/dm-mpath.c
··· 439 439 } 440 440 441 441 /* 442 - * dm_report_EIO() is a macro instead of a function to make pr_debug() 442 + * dm_report_EIO() is a macro instead of a function to make pr_debug_ratelimited() 443 443 * report the function name and line number of the function from which 444 444 * it has been invoked. 445 445 */ ··· 447 447 do { \ 448 448 struct mapped_device *md = dm_table_get_md((m)->ti->table); \ 449 449 \ 450 - pr_debug("%s: returning EIO; QIFNP = %d; SQIFNP = %d; DNFS = %d\n", \ 451 - dm_device_name(md), \ 452 - test_bit(MPATHF_QUEUE_IF_NO_PATH, &(m)->flags), \ 453 - test_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &(m)->flags), \ 454 - dm_noflush_suspending((m)->ti)); \ 450 + DMDEBUG_LIMIT("%s: returning EIO; QIFNP = %d; SQIFNP = %d; DNFS = %d", \ 451 + dm_device_name(md), \ 452 + test_bit(MPATHF_QUEUE_IF_NO_PATH, &(m)->flags), \ 453 + test_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &(m)->flags), \ 454 + dm_noflush_suspending((m)->ti)); \ 455 455 } while (0) 456 456 457 457 /* 458 458 * Check whether bios must be queued in the device-mapper core rather 459 459 * than here in the target. 460 - * 461 - * If MPATHF_QUEUE_IF_NO_PATH and MPATHF_SAVED_QUEUE_IF_NO_PATH hold 462 - * the same value then we are not between multipath_presuspend() 463 - * and multipath_resume() calls and we have no need to check 464 - * for the DMF_NOFLUSH_SUSPENDING flag. 465 460 */ 466 - static bool __must_push_back(struct multipath *m, unsigned long flags) 461 + static bool __must_push_back(struct multipath *m) 467 462 { 468 - return ((test_bit(MPATHF_QUEUE_IF_NO_PATH, &flags) != 469 - test_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &flags)) && 470 - dm_noflush_suspending(m->ti)); 463 + return dm_noflush_suspending(m->ti); 471 464 } 472 465 473 - /* 474 - * Following functions use READ_ONCE to get atomic access to 475 - * all m->flags to avoid taking spinlock 476 - */ 477 466 static bool must_push_back_rq(struct multipath *m) 478 467 { 479 - unsigned long flags = READ_ONCE(m->flags); 480 - return test_bit(MPATHF_QUEUE_IF_NO_PATH, &flags) || __must_push_back(m, flags); 481 - } 482 - 483 - static bool must_push_back_bio(struct multipath *m) 484 - { 485 - unsigned long flags = READ_ONCE(m->flags); 486 - return __must_push_back(m, flags); 468 + return test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags) || __must_push_back(m); 487 469 } 488 470 489 471 /* ··· 549 567 if (pgpath && pgpath->pg->ps.type->end_io) 550 568 pgpath->pg->ps.type->end_io(&pgpath->pg->ps, 551 569 &pgpath->path, 552 - mpio->nr_bytes); 570 + mpio->nr_bytes, 571 + clone->io_start_time_ns); 553 572 } 554 573 555 574 blk_put_request(clone); ··· 602 619 return DM_MAPIO_SUBMITTED; 603 620 604 621 if (!pgpath) { 605 - if (must_push_back_bio(m)) 622 + if (__must_push_back(m)) 606 623 return DM_MAPIO_REQUEUE; 607 624 dm_report_EIO(m); 608 625 return DM_MAPIO_KILL; ··· 692 709 * If we run out of usable paths, should we queue I/O or error it? 693 710 */ 694 711 static int queue_if_no_path(struct multipath *m, bool queue_if_no_path, 695 - bool save_old_value) 712 + bool save_old_value, const char *caller) 696 713 { 697 714 unsigned long flags; 715 + bool queue_if_no_path_bit, saved_queue_if_no_path_bit; 716 + const char *dm_dev_name = dm_device_name(dm_table_get_md(m->ti->table)); 717 + 718 + DMDEBUG("%s: %s caller=%s queue_if_no_path=%d save_old_value=%d", 719 + dm_dev_name, __func__, caller, queue_if_no_path, save_old_value); 698 720 699 721 spin_lock_irqsave(&m->lock, flags); 700 - assign_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags, 701 - (save_old_value && test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags)) || 702 - (!save_old_value && queue_if_no_path)); 722 + 723 + queue_if_no_path_bit = test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags); 724 + saved_queue_if_no_path_bit = test_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags); 725 + 726 + if (save_old_value) { 727 + if (unlikely(!queue_if_no_path_bit && saved_queue_if_no_path_bit)) { 728 + DMERR("%s: QIFNP disabled but saved as enabled, saving again loses state, not saving!", 729 + dm_dev_name); 730 + } else 731 + assign_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags, queue_if_no_path_bit); 732 + } else if (!queue_if_no_path && saved_queue_if_no_path_bit) { 733 + /* due to "fail_if_no_path" message, need to honor it. */ 734 + clear_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags); 735 + } 703 736 assign_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags, queue_if_no_path); 737 + 738 + DMDEBUG("%s: after %s changes; QIFNP = %d; SQIFNP = %d; DNFS = %d", 739 + dm_dev_name, __func__, 740 + test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags), 741 + test_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags), 742 + dm_noflush_suspending(m->ti)); 743 + 704 744 spin_unlock_irqrestore(&m->lock, flags); 705 745 706 746 if (!queue_if_no_path) { ··· 744 738 struct mapped_device *md = dm_table_get_md(m->ti->table); 745 739 746 740 DMWARN("queue_if_no_path timeout on %s, failing queued IO", dm_device_name(md)); 747 - queue_if_no_path(m, false, false); 741 + queue_if_no_path(m, false, false, __func__); 748 742 } 749 743 750 744 /* ··· 1084 1078 argc--; 1085 1079 1086 1080 if (!strcasecmp(arg_name, "queue_if_no_path")) { 1087 - r = queue_if_no_path(m, true, false); 1081 + r = queue_if_no_path(m, true, false, __func__); 1088 1082 continue; 1089 1083 } 1090 1084 ··· 1285 1279 if (!pgpath->is_active) 1286 1280 goto out; 1287 1281 1288 - DMWARN("Failing path %s.", pgpath->path.dev->name); 1282 + DMWARN("%s: Failing path %s.", 1283 + dm_device_name(dm_table_get_md(m->ti->table)), 1284 + pgpath->path.dev->name); 1289 1285 1290 1286 pgpath->pg->ps.type->fail_path(&pgpath->pg->ps, &pgpath->path); 1291 1287 pgpath->is_active = false; ··· 1326 1318 if (pgpath->is_active) 1327 1319 goto out; 1328 1320 1329 - DMWARN("Reinstating path %s.", pgpath->path.dev->name); 1321 + DMWARN("%s: Reinstating path %s.", 1322 + dm_device_name(dm_table_get_md(m->ti->table)), 1323 + pgpath->path.dev->name); 1330 1324 1331 1325 r = pgpath->pg->ps.type->reinstate_path(&pgpath->pg->ps, &pgpath->path); 1332 1326 if (r) ··· 1627 1617 struct path_selector *ps = &pgpath->pg->ps; 1628 1618 1629 1619 if (ps->type->end_io) 1630 - ps->type->end_io(ps, &pgpath->path, mpio->nr_bytes); 1620 + ps->type->end_io(ps, &pgpath->path, mpio->nr_bytes, 1621 + clone->io_start_time_ns); 1631 1622 } 1632 1623 1633 1624 return r; ··· 1651 1640 1652 1641 if (atomic_read(&m->nr_valid_paths) == 0 && 1653 1642 !test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags)) { 1654 - if (must_push_back_bio(m)) { 1643 + if (__must_push_back(m)) { 1655 1644 r = DM_ENDIO_REQUEUE; 1656 1645 } else { 1657 1646 dm_report_EIO(m); ··· 1672 1661 struct path_selector *ps = &pgpath->pg->ps; 1673 1662 1674 1663 if (ps->type->end_io) 1675 - ps->type->end_io(ps, &pgpath->path, mpio->nr_bytes); 1664 + ps->type->end_io(ps, &pgpath->path, mpio->nr_bytes, 1665 + dm_start_time_ns_from_clone(clone)); 1676 1666 } 1677 1667 1678 1668 return r; 1679 1669 } 1680 1670 1681 1671 /* 1682 - * Suspend can't complete until all the I/O is processed so if 1683 - * the last path fails we must error any remaining I/O. 1684 - * Note that if the freeze_bdev fails while suspending, the 1685 - * queue_if_no_path state is lost - userspace should reset it. 1672 + * Suspend with flush can't complete until all the I/O is processed 1673 + * so if the last path fails we must error any remaining I/O. 1674 + * - Note that if the freeze_bdev fails while suspending, the 1675 + * queue_if_no_path state is lost - userspace should reset it. 1676 + * Otherwise, during noflush suspend, queue_if_no_path will not change. 1686 1677 */ 1687 1678 static void multipath_presuspend(struct dm_target *ti) 1688 1679 { 1689 1680 struct multipath *m = ti->private; 1690 1681 1691 - queue_if_no_path(m, false, true); 1682 + /* FIXME: bio-based shouldn't need to always disable queue_if_no_path */ 1683 + if (m->queue_mode == DM_TYPE_BIO_BASED || !dm_noflush_suspending(m->ti)) 1684 + queue_if_no_path(m, false, true, __func__); 1692 1685 } 1693 1686 1694 1687 static void multipath_postsuspend(struct dm_target *ti) ··· 1713 1698 unsigned long flags; 1714 1699 1715 1700 spin_lock_irqsave(&m->lock, flags); 1716 - assign_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags, 1717 - test_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags)); 1701 + if (test_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags)) { 1702 + set_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags); 1703 + clear_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags); 1704 + } 1705 + 1706 + DMDEBUG("%s: %s finished; QIFNP = %d; SQIFNP = %d", 1707 + dm_device_name(dm_table_get_md(m->ti->table)), __func__, 1708 + test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags), 1709 + test_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags)); 1710 + 1718 1711 spin_unlock_irqrestore(&m->lock, flags); 1719 1712 } 1720 1713 ··· 1882 1859 1883 1860 if (argc == 1) { 1884 1861 if (!strcasecmp(argv[0], "queue_if_no_path")) { 1885 - r = queue_if_no_path(m, true, false); 1862 + r = queue_if_no_path(m, true, false, __func__); 1886 1863 spin_lock_irqsave(&m->lock, flags); 1887 1864 enable_nopath_timeout(m); 1888 1865 spin_unlock_irqrestore(&m->lock, flags); 1889 1866 goto out; 1890 1867 } else if (!strcasecmp(argv[0], "fail_if_no_path")) { 1891 - r = queue_if_no_path(m, false, false); 1868 + r = queue_if_no_path(m, false, false, __func__); 1892 1869 disable_nopath_timeout(m); 1893 1870 goto out; 1894 1871 } ··· 1941 1918 int r; 1942 1919 1943 1920 current_pgpath = READ_ONCE(m->current_pgpath); 1944 - if (!current_pgpath) 1921 + if (!current_pgpath || !test_bit(MPATHF_QUEUE_IO, &m->flags)) 1945 1922 current_pgpath = choose_pgpath(m, 0); 1946 1923 1947 1924 if (current_pgpath) {
+1 -1
drivers/md/dm-path-selector.h
··· 74 74 int (*start_io) (struct path_selector *ps, struct dm_path *path, 75 75 size_t nr_bytes); 76 76 int (*end_io) (struct path_selector *ps, struct dm_path *path, 77 - size_t nr_bytes); 77 + size_t nr_bytes, u64 start_time); 78 78 }; 79 79 80 80 /* Register a path selector */
+1 -1
drivers/md/dm-queue-length.c
··· 227 227 } 228 228 229 229 static int ql_end_io(struct path_selector *ps, struct dm_path *path, 230 - size_t nr_bytes) 230 + size_t nr_bytes, u64 start_time) 231 231 { 232 232 struct path_info *pi = path->pscontext; 233 233
+1 -1
drivers/md/dm-raid.c
··· 254 254 int mode; 255 255 } journal_dev; 256 256 257 - struct raid_dev dev[0]; 257 + struct raid_dev dev[]; 258 258 }; 259 259 260 260 static void rs_config_backup(struct raid_set *rs, struct rs_layout *l)
+1 -1
drivers/md/dm-raid1.c
··· 83 83 struct work_struct trigger_event; 84 84 85 85 unsigned nr_mirrors; 86 - struct mirror mirror[0]; 86 + struct mirror mirror[]; 87 87 }; 88 88 89 89 DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(raid1_resync_throttle,
+1 -1
drivers/md/dm-service-time.c
··· 309 309 } 310 310 311 311 static int st_end_io(struct path_selector *ps, struct dm_path *path, 312 - size_t nr_bytes) 312 + size_t nr_bytes, u64 start_time) 313 313 { 314 314 struct path_info *pi = path->pscontext; 315 315
+1 -1
drivers/md/dm-stats.c
··· 56 56 size_t percpu_alloc_size; 57 57 size_t histogram_alloc_size; 58 58 struct dm_stat_percpu *stat_percpu[NR_CPUS]; 59 - struct dm_stat_shared stat_shared[0]; 59 + struct dm_stat_shared stat_shared[]; 60 60 }; 61 61 62 62 #define STAT_PRECISE_TIMESTAMPS 1
+1 -1
drivers/md/dm-stripe.c
··· 41 41 /* Work struct used for triggering events*/ 42 42 struct work_struct trigger_event; 43 43 44 - struct stripe stripe[0]; 44 + struct stripe stripe[]; 45 45 }; 46 46 47 47 /*
+1 -1
drivers/md/dm-switch.c
··· 53 53 /* 54 54 * Array of dm devices to switch between. 55 55 */ 56 - struct switch_path path_list[0]; 56 + struct switch_path path_list[]; 57 57 }; 58 58 59 59 static struct switch_ctx *alloc_switch_ctx(struct dm_target *ti, unsigned nr_paths,
+37 -5
drivers/md/dm-writecache.c
··· 234 234 235 235 wc->memory_vmapped = false; 236 236 237 - if (!wc->ssd_dev->dax_dev) { 238 - r = -EOPNOTSUPP; 239 - goto err1; 240 - } 241 237 s = wc->memory_map_size; 242 238 p = s >> PAGE_SHIFT; 243 239 if (!p) { ··· 1139 1143 return r; 1140 1144 } 1141 1145 1146 + static void memcpy_flushcache_optimized(void *dest, void *source, size_t size) 1147 + { 1148 + /* 1149 + * clflushopt performs better with block size 1024, 2048, 4096 1150 + * non-temporal stores perform better with block size 512 1151 + * 1152 + * block size 512 1024 2048 4096 1153 + * movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s 1154 + * clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s 1155 + * 1156 + * We see that movnti performs better for 512-byte blocks, and 1157 + * clflushopt performs better for 1024-byte and larger blocks. So, we 1158 + * prefer clflushopt for sizes >= 768. 1159 + * 1160 + * NOTE: this happens to be the case now (with dm-writecache's single 1161 + * threaded model) but re-evaluate this once memcpy_flushcache() is 1162 + * enabled to use movdir64b which might invalidate this performance 1163 + * advantage seen with cache-allocating-writes plus flushing. 1164 + */ 1165 + #ifdef CONFIG_X86 1166 + if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && 1167 + likely(boot_cpu_data.x86_clflush_size == 64) && 1168 + likely(size >= 768)) { 1169 + do { 1170 + memcpy((void *)dest, (void *)source, 64); 1171 + clflushopt((void *)dest); 1172 + dest += 64; 1173 + source += 64; 1174 + size -= 64; 1175 + } while (size >= 64); 1176 + return; 1177 + } 1178 + #endif 1179 + memcpy_flushcache(dest, source, size); 1180 + } 1181 + 1142 1182 static void bio_copy_block(struct dm_writecache *wc, struct bio *bio, void *data) 1143 1183 { 1144 1184 void *buf; ··· 1200 1168 } 1201 1169 } else { 1202 1170 flush_dcache_page(bio_page(bio)); 1203 - memcpy_flushcache(data, buf, size); 1171 + memcpy_flushcache_optimized(data, buf, size); 1204 1172 } 1205 1173 1206 1174 bvec_kunmap_irq(buf, &flags);
+770 -274
drivers/md/dm-zoned-metadata.c
··· 16 16 /* 17 17 * Metadata version. 18 18 */ 19 - #define DMZ_META_VER 1 19 + #define DMZ_META_VER 2 20 20 21 21 /* 22 22 * On-disk super block magic. ··· 69 69 /* Checksum */ 70 70 __le32 crc; /* 48 */ 71 71 72 + /* DM-Zoned label */ 73 + u8 dmz_label[32]; /* 80 */ 74 + 75 + /* DM-Zoned UUID */ 76 + u8 dmz_uuid[16]; /* 96 */ 77 + 78 + /* Device UUID */ 79 + u8 dev_uuid[16]; /* 112 */ 80 + 72 81 /* Padding to full 512B sector */ 73 - u8 reserved[464]; /* 512 */ 82 + u8 reserved[400]; /* 512 */ 74 83 }; 75 84 76 85 /* ··· 131 122 */ 132 123 struct dmz_sb { 133 124 sector_t block; 125 + struct dmz_dev *dev; 134 126 struct dmz_mblock *mblk; 135 127 struct dmz_super *sb; 128 + struct dm_zone *zone; 136 129 }; 137 130 138 131 /* ··· 142 131 */ 143 132 struct dmz_metadata { 144 133 struct dmz_dev *dev; 134 + unsigned int nr_devs; 135 + 136 + char devname[BDEVNAME_SIZE]; 137 + char label[BDEVNAME_SIZE]; 138 + uuid_t uuid; 145 139 146 140 sector_t zone_bitmap_size; 147 141 unsigned int zone_nr_bitmap_blocks; 148 142 unsigned int zone_bits_per_mblk; 149 143 144 + sector_t zone_nr_blocks; 145 + sector_t zone_nr_blocks_shift; 146 + 147 + sector_t zone_nr_sectors; 148 + sector_t zone_nr_sectors_shift; 149 + 150 150 unsigned int nr_bitmap_blocks; 151 151 unsigned int nr_map_blocks; 152 152 153 + unsigned int nr_zones; 153 154 unsigned int nr_useable_zones; 154 155 unsigned int nr_meta_blocks; 155 156 unsigned int nr_meta_zones; 156 157 unsigned int nr_data_zones; 158 + unsigned int nr_cache_zones; 157 159 unsigned int nr_rnd_zones; 158 160 unsigned int nr_reserved_seq; 159 161 unsigned int nr_chunks; 160 162 161 163 /* Zone information array */ 162 - struct dm_zone *zones; 164 + struct xarray zones; 163 165 164 - struct dm_zone *sb_zone; 165 166 struct dmz_sb sb[2]; 166 167 unsigned int mblk_primary; 168 + unsigned int sb_version; 167 169 u64 sb_gen; 168 170 unsigned int min_nr_mblks; 169 171 unsigned int max_nr_mblks; ··· 192 168 /* Zone allocation management */ 193 169 struct mutex map_lock; 194 170 struct dmz_mblock **map_mblk; 195 - unsigned int nr_rnd; 196 - atomic_t unmap_nr_rnd; 197 - struct list_head unmap_rnd_list; 198 - struct list_head map_rnd_list; 199 171 200 - unsigned int nr_seq; 201 - atomic_t unmap_nr_seq; 202 - struct list_head unmap_seq_list; 203 - struct list_head map_seq_list; 172 + unsigned int nr_cache; 173 + atomic_t unmap_nr_cache; 174 + struct list_head unmap_cache_list; 175 + struct list_head map_cache_list; 204 176 205 177 atomic_t nr_reserved_seq_zones; 206 178 struct list_head reserved_seq_zones_list; ··· 204 184 wait_queue_head_t free_wq; 205 185 }; 206 186 187 + #define dmz_zmd_info(zmd, format, args...) \ 188 + DMINFO("(%s): " format, (zmd)->label, ## args) 189 + 190 + #define dmz_zmd_err(zmd, format, args...) \ 191 + DMERR("(%s): " format, (zmd)->label, ## args) 192 + 193 + #define dmz_zmd_warn(zmd, format, args...) \ 194 + DMWARN("(%s): " format, (zmd)->label, ## args) 195 + 196 + #define dmz_zmd_debug(zmd, format, args...) \ 197 + DMDEBUG("(%s): " format, (zmd)->label, ## args) 207 198 /* 208 199 * Various accessors 209 200 */ 210 - unsigned int dmz_id(struct dmz_metadata *zmd, struct dm_zone *zone) 201 + static unsigned int dmz_dev_zone_id(struct dmz_metadata *zmd, struct dm_zone *zone) 211 202 { 212 - return ((unsigned int)(zone - zmd->zones)); 203 + if (WARN_ON(!zone)) 204 + return 0; 205 + 206 + return zone->id - zone->dev->zone_offset; 213 207 } 214 208 215 209 sector_t dmz_start_sect(struct dmz_metadata *zmd, struct dm_zone *zone) 216 210 { 217 - return (sector_t)dmz_id(zmd, zone) << zmd->dev->zone_nr_sectors_shift; 211 + unsigned int zone_id = dmz_dev_zone_id(zmd, zone); 212 + 213 + return (sector_t)zone_id << zmd->zone_nr_sectors_shift; 218 214 } 219 215 220 216 sector_t dmz_start_block(struct dmz_metadata *zmd, struct dm_zone *zone) 221 217 { 222 - return (sector_t)dmz_id(zmd, zone) << zmd->dev->zone_nr_blocks_shift; 218 + unsigned int zone_id = dmz_dev_zone_id(zmd, zone); 219 + 220 + return (sector_t)zone_id << zmd->zone_nr_blocks_shift; 221 + } 222 + 223 + unsigned int dmz_zone_nr_blocks(struct dmz_metadata *zmd) 224 + { 225 + return zmd->zone_nr_blocks; 226 + } 227 + 228 + unsigned int dmz_zone_nr_blocks_shift(struct dmz_metadata *zmd) 229 + { 230 + return zmd->zone_nr_blocks_shift; 231 + } 232 + 233 + unsigned int dmz_zone_nr_sectors(struct dmz_metadata *zmd) 234 + { 235 + return zmd->zone_nr_sectors; 236 + } 237 + 238 + unsigned int dmz_zone_nr_sectors_shift(struct dmz_metadata *zmd) 239 + { 240 + return zmd->zone_nr_sectors_shift; 241 + } 242 + 243 + unsigned int dmz_nr_zones(struct dmz_metadata *zmd) 244 + { 245 + return zmd->nr_zones; 223 246 } 224 247 225 248 unsigned int dmz_nr_chunks(struct dmz_metadata *zmd) ··· 270 207 return zmd->nr_chunks; 271 208 } 272 209 273 - unsigned int dmz_nr_rnd_zones(struct dmz_metadata *zmd) 210 + unsigned int dmz_nr_rnd_zones(struct dmz_metadata *zmd, int idx) 274 211 { 275 - return zmd->nr_rnd; 212 + return zmd->dev[idx].nr_rnd; 276 213 } 277 214 278 - unsigned int dmz_nr_unmap_rnd_zones(struct dmz_metadata *zmd) 215 + unsigned int dmz_nr_unmap_rnd_zones(struct dmz_metadata *zmd, int idx) 279 216 { 280 - return atomic_read(&zmd->unmap_nr_rnd); 217 + return atomic_read(&zmd->dev[idx].unmap_nr_rnd); 218 + } 219 + 220 + unsigned int dmz_nr_cache_zones(struct dmz_metadata *zmd) 221 + { 222 + return zmd->nr_cache; 223 + } 224 + 225 + unsigned int dmz_nr_unmap_cache_zones(struct dmz_metadata *zmd) 226 + { 227 + return atomic_read(&zmd->unmap_nr_cache); 228 + } 229 + 230 + unsigned int dmz_nr_seq_zones(struct dmz_metadata *zmd, int idx) 231 + { 232 + return zmd->dev[idx].nr_seq; 233 + } 234 + 235 + unsigned int dmz_nr_unmap_seq_zones(struct dmz_metadata *zmd, int idx) 236 + { 237 + return atomic_read(&zmd->dev[idx].unmap_nr_seq); 238 + } 239 + 240 + static struct dm_zone *dmz_get(struct dmz_metadata *zmd, unsigned int zone_id) 241 + { 242 + return xa_load(&zmd->zones, zone_id); 243 + } 244 + 245 + static struct dm_zone *dmz_insert(struct dmz_metadata *zmd, 246 + unsigned int zone_id, struct dmz_dev *dev) 247 + { 248 + struct dm_zone *zone = kzalloc(sizeof(struct dm_zone), GFP_KERNEL); 249 + 250 + if (!zone) 251 + return ERR_PTR(-ENOMEM); 252 + 253 + if (xa_insert(&zmd->zones, zone_id, zone, GFP_KERNEL)) { 254 + kfree(zone); 255 + return ERR_PTR(-EBUSY); 256 + } 257 + 258 + INIT_LIST_HEAD(&zone->link); 259 + atomic_set(&zone->refcount, 0); 260 + zone->id = zone_id; 261 + zone->chunk = DMZ_MAP_UNMAPPED; 262 + zone->dev = dev; 263 + 264 + return zone; 265 + } 266 + 267 + const char *dmz_metadata_label(struct dmz_metadata *zmd) 268 + { 269 + return (const char *)zmd->label; 270 + } 271 + 272 + bool dmz_check_dev(struct dmz_metadata *zmd) 273 + { 274 + unsigned int i; 275 + 276 + for (i = 0; i < zmd->nr_devs; i++) { 277 + if (!dmz_check_bdev(&zmd->dev[i])) 278 + return false; 279 + } 280 + return true; 281 + } 282 + 283 + bool dmz_dev_is_dying(struct dmz_metadata *zmd) 284 + { 285 + unsigned int i; 286 + 287 + for (i = 0; i < zmd->nr_devs; i++) { 288 + if (dmz_bdev_is_dying(&zmd->dev[i])) 289 + return true; 290 + } 291 + return false; 281 292 } 282 293 283 294 /* ··· 539 402 { 540 403 struct dmz_mblock *mblk, *m; 541 404 sector_t block = zmd->sb[zmd->mblk_primary].block + mblk_no; 405 + struct dmz_dev *dev = zmd->sb[zmd->mblk_primary].dev; 542 406 struct bio *bio; 543 407 544 - if (dmz_bdev_is_dying(zmd->dev)) 408 + if (dmz_bdev_is_dying(dev)) 545 409 return ERR_PTR(-EIO); 546 410 547 411 /* Get a new block and a BIO to read it */ ··· 578 440 579 441 /* Submit read BIO */ 580 442 bio->bi_iter.bi_sector = dmz_blk2sect(block); 581 - bio_set_dev(bio, zmd->dev->bdev); 443 + bio_set_dev(bio, dev->bdev); 582 444 bio->bi_private = mblk; 583 445 bio->bi_end_io = dmz_mblock_bio_end_io; 584 446 bio_set_op_attrs(bio, REQ_OP_READ, REQ_META | REQ_PRIO); ··· 675 537 sector_t mblk_no) 676 538 { 677 539 struct dmz_mblock *mblk; 540 + struct dmz_dev *dev = zmd->sb[zmd->mblk_primary].dev; 678 541 679 542 /* Check rbtree */ 680 543 spin_lock(&zmd->mblk_lock); ··· 694 555 TASK_UNINTERRUPTIBLE); 695 556 if (test_bit(DMZ_META_ERROR, &mblk->state)) { 696 557 dmz_release_mblock(zmd, mblk); 697 - dmz_check_bdev(zmd->dev); 558 + dmz_check_bdev(dev); 698 559 return ERR_PTR(-EIO); 699 560 } 700 561 ··· 718 579 static int dmz_write_mblock(struct dmz_metadata *zmd, struct dmz_mblock *mblk, 719 580 unsigned int set) 720 581 { 582 + struct dmz_dev *dev = zmd->sb[set].dev; 721 583 sector_t block = zmd->sb[set].block + mblk->no; 722 584 struct bio *bio; 723 585 724 - if (dmz_bdev_is_dying(zmd->dev)) 586 + if (dmz_bdev_is_dying(dev)) 725 587 return -EIO; 726 588 727 589 bio = bio_alloc(GFP_NOIO, 1); ··· 734 594 set_bit(DMZ_META_WRITING, &mblk->state); 735 595 736 596 bio->bi_iter.bi_sector = dmz_blk2sect(block); 737 - bio_set_dev(bio, zmd->dev->bdev); 597 + bio_set_dev(bio, dev->bdev); 738 598 bio->bi_private = mblk; 739 599 bio->bi_end_io = dmz_mblock_bio_end_io; 740 600 bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_META | REQ_PRIO); ··· 747 607 /* 748 608 * Read/write a metadata block. 749 609 */ 750 - static int dmz_rdwr_block(struct dmz_metadata *zmd, int op, sector_t block, 751 - struct page *page) 610 + static int dmz_rdwr_block(struct dmz_dev *dev, int op, 611 + sector_t block, struct page *page) 752 612 { 753 613 struct bio *bio; 754 614 int ret; 755 615 756 - if (dmz_bdev_is_dying(zmd->dev)) 616 + if (WARN_ON(!dev)) 617 + return -EIO; 618 + 619 + if (dmz_bdev_is_dying(dev)) 757 620 return -EIO; 758 621 759 622 bio = bio_alloc(GFP_NOIO, 1); ··· 764 621 return -ENOMEM; 765 622 766 623 bio->bi_iter.bi_sector = dmz_blk2sect(block); 767 - bio_set_dev(bio, zmd->dev->bdev); 624 + bio_set_dev(bio, dev->bdev); 768 625 bio_set_op_attrs(bio, op, REQ_SYNC | REQ_META | REQ_PRIO); 769 626 bio_add_page(bio, page, DMZ_BLOCK_SIZE, 0); 770 627 ret = submit_bio_wait(bio); 771 628 bio_put(bio); 772 629 773 630 if (ret) 774 - dmz_check_bdev(zmd->dev); 631 + dmz_check_bdev(dev); 775 632 return ret; 776 633 } 777 634 ··· 780 637 */ 781 638 static int dmz_write_sb(struct dmz_metadata *zmd, unsigned int set) 782 639 { 783 - sector_t block = zmd->sb[set].block; 784 640 struct dmz_mblock *mblk = zmd->sb[set].mblk; 785 641 struct dmz_super *sb = zmd->sb[set].sb; 642 + struct dmz_dev *dev = zmd->sb[set].dev; 643 + sector_t sb_block; 786 644 u64 sb_gen = zmd->sb_gen + 1; 787 645 int ret; 788 646 789 647 sb->magic = cpu_to_le32(DMZ_MAGIC); 790 - sb->version = cpu_to_le32(DMZ_META_VER); 648 + 649 + sb->version = cpu_to_le32(zmd->sb_version); 650 + if (zmd->sb_version > 1) { 651 + BUILD_BUG_ON(UUID_SIZE != 16); 652 + export_uuid(sb->dmz_uuid, &zmd->uuid); 653 + memcpy(sb->dmz_label, zmd->label, BDEVNAME_SIZE); 654 + export_uuid(sb->dev_uuid, &dev->uuid); 655 + } 791 656 792 657 sb->gen = cpu_to_le64(sb_gen); 793 658 794 - sb->sb_block = cpu_to_le64(block); 659 + /* 660 + * The metadata always references the absolute block address, 661 + * ie relative to the entire block range, not the per-device 662 + * block address. 663 + */ 664 + sb_block = zmd->sb[set].zone->id << zmd->zone_nr_blocks_shift; 665 + sb->sb_block = cpu_to_le64(sb_block); 795 666 sb->nr_meta_blocks = cpu_to_le32(zmd->nr_meta_blocks); 796 667 sb->nr_reserved_seq = cpu_to_le32(zmd->nr_reserved_seq); 797 668 sb->nr_chunks = cpu_to_le32(zmd->nr_chunks); ··· 816 659 sb->crc = 0; 817 660 sb->crc = cpu_to_le32(crc32_le(sb_gen, (unsigned char *)sb, DMZ_BLOCK_SIZE)); 818 661 819 - ret = dmz_rdwr_block(zmd, REQ_OP_WRITE, block, mblk->page); 662 + ret = dmz_rdwr_block(dev, REQ_OP_WRITE, zmd->sb[set].block, 663 + mblk->page); 820 664 if (ret == 0) 821 - ret = blkdev_issue_flush(zmd->dev->bdev, GFP_NOIO); 665 + ret = blkdev_issue_flush(dev->bdev, GFP_NOIO); 822 666 823 667 return ret; 824 668 } ··· 832 674 unsigned int set) 833 675 { 834 676 struct dmz_mblock *mblk; 677 + struct dmz_dev *dev = zmd->sb[set].dev; 835 678 struct blk_plug plug; 836 679 int ret = 0, nr_mblks_submitted = 0; 837 680 ··· 854 695 TASK_UNINTERRUPTIBLE); 855 696 if (test_bit(DMZ_META_ERROR, &mblk->state)) { 856 697 clear_bit(DMZ_META_ERROR, &mblk->state); 857 - dmz_check_bdev(zmd->dev); 698 + dmz_check_bdev(dev); 858 699 ret = -EIO; 859 700 } 860 701 nr_mblks_submitted--; ··· 862 703 863 704 /* Flush drive cache (this will also sync data) */ 864 705 if (ret == 0) 865 - ret = blkdev_issue_flush(zmd->dev->bdev, GFP_NOIO); 706 + ret = blkdev_issue_flush(dev->bdev, GFP_NOIO); 866 707 867 708 return ret; 868 709 } ··· 899 740 { 900 741 struct dmz_mblock *mblk; 901 742 struct list_head write_list; 743 + struct dmz_dev *dev; 902 744 int ret; 903 745 904 746 if (WARN_ON(!zmd)) ··· 913 753 * from modifying metadata. 914 754 */ 915 755 down_write(&zmd->mblk_sem); 756 + dev = zmd->sb[zmd->mblk_primary].dev; 916 757 917 758 /* 918 759 * This is called from the target flush work and reclaim work. ··· 921 760 */ 922 761 dmz_lock_flush(zmd); 923 762 924 - if (dmz_bdev_is_dying(zmd->dev)) { 763 + if (dmz_bdev_is_dying(dev)) { 925 764 ret = -EIO; 926 765 goto out; 927 766 } ··· 933 772 934 773 /* If there are no dirty metadata blocks, just flush the device cache */ 935 774 if (list_empty(&write_list)) { 936 - ret = blkdev_issue_flush(zmd->dev->bdev, GFP_NOIO); 775 + ret = blkdev_issue_flush(dev->bdev, GFP_NOIO); 937 776 goto err; 938 777 } 939 778 ··· 982 821 list_splice(&write_list, &zmd->mblk_dirty_list); 983 822 spin_unlock(&zmd->mblk_lock); 984 823 } 985 - if (!dmz_check_bdev(zmd->dev)) 824 + if (!dmz_check_bdev(dev)) 986 825 ret = -EIO; 987 826 goto out; 988 827 } ··· 990 829 /* 991 830 * Check super block. 992 831 */ 993 - static int dmz_check_sb(struct dmz_metadata *zmd, struct dmz_super *sb) 832 + static int dmz_check_sb(struct dmz_metadata *zmd, struct dmz_sb *dsb, 833 + bool tertiary) 994 834 { 835 + struct dmz_super *sb = dsb->sb; 836 + struct dmz_dev *dev = dsb->dev; 995 837 unsigned int nr_meta_zones, nr_data_zones; 996 - struct dmz_dev *dev = zmd->dev; 997 838 u32 crc, stored_crc; 998 - u64 gen; 839 + u64 gen, sb_block; 840 + 841 + if (le32_to_cpu(sb->magic) != DMZ_MAGIC) { 842 + dmz_dev_err(dev, "Invalid meta magic (needed 0x%08x, got 0x%08x)", 843 + DMZ_MAGIC, le32_to_cpu(sb->magic)); 844 + return -ENXIO; 845 + } 846 + 847 + zmd->sb_version = le32_to_cpu(sb->version); 848 + if (zmd->sb_version > DMZ_META_VER) { 849 + dmz_dev_err(dev, "Invalid meta version (needed %d, got %d)", 850 + DMZ_META_VER, zmd->sb_version); 851 + return -EINVAL; 852 + } 853 + if (zmd->sb_version < 2 && tertiary) { 854 + dmz_dev_err(dev, "Tertiary superblocks are not supported"); 855 + return -EINVAL; 856 + } 999 857 1000 858 gen = le64_to_cpu(sb->gen); 1001 859 stored_crc = le32_to_cpu(sb->crc); ··· 1026 846 return -ENXIO; 1027 847 } 1028 848 1029 - if (le32_to_cpu(sb->magic) != DMZ_MAGIC) { 1030 - dmz_dev_err(dev, "Invalid meta magic (needed 0x%08x, got 0x%08x)", 1031 - DMZ_MAGIC, le32_to_cpu(sb->magic)); 1032 - return -ENXIO; 849 + sb_block = le64_to_cpu(sb->sb_block); 850 + if (sb_block != (u64)dsb->zone->id << zmd->zone_nr_blocks_shift ) { 851 + dmz_dev_err(dev, "Invalid superblock position " 852 + "(is %llu expected %llu)", 853 + sb_block, 854 + (u64)dsb->zone->id << zmd->zone_nr_blocks_shift); 855 + return -EINVAL; 856 + } 857 + if (zmd->sb_version > 1) { 858 + uuid_t sb_uuid; 859 + 860 + import_uuid(&sb_uuid, sb->dmz_uuid); 861 + if (uuid_is_null(&sb_uuid)) { 862 + dmz_dev_err(dev, "NULL DM-Zoned uuid"); 863 + return -ENXIO; 864 + } else if (uuid_is_null(&zmd->uuid)) { 865 + uuid_copy(&zmd->uuid, &sb_uuid); 866 + } else if (!uuid_equal(&zmd->uuid, &sb_uuid)) { 867 + dmz_dev_err(dev, "mismatching DM-Zoned uuid, " 868 + "is %pUl expected %pUl", 869 + &sb_uuid, &zmd->uuid); 870 + return -ENXIO; 871 + } 872 + if (!strlen(zmd->label)) 873 + memcpy(zmd->label, sb->dmz_label, BDEVNAME_SIZE); 874 + else if (memcmp(zmd->label, sb->dmz_label, BDEVNAME_SIZE)) { 875 + dmz_dev_err(dev, "mismatching DM-Zoned label, " 876 + "is %s expected %s", 877 + sb->dmz_label, zmd->label); 878 + return -ENXIO; 879 + } 880 + import_uuid(&dev->uuid, sb->dev_uuid); 881 + if (uuid_is_null(&dev->uuid)) { 882 + dmz_dev_err(dev, "NULL device uuid"); 883 + return -ENXIO; 884 + } 885 + 886 + if (tertiary) { 887 + /* 888 + * Generation number should be 0, but it doesn't 889 + * really matter if it isn't. 890 + */ 891 + if (gen != 0) 892 + dmz_dev_warn(dev, "Invalid generation %llu", 893 + gen); 894 + return 0; 895 + } 1033 896 } 1034 897 1035 - if (le32_to_cpu(sb->version) != DMZ_META_VER) { 1036 - dmz_dev_err(dev, "Invalid meta version (needed %d, got %d)", 1037 - DMZ_META_VER, le32_to_cpu(sb->version)); 1038 - return -ENXIO; 1039 - } 1040 - 1041 - nr_meta_zones = (le32_to_cpu(sb->nr_meta_blocks) + dev->zone_nr_blocks - 1) 1042 - >> dev->zone_nr_blocks_shift; 898 + nr_meta_zones = (le32_to_cpu(sb->nr_meta_blocks) + zmd->zone_nr_blocks - 1) 899 + >> zmd->zone_nr_blocks_shift; 1043 900 if (!nr_meta_zones || 1044 901 nr_meta_zones >= zmd->nr_rnd_zones) { 1045 902 dmz_dev_err(dev, "Invalid number of metadata blocks"); ··· 1112 895 /* 1113 896 * Read the first or second super block from disk. 1114 897 */ 1115 - static int dmz_read_sb(struct dmz_metadata *zmd, unsigned int set) 898 + static int dmz_read_sb(struct dmz_metadata *zmd, struct dmz_sb *sb, int set) 1116 899 { 1117 - return dmz_rdwr_block(zmd, REQ_OP_READ, zmd->sb[set].block, 1118 - zmd->sb[set].mblk->page); 900 + dmz_zmd_debug(zmd, "read superblock set %d dev %s block %llu", 901 + set, sb->dev->name, sb->block); 902 + 903 + return dmz_rdwr_block(sb->dev, REQ_OP_READ, 904 + sb->block, sb->mblk->page); 1119 905 } 1120 906 1121 907 /* ··· 1128 908 */ 1129 909 static int dmz_lookup_secondary_sb(struct dmz_metadata *zmd) 1130 910 { 1131 - unsigned int zone_nr_blocks = zmd->dev->zone_nr_blocks; 911 + unsigned int zone_nr_blocks = zmd->zone_nr_blocks; 1132 912 struct dmz_mblock *mblk; 913 + unsigned int zone_id = zmd->sb[0].zone->id; 1133 914 int i; 1134 915 1135 916 /* Allocate a block */ ··· 1143 922 1144 923 /* Bad first super block: search for the second one */ 1145 924 zmd->sb[1].block = zmd->sb[0].block + zone_nr_blocks; 1146 - for (i = 0; i < zmd->nr_rnd_zones - 1; i++) { 1147 - if (dmz_read_sb(zmd, 1) != 0) 925 + zmd->sb[1].zone = dmz_get(zmd, zone_id + 1); 926 + zmd->sb[1].dev = zmd->sb[0].dev; 927 + for (i = 1; i < zmd->nr_rnd_zones; i++) { 928 + if (dmz_read_sb(zmd, &zmd->sb[1], 1) != 0) 1148 929 break; 1149 930 if (le32_to_cpu(zmd->sb[1].sb->magic) == DMZ_MAGIC) 1150 931 return 0; 1151 932 zmd->sb[1].block += zone_nr_blocks; 933 + zmd->sb[1].zone = dmz_get(zmd, zone_id + i); 1152 934 } 1153 935 1154 936 dmz_free_mblock(zmd, mblk); 1155 937 zmd->sb[1].mblk = NULL; 938 + zmd->sb[1].zone = NULL; 939 + zmd->sb[1].dev = NULL; 1156 940 1157 941 return -EIO; 1158 942 } 1159 943 1160 944 /* 1161 - * Read the first or second super block from disk. 945 + * Read a super block from disk. 1162 946 */ 1163 - static int dmz_get_sb(struct dmz_metadata *zmd, unsigned int set) 947 + static int dmz_get_sb(struct dmz_metadata *zmd, struct dmz_sb *sb, int set) 1164 948 { 1165 949 struct dmz_mblock *mblk; 1166 950 int ret; ··· 1175 949 if (!mblk) 1176 950 return -ENOMEM; 1177 951 1178 - zmd->sb[set].mblk = mblk; 1179 - zmd->sb[set].sb = mblk->data; 952 + sb->mblk = mblk; 953 + sb->sb = mblk->data; 1180 954 1181 955 /* Read super block */ 1182 - ret = dmz_read_sb(zmd, set); 956 + ret = dmz_read_sb(zmd, sb, set); 1183 957 if (ret) { 1184 958 dmz_free_mblock(zmd, mblk); 1185 - zmd->sb[set].mblk = NULL; 959 + sb->mblk = NULL; 1186 960 return ret; 1187 961 } 1188 962 ··· 1198 972 struct page *page; 1199 973 int i, ret; 1200 974 1201 - dmz_dev_warn(zmd->dev, "Metadata set %u invalid: recovering", dst_set); 975 + dmz_dev_warn(zmd->sb[dst_set].dev, 976 + "Metadata set %u invalid: recovering", dst_set); 1202 977 1203 978 if (dst_set == 0) 1204 - zmd->sb[0].block = dmz_start_block(zmd, zmd->sb_zone); 1205 - else { 1206 - zmd->sb[1].block = zmd->sb[0].block + 1207 - (zmd->nr_meta_zones << zmd->dev->zone_nr_blocks_shift); 1208 - } 979 + zmd->sb[0].block = dmz_start_block(zmd, zmd->sb[0].zone); 980 + else 981 + zmd->sb[1].block = dmz_start_block(zmd, zmd->sb[1].zone); 1209 982 1210 983 page = alloc_page(GFP_NOIO); 1211 984 if (!page) ··· 1212 987 1213 988 /* Copy metadata blocks */ 1214 989 for (i = 1; i < zmd->nr_meta_blocks; i++) { 1215 - ret = dmz_rdwr_block(zmd, REQ_OP_READ, 990 + ret = dmz_rdwr_block(zmd->sb[src_set].dev, REQ_OP_READ, 1216 991 zmd->sb[src_set].block + i, page); 1217 992 if (ret) 1218 993 goto out; 1219 - ret = dmz_rdwr_block(zmd, REQ_OP_WRITE, 994 + ret = dmz_rdwr_block(zmd->sb[dst_set].dev, REQ_OP_WRITE, 1220 995 zmd->sb[dst_set].block + i, page); 1221 996 if (ret) 1222 997 goto out; ··· 1248 1023 u64 sb_gen[2] = {0, 0}; 1249 1024 int ret; 1250 1025 1026 + if (!zmd->sb[0].zone) { 1027 + dmz_zmd_err(zmd, "Primary super block zone not set"); 1028 + return -ENXIO; 1029 + } 1030 + 1251 1031 /* Read and check the primary super block */ 1252 - zmd->sb[0].block = dmz_start_block(zmd, zmd->sb_zone); 1253 - ret = dmz_get_sb(zmd, 0); 1032 + zmd->sb[0].block = dmz_start_block(zmd, zmd->sb[0].zone); 1033 + zmd->sb[0].dev = zmd->sb[0].zone->dev; 1034 + ret = dmz_get_sb(zmd, &zmd->sb[0], 0); 1254 1035 if (ret) { 1255 - dmz_dev_err(zmd->dev, "Read primary super block failed"); 1036 + dmz_dev_err(zmd->sb[0].dev, "Read primary super block failed"); 1256 1037 return ret; 1257 1038 } 1258 1039 1259 - ret = dmz_check_sb(zmd, zmd->sb[0].sb); 1040 + ret = dmz_check_sb(zmd, &zmd->sb[0], false); 1260 1041 1261 1042 /* Read and check secondary super block */ 1262 1043 if (ret == 0) { 1263 1044 sb_good[0] = true; 1264 - zmd->sb[1].block = zmd->sb[0].block + 1265 - (zmd->nr_meta_zones << zmd->dev->zone_nr_blocks_shift); 1266 - ret = dmz_get_sb(zmd, 1); 1045 + if (!zmd->sb[1].zone) { 1046 + unsigned int zone_id = 1047 + zmd->sb[0].zone->id + zmd->nr_meta_zones; 1048 + 1049 + zmd->sb[1].zone = dmz_get(zmd, zone_id); 1050 + } 1051 + zmd->sb[1].block = dmz_start_block(zmd, zmd->sb[1].zone); 1052 + zmd->sb[1].dev = zmd->sb[0].dev; 1053 + ret = dmz_get_sb(zmd, &zmd->sb[1], 1); 1267 1054 } else 1268 1055 ret = dmz_lookup_secondary_sb(zmd); 1269 1056 1270 1057 if (ret) { 1271 - dmz_dev_err(zmd->dev, "Read secondary super block failed"); 1058 + dmz_dev_err(zmd->sb[1].dev, "Read secondary super block failed"); 1272 1059 return ret; 1273 1060 } 1274 1061 1275 - ret = dmz_check_sb(zmd, zmd->sb[1].sb); 1062 + ret = dmz_check_sb(zmd, &zmd->sb[1], false); 1276 1063 if (ret == 0) 1277 1064 sb_good[1] = true; 1278 1065 1279 1066 /* Use highest generation sb first */ 1280 1067 if (!sb_good[0] && !sb_good[1]) { 1281 - dmz_dev_err(zmd->dev, "No valid super block found"); 1068 + dmz_zmd_err(zmd, "No valid super block found"); 1282 1069 return -EIO; 1283 1070 } 1284 1071 1285 1072 if (sb_good[0]) 1286 1073 sb_gen[0] = le64_to_cpu(zmd->sb[0].sb->gen); 1287 - else 1074 + else { 1288 1075 ret = dmz_recover_mblocks(zmd, 0); 1076 + if (ret) { 1077 + dmz_dev_err(zmd->sb[0].dev, 1078 + "Recovery of superblock 0 failed"); 1079 + return -EIO; 1080 + } 1081 + } 1289 1082 1290 1083 if (sb_good[1]) 1291 1084 sb_gen[1] = le64_to_cpu(zmd->sb[1].sb->gen); 1292 - else 1085 + else { 1293 1086 ret = dmz_recover_mblocks(zmd, 1); 1294 1087 1295 - if (ret) { 1296 - dmz_dev_err(zmd->dev, "Recovery failed"); 1297 - return -EIO; 1088 + if (ret) { 1089 + dmz_dev_err(zmd->sb[1].dev, 1090 + "Recovery of superblock 1 failed"); 1091 + return -EIO; 1092 + } 1298 1093 } 1299 1094 1300 1095 if (sb_gen[0] >= sb_gen[1]) { ··· 1325 1080 zmd->mblk_primary = 1; 1326 1081 } 1327 1082 1328 - dmz_dev_debug(zmd->dev, "Using super block %u (gen %llu)", 1083 + dmz_dev_debug(zmd->sb[zmd->mblk_primary].dev, 1084 + "Using super block %u (gen %llu)", 1329 1085 zmd->mblk_primary, zmd->sb_gen); 1330 1086 1331 - return 0; 1087 + if (zmd->sb_version > 1) { 1088 + int i; 1089 + struct dmz_sb *sb; 1090 + 1091 + sb = kzalloc(sizeof(struct dmz_sb), GFP_KERNEL); 1092 + if (!sb) 1093 + return -ENOMEM; 1094 + for (i = 1; i < zmd->nr_devs; i++) { 1095 + sb->block = 0; 1096 + sb->zone = dmz_get(zmd, zmd->dev[i].zone_offset); 1097 + sb->dev = &zmd->dev[i]; 1098 + if (!dmz_is_meta(sb->zone)) { 1099 + dmz_dev_err(sb->dev, 1100 + "Tertiary super block zone %u not marked as metadata zone", 1101 + sb->zone->id); 1102 + ret = -EINVAL; 1103 + goto out_kfree; 1104 + } 1105 + ret = dmz_get_sb(zmd, sb, i + 1); 1106 + if (ret) { 1107 + dmz_dev_err(sb->dev, 1108 + "Read tertiary super block failed"); 1109 + dmz_free_mblock(zmd, sb->mblk); 1110 + goto out_kfree; 1111 + } 1112 + ret = dmz_check_sb(zmd, sb, true); 1113 + dmz_free_mblock(zmd, sb->mblk); 1114 + if (ret == -EINVAL) 1115 + goto out_kfree; 1116 + } 1117 + out_kfree: 1118 + kfree(sb); 1119 + } 1120 + return ret; 1332 1121 } 1333 1122 1334 1123 /* 1335 1124 * Initialize a zone descriptor. 1336 1125 */ 1337 - static int dmz_init_zone(struct blk_zone *blkz, unsigned int idx, void *data) 1126 + static int dmz_init_zone(struct blk_zone *blkz, unsigned int num, void *data) 1338 1127 { 1339 - struct dmz_metadata *zmd = data; 1340 - struct dm_zone *zone = &zmd->zones[idx]; 1341 - struct dmz_dev *dev = zmd->dev; 1128 + struct dmz_dev *dev = data; 1129 + struct dmz_metadata *zmd = dev->metadata; 1130 + int idx = num + dev->zone_offset; 1131 + struct dm_zone *zone; 1342 1132 1343 - /* Ignore the eventual last runt (smaller) zone */ 1344 - if (blkz->len != dev->zone_nr_sectors) { 1345 - if (blkz->start + blkz->len == dev->capacity) 1133 + zone = dmz_insert(zmd, idx, dev); 1134 + if (IS_ERR(zone)) 1135 + return PTR_ERR(zone); 1136 + 1137 + if (blkz->len != zmd->zone_nr_sectors) { 1138 + if (zmd->sb_version > 1) { 1139 + /* Ignore the eventual runt (smaller) zone */ 1140 + set_bit(DMZ_OFFLINE, &zone->flags); 1141 + return 0; 1142 + } else if (blkz->start + blkz->len == dev->capacity) 1346 1143 return 0; 1347 1144 return -ENXIO; 1348 1145 } 1349 - 1350 - INIT_LIST_HEAD(&zone->link); 1351 - atomic_set(&zone->refcount, 0); 1352 - zone->chunk = DMZ_MAP_UNMAPPED; 1353 1146 1354 1147 switch (blkz->type) { 1355 1148 case BLK_ZONE_TYPE_CONVENTIONAL: ··· 1414 1131 zmd->nr_useable_zones++; 1415 1132 if (dmz_is_rnd(zone)) { 1416 1133 zmd->nr_rnd_zones++; 1417 - if (!zmd->sb_zone) { 1418 - /* Super block zone */ 1419 - zmd->sb_zone = zone; 1134 + if (zmd->nr_devs == 1 && !zmd->sb[0].zone) { 1135 + /* Primary super block zone */ 1136 + zmd->sb[0].zone = zone; 1420 1137 } 1421 1138 } 1139 + if (zmd->nr_devs > 1 && num == 0) { 1140 + /* 1141 + * Tertiary superblock zones are always at the 1142 + * start of the zoned devices, so mark them 1143 + * as metadata zone. 1144 + */ 1145 + set_bit(DMZ_META, &zone->flags); 1146 + } 1422 1147 } 1148 + return 0; 1149 + } 1423 1150 1151 + static int dmz_emulate_zones(struct dmz_metadata *zmd, struct dmz_dev *dev) 1152 + { 1153 + int idx; 1154 + sector_t zone_offset = 0; 1155 + 1156 + for(idx = 0; idx < dev->nr_zones; idx++) { 1157 + struct dm_zone *zone; 1158 + 1159 + zone = dmz_insert(zmd, idx, dev); 1160 + if (IS_ERR(zone)) 1161 + return PTR_ERR(zone); 1162 + set_bit(DMZ_CACHE, &zone->flags); 1163 + zone->wp_block = 0; 1164 + zmd->nr_cache_zones++; 1165 + zmd->nr_useable_zones++; 1166 + if (dev->capacity - zone_offset < zmd->zone_nr_sectors) { 1167 + /* Disable runt zone */ 1168 + set_bit(DMZ_OFFLINE, &zone->flags); 1169 + break; 1170 + } 1171 + zone_offset += zmd->zone_nr_sectors; 1172 + } 1424 1173 return 0; 1425 1174 } 1426 1175 ··· 1461 1146 */ 1462 1147 static void dmz_drop_zones(struct dmz_metadata *zmd) 1463 1148 { 1464 - kfree(zmd->zones); 1465 - zmd->zones = NULL; 1149 + int idx; 1150 + 1151 + for(idx = 0; idx < zmd->nr_zones; idx++) { 1152 + struct dm_zone *zone = xa_load(&zmd->zones, idx); 1153 + 1154 + kfree(zone); 1155 + xa_erase(&zmd->zones, idx); 1156 + } 1157 + xa_destroy(&zmd->zones); 1466 1158 } 1467 1159 1468 1160 /* ··· 1478 1156 */ 1479 1157 static int dmz_init_zones(struct dmz_metadata *zmd) 1480 1158 { 1481 - struct dmz_dev *dev = zmd->dev; 1482 - int ret; 1159 + int i, ret; 1160 + struct dmz_dev *zoned_dev = &zmd->dev[0]; 1483 1161 1484 1162 /* Init */ 1485 - zmd->zone_bitmap_size = dev->zone_nr_blocks >> 3; 1163 + zmd->zone_nr_sectors = zmd->dev[0].zone_nr_sectors; 1164 + zmd->zone_nr_sectors_shift = ilog2(zmd->zone_nr_sectors); 1165 + zmd->zone_nr_blocks = dmz_sect2blk(zmd->zone_nr_sectors); 1166 + zmd->zone_nr_blocks_shift = ilog2(zmd->zone_nr_blocks); 1167 + zmd->zone_bitmap_size = zmd->zone_nr_blocks >> 3; 1486 1168 zmd->zone_nr_bitmap_blocks = 1487 1169 max_t(sector_t, 1, zmd->zone_bitmap_size >> DMZ_BLOCK_SHIFT); 1488 - zmd->zone_bits_per_mblk = min_t(sector_t, dev->zone_nr_blocks, 1170 + zmd->zone_bits_per_mblk = min_t(sector_t, zmd->zone_nr_blocks, 1489 1171 DMZ_BLOCK_SIZE_BITS); 1490 1172 1491 1173 /* Allocate zone array */ 1492 - zmd->zones = kcalloc(dev->nr_zones, sizeof(struct dm_zone), GFP_KERNEL); 1493 - if (!zmd->zones) 1494 - return -ENOMEM; 1174 + zmd->nr_zones = 0; 1175 + for (i = 0; i < zmd->nr_devs; i++) { 1176 + struct dmz_dev *dev = &zmd->dev[i]; 1495 1177 1496 - dmz_dev_info(dev, "Using %zu B for zone information", 1497 - sizeof(struct dm_zone) * dev->nr_zones); 1178 + dev->metadata = zmd; 1179 + zmd->nr_zones += dev->nr_zones; 1180 + 1181 + atomic_set(&dev->unmap_nr_rnd, 0); 1182 + INIT_LIST_HEAD(&dev->unmap_rnd_list); 1183 + INIT_LIST_HEAD(&dev->map_rnd_list); 1184 + 1185 + atomic_set(&dev->unmap_nr_seq, 0); 1186 + INIT_LIST_HEAD(&dev->unmap_seq_list); 1187 + INIT_LIST_HEAD(&dev->map_seq_list); 1188 + } 1189 + 1190 + if (!zmd->nr_zones) { 1191 + DMERR("(%s): No zones found", zmd->devname); 1192 + return -ENXIO; 1193 + } 1194 + xa_init(&zmd->zones); 1195 + 1196 + DMDEBUG("(%s): Using %zu B for zone information", 1197 + zmd->devname, sizeof(struct dm_zone) * zmd->nr_zones); 1198 + 1199 + if (zmd->nr_devs > 1) { 1200 + ret = dmz_emulate_zones(zmd, &zmd->dev[0]); 1201 + if (ret < 0) { 1202 + DMDEBUG("(%s): Failed to emulate zones, error %d", 1203 + zmd->devname, ret); 1204 + dmz_drop_zones(zmd); 1205 + return ret; 1206 + } 1207 + 1208 + /* 1209 + * Primary superblock zone is always at zone 0 when multiple 1210 + * drives are present. 1211 + */ 1212 + zmd->sb[0].zone = dmz_get(zmd, 0); 1213 + 1214 + for (i = 1; i < zmd->nr_devs; i++) { 1215 + zoned_dev = &zmd->dev[i]; 1216 + 1217 + ret = blkdev_report_zones(zoned_dev->bdev, 0, 1218 + BLK_ALL_ZONES, 1219 + dmz_init_zone, zoned_dev); 1220 + if (ret < 0) { 1221 + DMDEBUG("(%s): Failed to report zones, error %d", 1222 + zmd->devname, ret); 1223 + dmz_drop_zones(zmd); 1224 + return ret; 1225 + } 1226 + } 1227 + return 0; 1228 + } 1498 1229 1499 1230 /* 1500 1231 * Get zone information and initialize zone descriptors. At the same 1501 1232 * time, determine where the super block should be: first block of the 1502 1233 * first randomly writable zone. 1503 1234 */ 1504 - ret = blkdev_report_zones(dev->bdev, 0, BLK_ALL_ZONES, dmz_init_zone, 1505 - zmd); 1235 + ret = blkdev_report_zones(zoned_dev->bdev, 0, BLK_ALL_ZONES, 1236 + dmz_init_zone, zoned_dev); 1506 1237 if (ret < 0) { 1238 + DMDEBUG("(%s): Failed to report zones, error %d", 1239 + zmd->devname, ret); 1507 1240 dmz_drop_zones(zmd); 1508 1241 return ret; 1509 1242 } ··· 1590 1213 */ 1591 1214 static int dmz_update_zone(struct dmz_metadata *zmd, struct dm_zone *zone) 1592 1215 { 1216 + struct dmz_dev *dev = zone->dev; 1593 1217 unsigned int noio_flag; 1594 1218 int ret; 1219 + 1220 + if (dev->flags & DMZ_BDEV_REGULAR) 1221 + return 0; 1595 1222 1596 1223 /* 1597 1224 * Get zone information from disk. Since blkdev_report_zones() uses ··· 1604 1223 * GFP_NOIO was specified. 1605 1224 */ 1606 1225 noio_flag = memalloc_noio_save(); 1607 - ret = blkdev_report_zones(zmd->dev->bdev, dmz_start_sect(zmd, zone), 1, 1226 + ret = blkdev_report_zones(dev->bdev, dmz_start_sect(zmd, zone), 1, 1608 1227 dmz_update_zone_cb, zone); 1609 1228 memalloc_noio_restore(noio_flag); 1610 1229 1611 1230 if (ret == 0) 1612 1231 ret = -EIO; 1613 1232 if (ret < 0) { 1614 - dmz_dev_err(zmd->dev, "Get zone %u report failed", 1615 - dmz_id(zmd, zone)); 1616 - dmz_check_bdev(zmd->dev); 1233 + dmz_dev_err(dev, "Get zone %u report failed", 1234 + zone->id); 1235 + dmz_check_bdev(dev); 1617 1236 return ret; 1618 1237 } 1619 1238 ··· 1627 1246 static int dmz_handle_seq_write_err(struct dmz_metadata *zmd, 1628 1247 struct dm_zone *zone) 1629 1248 { 1249 + struct dmz_dev *dev = zone->dev; 1630 1250 unsigned int wp = 0; 1631 1251 int ret; 1632 1252 ··· 1636 1254 if (ret) 1637 1255 return ret; 1638 1256 1639 - dmz_dev_warn(zmd->dev, "Processing zone %u write error (zone wp %u/%u)", 1640 - dmz_id(zmd, zone), zone->wp_block, wp); 1257 + dmz_dev_warn(dev, "Processing zone %u write error (zone wp %u/%u)", 1258 + zone->id, zone->wp_block, wp); 1641 1259 1642 1260 if (zone->wp_block < wp) { 1643 1261 dmz_invalidate_blocks(zmd, zone, zone->wp_block, ··· 1645 1263 } 1646 1264 1647 1265 return 0; 1648 - } 1649 - 1650 - static struct dm_zone *dmz_get(struct dmz_metadata *zmd, unsigned int zone_id) 1651 - { 1652 - return &zmd->zones[zone_id]; 1653 1266 } 1654 1267 1655 1268 /* ··· 1664 1287 return 0; 1665 1288 1666 1289 if (!dmz_is_empty(zone) || dmz_seq_write_err(zone)) { 1667 - struct dmz_dev *dev = zmd->dev; 1290 + struct dmz_dev *dev = zone->dev; 1668 1291 1669 1292 ret = blkdev_zone_mgmt(dev->bdev, REQ_OP_ZONE_RESET, 1670 1293 dmz_start_sect(zmd, zone), 1671 - dev->zone_nr_sectors, GFP_NOIO); 1294 + zmd->zone_nr_sectors, GFP_NOIO); 1672 1295 if (ret) { 1673 1296 dmz_dev_err(dev, "Reset zone %u failed %d", 1674 - dmz_id(zmd, zone), ret); 1297 + zone->id, ret); 1675 1298 return ret; 1676 1299 } 1677 1300 } ··· 1690 1313 */ 1691 1314 static int dmz_load_mapping(struct dmz_metadata *zmd) 1692 1315 { 1693 - struct dmz_dev *dev = zmd->dev; 1694 1316 struct dm_zone *dzone, *bzone; 1695 1317 struct dmz_mblock *dmap_mblk = NULL; 1696 1318 struct dmz_map *dmap; ··· 1721 1345 if (dzone_id == DMZ_MAP_UNMAPPED) 1722 1346 goto next; 1723 1347 1724 - if (dzone_id >= dev->nr_zones) { 1725 - dmz_dev_err(dev, "Chunk %u mapping: invalid data zone ID %u", 1348 + if (dzone_id >= zmd->nr_zones) { 1349 + dmz_zmd_err(zmd, "Chunk %u mapping: invalid data zone ID %u", 1726 1350 chunk, dzone_id); 1727 1351 return -EIO; 1728 1352 } 1729 1353 1730 1354 dzone = dmz_get(zmd, dzone_id); 1355 + if (!dzone) { 1356 + dmz_zmd_err(zmd, "Chunk %u mapping: data zone %u not present", 1357 + chunk, dzone_id); 1358 + return -EIO; 1359 + } 1731 1360 set_bit(DMZ_DATA, &dzone->flags); 1732 1361 dzone->chunk = chunk; 1733 1362 dmz_get_zone_weight(zmd, dzone); 1734 1363 1735 - if (dmz_is_rnd(dzone)) 1736 - list_add_tail(&dzone->link, &zmd->map_rnd_list); 1364 + if (dmz_is_cache(dzone)) 1365 + list_add_tail(&dzone->link, &zmd->map_cache_list); 1366 + else if (dmz_is_rnd(dzone)) 1367 + list_add_tail(&dzone->link, &dzone->dev->map_rnd_list); 1737 1368 else 1738 - list_add_tail(&dzone->link, &zmd->map_seq_list); 1369 + list_add_tail(&dzone->link, &dzone->dev->map_seq_list); 1739 1370 1740 1371 /* Check buffer zone */ 1741 1372 bzone_id = le32_to_cpu(dmap[e].bzone_id); 1742 1373 if (bzone_id == DMZ_MAP_UNMAPPED) 1743 1374 goto next; 1744 1375 1745 - if (bzone_id >= dev->nr_zones) { 1746 - dmz_dev_err(dev, "Chunk %u mapping: invalid buffer zone ID %u", 1376 + if (bzone_id >= zmd->nr_zones) { 1377 + dmz_zmd_err(zmd, "Chunk %u mapping: invalid buffer zone ID %u", 1747 1378 chunk, bzone_id); 1748 1379 return -EIO; 1749 1380 } 1750 1381 1751 1382 bzone = dmz_get(zmd, bzone_id); 1752 - if (!dmz_is_rnd(bzone)) { 1753 - dmz_dev_err(dev, "Chunk %u mapping: invalid buffer zone %u", 1383 + if (!bzone) { 1384 + dmz_zmd_err(zmd, "Chunk %u mapping: buffer zone %u not present", 1385 + chunk, bzone_id); 1386 + return -EIO; 1387 + } 1388 + if (!dmz_is_rnd(bzone) && !dmz_is_cache(bzone)) { 1389 + dmz_zmd_err(zmd, "Chunk %u mapping: invalid buffer zone %u", 1754 1390 chunk, bzone_id); 1755 1391 return -EIO; 1756 1392 } ··· 1773 1385 bzone->bzone = dzone; 1774 1386 dzone->bzone = bzone; 1775 1387 dmz_get_zone_weight(zmd, bzone); 1776 - list_add_tail(&bzone->link, &zmd->map_rnd_list); 1388 + if (dmz_is_cache(bzone)) 1389 + list_add_tail(&bzone->link, &zmd->map_cache_list); 1390 + else 1391 + list_add_tail(&bzone->link, &bzone->dev->map_rnd_list); 1777 1392 next: 1778 1393 chunk++; 1779 1394 e++; ··· 1789 1398 * fully initialized. All remaining zones are unmapped data 1790 1399 * zones. Finish initializing those here. 1791 1400 */ 1792 - for (i = 0; i < dev->nr_zones; i++) { 1401 + for (i = 0; i < zmd->nr_zones; i++) { 1793 1402 dzone = dmz_get(zmd, i); 1403 + if (!dzone) 1404 + continue; 1794 1405 if (dmz_is_meta(dzone)) 1795 1406 continue; 1407 + if (dmz_is_offline(dzone)) 1408 + continue; 1796 1409 1797 - if (dmz_is_rnd(dzone)) 1798 - zmd->nr_rnd++; 1410 + if (dmz_is_cache(dzone)) 1411 + zmd->nr_cache++; 1412 + else if (dmz_is_rnd(dzone)) 1413 + dzone->dev->nr_rnd++; 1799 1414 else 1800 - zmd->nr_seq++; 1415 + dzone->dev->nr_seq++; 1801 1416 1802 1417 if (dmz_is_data(dzone)) { 1803 1418 /* Already initialized */ ··· 1813 1416 /* Unmapped data zone */ 1814 1417 set_bit(DMZ_DATA, &dzone->flags); 1815 1418 dzone->chunk = DMZ_MAP_UNMAPPED; 1816 - if (dmz_is_rnd(dzone)) { 1817 - list_add_tail(&dzone->link, &zmd->unmap_rnd_list); 1818 - atomic_inc(&zmd->unmap_nr_rnd); 1419 + if (dmz_is_cache(dzone)) { 1420 + list_add_tail(&dzone->link, &zmd->unmap_cache_list); 1421 + atomic_inc(&zmd->unmap_nr_cache); 1422 + } else if (dmz_is_rnd(dzone)) { 1423 + list_add_tail(&dzone->link, 1424 + &dzone->dev->unmap_rnd_list); 1425 + atomic_inc(&dzone->dev->unmap_nr_rnd); 1819 1426 } else if (atomic_read(&zmd->nr_reserved_seq_zones) < zmd->nr_reserved_seq) { 1820 1427 list_add_tail(&dzone->link, &zmd->reserved_seq_zones_list); 1428 + set_bit(DMZ_RESERVED, &dzone->flags); 1821 1429 atomic_inc(&zmd->nr_reserved_seq_zones); 1822 - zmd->nr_seq--; 1430 + dzone->dev->nr_seq--; 1823 1431 } else { 1824 - list_add_tail(&dzone->link, &zmd->unmap_seq_list); 1825 - atomic_inc(&zmd->unmap_nr_seq); 1432 + list_add_tail(&dzone->link, 1433 + &dzone->dev->unmap_seq_list); 1434 + atomic_inc(&dzone->dev->unmap_nr_seq); 1826 1435 } 1827 1436 } 1828 1437 ··· 1862 1459 list_del_init(&zone->link); 1863 1460 if (dmz_is_seq(zone)) { 1864 1461 /* LRU rotate sequential zone */ 1865 - list_add_tail(&zone->link, &zmd->map_seq_list); 1462 + list_add_tail(&zone->link, &zone->dev->map_seq_list); 1463 + } else if (dmz_is_cache(zone)) { 1464 + /* LRU rotate cache zone */ 1465 + list_add_tail(&zone->link, &zmd->map_cache_list); 1866 1466 } else { 1867 1467 /* LRU rotate random zone */ 1868 - list_add_tail(&zone->link, &zmd->map_rnd_list); 1468 + list_add_tail(&zone->link, &zone->dev->map_rnd_list); 1869 1469 } 1870 1470 } 1871 1471 ··· 1935 1529 { 1936 1530 dmz_unlock_map(zmd); 1937 1531 dmz_unlock_metadata(zmd); 1532 + set_bit(DMZ_RECLAIM_TERMINATE, &zone->flags); 1938 1533 wait_on_bit_timeout(&zone->flags, DMZ_RECLAIM, TASK_UNINTERRUPTIBLE, HZ); 1534 + clear_bit(DMZ_RECLAIM_TERMINATE, &zone->flags); 1939 1535 dmz_lock_metadata(zmd); 1940 1536 dmz_lock_map(zmd); 1941 1537 } 1942 1538 1943 1539 /* 1944 - * Select a random write zone for reclaim. 1540 + * Select a cache or random write zone for reclaim. 1945 1541 */ 1946 - static struct dm_zone *dmz_get_rnd_zone_for_reclaim(struct dmz_metadata *zmd) 1542 + static struct dm_zone *dmz_get_rnd_zone_for_reclaim(struct dmz_metadata *zmd, 1543 + unsigned int idx, bool idle) 1947 1544 { 1948 1545 struct dm_zone *dzone = NULL; 1949 - struct dm_zone *zone; 1546 + struct dm_zone *zone, *last = NULL; 1547 + struct list_head *zone_list; 1950 1548 1951 - if (list_empty(&zmd->map_rnd_list)) 1952 - return ERR_PTR(-EBUSY); 1549 + /* If we have cache zones select from the cache zone list */ 1550 + if (zmd->nr_cache) { 1551 + zone_list = &zmd->map_cache_list; 1552 + /* Try to relaim random zones, too, when idle */ 1553 + if (idle && list_empty(zone_list)) 1554 + zone_list = &zmd->dev[idx].map_rnd_list; 1555 + } else 1556 + zone_list = &zmd->dev[idx].map_rnd_list; 1953 1557 1954 - list_for_each_entry(zone, &zmd->map_rnd_list, link) { 1955 - if (dmz_is_buf(zone)) 1558 + list_for_each_entry(zone, zone_list, link) { 1559 + if (dmz_is_buf(zone)) { 1956 1560 dzone = zone->bzone; 1957 - else 1561 + if (dzone->dev->dev_idx != idx) 1562 + continue; 1563 + if (!last) { 1564 + last = dzone; 1565 + continue; 1566 + } 1567 + if (last->weight < dzone->weight) 1568 + continue; 1569 + dzone = last; 1570 + } else 1958 1571 dzone = zone; 1959 1572 if (dmz_lock_zone_reclaim(dzone)) 1960 1573 return dzone; 1961 1574 } 1962 1575 1963 - return ERR_PTR(-EBUSY); 1576 + return NULL; 1964 1577 } 1965 1578 1966 1579 /* 1967 1580 * Select a buffered sequential zone for reclaim. 1968 1581 */ 1969 - static struct dm_zone *dmz_get_seq_zone_for_reclaim(struct dmz_metadata *zmd) 1582 + static struct dm_zone *dmz_get_seq_zone_for_reclaim(struct dmz_metadata *zmd, 1583 + unsigned int idx) 1970 1584 { 1971 1585 struct dm_zone *zone; 1972 1586 1973 - if (list_empty(&zmd->map_seq_list)) 1974 - return ERR_PTR(-EBUSY); 1975 - 1976 - list_for_each_entry(zone, &zmd->map_seq_list, link) { 1587 + list_for_each_entry(zone, &zmd->dev[idx].map_seq_list, link) { 1977 1588 if (!zone->bzone) 1978 1589 continue; 1979 1590 if (dmz_lock_zone_reclaim(zone)) 1980 1591 return zone; 1981 1592 } 1982 1593 1983 - return ERR_PTR(-EBUSY); 1594 + return NULL; 1984 1595 } 1985 1596 1986 1597 /* 1987 1598 * Select a zone for reclaim. 1988 1599 */ 1989 - struct dm_zone *dmz_get_zone_for_reclaim(struct dmz_metadata *zmd) 1600 + struct dm_zone *dmz_get_zone_for_reclaim(struct dmz_metadata *zmd, 1601 + unsigned int dev_idx, bool idle) 1990 1602 { 1991 1603 struct dm_zone *zone; 1992 1604 ··· 2018 1594 */ 2019 1595 dmz_lock_map(zmd); 2020 1596 if (list_empty(&zmd->reserved_seq_zones_list)) 2021 - zone = dmz_get_seq_zone_for_reclaim(zmd); 1597 + zone = dmz_get_seq_zone_for_reclaim(zmd, dev_idx); 2022 1598 else 2023 - zone = dmz_get_rnd_zone_for_reclaim(zmd); 1599 + zone = dmz_get_rnd_zone_for_reclaim(zmd, dev_idx, idle); 2024 1600 dmz_unlock_map(zmd); 2025 1601 2026 1602 return zone; ··· 2040 1616 unsigned int dzone_id; 2041 1617 struct dm_zone *dzone = NULL; 2042 1618 int ret = 0; 1619 + int alloc_flags = zmd->nr_cache ? DMZ_ALLOC_CACHE : DMZ_ALLOC_RND; 2043 1620 2044 1621 dmz_lock_map(zmd); 2045 1622 again: ··· 2055 1630 goto out; 2056 1631 2057 1632 /* Allocate a random zone */ 2058 - dzone = dmz_alloc_zone(zmd, DMZ_ALLOC_RND); 1633 + dzone = dmz_alloc_zone(zmd, 0, alloc_flags); 2059 1634 if (!dzone) { 2060 - if (dmz_bdev_is_dying(zmd->dev)) { 1635 + if (dmz_dev_is_dying(zmd)) { 2061 1636 dzone = ERR_PTR(-EIO); 2062 1637 goto out; 2063 1638 } ··· 2070 1645 } else { 2071 1646 /* The chunk is already mapped: get the mapping zone */ 2072 1647 dzone = dmz_get(zmd, dzone_id); 1648 + if (!dzone) { 1649 + dzone = ERR_PTR(-EIO); 1650 + goto out; 1651 + } 2073 1652 if (dzone->chunk != chunk) { 2074 1653 dzone = ERR_PTR(-EIO); 2075 1654 goto out; ··· 2152 1723 struct dm_zone *dzone) 2153 1724 { 2154 1725 struct dm_zone *bzone; 1726 + int alloc_flags = zmd->nr_cache ? DMZ_ALLOC_CACHE : DMZ_ALLOC_RND; 2155 1727 2156 1728 dmz_lock_map(zmd); 2157 1729 again: ··· 2161 1731 goto out; 2162 1732 2163 1733 /* Allocate a random zone */ 2164 - bzone = dmz_alloc_zone(zmd, DMZ_ALLOC_RND); 1734 + bzone = dmz_alloc_zone(zmd, 0, alloc_flags); 2165 1735 if (!bzone) { 2166 - if (dmz_bdev_is_dying(zmd->dev)) { 1736 + if (dmz_dev_is_dying(zmd)) { 2167 1737 bzone = ERR_PTR(-EIO); 2168 1738 goto out; 2169 1739 } ··· 2172 1742 } 2173 1743 2174 1744 /* Update the chunk mapping */ 2175 - dmz_set_chunk_mapping(zmd, dzone->chunk, dmz_id(zmd, dzone), 2176 - dmz_id(zmd, bzone)); 1745 + dmz_set_chunk_mapping(zmd, dzone->chunk, dzone->id, bzone->id); 2177 1746 2178 1747 set_bit(DMZ_BUF, &bzone->flags); 2179 1748 bzone->chunk = dzone->chunk; 2180 1749 bzone->bzone = dzone; 2181 1750 dzone->bzone = bzone; 2182 - list_add_tail(&bzone->link, &zmd->map_rnd_list); 1751 + if (dmz_is_cache(bzone)) 1752 + list_add_tail(&bzone->link, &zmd->map_cache_list); 1753 + else 1754 + list_add_tail(&bzone->link, &bzone->dev->map_rnd_list); 2183 1755 out: 2184 1756 dmz_unlock_map(zmd); 2185 1757 ··· 2192 1760 * Get an unmapped (free) zone. 2193 1761 * This must be called with the mapping lock held. 2194 1762 */ 2195 - struct dm_zone *dmz_alloc_zone(struct dmz_metadata *zmd, unsigned long flags) 1763 + struct dm_zone *dmz_alloc_zone(struct dmz_metadata *zmd, unsigned int dev_idx, 1764 + unsigned long flags) 2196 1765 { 2197 1766 struct list_head *list; 2198 1767 struct dm_zone *zone; 1768 + int i = 0; 2199 1769 2200 - if (flags & DMZ_ALLOC_RND) 2201 - list = &zmd->unmap_rnd_list; 2202 - else 2203 - list = &zmd->unmap_seq_list; 2204 1770 again: 1771 + if (flags & DMZ_ALLOC_CACHE) 1772 + list = &zmd->unmap_cache_list; 1773 + else if (flags & DMZ_ALLOC_RND) 1774 + list = &zmd->dev[dev_idx].unmap_rnd_list; 1775 + else 1776 + list = &zmd->dev[dev_idx].unmap_seq_list; 1777 + 2205 1778 if (list_empty(list)) { 2206 1779 /* 2207 - * No free zone: if this is for reclaim, allow using the 2208 - * reserved sequential zones. 1780 + * No free zone: return NULL if this is for not reclaim. 2209 1781 */ 2210 - if (!(flags & DMZ_ALLOC_RECLAIM) || 2211 - list_empty(&zmd->reserved_seq_zones_list)) 1782 + if (!(flags & DMZ_ALLOC_RECLAIM)) 2212 1783 return NULL; 1784 + /* 1785 + * Try to allocate from other devices 1786 + */ 1787 + if (i < zmd->nr_devs) { 1788 + dev_idx = (dev_idx + 1) % zmd->nr_devs; 1789 + i++; 1790 + goto again; 1791 + } 2213 1792 2214 - zone = list_first_entry(&zmd->reserved_seq_zones_list, 2215 - struct dm_zone, link); 2216 - list_del_init(&zone->link); 2217 - atomic_dec(&zmd->nr_reserved_seq_zones); 1793 + /* 1794 + * Fallback to the reserved sequential zones 1795 + */ 1796 + zone = list_first_entry_or_null(&zmd->reserved_seq_zones_list, 1797 + struct dm_zone, link); 1798 + if (zone) { 1799 + list_del_init(&zone->link); 1800 + atomic_dec(&zmd->nr_reserved_seq_zones); 1801 + } 2218 1802 return zone; 2219 1803 } 2220 1804 2221 1805 zone = list_first_entry(list, struct dm_zone, link); 2222 1806 list_del_init(&zone->link); 2223 1807 2224 - if (dmz_is_rnd(zone)) 2225 - atomic_dec(&zmd->unmap_nr_rnd); 1808 + if (dmz_is_cache(zone)) 1809 + atomic_dec(&zmd->unmap_nr_cache); 1810 + else if (dmz_is_rnd(zone)) 1811 + atomic_dec(&zone->dev->unmap_nr_rnd); 2226 1812 else 2227 - atomic_dec(&zmd->unmap_nr_seq); 1813 + atomic_dec(&zone->dev->unmap_nr_seq); 2228 1814 2229 1815 if (dmz_is_offline(zone)) { 2230 - dmz_dev_warn(zmd->dev, "Zone %u is offline", dmz_id(zmd, zone)); 1816 + dmz_zmd_warn(zmd, "Zone %u is offline", zone->id); 2231 1817 zone = NULL; 2232 1818 goto again; 2233 1819 } 2234 - 1820 + if (dmz_is_meta(zone)) { 1821 + dmz_zmd_warn(zmd, "Zone %u has metadata", zone->id); 1822 + zone = NULL; 1823 + goto again; 1824 + } 2235 1825 return zone; 2236 1826 } 2237 1827 ··· 2268 1814 dmz_reset_zone(zmd, zone); 2269 1815 2270 1816 /* Return the zone to its type unmap list */ 2271 - if (dmz_is_rnd(zone)) { 2272 - list_add_tail(&zone->link, &zmd->unmap_rnd_list); 2273 - atomic_inc(&zmd->unmap_nr_rnd); 2274 - } else if (atomic_read(&zmd->nr_reserved_seq_zones) < 2275 - zmd->nr_reserved_seq) { 1817 + if (dmz_is_cache(zone)) { 1818 + list_add_tail(&zone->link, &zmd->unmap_cache_list); 1819 + atomic_inc(&zmd->unmap_nr_cache); 1820 + } else if (dmz_is_rnd(zone)) { 1821 + list_add_tail(&zone->link, &zone->dev->unmap_rnd_list); 1822 + atomic_inc(&zone->dev->unmap_nr_rnd); 1823 + } else if (dmz_is_reserved(zone)) { 2276 1824 list_add_tail(&zone->link, &zmd->reserved_seq_zones_list); 2277 1825 atomic_inc(&zmd->nr_reserved_seq_zones); 2278 1826 } else { 2279 - list_add_tail(&zone->link, &zmd->unmap_seq_list); 2280 - atomic_inc(&zmd->unmap_nr_seq); 1827 + list_add_tail(&zone->link, &zone->dev->unmap_seq_list); 1828 + atomic_inc(&zone->dev->unmap_nr_seq); 2281 1829 } 2282 1830 2283 1831 wake_up_all(&zmd->free_wq); ··· 2293 1837 unsigned int chunk) 2294 1838 { 2295 1839 /* Set the chunk mapping */ 2296 - dmz_set_chunk_mapping(zmd, chunk, dmz_id(zmd, dzone), 1840 + dmz_set_chunk_mapping(zmd, chunk, dzone->id, 2297 1841 DMZ_MAP_UNMAPPED); 2298 1842 dzone->chunk = chunk; 2299 - if (dmz_is_rnd(dzone)) 2300 - list_add_tail(&dzone->link, &zmd->map_rnd_list); 1843 + if (dmz_is_cache(dzone)) 1844 + list_add_tail(&dzone->link, &zmd->map_cache_list); 1845 + else if (dmz_is_rnd(dzone)) 1846 + list_add_tail(&dzone->link, &dzone->dev->map_rnd_list); 2301 1847 else 2302 - list_add_tail(&dzone->link, &zmd->map_seq_list); 1848 + list_add_tail(&dzone->link, &dzone->dev->map_seq_list); 2303 1849 } 2304 1850 2305 1851 /* ··· 2323 1865 * Unmapping the chunk buffer zone: clear only 2324 1866 * the chunk buffer mapping 2325 1867 */ 2326 - dzone_id = dmz_id(zmd, zone->bzone); 1868 + dzone_id = zone->bzone->id; 2327 1869 zone->bzone->bzone = NULL; 2328 1870 zone->bzone = NULL; 2329 1871 ··· 2385 1927 sector_t chunk_block) 2386 1928 { 2387 1929 sector_t bitmap_block = 1 + zmd->nr_map_blocks + 2388 - (sector_t)(dmz_id(zmd, zone) * zmd->zone_nr_bitmap_blocks) + 1930 + (sector_t)(zone->id * zmd->zone_nr_bitmap_blocks) + 2389 1931 (chunk_block >> DMZ_BLOCK_SHIFT_BITS); 2390 1932 2391 1933 return dmz_get_mblock(zmd, bitmap_block); ··· 2401 1943 sector_t chunk_block = 0; 2402 1944 2403 1945 /* Get the zones bitmap blocks */ 2404 - while (chunk_block < zmd->dev->zone_nr_blocks) { 1946 + while (chunk_block < zmd->zone_nr_blocks) { 2405 1947 from_mblk = dmz_get_bitmap(zmd, from_zone, chunk_block); 2406 1948 if (IS_ERR(from_mblk)) 2407 1949 return PTR_ERR(from_mblk); ··· 2436 1978 int ret; 2437 1979 2438 1980 /* Get the zones bitmap blocks */ 2439 - while (chunk_block < zmd->dev->zone_nr_blocks) { 1981 + while (chunk_block < zmd->zone_nr_blocks) { 2440 1982 /* Get a valid region from the source zone */ 2441 1983 ret = dmz_first_valid_block(zmd, from_zone, &chunk_block); 2442 1984 if (ret <= 0) ··· 2460 2002 sector_t chunk_block, unsigned int nr_blocks) 2461 2003 { 2462 2004 unsigned int count, bit, nr_bits; 2463 - unsigned int zone_nr_blocks = zmd->dev->zone_nr_blocks; 2005 + unsigned int zone_nr_blocks = zmd->zone_nr_blocks; 2464 2006 struct dmz_mblock *mblk; 2465 2007 unsigned int n = 0; 2466 2008 2467 - dmz_dev_debug(zmd->dev, "=> VALIDATE zone %u, block %llu, %u blocks", 2468 - dmz_id(zmd, zone), (unsigned long long)chunk_block, 2009 + dmz_zmd_debug(zmd, "=> VALIDATE zone %u, block %llu, %u blocks", 2010 + zone->id, (unsigned long long)chunk_block, 2469 2011 nr_blocks); 2470 2012 2471 2013 WARN_ON(chunk_block + nr_blocks > zone_nr_blocks); ··· 2494 2036 if (likely(zone->weight + n <= zone_nr_blocks)) 2495 2037 zone->weight += n; 2496 2038 else { 2497 - dmz_dev_warn(zmd->dev, "Zone %u: weight %u should be <= %u", 2498 - dmz_id(zmd, zone), zone->weight, 2039 + dmz_zmd_warn(zmd, "Zone %u: weight %u should be <= %u", 2040 + zone->id, zone->weight, 2499 2041 zone_nr_blocks - n); 2500 2042 zone->weight = zone_nr_blocks; 2501 2043 } ··· 2544 2086 struct dmz_mblock *mblk; 2545 2087 unsigned int n = 0; 2546 2088 2547 - dmz_dev_debug(zmd->dev, "=> INVALIDATE zone %u, block %llu, %u blocks", 2548 - dmz_id(zmd, zone), (u64)chunk_block, nr_blocks); 2089 + dmz_zmd_debug(zmd, "=> INVALIDATE zone %u, block %llu, %u blocks", 2090 + zone->id, (u64)chunk_block, nr_blocks); 2549 2091 2550 - WARN_ON(chunk_block + nr_blocks > zmd->dev->zone_nr_blocks); 2092 + WARN_ON(chunk_block + nr_blocks > zmd->zone_nr_blocks); 2551 2093 2552 2094 while (nr_blocks) { 2553 2095 /* Get bitmap block */ ··· 2574 2116 if (zone->weight >= n) 2575 2117 zone->weight -= n; 2576 2118 else { 2577 - dmz_dev_warn(zmd->dev, "Zone %u: weight %u should be >= %u", 2578 - dmz_id(zmd, zone), zone->weight, n); 2119 + dmz_zmd_warn(zmd, "Zone %u: weight %u should be >= %u", 2120 + zone->id, zone->weight, n); 2579 2121 zone->weight = 0; 2580 2122 } 2581 2123 ··· 2591 2133 struct dmz_mblock *mblk; 2592 2134 int ret; 2593 2135 2594 - WARN_ON(chunk_block >= zmd->dev->zone_nr_blocks); 2136 + WARN_ON(chunk_block >= zmd->zone_nr_blocks); 2595 2137 2596 2138 /* Get bitmap block */ 2597 2139 mblk = dmz_get_bitmap(zmd, zone, chunk_block); ··· 2621 2163 unsigned long *bitmap; 2622 2164 int n = 0; 2623 2165 2624 - WARN_ON(chunk_block + nr_blocks > zmd->dev->zone_nr_blocks); 2166 + WARN_ON(chunk_block + nr_blocks > zmd->zone_nr_blocks); 2625 2167 2626 2168 while (nr_blocks) { 2627 2169 /* Get bitmap block */ ··· 2665 2207 2666 2208 /* The block is valid: get the number of valid blocks from block */ 2667 2209 return dmz_to_next_set_block(zmd, zone, chunk_block, 2668 - zmd->dev->zone_nr_blocks - chunk_block, 0); 2210 + zmd->zone_nr_blocks - chunk_block, 0); 2669 2211 } 2670 2212 2671 2213 /* ··· 2681 2223 int ret; 2682 2224 2683 2225 ret = dmz_to_next_set_block(zmd, zone, start_block, 2684 - zmd->dev->zone_nr_blocks - start_block, 1); 2226 + zmd->zone_nr_blocks - start_block, 1); 2685 2227 if (ret < 0) 2686 2228 return ret; 2687 2229 ··· 2689 2231 *chunk_block = start_block; 2690 2232 2691 2233 return dmz_to_next_set_block(zmd, zone, start_block, 2692 - zmd->dev->zone_nr_blocks - start_block, 0); 2234 + zmd->zone_nr_blocks - start_block, 0); 2693 2235 } 2694 2236 2695 2237 /* ··· 2728 2270 struct dmz_mblock *mblk; 2729 2271 sector_t chunk_block = 0; 2730 2272 unsigned int bit, nr_bits; 2731 - unsigned int nr_blocks = zmd->dev->zone_nr_blocks; 2273 + unsigned int nr_blocks = zmd->zone_nr_blocks; 2732 2274 void *bitmap; 2733 2275 int n = 0; 2734 2276 ··· 2784 2326 while (!list_empty(&zmd->mblk_dirty_list)) { 2785 2327 mblk = list_first_entry(&zmd->mblk_dirty_list, 2786 2328 struct dmz_mblock, link); 2787 - dmz_dev_warn(zmd->dev, "mblock %llu still in dirty list (ref %u)", 2329 + dmz_zmd_warn(zmd, "mblock %llu still in dirty list (ref %u)", 2788 2330 (u64)mblk->no, mblk->ref); 2789 2331 list_del_init(&mblk->link); 2790 2332 rb_erase(&mblk->node, &zmd->mblk_rbtree); ··· 2802 2344 /* Sanity checks: the mblock rbtree should now be empty */ 2803 2345 root = &zmd->mblk_rbtree; 2804 2346 rbtree_postorder_for_each_entry_safe(mblk, next, root, node) { 2805 - dmz_dev_warn(zmd->dev, "mblock %llu ref %u still in rbtree", 2347 + dmz_zmd_warn(zmd, "mblock %llu ref %u still in rbtree", 2806 2348 (u64)mblk->no, mblk->ref); 2807 2349 mblk->ref = 0; 2808 2350 dmz_free_mblock(zmd, mblk); ··· 2815 2357 mutex_destroy(&zmd->map_lock); 2816 2358 } 2817 2359 2360 + static void dmz_print_dev(struct dmz_metadata *zmd, int num) 2361 + { 2362 + struct dmz_dev *dev = &zmd->dev[num]; 2363 + 2364 + if (bdev_zoned_model(dev->bdev) == BLK_ZONED_NONE) 2365 + dmz_dev_info(dev, "Regular block device"); 2366 + else 2367 + dmz_dev_info(dev, "Host-%s zoned block device", 2368 + bdev_zoned_model(dev->bdev) == BLK_ZONED_HA ? 2369 + "aware" : "managed"); 2370 + if (zmd->sb_version > 1) { 2371 + sector_t sector_offset = 2372 + dev->zone_offset << zmd->zone_nr_sectors_shift; 2373 + 2374 + dmz_dev_info(dev, " %llu 512-byte logical sectors (offset %llu)", 2375 + (u64)dev->capacity, (u64)sector_offset); 2376 + dmz_dev_info(dev, " %u zones of %llu 512-byte logical sectors (offset %llu)", 2377 + dev->nr_zones, (u64)zmd->zone_nr_sectors, 2378 + (u64)dev->zone_offset); 2379 + } else { 2380 + dmz_dev_info(dev, " %llu 512-byte logical sectors", 2381 + (u64)dev->capacity); 2382 + dmz_dev_info(dev, " %u zones of %llu 512-byte logical sectors", 2383 + dev->nr_zones, (u64)zmd->zone_nr_sectors); 2384 + } 2385 + } 2386 + 2818 2387 /* 2819 2388 * Initialize the zoned metadata. 2820 2389 */ 2821 - int dmz_ctr_metadata(struct dmz_dev *dev, struct dmz_metadata **metadata) 2390 + int dmz_ctr_metadata(struct dmz_dev *dev, int num_dev, 2391 + struct dmz_metadata **metadata, 2392 + const char *devname) 2822 2393 { 2823 2394 struct dmz_metadata *zmd; 2824 - unsigned int i, zid; 2395 + unsigned int i; 2825 2396 struct dm_zone *zone; 2826 2397 int ret; 2827 2398 ··· 2858 2371 if (!zmd) 2859 2372 return -ENOMEM; 2860 2373 2374 + strcpy(zmd->devname, devname); 2861 2375 zmd->dev = dev; 2376 + zmd->nr_devs = num_dev; 2862 2377 zmd->mblk_rbtree = RB_ROOT; 2863 2378 init_rwsem(&zmd->mblk_sem); 2864 2379 mutex_init(&zmd->mblk_flush_lock); ··· 2869 2380 INIT_LIST_HEAD(&zmd->mblk_dirty_list); 2870 2381 2871 2382 mutex_init(&zmd->map_lock); 2872 - atomic_set(&zmd->unmap_nr_rnd, 0); 2873 - INIT_LIST_HEAD(&zmd->unmap_rnd_list); 2874 - INIT_LIST_HEAD(&zmd->map_rnd_list); 2875 2383 2876 - atomic_set(&zmd->unmap_nr_seq, 0); 2877 - INIT_LIST_HEAD(&zmd->unmap_seq_list); 2878 - INIT_LIST_HEAD(&zmd->map_seq_list); 2384 + atomic_set(&zmd->unmap_nr_cache, 0); 2385 + INIT_LIST_HEAD(&zmd->unmap_cache_list); 2386 + INIT_LIST_HEAD(&zmd->map_cache_list); 2879 2387 2880 2388 atomic_set(&zmd->nr_reserved_seq_zones, 0); 2881 2389 INIT_LIST_HEAD(&zmd->reserved_seq_zones_list); ··· 2890 2404 goto err; 2891 2405 2892 2406 /* Set metadata zones starting from sb_zone */ 2893 - zid = dmz_id(zmd, zmd->sb_zone); 2894 2407 for (i = 0; i < zmd->nr_meta_zones << 1; i++) { 2895 - zone = dmz_get(zmd, zid + i); 2896 - if (!dmz_is_rnd(zone)) 2408 + zone = dmz_get(zmd, zmd->sb[0].zone->id + i); 2409 + if (!zone) { 2410 + dmz_zmd_err(zmd, 2411 + "metadata zone %u not present", i); 2412 + ret = -ENXIO; 2897 2413 goto err; 2414 + } 2415 + if (!dmz_is_rnd(zone) && !dmz_is_cache(zone)) { 2416 + dmz_zmd_err(zmd, 2417 + "metadata zone %d is not random", i); 2418 + ret = -ENXIO; 2419 + goto err; 2420 + } 2898 2421 set_bit(DMZ_META, &zone->flags); 2899 2422 } 2900 - 2901 2423 /* Load mapping table */ 2902 2424 ret = dmz_load_mapping(zmd); 2903 2425 if (ret) ··· 2926 2432 /* Metadata cache shrinker */ 2927 2433 ret = register_shrinker(&zmd->mblk_shrinker); 2928 2434 if (ret) { 2929 - dmz_dev_err(dev, "Register metadata cache shrinker failed"); 2435 + dmz_zmd_err(zmd, "Register metadata cache shrinker failed"); 2930 2436 goto err; 2931 2437 } 2932 2438 2933 - dmz_dev_info(dev, "Host-%s zoned block device", 2934 - bdev_zoned_model(dev->bdev) == BLK_ZONED_HA ? 2935 - "aware" : "managed"); 2936 - dmz_dev_info(dev, " %llu 512-byte logical sectors", 2937 - (u64)dev->capacity); 2938 - dmz_dev_info(dev, " %u zones of %llu 512-byte logical sectors", 2939 - dev->nr_zones, (u64)dev->zone_nr_sectors); 2940 - dmz_dev_info(dev, " %u metadata zones", 2941 - zmd->nr_meta_zones * 2); 2942 - dmz_dev_info(dev, " %u data zones for %u chunks", 2943 - zmd->nr_data_zones, zmd->nr_chunks); 2944 - dmz_dev_info(dev, " %u random zones (%u unmapped)", 2945 - zmd->nr_rnd, atomic_read(&zmd->unmap_nr_rnd)); 2946 - dmz_dev_info(dev, " %u sequential zones (%u unmapped)", 2947 - zmd->nr_seq, atomic_read(&zmd->unmap_nr_seq)); 2948 - dmz_dev_info(dev, " %u reserved sequential data zones", 2949 - zmd->nr_reserved_seq); 2439 + dmz_zmd_info(zmd, "DM-Zoned metadata version %d", zmd->sb_version); 2440 + for (i = 0; i < zmd->nr_devs; i++) 2441 + dmz_print_dev(zmd, i); 2950 2442 2951 - dmz_dev_debug(dev, "Format:"); 2952 - dmz_dev_debug(dev, "%u metadata blocks per set (%u max cache)", 2443 + dmz_zmd_info(zmd, " %u zones of %llu 512-byte logical sectors", 2444 + zmd->nr_zones, (u64)zmd->zone_nr_sectors); 2445 + dmz_zmd_debug(zmd, " %u metadata zones", 2446 + zmd->nr_meta_zones * 2); 2447 + dmz_zmd_debug(zmd, " %u data zones for %u chunks", 2448 + zmd->nr_data_zones, zmd->nr_chunks); 2449 + dmz_zmd_debug(zmd, " %u cache zones (%u unmapped)", 2450 + zmd->nr_cache, atomic_read(&zmd->unmap_nr_cache)); 2451 + for (i = 0; i < zmd->nr_devs; i++) { 2452 + dmz_zmd_debug(zmd, " %u random zones (%u unmapped)", 2453 + dmz_nr_rnd_zones(zmd, i), 2454 + dmz_nr_unmap_rnd_zones(zmd, i)); 2455 + dmz_zmd_debug(zmd, " %u sequential zones (%u unmapped)", 2456 + dmz_nr_seq_zones(zmd, i), 2457 + dmz_nr_unmap_seq_zones(zmd, i)); 2458 + } 2459 + dmz_zmd_debug(zmd, " %u reserved sequential data zones", 2460 + zmd->nr_reserved_seq); 2461 + dmz_zmd_debug(zmd, "Format:"); 2462 + dmz_zmd_debug(zmd, "%u metadata blocks per set (%u max cache)", 2953 2463 zmd->nr_meta_blocks, zmd->max_nr_mblks); 2954 - dmz_dev_debug(dev, " %u data zone mapping blocks", 2464 + dmz_zmd_debug(zmd, " %u data zone mapping blocks", 2955 2465 zmd->nr_map_blocks); 2956 - dmz_dev_debug(dev, " %u bitmap blocks", 2466 + dmz_zmd_debug(zmd, " %u bitmap blocks", 2957 2467 zmd->nr_bitmap_blocks); 2958 2468 2959 2469 *metadata = zmd; ··· 2986 2488 */ 2987 2489 int dmz_resume_metadata(struct dmz_metadata *zmd) 2988 2490 { 2989 - struct dmz_dev *dev = zmd->dev; 2990 2491 struct dm_zone *zone; 2991 2492 sector_t wp_block; 2992 2493 unsigned int i; 2993 2494 int ret; 2994 2495 2995 2496 /* Check zones */ 2996 - for (i = 0; i < dev->nr_zones; i++) { 2497 + for (i = 0; i < zmd->nr_zones; i++) { 2997 2498 zone = dmz_get(zmd, i); 2998 2499 if (!zone) { 2999 - dmz_dev_err(dev, "Unable to get zone %u", i); 2500 + dmz_zmd_err(zmd, "Unable to get zone %u", i); 3000 2501 return -EIO; 3001 2502 } 3002 - 3003 2503 wp_block = zone->wp_block; 3004 2504 3005 2505 ret = dmz_update_zone(zmd, zone); 3006 2506 if (ret) { 3007 - dmz_dev_err(dev, "Broken zone %u", i); 2507 + dmz_zmd_err(zmd, "Broken zone %u", i); 3008 2508 return ret; 3009 2509 } 3010 2510 3011 2511 if (dmz_is_offline(zone)) { 3012 - dmz_dev_warn(dev, "Zone %u is offline", i); 2512 + dmz_zmd_warn(zmd, "Zone %u is offline", i); 3013 2513 continue; 3014 2514 } 3015 2515 ··· 3015 2519 if (!dmz_is_seq(zone)) 3016 2520 zone->wp_block = 0; 3017 2521 else if (zone->wp_block != wp_block) { 3018 - dmz_dev_err(dev, "Zone %u: Invalid wp (%llu / %llu)", 2522 + dmz_zmd_err(zmd, "Zone %u: Invalid wp (%llu / %llu)", 3019 2523 i, (u64)zone->wp_block, (u64)wp_block); 3020 2524 zone->wp_block = wp_block; 3021 2525 dmz_invalidate_blocks(zmd, zone, zone->wp_block, 3022 - dev->zone_nr_blocks - zone->wp_block); 2526 + zmd->zone_nr_blocks - zone->wp_block); 3023 2527 } 3024 2528 } 3025 2529
+132 -78
drivers/md/dm-zoned-reclaim.c
··· 13 13 14 14 struct dmz_reclaim { 15 15 struct dmz_metadata *metadata; 16 - struct dmz_dev *dev; 17 16 18 17 struct delayed_work work; 19 18 struct workqueue_struct *wq; ··· 20 21 struct dm_kcopyd_client *kc; 21 22 struct dm_kcopyd_throttle kc_throttle; 22 23 int kc_err; 24 + 25 + int dev_idx; 23 26 24 27 unsigned long flags; 25 28 ··· 45 44 * Percentage of unmapped (free) random zones below which reclaim starts 46 45 * even if the target is busy. 47 46 */ 48 - #define DMZ_RECLAIM_LOW_UNMAP_RND 30 47 + #define DMZ_RECLAIM_LOW_UNMAP_ZONES 30 49 48 50 49 /* 51 50 * Percentage of unmapped (free) random zones above which reclaim will 52 51 * stop if the target is busy. 53 52 */ 54 - #define DMZ_RECLAIM_HIGH_UNMAP_RND 50 53 + #define DMZ_RECLAIM_HIGH_UNMAP_ZONES 50 55 54 56 55 /* 57 56 * Align a sequential zone write pointer to chunk_block. ··· 60 59 sector_t block) 61 60 { 62 61 struct dmz_metadata *zmd = zrc->metadata; 62 + struct dmz_dev *dev = zone->dev; 63 63 sector_t wp_block = zone->wp_block; 64 64 unsigned int nr_blocks; 65 65 int ret; ··· 76 74 * pointer and the requested position. 77 75 */ 78 76 nr_blocks = block - wp_block; 79 - ret = blkdev_issue_zeroout(zrc->dev->bdev, 77 + ret = blkdev_issue_zeroout(dev->bdev, 80 78 dmz_start_sect(zmd, zone) + dmz_blk2sect(wp_block), 81 79 dmz_blk2sect(nr_blocks), GFP_NOIO, 0); 82 80 if (ret) { 83 - dmz_dev_err(zrc->dev, 81 + dmz_dev_err(dev, 84 82 "Align zone %u wp %llu to %llu (wp+%u) blocks failed %d", 85 - dmz_id(zmd, zone), (unsigned long long)wp_block, 83 + zone->id, (unsigned long long)wp_block, 86 84 (unsigned long long)block, nr_blocks, ret); 87 - dmz_check_bdev(zrc->dev); 85 + dmz_check_bdev(dev); 88 86 return ret; 89 87 } 90 88 ··· 118 116 struct dm_zone *src_zone, struct dm_zone *dst_zone) 119 117 { 120 118 struct dmz_metadata *zmd = zrc->metadata; 121 - struct dmz_dev *dev = zrc->dev; 122 119 struct dm_io_region src, dst; 123 120 sector_t block = 0, end_block; 124 121 sector_t nr_blocks; ··· 129 128 if (dmz_is_seq(src_zone)) 130 129 end_block = src_zone->wp_block; 131 130 else 132 - end_block = dev->zone_nr_blocks; 131 + end_block = dmz_zone_nr_blocks(zmd); 133 132 src_zone_block = dmz_start_block(zmd, src_zone); 134 133 dst_zone_block = dmz_start_block(zmd, dst_zone); 135 134 ··· 137 136 set_bit(DM_KCOPYD_WRITE_SEQ, &flags); 138 137 139 138 while (block < end_block) { 140 - if (dev->flags & DMZ_BDEV_DYING) 139 + if (src_zone->dev->flags & DMZ_BDEV_DYING) 141 140 return -EIO; 141 + if (dst_zone->dev->flags & DMZ_BDEV_DYING) 142 + return -EIO; 143 + 144 + if (dmz_reclaim_should_terminate(src_zone)) 145 + return -EINTR; 142 146 143 147 /* Get a valid region from the source zone */ 144 148 ret = dmz_first_valid_block(zmd, src_zone, &block); ··· 162 156 return ret; 163 157 } 164 158 165 - src.bdev = dev->bdev; 159 + src.bdev = src_zone->dev->bdev; 166 160 src.sector = dmz_blk2sect(src_zone_block + block); 167 161 src.count = dmz_blk2sect(nr_blocks); 168 162 169 - dst.bdev = dev->bdev; 163 + dst.bdev = dst_zone->dev->bdev; 170 164 dst.sector = dmz_blk2sect(dst_zone_block + block); 171 165 dst.count = src.count; 172 166 ··· 200 194 struct dmz_metadata *zmd = zrc->metadata; 201 195 int ret; 202 196 203 - dmz_dev_debug(zrc->dev, 204 - "Chunk %u, move buf zone %u (weight %u) to data zone %u (weight %u)", 205 - dzone->chunk, dmz_id(zmd, bzone), dmz_weight(bzone), 206 - dmz_id(zmd, dzone), dmz_weight(dzone)); 197 + DMDEBUG("(%s/%u): Chunk %u, move buf zone %u (weight %u) to data zone %u (weight %u)", 198 + dmz_metadata_label(zmd), zrc->dev_idx, 199 + dzone->chunk, bzone->id, dmz_weight(bzone), 200 + dzone->id, dmz_weight(dzone)); 207 201 208 202 /* Flush data zone into the buffer zone */ 209 203 ret = dmz_reclaim_copy(zrc, bzone, dzone); ··· 216 210 ret = dmz_merge_valid_blocks(zmd, bzone, dzone, chunk_block); 217 211 if (ret == 0) { 218 212 /* Free the buffer zone */ 219 - dmz_invalidate_blocks(zmd, bzone, 0, zrc->dev->zone_nr_blocks); 213 + dmz_invalidate_blocks(zmd, bzone, 0, dmz_zone_nr_blocks(zmd)); 220 214 dmz_lock_map(zmd); 221 215 dmz_unmap_zone(zmd, bzone); 222 216 dmz_unlock_zone_reclaim(dzone); ··· 239 233 struct dmz_metadata *zmd = zrc->metadata; 240 234 int ret = 0; 241 235 242 - dmz_dev_debug(zrc->dev, 243 - "Chunk %u, move data zone %u (weight %u) to buf zone %u (weight %u)", 244 - chunk, dmz_id(zmd, dzone), dmz_weight(dzone), 245 - dmz_id(zmd, bzone), dmz_weight(bzone)); 236 + DMDEBUG("(%s/%u): Chunk %u, move data zone %u (weight %u) to buf zone %u (weight %u)", 237 + dmz_metadata_label(zmd), zrc->dev_idx, 238 + chunk, dzone->id, dmz_weight(dzone), 239 + bzone->id, dmz_weight(bzone)); 246 240 247 241 /* Flush data zone into the buffer zone */ 248 242 ret = dmz_reclaim_copy(zrc, dzone, bzone); ··· 258 252 * Free the data zone and remap the chunk to 259 253 * the buffer zone. 260 254 */ 261 - dmz_invalidate_blocks(zmd, dzone, 0, zrc->dev->zone_nr_blocks); 255 + dmz_invalidate_blocks(zmd, dzone, 0, dmz_zone_nr_blocks(zmd)); 262 256 dmz_lock_map(zmd); 263 257 dmz_unmap_zone(zmd, bzone); 264 258 dmz_unmap_zone(zmd, dzone); ··· 283 277 struct dm_zone *szone = NULL; 284 278 struct dmz_metadata *zmd = zrc->metadata; 285 279 int ret; 280 + int alloc_flags = DMZ_ALLOC_SEQ; 286 281 287 - /* Get a free sequential zone */ 282 + /* Get a free random or sequential zone */ 288 283 dmz_lock_map(zmd); 289 - szone = dmz_alloc_zone(zmd, DMZ_ALLOC_RECLAIM); 284 + again: 285 + szone = dmz_alloc_zone(zmd, zrc->dev_idx, 286 + alloc_flags | DMZ_ALLOC_RECLAIM); 287 + if (!szone && alloc_flags == DMZ_ALLOC_SEQ && dmz_nr_cache_zones(zmd)) { 288 + alloc_flags = DMZ_ALLOC_RND; 289 + goto again; 290 + } 290 291 dmz_unlock_map(zmd); 291 292 if (!szone) 292 293 return -ENOSPC; 293 294 294 - dmz_dev_debug(zrc->dev, 295 - "Chunk %u, move rnd zone %u (weight %u) to seq zone %u", 296 - chunk, dmz_id(zmd, dzone), dmz_weight(dzone), 297 - dmz_id(zmd, szone)); 295 + DMDEBUG("(%s/%u): Chunk %u, move %s zone %u (weight %u) to %s zone %u", 296 + dmz_metadata_label(zmd), zrc->dev_idx, chunk, 297 + dmz_is_cache(dzone) ? "cache" : "rnd", 298 + dzone->id, dmz_weight(dzone), 299 + dmz_is_rnd(szone) ? "rnd" : "seq", szone->id); 298 300 299 301 /* Flush the random data zone into the sequential zone */ 300 302 ret = dmz_reclaim_copy(zrc, dzone, szone); ··· 320 306 dmz_unlock_map(zmd); 321 307 } else { 322 308 /* Free the data zone and remap the chunk */ 323 - dmz_invalidate_blocks(zmd, dzone, 0, zrc->dev->zone_nr_blocks); 309 + dmz_invalidate_blocks(zmd, dzone, 0, dmz_zone_nr_blocks(zmd)); 324 310 dmz_lock_map(zmd); 325 311 dmz_unmap_zone(zmd, dzone); 326 312 dmz_unlock_zone_reclaim(dzone); ··· 351 337 } 352 338 353 339 /* 340 + * Test if the target device is idle. 341 + */ 342 + static inline int dmz_target_idle(struct dmz_reclaim *zrc) 343 + { 344 + return time_is_before_jiffies(zrc->atime + DMZ_IDLE_PERIOD); 345 + } 346 + 347 + /* 354 348 * Find a candidate zone for reclaim and process it. 355 349 */ 356 350 static int dmz_do_reclaim(struct dmz_reclaim *zrc) ··· 370 348 int ret; 371 349 372 350 /* Get a data zone */ 373 - dzone = dmz_get_zone_for_reclaim(zmd); 374 - if (IS_ERR(dzone)) 375 - return PTR_ERR(dzone); 351 + dzone = dmz_get_zone_for_reclaim(zmd, zrc->dev_idx, 352 + dmz_target_idle(zrc)); 353 + if (!dzone) { 354 + DMDEBUG("(%s/%u): No zone found to reclaim", 355 + dmz_metadata_label(zmd), zrc->dev_idx); 356 + return -EBUSY; 357 + } 376 358 377 359 start = jiffies; 378 - 379 - if (dmz_is_rnd(dzone)) { 360 + if (dmz_is_cache(dzone) || dmz_is_rnd(dzone)) { 380 361 if (!dmz_weight(dzone)) { 381 362 /* Empty zone */ 382 363 dmz_reclaim_empty(zrc, dzone); ··· 420 395 } 421 396 out: 422 397 if (ret) { 398 + if (ret == -EINTR) 399 + DMDEBUG("(%s/%u): reclaim zone %u interrupted", 400 + dmz_metadata_label(zmd), zrc->dev_idx, 401 + rzone->id); 402 + else 403 + DMDEBUG("(%s/%u): Failed to reclaim zone %u, err %d", 404 + dmz_metadata_label(zmd), zrc->dev_idx, 405 + rzone->id, ret); 423 406 dmz_unlock_zone_reclaim(dzone); 424 407 return ret; 425 408 } 426 409 427 410 ret = dmz_flush_metadata(zrc->metadata); 428 411 if (ret) { 429 - dmz_dev_debug(zrc->dev, 430 - "Metadata flush for zone %u failed, err %d\n", 431 - dmz_id(zmd, rzone), ret); 412 + DMDEBUG("(%s/%u): Metadata flush for zone %u failed, err %d", 413 + dmz_metadata_label(zmd), zrc->dev_idx, rzone->id, ret); 432 414 return ret; 433 415 } 434 416 435 - dmz_dev_debug(zrc->dev, "Reclaimed zone %u in %u ms", 436 - dmz_id(zmd, rzone), jiffies_to_msecs(jiffies - start)); 417 + DMDEBUG("(%s/%u): Reclaimed zone %u in %u ms", 418 + dmz_metadata_label(zmd), zrc->dev_idx, 419 + rzone->id, jiffies_to_msecs(jiffies - start)); 437 420 return 0; 438 421 } 439 422 440 - /* 441 - * Test if the target device is idle. 442 - */ 443 - static inline int dmz_target_idle(struct dmz_reclaim *zrc) 423 + static unsigned int dmz_reclaim_percentage(struct dmz_reclaim *zrc) 444 424 { 445 - return time_is_before_jiffies(zrc->atime + DMZ_IDLE_PERIOD); 425 + struct dmz_metadata *zmd = zrc->metadata; 426 + unsigned int nr_cache = dmz_nr_cache_zones(zmd); 427 + unsigned int nr_unmap, nr_zones; 428 + 429 + if (nr_cache) { 430 + nr_zones = nr_cache; 431 + nr_unmap = dmz_nr_unmap_cache_zones(zmd); 432 + } else { 433 + nr_zones = dmz_nr_rnd_zones(zmd, zrc->dev_idx); 434 + nr_unmap = dmz_nr_unmap_rnd_zones(zmd, zrc->dev_idx); 435 + } 436 + return nr_unmap * 100 / nr_zones; 446 437 } 447 438 448 439 /* 449 440 * Test if reclaim is necessary. 450 441 */ 451 - static bool dmz_should_reclaim(struct dmz_reclaim *zrc) 442 + static bool dmz_should_reclaim(struct dmz_reclaim *zrc, unsigned int p_unmap) 452 443 { 453 - struct dmz_metadata *zmd = zrc->metadata; 454 - unsigned int nr_rnd = dmz_nr_rnd_zones(zmd); 455 - unsigned int nr_unmap_rnd = dmz_nr_unmap_rnd_zones(zmd); 456 - unsigned int p_unmap_rnd = nr_unmap_rnd * 100 / nr_rnd; 444 + unsigned int nr_reclaim; 445 + 446 + nr_reclaim = dmz_nr_rnd_zones(zrc->metadata, zrc->dev_idx); 447 + 448 + if (dmz_nr_cache_zones(zrc->metadata)) { 449 + /* 450 + * The first device in a multi-device 451 + * setup only contains cache zones, so 452 + * never start reclaim there. 453 + */ 454 + if (zrc->dev_idx == 0) 455 + return false; 456 + nr_reclaim += dmz_nr_cache_zones(zrc->metadata); 457 + } 457 458 458 459 /* Reclaim when idle */ 459 - if (dmz_target_idle(zrc) && nr_unmap_rnd < nr_rnd) 460 + if (dmz_target_idle(zrc) && nr_reclaim) 460 461 return true; 461 462 462 - /* If there are still plenty of random zones, do not reclaim */ 463 - if (p_unmap_rnd >= DMZ_RECLAIM_HIGH_UNMAP_RND) 463 + /* If there are still plenty of cache zones, do not reclaim */ 464 + if (p_unmap >= DMZ_RECLAIM_HIGH_UNMAP_ZONES) 464 465 return false; 465 466 466 467 /* 467 - * If the percentage of unmapped random zones is low, 468 + * If the percentage of unmapped cache zones is low, 468 469 * reclaim even if the target is busy. 469 470 */ 470 - return p_unmap_rnd <= DMZ_RECLAIM_LOW_UNMAP_RND; 471 + return p_unmap <= DMZ_RECLAIM_LOW_UNMAP_ZONES; 471 472 } 472 473 473 474 /* ··· 503 452 { 504 453 struct dmz_reclaim *zrc = container_of(work, struct dmz_reclaim, work.work); 505 454 struct dmz_metadata *zmd = zrc->metadata; 506 - unsigned int nr_rnd, nr_unmap_rnd; 507 - unsigned int p_unmap_rnd; 455 + unsigned int p_unmap, nr_unmap_rnd = 0, nr_rnd = 0; 508 456 int ret; 509 457 510 - if (dmz_bdev_is_dying(zrc->dev)) 458 + if (dmz_dev_is_dying(zmd)) 511 459 return; 512 460 513 - if (!dmz_should_reclaim(zrc)) { 461 + p_unmap = dmz_reclaim_percentage(zrc); 462 + if (!dmz_should_reclaim(zrc, p_unmap)) { 514 463 mod_delayed_work(zrc->wq, &zrc->work, DMZ_IDLE_PERIOD); 515 464 return; 516 465 } ··· 521 470 * and slower if there are still some free random zones to avoid 522 471 * as much as possible to negatively impact the user workload. 523 472 */ 524 - nr_rnd = dmz_nr_rnd_zones(zmd); 525 - nr_unmap_rnd = dmz_nr_unmap_rnd_zones(zmd); 526 - p_unmap_rnd = nr_unmap_rnd * 100 / nr_rnd; 527 - if (dmz_target_idle(zrc) || p_unmap_rnd < DMZ_RECLAIM_LOW_UNMAP_RND / 2) { 473 + if (dmz_target_idle(zrc) || p_unmap < DMZ_RECLAIM_LOW_UNMAP_ZONES / 2) { 528 474 /* Idle or very low percentage: go fast */ 529 475 zrc->kc_throttle.throttle = 100; 530 476 } else { 531 477 /* Busy but we still have some random zone: throttle */ 532 - zrc->kc_throttle.throttle = min(75U, 100U - p_unmap_rnd / 2); 478 + zrc->kc_throttle.throttle = min(75U, 100U - p_unmap / 2); 533 479 } 534 480 535 - dmz_dev_debug(zrc->dev, 536 - "Reclaim (%u): %s, %u%% free rnd zones (%u/%u)", 537 - zrc->kc_throttle.throttle, 538 - (dmz_target_idle(zrc) ? "Idle" : "Busy"), 539 - p_unmap_rnd, nr_unmap_rnd, nr_rnd); 481 + nr_unmap_rnd = dmz_nr_unmap_rnd_zones(zmd, zrc->dev_idx); 482 + nr_rnd = dmz_nr_rnd_zones(zmd, zrc->dev_idx); 483 + 484 + DMDEBUG("(%s/%u): Reclaim (%u): %s, %u%% free zones (%u/%u cache %u/%u random)", 485 + dmz_metadata_label(zmd), zrc->dev_idx, 486 + zrc->kc_throttle.throttle, 487 + (dmz_target_idle(zrc) ? "Idle" : "Busy"), 488 + p_unmap, dmz_nr_unmap_cache_zones(zmd), 489 + dmz_nr_cache_zones(zmd), 490 + dmz_nr_unmap_rnd_zones(zmd, zrc->dev_idx), 491 + dmz_nr_rnd_zones(zmd, zrc->dev_idx)); 540 492 541 493 ret = dmz_do_reclaim(zrc); 542 - if (ret) { 543 - dmz_dev_debug(zrc->dev, "Reclaim error %d\n", ret); 544 - if (!dmz_check_bdev(zrc->dev)) 494 + if (ret && ret != -EINTR) { 495 + if (!dmz_check_dev(zmd)) 545 496 return; 546 497 } 547 498 ··· 553 500 /* 554 501 * Initialize reclaim. 555 502 */ 556 - int dmz_ctr_reclaim(struct dmz_dev *dev, struct dmz_metadata *zmd, 557 - struct dmz_reclaim **reclaim) 503 + int dmz_ctr_reclaim(struct dmz_metadata *zmd, 504 + struct dmz_reclaim **reclaim, int idx) 558 505 { 559 506 struct dmz_reclaim *zrc; 560 507 int ret; ··· 563 510 if (!zrc) 564 511 return -ENOMEM; 565 512 566 - zrc->dev = dev; 567 513 zrc->metadata = zmd; 568 514 zrc->atime = jiffies; 515 + zrc->dev_idx = idx; 569 516 570 517 /* Reclaim kcopyd client */ 571 518 zrc->kc = dm_kcopyd_client_create(&zrc->kc_throttle); ··· 577 524 578 525 /* Reclaim work */ 579 526 INIT_DELAYED_WORK(&zrc->work, dmz_reclaim_work); 580 - zrc->wq = alloc_ordered_workqueue("dmz_rwq_%s", WQ_MEM_RECLAIM, 581 - dev->name); 527 + zrc->wq = alloc_ordered_workqueue("dmz_rwq_%s_%d", WQ_MEM_RECLAIM, 528 + dmz_metadata_label(zmd), idx); 582 529 if (!zrc->wq) { 583 530 ret = -ENOMEM; 584 531 goto err; ··· 636 583 */ 637 584 void dmz_schedule_reclaim(struct dmz_reclaim *zrc) 638 585 { 639 - if (dmz_should_reclaim(zrc)) 586 + unsigned int p_unmap = dmz_reclaim_percentage(zrc); 587 + 588 + if (dmz_should_reclaim(zrc, p_unmap)) 640 589 mod_delayed_work(zrc->wq, &zrc->work, 0); 641 590 } 642 -
+324 -139
drivers/md/dm-zoned-target.c
··· 17 17 * Zone BIO context. 18 18 */ 19 19 struct dmz_bioctx { 20 - struct dmz_target *target; 20 + struct dmz_dev *dev; 21 21 struct dm_zone *zone; 22 22 struct bio *bio; 23 23 refcount_t ref; ··· 38 38 * Target descriptor. 39 39 */ 40 40 struct dmz_target { 41 - struct dm_dev *ddev; 41 + struct dm_dev **ddev; 42 + unsigned int nr_ddevs; 42 43 43 - unsigned long flags; 44 + unsigned int flags; 44 45 45 46 /* Zoned block device information */ 46 47 struct dmz_dev *dev; 47 48 48 49 /* For metadata handling */ 49 50 struct dmz_metadata *metadata; 50 - 51 - /* For reclaim */ 52 - struct dmz_reclaim *reclaim; 53 51 54 52 /* For chunk work */ 55 53 struct radix_tree_root chunk_rxtree; ··· 74 76 */ 75 77 static inline void dmz_bio_endio(struct bio *bio, blk_status_t status) 76 78 { 77 - struct dmz_bioctx *bioctx = dm_per_bio_data(bio, sizeof(struct dmz_bioctx)); 79 + struct dmz_bioctx *bioctx = 80 + dm_per_bio_data(bio, sizeof(struct dmz_bioctx)); 78 81 79 82 if (status != BLK_STS_OK && bio->bi_status == BLK_STS_OK) 80 83 bio->bi_status = status; 81 - if (bio->bi_status != BLK_STS_OK) 82 - bioctx->target->dev->flags |= DMZ_CHECK_BDEV; 84 + if (bioctx->dev && bio->bi_status != BLK_STS_OK) 85 + bioctx->dev->flags |= DMZ_CHECK_BDEV; 83 86 84 87 if (refcount_dec_and_test(&bioctx->ref)) { 85 88 struct dm_zone *zone = bioctx->zone; ··· 117 118 struct bio *bio, sector_t chunk_block, 118 119 unsigned int nr_blocks) 119 120 { 120 - struct dmz_bioctx *bioctx = dm_per_bio_data(bio, sizeof(struct dmz_bioctx)); 121 + struct dmz_bioctx *bioctx = 122 + dm_per_bio_data(bio, sizeof(struct dmz_bioctx)); 123 + struct dmz_dev *dev = zone->dev; 121 124 struct bio *clone; 125 + 126 + if (dev->flags & DMZ_BDEV_DYING) 127 + return -EIO; 122 128 123 129 clone = bio_clone_fast(bio, GFP_NOIO, &dmz->bio_set); 124 130 if (!clone) 125 131 return -ENOMEM; 126 132 127 - bio_set_dev(clone, dmz->dev->bdev); 133 + bio_set_dev(clone, dev->bdev); 134 + bioctx->dev = dev; 128 135 clone->bi_iter.bi_sector = 129 136 dmz_start_sect(dmz->metadata, zone) + dmz_blk2sect(chunk_block); 130 137 clone->bi_iter.bi_size = dmz_blk2sect(nr_blocks) << SECTOR_SHIFT; ··· 170 165 static int dmz_handle_read(struct dmz_target *dmz, struct dm_zone *zone, 171 166 struct bio *bio) 172 167 { 173 - sector_t chunk_block = dmz_chunk_block(dmz->dev, dmz_bio_block(bio)); 168 + struct dmz_metadata *zmd = dmz->metadata; 169 + sector_t chunk_block = dmz_chunk_block(zmd, dmz_bio_block(bio)); 174 170 unsigned int nr_blocks = dmz_bio_blocks(bio); 175 171 sector_t end_block = chunk_block + nr_blocks; 176 172 struct dm_zone *rzone, *bzone; ··· 183 177 return 0; 184 178 } 185 179 186 - dmz_dev_debug(dmz->dev, "READ chunk %llu -> %s zone %u, block %llu, %u blocks", 187 - (unsigned long long)dmz_bio_chunk(dmz->dev, bio), 188 - (dmz_is_rnd(zone) ? "RND" : "SEQ"), 189 - dmz_id(dmz->metadata, zone), 190 - (unsigned long long)chunk_block, nr_blocks); 180 + DMDEBUG("(%s): READ chunk %llu -> %s zone %u, block %llu, %u blocks", 181 + dmz_metadata_label(zmd), 182 + (unsigned long long)dmz_bio_chunk(zmd, bio), 183 + (dmz_is_rnd(zone) ? "RND" : 184 + (dmz_is_cache(zone) ? "CACHE" : "SEQ")), 185 + zone->id, 186 + (unsigned long long)chunk_block, nr_blocks); 191 187 192 188 /* Check block validity to determine the read location */ 193 189 bzone = zone->bzone; 194 190 while (chunk_block < end_block) { 195 191 nr_blocks = 0; 196 - if (dmz_is_rnd(zone) || chunk_block < zone->wp_block) { 192 + if (dmz_is_rnd(zone) || dmz_is_cache(zone) || 193 + chunk_block < zone->wp_block) { 197 194 /* Test block validity in the data zone */ 198 - ret = dmz_block_valid(dmz->metadata, zone, chunk_block); 195 + ret = dmz_block_valid(zmd, zone, chunk_block); 199 196 if (ret < 0) 200 197 return ret; 201 198 if (ret > 0) { ··· 213 204 * Check the buffer zone, if there is one. 214 205 */ 215 206 if (!nr_blocks && bzone) { 216 - ret = dmz_block_valid(dmz->metadata, bzone, chunk_block); 207 + ret = dmz_block_valid(zmd, bzone, chunk_block); 217 208 if (ret < 0) 218 209 return ret; 219 210 if (ret > 0) { ··· 225 216 226 217 if (nr_blocks) { 227 218 /* Valid blocks found: read them */ 228 - nr_blocks = min_t(unsigned int, nr_blocks, end_block - chunk_block); 229 - ret = dmz_submit_bio(dmz, rzone, bio, chunk_block, nr_blocks); 219 + nr_blocks = min_t(unsigned int, nr_blocks, 220 + end_block - chunk_block); 221 + ret = dmz_submit_bio(dmz, rzone, bio, 222 + chunk_block, nr_blocks); 230 223 if (ret) 231 224 return ret; 232 225 chunk_block += nr_blocks; ··· 319 308 static int dmz_handle_write(struct dmz_target *dmz, struct dm_zone *zone, 320 309 struct bio *bio) 321 310 { 322 - sector_t chunk_block = dmz_chunk_block(dmz->dev, dmz_bio_block(bio)); 311 + struct dmz_metadata *zmd = dmz->metadata; 312 + sector_t chunk_block = dmz_chunk_block(zmd, dmz_bio_block(bio)); 323 313 unsigned int nr_blocks = dmz_bio_blocks(bio); 324 314 325 315 if (!zone) 326 316 return -ENOSPC; 327 317 328 - dmz_dev_debug(dmz->dev, "WRITE chunk %llu -> %s zone %u, block %llu, %u blocks", 329 - (unsigned long long)dmz_bio_chunk(dmz->dev, bio), 330 - (dmz_is_rnd(zone) ? "RND" : "SEQ"), 331 - dmz_id(dmz->metadata, zone), 332 - (unsigned long long)chunk_block, nr_blocks); 318 + DMDEBUG("(%s): WRITE chunk %llu -> %s zone %u, block %llu, %u blocks", 319 + dmz_metadata_label(zmd), 320 + (unsigned long long)dmz_bio_chunk(zmd, bio), 321 + (dmz_is_rnd(zone) ? "RND" : 322 + (dmz_is_cache(zone) ? "CACHE" : "SEQ")), 323 + zone->id, 324 + (unsigned long long)chunk_block, nr_blocks); 333 325 334 - if (dmz_is_rnd(zone) || chunk_block == zone->wp_block) { 326 + if (dmz_is_rnd(zone) || dmz_is_cache(zone) || 327 + chunk_block == zone->wp_block) { 335 328 /* 336 329 * zone is a random zone or it is a sequential zone 337 330 * and the BIO is aligned to the zone write pointer: 338 331 * direct write the zone. 339 332 */ 340 - return dmz_handle_direct_write(dmz, zone, bio, chunk_block, nr_blocks); 333 + return dmz_handle_direct_write(dmz, zone, bio, 334 + chunk_block, nr_blocks); 341 335 } 342 336 343 337 /* ··· 361 345 struct dmz_metadata *zmd = dmz->metadata; 362 346 sector_t block = dmz_bio_block(bio); 363 347 unsigned int nr_blocks = dmz_bio_blocks(bio); 364 - sector_t chunk_block = dmz_chunk_block(dmz->dev, block); 348 + sector_t chunk_block = dmz_chunk_block(zmd, block); 365 349 int ret = 0; 366 350 367 351 /* For unmapped chunks, there is nothing to do */ ··· 371 355 if (dmz_is_readonly(zone)) 372 356 return -EROFS; 373 357 374 - dmz_dev_debug(dmz->dev, "DISCARD chunk %llu -> zone %u, block %llu, %u blocks", 375 - (unsigned long long)dmz_bio_chunk(dmz->dev, bio), 376 - dmz_id(zmd, zone), 377 - (unsigned long long)chunk_block, nr_blocks); 358 + DMDEBUG("(%s): DISCARD chunk %llu -> zone %u, block %llu, %u blocks", 359 + dmz_metadata_label(dmz->metadata), 360 + (unsigned long long)dmz_bio_chunk(zmd, bio), 361 + zone->id, 362 + (unsigned long long)chunk_block, nr_blocks); 378 363 379 364 /* 380 365 * Invalidate blocks in the data zone and its 381 366 * buffer zone if one is mapped. 382 367 */ 383 - if (dmz_is_rnd(zone) || chunk_block < zone->wp_block) 368 + if (dmz_is_rnd(zone) || dmz_is_cache(zone) || 369 + chunk_block < zone->wp_block) 384 370 ret = dmz_invalidate_blocks(zmd, zone, chunk_block, nr_blocks); 385 371 if (ret == 0 && zone->bzone) 386 372 ret = dmz_invalidate_blocks(zmd, zone->bzone, ··· 396 378 static void dmz_handle_bio(struct dmz_target *dmz, struct dm_chunk_work *cw, 397 379 struct bio *bio) 398 380 { 399 - struct dmz_bioctx *bioctx = dm_per_bio_data(bio, sizeof(struct dmz_bioctx)); 381 + struct dmz_bioctx *bioctx = 382 + dm_per_bio_data(bio, sizeof(struct dmz_bioctx)); 400 383 struct dmz_metadata *zmd = dmz->metadata; 401 384 struct dm_zone *zone; 402 - int ret; 385 + int i, ret; 403 386 404 387 /* 405 388 * Write may trigger a zone allocation. So make sure the 406 389 * allocation can succeed. 407 390 */ 408 391 if (bio_op(bio) == REQ_OP_WRITE) 409 - dmz_schedule_reclaim(dmz->reclaim); 392 + for (i = 0; i < dmz->nr_ddevs; i++) 393 + dmz_schedule_reclaim(dmz->dev[i].reclaim); 410 394 411 395 dmz_lock_metadata(zmd); 412 - 413 - if (dmz->dev->flags & DMZ_BDEV_DYING) { 414 - ret = -EIO; 415 - goto out; 416 - } 417 396 418 397 /* 419 398 * Get the data zone mapping the chunk. There may be no 420 399 * mapping for read and discard. If a mapping is obtained, 421 400 + the zone returned will be set to active state. 422 401 */ 423 - zone = dmz_get_chunk_mapping(zmd, dmz_bio_chunk(dmz->dev, bio), 402 + zone = dmz_get_chunk_mapping(zmd, dmz_bio_chunk(zmd, bio), 424 403 bio_op(bio)); 425 404 if (IS_ERR(zone)) { 426 405 ret = PTR_ERR(zone); ··· 428 413 if (zone) { 429 414 dmz_activate_zone(zone); 430 415 bioctx->zone = zone; 416 + dmz_reclaim_bio_acc(zone->dev->reclaim); 431 417 } 432 418 433 419 switch (bio_op(bio)) { ··· 443 427 ret = dmz_handle_discard(dmz, zone, bio); 444 428 break; 445 429 default: 446 - dmz_dev_err(dmz->dev, "Unsupported BIO operation 0x%x", 447 - bio_op(bio)); 430 + DMERR("(%s): Unsupported BIO operation 0x%x", 431 + dmz_metadata_label(dmz->metadata), bio_op(bio)); 448 432 ret = -EIO; 449 433 } 450 434 ··· 518 502 /* Flush dirty metadata blocks */ 519 503 ret = dmz_flush_metadata(dmz->metadata); 520 504 if (ret) 521 - dmz_dev_debug(dmz->dev, "Metadata flush failed, rc=%d\n", ret); 505 + DMDEBUG("(%s): Metadata flush failed, rc=%d", 506 + dmz_metadata_label(dmz->metadata), ret); 522 507 523 508 /* Process queued flush requests */ 524 509 while (1) { ··· 542 525 */ 543 526 static int dmz_queue_chunk_work(struct dmz_target *dmz, struct bio *bio) 544 527 { 545 - unsigned int chunk = dmz_bio_chunk(dmz->dev, bio); 528 + unsigned int chunk = dmz_bio_chunk(dmz->metadata, bio); 546 529 struct dm_chunk_work *cw; 547 530 int ret = 0; 548 531 ··· 575 558 576 559 bio_list_add(&cw->bio_list, bio); 577 560 578 - dmz_reclaim_bio_acc(dmz->reclaim); 579 561 if (queue_work(dmz->chunk_wq, &cw->work)) 580 562 dmz_get_chunk_work(cw); 581 563 out: ··· 634 618 static int dmz_map(struct dm_target *ti, struct bio *bio) 635 619 { 636 620 struct dmz_target *dmz = ti->private; 637 - struct dmz_dev *dev = dmz->dev; 621 + struct dmz_metadata *zmd = dmz->metadata; 638 622 struct dmz_bioctx *bioctx = dm_per_bio_data(bio, sizeof(struct dmz_bioctx)); 639 623 sector_t sector = bio->bi_iter.bi_sector; 640 624 unsigned int nr_sectors = bio_sectors(bio); 641 625 sector_t chunk_sector; 642 626 int ret; 643 627 644 - if (dmz_bdev_is_dying(dmz->dev)) 628 + if (dmz_dev_is_dying(zmd)) 645 629 return DM_MAPIO_KILL; 646 630 647 - dmz_dev_debug(dev, "BIO op %d sector %llu + %u => chunk %llu, block %llu, %u blocks", 648 - bio_op(bio), (unsigned long long)sector, nr_sectors, 649 - (unsigned long long)dmz_bio_chunk(dmz->dev, bio), 650 - (unsigned long long)dmz_chunk_block(dmz->dev, dmz_bio_block(bio)), 651 - (unsigned int)dmz_bio_blocks(bio)); 652 - 653 - bio_set_dev(bio, dev->bdev); 631 + DMDEBUG("(%s): BIO op %d sector %llu + %u => chunk %llu, block %llu, %u blocks", 632 + dmz_metadata_label(zmd), 633 + bio_op(bio), (unsigned long long)sector, nr_sectors, 634 + (unsigned long long)dmz_bio_chunk(zmd, bio), 635 + (unsigned long long)dmz_chunk_block(zmd, dmz_bio_block(bio)), 636 + (unsigned int)dmz_bio_blocks(bio)); 654 637 655 638 if (!nr_sectors && bio_op(bio) != REQ_OP_WRITE) 656 639 return DM_MAPIO_REMAPPED; ··· 659 644 return DM_MAPIO_KILL; 660 645 661 646 /* Initialize the BIO context */ 662 - bioctx->target = dmz; 647 + bioctx->dev = NULL; 663 648 bioctx->zone = NULL; 664 649 bioctx->bio = bio; 665 650 refcount_set(&bioctx->ref, 1); ··· 674 659 } 675 660 676 661 /* Split zone BIOs to fit entirely into a zone */ 677 - chunk_sector = sector & (dev->zone_nr_sectors - 1); 678 - if (chunk_sector + nr_sectors > dev->zone_nr_sectors) 679 - dm_accept_partial_bio(bio, dev->zone_nr_sectors - chunk_sector); 662 + chunk_sector = sector & (dmz_zone_nr_sectors(zmd) - 1); 663 + if (chunk_sector + nr_sectors > dmz_zone_nr_sectors(zmd)) 664 + dm_accept_partial_bio(bio, dmz_zone_nr_sectors(zmd) - chunk_sector); 680 665 681 666 /* Now ready to handle this BIO */ 682 667 ret = dmz_queue_chunk_work(dmz, bio); 683 668 if (ret) { 684 - dmz_dev_debug(dmz->dev, 685 - "BIO op %d, can't process chunk %llu, err %i\n", 686 - bio_op(bio), (u64)dmz_bio_chunk(dmz->dev, bio), 687 - ret); 669 + DMDEBUG("(%s): BIO op %d, can't process chunk %llu, err %i", 670 + dmz_metadata_label(zmd), 671 + bio_op(bio), (u64)dmz_bio_chunk(zmd, bio), 672 + ret); 688 673 return DM_MAPIO_REQUEUE; 689 674 } 690 675 ··· 694 679 /* 695 680 * Get zoned device information. 696 681 */ 697 - static int dmz_get_zoned_device(struct dm_target *ti, char *path) 682 + static int dmz_get_zoned_device(struct dm_target *ti, char *path, 683 + int idx, int nr_devs) 698 684 { 699 685 struct dmz_target *dmz = ti->private; 700 - struct request_queue *q; 686 + struct dm_dev *ddev; 701 687 struct dmz_dev *dev; 702 - sector_t aligned_capacity; 703 688 int ret; 689 + struct block_device *bdev; 704 690 705 691 /* Get the target device */ 706 - ret = dm_get_device(ti, path, dm_table_get_mode(ti->table), &dmz->ddev); 692 + ret = dm_get_device(ti, path, dm_table_get_mode(ti->table), &ddev); 707 693 if (ret) { 708 694 ti->error = "Get target device failed"; 709 - dmz->ddev = NULL; 710 695 return ret; 711 696 } 712 697 713 - dev = kzalloc(sizeof(struct dmz_dev), GFP_KERNEL); 714 - if (!dev) { 715 - ret = -ENOMEM; 716 - goto err; 698 + bdev = ddev->bdev; 699 + if (bdev_zoned_model(bdev) == BLK_ZONED_NONE) { 700 + if (nr_devs == 1) { 701 + ti->error = "Invalid regular device"; 702 + goto err; 703 + } 704 + if (idx != 0) { 705 + ti->error = "First device must be a regular device"; 706 + goto err; 707 + } 708 + if (dmz->ddev[0]) { 709 + ti->error = "Too many regular devices"; 710 + goto err; 711 + } 712 + dev = &dmz->dev[idx]; 713 + dev->flags = DMZ_BDEV_REGULAR; 714 + } else { 715 + if (dmz->ddev[idx]) { 716 + ti->error = "Too many zoned devices"; 717 + goto err; 718 + } 719 + if (nr_devs > 1 && idx == 0) { 720 + ti->error = "First device must be a regular device"; 721 + goto err; 722 + } 723 + dev = &dmz->dev[idx]; 717 724 } 718 - 719 - dev->bdev = dmz->ddev->bdev; 725 + dev->bdev = bdev; 726 + dev->dev_idx = idx; 720 727 (void)bdevname(dev->bdev, dev->name); 721 728 722 - if (bdev_zoned_model(dev->bdev) == BLK_ZONED_NONE) { 723 - ti->error = "Not a zoned block device"; 724 - ret = -EINVAL; 729 + dev->capacity = i_size_read(bdev->bd_inode) >> SECTOR_SHIFT; 730 + if (ti->begin) { 731 + ti->error = "Partial mapping is not supported"; 725 732 goto err; 726 733 } 727 734 728 - q = bdev_get_queue(dev->bdev); 729 - dev->capacity = i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT; 730 - aligned_capacity = dev->capacity & 731 - ~((sector_t)blk_queue_zone_sectors(q) - 1); 732 - if (ti->begin || 733 - ((ti->len != dev->capacity) && (ti->len != aligned_capacity))) { 734 - ti->error = "Partial mapping not supported"; 735 - ret = -EINVAL; 736 - goto err; 737 - } 738 - 739 - dev->zone_nr_sectors = blk_queue_zone_sectors(q); 740 - dev->zone_nr_sectors_shift = ilog2(dev->zone_nr_sectors); 741 - 742 - dev->zone_nr_blocks = dmz_sect2blk(dev->zone_nr_sectors); 743 - dev->zone_nr_blocks_shift = ilog2(dev->zone_nr_blocks); 744 - 745 - dev->nr_zones = blkdev_nr_zones(dev->bdev->bd_disk); 746 - 747 - dmz->dev = dev; 735 + dmz->ddev[idx] = ddev; 748 736 749 737 return 0; 750 738 err: 751 - dm_put_device(ti, dmz->ddev); 752 - kfree(dev); 753 - 754 - return ret; 739 + dm_put_device(ti, ddev); 740 + return -EINVAL; 755 741 } 756 742 757 743 /* ··· 761 745 static void dmz_put_zoned_device(struct dm_target *ti) 762 746 { 763 747 struct dmz_target *dmz = ti->private; 748 + int i; 764 749 765 - dm_put_device(ti, dmz->ddev); 766 - kfree(dmz->dev); 767 - dmz->dev = NULL; 750 + for (i = 0; i < dmz->nr_ddevs; i++) { 751 + if (dmz->ddev[i]) { 752 + dm_put_device(ti, dmz->ddev[i]); 753 + dmz->ddev[i] = NULL; 754 + } 755 + } 756 + } 757 + 758 + static int dmz_fixup_devices(struct dm_target *ti) 759 + { 760 + struct dmz_target *dmz = ti->private; 761 + struct dmz_dev *reg_dev, *zoned_dev; 762 + struct request_queue *q; 763 + sector_t zone_nr_sectors = 0; 764 + int i; 765 + 766 + /* 767 + * When we have more than on devices, the first one must be a 768 + * regular block device and the others zoned block devices. 769 + */ 770 + if (dmz->nr_ddevs > 1) { 771 + reg_dev = &dmz->dev[0]; 772 + if (!(reg_dev->flags & DMZ_BDEV_REGULAR)) { 773 + ti->error = "Primary disk is not a regular device"; 774 + return -EINVAL; 775 + } 776 + for (i = 1; i < dmz->nr_ddevs; i++) { 777 + zoned_dev = &dmz->dev[i]; 778 + if (zoned_dev->flags & DMZ_BDEV_REGULAR) { 779 + ti->error = "Secondary disk is not a zoned device"; 780 + return -EINVAL; 781 + } 782 + q = bdev_get_queue(zoned_dev->bdev); 783 + if (zone_nr_sectors && 784 + zone_nr_sectors != blk_queue_zone_sectors(q)) { 785 + ti->error = "Zone nr sectors mismatch"; 786 + return -EINVAL; 787 + } 788 + zone_nr_sectors = blk_queue_zone_sectors(q); 789 + zoned_dev->zone_nr_sectors = zone_nr_sectors; 790 + zoned_dev->nr_zones = 791 + blkdev_nr_zones(zoned_dev->bdev->bd_disk); 792 + } 793 + } else { 794 + reg_dev = NULL; 795 + zoned_dev = &dmz->dev[0]; 796 + if (zoned_dev->flags & DMZ_BDEV_REGULAR) { 797 + ti->error = "Disk is not a zoned device"; 798 + return -EINVAL; 799 + } 800 + q = bdev_get_queue(zoned_dev->bdev); 801 + zoned_dev->zone_nr_sectors = blk_queue_zone_sectors(q); 802 + zoned_dev->nr_zones = blkdev_nr_zones(zoned_dev->bdev->bd_disk); 803 + } 804 + 805 + if (reg_dev) { 806 + sector_t zone_offset; 807 + 808 + reg_dev->zone_nr_sectors = zone_nr_sectors; 809 + reg_dev->nr_zones = 810 + DIV_ROUND_UP_SECTOR_T(reg_dev->capacity, 811 + reg_dev->zone_nr_sectors); 812 + reg_dev->zone_offset = 0; 813 + zone_offset = reg_dev->nr_zones; 814 + for (i = 1; i < dmz->nr_ddevs; i++) { 815 + dmz->dev[i].zone_offset = zone_offset; 816 + zone_offset += dmz->dev[i].nr_zones; 817 + } 818 + } 819 + return 0; 768 820 } 769 821 770 822 /* ··· 841 757 static int dmz_ctr(struct dm_target *ti, unsigned int argc, char **argv) 842 758 { 843 759 struct dmz_target *dmz; 844 - struct dmz_dev *dev; 845 - int ret; 760 + int ret, i; 846 761 847 762 /* Check arguments */ 848 - if (argc != 1) { 763 + if (argc < 1) { 849 764 ti->error = "Invalid argument count"; 850 765 return -EINVAL; 851 766 } ··· 855 772 ti->error = "Unable to allocate the zoned target descriptor"; 856 773 return -ENOMEM; 857 774 } 775 + dmz->dev = kcalloc(argc, sizeof(struct dmz_dev), GFP_KERNEL); 776 + if (!dmz->dev) { 777 + ti->error = "Unable to allocate the zoned device descriptors"; 778 + kfree(dmz); 779 + return -ENOMEM; 780 + } 781 + dmz->ddev = kcalloc(argc, sizeof(struct dm_dev *), GFP_KERNEL); 782 + if (!dmz->ddev) { 783 + ti->error = "Unable to allocate the dm device descriptors"; 784 + ret = -ENOMEM; 785 + goto err; 786 + } 787 + dmz->nr_ddevs = argc; 788 + 858 789 ti->private = dmz; 859 790 860 791 /* Get the target zoned block device */ 861 - ret = dmz_get_zoned_device(ti, argv[0]); 862 - if (ret) { 863 - dmz->ddev = NULL; 864 - goto err; 792 + for (i = 0; i < argc; i++) { 793 + ret = dmz_get_zoned_device(ti, argv[i], i, argc); 794 + if (ret) 795 + goto err_dev; 865 796 } 797 + ret = dmz_fixup_devices(ti); 798 + if (ret) 799 + goto err_dev; 866 800 867 801 /* Initialize metadata */ 868 - dev = dmz->dev; 869 - ret = dmz_ctr_metadata(dev, &dmz->metadata); 802 + ret = dmz_ctr_metadata(dmz->dev, argc, &dmz->metadata, 803 + dm_table_device_name(ti->table)); 870 804 if (ret) { 871 805 ti->error = "Metadata initialization failed"; 872 806 goto err_dev; 873 807 } 874 808 875 809 /* Set target (no write same support) */ 876 - ti->max_io_len = dev->zone_nr_sectors << 9; 810 + ti->max_io_len = dmz_zone_nr_sectors(dmz->metadata) << 9; 877 811 ti->num_flush_bios = 1; 878 812 ti->num_discard_bios = 1; 879 813 ti->num_write_zeroes_bios = 1; ··· 899 799 ti->discards_supported = true; 900 800 901 801 /* The exposed capacity is the number of chunks that can be mapped */ 902 - ti->len = (sector_t)dmz_nr_chunks(dmz->metadata) << dev->zone_nr_sectors_shift; 802 + ti->len = (sector_t)dmz_nr_chunks(dmz->metadata) << 803 + dmz_zone_nr_sectors_shift(dmz->metadata); 903 804 904 805 /* Zone BIO */ 905 806 ret = bioset_init(&dmz->bio_set, DMZ_MIN_BIOS, 0, 0); ··· 912 811 /* Chunk BIO work */ 913 812 mutex_init(&dmz->chunk_lock); 914 813 INIT_RADIX_TREE(&dmz->chunk_rxtree, GFP_NOIO); 915 - dmz->chunk_wq = alloc_workqueue("dmz_cwq_%s", WQ_MEM_RECLAIM | WQ_UNBOUND, 916 - 0, dev->name); 814 + dmz->chunk_wq = alloc_workqueue("dmz_cwq_%s", 815 + WQ_MEM_RECLAIM | WQ_UNBOUND, 0, 816 + dmz_metadata_label(dmz->metadata)); 917 817 if (!dmz->chunk_wq) { 918 818 ti->error = "Create chunk workqueue failed"; 919 819 ret = -ENOMEM; ··· 926 824 bio_list_init(&dmz->flush_list); 927 825 INIT_DELAYED_WORK(&dmz->flush_work, dmz_flush_work); 928 826 dmz->flush_wq = alloc_ordered_workqueue("dmz_fwq_%s", WQ_MEM_RECLAIM, 929 - dev->name); 827 + dmz_metadata_label(dmz->metadata)); 930 828 if (!dmz->flush_wq) { 931 829 ti->error = "Create flush workqueue failed"; 932 830 ret = -ENOMEM; ··· 935 833 mod_delayed_work(dmz->flush_wq, &dmz->flush_work, DMZ_FLUSH_PERIOD); 936 834 937 835 /* Initialize reclaim */ 938 - ret = dmz_ctr_reclaim(dev, dmz->metadata, &dmz->reclaim); 939 - if (ret) { 940 - ti->error = "Zone reclaim initialization failed"; 941 - goto err_fwq; 836 + for (i = 0; i < dmz->nr_ddevs; i++) { 837 + ret = dmz_ctr_reclaim(dmz->metadata, &dmz->dev[i].reclaim, i); 838 + if (ret) { 839 + ti->error = "Zone reclaim initialization failed"; 840 + goto err_fwq; 841 + } 942 842 } 943 843 944 - dmz_dev_info(dev, "Target device: %llu 512-byte logical sectors (%llu blocks)", 945 - (unsigned long long)ti->len, 946 - (unsigned long long)dmz_sect2blk(ti->len)); 844 + DMINFO("(%s): Target device: %llu 512-byte logical sectors (%llu blocks)", 845 + dmz_metadata_label(dmz->metadata), 846 + (unsigned long long)ti->len, 847 + (unsigned long long)dmz_sect2blk(ti->len)); 947 848 948 849 return 0; 949 850 err_fwq: ··· 961 856 err_dev: 962 857 dmz_put_zoned_device(ti); 963 858 err: 859 + kfree(dmz->dev); 964 860 kfree(dmz); 965 861 966 862 return ret; ··· 973 867 static void dmz_dtr(struct dm_target *ti) 974 868 { 975 869 struct dmz_target *dmz = ti->private; 870 + int i; 976 871 977 872 flush_workqueue(dmz->chunk_wq); 978 873 destroy_workqueue(dmz->chunk_wq); 979 874 980 - dmz_dtr_reclaim(dmz->reclaim); 875 + for (i = 0; i < dmz->nr_ddevs; i++) 876 + dmz_dtr_reclaim(dmz->dev[i].reclaim); 981 877 982 878 cancel_delayed_work_sync(&dmz->flush_work); 983 879 destroy_workqueue(dmz->flush_wq); ··· 994 886 995 887 mutex_destroy(&dmz->chunk_lock); 996 888 889 + kfree(dmz->dev); 997 890 kfree(dmz); 998 891 } 999 892 ··· 1004 895 static void dmz_io_hints(struct dm_target *ti, struct queue_limits *limits) 1005 896 { 1006 897 struct dmz_target *dmz = ti->private; 1007 - unsigned int chunk_sectors = dmz->dev->zone_nr_sectors; 898 + unsigned int chunk_sectors = dmz_zone_nr_sectors(dmz->metadata); 1008 899 1009 900 limits->logical_block_size = DMZ_BLOCK_SIZE; 1010 901 limits->physical_block_size = DMZ_BLOCK_SIZE; ··· 1032 923 static int dmz_prepare_ioctl(struct dm_target *ti, struct block_device **bdev) 1033 924 { 1034 925 struct dmz_target *dmz = ti->private; 926 + struct dmz_dev *dev = &dmz->dev[0]; 1035 927 1036 - if (!dmz_check_bdev(dmz->dev)) 928 + if (!dmz_check_bdev(dev)) 1037 929 return -EIO; 1038 930 1039 - *bdev = dmz->dev->bdev; 931 + *bdev = dev->bdev; 1040 932 1041 933 return 0; 1042 934 } ··· 1048 938 static void dmz_suspend(struct dm_target *ti) 1049 939 { 1050 940 struct dmz_target *dmz = ti->private; 941 + int i; 1051 942 1052 943 flush_workqueue(dmz->chunk_wq); 1053 - dmz_suspend_reclaim(dmz->reclaim); 944 + for (i = 0; i < dmz->nr_ddevs; i++) 945 + dmz_suspend_reclaim(dmz->dev[i].reclaim); 1054 946 cancel_delayed_work_sync(&dmz->flush_work); 1055 947 } 1056 948 ··· 1062 950 static void dmz_resume(struct dm_target *ti) 1063 951 { 1064 952 struct dmz_target *dmz = ti->private; 953 + int i; 1065 954 1066 955 queue_delayed_work(dmz->flush_wq, &dmz->flush_work, DMZ_FLUSH_PERIOD); 1067 - dmz_resume_reclaim(dmz->reclaim); 956 + for (i = 0; i < dmz->nr_ddevs; i++) 957 + dmz_resume_reclaim(dmz->dev[i].reclaim); 1068 958 } 1069 959 1070 960 static int dmz_iterate_devices(struct dm_target *ti, 1071 961 iterate_devices_callout_fn fn, void *data) 1072 962 { 1073 963 struct dmz_target *dmz = ti->private; 1074 - struct dmz_dev *dev = dmz->dev; 1075 - sector_t capacity = dev->capacity & ~(dev->zone_nr_sectors - 1); 964 + unsigned int zone_nr_sectors = dmz_zone_nr_sectors(dmz->metadata); 965 + sector_t capacity; 966 + int i, r; 1076 967 1077 - return fn(ti, dmz->ddev, 0, capacity, data); 968 + for (i = 0; i < dmz->nr_ddevs; i++) { 969 + capacity = dmz->dev[i].capacity & ~(zone_nr_sectors - 1); 970 + r = fn(ti, dmz->ddev[i], 0, capacity, data); 971 + if (r) 972 + break; 973 + } 974 + return r; 975 + } 976 + 977 + static void dmz_status(struct dm_target *ti, status_type_t type, 978 + unsigned int status_flags, char *result, 979 + unsigned int maxlen) 980 + { 981 + struct dmz_target *dmz = ti->private; 982 + ssize_t sz = 0; 983 + char buf[BDEVNAME_SIZE]; 984 + struct dmz_dev *dev; 985 + int i; 986 + 987 + switch (type) { 988 + case STATUSTYPE_INFO: 989 + DMEMIT("%u zones %u/%u cache", 990 + dmz_nr_zones(dmz->metadata), 991 + dmz_nr_unmap_cache_zones(dmz->metadata), 992 + dmz_nr_cache_zones(dmz->metadata)); 993 + for (i = 0; i < dmz->nr_ddevs; i++) { 994 + /* 995 + * For a multi-device setup the first device 996 + * contains only cache zones. 997 + */ 998 + if ((i == 0) && 999 + (dmz_nr_cache_zones(dmz->metadata) > 0)) 1000 + continue; 1001 + DMEMIT(" %u/%u random %u/%u sequential", 1002 + dmz_nr_unmap_rnd_zones(dmz->metadata, i), 1003 + dmz_nr_rnd_zones(dmz->metadata, i), 1004 + dmz_nr_unmap_seq_zones(dmz->metadata, i), 1005 + dmz_nr_seq_zones(dmz->metadata, i)); 1006 + } 1007 + break; 1008 + case STATUSTYPE_TABLE: 1009 + dev = &dmz->dev[0]; 1010 + format_dev_t(buf, dev->bdev->bd_dev); 1011 + DMEMIT("%s", buf); 1012 + for (i = 1; i < dmz->nr_ddevs; i++) { 1013 + dev = &dmz->dev[i]; 1014 + format_dev_t(buf, dev->bdev->bd_dev); 1015 + DMEMIT(" %s", buf); 1016 + } 1017 + break; 1018 + } 1019 + return; 1020 + } 1021 + 1022 + static int dmz_message(struct dm_target *ti, unsigned int argc, char **argv, 1023 + char *result, unsigned int maxlen) 1024 + { 1025 + struct dmz_target *dmz = ti->private; 1026 + int r = -EINVAL; 1027 + 1028 + if (!strcasecmp(argv[0], "reclaim")) { 1029 + int i; 1030 + 1031 + for (i = 0; i < dmz->nr_ddevs; i++) 1032 + dmz_schedule_reclaim(dmz->dev[i].reclaim); 1033 + r = 0; 1034 + } else 1035 + DMERR("unrecognized message %s", argv[0]); 1036 + return r; 1078 1037 } 1079 1038 1080 1039 static struct target_type dmz_type = { 1081 1040 .name = "zoned", 1082 - .version = {1, 1, 0}, 1041 + .version = {2, 0, 0}, 1083 1042 .features = DM_TARGET_SINGLETON | DM_TARGET_ZONED_HM, 1084 1043 .module = THIS_MODULE, 1085 1044 .ctr = dmz_ctr, ··· 1161 978 .postsuspend = dmz_suspend, 1162 979 .resume = dmz_resume, 1163 980 .iterate_devices = dmz_iterate_devices, 981 + .status = dmz_status, 982 + .message = dmz_message, 1164 983 }; 1165 984 1166 985 static int __init dmz_init(void)
+79 -36
drivers/md/dm-zoned.h
··· 45 45 #define dmz_bio_block(bio) dmz_sect2blk((bio)->bi_iter.bi_sector) 46 46 #define dmz_bio_blocks(bio) dmz_sect2blk(bio_sectors(bio)) 47 47 48 + struct dmz_metadata; 49 + struct dmz_reclaim; 50 + 48 51 /* 49 52 * Zoned block device information. 50 53 */ 51 54 struct dmz_dev { 52 55 struct block_device *bdev; 56 + struct dmz_metadata *metadata; 57 + struct dmz_reclaim *reclaim; 53 58 54 59 char name[BDEVNAME_SIZE]; 60 + uuid_t uuid; 55 61 56 62 sector_t capacity; 57 63 64 + unsigned int dev_idx; 65 + 58 66 unsigned int nr_zones; 67 + unsigned int zone_offset; 59 68 60 69 unsigned int flags; 61 70 62 71 sector_t zone_nr_sectors; 63 - unsigned int zone_nr_sectors_shift; 64 72 65 - sector_t zone_nr_blocks; 66 - sector_t zone_nr_blocks_shift; 73 + unsigned int nr_rnd; 74 + atomic_t unmap_nr_rnd; 75 + struct list_head unmap_rnd_list; 76 + struct list_head map_rnd_list; 77 + 78 + unsigned int nr_seq; 79 + atomic_t unmap_nr_seq; 80 + struct list_head unmap_seq_list; 81 + struct list_head map_seq_list; 67 82 }; 68 83 69 - #define dmz_bio_chunk(dev, bio) ((bio)->bi_iter.bi_sector >> \ 70 - (dev)->zone_nr_sectors_shift) 71 - #define dmz_chunk_block(dev, b) ((b) & ((dev)->zone_nr_blocks - 1)) 84 + #define dmz_bio_chunk(zmd, bio) ((bio)->bi_iter.bi_sector >> \ 85 + dmz_zone_nr_sectors_shift(zmd)) 86 + #define dmz_chunk_block(zmd, b) ((b) & (dmz_zone_nr_blocks(zmd) - 1)) 72 87 73 88 /* Device flags. */ 74 89 #define DMZ_BDEV_DYING (1 << 0) 75 90 #define DMZ_CHECK_BDEV (2 << 0) 91 + #define DMZ_BDEV_REGULAR (4 << 0) 76 92 77 93 /* 78 94 * Zone descriptor. ··· 97 81 /* For listing the zone depending on its state */ 98 82 struct list_head link; 99 83 84 + /* Device containing this zone */ 85 + struct dmz_dev *dev; 86 + 100 87 /* Zone type and state */ 101 88 unsigned long flags; 102 89 103 90 /* Zone activation reference count */ 104 91 atomic_t refcount; 92 + 93 + /* Zone id */ 94 + unsigned int id; 105 95 106 96 /* Zone write pointer block (relative to the zone start block) */ 107 97 unsigned int wp_block; ··· 131 109 */ 132 110 enum { 133 111 /* Zone write type */ 112 + DMZ_CACHE, 134 113 DMZ_RND, 135 114 DMZ_SEQ, 136 115 ··· 143 120 DMZ_META, 144 121 DMZ_DATA, 145 122 DMZ_BUF, 123 + DMZ_RESERVED, 146 124 147 125 /* Zone internal state */ 148 126 DMZ_RECLAIM, 149 127 DMZ_SEQ_WRITE_ERR, 128 + DMZ_RECLAIM_TERMINATE, 150 129 }; 151 130 152 131 /* 153 132 * Zone data accessors. 154 133 */ 134 + #define dmz_is_cache(z) test_bit(DMZ_CACHE, &(z)->flags) 155 135 #define dmz_is_rnd(z) test_bit(DMZ_RND, &(z)->flags) 156 136 #define dmz_is_seq(z) test_bit(DMZ_SEQ, &(z)->flags) 157 137 #define dmz_is_empty(z) ((z)->wp_block == 0) 158 138 #define dmz_is_offline(z) test_bit(DMZ_OFFLINE, &(z)->flags) 159 139 #define dmz_is_readonly(z) test_bit(DMZ_READ_ONLY, &(z)->flags) 160 140 #define dmz_in_reclaim(z) test_bit(DMZ_RECLAIM, &(z)->flags) 141 + #define dmz_is_reserved(z) test_bit(DMZ_RESERVED, &(z)->flags) 161 142 #define dmz_seq_write_err(z) test_bit(DMZ_SEQ_WRITE_ERR, &(z)->flags) 143 + #define dmz_reclaim_should_terminate(z) \ 144 + test_bit(DMZ_RECLAIM_TERMINATE, &(z)->flags) 162 145 163 146 #define dmz_is_meta(z) test_bit(DMZ_META, &(z)->flags) 164 147 #define dmz_is_buf(z) test_bit(DMZ_BUF, &(z)->flags) ··· 187 158 #define dmz_dev_debug(dev, format, args...) \ 188 159 DMDEBUG("(%s): " format, (dev)->name, ## args) 189 160 190 - struct dmz_metadata; 191 - struct dmz_reclaim; 192 - 193 161 /* 194 162 * Functions defined in dm-zoned-metadata.c 195 163 */ 196 - int dmz_ctr_metadata(struct dmz_dev *dev, struct dmz_metadata **zmd); 164 + int dmz_ctr_metadata(struct dmz_dev *dev, int num_dev, 165 + struct dmz_metadata **zmd, const char *devname); 197 166 void dmz_dtr_metadata(struct dmz_metadata *zmd); 198 167 int dmz_resume_metadata(struct dmz_metadata *zmd); 199 168 ··· 202 175 void dmz_lock_flush(struct dmz_metadata *zmd); 203 176 void dmz_unlock_flush(struct dmz_metadata *zmd); 204 177 int dmz_flush_metadata(struct dmz_metadata *zmd); 178 + const char *dmz_metadata_label(struct dmz_metadata *zmd); 205 179 206 - unsigned int dmz_id(struct dmz_metadata *zmd, struct dm_zone *zone); 207 180 sector_t dmz_start_sect(struct dmz_metadata *zmd, struct dm_zone *zone); 208 181 sector_t dmz_start_block(struct dmz_metadata *zmd, struct dm_zone *zone); 209 182 unsigned int dmz_nr_chunks(struct dmz_metadata *zmd); 210 183 211 - #define DMZ_ALLOC_RND 0x01 212 - #define DMZ_ALLOC_RECLAIM 0x02 184 + bool dmz_check_dev(struct dmz_metadata *zmd); 185 + bool dmz_dev_is_dying(struct dmz_metadata *zmd); 213 186 214 - struct dm_zone *dmz_alloc_zone(struct dmz_metadata *zmd, unsigned long flags); 187 + #define DMZ_ALLOC_RND 0x01 188 + #define DMZ_ALLOC_CACHE 0x02 189 + #define DMZ_ALLOC_SEQ 0x04 190 + #define DMZ_ALLOC_RECLAIM 0x10 191 + 192 + struct dm_zone *dmz_alloc_zone(struct dmz_metadata *zmd, 193 + unsigned int dev_idx, unsigned long flags); 215 194 void dmz_free_zone(struct dmz_metadata *zmd, struct dm_zone *zone); 216 195 217 196 void dmz_map_zone(struct dmz_metadata *zmd, struct dm_zone *zone, 218 197 unsigned int chunk); 219 198 void dmz_unmap_zone(struct dmz_metadata *zmd, struct dm_zone *zone); 220 - unsigned int dmz_nr_rnd_zones(struct dmz_metadata *zmd); 221 - unsigned int dmz_nr_unmap_rnd_zones(struct dmz_metadata *zmd); 199 + unsigned int dmz_nr_zones(struct dmz_metadata *zmd); 200 + unsigned int dmz_nr_cache_zones(struct dmz_metadata *zmd); 201 + unsigned int dmz_nr_unmap_cache_zones(struct dmz_metadata *zmd); 202 + unsigned int dmz_nr_rnd_zones(struct dmz_metadata *zmd, int idx); 203 + unsigned int dmz_nr_unmap_rnd_zones(struct dmz_metadata *zmd, int idx); 204 + unsigned int dmz_nr_seq_zones(struct dmz_metadata *zmd, int idx); 205 + unsigned int dmz_nr_unmap_seq_zones(struct dmz_metadata *zmd, int idx); 206 + unsigned int dmz_zone_nr_blocks(struct dmz_metadata *zmd); 207 + unsigned int dmz_zone_nr_blocks_shift(struct dmz_metadata *zmd); 208 + unsigned int dmz_zone_nr_sectors(struct dmz_metadata *zmd); 209 + unsigned int dmz_zone_nr_sectors_shift(struct dmz_metadata *zmd); 222 210 223 211 /* 224 212 * Activate a zone (increment its reference count). ··· 243 201 atomic_inc(&zone->refcount); 244 202 } 245 203 246 - /* 247 - * Deactivate a zone. This decrement the zone reference counter 248 - * indicating that all BIOs to the zone have completed when the count is 0. 249 - */ 250 - static inline void dmz_deactivate_zone(struct dm_zone *zone) 251 - { 252 - atomic_dec(&zone->refcount); 253 - } 254 - 255 - /* 256 - * Test if a zone is active, that is, has a refcount > 0. 257 - */ 258 - static inline bool dmz_is_active(struct dm_zone *zone) 259 - { 260 - return atomic_read(&zone->refcount); 261 - } 262 - 263 204 int dmz_lock_zone_reclaim(struct dm_zone *zone); 264 205 void dmz_unlock_zone_reclaim(struct dm_zone *zone); 265 - struct dm_zone *dmz_get_zone_for_reclaim(struct dmz_metadata *zmd); 206 + struct dm_zone *dmz_get_zone_for_reclaim(struct dmz_metadata *zmd, 207 + unsigned int dev_idx, bool idle); 266 208 267 209 struct dm_zone *dmz_get_chunk_mapping(struct dmz_metadata *zmd, 268 210 unsigned int chunk, int op); ··· 270 244 /* 271 245 * Functions defined in dm-zoned-reclaim.c 272 246 */ 273 - int dmz_ctr_reclaim(struct dmz_dev *dev, struct dmz_metadata *zmd, 274 - struct dmz_reclaim **zrc); 247 + int dmz_ctr_reclaim(struct dmz_metadata *zmd, struct dmz_reclaim **zrc, int idx); 275 248 void dmz_dtr_reclaim(struct dmz_reclaim *zrc); 276 249 void dmz_suspend_reclaim(struct dmz_reclaim *zrc); 277 250 void dmz_resume_reclaim(struct dmz_reclaim *zrc); ··· 282 257 */ 283 258 bool dmz_bdev_is_dying(struct dmz_dev *dmz_dev); 284 259 bool dmz_check_bdev(struct dmz_dev *dmz_dev); 260 + 261 + /* 262 + * Deactivate a zone. This decrement the zone reference counter 263 + * indicating that all BIOs to the zone have completed when the count is 0. 264 + */ 265 + static inline void dmz_deactivate_zone(struct dm_zone *zone) 266 + { 267 + dmz_reclaim_bio_acc(zone->dev->reclaim); 268 + atomic_dec(&zone->refcount); 269 + } 270 + 271 + /* 272 + * Test if a zone is active, that is, has a refcount > 0. 273 + */ 274 + static inline bool dmz_is_active(struct dm_zone *zone) 275 + { 276 + return atomic_read(&zone->refcount); 277 + } 285 278 286 279 #endif /* DM_ZONED_H */
+10 -1
drivers/md/dm.c
··· 676 676 return md_in_flight_bios(md); 677 677 } 678 678 679 + u64 dm_start_time_ns_from_clone(struct bio *bio) 680 + { 681 + struct dm_target_io *tio = container_of(bio, struct dm_target_io, clone); 682 + struct dm_io *io = tio->io; 683 + 684 + return jiffies_to_nsecs(io->start_time); 685 + } 686 + EXPORT_SYMBOL_GPL(dm_start_time_ns_from_clone); 687 + 679 688 static void start_io_acct(struct dm_io *io) 680 689 { 681 690 struct mapped_device *md = io->md; ··· 2619 2610 if (noflush) 2620 2611 set_bit(DMF_NOFLUSH_SUSPENDING, &md->flags); 2621 2612 else 2622 - pr_debug("%s: suspending with flush\n", dm_device_name(md)); 2613 + DMDEBUG("%s: suspending with flush", dm_device_name(md)); 2623 2614 2624 2615 /* 2625 2616 * This gets reverted if there's an error later and the targets
+2 -2
drivers/md/persistent-data/dm-btree-internal.h
··· 38 38 39 39 struct btree_node { 40 40 struct node_header header; 41 - __le64 keys[0]; 41 + __le64 keys[]; 42 42 } __packed; 43 43 44 44 ··· 68 68 }; 69 69 70 70 void init_ro_spine(struct ro_spine *s, struct dm_btree_info *info); 71 - int exit_ro_spine(struct ro_spine *s); 71 + void exit_ro_spine(struct ro_spine *s); 72 72 int ro_step(struct ro_spine *s, dm_block_t new_child); 73 73 void ro_pop(struct ro_spine *s); 74 74 struct btree_node *ro_node(struct ro_spine *s);
+2 -4
drivers/md/persistent-data/dm-btree-spine.c
··· 132 132 s->nodes[1] = NULL; 133 133 } 134 134 135 - int exit_ro_spine(struct ro_spine *s) 135 + void exit_ro_spine(struct ro_spine *s) 136 136 { 137 - int r = 0, i; 137 + int i; 138 138 139 139 for (i = 0; i < s->count; i++) { 140 140 unlock_block(s->info, s->nodes[i]); 141 141 } 142 - 143 - return r; 144 142 } 145 143 146 144 int ro_step(struct ro_spine *s, dm_block_t new_child)
+3 -6
include/linux/device-mapper.h
··· 332 332 struct bio *dm_bio_from_per_bio_data(void *data, size_t data_size); 333 333 unsigned dm_bio_get_target_bio_nr(const struct bio *bio); 334 334 335 + u64 dm_start_time_ns_from_clone(struct bio *bio); 336 + 335 337 int dm_register_target(struct target_type *t); 336 338 void dm_unregister_target(struct target_type *t); 337 339 ··· 559 557 #define DMINFO(fmt, ...) pr_info(DM_FMT(fmt), ##__VA_ARGS__) 560 558 #define DMINFO_LIMIT(fmt, ...) pr_info_ratelimited(DM_FMT(fmt), ##__VA_ARGS__) 561 559 562 - #ifdef CONFIG_DM_DEBUG 563 - #define DMDEBUG(fmt, ...) printk(KERN_DEBUG DM_FMT(fmt), ##__VA_ARGS__) 560 + #define DMDEBUG(fmt, ...) pr_debug(DM_FMT(fmt), ##__VA_ARGS__) 564 561 #define DMDEBUG_LIMIT(fmt, ...) pr_debug_ratelimited(DM_FMT(fmt), ##__VA_ARGS__) 565 - #else 566 - #define DMDEBUG(fmt, ...) no_printk(fmt, ##__VA_ARGS__) 567 - #define DMDEBUG_LIMIT(fmt, ...) no_printk(fmt, ##__VA_ARGS__) 568 - #endif 569 562 570 563 #define DMEMIT(x...) sz += ((sz >= maxlen) ? \ 571 564 0 : scnprintf(result + sz, maxlen - sz, x))
+12
include/linux/dm-bufio.h
··· 119 119 int dm_bufio_issue_flush(struct dm_bufio_client *c); 120 120 121 121 /* 122 + * Send a discard request to the underlying device. 123 + */ 124 + int dm_bufio_issue_discard(struct dm_bufio_client *c, sector_t block, sector_t count); 125 + 126 + /* 122 127 * Like dm_bufio_release but also move the buffer to the new 123 128 * block. dm_bufio_write_dirty_buffers is needed to commit the new block. 124 129 */ ··· 135 130 * does nothing. 136 131 */ 137 132 void dm_bufio_forget(struct dm_bufio_client *c, sector_t block); 133 + 134 + /* 135 + * Free the given range of buffers. 136 + * This is just a hint, if the buffer is in use or dirty, this function 137 + * does nothing. 138 + */ 139 + void dm_bufio_forget_buffers(struct dm_bufio_client *c, sector_t block, sector_t n_blocks); 138 140 139 141 /* 140 142 * Set the minimum number of buffers before cleanup happens.