Merge tag 'for-4.13/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

+144

Documentation/device-mapper/dm-zoned.txt

··· 1 + dm-zoned 2 + ======== 3 + 4 + The dm-zoned device mapper target exposes a zoned block device (ZBC and 5 + ZAC compliant devices) as a regular block device without any write 6 + pattern constraints. In effect, it implements a drive-managed zoned 7 + block device which hides from the user (a file system or an application 8 + doing raw block device accesses) the sequential write constraints of 9 + host-managed zoned block devices and can mitigate the potential 10 + device-side performance degradation due to excessive random writes on 11 + host-aware zoned block devices. 12 + 13 + For a more detailed description of the zoned block device models and 14 + their constraints see (for SCSI devices): 15 + 16 + http://www.t10.org/drafts.htm#ZBC_Family 17 + 18 + and (for ATA devices): 19 + 20 + http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf 21 + 22 + The dm-zoned implementation is simple and minimizes system overhead (CPU 23 + and memory usage as well as storage capacity loss). For a 10TB 24 + host-managed disk with 256 MB zones, dm-zoned memory usage per disk 25 + instance is at most 4.5 MB and as little as 5 zones will be used 26 + internally for storing metadata and performaing reclaim operations. 27 + 28 + dm-zoned target devices are formatted and checked using the dmzadm 29 + utility available at: 30 + 31 + https://github.com/hgst/dm-zoned-tools 32 + 33 + Algorithm 34 + ========= 35 + 36 + dm-zoned implements an on-disk buffering scheme to handle non-sequential 37 + write accesses to the sequential zones of a zoned block device. 38 + Conventional zones are used for caching as well as for storing internal 39 + metadata. 40 + 41 + The zones of the device are separated into 2 types: 42 + 43 + 1) Metadata zones: these are conventional zones used to store metadata. 44 + Metadata zones are not reported as useable capacity to the user. 45 + 46 + 2) Data zones: all remaining zones, the vast majority of which will be 47 + sequential zones used exclusively to store user data. The conventional 48 + zones of the device may be used also for buffering user random writes. 49 + Data in these zones may be directly mapped to the conventional zone, but 50 + later moved to a sequential zone so that the conventional zone can be 51 + reused for buffering incoming random writes. 52 + 53 + dm-zoned exposes a logical device with a sector size of 4096 bytes, 54 + irrespective of the physical sector size of the backend zoned block 55 + device being used. This allows reducing the amount of metadata needed to 56 + manage valid blocks (blocks written). 57 + 58 + The on-disk metadata format is as follows: 59 + 60 + 1) The first block of the first conventional zone found contains the 61 + super block which describes the on disk amount and position of metadata 62 + blocks. 63 + 64 + 2) Following the super block, a set of blocks is used to describe the 65 + mapping of the logical device blocks. The mapping is done per chunk of 66 + blocks, with the chunk size equal to the zoned block device size. The 67 + mapping table is indexed by chunk number and each mapping entry 68 + indicates the zone number of the device storing the chunk of data. Each 69 + mapping entry may also indicate if the zone number of a conventional 70 + zone used to buffer random modification to the data zone. 71 + 72 + 3) A set of blocks used to store bitmaps indicating the validity of 73 + blocks in the data zones follows the mapping table. A valid block is 74 + defined as a block that was written and not discarded. For a buffered 75 + data chunk, a block is always valid only in the data zone mapping the 76 + chunk or in the buffer zone of the chunk. 77 + 78 + For a logical chunk mapped to a conventional zone, all write operations 79 + are processed by directly writing to the zone. If the mapping zone is a 80 + sequential zone, the write operation is processed directly only if the 81 + write offset within the logical chunk is equal to the write pointer 82 + offset within of the sequential data zone (i.e. the write operation is 83 + aligned on the zone write pointer). Otherwise, write operations are 84 + processed indirectly using a buffer zone. In that case, an unused 85 + conventional zone is allocated and assigned to the chunk being 86 + accessed. Writing a block to the buffer zone of a chunk will 87 + automatically invalidate the same block in the sequential zone mapping 88 + the chunk. If all blocks of the sequential zone become invalid, the zone 89 + is freed and the chunk buffer zone becomes the primary zone mapping the 90 + chunk, resulting in native random write performance similar to a regular 91 + block device. 92 + 93 + Read operations are processed according to the block validity 94 + information provided by the bitmaps. Valid blocks are read either from 95 + the sequential zone mapping a chunk, or if the chunk is buffered, from 96 + the buffer zone assigned. If the accessed chunk has no mapping, or the 97 + accessed blocks are invalid, the read buffer is zeroed and the read 98 + operation terminated. 99 + 100 + After some time, the limited number of convnetional zones available may 101 + be exhausted (all used to map chunks or buffer sequential zones) and 102 + unaligned writes to unbuffered chunks become impossible. To avoid this 103 + situation, a reclaim process regularly scans used conventional zones and 104 + tries to reclaim the least recently used zones by copying the valid 105 + blocks of the buffer zone to a free sequential zone. Once the copy 106 + completes, the chunk mapping is updated to point to the sequential zone 107 + and the buffer zone freed for reuse. 108 + 109 + Metadata Protection 110 + =================== 111 + 112 + To protect metadata against corruption in case of sudden power loss or 113 + system crash, 2 sets of metadata zones are used. One set, the primary 114 + set, is used as the main metadata region, while the secondary set is 115 + used as a staging area. Modified metadata is first written to the 116 + secondary set and validated by updating the super block in the secondary 117 + set, a generation counter is used to indicate that this set contains the 118 + newest metadata. Once this operation completes, in place of metadata 119 + block updates can be done in the primary metadata set. This ensures that 120 + one of the set is always consistent (all modifications committed or none 121 + at all). Flush operations are used as a commit point. Upon reception of 122 + a flush request, metadata modification activity is temporarily blocked 123 + (for both incoming BIO processing and reclaim process) and all dirty 124 + metadata blocks are staged and updated. Normal operation is then 125 + resumed. Flushing metadata thus only temporarily delays write and 126 + discard requests. Read requests can be processed concurrently while 127 + metadata flush is being executed. 128 + 129 + Usage 130 + ===== 131 + 132 + A zoned block device must first be formatted using the dmzadm tool. This 133 + will analyze the device zone configuration, determine where to place the 134 + metadata sets on the device and initialize the metadata sets. 135 + 136 + Ex: 137 + 138 + dmzadm --format /dev/sdxx 139 + 140 + For a formatted device, the target can be created normally with the 141 + dmsetup utility. The only parameter that dm-zoned requires is the 142 + underlying zoned block device name. Ex: 143 + 144 + echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | dmsetup create dmz-`basename ${dev}`

+17

drivers/md/Kconfig

··· 521 521 To compile this code as a module, choose M here: the module will 522 522 be called dm-integrity. 523 523 524 + config DM_ZONED 525 + tristate "Drive-managed zoned block device target support" 526 + depends on BLK_DEV_DM 527 + depends on BLK_DEV_ZONED 528 + ---help--- 529 + This device-mapper target takes a host-managed or host-aware zoned 530 + block device and exposes most of its capacity as a regular block 531 + device (drive-managed zoned block device) without any write 532 + constraints. This is mainly intended for use with file systems that 533 + do not natively support zoned block devices but still want to 534 + benefit from the increased capacity offered by SMR disks. Other uses 535 + by applications using raw block devices (for example object stores) 536 + are also possible. 537 + 538 + To compile this code as a module, choose M here: the module will 539 + be called dm-zoned. 540 + 524 541 If unsure, say N. 525 542 526 543 endif # MD

+2

drivers/md/Makefile

··· 20 20 dm-verity-y += dm-verity-target.o 21 21 md-mod-y += md.o bitmap.o 22 22 raid456-y += raid5.o raid5-cache.o raid5-ppl.o 23 + dm-zoned-y += dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o 23 24 24 25 # Note: link order is important. All raid personalities 25 26 # and must come before md.o, as they each initialise ··· 61 60 obj-$(CONFIG_DM_ERA) += dm-era.o 62 61 obj-$(CONFIG_DM_LOG_WRITES) += dm-log-writes.o 63 62 obj-$(CONFIG_DM_INTEGRITY) += dm-integrity.o 63 + obj-$(CONFIG_DM_ZONED) += dm-zoned.o 64 64 65 65 ifeq ($(CONFIG_DM_UEVENT),y) 66 66 dm-mod-objs += dm-uevent.o

+1 -1

drivers/md/dm-bio-prison-v1.c

··· 116 116 117 117 while (*new) { 118 118 struct dm_bio_prison_cell *cell = 119 - container_of(*new, struct dm_bio_prison_cell, node); 119 + rb_entry(*new, struct dm_bio_prison_cell, node); 120 120 121 121 r = cmp_keys(key, &cell->key); 122 122

+1 -1

drivers/md/dm-bio-prison-v2.c

··· 120 120 121 121 while (*new) { 122 122 struct dm_bio_prison_cell_v2 *cell = 123 - container_of(*new, struct dm_bio_prison_cell_v2, node); 123 + rb_entry(*new, struct dm_bio_prison_cell_v2, node); 124 124 125 125 r = cmp_keys(key, &cell->key); 126 126

+3

drivers/md/dm-core.h

··· 147 147 return !maxlen || strlen(result) + 1 >= maxlen; 148 148 } 149 149 150 + extern atomic_t dm_global_event_nr; 151 + extern wait_queue_head_t dm_global_eventq; 152 + 150 153 #endif

+20 -1

drivers/md/dm-crypt.c

··· 246 246 * plain64: the initial vector is the 64-bit little-endian version of the sector 247 247 * number, padded with zeros if necessary. 248 248 * 249 + * plain64be: the initial vector is the 64-bit big-endian version of the sector 250 + * number, padded with zeros if necessary. 251 + * 249 252 * essiv: "encrypted sector|salt initial vector", the sector number is 250 253 * encrypted with the bulk cipher using a salt as key. The salt 251 254 * should be derived from the bulk cipher's key via hashing. ··· 301 298 { 302 299 memset(iv, 0, cc->iv_size); 303 300 *(__le64 *)iv = cpu_to_le64(dmreq->iv_sector); 301 + 302 + return 0; 303 + } 304 + 305 + static int crypt_iv_plain64be_gen(struct crypt_config *cc, u8 *iv, 306 + struct dm_crypt_request *dmreq) 307 + { 308 + memset(iv, 0, cc->iv_size); 309 + /* iv_size is at least of size u64; usually it is 16 bytes */ 310 + *(__be64 *)&iv[cc->iv_size - sizeof(u64)] = cpu_to_be64(dmreq->iv_sector); 304 311 305 312 return 0; 306 313 } ··· 846 833 847 834 static const struct crypt_iv_operations crypt_iv_plain64_ops = { 848 835 .generator = crypt_iv_plain64_gen 836 + }; 837 + 838 + static const struct crypt_iv_operations crypt_iv_plain64be_ops = { 839 + .generator = crypt_iv_plain64be_gen 849 840 }; 850 841 851 842 static const struct crypt_iv_operations crypt_iv_essiv_ops = { ··· 2225 2208 cc->iv_gen_ops = &crypt_iv_plain_ops; 2226 2209 else if (strcmp(ivmode, "plain64") == 0) 2227 2210 cc->iv_gen_ops = &crypt_iv_plain64_ops; 2211 + else if (strcmp(ivmode, "plain64be") == 0) 2212 + cc->iv_gen_ops = &crypt_iv_plain64be_ops; 2228 2213 else if (strcmp(ivmode, "essiv") == 0) 2229 2214 cc->iv_gen_ops = &crypt_iv_essiv_ops; 2230 2215 else if (strcmp(ivmode, "benbi") == 0) ··· 3006 2987 3007 2988 static struct target_type crypt_target = { 3008 2989 .name = "crypt", 3009 - .version = {1, 17, 0}, 2990 + .version = {1, 18, 0}, 3010 2991 .module = THIS_MODULE, 3011 2992 .ctr = crypt_ctr, 3012 2993 .dtr = crypt_dtr,

+20 -3

drivers/md/dm-flakey.c

··· 275 275 struct flakey_c *fc = ti->private; 276 276 277 277 bio->bi_bdev = fc->dev->bdev; 278 - if (bio_sectors(bio)) 278 + if (bio_sectors(bio) || bio_op(bio) == REQ_OP_ZONE_RESET) 279 279 bio->bi_iter.bi_sector = 280 280 flakey_map_sector(ti, bio->bi_iter.bi_sector); 281 281 } ··· 305 305 unsigned elapsed; 306 306 struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data)); 307 307 pb->bio_submitted = false; 308 + 309 + /* Do not fail reset zone */ 310 + if (bio_op(bio) == REQ_OP_ZONE_RESET) 311 + goto map_bio; 312 + 313 + /* We need to remap reported zones, so remember the BIO iter */ 314 + if (bio_op(bio) == REQ_OP_ZONE_REPORT) 315 + goto map_bio; 308 316 309 317 /* Are we alive ? */ 310 318 elapsed = (jiffies - fc->start_time) / HZ; ··· 367 359 } 368 360 369 361 static int flakey_end_io(struct dm_target *ti, struct bio *bio, 370 - blk_status_t *error) 362 + blk_status_t *error) 371 363 { 372 364 struct flakey_c *fc = ti->private; 373 365 struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data)); 366 + 367 + if (bio_op(bio) == REQ_OP_ZONE_RESET) 368 + return DM_ENDIO_DONE; 369 + 370 + if (bio_op(bio) == REQ_OP_ZONE_REPORT) { 371 + dm_remap_zone_report(ti, bio, fc->start); 372 + return DM_ENDIO_DONE; 373 + } 374 374 375 375 if (!*error && pb->bio_submitted && (bio_data_dir(bio) == READ)) { 376 376 if (fc->corrupt_bio_byte && (fc->corrupt_bio_rw == READ) && ··· 462 446 463 447 static struct target_type flakey_target = { 464 448 .name = "flakey", 465 - .version = {1, 4, 0}, 449 + .version = {1, 5, 0}, 450 + .features = DM_TARGET_ZONED_HM, 466 451 .module = THIS_MODULE, 467 452 .ctr = flakey_ctr, 468 453 .dtr = flakey_dtr,

+87 -22

drivers/md/dm-ioctl.c

··· 23 23 #define DM_MSG_PREFIX "ioctl" 24 24 #define DM_DRIVER_EMAIL "dm-devel@redhat.com" 25 25 26 + struct dm_file { 27 + /* 28 + * poll will wait until the global event number is greater than 29 + * this value. 30 + */ 31 + volatile unsigned global_event_nr; 32 + }; 33 + 26 34 /*----------------------------------------------------------------- 27 35 * The ioctl interface needs to be able to look up devices by 28 36 * name or uuid. ··· 464 456 * All the ioctl commands get dispatched to functions with this 465 457 * prototype. 466 458 */ 467 - typedef int (*ioctl_fn)(struct dm_ioctl *param, size_t param_size); 459 + typedef int (*ioctl_fn)(struct file *filp, struct dm_ioctl *param, size_t param_size); 468 460 469 - static int remove_all(struct dm_ioctl *param, size_t param_size) 461 + static int remove_all(struct file *filp, struct dm_ioctl *param, size_t param_size) 470 462 { 471 463 dm_hash_remove_all(true, !!(param->flags & DM_DEFERRED_REMOVE), false); 472 464 param->data_size = 0; ··· 499 491 return ((void *) param) + param->data_start; 500 492 } 501 493 502 - static int list_devices(struct dm_ioctl *param, size_t param_size) 494 + static int list_devices(struct file *filp, struct dm_ioctl *param, size_t param_size) 503 495 { 504 496 unsigned int i; 505 497 struct hash_cell *hc; 506 498 size_t len, needed = 0; 507 499 struct gendisk *disk; 508 500 struct dm_name_list *nl, *old_nl = NULL; 501 + uint32_t *event_nr; 509 502 510 503 down_write(&_hash_lock); 511 504 ··· 519 510 needed += sizeof(struct dm_name_list); 520 511 needed += strlen(hc->name) + 1; 521 512 needed += ALIGN_MASK; 513 + needed += (sizeof(uint32_t) + ALIGN_MASK) & ~ALIGN_MASK; 522 514 } 523 515 } 524 516 ··· 549 539 strcpy(nl->name, hc->name); 550 540 551 541 old_nl = nl; 552 - nl = align_ptr(((void *) ++nl) + strlen(hc->name) + 1); 542 + event_nr = align_ptr(((void *) (nl + 1)) + strlen(hc->name) + 1); 543 + *event_nr = dm_get_event_nr(hc->md); 544 + nl = align_ptr(event_nr + 1); 553 545 } 554 546 } 555 547 ··· 594 582 info->vers = align_ptr(((void *) ++info->vers) + strlen(tt->name) + 1); 595 583 } 596 584 597 - static int list_versions(struct dm_ioctl *param, size_t param_size) 585 + static int list_versions(struct file *filp, struct dm_ioctl *param, size_t param_size) 598 586 { 599 587 size_t len, needed = 0; 600 588 struct dm_target_versions *vers; ··· 736 724 } 737 725 } 738 726 739 - static int dev_create(struct dm_ioctl *param, size_t param_size) 727 + static int dev_create(struct file *filp, struct dm_ioctl *param, size_t param_size) 740 728 { 741 729 int r, m = DM_ANY_MINOR; 742 730 struct mapped_device *md; ··· 828 816 return md; 829 817 } 830 818 831 - static int dev_remove(struct dm_ioctl *param, size_t param_size) 819 + static int dev_remove(struct file *filp, struct dm_ioctl *param, size_t param_size) 832 820 { 833 821 struct hash_cell *hc; 834 822 struct mapped_device *md; ··· 893 881 return -EINVAL; 894 882 } 895 883 896 - static int dev_rename(struct dm_ioctl *param, size_t param_size) 884 + static int dev_rename(struct file *filp, struct dm_ioctl *param, size_t param_size) 897 885 { 898 886 int r; 899 887 char *new_data = (char *) param + param->data_start; ··· 923 911 return 0; 924 912 } 925 913 926 - static int dev_set_geometry(struct dm_ioctl *param, size_t param_size) 914 + static int dev_set_geometry(struct file *filp, struct dm_ioctl *param, size_t param_size) 927 915 { 928 916 int r = -EINVAL, x; 929 917 struct mapped_device *md; ··· 1072 1060 * Set or unset the suspension state of a device. 1073 1061 * If the device already is in the requested state we just return its status. 1074 1062 */ 1075 - static int dev_suspend(struct dm_ioctl *param, size_t param_size) 1063 + static int dev_suspend(struct file *filp, struct dm_ioctl *param, size_t param_size) 1076 1064 { 1077 1065 if (param->flags & DM_SUSPEND_FLAG) 1078 1066 return do_suspend(param); ··· 1084 1072 * Copies device info back to user space, used by 1085 1073 * the create and info ioctls. 1086 1074 */ 1087 - static int dev_status(struct dm_ioctl *param, size_t param_size) 1075 + static int dev_status(struct file *filp, struct dm_ioctl *param, size_t param_size) 1088 1076 { 1089 1077 struct mapped_device *md; 1090 1078 ··· 1175 1163 /* 1176 1164 * Wait for a device to report an event 1177 1165 */ 1178 - static int dev_wait(struct dm_ioctl *param, size_t param_size) 1166 + static int dev_wait(struct file *filp, struct dm_ioctl *param, size_t param_size) 1179 1167 { 1180 1168 int r = 0; 1181 1169 struct mapped_device *md; ··· 1210 1198 dm_put(md); 1211 1199 1212 1200 return r; 1201 + } 1202 + 1203 + /* 1204 + * Remember the global event number and make it possible to poll 1205 + * for further events. 1206 + */ 1207 + static int dev_arm_poll(struct file *filp, struct dm_ioctl *param, size_t param_size) 1208 + { 1209 + struct dm_file *priv = filp->private_data; 1210 + 1211 + priv->global_event_nr = atomic_read(&dm_global_event_nr); 1212 + 1213 + return 0; 1213 1214 } 1214 1215 1215 1216 static inline fmode_t get_mode(struct dm_ioctl *param) ··· 1294 1269 return false; 1295 1270 } 1296 1271 1297 - static int table_load(struct dm_ioctl *param, size_t param_size) 1272 + static int table_load(struct file *filp, struct dm_ioctl *param, size_t param_size) 1298 1273 { 1299 1274 int r; 1300 1275 struct hash_cell *hc; ··· 1381 1356 return r; 1382 1357 } 1383 1358 1384 - static int table_clear(struct dm_ioctl *param, size_t param_size) 1359 + static int table_clear(struct file *filp, struct dm_ioctl *param, size_t param_size) 1385 1360 { 1386 1361 struct hash_cell *hc; 1387 1362 struct mapped_device *md; ··· 1455 1430 param->data_size = param->data_start + needed; 1456 1431 } 1457 1432 1458 - static int table_deps(struct dm_ioctl *param, size_t param_size) 1433 + static int table_deps(struct file *filp, struct dm_ioctl *param, size_t param_size) 1459 1434 { 1460 1435 struct mapped_device *md; 1461 1436 struct dm_table *table; ··· 1481 1456 * Return the status of a device as a text string for each 1482 1457 * target. 1483 1458 */ 1484 - static int table_status(struct dm_ioctl *param, size_t param_size) 1459 + static int table_status(struct file *filp, struct dm_ioctl *param, size_t param_size) 1485 1460 { 1486 1461 struct mapped_device *md; 1487 1462 struct dm_table *table; ··· 1536 1511 /* 1537 1512 * Pass a message to the target that's at the supplied device offset. 1538 1513 */ 1539 - static int target_message(struct dm_ioctl *param, size_t param_size) 1514 + static int target_message(struct file *filp, struct dm_ioctl *param, size_t param_size) 1540 1515 { 1541 1516 int r, argc; 1542 1517 char **argv; ··· 1653 1628 {DM_LIST_VERSIONS_CMD, 0, list_versions}, 1654 1629 1655 1630 {DM_TARGET_MSG_CMD, 0, target_message}, 1656 - {DM_DEV_SET_GEOMETRY_CMD, 0, dev_set_geometry} 1631 + {DM_DEV_SET_GEOMETRY_CMD, 0, dev_set_geometry}, 1632 + {DM_DEV_ARM_POLL, IOCTL_FLAGS_NO_PARAMS, dev_arm_poll}, 1657 1633 }; 1658 1634 1659 1635 if (unlikely(cmd >= ARRAY_SIZE(_ioctls))) ··· 1809 1783 return 0; 1810 1784 } 1811 1785 1812 - static int ctl_ioctl(uint command, struct dm_ioctl __user *user) 1786 + static int ctl_ioctl(struct file *file, uint command, struct dm_ioctl __user *user) 1813 1787 { 1814 1788 int r = 0; 1815 1789 int ioctl_flags; ··· 1863 1837 goto out; 1864 1838 1865 1839 param->data_size = offsetof(struct dm_ioctl, data); 1866 - r = fn(param, input_param_size); 1840 + r = fn(file, param, input_param_size); 1867 1841 1868 1842 if (unlikely(param->flags & DM_BUFFER_FULL_FLAG) && 1869 1843 unlikely(ioctl_flags & IOCTL_FLAGS_NO_PARAMS)) ··· 1882 1856 1883 1857 static long dm_ctl_ioctl(struct file *file, uint command, ulong u) 1884 1858 { 1885 - return (long)ctl_ioctl(command, (struct dm_ioctl __user *)u); 1859 + return (long)ctl_ioctl(file, command, (struct dm_ioctl __user *)u); 1886 1860 } 1887 1861 1888 1862 #ifdef CONFIG_COMPAT ··· 1894 1868 #define dm_compat_ctl_ioctl NULL 1895 1869 #endif 1896 1870 1871 + static int dm_open(struct inode *inode, struct file *filp) 1872 + { 1873 + int r; 1874 + struct dm_file *priv; 1875 + 1876 + r = nonseekable_open(inode, filp); 1877 + if (unlikely(r)) 1878 + return r; 1879 + 1880 + priv = filp->private_data = kmalloc(sizeof(struct dm_file), GFP_KERNEL); 1881 + if (!priv) 1882 + return -ENOMEM; 1883 + 1884 + priv->global_event_nr = atomic_read(&dm_global_event_nr); 1885 + 1886 + return 0; 1887 + } 1888 + 1889 + static int dm_release(struct inode *inode, struct file *filp) 1890 + { 1891 + kfree(filp->private_data); 1892 + return 0; 1893 + } 1894 + 1895 + static unsigned dm_poll(struct file *filp, poll_table *wait) 1896 + { 1897 + struct dm_file *priv = filp->private_data; 1898 + unsigned mask = 0; 1899 + 1900 + poll_wait(filp, &dm_global_eventq, wait); 1901 + 1902 + if ((int)(atomic_read(&dm_global_event_nr) - priv->global_event_nr) > 0) 1903 + mask |= POLLIN; 1904 + 1905 + return mask; 1906 + } 1907 + 1897 1908 static const struct file_operations _ctl_fops = { 1898 - .open = nonseekable_open, 1909 + .open = dm_open, 1910 + .release = dm_release, 1911 + .poll = dm_poll, 1899 1912 .unlocked_ioctl = dm_ctl_ioctl, 1900 1913 .compat_ioctl = dm_compat_ctl_ioctl, 1901 1914 .owner = THIS_MODULE,

+63 -2

drivers/md/dm-kcopyd.c

··· 356 356 struct mutex lock; 357 357 atomic_t sub_jobs; 358 358 sector_t progress; 359 + sector_t write_offset; 359 360 360 361 struct kcopyd_job *master_job; 361 362 }; ··· 387 386 * Functions to push and pop a job onto the head of a given job 388 387 * list. 389 388 */ 389 + static struct kcopyd_job *pop_io_job(struct list_head *jobs, 390 + struct dm_kcopyd_client *kc) 391 + { 392 + struct kcopyd_job *job; 393 + 394 + /* 395 + * For I/O jobs, pop any read, any write without sequential write 396 + * constraint and sequential writes that are at the right position. 397 + */ 398 + list_for_each_entry(job, jobs, list) { 399 + if (job->rw == READ || !test_bit(DM_KCOPYD_WRITE_SEQ, &job->flags)) { 400 + list_del(&job->list); 401 + return job; 402 + } 403 + 404 + if (job->write_offset == job->master_job->write_offset) { 405 + job->master_job->write_offset += job->source.count; 406 + list_del(&job->list); 407 + return job; 408 + } 409 + } 410 + 411 + return NULL; 412 + } 413 + 390 414 static struct kcopyd_job *pop(struct list_head *jobs, 391 415 struct dm_kcopyd_client *kc) 392 416 { ··· 421 395 spin_lock_irqsave(&kc->job_lock, flags); 422 396 423 397 if (!list_empty(jobs)) { 424 - job = list_entry(jobs->next, struct kcopyd_job, list); 425 - list_del(&job->list); 398 + if (jobs == &kc->io_jobs) 399 + job = pop_io_job(jobs, kc); 400 + else { 401 + job = list_entry(jobs->next, struct kcopyd_job, list); 402 + list_del(&job->list); 403 + } 426 404 } 427 405 spin_unlock_irqrestore(&kc->job_lock, flags); 428 406 ··· 535 505 .notify.context = job, 536 506 .client = job->kc->io_client, 537 507 }; 508 + 509 + /* 510 + * If we need to write sequentially and some reads or writes failed, 511 + * no point in continuing. 512 + */ 513 + if (test_bit(DM_KCOPYD_WRITE_SEQ, &job->flags) && 514 + job->master_job->write_err) 515 + return -EIO; 538 516 539 517 io_job_start(job->kc->throttle); 540 518 ··· 693 655 int i; 694 656 695 657 *sub_job = *job; 658 + sub_job->write_offset = progress; 696 659 sub_job->source.sector += progress; 697 660 sub_job->source.count = count; 698 661 ··· 762 723 job->num_dests = num_dests; 763 724 memcpy(&job->dests, dests, sizeof(*dests) * num_dests); 764 725 726 + /* 727 + * If one of the destination is a host-managed zoned block device, 728 + * we need to write sequentially. If one of the destination is a 729 + * host-aware device, then leave it to the caller to choose what to do. 730 + */ 731 + if (!test_bit(DM_KCOPYD_WRITE_SEQ, &job->flags)) { 732 + for (i = 0; i < job->num_dests; i++) { 733 + if (bdev_zoned_model(dests[i].bdev) == BLK_ZONED_HM) { 734 + set_bit(DM_KCOPYD_WRITE_SEQ, &job->flags); 735 + break; 736 + } 737 + } 738 + } 739 + 740 + /* 741 + * If we need to write sequentially, errors cannot be ignored. 742 + */ 743 + if (test_bit(DM_KCOPYD_WRITE_SEQ, &job->flags) && 744 + test_bit(DM_KCOPYD_IGNORE_ERROR, &job->flags)) 745 + clear_bit(DM_KCOPYD_IGNORE_ERROR, &job->flags); 746 + 765 747 if (from) { 766 748 job->source = *from; 767 749 job->pages = NULL; ··· 806 746 job->fn = fn; 807 747 job->context = context; 808 748 job->master_job = job; 749 + job->write_offset = 0; 809 750 810 751 if (job->source.count <= SUB_JOB_SIZE) 811 752 dispatch_job(job);

+15 -3

drivers/md/dm-linear.c

··· 89 89 struct linear_c *lc = ti->private; 90 90 91 91 bio->bi_bdev = lc->dev->bdev; 92 - if (bio_sectors(bio)) 92 + if (bio_sectors(bio) || bio_op(bio) == REQ_OP_ZONE_RESET) 93 93 bio->bi_iter.bi_sector = 94 94 linear_map_sector(ti, bio->bi_iter.bi_sector); 95 95 } ··· 99 99 linear_map_bio(ti, bio); 100 100 101 101 return DM_MAPIO_REMAPPED; 102 + } 103 + 104 + static int linear_end_io(struct dm_target *ti, struct bio *bio, 105 + blk_status_t *error) 106 + { 107 + struct linear_c *lc = ti->private; 108 + 109 + if (!*error && bio_op(bio) == REQ_OP_ZONE_REPORT) 110 + dm_remap_zone_report(ti, bio, lc->start); 111 + 112 + return DM_ENDIO_DONE; 102 113 } 103 114 104 115 static void linear_status(struct dm_target *ti, status_type_t type, ··· 172 161 173 162 static struct target_type linear_target = { 174 163 .name = "linear", 175 - .version = {1, 3, 0}, 176 - .features = DM_TARGET_PASSES_INTEGRITY, 164 + .version = {1, 4, 0}, 165 + .features = DM_TARGET_PASSES_INTEGRITY | DM_TARGET_ZONED_HM, 177 166 .module = THIS_MODULE, 178 167 .ctr = linear_ctr, 179 168 .dtr = linear_dtr, 180 169 .map = linear_map, 170 + .end_io = linear_end_io, 181 171 .status = linear_status, 182 172 .prepare_ioctl = linear_prepare_ioctl, 183 173 .iterate_devices = linear_iterate_devices,

+10 -3

drivers/md/dm-raid.c

··· 1571 1571 return rdev->sectors; 1572 1572 } 1573 1573 1574 - BUG(); /* Constructor ensures we got some. */ 1574 + return 0; 1575 1575 } 1576 1576 1577 1577 /* Calculate the sectors per device and per array used for @rs */ ··· 2941 2941 bool resize; 2942 2942 struct raid_type *rt; 2943 2943 unsigned int num_raid_params, num_raid_devs; 2944 - sector_t calculated_dev_sectors; 2944 + sector_t calculated_dev_sectors, rdev_sectors; 2945 2945 struct raid_set *rs = NULL; 2946 2946 const char *arg; 2947 2947 struct rs_layout rs_layout; ··· 3017 3017 if (r) 3018 3018 goto bad; 3019 3019 3020 - resize = calculated_dev_sectors != __rdev_sectors(rs); 3020 + rdev_sectors = __rdev_sectors(rs); 3021 + if (!rdev_sectors) { 3022 + ti->error = "Invalid rdev size"; 3023 + r = -EINVAL; 3024 + goto bad; 3025 + } 3026 + 3027 + resize = calculated_dev_sectors != rdev_sectors; 3021 3028 3022 3029 INIT_WORK(&rs->md.event_work, do_table_event); 3023 3030 ti->private = rs;

+162

drivers/md/dm-table.c

··· 319 319 return 1; 320 320 } 321 321 322 + /* 323 + * If the target is mapped to zoned block device(s), check 324 + * that the zones are not partially mapped. 325 + */ 326 + if (bdev_zoned_model(bdev) != BLK_ZONED_NONE) { 327 + unsigned int zone_sectors = bdev_zone_sectors(bdev); 328 + 329 + if (start & (zone_sectors - 1)) { 330 + DMWARN("%s: start=%llu not aligned to h/w zone size %u of %s", 331 + dm_device_name(ti->table->md), 332 + (unsigned long long)start, 333 + zone_sectors, bdevname(bdev, b)); 334 + return 1; 335 + } 336 + 337 + /* 338 + * Note: The last zone of a zoned block device may be smaller 339 + * than other zones. So for a target mapping the end of a 340 + * zoned block device with such a zone, len would not be zone 341 + * aligned. We do not allow such last smaller zone to be part 342 + * of the mapping here to ensure that mappings with multiple 343 + * devices do not end up with a smaller zone in the middle of 344 + * the sector range. 345 + */ 346 + if (len & (zone_sectors - 1)) { 347 + DMWARN("%s: len=%llu not aligned to h/w zone size %u of %s", 348 + dm_device_name(ti->table->md), 349 + (unsigned long long)len, 350 + zone_sectors, bdevname(bdev, b)); 351 + return 1; 352 + } 353 + } 354 + 322 355 if (logical_block_size_sectors <= 1) 323 356 return 0; 324 357 ··· 488 455 q->limits.logical_block_size, 489 456 q->limits.alignment_offset, 490 457 (unsigned long long) start << SECTOR_SHIFT); 458 + 459 + limits->zoned = blk_queue_zoned_model(q); 491 460 492 461 return 0; 493 462 } ··· 1381 1346 return true; 1382 1347 } 1383 1348 1349 + static int device_is_zoned_model(struct dm_target *ti, struct dm_dev *dev, 1350 + sector_t start, sector_t len, void *data) 1351 + { 1352 + struct request_queue *q = bdev_get_queue(dev->bdev); 1353 + enum blk_zoned_model *zoned_model = data; 1354 + 1355 + return q && blk_queue_zoned_model(q) == *zoned_model; 1356 + } 1357 + 1358 + static bool dm_table_supports_zoned_model(struct dm_table *t, 1359 + enum blk_zoned_model zoned_model) 1360 + { 1361 + struct dm_target *ti; 1362 + unsigned i; 1363 + 1364 + for (i = 0; i < dm_table_get_num_targets(t); i++) { 1365 + ti = dm_table_get_target(t, i); 1366 + 1367 + if (zoned_model == BLK_ZONED_HM && 1368 + !dm_target_supports_zoned_hm(ti->type)) 1369 + return false; 1370 + 1371 + if (!ti->type->iterate_devices || 1372 + !ti->type->iterate_devices(ti, device_is_zoned_model, &zoned_model)) 1373 + return false; 1374 + } 1375 + 1376 + return true; 1377 + } 1378 + 1379 + static int device_matches_zone_sectors(struct dm_target *ti, struct dm_dev *dev, 1380 + sector_t start, sector_t len, void *data) 1381 + { 1382 + struct request_queue *q = bdev_get_queue(dev->bdev); 1383 + unsigned int *zone_sectors = data; 1384 + 1385 + return q && blk_queue_zone_sectors(q) == *zone_sectors; 1386 + } 1387 + 1388 + static bool dm_table_matches_zone_sectors(struct dm_table *t, 1389 + unsigned int zone_sectors) 1390 + { 1391 + struct dm_target *ti; 1392 + unsigned i; 1393 + 1394 + for (i = 0; i < dm_table_get_num_targets(t); i++) { 1395 + ti = dm_table_get_target(t, i); 1396 + 1397 + if (!ti->type->iterate_devices || 1398 + !ti->type->iterate_devices(ti, device_matches_zone_sectors, &zone_sectors)) 1399 + return false; 1400 + } 1401 + 1402 + return true; 1403 + } 1404 + 1405 + static int validate_hardware_zoned_model(struct dm_table *table, 1406 + enum blk_zoned_model zoned_model, 1407 + unsigned int zone_sectors) 1408 + { 1409 + if (zoned_model == BLK_ZONED_NONE) 1410 + return 0; 1411 + 1412 + if (!dm_table_supports_zoned_model(table, zoned_model)) { 1413 + DMERR("%s: zoned model is not consistent across all devices", 1414 + dm_device_name(table->md)); 1415 + return -EINVAL; 1416 + } 1417 + 1418 + /* Check zone size validity and compatibility */ 1419 + if (!zone_sectors || !is_power_of_2(zone_sectors)) 1420 + return -EINVAL; 1421 + 1422 + if (!dm_table_matches_zone_sectors(table, zone_sectors)) { 1423 + DMERR("%s: zone sectors is not consistent across all devices", 1424 + dm_device_name(table->md)); 1425 + return -EINVAL; 1426 + } 1427 + 1428 + return 0; 1429 + } 1430 + 1384 1431 /* 1385 1432 * Establish the new table's queue_limits and validate them. 1386 1433 */ ··· 1472 1355 struct dm_target *ti; 1473 1356 struct queue_limits ti_limits; 1474 1357 unsigned i; 1358 + enum blk_zoned_model zoned_model = BLK_ZONED_NONE; 1359 + unsigned int zone_sectors = 0; 1475 1360 1476 1361 blk_set_stacking_limits(limits); 1477 1362 ··· 1490 1371 */ 1491 1372 ti->type->iterate_devices(ti, dm_set_device_limits, 1492 1373 &ti_limits); 1374 + 1375 + if (zoned_model == BLK_ZONED_NONE && ti_limits.zoned != BLK_ZONED_NONE) { 1376 + /* 1377 + * After stacking all limits, validate all devices 1378 + * in table support this zoned model and zone sectors. 1379 + */ 1380 + zoned_model = ti_limits.zoned; 1381 + zone_sectors = ti_limits.chunk_sectors; 1382 + } 1493 1383 1494 1384 /* Set I/O hints portion of queue limits */ 1495 1385 if (ti->type->io_hints) ··· 1524 1396 dm_device_name(table->md), 1525 1397 (unsigned long long) ti->begin, 1526 1398 (unsigned long long) ti->len); 1399 + 1400 + /* 1401 + * FIXME: this should likely be moved to blk_stack_limits(), would 1402 + * also eliminate limits->zoned stacking hack in dm_set_device_limits() 1403 + */ 1404 + if (limits->zoned == BLK_ZONED_NONE && ti_limits.zoned != BLK_ZONED_NONE) { 1405 + /* 1406 + * By default, the stacked limits zoned model is set to 1407 + * BLK_ZONED_NONE in blk_set_stacking_limits(). Update 1408 + * this model using the first target model reported 1409 + * that is not BLK_ZONED_NONE. This will be either the 1410 + * first target device zoned model or the model reported 1411 + * by the target .io_hints. 1412 + */ 1413 + limits->zoned = ti_limits.zoned; 1414 + } 1527 1415 } 1416 + 1417 + /* 1418 + * Verify that the zoned model and zone sectors, as determined before 1419 + * any .io_hints override, are the same across all devices in the table. 1420 + * - this is especially relevant if .io_hints is emulating a disk-managed 1421 + * zoned model (aka BLK_ZONED_NONE) on host-managed zoned block devices. 1422 + * BUT... 1423 + */ 1424 + if (limits->zoned != BLK_ZONED_NONE) { 1425 + /* 1426 + * ...IF the above limits stacking determined a zoned model 1427 + * validate that all of the table's devices conform to it. 1428 + */ 1429 + zoned_model = limits->zoned; 1430 + zone_sectors = limits->chunk_sectors; 1431 + } 1432 + if (validate_hardware_zoned_model(table, zoned_model, zone_sectors)) 1433 + return -EINVAL; 1528 1434 1529 1435 return validate_hardware_logical_block_alignment(table, limits); 1530 1436 }

+2509

drivers/md/dm-zoned-metadata.c

··· 1 + /* 2 + * Copyright (C) 2017 Western Digital Corporation or its affiliates. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #include "dm-zoned.h" 8 + 9 + #include <linux/module.h> 10 + #include <linux/crc32.h> 11 + 12 + #define DM_MSG_PREFIX "zoned metadata" 13 + 14 + /* 15 + * Metadata version. 16 + */ 17 + #define DMZ_META_VER 1 18 + 19 + /* 20 + * On-disk super block magic. 21 + */ 22 + #define DMZ_MAGIC ((((unsigned int)('D')) << 24) | \ 23 + (((unsigned int)('Z')) << 16) | \ 24 + (((unsigned int)('B')) << 8) | \ 25 + ((unsigned int)('D'))) 26 + 27 + /* 28 + * On disk super block. 29 + * This uses only 512 B but uses on disk a full 4KB block. This block is 30 + * followed on disk by the mapping table of chunks to zones and the bitmap 31 + * blocks indicating zone block validity. 32 + * The overall resulting metadata format is: 33 + * (1) Super block (1 block) 34 + * (2) Chunk mapping table (nr_map_blocks) 35 + * (3) Bitmap blocks (nr_bitmap_blocks) 36 + * All metadata blocks are stored in conventional zones, starting from the 37 + * the first conventional zone found on disk. 38 + */ 39 + struct dmz_super { 40 + /* Magic number */ 41 + __le32 magic; /* 4 */ 42 + 43 + /* Metadata version number */ 44 + __le32 version; /* 8 */ 45 + 46 + /* Generation number */ 47 + __le64 gen; /* 16 */ 48 + 49 + /* This block number */ 50 + __le64 sb_block; /* 24 */ 51 + 52 + /* The number of metadata blocks, including this super block */ 53 + __le32 nr_meta_blocks; /* 28 */ 54 + 55 + /* The number of sequential zones reserved for reclaim */ 56 + __le32 nr_reserved_seq; /* 32 */ 57 + 58 + /* The number of entries in the mapping table */ 59 + __le32 nr_chunks; /* 36 */ 60 + 61 + /* The number of blocks used for the chunk mapping table */ 62 + __le32 nr_map_blocks; /* 40 */ 63 + 64 + /* The number of blocks used for the block bitmaps */ 65 + __le32 nr_bitmap_blocks; /* 44 */ 66 + 67 + /* Checksum */ 68 + __le32 crc; /* 48 */ 69 + 70 + /* Padding to full 512B sector */ 71 + u8 reserved[464]; /* 512 */ 72 + }; 73 + 74 + /* 75 + * Chunk mapping entry: entries are indexed by chunk number 76 + * and give the zone ID (dzone_id) mapping the chunk on disk. 77 + * This zone may be sequential or random. If it is a sequential 78 + * zone, a second zone (bzone_id) used as a write buffer may 79 + * also be specified. This second zone will always be a randomly 80 + * writeable zone. 81 + */ 82 + struct dmz_map { 83 + __le32 dzone_id; 84 + __le32 bzone_id; 85 + }; 86 + 87 + /* 88 + * Chunk mapping table metadata: 512 8-bytes entries per 4KB block. 89 + */ 90 + #define DMZ_MAP_ENTRIES (DMZ_BLOCK_SIZE / sizeof(struct dmz_map)) 91 + #define DMZ_MAP_ENTRIES_SHIFT (ilog2(DMZ_MAP_ENTRIES)) 92 + #define DMZ_MAP_ENTRIES_MASK (DMZ_MAP_ENTRIES - 1) 93 + #define DMZ_MAP_UNMAPPED UINT_MAX 94 + 95 + /* 96 + * Meta data block descriptor (for cached metadata blocks). 97 + */ 98 + struct dmz_mblock { 99 + struct rb_node node; 100 + struct list_head link; 101 + sector_t no; 102 + atomic_t ref; 103 + unsigned long state; 104 + struct page *page; 105 + void *data; 106 + }; 107 + 108 + /* 109 + * Metadata block state flags. 110 + */ 111 + enum { 112 + DMZ_META_DIRTY, 113 + DMZ_META_READING, 114 + DMZ_META_WRITING, 115 + DMZ_META_ERROR, 116 + }; 117 + 118 + /* 119 + * Super block information (one per metadata set). 120 + */ 121 + struct dmz_sb { 122 + sector_t block; 123 + struct dmz_mblock *mblk; 124 + struct dmz_super *sb; 125 + }; 126 + 127 + /* 128 + * In-memory metadata. 129 + */ 130 + struct dmz_metadata { 131 + struct dmz_dev *dev; 132 + 133 + sector_t zone_bitmap_size; 134 + unsigned int zone_nr_bitmap_blocks; 135 + 136 + unsigned int nr_bitmap_blocks; 137 + unsigned int nr_map_blocks; 138 + 139 + unsigned int nr_useable_zones; 140 + unsigned int nr_meta_blocks; 141 + unsigned int nr_meta_zones; 142 + unsigned int nr_data_zones; 143 + unsigned int nr_rnd_zones; 144 + unsigned int nr_reserved_seq; 145 + unsigned int nr_chunks; 146 + 147 + /* Zone information array */ 148 + struct dm_zone *zones; 149 + 150 + struct dm_zone *sb_zone; 151 + struct dmz_sb sb[2]; 152 + unsigned int mblk_primary; 153 + u64 sb_gen; 154 + unsigned int min_nr_mblks; 155 + unsigned int max_nr_mblks; 156 + atomic_t nr_mblks; 157 + struct rw_semaphore mblk_sem; 158 + struct mutex mblk_flush_lock; 159 + spinlock_t mblk_lock; 160 + struct rb_root mblk_rbtree; 161 + struct list_head mblk_lru_list; 162 + struct list_head mblk_dirty_list; 163 + struct shrinker mblk_shrinker; 164 + 165 + /* Zone allocation management */ 166 + struct mutex map_lock; 167 + struct dmz_mblock **map_mblk; 168 + unsigned int nr_rnd; 169 + atomic_t unmap_nr_rnd; 170 + struct list_head unmap_rnd_list; 171 + struct list_head map_rnd_list; 172 + 173 + unsigned int nr_seq; 174 + atomic_t unmap_nr_seq; 175 + struct list_head unmap_seq_list; 176 + struct list_head map_seq_list; 177 + 178 + atomic_t nr_reserved_seq_zones; 179 + struct list_head reserved_seq_zones_list; 180 + 181 + wait_queue_head_t free_wq; 182 + }; 183 + 184 + /* 185 + * Various accessors 186 + */ 187 + unsigned int dmz_id(struct dmz_metadata *zmd, struct dm_zone *zone) 188 + { 189 + return ((unsigned int)(zone - zmd->zones)); 190 + } 191 + 192 + sector_t dmz_start_sect(struct dmz_metadata *zmd, struct dm_zone *zone) 193 + { 194 + return (sector_t)dmz_id(zmd, zone) << zmd->dev->zone_nr_sectors_shift; 195 + } 196 + 197 + sector_t dmz_start_block(struct dmz_metadata *zmd, struct dm_zone *zone) 198 + { 199 + return (sector_t)dmz_id(zmd, zone) << zmd->dev->zone_nr_blocks_shift; 200 + } 201 + 202 + unsigned int dmz_nr_chunks(struct dmz_metadata *zmd) 203 + { 204 + return zmd->nr_chunks; 205 + } 206 + 207 + unsigned int dmz_nr_rnd_zones(struct dmz_metadata *zmd) 208 + { 209 + return zmd->nr_rnd; 210 + } 211 + 212 + unsigned int dmz_nr_unmap_rnd_zones(struct dmz_metadata *zmd) 213 + { 214 + return atomic_read(&zmd->unmap_nr_rnd); 215 + } 216 + 217 + /* 218 + * Lock/unlock mapping table. 219 + * The map lock also protects all the zone lists. 220 + */ 221 + void dmz_lock_map(struct dmz_metadata *zmd) 222 + { 223 + mutex_lock(&zmd->map_lock); 224 + } 225 + 226 + void dmz_unlock_map(struct dmz_metadata *zmd) 227 + { 228 + mutex_unlock(&zmd->map_lock); 229 + } 230 + 231 + /* 232 + * Lock/unlock metadata access. This is a "read" lock on a semaphore 233 + * that prevents metadata flush from running while metadata are being 234 + * modified. The actual metadata write mutual exclusion is achieved with 235 + * the map lock and zone styate management (active and reclaim state are 236 + * mutually exclusive). 237 + */ 238 + void dmz_lock_metadata(struct dmz_metadata *zmd) 239 + { 240 + down_read(&zmd->mblk_sem); 241 + } 242 + 243 + void dmz_unlock_metadata(struct dmz_metadata *zmd) 244 + { 245 + up_read(&zmd->mblk_sem); 246 + } 247 + 248 + /* 249 + * Lock/unlock flush: prevent concurrent executions 250 + * of dmz_flush_metadata as well as metadata modification in reclaim 251 + * while flush is being executed. 252 + */ 253 + void dmz_lock_flush(struct dmz_metadata *zmd) 254 + { 255 + mutex_lock(&zmd->mblk_flush_lock); 256 + } 257 + 258 + void dmz_unlock_flush(struct dmz_metadata *zmd) 259 + { 260 + mutex_unlock(&zmd->mblk_flush_lock); 261 + } 262 + 263 + /* 264 + * Allocate a metadata block. 265 + */ 266 + static struct dmz_mblock *dmz_alloc_mblock(struct dmz_metadata *zmd, 267 + sector_t mblk_no) 268 + { 269 + struct dmz_mblock *mblk = NULL; 270 + 271 + /* See if we can reuse cached blocks */ 272 + if (zmd->max_nr_mblks && atomic_read(&zmd->nr_mblks) > zmd->max_nr_mblks) { 273 + spin_lock(&zmd->mblk_lock); 274 + mblk = list_first_entry_or_null(&zmd->mblk_lru_list, 275 + struct dmz_mblock, link); 276 + if (mblk) { 277 + list_del_init(&mblk->link); 278 + rb_erase(&mblk->node, &zmd->mblk_rbtree); 279 + mblk->no = mblk_no; 280 + } 281 + spin_unlock(&zmd->mblk_lock); 282 + if (mblk) 283 + return mblk; 284 + } 285 + 286 + /* Allocate a new block */ 287 + mblk = kmalloc(sizeof(struct dmz_mblock), GFP_NOIO); 288 + if (!mblk) 289 + return NULL; 290 + 291 + mblk->page = alloc_page(GFP_NOIO); 292 + if (!mblk->page) { 293 + kfree(mblk); 294 + return NULL; 295 + } 296 + 297 + RB_CLEAR_NODE(&mblk->node); 298 + INIT_LIST_HEAD(&mblk->link); 299 + atomic_set(&mblk->ref, 0); 300 + mblk->state = 0; 301 + mblk->no = mblk_no; 302 + mblk->data = page_address(mblk->page); 303 + 304 + atomic_inc(&zmd->nr_mblks); 305 + 306 + return mblk; 307 + } 308 + 309 + /* 310 + * Free a metadata block. 311 + */ 312 + static void dmz_free_mblock(struct dmz_metadata *zmd, struct dmz_mblock *mblk) 313 + { 314 + __free_pages(mblk->page, 0); 315 + kfree(mblk); 316 + 317 + atomic_dec(&zmd->nr_mblks); 318 + } 319 + 320 + /* 321 + * Insert a metadata block in the rbtree. 322 + */ 323 + static void dmz_insert_mblock(struct dmz_metadata *zmd, struct dmz_mblock *mblk) 324 + { 325 + struct rb_root *root = &zmd->mblk_rbtree; 326 + struct rb_node **new = &(root->rb_node), *parent = NULL; 327 + struct dmz_mblock *b; 328 + 329 + /* Figure out where to put the new node */ 330 + while (*new) { 331 + b = container_of(*new, struct dmz_mblock, node); 332 + parent = *new; 333 + new = (b->no < mblk->no) ? &((*new)->rb_left) : &((*new)->rb_right); 334 + } 335 + 336 + /* Add new node and rebalance tree */ 337 + rb_link_node(&mblk->node, parent, new); 338 + rb_insert_color(&mblk->node, root); 339 + } 340 + 341 + /* 342 + * Lookup a metadata block in the rbtree. 343 + */ 344 + static struct dmz_mblock *dmz_lookup_mblock(struct dmz_metadata *zmd, 345 + sector_t mblk_no) 346 + { 347 + struct rb_root *root = &zmd->mblk_rbtree; 348 + struct rb_node *node = root->rb_node; 349 + struct dmz_mblock *mblk; 350 + 351 + while (node) { 352 + mblk = container_of(node, struct dmz_mblock, node); 353 + if (mblk->no == mblk_no) 354 + return mblk; 355 + node = (mblk->no < mblk_no) ? node->rb_left : node->rb_right; 356 + } 357 + 358 + return NULL; 359 + } 360 + 361 + /* 362 + * Metadata block BIO end callback. 363 + */ 364 + static void dmz_mblock_bio_end_io(struct bio *bio) 365 + { 366 + struct dmz_mblock *mblk = bio->bi_private; 367 + int flag; 368 + 369 + if (bio->bi_status) 370 + set_bit(DMZ_META_ERROR, &mblk->state); 371 + 372 + if (bio_op(bio) == REQ_OP_WRITE) 373 + flag = DMZ_META_WRITING; 374 + else 375 + flag = DMZ_META_READING; 376 + 377 + clear_bit_unlock(flag, &mblk->state); 378 + smp_mb__after_atomic(); 379 + wake_up_bit(&mblk->state, flag); 380 + 381 + bio_put(bio); 382 + } 383 + 384 + /* 385 + * Read a metadata block from disk. 386 + */ 387 + static struct dmz_mblock *dmz_fetch_mblock(struct dmz_metadata *zmd, 388 + sector_t mblk_no) 389 + { 390 + struct dmz_mblock *mblk; 391 + sector_t block = zmd->sb[zmd->mblk_primary].block + mblk_no; 392 + struct bio *bio; 393 + 394 + /* Get block and insert it */ 395 + mblk = dmz_alloc_mblock(zmd, mblk_no); 396 + if (!mblk) 397 + return NULL; 398 + 399 + spin_lock(&zmd->mblk_lock); 400 + atomic_inc(&mblk->ref); 401 + set_bit(DMZ_META_READING, &mblk->state); 402 + dmz_insert_mblock(zmd, mblk); 403 + spin_unlock(&zmd->mblk_lock); 404 + 405 + bio = bio_alloc(GFP_NOIO, 1); 406 + if (!bio) { 407 + dmz_free_mblock(zmd, mblk); 408 + return NULL; 409 + } 410 + 411 + bio->bi_iter.bi_sector = dmz_blk2sect(block); 412 + bio->bi_bdev = zmd->dev->bdev; 413 + bio->bi_private = mblk; 414 + bio->bi_end_io = dmz_mblock_bio_end_io; 415 + bio_set_op_attrs(bio, REQ_OP_READ, REQ_META | REQ_PRIO); 416 + bio_add_page(bio, mblk->page, DMZ_BLOCK_SIZE, 0); 417 + submit_bio(bio); 418 + 419 + return mblk; 420 + } 421 + 422 + /* 423 + * Free metadata blocks. 424 + */ 425 + static unsigned long dmz_shrink_mblock_cache(struct dmz_metadata *zmd, 426 + unsigned long limit) 427 + { 428 + struct dmz_mblock *mblk; 429 + unsigned long count = 0; 430 + 431 + if (!zmd->max_nr_mblks) 432 + return 0; 433 + 434 + while (!list_empty(&zmd->mblk_lru_list) && 435 + atomic_read(&zmd->nr_mblks) > zmd->min_nr_mblks && 436 + count < limit) { 437 + mblk = list_first_entry(&zmd->mblk_lru_list, 438 + struct dmz_mblock, link); 439 + list_del_init(&mblk->link); 440 + rb_erase(&mblk->node, &zmd->mblk_rbtree); 441 + dmz_free_mblock(zmd, mblk); 442 + count++; 443 + } 444 + 445 + return count; 446 + } 447 + 448 + /* 449 + * For mblock shrinker: get the number of unused metadata blocks in the cache. 450 + */ 451 + static unsigned long dmz_mblock_shrinker_count(struct shrinker *shrink, 452 + struct shrink_control *sc) 453 + { 454 + struct dmz_metadata *zmd = container_of(shrink, struct dmz_metadata, mblk_shrinker); 455 + 456 + return atomic_read(&zmd->nr_mblks); 457 + } 458 + 459 + /* 460 + * For mblock shrinker: scan unused metadata blocks and shrink the cache. 461 + */ 462 + static unsigned long dmz_mblock_shrinker_scan(struct shrinker *shrink, 463 + struct shrink_control *sc) 464 + { 465 + struct dmz_metadata *zmd = container_of(shrink, struct dmz_metadata, mblk_shrinker); 466 + unsigned long count; 467 + 468 + spin_lock(&zmd->mblk_lock); 469 + count = dmz_shrink_mblock_cache(zmd, sc->nr_to_scan); 470 + spin_unlock(&zmd->mblk_lock); 471 + 472 + return count ? count : SHRINK_STOP; 473 + } 474 + 475 + /* 476 + * Release a metadata block. 477 + */ 478 + static void dmz_release_mblock(struct dmz_metadata *zmd, 479 + struct dmz_mblock *mblk) 480 + { 481 + 482 + if (!mblk) 483 + return; 484 + 485 + spin_lock(&zmd->mblk_lock); 486 + 487 + if (atomic_dec_and_test(&mblk->ref)) { 488 + if (test_bit(DMZ_META_ERROR, &mblk->state)) { 489 + rb_erase(&mblk->node, &zmd->mblk_rbtree); 490 + dmz_free_mblock(zmd, mblk); 491 + } else if (!test_bit(DMZ_META_DIRTY, &mblk->state)) { 492 + list_add_tail(&mblk->link, &zmd->mblk_lru_list); 493 + dmz_shrink_mblock_cache(zmd, 1); 494 + } 495 + } 496 + 497 + spin_unlock(&zmd->mblk_lock); 498 + } 499 + 500 + /* 501 + * Get a metadata block from the rbtree. If the block 502 + * is not present, read it from disk. 503 + */ 504 + static struct dmz_mblock *dmz_get_mblock(struct dmz_metadata *zmd, 505 + sector_t mblk_no) 506 + { 507 + struct dmz_mblock *mblk; 508 + 509 + /* Check rbtree */ 510 + spin_lock(&zmd->mblk_lock); 511 + mblk = dmz_lookup_mblock(zmd, mblk_no); 512 + if (mblk) { 513 + /* Cache hit: remove block from LRU list */ 514 + if (atomic_inc_return(&mblk->ref) == 1 && 515 + !test_bit(DMZ_META_DIRTY, &mblk->state)) 516 + list_del_init(&mblk->link); 517 + } 518 + spin_unlock(&zmd->mblk_lock); 519 + 520 + if (!mblk) { 521 + /* Cache miss: read the block from disk */ 522 + mblk = dmz_fetch_mblock(zmd, mblk_no); 523 + if (!mblk) 524 + return ERR_PTR(-ENOMEM); 525 + } 526 + 527 + /* Wait for on-going read I/O and check for error */ 528 + wait_on_bit_io(&mblk->state, DMZ_META_READING, 529 + TASK_UNINTERRUPTIBLE); 530 + if (test_bit(DMZ_META_ERROR, &mblk->state)) { 531 + dmz_release_mblock(zmd, mblk); 532 + return ERR_PTR(-EIO); 533 + } 534 + 535 + return mblk; 536 + } 537 + 538 + /* 539 + * Mark a metadata block dirty. 540 + */ 541 + static void dmz_dirty_mblock(struct dmz_metadata *zmd, struct dmz_mblock *mblk) 542 + { 543 + spin_lock(&zmd->mblk_lock); 544 + if (!test_and_set_bit(DMZ_META_DIRTY, &mblk->state)) 545 + list_add_tail(&mblk->link, &zmd->mblk_dirty_list); 546 + spin_unlock(&zmd->mblk_lock); 547 + } 548 + 549 + /* 550 + * Issue a metadata block write BIO. 551 + */ 552 + static void dmz_write_mblock(struct dmz_metadata *zmd, struct dmz_mblock *mblk, 553 + unsigned int set) 554 + { 555 + sector_t block = zmd->sb[set].block + mblk->no; 556 + struct bio *bio; 557 + 558 + bio = bio_alloc(GFP_NOIO, 1); 559 + if (!bio) { 560 + set_bit(DMZ_META_ERROR, &mblk->state); 561 + return; 562 + } 563 + 564 + set_bit(DMZ_META_WRITING, &mblk->state); 565 + 566 + bio->bi_iter.bi_sector = dmz_blk2sect(block); 567 + bio->bi_bdev = zmd->dev->bdev; 568 + bio->bi_private = mblk; 569 + bio->bi_end_io = dmz_mblock_bio_end_io; 570 + bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_META | REQ_PRIO); 571 + bio_add_page(bio, mblk->page, DMZ_BLOCK_SIZE, 0); 572 + submit_bio(bio); 573 + } 574 + 575 + /* 576 + * Read/write a metadata block. 577 + */ 578 + static int dmz_rdwr_block(struct dmz_metadata *zmd, int op, sector_t block, 579 + struct page *page) 580 + { 581 + struct bio *bio; 582 + int ret; 583 + 584 + bio = bio_alloc(GFP_NOIO, 1); 585 + if (!bio) 586 + return -ENOMEM; 587 + 588 + bio->bi_iter.bi_sector = dmz_blk2sect(block); 589 + bio->bi_bdev = zmd->dev->bdev; 590 + bio_set_op_attrs(bio, op, REQ_SYNC | REQ_META | REQ_PRIO); 591 + bio_add_page(bio, page, DMZ_BLOCK_SIZE, 0); 592 + ret = submit_bio_wait(bio); 593 + bio_put(bio); 594 + 595 + return ret; 596 + } 597 + 598 + /* 599 + * Write super block of the specified metadata set. 600 + */ 601 + static int dmz_write_sb(struct dmz_metadata *zmd, unsigned int set) 602 + { 603 + sector_t block = zmd->sb[set].block; 604 + struct dmz_mblock *mblk = zmd->sb[set].mblk; 605 + struct dmz_super *sb = zmd->sb[set].sb; 606 + u64 sb_gen = zmd->sb_gen + 1; 607 + int ret; 608 + 609 + sb->magic = cpu_to_le32(DMZ_MAGIC); 610 + sb->version = cpu_to_le32(DMZ_META_VER); 611 + 612 + sb->gen = cpu_to_le64(sb_gen); 613 + 614 + sb->sb_block = cpu_to_le64(block); 615 + sb->nr_meta_blocks = cpu_to_le32(zmd->nr_meta_blocks); 616 + sb->nr_reserved_seq = cpu_to_le32(zmd->nr_reserved_seq); 617 + sb->nr_chunks = cpu_to_le32(zmd->nr_chunks); 618 + 619 + sb->nr_map_blocks = cpu_to_le32(zmd->nr_map_blocks); 620 + sb->nr_bitmap_blocks = cpu_to_le32(zmd->nr_bitmap_blocks); 621 + 622 + sb->crc = 0; 623 + sb->crc = cpu_to_le32(crc32_le(sb_gen, (unsigned char *)sb, DMZ_BLOCK_SIZE)); 624 + 625 + ret = dmz_rdwr_block(zmd, REQ_OP_WRITE, block, mblk->page); 626 + if (ret == 0) 627 + ret = blkdev_issue_flush(zmd->dev->bdev, GFP_KERNEL, NULL); 628 + 629 + return ret; 630 + } 631 + 632 + /* 633 + * Write dirty metadata blocks to the specified set. 634 + */ 635 + static int dmz_write_dirty_mblocks(struct dmz_metadata *zmd, 636 + struct list_head *write_list, 637 + unsigned int set) 638 + { 639 + struct dmz_mblock *mblk; 640 + struct blk_plug plug; 641 + int ret = 0; 642 + 643 + /* Issue writes */ 644 + blk_start_plug(&plug); 645 + list_for_each_entry(mblk, write_list, link) 646 + dmz_write_mblock(zmd, mblk, set); 647 + blk_finish_plug(&plug); 648 + 649 + /* Wait for completion */ 650 + list_for_each_entry(mblk, write_list, link) { 651 + wait_on_bit_io(&mblk->state, DMZ_META_WRITING, 652 + TASK_UNINTERRUPTIBLE); 653 + if (test_bit(DMZ_META_ERROR, &mblk->state)) { 654 + clear_bit(DMZ_META_ERROR, &mblk->state); 655 + ret = -EIO; 656 + } 657 + } 658 + 659 + /* Flush drive cache (this will also sync data) */ 660 + if (ret == 0) 661 + ret = blkdev_issue_flush(zmd->dev->bdev, GFP_KERNEL, NULL); 662 + 663 + return ret; 664 + } 665 + 666 + /* 667 + * Log dirty metadata blocks. 668 + */ 669 + static int dmz_log_dirty_mblocks(struct dmz_metadata *zmd, 670 + struct list_head *write_list) 671 + { 672 + unsigned int log_set = zmd->mblk_primary ^ 0x1; 673 + int ret; 674 + 675 + /* Write dirty blocks to the log */ 676 + ret = dmz_write_dirty_mblocks(zmd, write_list, log_set); 677 + if (ret) 678 + return ret; 679 + 680 + /* 681 + * No error so far: now validate the log by updating the 682 + * log index super block generation. 683 + */ 684 + ret = dmz_write_sb(zmd, log_set); 685 + if (ret) 686 + return ret; 687 + 688 + return 0; 689 + } 690 + 691 + /* 692 + * Flush dirty metadata blocks. 693 + */ 694 + int dmz_flush_metadata(struct dmz_metadata *zmd) 695 + { 696 + struct dmz_mblock *mblk; 697 + struct list_head write_list; 698 + int ret; 699 + 700 + if (WARN_ON(!zmd)) 701 + return 0; 702 + 703 + INIT_LIST_HEAD(&write_list); 704 + 705 + /* 706 + * Make sure that metadata blocks are stable before logging: take 707 + * the write lock on the metadata semaphore to prevent target BIOs 708 + * from modifying metadata. 709 + */ 710 + down_write(&zmd->mblk_sem); 711 + 712 + /* 713 + * This is called from the target flush work and reclaim work. 714 + * Concurrent execution is not allowed. 715 + */ 716 + dmz_lock_flush(zmd); 717 + 718 + /* Get dirty blocks */ 719 + spin_lock(&zmd->mblk_lock); 720 + list_splice_init(&zmd->mblk_dirty_list, &write_list); 721 + spin_unlock(&zmd->mblk_lock); 722 + 723 + /* If there are no dirty metadata blocks, just flush the device cache */ 724 + if (list_empty(&write_list)) { 725 + ret = blkdev_issue_flush(zmd->dev->bdev, GFP_KERNEL, NULL); 726 + goto out; 727 + } 728 + 729 + /* 730 + * The primary metadata set is still clean. Keep it this way until 731 + * all updates are successful in the secondary set. That is, use 732 + * the secondary set as a log. 733 + */ 734 + ret = dmz_log_dirty_mblocks(zmd, &write_list); 735 + if (ret) 736 + goto out; 737 + 738 + /* 739 + * The log is on disk. It is now safe to update in place 740 + * in the primary metadata set. 741 + */ 742 + ret = dmz_write_dirty_mblocks(zmd, &write_list, zmd->mblk_primary); 743 + if (ret) 744 + goto out; 745 + 746 + ret = dmz_write_sb(zmd, zmd->mblk_primary); 747 + if (ret) 748 + goto out; 749 + 750 + while (!list_empty(&write_list)) { 751 + mblk = list_first_entry(&write_list, struct dmz_mblock, link); 752 + list_del_init(&mblk->link); 753 + 754 + spin_lock(&zmd->mblk_lock); 755 + clear_bit(DMZ_META_DIRTY, &mblk->state); 756 + if (atomic_read(&mblk->ref) == 0) 757 + list_add_tail(&mblk->link, &zmd->mblk_lru_list); 758 + spin_unlock(&zmd->mblk_lock); 759 + } 760 + 761 + zmd->sb_gen++; 762 + out: 763 + if (ret && !list_empty(&write_list)) { 764 + spin_lock(&zmd->mblk_lock); 765 + list_splice(&write_list, &zmd->mblk_dirty_list); 766 + spin_unlock(&zmd->mblk_lock); 767 + } 768 + 769 + dmz_unlock_flush(zmd); 770 + up_write(&zmd->mblk_sem); 771 + 772 + return ret; 773 + } 774 + 775 + /* 776 + * Check super block. 777 + */ 778 + static int dmz_check_sb(struct dmz_metadata *zmd, struct dmz_super *sb) 779 + { 780 + unsigned int nr_meta_zones, nr_data_zones; 781 + struct dmz_dev *dev = zmd->dev; 782 + u32 crc, stored_crc; 783 + u64 gen; 784 + 785 + gen = le64_to_cpu(sb->gen); 786 + stored_crc = le32_to_cpu(sb->crc); 787 + sb->crc = 0; 788 + crc = crc32_le(gen, (unsigned char *)sb, DMZ_BLOCK_SIZE); 789 + if (crc != stored_crc) { 790 + dmz_dev_err(dev, "Invalid checksum (needed 0x%08x, got 0x%08x)", 791 + crc, stored_crc); 792 + return -ENXIO; 793 + } 794 + 795 + if (le32_to_cpu(sb->magic) != DMZ_MAGIC) { 796 + dmz_dev_err(dev, "Invalid meta magic (needed 0x%08x, got 0x%08x)", 797 + DMZ_MAGIC, le32_to_cpu(sb->magic)); 798 + return -ENXIO; 799 + } 800 + 801 + if (le32_to_cpu(sb->version) != DMZ_META_VER) { 802 + dmz_dev_err(dev, "Invalid meta version (needed %d, got %d)", 803 + DMZ_META_VER, le32_to_cpu(sb->version)); 804 + return -ENXIO; 805 + } 806 + 807 + nr_meta_zones = (le32_to_cpu(sb->nr_meta_blocks) + dev->zone_nr_blocks - 1) 808 + >> dev->zone_nr_blocks_shift; 809 + if (!nr_meta_zones || 810 + nr_meta_zones >= zmd->nr_rnd_zones) { 811 + dmz_dev_err(dev, "Invalid number of metadata blocks"); 812 + return -ENXIO; 813 + } 814 + 815 + if (!le32_to_cpu(sb->nr_reserved_seq) || 816 + le32_to_cpu(sb->nr_reserved_seq) >= (zmd->nr_useable_zones - nr_meta_zones)) { 817 + dmz_dev_err(dev, "Invalid number of reserved sequential zones"); 818 + return -ENXIO; 819 + } 820 + 821 + nr_data_zones = zmd->nr_useable_zones - 822 + (nr_meta_zones * 2 + le32_to_cpu(sb->nr_reserved_seq)); 823 + if (le32_to_cpu(sb->nr_chunks) > nr_data_zones) { 824 + dmz_dev_err(dev, "Invalid number of chunks %u / %u", 825 + le32_to_cpu(sb->nr_chunks), nr_data_zones); 826 + return -ENXIO; 827 + } 828 + 829 + /* OK */ 830 + zmd->nr_meta_blocks = le32_to_cpu(sb->nr_meta_blocks); 831 + zmd->nr_reserved_seq = le32_to_cpu(sb->nr_reserved_seq); 832 + zmd->nr_chunks = le32_to_cpu(sb->nr_chunks); 833 + zmd->nr_map_blocks = le32_to_cpu(sb->nr_map_blocks); 834 + zmd->nr_bitmap_blocks = le32_to_cpu(sb->nr_bitmap_blocks); 835 + zmd->nr_meta_zones = nr_meta_zones; 836 + zmd->nr_data_zones = nr_data_zones; 837 + 838 + return 0; 839 + } 840 + 841 + /* 842 + * Read the first or second super block from disk. 843 + */ 844 + static int dmz_read_sb(struct dmz_metadata *zmd, unsigned int set) 845 + { 846 + return dmz_rdwr_block(zmd, REQ_OP_READ, zmd->sb[set].block, 847 + zmd->sb[set].mblk->page); 848 + } 849 + 850 + /* 851 + * Determine the position of the secondary super blocks on disk. 852 + * This is used only if a corruption of the primary super block 853 + * is detected. 854 + */ 855 + static int dmz_lookup_secondary_sb(struct dmz_metadata *zmd) 856 + { 857 + unsigned int zone_nr_blocks = zmd->dev->zone_nr_blocks; 858 + struct dmz_mblock *mblk; 859 + int i; 860 + 861 + /* Allocate a block */ 862 + mblk = dmz_alloc_mblock(zmd, 0); 863 + if (!mblk) 864 + return -ENOMEM; 865 + 866 + zmd->sb[1].mblk = mblk; 867 + zmd->sb[1].sb = mblk->data; 868 + 869 + /* Bad first super block: search for the second one */ 870 + zmd->sb[1].block = zmd->sb[0].block + zone_nr_blocks; 871 + for (i = 0; i < zmd->nr_rnd_zones - 1; i++) { 872 + if (dmz_read_sb(zmd, 1) != 0) 873 + break; 874 + if (le32_to_cpu(zmd->sb[1].sb->magic) == DMZ_MAGIC) 875 + return 0; 876 + zmd->sb[1].block += zone_nr_blocks; 877 + } 878 + 879 + dmz_free_mblock(zmd, mblk); 880 + zmd->sb[1].mblk = NULL; 881 + 882 + return -EIO; 883 + } 884 + 885 + /* 886 + * Read the first or second super block from disk. 887 + */ 888 + static int dmz_get_sb(struct dmz_metadata *zmd, unsigned int set) 889 + { 890 + struct dmz_mblock *mblk; 891 + int ret; 892 + 893 + /* Allocate a block */ 894 + mblk = dmz_alloc_mblock(zmd, 0); 895 + if (!mblk) 896 + return -ENOMEM; 897 + 898 + zmd->sb[set].mblk = mblk; 899 + zmd->sb[set].sb = mblk->data; 900 + 901 + /* Read super block */ 902 + ret = dmz_read_sb(zmd, set); 903 + if (ret) { 904 + dmz_free_mblock(zmd, mblk); 905 + zmd->sb[set].mblk = NULL; 906 + return ret; 907 + } 908 + 909 + return 0; 910 + } 911 + 912 + /* 913 + * Recover a metadata set. 914 + */ 915 + static int dmz_recover_mblocks(struct dmz_metadata *zmd, unsigned int dst_set) 916 + { 917 + unsigned int src_set = dst_set ^ 0x1; 918 + struct page *page; 919 + int i, ret; 920 + 921 + dmz_dev_warn(zmd->dev, "Metadata set %u invalid: recovering", dst_set); 922 + 923 + if (dst_set == 0) 924 + zmd->sb[0].block = dmz_start_block(zmd, zmd->sb_zone); 925 + else { 926 + zmd->sb[1].block = zmd->sb[0].block + 927 + (zmd->nr_meta_zones << zmd->dev->zone_nr_blocks_shift); 928 + } 929 + 930 + page = alloc_page(GFP_KERNEL); 931 + if (!page) 932 + return -ENOMEM; 933 + 934 + /* Copy metadata blocks */ 935 + for (i = 1; i < zmd->nr_meta_blocks; i++) { 936 + ret = dmz_rdwr_block(zmd, REQ_OP_READ, 937 + zmd->sb[src_set].block + i, page); 938 + if (ret) 939 + goto out; 940 + ret = dmz_rdwr_block(zmd, REQ_OP_WRITE, 941 + zmd->sb[dst_set].block + i, page); 942 + if (ret) 943 + goto out; 944 + } 945 + 946 + /* Finalize with the super block */ 947 + if (!zmd->sb[dst_set].mblk) { 948 + zmd->sb[dst_set].mblk = dmz_alloc_mblock(zmd, 0); 949 + if (!zmd->sb[dst_set].mblk) { 950 + ret = -ENOMEM; 951 + goto out; 952 + } 953 + zmd->sb[dst_set].sb = zmd->sb[dst_set].mblk->data; 954 + } 955 + 956 + ret = dmz_write_sb(zmd, dst_set); 957 + out: 958 + __free_pages(page, 0); 959 + 960 + return ret; 961 + } 962 + 963 + /* 964 + * Get super block from disk. 965 + */ 966 + static int dmz_load_sb(struct dmz_metadata *zmd) 967 + { 968 + bool sb_good[2] = {false, false}; 969 + u64 sb_gen[2] = {0, 0}; 970 + int ret; 971 + 972 + /* Read and check the primary super block */ 973 + zmd->sb[0].block = dmz_start_block(zmd, zmd->sb_zone); 974 + ret = dmz_get_sb(zmd, 0); 975 + if (ret) { 976 + dmz_dev_err(zmd->dev, "Read primary super block failed"); 977 + return ret; 978 + } 979 + 980 + ret = dmz_check_sb(zmd, zmd->sb[0].sb); 981 + 982 + /* Read and check secondary super block */ 983 + if (ret == 0) { 984 + sb_good[0] = true; 985 + zmd->sb[1].block = zmd->sb[0].block + 986 + (zmd->nr_meta_zones << zmd->dev->zone_nr_blocks_shift); 987 + ret = dmz_get_sb(zmd, 1); 988 + } else 989 + ret = dmz_lookup_secondary_sb(zmd); 990 + 991 + if (ret) { 992 + dmz_dev_err(zmd->dev, "Read secondary super block failed"); 993 + return ret; 994 + } 995 + 996 + ret = dmz_check_sb(zmd, zmd->sb[1].sb); 997 + if (ret == 0) 998 + sb_good[1] = true; 999 + 1000 + /* Use highest generation sb first */ 1001 + if (!sb_good[0] && !sb_good[1]) { 1002 + dmz_dev_err(zmd->dev, "No valid super block found"); 1003 + return -EIO; 1004 + } 1005 + 1006 + if (sb_good[0]) 1007 + sb_gen[0] = le64_to_cpu(zmd->sb[0].sb->gen); 1008 + else 1009 + ret = dmz_recover_mblocks(zmd, 0); 1010 + 1011 + if (sb_good[1]) 1012 + sb_gen[1] = le64_to_cpu(zmd->sb[1].sb->gen); 1013 + else 1014 + ret = dmz_recover_mblocks(zmd, 1); 1015 + 1016 + if (ret) { 1017 + dmz_dev_err(zmd->dev, "Recovery failed"); 1018 + return -EIO; 1019 + } 1020 + 1021 + if (sb_gen[0] >= sb_gen[1]) { 1022 + zmd->sb_gen = sb_gen[0]; 1023 + zmd->mblk_primary = 0; 1024 + } else { 1025 + zmd->sb_gen = sb_gen[1]; 1026 + zmd->mblk_primary = 1; 1027 + } 1028 + 1029 + dmz_dev_debug(zmd->dev, "Using super block %u (gen %llu)", 1030 + zmd->mblk_primary, zmd->sb_gen); 1031 + 1032 + return 0; 1033 + } 1034 + 1035 + /* 1036 + * Initialize a zone descriptor. 1037 + */ 1038 + static int dmz_init_zone(struct dmz_metadata *zmd, struct dm_zone *zone, 1039 + struct blk_zone *blkz) 1040 + { 1041 + struct dmz_dev *dev = zmd->dev; 1042 + 1043 + /* Ignore the eventual last runt (smaller) zone */ 1044 + if (blkz->len != dev->zone_nr_sectors) { 1045 + if (blkz->start + blkz->len == dev->capacity) 1046 + return 0; 1047 + return -ENXIO; 1048 + } 1049 + 1050 + INIT_LIST_HEAD(&zone->link); 1051 + atomic_set(&zone->refcount, 0); 1052 + zone->chunk = DMZ_MAP_UNMAPPED; 1053 + 1054 + if (blkz->type == BLK_ZONE_TYPE_CONVENTIONAL) { 1055 + set_bit(DMZ_RND, &zone->flags); 1056 + zmd->nr_rnd_zones++; 1057 + } else if (blkz->type == BLK_ZONE_TYPE_SEQWRITE_REQ || 1058 + blkz->type == BLK_ZONE_TYPE_SEQWRITE_PREF) { 1059 + set_bit(DMZ_SEQ, &zone->flags); 1060 + } else 1061 + return -ENXIO; 1062 + 1063 + if (blkz->cond == BLK_ZONE_COND_OFFLINE) 1064 + set_bit(DMZ_OFFLINE, &zone->flags); 1065 + else if (blkz->cond == BLK_ZONE_COND_READONLY) 1066 + set_bit(DMZ_READ_ONLY, &zone->flags); 1067 + 1068 + if (dmz_is_rnd(zone)) 1069 + zone->wp_block = 0; 1070 + else 1071 + zone->wp_block = dmz_sect2blk(blkz->wp - blkz->start); 1072 + 1073 + if (!dmz_is_offline(zone) && !dmz_is_readonly(zone)) { 1074 + zmd->nr_useable_zones++; 1075 + if (dmz_is_rnd(zone)) { 1076 + zmd->nr_rnd_zones++; 1077 + if (!zmd->sb_zone) { 1078 + /* Super block zone */ 1079 + zmd->sb_zone = zone; 1080 + } 1081 + } 1082 + } 1083 + 1084 + return 0; 1085 + } 1086 + 1087 + /* 1088 + * Free zones descriptors. 1089 + */ 1090 + static void dmz_drop_zones(struct dmz_metadata *zmd) 1091 + { 1092 + kfree(zmd->zones); 1093 + zmd->zones = NULL; 1094 + } 1095 + 1096 + /* 1097 + * The size of a zone report in number of zones. 1098 + * This results in 4096*64B=256KB report zones commands. 1099 + */ 1100 + #define DMZ_REPORT_NR_ZONES 4096 1101 + 1102 + /* 1103 + * Allocate and initialize zone descriptors using the zone 1104 + * information from disk. 1105 + */ 1106 + static int dmz_init_zones(struct dmz_metadata *zmd) 1107 + { 1108 + struct dmz_dev *dev = zmd->dev; 1109 + struct dm_zone *zone; 1110 + struct blk_zone *blkz; 1111 + unsigned int nr_blkz; 1112 + sector_t sector = 0; 1113 + int i, ret = 0; 1114 + 1115 + /* Init */ 1116 + zmd->zone_bitmap_size = dev->zone_nr_blocks >> 3; 1117 + zmd->zone_nr_bitmap_blocks = zmd->zone_bitmap_size >> DMZ_BLOCK_SHIFT; 1118 + 1119 + /* Allocate zone array */ 1120 + zmd->zones = kcalloc(dev->nr_zones, sizeof(struct dm_zone), GFP_KERNEL); 1121 + if (!zmd->zones) 1122 + return -ENOMEM; 1123 + 1124 + dmz_dev_info(dev, "Using %zu B for zone information", 1125 + sizeof(struct dm_zone) * dev->nr_zones); 1126 + 1127 + /* Get zone information */ 1128 + nr_blkz = DMZ_REPORT_NR_ZONES; 1129 + blkz = kcalloc(nr_blkz, sizeof(struct blk_zone), GFP_KERNEL); 1130 + if (!blkz) { 1131 + ret = -ENOMEM; 1132 + goto out; 1133 + } 1134 + 1135 + /* 1136 + * Get zone information and initialize zone descriptors. 1137 + * At the same time, determine where the super block 1138 + * should be: first block of the first randomly writable 1139 + * zone. 1140 + */ 1141 + zone = zmd->zones; 1142 + while (sector < dev->capacity) { 1143 + /* Get zone information */ 1144 + nr_blkz = DMZ_REPORT_NR_ZONES; 1145 + ret = blkdev_report_zones(dev->bdev, sector, blkz, 1146 + &nr_blkz, GFP_KERNEL); 1147 + if (ret) { 1148 + dmz_dev_err(dev, "Report zones failed %d", ret); 1149 + goto out; 1150 + } 1151 + 1152 + /* Process report */ 1153 + for (i = 0; i < nr_blkz; i++) { 1154 + ret = dmz_init_zone(zmd, zone, &blkz[i]); 1155 + if (ret) 1156 + goto out; 1157 + sector += dev->zone_nr_sectors; 1158 + zone++; 1159 + } 1160 + } 1161 + 1162 + /* The entire zone configuration of the disk should now be known */ 1163 + if (sector < dev->capacity) { 1164 + dmz_dev_err(dev, "Failed to get correct zone information"); 1165 + ret = -ENXIO; 1166 + } 1167 + out: 1168 + kfree(blkz); 1169 + if (ret) 1170 + dmz_drop_zones(zmd); 1171 + 1172 + return ret; 1173 + } 1174 + 1175 + /* 1176 + * Update a zone information. 1177 + */ 1178 + static int dmz_update_zone(struct dmz_metadata *zmd, struct dm_zone *zone) 1179 + { 1180 + unsigned int nr_blkz = 1; 1181 + struct blk_zone blkz; 1182 + int ret; 1183 + 1184 + /* Get zone information from disk */ 1185 + ret = blkdev_report_zones(zmd->dev->bdev, dmz_start_sect(zmd, zone), 1186 + &blkz, &nr_blkz, GFP_KERNEL); 1187 + if (ret) { 1188 + dmz_dev_err(zmd->dev, "Get zone %u report failed", 1189 + dmz_id(zmd, zone)); 1190 + return ret; 1191 + } 1192 + 1193 + clear_bit(DMZ_OFFLINE, &zone->flags); 1194 + clear_bit(DMZ_READ_ONLY, &zone->flags); 1195 + if (blkz.cond == BLK_ZONE_COND_OFFLINE) 1196 + set_bit(DMZ_OFFLINE, &zone->flags); 1197 + else if (blkz.cond == BLK_ZONE_COND_READONLY) 1198 + set_bit(DMZ_READ_ONLY, &zone->flags); 1199 + 1200 + if (dmz_is_seq(zone)) 1201 + zone->wp_block = dmz_sect2blk(blkz.wp - blkz.start); 1202 + else 1203 + zone->wp_block = 0; 1204 + 1205 + return 0; 1206 + } 1207 + 1208 + /* 1209 + * Check a zone write pointer position when the zone is marked 1210 + * with the sequential write error flag. 1211 + */ 1212 + static int dmz_handle_seq_write_err(struct dmz_metadata *zmd, 1213 + struct dm_zone *zone) 1214 + { 1215 + unsigned int wp = 0; 1216 + int ret; 1217 + 1218 + wp = zone->wp_block; 1219 + ret = dmz_update_zone(zmd, zone); 1220 + if (ret) 1221 + return ret; 1222 + 1223 + dmz_dev_warn(zmd->dev, "Processing zone %u write error (zone wp %u/%u)", 1224 + dmz_id(zmd, zone), zone->wp_block, wp); 1225 + 1226 + if (zone->wp_block < wp) { 1227 + dmz_invalidate_blocks(zmd, zone, zone->wp_block, 1228 + wp - zone->wp_block); 1229 + } 1230 + 1231 + return 0; 1232 + } 1233 + 1234 + static struct dm_zone *dmz_get(struct dmz_metadata *zmd, unsigned int zone_id) 1235 + { 1236 + return &zmd->zones[zone_id]; 1237 + } 1238 + 1239 + /* 1240 + * Reset a zone write pointer. 1241 + */ 1242 + static int dmz_reset_zone(struct dmz_metadata *zmd, struct dm_zone *zone) 1243 + { 1244 + int ret; 1245 + 1246 + /* 1247 + * Ignore offline zones, read only zones, 1248 + * and conventional zones. 1249 + */ 1250 + if (dmz_is_offline(zone) || 1251 + dmz_is_readonly(zone) || 1252 + dmz_is_rnd(zone)) 1253 + return 0; 1254 + 1255 + if (!dmz_is_empty(zone) || dmz_seq_write_err(zone)) { 1256 + struct dmz_dev *dev = zmd->dev; 1257 + 1258 + ret = blkdev_reset_zones(dev->bdev, 1259 + dmz_start_sect(zmd, zone), 1260 + dev->zone_nr_sectors, GFP_KERNEL); 1261 + if (ret) { 1262 + dmz_dev_err(dev, "Reset zone %u failed %d", 1263 + dmz_id(zmd, zone), ret); 1264 + return ret; 1265 + } 1266 + } 1267 + 1268 + /* Clear write error bit and rewind write pointer position */ 1269 + clear_bit(DMZ_SEQ_WRITE_ERR, &zone->flags); 1270 + zone->wp_block = 0; 1271 + 1272 + return 0; 1273 + } 1274 + 1275 + static void dmz_get_zone_weight(struct dmz_metadata *zmd, struct dm_zone *zone); 1276 + 1277 + /* 1278 + * Initialize chunk mapping. 1279 + */ 1280 + static int dmz_load_mapping(struct dmz_metadata *zmd) 1281 + { 1282 + struct dmz_dev *dev = zmd->dev; 1283 + struct dm_zone *dzone, *bzone; 1284 + struct dmz_mblock *dmap_mblk = NULL; 1285 + struct dmz_map *dmap; 1286 + unsigned int i = 0, e = 0, chunk = 0; 1287 + unsigned int dzone_id; 1288 + unsigned int bzone_id; 1289 + 1290 + /* Metadata block array for the chunk mapping table */ 1291 + zmd->map_mblk = kcalloc(zmd->nr_map_blocks, 1292 + sizeof(struct dmz_mblk *), GFP_KERNEL); 1293 + if (!zmd->map_mblk) 1294 + return -ENOMEM; 1295 + 1296 + /* Get chunk mapping table blocks and initialize zone mapping */ 1297 + while (chunk < zmd->nr_chunks) { 1298 + if (!dmap_mblk) { 1299 + /* Get mapping block */ 1300 + dmap_mblk = dmz_get_mblock(zmd, i + 1); 1301 + if (IS_ERR(dmap_mblk)) 1302 + return PTR_ERR(dmap_mblk); 1303 + zmd->map_mblk[i] = dmap_mblk; 1304 + dmap = (struct dmz_map *) dmap_mblk->data; 1305 + i++; 1306 + e = 0; 1307 + } 1308 + 1309 + /* Check data zone */ 1310 + dzone_id = le32_to_cpu(dmap[e].dzone_id); 1311 + if (dzone_id == DMZ_MAP_UNMAPPED) 1312 + goto next; 1313 + 1314 + if (dzone_id >= dev->nr_zones) { 1315 + dmz_dev_err(dev, "Chunk %u mapping: invalid data zone ID %u", 1316 + chunk, dzone_id); 1317 + return -EIO; 1318 + } 1319 + 1320 + dzone = dmz_get(zmd, dzone_id); 1321 + set_bit(DMZ_DATA, &dzone->flags); 1322 + dzone->chunk = chunk; 1323 + dmz_get_zone_weight(zmd, dzone); 1324 + 1325 + if (dmz_is_rnd(dzone)) 1326 + list_add_tail(&dzone->link, &zmd->map_rnd_list); 1327 + else 1328 + list_add_tail(&dzone->link, &zmd->map_seq_list); 1329 + 1330 + /* Check buffer zone */ 1331 + bzone_id = le32_to_cpu(dmap[e].bzone_id); 1332 + if (bzone_id == DMZ_MAP_UNMAPPED) 1333 + goto next; 1334 + 1335 + if (bzone_id >= dev->nr_zones) { 1336 + dmz_dev_err(dev, "Chunk %u mapping: invalid buffer zone ID %u", 1337 + chunk, bzone_id); 1338 + return -EIO; 1339 + } 1340 + 1341 + bzone = dmz_get(zmd, bzone_id); 1342 + if (!dmz_is_rnd(bzone)) { 1343 + dmz_dev_err(dev, "Chunk %u mapping: invalid buffer zone %u", 1344 + chunk, bzone_id); 1345 + return -EIO; 1346 + } 1347 + 1348 + set_bit(DMZ_DATA, &bzone->flags); 1349 + set_bit(DMZ_BUF, &bzone->flags); 1350 + bzone->chunk = chunk; 1351 + bzone->bzone = dzone; 1352 + dzone->bzone = bzone; 1353 + dmz_get_zone_weight(zmd, bzone); 1354 + list_add_tail(&bzone->link, &zmd->map_rnd_list); 1355 + next: 1356 + chunk++; 1357 + e++; 1358 + if (e >= DMZ_MAP_ENTRIES) 1359 + dmap_mblk = NULL; 1360 + } 1361 + 1362 + /* 1363 + * At this point, only meta zones and mapped data zones were 1364 + * fully initialized. All remaining zones are unmapped data 1365 + * zones. Finish initializing those here. 1366 + */ 1367 + for (i = 0; i < dev->nr_zones; i++) { 1368 + dzone = dmz_get(zmd, i); 1369 + if (dmz_is_meta(dzone)) 1370 + continue; 1371 + 1372 + if (dmz_is_rnd(dzone)) 1373 + zmd->nr_rnd++; 1374 + else 1375 + zmd->nr_seq++; 1376 + 1377 + if (dmz_is_data(dzone)) { 1378 + /* Already initialized */ 1379 + continue; 1380 + } 1381 + 1382 + /* Unmapped data zone */ 1383 + set_bit(DMZ_DATA, &dzone->flags); 1384 + dzone->chunk = DMZ_MAP_UNMAPPED; 1385 + if (dmz_is_rnd(dzone)) { 1386 + list_add_tail(&dzone->link, &zmd->unmap_rnd_list); 1387 + atomic_inc(&zmd->unmap_nr_rnd); 1388 + } else if (atomic_read(&zmd->nr_reserved_seq_zones) < zmd->nr_reserved_seq) { 1389 + list_add_tail(&dzone->link, &zmd->reserved_seq_zones_list); 1390 + atomic_inc(&zmd->nr_reserved_seq_zones); 1391 + zmd->nr_seq--; 1392 + } else { 1393 + list_add_tail(&dzone->link, &zmd->unmap_seq_list); 1394 + atomic_inc(&zmd->unmap_nr_seq); 1395 + } 1396 + } 1397 + 1398 + return 0; 1399 + } 1400 + 1401 + /* 1402 + * Set a data chunk mapping. 1403 + */ 1404 + static void dmz_set_chunk_mapping(struct dmz_metadata *zmd, unsigned int chunk, 1405 + unsigned int dzone_id, unsigned int bzone_id) 1406 + { 1407 + struct dmz_mblock *dmap_mblk = zmd->map_mblk[chunk >> DMZ_MAP_ENTRIES_SHIFT]; 1408 + struct dmz_map *dmap = (struct dmz_map *) dmap_mblk->data; 1409 + int map_idx = chunk & DMZ_MAP_ENTRIES_MASK; 1410 + 1411 + dmap[map_idx].dzone_id = cpu_to_le32(dzone_id); 1412 + dmap[map_idx].bzone_id = cpu_to_le32(bzone_id); 1413 + dmz_dirty_mblock(zmd, dmap_mblk); 1414 + } 1415 + 1416 + /* 1417 + * The list of mapped zones is maintained in LRU order. 1418 + * This rotates a zone at the end of its map list. 1419 + */ 1420 + static void __dmz_lru_zone(struct dmz_metadata *zmd, struct dm_zone *zone) 1421 + { 1422 + if (list_empty(&zone->link)) 1423 + return; 1424 + 1425 + list_del_init(&zone->link); 1426 + if (dmz_is_seq(zone)) { 1427 + /* LRU rotate sequential zone */ 1428 + list_add_tail(&zone->link, &zmd->map_seq_list); 1429 + } else { 1430 + /* LRU rotate random zone */ 1431 + list_add_tail(&zone->link, &zmd->map_rnd_list); 1432 + } 1433 + } 1434 + 1435 + /* 1436 + * The list of mapped random zones is maintained 1437 + * in LRU order. This rotates a zone at the end of the list. 1438 + */ 1439 + static void dmz_lru_zone(struct dmz_metadata *zmd, struct dm_zone *zone) 1440 + { 1441 + __dmz_lru_zone(zmd, zone); 1442 + if (zone->bzone) 1443 + __dmz_lru_zone(zmd, zone->bzone); 1444 + } 1445 + 1446 + /* 1447 + * Wait for any zone to be freed. 1448 + */ 1449 + static void dmz_wait_for_free_zones(struct dmz_metadata *zmd) 1450 + { 1451 + DEFINE_WAIT(wait); 1452 + 1453 + prepare_to_wait(&zmd->free_wq, &wait, TASK_UNINTERRUPTIBLE); 1454 + dmz_unlock_map(zmd); 1455 + dmz_unlock_metadata(zmd); 1456 + 1457 + io_schedule_timeout(HZ); 1458 + 1459 + dmz_lock_metadata(zmd); 1460 + dmz_lock_map(zmd); 1461 + finish_wait(&zmd->free_wq, &wait); 1462 + } 1463 + 1464 + /* 1465 + * Lock a zone for reclaim (set the zone RECLAIM bit). 1466 + * Returns false if the zone cannot be locked or if it is already locked 1467 + * and 1 otherwise. 1468 + */ 1469 + int dmz_lock_zone_reclaim(struct dm_zone *zone) 1470 + { 1471 + /* Active zones cannot be reclaimed */ 1472 + if (dmz_is_active(zone)) 1473 + return 0; 1474 + 1475 + return !test_and_set_bit(DMZ_RECLAIM, &zone->flags); 1476 + } 1477 + 1478 + /* 1479 + * Clear a zone reclaim flag. 1480 + */ 1481 + void dmz_unlock_zone_reclaim(struct dm_zone *zone) 1482 + { 1483 + WARN_ON(dmz_is_active(zone)); 1484 + WARN_ON(!dmz_in_reclaim(zone)); 1485 + 1486 + clear_bit_unlock(DMZ_RECLAIM, &zone->flags); 1487 + smp_mb__after_atomic(); 1488 + wake_up_bit(&zone->flags, DMZ_RECLAIM); 1489 + } 1490 + 1491 + /* 1492 + * Wait for a zone reclaim to complete. 1493 + */ 1494 + static void dmz_wait_for_reclaim(struct dmz_metadata *zmd, struct dm_zone *zone) 1495 + { 1496 + dmz_unlock_map(zmd); 1497 + dmz_unlock_metadata(zmd); 1498 + wait_on_bit_timeout(&zone->flags, DMZ_RECLAIM, TASK_UNINTERRUPTIBLE, HZ); 1499 + dmz_lock_metadata(zmd); 1500 + dmz_lock_map(zmd); 1501 + } 1502 + 1503 + /* 1504 + * Select a random write zone for reclaim. 1505 + */ 1506 + static struct dm_zone *dmz_get_rnd_zone_for_reclaim(struct dmz_metadata *zmd) 1507 + { 1508 + struct dm_zone *dzone = NULL; 1509 + struct dm_zone *zone; 1510 + 1511 + if (list_empty(&zmd->map_rnd_list)) 1512 + return NULL; 1513 + 1514 + list_for_each_entry(zone, &zmd->map_rnd_list, link) { 1515 + if (dmz_is_buf(zone)) 1516 + dzone = zone->bzone; 1517 + else 1518 + dzone = zone; 1519 + if (dmz_lock_zone_reclaim(dzone)) 1520 + return dzone; 1521 + } 1522 + 1523 + return NULL; 1524 + } 1525 + 1526 + /* 1527 + * Select a buffered sequential zone for reclaim. 1528 + */ 1529 + static struct dm_zone *dmz_get_seq_zone_for_reclaim(struct dmz_metadata *zmd) 1530 + { 1531 + struct dm_zone *zone; 1532 + 1533 + if (list_empty(&zmd->map_seq_list)) 1534 + return NULL; 1535 + 1536 + list_for_each_entry(zone, &zmd->map_seq_list, link) { 1537 + if (!zone->bzone) 1538 + continue; 1539 + if (dmz_lock_zone_reclaim(zone)) 1540 + return zone; 1541 + } 1542 + 1543 + return NULL; 1544 + } 1545 + 1546 + /* 1547 + * Select a zone for reclaim. 1548 + */ 1549 + struct dm_zone *dmz_get_zone_for_reclaim(struct dmz_metadata *zmd) 1550 + { 1551 + struct dm_zone *zone; 1552 + 1553 + /* 1554 + * Search for a zone candidate to reclaim: 2 cases are possible. 1555 + * (1) There is no free sequential zones. Then a random data zone 1556 + * cannot be reclaimed. So choose a sequential zone to reclaim so 1557 + * that afterward a random zone can be reclaimed. 1558 + * (2) At least one free sequential zone is available, then choose 1559 + * the oldest random zone (data or buffer) that can be locked. 1560 + */ 1561 + dmz_lock_map(zmd); 1562 + if (list_empty(&zmd->reserved_seq_zones_list)) 1563 + zone = dmz_get_seq_zone_for_reclaim(zmd); 1564 + else 1565 + zone = dmz_get_rnd_zone_for_reclaim(zmd); 1566 + dmz_unlock_map(zmd); 1567 + 1568 + return zone; 1569 + } 1570 + 1571 + /* 1572 + * Activate a zone (increment its reference count). 1573 + */ 1574 + void dmz_activate_zone(struct dm_zone *zone) 1575 + { 1576 + set_bit(DMZ_ACTIVE, &zone->flags); 1577 + atomic_inc(&zone->refcount); 1578 + } 1579 + 1580 + /* 1581 + * Deactivate a zone. This decrement the zone reference counter 1582 + * and clears the active state of the zone once the count reaches 0, 1583 + * indicating that all BIOs to the zone have completed. Returns 1584 + * true if the zone was deactivated. 1585 + */ 1586 + void dmz_deactivate_zone(struct dm_zone *zone) 1587 + { 1588 + if (atomic_dec_and_test(&zone->refcount)) { 1589 + WARN_ON(!test_bit(DMZ_ACTIVE, &zone->flags)); 1590 + clear_bit_unlock(DMZ_ACTIVE, &zone->flags); 1591 + smp_mb__after_atomic(); 1592 + } 1593 + } 1594 + 1595 + /* 1596 + * Get the zone mapping a chunk, if the chunk is mapped already. 1597 + * If no mapping exist and the operation is WRITE, a zone is 1598 + * allocated and used to map the chunk. 1599 + * The zone returned will be set to the active state. 1600 + */ 1601 + struct dm_zone *dmz_get_chunk_mapping(struct dmz_metadata *zmd, unsigned int chunk, int op) 1602 + { 1603 + struct dmz_mblock *dmap_mblk = zmd->map_mblk[chunk >> DMZ_MAP_ENTRIES_SHIFT]; 1604 + struct dmz_map *dmap = (struct dmz_map *) dmap_mblk->data; 1605 + int dmap_idx = chunk & DMZ_MAP_ENTRIES_MASK; 1606 + unsigned int dzone_id; 1607 + struct dm_zone *dzone = NULL; 1608 + int ret = 0; 1609 + 1610 + dmz_lock_map(zmd); 1611 + again: 1612 + /* Get the chunk mapping */ 1613 + dzone_id = le32_to_cpu(dmap[dmap_idx].dzone_id); 1614 + if (dzone_id == DMZ_MAP_UNMAPPED) { 1615 + /* 1616 + * Read or discard in unmapped chunks are fine. But for 1617 + * writes, we need a mapping, so get one. 1618 + */ 1619 + if (op != REQ_OP_WRITE) 1620 + goto out; 1621 + 1622 + /* Alloate a random zone */ 1623 + dzone = dmz_alloc_zone(zmd, DMZ_ALLOC_RND); 1624 + if (!dzone) { 1625 + dmz_wait_for_free_zones(zmd); 1626 + goto again; 1627 + } 1628 + 1629 + dmz_map_zone(zmd, dzone, chunk); 1630 + 1631 + } else { 1632 + /* The chunk is already mapped: get the mapping zone */ 1633 + dzone = dmz_get(zmd, dzone_id); 1634 + if (dzone->chunk != chunk) { 1635 + dzone = ERR_PTR(-EIO); 1636 + goto out; 1637 + } 1638 + 1639 + /* Repair write pointer if the sequential dzone has error */ 1640 + if (dmz_seq_write_err(dzone)) { 1641 + ret = dmz_handle_seq_write_err(zmd, dzone); 1642 + if (ret) { 1643 + dzone = ERR_PTR(-EIO); 1644 + goto out; 1645 + } 1646 + clear_bit(DMZ_SEQ_WRITE_ERR, &dzone->flags); 1647 + } 1648 + } 1649 + 1650 + /* 1651 + * If the zone is being reclaimed, the chunk mapping may change 1652 + * to a different zone. So wait for reclaim and retry. Otherwise, 1653 + * activate the zone (this will prevent reclaim from touching it). 1654 + */ 1655 + if (dmz_in_reclaim(dzone)) { 1656 + dmz_wait_for_reclaim(zmd, dzone); 1657 + goto again; 1658 + } 1659 + dmz_activate_zone(dzone); 1660 + dmz_lru_zone(zmd, dzone); 1661 + out: 1662 + dmz_unlock_map(zmd); 1663 + 1664 + return dzone; 1665 + } 1666 + 1667 + /* 1668 + * Write and discard change the block validity of data zones and their buffer 1669 + * zones. Check here that valid blocks are still present. If all blocks are 1670 + * invalid, the zones can be unmapped on the fly without waiting for reclaim 1671 + * to do it. 1672 + */ 1673 + void dmz_put_chunk_mapping(struct dmz_metadata *zmd, struct dm_zone *dzone) 1674 + { 1675 + struct dm_zone *bzone; 1676 + 1677 + dmz_lock_map(zmd); 1678 + 1679 + bzone = dzone->bzone; 1680 + if (bzone) { 1681 + if (dmz_weight(bzone)) 1682 + dmz_lru_zone(zmd, bzone); 1683 + else { 1684 + /* Empty buffer zone: reclaim it */ 1685 + dmz_unmap_zone(zmd, bzone); 1686 + dmz_free_zone(zmd, bzone); 1687 + bzone = NULL; 1688 + } 1689 + } 1690 + 1691 + /* Deactivate the data zone */ 1692 + dmz_deactivate_zone(dzone); 1693 + if (dmz_is_active(dzone) || bzone || dmz_weight(dzone)) 1694 + dmz_lru_zone(zmd, dzone); 1695 + else { 1696 + /* Unbuffered inactive empty data zone: reclaim it */ 1697 + dmz_unmap_zone(zmd, dzone); 1698 + dmz_free_zone(zmd, dzone); 1699 + } 1700 + 1701 + dmz_unlock_map(zmd); 1702 + } 1703 + 1704 + /* 1705 + * Allocate and map a random zone to buffer a chunk 1706 + * already mapped to a sequential zone. 1707 + */ 1708 + struct dm_zone *dmz_get_chunk_buffer(struct dmz_metadata *zmd, 1709 + struct dm_zone *dzone) 1710 + { 1711 + struct dm_zone *bzone; 1712 + 1713 + dmz_lock_map(zmd); 1714 + again: 1715 + bzone = dzone->bzone; 1716 + if (bzone) 1717 + goto out; 1718 + 1719 + /* Alloate a random zone */ 1720 + bzone = dmz_alloc_zone(zmd, DMZ_ALLOC_RND); 1721 + if (!bzone) { 1722 + dmz_wait_for_free_zones(zmd); 1723 + goto again; 1724 + } 1725 + 1726 + /* Update the chunk mapping */ 1727 + dmz_set_chunk_mapping(zmd, dzone->chunk, dmz_id(zmd, dzone), 1728 + dmz_id(zmd, bzone)); 1729 + 1730 + set_bit(DMZ_BUF, &bzone->flags); 1731 + bzone->chunk = dzone->chunk; 1732 + bzone->bzone = dzone; 1733 + dzone->bzone = bzone; 1734 + list_add_tail(&bzone->link, &zmd->map_rnd_list); 1735 + out: 1736 + dmz_unlock_map(zmd); 1737 + 1738 + return bzone; 1739 + } 1740 + 1741 + /* 1742 + * Get an unmapped (free) zone. 1743 + * This must be called with the mapping lock held. 1744 + */ 1745 + struct dm_zone *dmz_alloc_zone(struct dmz_metadata *zmd, unsigned long flags) 1746 + { 1747 + struct list_head *list; 1748 + struct dm_zone *zone; 1749 + 1750 + if (flags & DMZ_ALLOC_RND) 1751 + list = &zmd->unmap_rnd_list; 1752 + else 1753 + list = &zmd->unmap_seq_list; 1754 + again: 1755 + if (list_empty(list)) { 1756 + /* 1757 + * No free zone: if this is for reclaim, allow using the 1758 + * reserved sequential zones. 1759 + */ 1760 + if (!(flags & DMZ_ALLOC_RECLAIM) || 1761 + list_empty(&zmd->reserved_seq_zones_list)) 1762 + return NULL; 1763 + 1764 + zone = list_first_entry(&zmd->reserved_seq_zones_list, 1765 + struct dm_zone, link); 1766 + list_del_init(&zone->link); 1767 + atomic_dec(&zmd->nr_reserved_seq_zones); 1768 + return zone; 1769 + } 1770 + 1771 + zone = list_first_entry(list, struct dm_zone, link); 1772 + list_del_init(&zone->link); 1773 + 1774 + if (dmz_is_rnd(zone)) 1775 + atomic_dec(&zmd->unmap_nr_rnd); 1776 + else 1777 + atomic_dec(&zmd->unmap_nr_seq); 1778 + 1779 + if (dmz_is_offline(zone)) { 1780 + dmz_dev_warn(zmd->dev, "Zone %u is offline", dmz_id(zmd, zone)); 1781 + zone = NULL; 1782 + goto again; 1783 + } 1784 + 1785 + return zone; 1786 + } 1787 + 1788 + /* 1789 + * Free a zone. 1790 + * This must be called with the mapping lock held. 1791 + */ 1792 + void dmz_free_zone(struct dmz_metadata *zmd, struct dm_zone *zone) 1793 + { 1794 + /* If this is a sequential zone, reset it */ 1795 + if (dmz_is_seq(zone)) 1796 + dmz_reset_zone(zmd, zone); 1797 + 1798 + /* Return the zone to its type unmap list */ 1799 + if (dmz_is_rnd(zone)) { 1800 + list_add_tail(&zone->link, &zmd->unmap_rnd_list); 1801 + atomic_inc(&zmd->unmap_nr_rnd); 1802 + } else if (atomic_read(&zmd->nr_reserved_seq_zones) < 1803 + zmd->nr_reserved_seq) { 1804 + list_add_tail(&zone->link, &zmd->reserved_seq_zones_list); 1805 + atomic_inc(&zmd->nr_reserved_seq_zones); 1806 + } else { 1807 + list_add_tail(&zone->link, &zmd->unmap_seq_list); 1808 + atomic_inc(&zmd->unmap_nr_seq); 1809 + } 1810 + 1811 + wake_up_all(&zmd->free_wq); 1812 + } 1813 + 1814 + /* 1815 + * Map a chunk to a zone. 1816 + * This must be called with the mapping lock held. 1817 + */ 1818 + void dmz_map_zone(struct dmz_metadata *zmd, struct dm_zone *dzone, 1819 + unsigned int chunk) 1820 + { 1821 + /* Set the chunk mapping */ 1822 + dmz_set_chunk_mapping(zmd, chunk, dmz_id(zmd, dzone), 1823 + DMZ_MAP_UNMAPPED); 1824 + dzone->chunk = chunk; 1825 + if (dmz_is_rnd(dzone)) 1826 + list_add_tail(&dzone->link, &zmd->map_rnd_list); 1827 + else 1828 + list_add_tail(&dzone->link, &zmd->map_seq_list); 1829 + } 1830 + 1831 + /* 1832 + * Unmap a zone. 1833 + * This must be called with the mapping lock held. 1834 + */ 1835 + void dmz_unmap_zone(struct dmz_metadata *zmd, struct dm_zone *zone) 1836 + { 1837 + unsigned int chunk = zone->chunk; 1838 + unsigned int dzone_id; 1839 + 1840 + if (chunk == DMZ_MAP_UNMAPPED) { 1841 + /* Already unmapped */ 1842 + return; 1843 + } 1844 + 1845 + if (test_and_clear_bit(DMZ_BUF, &zone->flags)) { 1846 + /* 1847 + * Unmapping the chunk buffer zone: clear only 1848 + * the chunk buffer mapping 1849 + */ 1850 + dzone_id = dmz_id(zmd, zone->bzone); 1851 + zone->bzone->bzone = NULL; 1852 + zone->bzone = NULL; 1853 + 1854 + } else { 1855 + /* 1856 + * Unmapping the chunk data zone: the zone must 1857 + * not be buffered. 1858 + */ 1859 + if (WARN_ON(zone->bzone)) { 1860 + zone->bzone->bzone = NULL; 1861 + zone->bzone = NULL; 1862 + } 1863 + dzone_id = DMZ_MAP_UNMAPPED; 1864 + } 1865 + 1866 + dmz_set_chunk_mapping(zmd, chunk, dzone_id, DMZ_MAP_UNMAPPED); 1867 + 1868 + zone->chunk = DMZ_MAP_UNMAPPED; 1869 + list_del_init(&zone->link); 1870 + } 1871 + 1872 + /* 1873 + * Set @nr_bits bits in @bitmap starting from @bit. 1874 + * Return the number of bits changed from 0 to 1. 1875 + */ 1876 + static unsigned int dmz_set_bits(unsigned long *bitmap, 1877 + unsigned int bit, unsigned int nr_bits) 1878 + { 1879 + unsigned long *addr; 1880 + unsigned int end = bit + nr_bits; 1881 + unsigned int n = 0; 1882 + 1883 + while (bit < end) { 1884 + if (((bit & (BITS_PER_LONG - 1)) == 0) && 1885 + ((end - bit) >= BITS_PER_LONG)) { 1886 + /* Try to set the whole word at once */ 1887 + addr = bitmap + BIT_WORD(bit); 1888 + if (*addr == 0) { 1889 + *addr = ULONG_MAX; 1890 + n += BITS_PER_LONG; 1891 + bit += BITS_PER_LONG; 1892 + continue; 1893 + } 1894 + } 1895 + 1896 + if (!test_and_set_bit(bit, bitmap)) 1897 + n++; 1898 + bit++; 1899 + } 1900 + 1901 + return n; 1902 + } 1903 + 1904 + /* 1905 + * Get the bitmap block storing the bit for chunk_block in zone. 1906 + */ 1907 + static struct dmz_mblock *dmz_get_bitmap(struct dmz_metadata *zmd, 1908 + struct dm_zone *zone, 1909 + sector_t chunk_block) 1910 + { 1911 + sector_t bitmap_block = 1 + zmd->nr_map_blocks + 1912 + (sector_t)(dmz_id(zmd, zone) * zmd->zone_nr_bitmap_blocks) + 1913 + (chunk_block >> DMZ_BLOCK_SHIFT_BITS); 1914 + 1915 + return dmz_get_mblock(zmd, bitmap_block); 1916 + } 1917 + 1918 + /* 1919 + * Copy the valid blocks bitmap of from_zone to the bitmap of to_zone. 1920 + */ 1921 + int dmz_copy_valid_blocks(struct dmz_metadata *zmd, struct dm_zone *from_zone, 1922 + struct dm_zone *to_zone) 1923 + { 1924 + struct dmz_mblock *from_mblk, *to_mblk; 1925 + sector_t chunk_block = 0; 1926 + 1927 + /* Get the zones bitmap blocks */ 1928 + while (chunk_block < zmd->dev->zone_nr_blocks) { 1929 + from_mblk = dmz_get_bitmap(zmd, from_zone, chunk_block); 1930 + if (IS_ERR(from_mblk)) 1931 + return PTR_ERR(from_mblk); 1932 + to_mblk = dmz_get_bitmap(zmd, to_zone, chunk_block); 1933 + if (IS_ERR(to_mblk)) { 1934 + dmz_release_mblock(zmd, from_mblk); 1935 + return PTR_ERR(to_mblk); 1936 + } 1937 + 1938 + memcpy(to_mblk->data, from_mblk->data, DMZ_BLOCK_SIZE); 1939 + dmz_dirty_mblock(zmd, to_mblk); 1940 + 1941 + dmz_release_mblock(zmd, to_mblk); 1942 + dmz_release_mblock(zmd, from_mblk); 1943 + 1944 + chunk_block += DMZ_BLOCK_SIZE_BITS; 1945 + } 1946 + 1947 + to_zone->weight = from_zone->weight; 1948 + 1949 + return 0; 1950 + } 1951 + 1952 + /* 1953 + * Merge the valid blocks bitmap of from_zone into the bitmap of to_zone, 1954 + * starting from chunk_block. 1955 + */ 1956 + int dmz_merge_valid_blocks(struct dmz_metadata *zmd, struct dm_zone *from_zone, 1957 + struct dm_zone *to_zone, sector_t chunk_block) 1958 + { 1959 + unsigned int nr_blocks; 1960 + int ret; 1961 + 1962 + /* Get the zones bitmap blocks */ 1963 + while (chunk_block < zmd->dev->zone_nr_blocks) { 1964 + /* Get a valid region from the source zone */ 1965 + ret = dmz_first_valid_block(zmd, from_zone, &chunk_block); 1966 + if (ret <= 0) 1967 + return ret; 1968 + 1969 + nr_blocks = ret; 1970 + ret = dmz_validate_blocks(zmd, to_zone, chunk_block, nr_blocks); 1971 + if (ret) 1972 + return ret; 1973 + 1974 + chunk_block += nr_blocks; 1975 + } 1976 + 1977 + return 0; 1978 + } 1979 + 1980 + /* 1981 + * Validate all the blocks in the range [block..block+nr_blocks-1]. 1982 + */ 1983 + int dmz_validate_blocks(struct dmz_metadata *zmd, struct dm_zone *zone, 1984 + sector_t chunk_block, unsigned int nr_blocks) 1985 + { 1986 + unsigned int count, bit, nr_bits; 1987 + unsigned int zone_nr_blocks = zmd->dev->zone_nr_blocks; 1988 + struct dmz_mblock *mblk; 1989 + unsigned int n = 0; 1990 + 1991 + dmz_dev_debug(zmd->dev, "=> VALIDATE zone %u, block %llu, %u blocks", 1992 + dmz_id(zmd, zone), (unsigned long long)chunk_block, 1993 + nr_blocks); 1994 + 1995 + WARN_ON(chunk_block + nr_blocks > zone_nr_blocks); 1996 + 1997 + while (nr_blocks) { 1998 + /* Get bitmap block */ 1999 + mblk = dmz_get_bitmap(zmd, zone, chunk_block); 2000 + if (IS_ERR(mblk)) 2001 + return PTR_ERR(mblk); 2002 + 2003 + /* Set bits */ 2004 + bit = chunk_block & DMZ_BLOCK_MASK_BITS; 2005 + nr_bits = min(nr_blocks, DMZ_BLOCK_SIZE_BITS - bit); 2006 + 2007 + count = dmz_set_bits((unsigned long *)mblk->data, bit, nr_bits); 2008 + if (count) { 2009 + dmz_dirty_mblock(zmd, mblk); 2010 + n += count; 2011 + } 2012 + dmz_release_mblock(zmd, mblk); 2013 + 2014 + nr_blocks -= nr_bits; 2015 + chunk_block += nr_bits; 2016 + } 2017 + 2018 + if (likely(zone->weight + n <= zone_nr_blocks)) 2019 + zone->weight += n; 2020 + else { 2021 + dmz_dev_warn(zmd->dev, "Zone %u: weight %u should be <= %u", 2022 + dmz_id(zmd, zone), zone->weight, 2023 + zone_nr_blocks - n); 2024 + zone->weight = zone_nr_blocks; 2025 + } 2026 + 2027 + return 0; 2028 + } 2029 + 2030 + /* 2031 + * Clear nr_bits bits in bitmap starting from bit. 2032 + * Return the number of bits cleared. 2033 + */ 2034 + static int dmz_clear_bits(unsigned long *bitmap, int bit, int nr_bits) 2035 + { 2036 + unsigned long *addr; 2037 + int end = bit + nr_bits; 2038 + int n = 0; 2039 + 2040 + while (bit < end) { 2041 + if (((bit & (BITS_PER_LONG - 1)) == 0) && 2042 + ((end - bit) >= BITS_PER_LONG)) { 2043 + /* Try to clear whole word at once */ 2044 + addr = bitmap + BIT_WORD(bit); 2045 + if (*addr == ULONG_MAX) { 2046 + *addr = 0; 2047 + n += BITS_PER_LONG; 2048 + bit += BITS_PER_LONG; 2049 + continue; 2050 + } 2051 + } 2052 + 2053 + if (test_and_clear_bit(bit, bitmap)) 2054 + n++; 2055 + bit++; 2056 + } 2057 + 2058 + return n; 2059 + } 2060 + 2061 + /* 2062 + * Invalidate all the blocks in the range [block..block+nr_blocks-1]. 2063 + */ 2064 + int dmz_invalidate_blocks(struct dmz_metadata *zmd, struct dm_zone *zone, 2065 + sector_t chunk_block, unsigned int nr_blocks) 2066 + { 2067 + unsigned int count, bit, nr_bits; 2068 + struct dmz_mblock *mblk; 2069 + unsigned int n = 0; 2070 + 2071 + dmz_dev_debug(zmd->dev, "=> INVALIDATE zone %u, block %llu, %u blocks", 2072 + dmz_id(zmd, zone), (u64)chunk_block, nr_blocks); 2073 + 2074 + WARN_ON(chunk_block + nr_blocks > zmd->dev->zone_nr_blocks); 2075 + 2076 + while (nr_blocks) { 2077 + /* Get bitmap block */ 2078 + mblk = dmz_get_bitmap(zmd, zone, chunk_block); 2079 + if (IS_ERR(mblk)) 2080 + return PTR_ERR(mblk); 2081 + 2082 + /* Clear bits */ 2083 + bit = chunk_block & DMZ_BLOCK_MASK_BITS; 2084 + nr_bits = min(nr_blocks, DMZ_BLOCK_SIZE_BITS - bit); 2085 + 2086 + count = dmz_clear_bits((unsigned long *)mblk->data, 2087 + bit, nr_bits); 2088 + if (count) { 2089 + dmz_dirty_mblock(zmd, mblk); 2090 + n += count; 2091 + } 2092 + dmz_release_mblock(zmd, mblk); 2093 + 2094 + nr_blocks -= nr_bits; 2095 + chunk_block += nr_bits; 2096 + } 2097 + 2098 + if (zone->weight >= n) 2099 + zone->weight -= n; 2100 + else { 2101 + dmz_dev_warn(zmd->dev, "Zone %u: weight %u should be >= %u", 2102 + dmz_id(zmd, zone), zone->weight, n); 2103 + zone->weight = 0; 2104 + } 2105 + 2106 + return 0; 2107 + } 2108 + 2109 + /* 2110 + * Get a block bit value. 2111 + */ 2112 + static int dmz_test_block(struct dmz_metadata *zmd, struct dm_zone *zone, 2113 + sector_t chunk_block) 2114 + { 2115 + struct dmz_mblock *mblk; 2116 + int ret; 2117 + 2118 + WARN_ON(chunk_block >= zmd->dev->zone_nr_blocks); 2119 + 2120 + /* Get bitmap block */ 2121 + mblk = dmz_get_bitmap(zmd, zone, chunk_block); 2122 + if (IS_ERR(mblk)) 2123 + return PTR_ERR(mblk); 2124 + 2125 + /* Get offset */ 2126 + ret = test_bit(chunk_block & DMZ_BLOCK_MASK_BITS, 2127 + (unsigned long *) mblk->data) != 0; 2128 + 2129 + dmz_release_mblock(zmd, mblk); 2130 + 2131 + return ret; 2132 + } 2133 + 2134 + /* 2135 + * Return the number of blocks from chunk_block to the first block with a bit 2136 + * value specified by set. Search at most nr_blocks blocks from chunk_block. 2137 + */ 2138 + static int dmz_to_next_set_block(struct dmz_metadata *zmd, struct dm_zone *zone, 2139 + sector_t chunk_block, unsigned int nr_blocks, 2140 + int set) 2141 + { 2142 + struct dmz_mblock *mblk; 2143 + unsigned int bit, set_bit, nr_bits; 2144 + unsigned long *bitmap; 2145 + int n = 0; 2146 + 2147 + WARN_ON(chunk_block + nr_blocks > zmd->dev->zone_nr_blocks); 2148 + 2149 + while (nr_blocks) { 2150 + /* Get bitmap block */ 2151 + mblk = dmz_get_bitmap(zmd, zone, chunk_block); 2152 + if (IS_ERR(mblk)) 2153 + return PTR_ERR(mblk); 2154 + 2155 + /* Get offset */ 2156 + bitmap = (unsigned long *) mblk->data; 2157 + bit = chunk_block & DMZ_BLOCK_MASK_BITS; 2158 + nr_bits = min(nr_blocks, DMZ_BLOCK_SIZE_BITS - bit); 2159 + if (set) 2160 + set_bit = find_next_bit(bitmap, DMZ_BLOCK_SIZE_BITS, bit); 2161 + else 2162 + set_bit = find_next_zero_bit(bitmap, DMZ_BLOCK_SIZE_BITS, bit); 2163 + dmz_release_mblock(zmd, mblk); 2164 + 2165 + n += set_bit - bit; 2166 + if (set_bit < DMZ_BLOCK_SIZE_BITS) 2167 + break; 2168 + 2169 + nr_blocks -= nr_bits; 2170 + chunk_block += nr_bits; 2171 + } 2172 + 2173 + return n; 2174 + } 2175 + 2176 + /* 2177 + * Test if chunk_block is valid. If it is, the number of consecutive 2178 + * valid blocks from chunk_block will be returned. 2179 + */ 2180 + int dmz_block_valid(struct dmz_metadata *zmd, struct dm_zone *zone, 2181 + sector_t chunk_block) 2182 + { 2183 + int valid; 2184 + 2185 + valid = dmz_test_block(zmd, zone, chunk_block); 2186 + if (valid <= 0) 2187 + return valid; 2188 + 2189 + /* The block is valid: get the number of valid blocks from block */ 2190 + return dmz_to_next_set_block(zmd, zone, chunk_block, 2191 + zmd->dev->zone_nr_blocks - chunk_block, 0); 2192 + } 2193 + 2194 + /* 2195 + * Find the first valid block from @chunk_block in @zone. 2196 + * If such a block is found, its number is returned using 2197 + * @chunk_block and the total number of valid blocks from @chunk_block 2198 + * is returned. 2199 + */ 2200 + int dmz_first_valid_block(struct dmz_metadata *zmd, struct dm_zone *zone, 2201 + sector_t *chunk_block) 2202 + { 2203 + sector_t start_block = *chunk_block; 2204 + int ret; 2205 + 2206 + ret = dmz_to_next_set_block(zmd, zone, start_block, 2207 + zmd->dev->zone_nr_blocks - start_block, 1); 2208 + if (ret < 0) 2209 + return ret; 2210 + 2211 + start_block += ret; 2212 + *chunk_block = start_block; 2213 + 2214 + return dmz_to_next_set_block(zmd, zone, start_block, 2215 + zmd->dev->zone_nr_blocks - start_block, 0); 2216 + } 2217 + 2218 + /* 2219 + * Count the number of bits set starting from bit up to bit + nr_bits - 1. 2220 + */ 2221 + static int dmz_count_bits(void *bitmap, int bit, int nr_bits) 2222 + { 2223 + unsigned long *addr; 2224 + int end = bit + nr_bits; 2225 + int n = 0; 2226 + 2227 + while (bit < end) { 2228 + if (((bit & (BITS_PER_LONG - 1)) == 0) && 2229 + ((end - bit) >= BITS_PER_LONG)) { 2230 + addr = (unsigned long *)bitmap + BIT_WORD(bit); 2231 + if (*addr == ULONG_MAX) { 2232 + n += BITS_PER_LONG; 2233 + bit += BITS_PER_LONG; 2234 + continue; 2235 + } 2236 + } 2237 + 2238 + if (test_bit(bit, bitmap)) 2239 + n++; 2240 + bit++; 2241 + } 2242 + 2243 + return n; 2244 + } 2245 + 2246 + /* 2247 + * Get a zone weight. 2248 + */ 2249 + static void dmz_get_zone_weight(struct dmz_metadata *zmd, struct dm_zone *zone) 2250 + { 2251 + struct dmz_mblock *mblk; 2252 + sector_t chunk_block = 0; 2253 + unsigned int bit, nr_bits; 2254 + unsigned int nr_blocks = zmd->dev->zone_nr_blocks; 2255 + void *bitmap; 2256 + int n = 0; 2257 + 2258 + while (nr_blocks) { 2259 + /* Get bitmap block */ 2260 + mblk = dmz_get_bitmap(zmd, zone, chunk_block); 2261 + if (IS_ERR(mblk)) { 2262 + n = 0; 2263 + break; 2264 + } 2265 + 2266 + /* Count bits in this block */ 2267 + bitmap = mblk->data; 2268 + bit = chunk_block & DMZ_BLOCK_MASK_BITS; 2269 + nr_bits = min(nr_blocks, DMZ_BLOCK_SIZE_BITS - bit); 2270 + n += dmz_count_bits(bitmap, bit, nr_bits); 2271 + 2272 + dmz_release_mblock(zmd, mblk); 2273 + 2274 + nr_blocks -= nr_bits; 2275 + chunk_block += nr_bits; 2276 + } 2277 + 2278 + zone->weight = n; 2279 + } 2280 + 2281 + /* 2282 + * Cleanup the zoned metadata resources. 2283 + */ 2284 + static void dmz_cleanup_metadata(struct dmz_metadata *zmd) 2285 + { 2286 + struct rb_root *root; 2287 + struct dmz_mblock *mblk, *next; 2288 + int i; 2289 + 2290 + /* Release zone mapping resources */ 2291 + if (zmd->map_mblk) { 2292 + for (i = 0; i < zmd->nr_map_blocks; i++) 2293 + dmz_release_mblock(zmd, zmd->map_mblk[i]); 2294 + kfree(zmd->map_mblk); 2295 + zmd->map_mblk = NULL; 2296 + } 2297 + 2298 + /* Release super blocks */ 2299 + for (i = 0; i < 2; i++) { 2300 + if (zmd->sb[i].mblk) { 2301 + dmz_free_mblock(zmd, zmd->sb[i].mblk); 2302 + zmd->sb[i].mblk = NULL; 2303 + } 2304 + } 2305 + 2306 + /* Free cached blocks */ 2307 + while (!list_empty(&zmd->mblk_dirty_list)) { 2308 + mblk = list_first_entry(&zmd->mblk_dirty_list, 2309 + struct dmz_mblock, link); 2310 + dmz_dev_warn(zmd->dev, "mblock %llu still in dirty list (ref %u)", 2311 + (u64)mblk->no, atomic_read(&mblk->ref)); 2312 + list_del_init(&mblk->link); 2313 + rb_erase(&mblk->node, &zmd->mblk_rbtree); 2314 + dmz_free_mblock(zmd, mblk); 2315 + } 2316 + 2317 + while (!list_empty(&zmd->mblk_lru_list)) { 2318 + mblk = list_first_entry(&zmd->mblk_lru_list, 2319 + struct dmz_mblock, link); 2320 + list_del_init(&mblk->link); 2321 + rb_erase(&mblk->node, &zmd->mblk_rbtree); 2322 + dmz_free_mblock(zmd, mblk); 2323 + } 2324 + 2325 + /* Sanity checks: the mblock rbtree should now be empty */ 2326 + root = &zmd->mblk_rbtree; 2327 + rbtree_postorder_for_each_entry_safe(mblk, next, root, node) { 2328 + dmz_dev_warn(zmd->dev, "mblock %llu ref %u still in rbtree", 2329 + (u64)mblk->no, atomic_read(&mblk->ref)); 2330 + atomic_set(&mblk->ref, 0); 2331 + dmz_free_mblock(zmd, mblk); 2332 + } 2333 + 2334 + /* Free the zone descriptors */ 2335 + dmz_drop_zones(zmd); 2336 + } 2337 + 2338 + /* 2339 + * Initialize the zoned metadata. 2340 + */ 2341 + int dmz_ctr_metadata(struct dmz_dev *dev, struct dmz_metadata **metadata) 2342 + { 2343 + struct dmz_metadata *zmd; 2344 + unsigned int i, zid; 2345 + struct dm_zone *zone; 2346 + int ret; 2347 + 2348 + zmd = kzalloc(sizeof(struct dmz_metadata), GFP_KERNEL); 2349 + if (!zmd) 2350 + return -ENOMEM; 2351 + 2352 + zmd->dev = dev; 2353 + zmd->mblk_rbtree = RB_ROOT; 2354 + init_rwsem(&zmd->mblk_sem); 2355 + mutex_init(&zmd->mblk_flush_lock); 2356 + spin_lock_init(&zmd->mblk_lock); 2357 + INIT_LIST_HEAD(&zmd->mblk_lru_list); 2358 + INIT_LIST_HEAD(&zmd->mblk_dirty_list); 2359 + 2360 + mutex_init(&zmd->map_lock); 2361 + atomic_set(&zmd->unmap_nr_rnd, 0); 2362 + INIT_LIST_HEAD(&zmd->unmap_rnd_list); 2363 + INIT_LIST_HEAD(&zmd->map_rnd_list); 2364 + 2365 + atomic_set(&zmd->unmap_nr_seq, 0); 2366 + INIT_LIST_HEAD(&zmd->unmap_seq_list); 2367 + INIT_LIST_HEAD(&zmd->map_seq_list); 2368 + 2369 + atomic_set(&zmd->nr_reserved_seq_zones, 0); 2370 + INIT_LIST_HEAD(&zmd->reserved_seq_zones_list); 2371 + 2372 + init_waitqueue_head(&zmd->free_wq); 2373 + 2374 + /* Initialize zone descriptors */ 2375 + ret = dmz_init_zones(zmd); 2376 + if (ret) 2377 + goto err; 2378 + 2379 + /* Get super block */ 2380 + ret = dmz_load_sb(zmd); 2381 + if (ret) 2382 + goto err; 2383 + 2384 + /* Set metadata zones starting from sb_zone */ 2385 + zid = dmz_id(zmd, zmd->sb_zone); 2386 + for (i = 0; i < zmd->nr_meta_zones << 1; i++) { 2387 + zone = dmz_get(zmd, zid + i); 2388 + if (!dmz_is_rnd(zone)) 2389 + goto err; 2390 + set_bit(DMZ_META, &zone->flags); 2391 + } 2392 + 2393 + /* Load mapping table */ 2394 + ret = dmz_load_mapping(zmd); 2395 + if (ret) 2396 + goto err; 2397 + 2398 + /* 2399 + * Cache size boundaries: allow at least 2 super blocks, the chunk map 2400 + * blocks and enough blocks to be able to cache the bitmap blocks of 2401 + * up to 16 zones when idle (min_nr_mblks). Otherwise, if busy, allow 2402 + * the cache to add 512 more metadata blocks. 2403 + */ 2404 + zmd->min_nr_mblks = 2 + zmd->nr_map_blocks + zmd->zone_nr_bitmap_blocks * 16; 2405 + zmd->max_nr_mblks = zmd->min_nr_mblks + 512; 2406 + zmd->mblk_shrinker.count_objects = dmz_mblock_shrinker_count; 2407 + zmd->mblk_shrinker.scan_objects = dmz_mblock_shrinker_scan; 2408 + zmd->mblk_shrinker.seeks = DEFAULT_SEEKS; 2409 + 2410 + /* Metadata cache shrinker */ 2411 + ret = register_shrinker(&zmd->mblk_shrinker); 2412 + if (ret) { 2413 + dmz_dev_err(dev, "Register metadata cache shrinker failed"); 2414 + goto err; 2415 + } 2416 + 2417 + dmz_dev_info(dev, "Host-%s zoned block device", 2418 + bdev_zoned_model(dev->bdev) == BLK_ZONED_HA ? 2419 + "aware" : "managed"); 2420 + dmz_dev_info(dev, " %llu 512-byte logical sectors", 2421 + (u64)dev->capacity); 2422 + dmz_dev_info(dev, " %u zones of %llu 512-byte logical sectors", 2423 + dev->nr_zones, (u64)dev->zone_nr_sectors); 2424 + dmz_dev_info(dev, " %u metadata zones", 2425 + zmd->nr_meta_zones * 2); 2426 + dmz_dev_info(dev, " %u data zones for %u chunks", 2427 + zmd->nr_data_zones, zmd->nr_chunks); 2428 + dmz_dev_info(dev, " %u random zones (%u unmapped)", 2429 + zmd->nr_rnd, atomic_read(&zmd->unmap_nr_rnd)); 2430 + dmz_dev_info(dev, " %u sequential zones (%u unmapped)", 2431 + zmd->nr_seq, atomic_read(&zmd->unmap_nr_seq)); 2432 + dmz_dev_info(dev, " %u reserved sequential data zones", 2433 + zmd->nr_reserved_seq); 2434 + 2435 + dmz_dev_debug(dev, "Format:"); 2436 + dmz_dev_debug(dev, "%u metadata blocks per set (%u max cache)", 2437 + zmd->nr_meta_blocks, zmd->max_nr_mblks); 2438 + dmz_dev_debug(dev, " %u data zone mapping blocks", 2439 + zmd->nr_map_blocks); 2440 + dmz_dev_debug(dev, " %u bitmap blocks", 2441 + zmd->nr_bitmap_blocks); 2442 + 2443 + *metadata = zmd; 2444 + 2445 + return 0; 2446 + err: 2447 + dmz_cleanup_metadata(zmd); 2448 + kfree(zmd); 2449 + *metadata = NULL; 2450 + 2451 + return ret; 2452 + } 2453 + 2454 + /* 2455 + * Cleanup the zoned metadata resources. 2456 + */ 2457 + void dmz_dtr_metadata(struct dmz_metadata *zmd) 2458 + { 2459 + unregister_shrinker(&zmd->mblk_shrinker); 2460 + dmz_cleanup_metadata(zmd); 2461 + kfree(zmd); 2462 + } 2463 + 2464 + /* 2465 + * Check zone information on resume. 2466 + */ 2467 + int dmz_resume_metadata(struct dmz_metadata *zmd) 2468 + { 2469 + struct dmz_dev *dev = zmd->dev; 2470 + struct dm_zone *zone; 2471 + sector_t wp_block; 2472 + unsigned int i; 2473 + int ret; 2474 + 2475 + /* Check zones */ 2476 + for (i = 0; i < dev->nr_zones; i++) { 2477 + zone = dmz_get(zmd, i); 2478 + if (!zone) { 2479 + dmz_dev_err(dev, "Unable to get zone %u", i); 2480 + return -EIO; 2481 + } 2482 + 2483 + wp_block = zone->wp_block; 2484 + 2485 + ret = dmz_update_zone(zmd, zone); 2486 + if (ret) { 2487 + dmz_dev_err(dev, "Broken zone %u", i); 2488 + return ret; 2489 + } 2490 + 2491 + if (dmz_is_offline(zone)) { 2492 + dmz_dev_warn(dev, "Zone %u is offline", i); 2493 + continue; 2494 + } 2495 + 2496 + /* Check write pointer */ 2497 + if (!dmz_is_seq(zone)) 2498 + zone->wp_block = 0; 2499 + else if (zone->wp_block != wp_block) { 2500 + dmz_dev_err(dev, "Zone %u: Invalid wp (%llu / %llu)", 2501 + i, (u64)zone->wp_block, (u64)wp_block); 2502 + zone->wp_block = wp_block; 2503 + dmz_invalidate_blocks(zmd, zone, zone->wp_block, 2504 + dev->zone_nr_blocks - zone->wp_block); 2505 + } 2506 + } 2507 + 2508 + return 0; 2509 + }

+570

drivers/md/dm-zoned-reclaim.c

··· 1 + /* 2 + * Copyright (C) 2017 Western Digital Corporation or its affiliates. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #include "dm-zoned.h" 8 + 9 + #include <linux/module.h> 10 + 11 + #define DM_MSG_PREFIX "zoned reclaim" 12 + 13 + struct dmz_reclaim { 14 + struct dmz_metadata *metadata; 15 + struct dmz_dev *dev; 16 + 17 + struct delayed_work work; 18 + struct workqueue_struct *wq; 19 + 20 + struct dm_kcopyd_client *kc; 21 + struct dm_kcopyd_throttle kc_throttle; 22 + int kc_err; 23 + 24 + unsigned long flags; 25 + 26 + /* Last target access time */ 27 + unsigned long atime; 28 + }; 29 + 30 + /* 31 + * Reclaim state flags. 32 + */ 33 + enum { 34 + DMZ_RECLAIM_KCOPY, 35 + }; 36 + 37 + /* 38 + * Number of seconds of target BIO inactivity to consider the target idle. 39 + */ 40 + #define DMZ_IDLE_PERIOD (10UL * HZ) 41 + 42 + /* 43 + * Percentage of unmapped (free) random zones below which reclaim starts 44 + * even if the target is busy. 45 + */ 46 + #define DMZ_RECLAIM_LOW_UNMAP_RND 30 47 + 48 + /* 49 + * Percentage of unmapped (free) random zones above which reclaim will 50 + * stop if the target is busy. 51 + */ 52 + #define DMZ_RECLAIM_HIGH_UNMAP_RND 50 53 + 54 + /* 55 + * Align a sequential zone write pointer to chunk_block. 56 + */ 57 + static int dmz_reclaim_align_wp(struct dmz_reclaim *zrc, struct dm_zone *zone, 58 + sector_t block) 59 + { 60 + struct dmz_metadata *zmd = zrc->metadata; 61 + sector_t wp_block = zone->wp_block; 62 + unsigned int nr_blocks; 63 + int ret; 64 + 65 + if (wp_block == block) 66 + return 0; 67 + 68 + if (wp_block > block) 69 + return -EIO; 70 + 71 + /* 72 + * Zeroout the space between the write 73 + * pointer and the requested position. 74 + */ 75 + nr_blocks = block - wp_block; 76 + ret = blkdev_issue_zeroout(zrc->dev->bdev, 77 + dmz_start_sect(zmd, zone) + dmz_blk2sect(wp_block), 78 + dmz_blk2sect(nr_blocks), GFP_NOFS, false); 79 + if (ret) { 80 + dmz_dev_err(zrc->dev, 81 + "Align zone %u wp %llu to %llu (wp+%u) blocks failed %d", 82 + dmz_id(zmd, zone), (unsigned long long)wp_block, 83 + (unsigned long long)block, nr_blocks, ret); 84 + return ret; 85 + } 86 + 87 + zone->wp_block = block; 88 + 89 + return 0; 90 + } 91 + 92 + /* 93 + * dm_kcopyd_copy end notification. 94 + */ 95 + static void dmz_reclaim_kcopy_end(int read_err, unsigned long write_err, 96 + void *context) 97 + { 98 + struct dmz_reclaim *zrc = context; 99 + 100 + if (read_err || write_err) 101 + zrc->kc_err = -EIO; 102 + else 103 + zrc->kc_err = 0; 104 + 105 + clear_bit_unlock(DMZ_RECLAIM_KCOPY, &zrc->flags); 106 + smp_mb__after_atomic(); 107 + wake_up_bit(&zrc->flags, DMZ_RECLAIM_KCOPY); 108 + } 109 + 110 + /* 111 + * Copy valid blocks of src_zone into dst_zone. 112 + */ 113 + static int dmz_reclaim_copy(struct dmz_reclaim *zrc, 114 + struct dm_zone *src_zone, struct dm_zone *dst_zone) 115 + { 116 + struct dmz_metadata *zmd = zrc->metadata; 117 + struct dmz_dev *dev = zrc->dev; 118 + struct dm_io_region src, dst; 119 + sector_t block = 0, end_block; 120 + sector_t nr_blocks; 121 + sector_t src_zone_block; 122 + sector_t dst_zone_block; 123 + unsigned long flags = 0; 124 + int ret; 125 + 126 + if (dmz_is_seq(src_zone)) 127 + end_block = src_zone->wp_block; 128 + else 129 + end_block = dev->zone_nr_blocks; 130 + src_zone_block = dmz_start_block(zmd, src_zone); 131 + dst_zone_block = dmz_start_block(zmd, dst_zone); 132 + 133 + if (dmz_is_seq(dst_zone)) 134 + set_bit(DM_KCOPYD_WRITE_SEQ, &flags); 135 + 136 + while (block < end_block) { 137 + /* Get a valid region from the source zone */ 138 + ret = dmz_first_valid_block(zmd, src_zone, &block); 139 + if (ret <= 0) 140 + return ret; 141 + nr_blocks = ret; 142 + 143 + /* 144 + * If we are writing in a sequential zone, we must make sure 145 + * that writes are sequential. So Zeroout any eventual hole 146 + * between writes. 147 + */ 148 + if (dmz_is_seq(dst_zone)) { 149 + ret = dmz_reclaim_align_wp(zrc, dst_zone, block); 150 + if (ret) 151 + return ret; 152 + } 153 + 154 + src.bdev = dev->bdev; 155 + src.sector = dmz_blk2sect(src_zone_block + block); 156 + src.count = dmz_blk2sect(nr_blocks); 157 + 158 + dst.bdev = dev->bdev; 159 + dst.sector = dmz_blk2sect(dst_zone_block + block); 160 + dst.count = src.count; 161 + 162 + /* Copy the valid region */ 163 + set_bit(DMZ_RECLAIM_KCOPY, &zrc->flags); 164 + ret = dm_kcopyd_copy(zrc->kc, &src, 1, &dst, flags, 165 + dmz_reclaim_kcopy_end, zrc); 166 + if (ret) 167 + return ret; 168 + 169 + /* Wait for copy to complete */ 170 + wait_on_bit_io(&zrc->flags, DMZ_RECLAIM_KCOPY, 171 + TASK_UNINTERRUPTIBLE); 172 + if (zrc->kc_err) 173 + return zrc->kc_err; 174 + 175 + block += nr_blocks; 176 + if (dmz_is_seq(dst_zone)) 177 + dst_zone->wp_block = block; 178 + } 179 + 180 + return 0; 181 + } 182 + 183 + /* 184 + * Move valid blocks of dzone buffer zone into dzone (after its write pointer) 185 + * and free the buffer zone. 186 + */ 187 + static int dmz_reclaim_buf(struct dmz_reclaim *zrc, struct dm_zone *dzone) 188 + { 189 + struct dm_zone *bzone = dzone->bzone; 190 + sector_t chunk_block = dzone->wp_block; 191 + struct dmz_metadata *zmd = zrc->metadata; 192 + int ret; 193 + 194 + dmz_dev_debug(zrc->dev, 195 + "Chunk %u, move buf zone %u (weight %u) to data zone %u (weight %u)", 196 + dzone->chunk, dmz_id(zmd, bzone), dmz_weight(bzone), 197 + dmz_id(zmd, dzone), dmz_weight(dzone)); 198 + 199 + /* Flush data zone into the buffer zone */ 200 + ret = dmz_reclaim_copy(zrc, bzone, dzone); 201 + if (ret < 0) 202 + return ret; 203 + 204 + dmz_lock_flush(zmd); 205 + 206 + /* Validate copied blocks */ 207 + ret = dmz_merge_valid_blocks(zmd, bzone, dzone, chunk_block); 208 + if (ret == 0) { 209 + /* Free the buffer zone */ 210 + dmz_invalidate_blocks(zmd, bzone, 0, zrc->dev->zone_nr_blocks); 211 + dmz_lock_map(zmd); 212 + dmz_unmap_zone(zmd, bzone); 213 + dmz_unlock_zone_reclaim(dzone); 214 + dmz_free_zone(zmd, bzone); 215 + dmz_unlock_map(zmd); 216 + } 217 + 218 + dmz_unlock_flush(zmd); 219 + 220 + return 0; 221 + } 222 + 223 + /* 224 + * Merge valid blocks of dzone into its buffer zone and free dzone. 225 + */ 226 + static int dmz_reclaim_seq_data(struct dmz_reclaim *zrc, struct dm_zone *dzone) 227 + { 228 + unsigned int chunk = dzone->chunk; 229 + struct dm_zone *bzone = dzone->bzone; 230 + struct dmz_metadata *zmd = zrc->metadata; 231 + int ret = 0; 232 + 233 + dmz_dev_debug(zrc->dev, 234 + "Chunk %u, move data zone %u (weight %u) to buf zone %u (weight %u)", 235 + chunk, dmz_id(zmd, dzone), dmz_weight(dzone), 236 + dmz_id(zmd, bzone), dmz_weight(bzone)); 237 + 238 + /* Flush data zone into the buffer zone */ 239 + ret = dmz_reclaim_copy(zrc, dzone, bzone); 240 + if (ret < 0) 241 + return ret; 242 + 243 + dmz_lock_flush(zmd); 244 + 245 + /* Validate copied blocks */ 246 + ret = dmz_merge_valid_blocks(zmd, dzone, bzone, 0); 247 + if (ret == 0) { 248 + /* 249 + * Free the data zone and remap the chunk to 250 + * the buffer zone. 251 + */ 252 + dmz_invalidate_blocks(zmd, dzone, 0, zrc->dev->zone_nr_blocks); 253 + dmz_lock_map(zmd); 254 + dmz_unmap_zone(zmd, bzone); 255 + dmz_unmap_zone(zmd, dzone); 256 + dmz_unlock_zone_reclaim(dzone); 257 + dmz_free_zone(zmd, dzone); 258 + dmz_map_zone(zmd, bzone, chunk); 259 + dmz_unlock_map(zmd); 260 + } 261 + 262 + dmz_unlock_flush(zmd); 263 + 264 + return 0; 265 + } 266 + 267 + /* 268 + * Move valid blocks of the random data zone dzone into a free sequential zone. 269 + * Once blocks are moved, remap the zone chunk to the sequential zone. 270 + */ 271 + static int dmz_reclaim_rnd_data(struct dmz_reclaim *zrc, struct dm_zone *dzone) 272 + { 273 + unsigned int chunk = dzone->chunk; 274 + struct dm_zone *szone = NULL; 275 + struct dmz_metadata *zmd = zrc->metadata; 276 + int ret; 277 + 278 + /* Get a free sequential zone */ 279 + dmz_lock_map(zmd); 280 + szone = dmz_alloc_zone(zmd, DMZ_ALLOC_RECLAIM); 281 + dmz_unlock_map(zmd); 282 + if (!szone) 283 + return -ENOSPC; 284 + 285 + dmz_dev_debug(zrc->dev, 286 + "Chunk %u, move rnd zone %u (weight %u) to seq zone %u", 287 + chunk, dmz_id(zmd, dzone), dmz_weight(dzone), 288 + dmz_id(zmd, szone)); 289 + 290 + /* Flush the random data zone into the sequential zone */ 291 + ret = dmz_reclaim_copy(zrc, dzone, szone); 292 + 293 + dmz_lock_flush(zmd); 294 + 295 + if (ret == 0) { 296 + /* Validate copied blocks */ 297 + ret = dmz_copy_valid_blocks(zmd, dzone, szone); 298 + } 299 + if (ret) { 300 + /* Free the sequential zone */ 301 + dmz_lock_map(zmd); 302 + dmz_free_zone(zmd, szone); 303 + dmz_unlock_map(zmd); 304 + } else { 305 + /* Free the data zone and remap the chunk */ 306 + dmz_invalidate_blocks(zmd, dzone, 0, zrc->dev->zone_nr_blocks); 307 + dmz_lock_map(zmd); 308 + dmz_unmap_zone(zmd, dzone); 309 + dmz_unlock_zone_reclaim(dzone); 310 + dmz_free_zone(zmd, dzone); 311 + dmz_map_zone(zmd, szone, chunk); 312 + dmz_unlock_map(zmd); 313 + } 314 + 315 + dmz_unlock_flush(zmd); 316 + 317 + return 0; 318 + } 319 + 320 + /* 321 + * Reclaim an empty zone. 322 + */ 323 + static void dmz_reclaim_empty(struct dmz_reclaim *zrc, struct dm_zone *dzone) 324 + { 325 + struct dmz_metadata *zmd = zrc->metadata; 326 + 327 + dmz_lock_flush(zmd); 328 + dmz_lock_map(zmd); 329 + dmz_unmap_zone(zmd, dzone); 330 + dmz_unlock_zone_reclaim(dzone); 331 + dmz_free_zone(zmd, dzone); 332 + dmz_unlock_map(zmd); 333 + dmz_unlock_flush(zmd); 334 + } 335 + 336 + /* 337 + * Find a candidate zone for reclaim and process it. 338 + */ 339 + static void dmz_reclaim(struct dmz_reclaim *zrc) 340 + { 341 + struct dmz_metadata *zmd = zrc->metadata; 342 + struct dm_zone *dzone; 343 + struct dm_zone *rzone; 344 + unsigned long start; 345 + int ret; 346 + 347 + /* Get a data zone */ 348 + dzone = dmz_get_zone_for_reclaim(zmd); 349 + if (!dzone) 350 + return; 351 + 352 + start = jiffies; 353 + 354 + if (dmz_is_rnd(dzone)) { 355 + if (!dmz_weight(dzone)) { 356 + /* Empty zone */ 357 + dmz_reclaim_empty(zrc, dzone); 358 + ret = 0; 359 + } else { 360 + /* 361 + * Reclaim the random data zone by moving its 362 + * valid data blocks to a free sequential zone. 363 + */ 364 + ret = dmz_reclaim_rnd_data(zrc, dzone); 365 + } 366 + rzone = dzone; 367 + 368 + } else { 369 + struct dm_zone *bzone = dzone->bzone; 370 + sector_t chunk_block = 0; 371 + 372 + ret = dmz_first_valid_block(zmd, bzone, &chunk_block); 373 + if (ret < 0) 374 + goto out; 375 + 376 + if (ret == 0 || chunk_block >= dzone->wp_block) { 377 + /* 378 + * The buffer zone is empty or its valid blocks are 379 + * after the data zone write pointer. 380 + */ 381 + ret = dmz_reclaim_buf(zrc, dzone); 382 + rzone = bzone; 383 + } else { 384 + /* 385 + * Reclaim the data zone by merging it into the 386 + * buffer zone so that the buffer zone itself can 387 + * be later reclaimed. 388 + */ 389 + ret = dmz_reclaim_seq_data(zrc, dzone); 390 + rzone = dzone; 391 + } 392 + } 393 + out: 394 + if (ret) { 395 + dmz_unlock_zone_reclaim(dzone); 396 + return; 397 + } 398 + 399 + (void) dmz_flush_metadata(zrc->metadata); 400 + 401 + dmz_dev_debug(zrc->dev, "Reclaimed zone %u in %u ms", 402 + dmz_id(zmd, rzone), jiffies_to_msecs(jiffies - start)); 403 + } 404 + 405 + /* 406 + * Test if the target device is idle. 407 + */ 408 + static inline int dmz_target_idle(struct dmz_reclaim *zrc) 409 + { 410 + return time_is_before_jiffies(zrc->atime + DMZ_IDLE_PERIOD); 411 + } 412 + 413 + /* 414 + * Test if reclaim is necessary. 415 + */ 416 + static bool dmz_should_reclaim(struct dmz_reclaim *zrc) 417 + { 418 + struct dmz_metadata *zmd = zrc->metadata; 419 + unsigned int nr_rnd = dmz_nr_rnd_zones(zmd); 420 + unsigned int nr_unmap_rnd = dmz_nr_unmap_rnd_zones(zmd); 421 + unsigned int p_unmap_rnd = nr_unmap_rnd * 100 / nr_rnd; 422 + 423 + /* Reclaim when idle */ 424 + if (dmz_target_idle(zrc) && nr_unmap_rnd < nr_rnd) 425 + return true; 426 + 427 + /* If there are still plenty of random zones, do not reclaim */ 428 + if (p_unmap_rnd >= DMZ_RECLAIM_HIGH_UNMAP_RND) 429 + return false; 430 + 431 + /* 432 + * If the percentage of unmappped random zones is low, 433 + * reclaim even if the target is busy. 434 + */ 435 + return p_unmap_rnd <= DMZ_RECLAIM_LOW_UNMAP_RND; 436 + } 437 + 438 + /* 439 + * Reclaim work function. 440 + */ 441 + static void dmz_reclaim_work(struct work_struct *work) 442 + { 443 + struct dmz_reclaim *zrc = container_of(work, struct dmz_reclaim, work.work); 444 + struct dmz_metadata *zmd = zrc->metadata; 445 + unsigned int nr_rnd, nr_unmap_rnd; 446 + unsigned int p_unmap_rnd; 447 + 448 + if (!dmz_should_reclaim(zrc)) { 449 + mod_delayed_work(zrc->wq, &zrc->work, DMZ_IDLE_PERIOD); 450 + return; 451 + } 452 + 453 + /* 454 + * We need to start reclaiming random zones: set up zone copy 455 + * throttling to either go fast if we are very low on random zones 456 + * and slower if there are still some free random zones to avoid 457 + * as much as possible to negatively impact the user workload. 458 + */ 459 + nr_rnd = dmz_nr_rnd_zones(zmd); 460 + nr_unmap_rnd = dmz_nr_unmap_rnd_zones(zmd); 461 + p_unmap_rnd = nr_unmap_rnd * 100 / nr_rnd; 462 + if (dmz_target_idle(zrc) || p_unmap_rnd < DMZ_RECLAIM_LOW_UNMAP_RND / 2) { 463 + /* Idle or very low percentage: go fast */ 464 + zrc->kc_throttle.throttle = 100; 465 + } else { 466 + /* Busy but we still have some random zone: throttle */ 467 + zrc->kc_throttle.throttle = min(75U, 100U - p_unmap_rnd / 2); 468 + } 469 + 470 + dmz_dev_debug(zrc->dev, 471 + "Reclaim (%u): %s, %u%% free rnd zones (%u/%u)", 472 + zrc->kc_throttle.throttle, 473 + (dmz_target_idle(zrc) ? "Idle" : "Busy"), 474 + p_unmap_rnd, nr_unmap_rnd, nr_rnd); 475 + 476 + dmz_reclaim(zrc); 477 + 478 + dmz_schedule_reclaim(zrc); 479 + } 480 + 481 + /* 482 + * Initialize reclaim. 483 + */ 484 + int dmz_ctr_reclaim(struct dmz_dev *dev, struct dmz_metadata *zmd, 485 + struct dmz_reclaim **reclaim) 486 + { 487 + struct dmz_reclaim *zrc; 488 + int ret; 489 + 490 + zrc = kzalloc(sizeof(struct dmz_reclaim), GFP_KERNEL); 491 + if (!zrc) 492 + return -ENOMEM; 493 + 494 + zrc->dev = dev; 495 + zrc->metadata = zmd; 496 + zrc->atime = jiffies; 497 + 498 + /* Reclaim kcopyd client */ 499 + zrc->kc = dm_kcopyd_client_create(&zrc->kc_throttle); 500 + if (IS_ERR(zrc->kc)) { 501 + ret = PTR_ERR(zrc->kc); 502 + zrc->kc = NULL; 503 + goto err; 504 + } 505 + 506 + /* Reclaim work */ 507 + INIT_DELAYED_WORK(&zrc->work, dmz_reclaim_work); 508 + zrc->wq = alloc_ordered_workqueue("dmz_rwq_%s", WQ_MEM_RECLAIM, 509 + dev->name); 510 + if (!zrc->wq) { 511 + ret = -ENOMEM; 512 + goto err; 513 + } 514 + 515 + *reclaim = zrc; 516 + queue_delayed_work(zrc->wq, &zrc->work, 0); 517 + 518 + return 0; 519 + err: 520 + if (zrc->kc) 521 + dm_kcopyd_client_destroy(zrc->kc); 522 + kfree(zrc); 523 + 524 + return ret; 525 + } 526 + 527 + /* 528 + * Terminate reclaim. 529 + */ 530 + void dmz_dtr_reclaim(struct dmz_reclaim *zrc) 531 + { 532 + cancel_delayed_work_sync(&zrc->work); 533 + destroy_workqueue(zrc->wq); 534 + dm_kcopyd_client_destroy(zrc->kc); 535 + kfree(zrc); 536 + } 537 + 538 + /* 539 + * Suspend reclaim. 540 + */ 541 + void dmz_suspend_reclaim(struct dmz_reclaim *zrc) 542 + { 543 + cancel_delayed_work_sync(&zrc->work); 544 + } 545 + 546 + /* 547 + * Resume reclaim. 548 + */ 549 + void dmz_resume_reclaim(struct dmz_reclaim *zrc) 550 + { 551 + queue_delayed_work(zrc->wq, &zrc->work, DMZ_IDLE_PERIOD); 552 + } 553 + 554 + /* 555 + * BIO accounting. 556 + */ 557 + void dmz_reclaim_bio_acc(struct dmz_reclaim *zrc) 558 + { 559 + zrc->atime = jiffies; 560 + } 561 + 562 + /* 563 + * Start reclaim if necessary. 564 + */ 565 + void dmz_schedule_reclaim(struct dmz_reclaim *zrc) 566 + { 567 + if (dmz_should_reclaim(zrc)) 568 + mod_delayed_work(zrc->wq, &zrc->work, 0); 569 + } 570 +

+967

drivers/md/dm-zoned-target.c

··· 1 + /* 2 + * Copyright (C) 2017 Western Digital Corporation or its affiliates. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #include "dm-zoned.h" 8 + 9 + #include <linux/module.h> 10 + 11 + #define DM_MSG_PREFIX "zoned" 12 + 13 + #define DMZ_MIN_BIOS 8192 14 + 15 + /* 16 + * Zone BIO context. 17 + */ 18 + struct dmz_bioctx { 19 + struct dmz_target *target; 20 + struct dm_zone *zone; 21 + struct bio *bio; 22 + atomic_t ref; 23 + blk_status_t status; 24 + }; 25 + 26 + /* 27 + * Chunk work descriptor. 28 + */ 29 + struct dm_chunk_work { 30 + struct work_struct work; 31 + atomic_t refcount; 32 + struct dmz_target *target; 33 + unsigned int chunk; 34 + struct bio_list bio_list; 35 + }; 36 + 37 + /* 38 + * Target descriptor. 39 + */ 40 + struct dmz_target { 41 + struct dm_dev *ddev; 42 + 43 + unsigned long flags; 44 + 45 + /* Zoned block device information */ 46 + struct dmz_dev *dev; 47 + 48 + /* For metadata handling */ 49 + struct dmz_metadata *metadata; 50 + 51 + /* For reclaim */ 52 + struct dmz_reclaim *reclaim; 53 + 54 + /* For chunk work */ 55 + struct mutex chunk_lock; 56 + struct radix_tree_root chunk_rxtree; 57 + struct workqueue_struct *chunk_wq; 58 + 59 + /* For cloned BIOs to zones */ 60 + struct bio_set *bio_set; 61 + 62 + /* For flush */ 63 + spinlock_t flush_lock; 64 + struct bio_list flush_list; 65 + struct delayed_work flush_work; 66 + struct workqueue_struct *flush_wq; 67 + }; 68 + 69 + /* 70 + * Flush intervals (seconds). 71 + */ 72 + #define DMZ_FLUSH_PERIOD (10 * HZ) 73 + 74 + /* 75 + * Target BIO completion. 76 + */ 77 + static inline void dmz_bio_endio(struct bio *bio, blk_status_t status) 78 + { 79 + struct dmz_bioctx *bioctx = dm_per_bio_data(bio, sizeof(struct dmz_bioctx)); 80 + 81 + if (bioctx->status == BLK_STS_OK && status != BLK_STS_OK) 82 + bioctx->status = status; 83 + bio_endio(bio); 84 + } 85 + 86 + /* 87 + * Partial clone read BIO completion callback. This terminates the 88 + * target BIO when there are no more references to its context. 89 + */ 90 + static void dmz_read_bio_end_io(struct bio *bio) 91 + { 92 + struct dmz_bioctx *bioctx = bio->bi_private; 93 + blk_status_t status = bio->bi_status; 94 + 95 + bio_put(bio); 96 + dmz_bio_endio(bioctx->bio, status); 97 + } 98 + 99 + /* 100 + * Issue a BIO to a zone. The BIO may only partially process the 101 + * original target BIO. 102 + */ 103 + static int dmz_submit_read_bio(struct dmz_target *dmz, struct dm_zone *zone, 104 + struct bio *bio, sector_t chunk_block, 105 + unsigned int nr_blocks) 106 + { 107 + struct dmz_bioctx *bioctx = dm_per_bio_data(bio, sizeof(struct dmz_bioctx)); 108 + sector_t sector; 109 + struct bio *clone; 110 + 111 + /* BIO remap sector */ 112 + sector = dmz_start_sect(dmz->metadata, zone) + dmz_blk2sect(chunk_block); 113 + 114 + /* If the read is not partial, there is no need to clone the BIO */ 115 + if (nr_blocks == dmz_bio_blocks(bio)) { 116 + /* Setup and submit the BIO */ 117 + bio->bi_iter.bi_sector = sector; 118 + atomic_inc(&bioctx->ref); 119 + generic_make_request(bio); 120 + return 0; 121 + } 122 + 123 + /* Partial BIO: we need to clone the BIO */ 124 + clone = bio_clone_fast(bio, GFP_NOIO, dmz->bio_set); 125 + if (!clone) 126 + return -ENOMEM; 127 + 128 + /* Setup the clone */ 129 + clone->bi_iter.bi_sector = sector; 130 + clone->bi_iter.bi_size = dmz_blk2sect(nr_blocks) << SECTOR_SHIFT; 131 + clone->bi_end_io = dmz_read_bio_end_io; 132 + clone->bi_private = bioctx; 133 + 134 + bio_advance(bio, clone->bi_iter.bi_size); 135 + 136 + /* Submit the clone */ 137 + atomic_inc(&bioctx->ref); 138 + generic_make_request(clone); 139 + 140 + return 0; 141 + } 142 + 143 + /* 144 + * Zero out pages of discarded blocks accessed by a read BIO. 145 + */ 146 + static void dmz_handle_read_zero(struct dmz_target *dmz, struct bio *bio, 147 + sector_t chunk_block, unsigned int nr_blocks) 148 + { 149 + unsigned int size = nr_blocks << DMZ_BLOCK_SHIFT; 150 + 151 + /* Clear nr_blocks */ 152 + swap(bio->bi_iter.bi_size, size); 153 + zero_fill_bio(bio); 154 + swap(bio->bi_iter.bi_size, size); 155 + 156 + bio_advance(bio, size); 157 + } 158 + 159 + /* 160 + * Process a read BIO. 161 + */ 162 + static int dmz_handle_read(struct dmz_target *dmz, struct dm_zone *zone, 163 + struct bio *bio) 164 + { 165 + sector_t chunk_block = dmz_chunk_block(dmz->dev, dmz_bio_block(bio)); 166 + unsigned int nr_blocks = dmz_bio_blocks(bio); 167 + sector_t end_block = chunk_block + nr_blocks; 168 + struct dm_zone *rzone, *bzone; 169 + int ret; 170 + 171 + /* Read into unmapped chunks need only zeroing the BIO buffer */ 172 + if (!zone) { 173 + zero_fill_bio(bio); 174 + return 0; 175 + } 176 + 177 + dmz_dev_debug(dmz->dev, "READ chunk %llu -> %s zone %u, block %llu, %u blocks", 178 + (unsigned long long)dmz_bio_chunk(dmz->dev, bio), 179 + (dmz_is_rnd(zone) ? "RND" : "SEQ"), 180 + dmz_id(dmz->metadata, zone), 181 + (unsigned long long)chunk_block, nr_blocks); 182 + 183 + /* Check block validity to determine the read location */ 184 + bzone = zone->bzone; 185 + while (chunk_block < end_block) { 186 + nr_blocks = 0; 187 + if (dmz_is_rnd(zone) || chunk_block < zone->wp_block) { 188 + /* Test block validity in the data zone */ 189 + ret = dmz_block_valid(dmz->metadata, zone, chunk_block); 190 + if (ret < 0) 191 + return ret; 192 + if (ret > 0) { 193 + /* Read data zone blocks */ 194 + nr_blocks = ret; 195 + rzone = zone; 196 + } 197 + } 198 + 199 + /* 200 + * No valid blocks found in the data zone. 201 + * Check the buffer zone, if there is one. 202 + */ 203 + if (!nr_blocks && bzone) { 204 + ret = dmz_block_valid(dmz->metadata, bzone, chunk_block); 205 + if (ret < 0) 206 + return ret; 207 + if (ret > 0) { 208 + /* Read buffer zone blocks */ 209 + nr_blocks = ret; 210 + rzone = bzone; 211 + } 212 + } 213 + 214 + if (nr_blocks) { 215 + /* Valid blocks found: read them */ 216 + nr_blocks = min_t(unsigned int, nr_blocks, end_block - chunk_block); 217 + ret = dmz_submit_read_bio(dmz, rzone, bio, chunk_block, nr_blocks); 218 + if (ret) 219 + return ret; 220 + chunk_block += nr_blocks; 221 + } else { 222 + /* No valid block: zeroout the current BIO block */ 223 + dmz_handle_read_zero(dmz, bio, chunk_block, 1); 224 + chunk_block++; 225 + } 226 + } 227 + 228 + return 0; 229 + } 230 + 231 + /* 232 + * Issue a write BIO to a zone. 233 + */ 234 + static void dmz_submit_write_bio(struct dmz_target *dmz, struct dm_zone *zone, 235 + struct bio *bio, sector_t chunk_block, 236 + unsigned int nr_blocks) 237 + { 238 + struct dmz_bioctx *bioctx = dm_per_bio_data(bio, sizeof(struct dmz_bioctx)); 239 + 240 + /* Setup and submit the BIO */ 241 + bio->bi_bdev = dmz->dev->bdev; 242 + bio->bi_iter.bi_sector = dmz_start_sect(dmz->metadata, zone) + dmz_blk2sect(chunk_block); 243 + atomic_inc(&bioctx->ref); 244 + generic_make_request(bio); 245 + 246 + if (dmz_is_seq(zone)) 247 + zone->wp_block += nr_blocks; 248 + } 249 + 250 + /* 251 + * Write blocks directly in a data zone, at the write pointer. 252 + * If a buffer zone is assigned, invalidate the blocks written 253 + * in place. 254 + */ 255 + static int dmz_handle_direct_write(struct dmz_target *dmz, 256 + struct dm_zone *zone, struct bio *bio, 257 + sector_t chunk_block, 258 + unsigned int nr_blocks) 259 + { 260 + struct dmz_metadata *zmd = dmz->metadata; 261 + struct dm_zone *bzone = zone->bzone; 262 + int ret; 263 + 264 + if (dmz_is_readonly(zone)) 265 + return -EROFS; 266 + 267 + /* Submit write */ 268 + dmz_submit_write_bio(dmz, zone, bio, chunk_block, nr_blocks); 269 + 270 + /* 271 + * Validate the blocks in the data zone and invalidate 272 + * in the buffer zone, if there is one. 273 + */ 274 + ret = dmz_validate_blocks(zmd, zone, chunk_block, nr_blocks); 275 + if (ret == 0 && bzone) 276 + ret = dmz_invalidate_blocks(zmd, bzone, chunk_block, nr_blocks); 277 + 278 + return ret; 279 + } 280 + 281 + /* 282 + * Write blocks in the buffer zone of @zone. 283 + * If no buffer zone is assigned yet, get one. 284 + * Called with @zone write locked. 285 + */ 286 + static int dmz_handle_buffered_write(struct dmz_target *dmz, 287 + struct dm_zone *zone, struct bio *bio, 288 + sector_t chunk_block, 289 + unsigned int nr_blocks) 290 + { 291 + struct dmz_metadata *zmd = dmz->metadata; 292 + struct dm_zone *bzone; 293 + int ret; 294 + 295 + /* Get the buffer zone. One will be allocated if needed */ 296 + bzone = dmz_get_chunk_buffer(zmd, zone); 297 + if (!bzone) 298 + return -ENOSPC; 299 + 300 + if (dmz_is_readonly(bzone)) 301 + return -EROFS; 302 + 303 + /* Submit write */ 304 + dmz_submit_write_bio(dmz, bzone, bio, chunk_block, nr_blocks); 305 + 306 + /* 307 + * Validate the blocks in the buffer zone 308 + * and invalidate in the data zone. 309 + */ 310 + ret = dmz_validate_blocks(zmd, bzone, chunk_block, nr_blocks); 311 + if (ret == 0 && chunk_block < zone->wp_block) 312 + ret = dmz_invalidate_blocks(zmd, zone, chunk_block, nr_blocks); 313 + 314 + return ret; 315 + } 316 + 317 + /* 318 + * Process a write BIO. 319 + */ 320 + static int dmz_handle_write(struct dmz_target *dmz, struct dm_zone *zone, 321 + struct bio *bio) 322 + { 323 + sector_t chunk_block = dmz_chunk_block(dmz->dev, dmz_bio_block(bio)); 324 + unsigned int nr_blocks = dmz_bio_blocks(bio); 325 + 326 + if (!zone) 327 + return -ENOSPC; 328 + 329 + dmz_dev_debug(dmz->dev, "WRITE chunk %llu -> %s zone %u, block %llu, %u blocks", 330 + (unsigned long long)dmz_bio_chunk(dmz->dev, bio), 331 + (dmz_is_rnd(zone) ? "RND" : "SEQ"), 332 + dmz_id(dmz->metadata, zone), 333 + (unsigned long long)chunk_block, nr_blocks); 334 + 335 + if (dmz_is_rnd(zone) || chunk_block == zone->wp_block) { 336 + /* 337 + * zone is a random zone or it is a sequential zone 338 + * and the BIO is aligned to the zone write pointer: 339 + * direct write the zone. 340 + */ 341 + return dmz_handle_direct_write(dmz, zone, bio, chunk_block, nr_blocks); 342 + } 343 + 344 + /* 345 + * This is an unaligned write in a sequential zone: 346 + * use buffered write. 347 + */ 348 + return dmz_handle_buffered_write(dmz, zone, bio, chunk_block, nr_blocks); 349 + } 350 + 351 + /* 352 + * Process a discard BIO. 353 + */ 354 + static int dmz_handle_discard(struct dmz_target *dmz, struct dm_zone *zone, 355 + struct bio *bio) 356 + { 357 + struct dmz_metadata *zmd = dmz->metadata; 358 + sector_t block = dmz_bio_block(bio); 359 + unsigned int nr_blocks = dmz_bio_blocks(bio); 360 + sector_t chunk_block = dmz_chunk_block(dmz->dev, block); 361 + int ret = 0; 362 + 363 + /* For unmapped chunks, there is nothing to do */ 364 + if (!zone) 365 + return 0; 366 + 367 + if (dmz_is_readonly(zone)) 368 + return -EROFS; 369 + 370 + dmz_dev_debug(dmz->dev, "DISCARD chunk %llu -> zone %u, block %llu, %u blocks", 371 + (unsigned long long)dmz_bio_chunk(dmz->dev, bio), 372 + dmz_id(zmd, zone), 373 + (unsigned long long)chunk_block, nr_blocks); 374 + 375 + /* 376 + * Invalidate blocks in the data zone and its 377 + * buffer zone if one is mapped. 378 + */ 379 + if (dmz_is_rnd(zone) || chunk_block < zone->wp_block) 380 + ret = dmz_invalidate_blocks(zmd, zone, chunk_block, nr_blocks); 381 + if (ret == 0 && zone->bzone) 382 + ret = dmz_invalidate_blocks(zmd, zone->bzone, 383 + chunk_block, nr_blocks); 384 + return ret; 385 + } 386 + 387 + /* 388 + * Process a BIO. 389 + */ 390 + static void dmz_handle_bio(struct dmz_target *dmz, struct dm_chunk_work *cw, 391 + struct bio *bio) 392 + { 393 + struct dmz_bioctx *bioctx = dm_per_bio_data(bio, sizeof(struct dmz_bioctx)); 394 + struct dmz_metadata *zmd = dmz->metadata; 395 + struct dm_zone *zone; 396 + int ret; 397 + 398 + /* 399 + * Write may trigger a zone allocation. So make sure the 400 + * allocation can succeed. 401 + */ 402 + if (bio_op(bio) == REQ_OP_WRITE) 403 + dmz_schedule_reclaim(dmz->reclaim); 404 + 405 + dmz_lock_metadata(zmd); 406 + 407 + /* 408 + * Get the data zone mapping the chunk. There may be no 409 + * mapping for read and discard. If a mapping is obtained, 410 + + the zone returned will be set to active state. 411 + */ 412 + zone = dmz_get_chunk_mapping(zmd, dmz_bio_chunk(dmz->dev, bio), 413 + bio_op(bio)); 414 + if (IS_ERR(zone)) { 415 + ret = PTR_ERR(zone); 416 + goto out; 417 + } 418 + 419 + /* Process the BIO */ 420 + if (zone) { 421 + dmz_activate_zone(zone); 422 + bioctx->zone = zone; 423 + } 424 + 425 + switch (bio_op(bio)) { 426 + case REQ_OP_READ: 427 + ret = dmz_handle_read(dmz, zone, bio); 428 + break; 429 + case REQ_OP_WRITE: 430 + ret = dmz_handle_write(dmz, zone, bio); 431 + break; 432 + case REQ_OP_DISCARD: 433 + case REQ_OP_WRITE_ZEROES: 434 + ret = dmz_handle_discard(dmz, zone, bio); 435 + break; 436 + default: 437 + dmz_dev_err(dmz->dev, "Unsupported BIO operation 0x%x", 438 + bio_op(bio)); 439 + ret = -EIO; 440 + } 441 + 442 + /* 443 + * Release the chunk mapping. This will check that the mapping 444 + * is still valid, that is, that the zone used still has valid blocks. 445 + */ 446 + if (zone) 447 + dmz_put_chunk_mapping(zmd, zone); 448 + out: 449 + dmz_bio_endio(bio, errno_to_blk_status(ret)); 450 + 451 + dmz_unlock_metadata(zmd); 452 + } 453 + 454 + /* 455 + * Increment a chunk reference counter. 456 + */ 457 + static inline void dmz_get_chunk_work(struct dm_chunk_work *cw) 458 + { 459 + atomic_inc(&cw->refcount); 460 + } 461 + 462 + /* 463 + * Decrement a chunk work reference count and 464 + * free it if it becomes 0. 465 + */ 466 + static void dmz_put_chunk_work(struct dm_chunk_work *cw) 467 + { 468 + if (atomic_dec_and_test(&cw->refcount)) { 469 + WARN_ON(!bio_list_empty(&cw->bio_list)); 470 + radix_tree_delete(&cw->target->chunk_rxtree, cw->chunk); 471 + kfree(cw); 472 + } 473 + } 474 + 475 + /* 476 + * Chunk BIO work function. 477 + */ 478 + static void dmz_chunk_work(struct work_struct *work) 479 + { 480 + struct dm_chunk_work *cw = container_of(work, struct dm_chunk_work, work); 481 + struct dmz_target *dmz = cw->target; 482 + struct bio *bio; 483 + 484 + mutex_lock(&dmz->chunk_lock); 485 + 486 + /* Process the chunk BIOs */ 487 + while ((bio = bio_list_pop(&cw->bio_list))) { 488 + mutex_unlock(&dmz->chunk_lock); 489 + dmz_handle_bio(dmz, cw, bio); 490 + mutex_lock(&dmz->chunk_lock); 491 + dmz_put_chunk_work(cw); 492 + } 493 + 494 + /* Queueing the work incremented the work refcount */ 495 + dmz_put_chunk_work(cw); 496 + 497 + mutex_unlock(&dmz->chunk_lock); 498 + } 499 + 500 + /* 501 + * Flush work. 502 + */ 503 + static void dmz_flush_work(struct work_struct *work) 504 + { 505 + struct dmz_target *dmz = container_of(work, struct dmz_target, flush_work.work); 506 + struct bio *bio; 507 + int ret; 508 + 509 + /* Flush dirty metadata blocks */ 510 + ret = dmz_flush_metadata(dmz->metadata); 511 + 512 + /* Process queued flush requests */ 513 + while (1) { 514 + spin_lock(&dmz->flush_lock); 515 + bio = bio_list_pop(&dmz->flush_list); 516 + spin_unlock(&dmz->flush_lock); 517 + 518 + if (!bio) 519 + break; 520 + 521 + dmz_bio_endio(bio, errno_to_blk_status(ret)); 522 + } 523 + 524 + queue_delayed_work(dmz->flush_wq, &dmz->flush_work, DMZ_FLUSH_PERIOD); 525 + } 526 + 527 + /* 528 + * Get a chunk work and start it to process a new BIO. 529 + * If the BIO chunk has no work yet, create one. 530 + */ 531 + static void dmz_queue_chunk_work(struct dmz_target *dmz, struct bio *bio) 532 + { 533 + unsigned int chunk = dmz_bio_chunk(dmz->dev, bio); 534 + struct dm_chunk_work *cw; 535 + 536 + mutex_lock(&dmz->chunk_lock); 537 + 538 + /* Get the BIO chunk work. If one is not active yet, create one */ 539 + cw = radix_tree_lookup(&dmz->chunk_rxtree, chunk); 540 + if (!cw) { 541 + int ret; 542 + 543 + /* Create a new chunk work */ 544 + cw = kmalloc(sizeof(struct dm_chunk_work), GFP_NOFS); 545 + if (!cw) 546 + goto out; 547 + 548 + INIT_WORK(&cw->work, dmz_chunk_work); 549 + atomic_set(&cw->refcount, 0); 550 + cw->target = dmz; 551 + cw->chunk = chunk; 552 + bio_list_init(&cw->bio_list); 553 + 554 + ret = radix_tree_insert(&dmz->chunk_rxtree, chunk, cw); 555 + if (unlikely(ret)) { 556 + kfree(cw); 557 + cw = NULL; 558 + goto out; 559 + } 560 + } 561 + 562 + bio_list_add(&cw->bio_list, bio); 563 + dmz_get_chunk_work(cw); 564 + 565 + if (queue_work(dmz->chunk_wq, &cw->work)) 566 + dmz_get_chunk_work(cw); 567 + out: 568 + mutex_unlock(&dmz->chunk_lock); 569 + } 570 + 571 + /* 572 + * Process a new BIO. 573 + */ 574 + static int dmz_map(struct dm_target *ti, struct bio *bio) 575 + { 576 + struct dmz_target *dmz = ti->private; 577 + struct dmz_dev *dev = dmz->dev; 578 + struct dmz_bioctx *bioctx = dm_per_bio_data(bio, sizeof(struct dmz_bioctx)); 579 + sector_t sector = bio->bi_iter.bi_sector; 580 + unsigned int nr_sectors = bio_sectors(bio); 581 + sector_t chunk_sector; 582 + 583 + dmz_dev_debug(dev, "BIO op %d sector %llu + %u => chunk %llu, block %llu, %u blocks", 584 + bio_op(bio), (unsigned long long)sector, nr_sectors, 585 + (unsigned long long)dmz_bio_chunk(dmz->dev, bio), 586 + (unsigned long long)dmz_chunk_block(dmz->dev, dmz_bio_block(bio)), 587 + (unsigned int)dmz_bio_blocks(bio)); 588 + 589 + bio->bi_bdev = dev->bdev; 590 + 591 + if (!nr_sectors && (bio_op(bio) != REQ_OP_FLUSH) && (bio_op(bio) != REQ_OP_WRITE)) 592 + return DM_MAPIO_REMAPPED; 593 + 594 + /* The BIO should be block aligned */ 595 + if ((nr_sectors & DMZ_BLOCK_SECTORS_MASK) || (sector & DMZ_BLOCK_SECTORS_MASK)) 596 + return DM_MAPIO_KILL; 597 + 598 + /* Initialize the BIO context */ 599 + bioctx->target = dmz; 600 + bioctx->zone = NULL; 601 + bioctx->bio = bio; 602 + atomic_set(&bioctx->ref, 1); 603 + bioctx->status = BLK_STS_OK; 604 + 605 + /* Set the BIO pending in the flush list */ 606 + if (bio_op(bio) == REQ_OP_FLUSH || (!nr_sectors && bio_op(bio) == REQ_OP_WRITE)) { 607 + spin_lock(&dmz->flush_lock); 608 + bio_list_add(&dmz->flush_list, bio); 609 + spin_unlock(&dmz->flush_lock); 610 + mod_delayed_work(dmz->flush_wq, &dmz->flush_work, 0); 611 + return DM_MAPIO_SUBMITTED; 612 + } 613 + 614 + /* Split zone BIOs to fit entirely into a zone */ 615 + chunk_sector = sector & (dev->zone_nr_sectors - 1); 616 + if (chunk_sector + nr_sectors > dev->zone_nr_sectors) 617 + dm_accept_partial_bio(bio, dev->zone_nr_sectors - chunk_sector); 618 + 619 + /* Now ready to handle this BIO */ 620 + dmz_reclaim_bio_acc(dmz->reclaim); 621 + dmz_queue_chunk_work(dmz, bio); 622 + 623 + return DM_MAPIO_SUBMITTED; 624 + } 625 + 626 + /* 627 + * Completed target BIO processing. 628 + */ 629 + static int dmz_end_io(struct dm_target *ti, struct bio *bio, blk_status_t *error) 630 + { 631 + struct dmz_bioctx *bioctx = dm_per_bio_data(bio, sizeof(struct dmz_bioctx)); 632 + 633 + if (bioctx->status == BLK_STS_OK && *error) 634 + bioctx->status = *error; 635 + 636 + if (!atomic_dec_and_test(&bioctx->ref)) 637 + return DM_ENDIO_INCOMPLETE; 638 + 639 + /* Done */ 640 + bio->bi_status = bioctx->status; 641 + 642 + if (bioctx->zone) { 643 + struct dm_zone *zone = bioctx->zone; 644 + 645 + if (*error && bio_op(bio) == REQ_OP_WRITE) { 646 + if (dmz_is_seq(zone)) 647 + set_bit(DMZ_SEQ_WRITE_ERR, &zone->flags); 648 + } 649 + dmz_deactivate_zone(zone); 650 + } 651 + 652 + return DM_ENDIO_DONE; 653 + } 654 + 655 + /* 656 + * Get zoned device information. 657 + */ 658 + static int dmz_get_zoned_device(struct dm_target *ti, char *path) 659 + { 660 + struct dmz_target *dmz = ti->private; 661 + struct request_queue *q; 662 + struct dmz_dev *dev; 663 + int ret; 664 + 665 + /* Get the target device */ 666 + ret = dm_get_device(ti, path, dm_table_get_mode(ti->table), &dmz->ddev); 667 + if (ret) { 668 + ti->error = "Get target device failed"; 669 + dmz->ddev = NULL; 670 + return ret; 671 + } 672 + 673 + dev = kzalloc(sizeof(struct dmz_dev), GFP_KERNEL); 674 + if (!dev) { 675 + ret = -ENOMEM; 676 + goto err; 677 + } 678 + 679 + dev->bdev = dmz->ddev->bdev; 680 + (void)bdevname(dev->bdev, dev->name); 681 + 682 + if (bdev_zoned_model(dev->bdev) == BLK_ZONED_NONE) { 683 + ti->error = "Not a zoned block device"; 684 + ret = -EINVAL; 685 + goto err; 686 + } 687 + 688 + dev->capacity = i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT; 689 + if (ti->begin || (ti->len != dev->capacity)) { 690 + ti->error = "Partial mapping not supported"; 691 + ret = -EINVAL; 692 + goto err; 693 + } 694 + 695 + q = bdev_get_queue(dev->bdev); 696 + dev->zone_nr_sectors = q->limits.chunk_sectors; 697 + dev->zone_nr_sectors_shift = ilog2(dev->zone_nr_sectors); 698 + 699 + dev->zone_nr_blocks = dmz_sect2blk(dev->zone_nr_sectors); 700 + dev->zone_nr_blocks_shift = ilog2(dev->zone_nr_blocks); 701 + 702 + dev->nr_zones = (dev->capacity + dev->zone_nr_sectors - 1) 703 + >> dev->zone_nr_sectors_shift; 704 + 705 + dmz->dev = dev; 706 + 707 + return 0; 708 + err: 709 + dm_put_device(ti, dmz->ddev); 710 + kfree(dev); 711 + 712 + return ret; 713 + } 714 + 715 + /* 716 + * Cleanup zoned device information. 717 + */ 718 + static void dmz_put_zoned_device(struct dm_target *ti) 719 + { 720 + struct dmz_target *dmz = ti->private; 721 + 722 + dm_put_device(ti, dmz->ddev); 723 + kfree(dmz->dev); 724 + dmz->dev = NULL; 725 + } 726 + 727 + /* 728 + * Setup target. 729 + */ 730 + static int dmz_ctr(struct dm_target *ti, unsigned int argc, char **argv) 731 + { 732 + struct dmz_target *dmz; 733 + struct dmz_dev *dev; 734 + int ret; 735 + 736 + /* Check arguments */ 737 + if (argc != 1) { 738 + ti->error = "Invalid argument count"; 739 + return -EINVAL; 740 + } 741 + 742 + /* Allocate and initialize the target descriptor */ 743 + dmz = kzalloc(sizeof(struct dmz_target), GFP_KERNEL); 744 + if (!dmz) { 745 + ti->error = "Unable to allocate the zoned target descriptor"; 746 + return -ENOMEM; 747 + } 748 + ti->private = dmz; 749 + 750 + /* Get the target zoned block device */ 751 + ret = dmz_get_zoned_device(ti, argv[0]); 752 + if (ret) { 753 + dmz->ddev = NULL; 754 + goto err; 755 + } 756 + 757 + /* Initialize metadata */ 758 + dev = dmz->dev; 759 + ret = dmz_ctr_metadata(dev, &dmz->metadata); 760 + if (ret) { 761 + ti->error = "Metadata initialization failed"; 762 + goto err_dev; 763 + } 764 + 765 + /* Set target (no write same support) */ 766 + ti->max_io_len = dev->zone_nr_sectors << 9; 767 + ti->num_flush_bios = 1; 768 + ti->num_discard_bios = 1; 769 + ti->num_write_zeroes_bios = 1; 770 + ti->per_io_data_size = sizeof(struct dmz_bioctx); 771 + ti->flush_supported = true; 772 + ti->discards_supported = true; 773 + ti->split_discard_bios = true; 774 + 775 + /* The exposed capacity is the number of chunks that can be mapped */ 776 + ti->len = (sector_t)dmz_nr_chunks(dmz->metadata) << dev->zone_nr_sectors_shift; 777 + 778 + /* Zone BIO */ 779 + dmz->bio_set = bioset_create(DMZ_MIN_BIOS, 0, 0); 780 + if (!dmz->bio_set) { 781 + ti->error = "Create BIO set failed"; 782 + ret = -ENOMEM; 783 + goto err_meta; 784 + } 785 + 786 + /* Chunk BIO work */ 787 + mutex_init(&dmz->chunk_lock); 788 + INIT_RADIX_TREE(&dmz->chunk_rxtree, GFP_NOFS); 789 + dmz->chunk_wq = alloc_workqueue("dmz_cwq_%s", WQ_MEM_RECLAIM | WQ_UNBOUND, 790 + 0, dev->name); 791 + if (!dmz->chunk_wq) { 792 + ti->error = "Create chunk workqueue failed"; 793 + ret = -ENOMEM; 794 + goto err_bio; 795 + } 796 + 797 + /* Flush work */ 798 + spin_lock_init(&dmz->flush_lock); 799 + bio_list_init(&dmz->flush_list); 800 + INIT_DELAYED_WORK(&dmz->flush_work, dmz_flush_work); 801 + dmz->flush_wq = alloc_ordered_workqueue("dmz_fwq_%s", WQ_MEM_RECLAIM, 802 + dev->name); 803 + if (!dmz->flush_wq) { 804 + ti->error = "Create flush workqueue failed"; 805 + ret = -ENOMEM; 806 + goto err_cwq; 807 + } 808 + mod_delayed_work(dmz->flush_wq, &dmz->flush_work, DMZ_FLUSH_PERIOD); 809 + 810 + /* Initialize reclaim */ 811 + ret = dmz_ctr_reclaim(dev, dmz->metadata, &dmz->reclaim); 812 + if (ret) { 813 + ti->error = "Zone reclaim initialization failed"; 814 + goto err_fwq; 815 + } 816 + 817 + dmz_dev_info(dev, "Target device: %llu 512-byte logical sectors (%llu blocks)", 818 + (unsigned long long)ti->len, 819 + (unsigned long long)dmz_sect2blk(ti->len)); 820 + 821 + return 0; 822 + err_fwq: 823 + destroy_workqueue(dmz->flush_wq); 824 + err_cwq: 825 + destroy_workqueue(dmz->chunk_wq); 826 + err_bio: 827 + bioset_free(dmz->bio_set); 828 + err_meta: 829 + dmz_dtr_metadata(dmz->metadata); 830 + err_dev: 831 + dmz_put_zoned_device(ti); 832 + err: 833 + kfree(dmz); 834 + 835 + return ret; 836 + } 837 + 838 + /* 839 + * Cleanup target. 840 + */ 841 + static void dmz_dtr(struct dm_target *ti) 842 + { 843 + struct dmz_target *dmz = ti->private; 844 + 845 + flush_workqueue(dmz->chunk_wq); 846 + destroy_workqueue(dmz->chunk_wq); 847 + 848 + dmz_dtr_reclaim(dmz->reclaim); 849 + 850 + cancel_delayed_work_sync(&dmz->flush_work); 851 + destroy_workqueue(dmz->flush_wq); 852 + 853 + (void) dmz_flush_metadata(dmz->metadata); 854 + 855 + dmz_dtr_metadata(dmz->metadata); 856 + 857 + bioset_free(dmz->bio_set); 858 + 859 + dmz_put_zoned_device(ti); 860 + 861 + kfree(dmz); 862 + } 863 + 864 + /* 865 + * Setup target request queue limits. 866 + */ 867 + static void dmz_io_hints(struct dm_target *ti, struct queue_limits *limits) 868 + { 869 + struct dmz_target *dmz = ti->private; 870 + unsigned int chunk_sectors = dmz->dev->zone_nr_sectors; 871 + 872 + limits->logical_block_size = DMZ_BLOCK_SIZE; 873 + limits->physical_block_size = DMZ_BLOCK_SIZE; 874 + 875 + blk_limits_io_min(limits, DMZ_BLOCK_SIZE); 876 + blk_limits_io_opt(limits, DMZ_BLOCK_SIZE); 877 + 878 + limits->discard_alignment = DMZ_BLOCK_SIZE; 879 + limits->discard_granularity = DMZ_BLOCK_SIZE; 880 + limits->max_discard_sectors = chunk_sectors; 881 + limits->max_hw_discard_sectors = chunk_sectors; 882 + limits->max_write_zeroes_sectors = chunk_sectors; 883 + 884 + /* FS hint to try to align to the device zone size */ 885 + limits->chunk_sectors = chunk_sectors; 886 + limits->max_sectors = chunk_sectors; 887 + 888 + /* We are exposing a drive-managed zoned block device */ 889 + limits->zoned = BLK_ZONED_NONE; 890 + } 891 + 892 + /* 893 + * Pass on ioctl to the backend device. 894 + */ 895 + static int dmz_prepare_ioctl(struct dm_target *ti, 896 + struct block_device **bdev, fmode_t *mode) 897 + { 898 + struct dmz_target *dmz = ti->private; 899 + 900 + *bdev = dmz->dev->bdev; 901 + 902 + return 0; 903 + } 904 + 905 + /* 906 + * Stop works on suspend. 907 + */ 908 + static void dmz_suspend(struct dm_target *ti) 909 + { 910 + struct dmz_target *dmz = ti->private; 911 + 912 + flush_workqueue(dmz->chunk_wq); 913 + dmz_suspend_reclaim(dmz->reclaim); 914 + cancel_delayed_work_sync(&dmz->flush_work); 915 + } 916 + 917 + /* 918 + * Restart works on resume or if suspend failed. 919 + */ 920 + static void dmz_resume(struct dm_target *ti) 921 + { 922 + struct dmz_target *dmz = ti->private; 923 + 924 + queue_delayed_work(dmz->flush_wq, &dmz->flush_work, DMZ_FLUSH_PERIOD); 925 + dmz_resume_reclaim(dmz->reclaim); 926 + } 927 + 928 + static int dmz_iterate_devices(struct dm_target *ti, 929 + iterate_devices_callout_fn fn, void *data) 930 + { 931 + struct dmz_target *dmz = ti->private; 932 + 933 + return fn(ti, dmz->ddev, 0, dmz->dev->capacity, data); 934 + } 935 + 936 + static struct target_type dmz_type = { 937 + .name = "zoned", 938 + .version = {1, 0, 0}, 939 + .features = DM_TARGET_SINGLETON | DM_TARGET_ZONED_HM, 940 + .module = THIS_MODULE, 941 + .ctr = dmz_ctr, 942 + .dtr = dmz_dtr, 943 + .map = dmz_map, 944 + .end_io = dmz_end_io, 945 + .io_hints = dmz_io_hints, 946 + .prepare_ioctl = dmz_prepare_ioctl, 947 + .postsuspend = dmz_suspend, 948 + .resume = dmz_resume, 949 + .iterate_devices = dmz_iterate_devices, 950 + }; 951 + 952 + static int __init dmz_init(void) 953 + { 954 + return dm_register_target(&dmz_type); 955 + } 956 + 957 + static void __exit dmz_exit(void) 958 + { 959 + dm_unregister_target(&dmz_type); 960 + } 961 + 962 + module_init(dmz_init); 963 + module_exit(dmz_exit); 964 + 965 + MODULE_DESCRIPTION(DM_NAME " target for zoned block devices"); 966 + MODULE_AUTHOR("Damien Le Moal <damien.lemoal@wdc.com>"); 967 + MODULE_LICENSE("GPL");

+228

drivers/md/dm-zoned.h

··· 1 + /* 2 + * Copyright (C) 2017 Western Digital Corporation or its affiliates. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #ifndef DM_ZONED_H 8 + #define DM_ZONED_H 9 + 10 + #include <linux/types.h> 11 + #include <linux/blkdev.h> 12 + #include <linux/device-mapper.h> 13 + #include <linux/dm-kcopyd.h> 14 + #include <linux/list.h> 15 + #include <linux/spinlock.h> 16 + #include <linux/mutex.h> 17 + #include <linux/workqueue.h> 18 + #include <linux/rwsem.h> 19 + #include <linux/rbtree.h> 20 + #include <linux/radix-tree.h> 21 + #include <linux/shrinker.h> 22 + 23 + /* 24 + * dm-zoned creates block devices with 4KB blocks, always. 25 + */ 26 + #define DMZ_BLOCK_SHIFT 12 27 + #define DMZ_BLOCK_SIZE (1 << DMZ_BLOCK_SHIFT) 28 + #define DMZ_BLOCK_MASK (DMZ_BLOCK_SIZE - 1) 29 + 30 + #define DMZ_BLOCK_SHIFT_BITS (DMZ_BLOCK_SHIFT + 3) 31 + #define DMZ_BLOCK_SIZE_BITS (1 << DMZ_BLOCK_SHIFT_BITS) 32 + #define DMZ_BLOCK_MASK_BITS (DMZ_BLOCK_SIZE_BITS - 1) 33 + 34 + #define DMZ_BLOCK_SECTORS_SHIFT (DMZ_BLOCK_SHIFT - SECTOR_SHIFT) 35 + #define DMZ_BLOCK_SECTORS (DMZ_BLOCK_SIZE >> SECTOR_SHIFT) 36 + #define DMZ_BLOCK_SECTORS_MASK (DMZ_BLOCK_SECTORS - 1) 37 + 38 + /* 39 + * 4KB block <-> 512B sector conversion. 40 + */ 41 + #define dmz_blk2sect(b) ((sector_t)(b) << DMZ_BLOCK_SECTORS_SHIFT) 42 + #define dmz_sect2blk(s) ((sector_t)(s) >> DMZ_BLOCK_SECTORS_SHIFT) 43 + 44 + #define dmz_bio_block(bio) dmz_sect2blk((bio)->bi_iter.bi_sector) 45 + #define dmz_bio_blocks(bio) dmz_sect2blk(bio_sectors(bio)) 46 + 47 + /* 48 + * Zoned block device information. 49 + */ 50 + struct dmz_dev { 51 + struct block_device *bdev; 52 + 53 + char name[BDEVNAME_SIZE]; 54 + 55 + sector_t capacity; 56 + 57 + unsigned int nr_zones; 58 + 59 + sector_t zone_nr_sectors; 60 + unsigned int zone_nr_sectors_shift; 61 + 62 + sector_t zone_nr_blocks; 63 + sector_t zone_nr_blocks_shift; 64 + }; 65 + 66 + #define dmz_bio_chunk(dev, bio) ((bio)->bi_iter.bi_sector >> \ 67 + (dev)->zone_nr_sectors_shift) 68 + #define dmz_chunk_block(dev, b) ((b) & ((dev)->zone_nr_blocks - 1)) 69 + 70 + /* 71 + * Zone descriptor. 72 + */ 73 + struct dm_zone { 74 + /* For listing the zone depending on its state */ 75 + struct list_head link; 76 + 77 + /* Zone type and state */ 78 + unsigned long flags; 79 + 80 + /* Zone activation reference count */ 81 + atomic_t refcount; 82 + 83 + /* Zone write pointer block (relative to the zone start block) */ 84 + unsigned int wp_block; 85 + 86 + /* Zone weight (number of valid blocks in the zone) */ 87 + unsigned int weight; 88 + 89 + /* The chunk that the zone maps */ 90 + unsigned int chunk; 91 + 92 + /* 93 + * For a sequential data zone, pointer to the random zone 94 + * used as a buffer for processing unaligned writes. 95 + * For a buffer zone, this points back to the data zone. 96 + */ 97 + struct dm_zone *bzone; 98 + }; 99 + 100 + /* 101 + * Zone flags. 102 + */ 103 + enum { 104 + /* Zone write type */ 105 + DMZ_RND, 106 + DMZ_SEQ, 107 + 108 + /* Zone critical condition */ 109 + DMZ_OFFLINE, 110 + DMZ_READ_ONLY, 111 + 112 + /* How the zone is being used */ 113 + DMZ_META, 114 + DMZ_DATA, 115 + DMZ_BUF, 116 + 117 + /* Zone internal state */ 118 + DMZ_ACTIVE, 119 + DMZ_RECLAIM, 120 + DMZ_SEQ_WRITE_ERR, 121 + }; 122 + 123 + /* 124 + * Zone data accessors. 125 + */ 126 + #define dmz_is_rnd(z) test_bit(DMZ_RND, &(z)->flags) 127 + #define dmz_is_seq(z) test_bit(DMZ_SEQ, &(z)->flags) 128 + #define dmz_is_empty(z) ((z)->wp_block == 0) 129 + #define dmz_is_offline(z) test_bit(DMZ_OFFLINE, &(z)->flags) 130 + #define dmz_is_readonly(z) test_bit(DMZ_READ_ONLY, &(z)->flags) 131 + #define dmz_is_active(z) test_bit(DMZ_ACTIVE, &(z)->flags) 132 + #define dmz_in_reclaim(z) test_bit(DMZ_RECLAIM, &(z)->flags) 133 + #define dmz_seq_write_err(z) test_bit(DMZ_SEQ_WRITE_ERR, &(z)->flags) 134 + 135 + #define dmz_is_meta(z) test_bit(DMZ_META, &(z)->flags) 136 + #define dmz_is_buf(z) test_bit(DMZ_BUF, &(z)->flags) 137 + #define dmz_is_data(z) test_bit(DMZ_DATA, &(z)->flags) 138 + 139 + #define dmz_weight(z) ((z)->weight) 140 + 141 + /* 142 + * Message functions. 143 + */ 144 + #define dmz_dev_info(dev, format, args...) \ 145 + DMINFO("(%s): " format, (dev)->name, ## args) 146 + 147 + #define dmz_dev_err(dev, format, args...) \ 148 + DMERR("(%s): " format, (dev)->name, ## args) 149 + 150 + #define dmz_dev_warn(dev, format, args...) \ 151 + DMWARN("(%s): " format, (dev)->name, ## args) 152 + 153 + #define dmz_dev_debug(dev, format, args...) \ 154 + DMDEBUG("(%s): " format, (dev)->name, ## args) 155 + 156 + struct dmz_metadata; 157 + struct dmz_reclaim; 158 + 159 + /* 160 + * Functions defined in dm-zoned-metadata.c 161 + */ 162 + int dmz_ctr_metadata(struct dmz_dev *dev, struct dmz_metadata **zmd); 163 + void dmz_dtr_metadata(struct dmz_metadata *zmd); 164 + int dmz_resume_metadata(struct dmz_metadata *zmd); 165 + 166 + void dmz_lock_map(struct dmz_metadata *zmd); 167 + void dmz_unlock_map(struct dmz_metadata *zmd); 168 + void dmz_lock_metadata(struct dmz_metadata *zmd); 169 + void dmz_unlock_metadata(struct dmz_metadata *zmd); 170 + void dmz_lock_flush(struct dmz_metadata *zmd); 171 + void dmz_unlock_flush(struct dmz_metadata *zmd); 172 + int dmz_flush_metadata(struct dmz_metadata *zmd); 173 + 174 + unsigned int dmz_id(struct dmz_metadata *zmd, struct dm_zone *zone); 175 + sector_t dmz_start_sect(struct dmz_metadata *zmd, struct dm_zone *zone); 176 + sector_t dmz_start_block(struct dmz_metadata *zmd, struct dm_zone *zone); 177 + unsigned int dmz_nr_chunks(struct dmz_metadata *zmd); 178 + 179 + #define DMZ_ALLOC_RND 0x01 180 + #define DMZ_ALLOC_RECLAIM 0x02 181 + 182 + struct dm_zone *dmz_alloc_zone(struct dmz_metadata *zmd, unsigned long flags); 183 + void dmz_free_zone(struct dmz_metadata *zmd, struct dm_zone *zone); 184 + 185 + void dmz_map_zone(struct dmz_metadata *zmd, struct dm_zone *zone, 186 + unsigned int chunk); 187 + void dmz_unmap_zone(struct dmz_metadata *zmd, struct dm_zone *zone); 188 + unsigned int dmz_nr_rnd_zones(struct dmz_metadata *zmd); 189 + unsigned int dmz_nr_unmap_rnd_zones(struct dmz_metadata *zmd); 190 + 191 + void dmz_activate_zone(struct dm_zone *zone); 192 + void dmz_deactivate_zone(struct dm_zone *zone); 193 + 194 + int dmz_lock_zone_reclaim(struct dm_zone *zone); 195 + void dmz_unlock_zone_reclaim(struct dm_zone *zone); 196 + struct dm_zone *dmz_get_zone_for_reclaim(struct dmz_metadata *zmd); 197 + 198 + struct dm_zone *dmz_get_chunk_mapping(struct dmz_metadata *zmd, 199 + unsigned int chunk, int op); 200 + void dmz_put_chunk_mapping(struct dmz_metadata *zmd, struct dm_zone *zone); 201 + struct dm_zone *dmz_get_chunk_buffer(struct dmz_metadata *zmd, 202 + struct dm_zone *dzone); 203 + 204 + int dmz_validate_blocks(struct dmz_metadata *zmd, struct dm_zone *zone, 205 + sector_t chunk_block, unsigned int nr_blocks); 206 + int dmz_invalidate_blocks(struct dmz_metadata *zmd, struct dm_zone *zone, 207 + sector_t chunk_block, unsigned int nr_blocks); 208 + int dmz_block_valid(struct dmz_metadata *zmd, struct dm_zone *zone, 209 + sector_t chunk_block); 210 + int dmz_first_valid_block(struct dmz_metadata *zmd, struct dm_zone *zone, 211 + sector_t *chunk_block); 212 + int dmz_copy_valid_blocks(struct dmz_metadata *zmd, struct dm_zone *from_zone, 213 + struct dm_zone *to_zone); 214 + int dmz_merge_valid_blocks(struct dmz_metadata *zmd, struct dm_zone *from_zone, 215 + struct dm_zone *to_zone, sector_t chunk_block); 216 + 217 + /* 218 + * Functions defined in dm-zoned-reclaim.c 219 + */ 220 + int dmz_ctr_reclaim(struct dmz_dev *dev, struct dmz_metadata *zmd, 221 + struct dmz_reclaim **zrc); 222 + void dmz_dtr_reclaim(struct dmz_reclaim *zrc); 223 + void dmz_suspend_reclaim(struct dmz_reclaim *zrc); 224 + void dmz_resume_reclaim(struct dmz_reclaim *zrc); 225 + void dmz_reclaim_bio_acc(struct dmz_reclaim *zrc); 226 + void dmz_schedule_reclaim(struct dmz_reclaim *zrc); 227 + 228 + #endif /* DM_ZONED_H */

+95 -2

drivers/md/dm.c

··· 58 58 59 59 static struct workqueue_struct *deferred_remove_workqueue; 60 60 61 + atomic_t dm_global_event_nr = ATOMIC_INIT(0); 62 + DECLARE_WAIT_QUEUE_HEAD(dm_global_eventq); 63 + 61 64 /* 62 65 * One of these is allocated per bio. 63 66 */ ··· 1013 1010 EXPORT_SYMBOL_GPL(dm_accept_partial_bio); 1014 1011 1015 1012 /* 1013 + * The zone descriptors obtained with a zone report indicate 1014 + * zone positions within the target device. The zone descriptors 1015 + * must be remapped to match their position within the dm device. 1016 + * A target may call dm_remap_zone_report after completion of a 1017 + * REQ_OP_ZONE_REPORT bio to remap the zone descriptors obtained 1018 + * from the target device mapping to the dm device. 1019 + */ 1020 + void dm_remap_zone_report(struct dm_target *ti, struct bio *bio, sector_t start) 1021 + { 1022 + #ifdef CONFIG_BLK_DEV_ZONED 1023 + struct dm_target_io *tio = container_of(bio, struct dm_target_io, clone); 1024 + struct bio *report_bio = tio->io->bio; 1025 + struct blk_zone_report_hdr *hdr = NULL; 1026 + struct blk_zone *zone; 1027 + unsigned int nr_rep = 0; 1028 + unsigned int ofst; 1029 + struct bio_vec bvec; 1030 + struct bvec_iter iter; 1031 + void *addr; 1032 + 1033 + if (bio->bi_status) 1034 + return; 1035 + 1036 + /* 1037 + * Remap the start sector of the reported zones. For sequential zones, 1038 + * also remap the write pointer position. 1039 + */ 1040 + bio_for_each_segment(bvec, report_bio, iter) { 1041 + addr = kmap_atomic(bvec.bv_page); 1042 + 1043 + /* Remember the report header in the first page */ 1044 + if (!hdr) { 1045 + hdr = addr; 1046 + ofst = sizeof(struct blk_zone_report_hdr); 1047 + } else 1048 + ofst = 0; 1049 + 1050 + /* Set zones start sector */ 1051 + while (hdr->nr_zones && ofst < bvec.bv_len) { 1052 + zone = addr + ofst; 1053 + if (zone->start >= start + ti->len) { 1054 + hdr->nr_zones = 0; 1055 + break; 1056 + } 1057 + zone->start = zone->start + ti->begin - start; 1058 + if (zone->type != BLK_ZONE_TYPE_CONVENTIONAL) { 1059 + if (zone->cond == BLK_ZONE_COND_FULL) 1060 + zone->wp = zone->start + zone->len; 1061 + else if (zone->cond == BLK_ZONE_COND_EMPTY) 1062 + zone->wp = zone->start; 1063 + else 1064 + zone->wp = zone->wp + ti->begin - start; 1065 + } 1066 + ofst += sizeof(struct blk_zone); 1067 + hdr->nr_zones--; 1068 + nr_rep++; 1069 + } 1070 + 1071 + if (addr != hdr) 1072 + kunmap_atomic(addr); 1073 + 1074 + if (!hdr->nr_zones) 1075 + break; 1076 + } 1077 + 1078 + if (hdr) { 1079 + hdr->nr_zones = nr_rep; 1080 + kunmap_atomic(hdr); 1081 + } 1082 + 1083 + bio_advance(report_bio, report_bio->bi_iter.bi_size); 1084 + 1085 + #else /* !CONFIG_BLK_DEV_ZONED */ 1086 + bio->bi_status = BLK_STS_NOTSUPP; 1087 + #endif 1088 + } 1089 + EXPORT_SYMBOL_GPL(dm_remap_zone_report); 1090 + 1091 + /* 1016 1092 * Flush current->bio_list when the target map method blocks. 1017 1093 * This fixes deadlocks in snapshot and possibly in other targets. 1018 1094 */ ··· 1231 1149 return r; 1232 1150 } 1233 1151 1234 - bio_advance(clone, to_bytes(sector - clone->bi_iter.bi_sector)); 1152 + if (bio_op(bio) != REQ_OP_ZONE_REPORT) 1153 + bio_advance(clone, to_bytes(sector - clone->bi_iter.bi_sector)); 1235 1154 clone->bi_iter.bi_size = to_bytes(len); 1236 1155 1237 1156 if (unlikely(bio_integrity(bio) != NULL)) ··· 1421 1338 if (!dm_target_is_valid(ti)) 1422 1339 return -EIO; 1423 1340 1424 - len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count); 1341 + if (bio_op(bio) == REQ_OP_ZONE_REPORT) 1342 + len = ci->sector_count; 1343 + else 1344 + len = min_t(sector_t, max_io_len(ci->sector, ti), 1345 + ci->sector_count); 1425 1346 1426 1347 r = __clone_and_map_data_bio(ci, ti, ci->sector, &len); 1427 1348 if (r < 0) ··· 1468 1381 ci.sector_count = 0; 1469 1382 error = __send_empty_flush(&ci); 1470 1383 /* dec_pending submits any data associated with flush */ 1384 + } else if (bio_op(bio) == REQ_OP_ZONE_RESET) { 1385 + ci.bio = bio; 1386 + ci.sector_count = 0; 1387 + error = __split_and_process_non_flush(&ci); 1471 1388 } else { 1472 1389 ci.bio = bio; 1473 1390 ci.sector_count = bio_sectors(bio); ··· 1850 1759 dm_send_uevents(&uevents, &disk_to_dev(md->disk)->kobj); 1851 1760 1852 1761 atomic_inc(&md->event_nr); 1762 + atomic_inc(&dm_global_event_nr); 1853 1763 wake_up(&md->eventq); 1764 + wake_up(&dm_global_eventq); 1854 1765 } 1855 1766 1856 1767 /*

+37 -36

include/linux/device-mapper.h

··· 237 237 #define DM_TARGET_PASSES_INTEGRITY 0x00000020 238 238 #define dm_target_passes_integrity(type) ((type)->features & DM_TARGET_PASSES_INTEGRITY) 239 239 240 + /* 241 + * Indicates that a target supports host-managed zoned block devices. 242 + */ 243 + #define DM_TARGET_ZONED_HM 0x00000040 244 + #define dm_target_supports_zoned_hm(type) ((type)->features & DM_TARGET_ZONED_HM) 245 + 240 246 struct dm_target { 241 247 struct dm_table *table; 242 248 struct target_type *type; ··· 450 444 int dm_suspended(struct dm_target *ti); 451 445 int dm_noflush_suspending(struct dm_target *ti); 452 446 void dm_accept_partial_bio(struct bio *bio, unsigned n_sectors); 447 + void dm_remap_zone_report(struct dm_target *ti, struct bio *bio, 448 + sector_t start); 453 449 union map_info *dm_get_rq_mapinfo(struct request *rq); 454 450 455 451 struct queue_limits *dm_get_queue_limits(struct mapped_device *md); ··· 551 543 #define dm_ratelimit() 0 552 544 #endif 553 545 554 - #define DMCRIT(f, arg...) \ 555 - printk(KERN_CRIT DM_NAME ": " DM_MSG_PREFIX ": " f "\n", ## arg) 546 + #define DM_FMT(fmt) DM_NAME ": " DM_MSG_PREFIX ": " fmt "\n" 556 547 557 - #define DMERR(f, arg...) \ 558 - printk(KERN_ERR DM_NAME ": " DM_MSG_PREFIX ": " f "\n", ## arg) 559 - #define DMERR_LIMIT(f, arg...) \ 560 - do { \ 561 - if (dm_ratelimit()) \ 562 - printk(KERN_ERR DM_NAME ": " DM_MSG_PREFIX ": " \ 563 - f "\n", ## arg); \ 564 - } while (0) 548 + #define DMCRIT(fmt, ...) pr_crit(DM_FMT(fmt), ##__VA_ARGS__) 565 549 566 - #define DMWARN(f, arg...) \ 567 - printk(KERN_WARNING DM_NAME ": " DM_MSG_PREFIX ": " f "\n", ## arg) 568 - #define DMWARN_LIMIT(f, arg...) \ 569 - do { \ 570 - if (dm_ratelimit()) \ 571 - printk(KERN_WARNING DM_NAME ": " DM_MSG_PREFIX ": " \ 572 - f "\n", ## arg); \ 573 - } while (0) 550 + #define DMERR(fmt, ...) pr_err(DM_FMT(fmt), ##__VA_ARGS__) 551 + #define DMERR_LIMIT(fmt, ...) \ 552 + do { \ 553 + if (dm_ratelimit()) \ 554 + DMERR(fmt, ##__VA_ARGS__); \ 555 + } while (0) 574 556 575 - #define DMINFO(f, arg...) \ 576 - printk(KERN_INFO DM_NAME ": " DM_MSG_PREFIX ": " f "\n", ## arg) 577 - #define DMINFO_LIMIT(f, arg...) \ 578 - do { \ 579 - if (dm_ratelimit()) \ 580 - printk(KERN_INFO DM_NAME ": " DM_MSG_PREFIX ": " f \ 581 - "\n", ## arg); \ 582 - } while (0) 557 + #define DMWARN(fmt, ...) pr_warn(DM_FMT(fmt), ##__VA_ARGS__) 558 + #define DMWARN_LIMIT(fmt, ...) \ 559 + do { \ 560 + if (dm_ratelimit()) \ 561 + DMWARN(fmt, ##__VA_ARGS__); \ 562 + } while (0) 563 + 564 + #define DMINFO(fmt, ...) pr_info(DM_FMT(fmt), ##__VA_ARGS__) 565 + #define DMINFO_LIMIT(fmt, ...) \ 566 + do { \ 567 + if (dm_ratelimit()) \ 568 + DMINFO(fmt, ##__VA_ARGS__); \ 569 + } while (0) 583 570 584 571 #ifdef CONFIG_DM_DEBUG 585 - # define DMDEBUG(f, arg...) \ 586 - printk(KERN_DEBUG DM_NAME ": " DM_MSG_PREFIX " DEBUG: " f "\n", ## arg) 587 - # define DMDEBUG_LIMIT(f, arg...) \ 588 - do { \ 589 - if (dm_ratelimit()) \ 590 - printk(KERN_DEBUG DM_NAME ": " DM_MSG_PREFIX ": " f \ 591 - "\n", ## arg); \ 592 - } while (0) 572 + #define DMDEBUG(fmt, ...) printk(KERN_DEBUG DM_FMT(fmt), ##__VA_ARGS__) 573 + #define DMDEBUG_LIMIT(fmt, ...) \ 574 + do { \ 575 + if (dm_ratelimit()) \ 576 + DMDEBUG(fmt, ##__VA_ARGS__); \ 577 + } while (0) 593 578 #else 594 - # define DMDEBUG(f, arg...) do {} while (0) 595 - # define DMDEBUG_LIMIT(f, arg...) do {} while (0) 579 + #define DMDEBUG(fmt, ...) no_printk(fmt, ##__VA_ARGS__) 580 + #define DMDEBUG_LIMIT(fmt, ...) no_printk(fmt, ##__VA_ARGS__) 596 581 #endif 597 582 598 583 #define DMEMIT(x...) sz += ((sz >= maxlen) ? \

+1

include/linux/dm-kcopyd.h

··· 20 20 #define DM_KCOPYD_MAX_REGIONS 8 21 21 22 22 #define DM_KCOPYD_IGNORE_ERROR 1 23 + #define DM_KCOPYD_WRITE_SEQ 2 23 24 24 25 struct dm_kcopyd_throttle { 25 26 unsigned throttle;

+3 -1

include/uapi/linux/dm-ioctl.h

··· 240 240 /* Added later */ 241 241 DM_LIST_VERSIONS_CMD, 242 242 DM_TARGET_MSG_CMD, 243 - DM_DEV_SET_GEOMETRY_CMD 243 + DM_DEV_SET_GEOMETRY_CMD, 244 + DM_DEV_ARM_POLL_CMD, 244 245 }; 245 246 246 247 #define DM_IOCTL 0xfd ··· 256 255 #define DM_DEV_SUSPEND _IOWR(DM_IOCTL, DM_DEV_SUSPEND_CMD, struct dm_ioctl) 257 256 #define DM_DEV_STATUS _IOWR(DM_IOCTL, DM_DEV_STATUS_CMD, struct dm_ioctl) 258 257 #define DM_DEV_WAIT _IOWR(DM_IOCTL, DM_DEV_WAIT_CMD, struct dm_ioctl) 258 + #define DM_DEV_ARM_POLL _IOWR(DM_IOCTL, DM_DEV_ARM_POLL_CMD, struct dm_ioctl) 259 259 260 260 #define DM_TABLE_LOAD _IOWR(DM_IOCTL, DM_TABLE_LOAD_CMD, struct dm_ioctl) 261 261 #define DM_TABLE_CLEAR _IOWR(DM_IOCTL, DM_TABLE_CLEAR_CMD, struct dm_ioctl)