Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge tag 'dm-3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper changes from Mike Snitzer:

- Fix dm-cache corruption caused by discard_block_size > cache_block_size

- Fix a lock-inversion detected by LOCKDEP in dm-cache

- Fix a dangling bio bug in the dm-thinp target's process_deferred_bios
error path

- Fix corruption due to non-atomic transaction commit which allowed a
metadata superblock to be written before all other metadata was
successfully written -- this is common to all targets that use the
persistent-data library's transaction manager (dm-thinp, dm-cache and
dm-era).

- Various small cleanups in the DM core

- Add the dm-era target which is useful for keeping track of which
blocks were written within a user defined period of time called an
'era'. Use cases include tracking changed blocks for backup
software, and partially invalidating the contents of a cache to
restore cache coherency after rolling back a vendor snapshot.

- Improve the on-disk layout of multithreaded writes to the
dm-thin-pool by splitting the pool's deferred bio list to be a
per-thin device list and then sorting that list using an rb_tree.
The subsequent read throughput of the data written via multiple
threads improved by ~70%.

- Simplify the multipath target's handling of queuing IO by pushing
requests back to the request queue rather than queueing the IO
internally.

* tag 'dm-3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (24 commits)
dm cache: fix a lock-inversion
dm thin: sort the per thin deferred bios using an rb_tree
dm thin: use per thin device deferred bio lists
dm thin: simplify pool_is_congested
dm thin: fix dangling bio in process_deferred_bios error path
dm mpath: print more useful warnings in multipath_message()
dm-mpath: do not activate failed paths
dm mpath: remove extra nesting in map function
dm mpath: remove map_io()
dm mpath: reduce memory pressure when requeuing
dm mpath: remove process_queued_ios()
dm mpath: push back requests instead of queueing
dm table: add dm_table_run_md_queue_async
dm mpath: do not call pg_init when it is already running
dm: use RCU_INIT_POINTER instead of rcu_assign_pointer in __unbind
dm: stop using bi_private
dm: remove dm_get_mapinfo
dm: make dm_table_alloc_md_mempools static
dm: take care to copy the space map roots before locking the superblock
dm transaction manager: fix corruption due to non-atomic transaction commit
...

+2351 -483
+108
Documentation/device-mapper/era.txt
··· 1 + Introduction 2 + ============ 3 + 4 + dm-era is a target that behaves similar to the linear target. In 5 + addition it keeps track of which blocks were written within a user 6 + defined period of time called an 'era'. Each era target instance 7 + maintains the current era as a monotonically increasing 32-bit 8 + counter. 9 + 10 + Use cases include tracking changed blocks for backup software, and 11 + partially invalidating the contents of a cache to restore cache 12 + coherency after rolling back a vendor snapshot. 13 + 14 + Constructor 15 + =========== 16 + 17 + era <metadata dev> <origin dev> <block size> 18 + 19 + metadata dev : fast device holding the persistent metadata 20 + origin dev : device holding data blocks that may change 21 + block size : block size of origin data device, granularity that is 22 + tracked by the target 23 + 24 + Messages 25 + ======== 26 + 27 + None of the dm messages take any arguments. 28 + 29 + checkpoint 30 + ---------- 31 + 32 + Possibly move to a new era. You shouldn't assume the era has 33 + incremented. After sending this message, you should check the 34 + current era via the status line. 35 + 36 + take_metadata_snap 37 + ------------------ 38 + 39 + Create a clone of the metadata, to allow a userland process to read it. 40 + 41 + drop_metadata_snap 42 + ------------------ 43 + 44 + Drop the metadata snapshot. 45 + 46 + Status 47 + ====== 48 + 49 + <metadata block size> <#used metadata blocks>/<#total metadata blocks> 50 + <current era> <held metadata root | '-'> 51 + 52 + metadata block size : Fixed block size for each metadata block in 53 + sectors 54 + #used metadata blocks : Number of metadata blocks used 55 + #total metadata blocks : Total number of metadata blocks 56 + current era : The current era 57 + held metadata root : The location, in blocks, of the metadata root 58 + that has been 'held' for userspace read 59 + access. '-' indicates there is no held root 60 + 61 + Detailed use case 62 + ================= 63 + 64 + The scenario of invalidating a cache when rolling back a vendor 65 + snapshot was the primary use case when developing this target: 66 + 67 + Taking a vendor snapshot 68 + ------------------------ 69 + 70 + - Send a checkpoint message to the era target 71 + - Make a note of the current era in its status line 72 + - Take vendor snapshot (the era and snapshot should be forever 73 + associated now). 74 + 75 + Rolling back to an vendor snapshot 76 + ---------------------------------- 77 + 78 + - Cache enters passthrough mode (see: dm-cache's docs in cache.txt) 79 + - Rollback vendor storage 80 + - Take metadata snapshot 81 + - Ascertain which blocks have been written since the snapshot was taken 82 + by checking each block's era 83 + - Invalidate those blocks in the caching software 84 + - Cache returns to writeback/writethrough mode 85 + 86 + Memory usage 87 + ============ 88 + 89 + The target uses a bitset to record writes in the current era. It also 90 + has a spare bitset ready for switching over to a new era. Other than 91 + that it uses a few 4k blocks for updating metadata. 92 + 93 + (4 * nr_blocks) bytes + buffers 94 + 95 + Resilience 96 + ========== 97 + 98 + Metadata is updated on disk before a write to a previously unwritten 99 + block is performed. As such dm-era should not be effected by a hard 100 + crash such as power failure. 101 + 102 + Userland tools 103 + ============== 104 + 105 + Userland tools are found in the increasingly poorly named 106 + thin-provisioning-tools project: 107 + 108 + https://github.com/jthornber/thin-provisioning-tools
+11
drivers/md/Kconfig
··· 285 285 A simple cache policy that writes back all data to the 286 286 origin. Used when decommissioning a dm-cache. 287 287 288 + config DM_ERA 289 + tristate "Era target (EXPERIMENTAL)" 290 + depends on BLK_DEV_DM 291 + default n 292 + select DM_PERSISTENT_DATA 293 + select DM_BIO_PRISON 294 + ---help--- 295 + dm-era tracks which parts of a block device are written to 296 + over time. Useful for maintaining cache coherency when using 297 + vendor snapshots. 298 + 288 299 config DM_MIRROR 289 300 tristate "Mirror target" 290 301 depends on BLK_DEV_DM
+2
drivers/md/Makefile
··· 14 14 dm-cache-y += dm-cache-target.o dm-cache-metadata.o dm-cache-policy.o 15 15 dm-cache-mq-y += dm-cache-policy-mq.o 16 16 dm-cache-cleaner-y += dm-cache-policy-cleaner.o 17 + dm-era-y += dm-era-target.o 17 18 md-mod-y += md.o bitmap.o 18 19 raid456-y += raid5.o 19 20 ··· 54 53 obj-$(CONFIG_DM_CACHE) += dm-cache.o 55 54 obj-$(CONFIG_DM_CACHE_MQ) += dm-cache-mq.o 56 55 obj-$(CONFIG_DM_CACHE_CLEANER) += dm-cache-cleaner.o 56 + obj-$(CONFIG_DM_ERA) += dm-era.o 57 57 58 58 ifeq ($(CONFIG_DM_UEVENT),y) 59 59 dm-mod-objs += dm-uevent.o
-11
drivers/md/dm-cache-block-types.h
··· 19 19 20 20 typedef dm_block_t __bitwise__ dm_oblock_t; 21 21 typedef uint32_t __bitwise__ dm_cblock_t; 22 - typedef dm_block_t __bitwise__ dm_dblock_t; 23 22 24 23 static inline dm_oblock_t to_oblock(dm_block_t b) 25 24 { ··· 38 39 static inline uint32_t from_cblock(dm_cblock_t b) 39 40 { 40 41 return (__force uint32_t) b; 41 - } 42 - 43 - static inline dm_dblock_t to_dblock(dm_block_t b) 44 - { 45 - return (__force dm_dblock_t) b; 46 - } 47 - 48 - static inline dm_block_t from_dblock(dm_dblock_t b) 49 - { 50 - return (__force dm_block_t) b; 51 42 } 52 43 53 44 #endif /* DM_CACHE_BLOCK_TYPES_H */
+75 -59
drivers/md/dm-cache-metadata.c
··· 109 109 dm_block_t discard_root; 110 110 111 111 sector_t discard_block_size; 112 - dm_dblock_t discard_nr_blocks; 112 + dm_oblock_t discard_nr_blocks; 113 113 114 114 sector_t data_block_size; 115 115 dm_cblock_t cache_blocks; ··· 120 120 unsigned policy_version[CACHE_POLICY_VERSION_SIZE]; 121 121 size_t policy_hint_size; 122 122 struct dm_cache_statistics stats; 123 + 124 + /* 125 + * Reading the space map root can fail, so we read it into this 126 + * buffer before the superblock is locked and updated. 127 + */ 128 + __u8 metadata_space_map_root[SPACE_MAP_ROOT_SIZE]; 123 129 }; 124 130 125 131 /*------------------------------------------------------------------- ··· 266 260 } 267 261 } 268 262 263 + static int __save_sm_root(struct dm_cache_metadata *cmd) 264 + { 265 + int r; 266 + size_t metadata_len; 267 + 268 + r = dm_sm_root_size(cmd->metadata_sm, &metadata_len); 269 + if (r < 0) 270 + return r; 271 + 272 + return dm_sm_copy_root(cmd->metadata_sm, &cmd->metadata_space_map_root, 273 + metadata_len); 274 + } 275 + 276 + static void __copy_sm_root(struct dm_cache_metadata *cmd, 277 + struct cache_disk_superblock *disk_super) 278 + { 279 + memcpy(&disk_super->metadata_space_map_root, 280 + &cmd->metadata_space_map_root, 281 + sizeof(cmd->metadata_space_map_root)); 282 + } 283 + 269 284 static int __write_initial_superblock(struct dm_cache_metadata *cmd) 270 285 { 271 286 int r; 272 287 struct dm_block *sblock; 273 - size_t metadata_len; 274 288 struct cache_disk_superblock *disk_super; 275 289 sector_t bdev_size = i_size_read(cmd->bdev->bd_inode) >> SECTOR_SHIFT; 276 290 ··· 298 272 if (bdev_size > DM_CACHE_METADATA_MAX_SECTORS) 299 273 bdev_size = DM_CACHE_METADATA_MAX_SECTORS; 300 274 301 - r = dm_sm_root_size(cmd->metadata_sm, &metadata_len); 275 + r = dm_tm_pre_commit(cmd->tm); 302 276 if (r < 0) 303 277 return r; 304 278 305 - r = dm_tm_pre_commit(cmd->tm); 306 - if (r < 0) 279 + /* 280 + * dm_sm_copy_root() can fail. So we need to do it before we start 281 + * updating the superblock. 282 + */ 283 + r = __save_sm_root(cmd); 284 + if (r) 307 285 return r; 308 286 309 287 r = superblock_lock_zero(cmd, &sblock); ··· 323 293 memset(disk_super->policy_version, 0, sizeof(disk_super->policy_version)); 324 294 disk_super->policy_hint_size = 0; 325 295 326 - r = dm_sm_copy_root(cmd->metadata_sm, &disk_super->metadata_space_map_root, 327 - metadata_len); 328 - if (r < 0) 329 - goto bad_locked; 296 + __copy_sm_root(cmd, disk_super); 330 297 331 298 disk_super->mapping_root = cpu_to_le64(cmd->root); 332 299 disk_super->hint_root = cpu_to_le64(cmd->hint_root); 333 300 disk_super->discard_root = cpu_to_le64(cmd->discard_root); 334 301 disk_super->discard_block_size = cpu_to_le64(cmd->discard_block_size); 335 - disk_super->discard_nr_blocks = cpu_to_le64(from_dblock(cmd->discard_nr_blocks)); 302 + disk_super->discard_nr_blocks = cpu_to_le64(from_oblock(cmd->discard_nr_blocks)); 336 303 disk_super->metadata_block_size = cpu_to_le32(DM_CACHE_METADATA_BLOCK_SIZE >> SECTOR_SHIFT); 337 304 disk_super->data_block_size = cpu_to_le32(cmd->data_block_size); 338 305 disk_super->cache_blocks = cpu_to_le32(0); ··· 340 313 disk_super->write_misses = cpu_to_le32(0); 341 314 342 315 return dm_tm_commit(cmd->tm, sblock); 343 - 344 - bad_locked: 345 - dm_bm_unlock(sblock); 346 - return r; 347 316 } 348 317 349 318 static int __format_metadata(struct dm_cache_metadata *cmd) ··· 519 496 cmd->hint_root = le64_to_cpu(disk_super->hint_root); 520 497 cmd->discard_root = le64_to_cpu(disk_super->discard_root); 521 498 cmd->discard_block_size = le64_to_cpu(disk_super->discard_block_size); 522 - cmd->discard_nr_blocks = to_dblock(le64_to_cpu(disk_super->discard_nr_blocks)); 499 + cmd->discard_nr_blocks = to_oblock(le64_to_cpu(disk_super->discard_nr_blocks)); 523 500 cmd->data_block_size = le32_to_cpu(disk_super->data_block_size); 524 501 cmd->cache_blocks = to_cblock(le32_to_cpu(disk_super->cache_blocks)); 525 502 strncpy(cmd->policy_name, disk_super->policy_name, sizeof(cmd->policy_name)); ··· 553 530 disk_super = dm_block_data(sblock); 554 531 update_flags(disk_super, mutator); 555 532 read_superblock_fields(cmd, disk_super); 533 + dm_bm_unlock(sblock); 556 534 557 - return dm_bm_flush_and_unlock(cmd->bm, sblock); 535 + return dm_bm_flush(cmd->bm); 558 536 } 559 537 560 538 static int __begin_transaction(struct dm_cache_metadata *cmd) ··· 583 559 flags_mutator mutator) 584 560 { 585 561 int r; 586 - size_t metadata_len; 587 562 struct cache_disk_superblock *disk_super; 588 563 struct dm_block *sblock; 589 564 ··· 600 577 if (r < 0) 601 578 return r; 602 579 603 - r = dm_sm_root_size(cmd->metadata_sm, &metadata_len); 604 - if (r < 0) 580 + r = __save_sm_root(cmd); 581 + if (r) 605 582 return r; 606 583 607 584 r = superblock_lock(cmd, &sblock); ··· 617 594 disk_super->hint_root = cpu_to_le64(cmd->hint_root); 618 595 disk_super->discard_root = cpu_to_le64(cmd->discard_root); 619 596 disk_super->discard_block_size = cpu_to_le64(cmd->discard_block_size); 620 - disk_super->discard_nr_blocks = cpu_to_le64(from_dblock(cmd->discard_nr_blocks)); 597 + disk_super->discard_nr_blocks = cpu_to_le64(from_oblock(cmd->discard_nr_blocks)); 621 598 disk_super->cache_blocks = cpu_to_le32(from_cblock(cmd->cache_blocks)); 622 599 strncpy(disk_super->policy_name, cmd->policy_name, sizeof(disk_super->policy_name)); 623 600 disk_super->policy_version[0] = cpu_to_le32(cmd->policy_version[0]); ··· 628 605 disk_super->read_misses = cpu_to_le32(cmd->stats.read_misses); 629 606 disk_super->write_hits = cpu_to_le32(cmd->stats.write_hits); 630 607 disk_super->write_misses = cpu_to_le32(cmd->stats.write_misses); 631 - 632 - r = dm_sm_copy_root(cmd->metadata_sm, &disk_super->metadata_space_map_root, 633 - metadata_len); 634 - if (r < 0) { 635 - dm_bm_unlock(sblock); 636 - return r; 637 - } 608 + __copy_sm_root(cmd, disk_super); 638 609 639 610 return dm_tm_commit(cmd->tm, sblock); 640 611 } ··· 788 771 789 772 int dm_cache_discard_bitset_resize(struct dm_cache_metadata *cmd, 790 773 sector_t discard_block_size, 791 - dm_dblock_t new_nr_entries) 774 + dm_oblock_t new_nr_entries) 792 775 { 793 776 int r; 794 777 795 778 down_write(&cmd->root_lock); 796 779 r = dm_bitset_resize(&cmd->discard_info, 797 780 cmd->discard_root, 798 - from_dblock(cmd->discard_nr_blocks), 799 - from_dblock(new_nr_entries), 781 + from_oblock(cmd->discard_nr_blocks), 782 + from_oblock(new_nr_entries), 800 783 false, &cmd->discard_root); 801 784 if (!r) { 802 785 cmd->discard_block_size = discard_block_size; ··· 809 792 return r; 810 793 } 811 794 812 - static int __set_discard(struct dm_cache_metadata *cmd, dm_dblock_t b) 795 + static int __set_discard(struct dm_cache_metadata *cmd, dm_oblock_t b) 813 796 { 814 797 return dm_bitset_set_bit(&cmd->discard_info, cmd->discard_root, 815 - from_dblock(b), &cmd->discard_root); 798 + from_oblock(b), &cmd->discard_root); 816 799 } 817 800 818 - static int __clear_discard(struct dm_cache_metadata *cmd, dm_dblock_t b) 801 + static int __clear_discard(struct dm_cache_metadata *cmd, dm_oblock_t b) 819 802 { 820 803 return dm_bitset_clear_bit(&cmd->discard_info, cmd->discard_root, 821 - from_dblock(b), &cmd->discard_root); 804 + from_oblock(b), &cmd->discard_root); 822 805 } 823 806 824 - static int __is_discarded(struct dm_cache_metadata *cmd, dm_dblock_t b, 807 + static int __is_discarded(struct dm_cache_metadata *cmd, dm_oblock_t b, 825 808 bool *is_discarded) 826 809 { 827 810 return dm_bitset_test_bit(&cmd->discard_info, cmd->discard_root, 828 - from_dblock(b), &cmd->discard_root, 811 + from_oblock(b), &cmd->discard_root, 829 812 is_discarded); 830 813 } 831 814 832 815 static int __discard(struct dm_cache_metadata *cmd, 833 - dm_dblock_t dblock, bool discard) 816 + dm_oblock_t dblock, bool discard) 834 817 { 835 818 int r; 836 819 ··· 843 826 } 844 827 845 828 int dm_cache_set_discard(struct dm_cache_metadata *cmd, 846 - dm_dblock_t dblock, bool discard) 829 + dm_oblock_t dblock, bool discard) 847 830 { 848 831 int r; 849 832 ··· 861 844 dm_block_t b; 862 845 bool discard; 863 846 864 - for (b = 0; b < from_dblock(cmd->discard_nr_blocks); b++) { 865 - dm_dblock_t dblock = to_dblock(b); 847 + for (b = 0; b < from_oblock(cmd->discard_nr_blocks); b++) { 848 + dm_oblock_t dblock = to_oblock(b); 866 849 867 850 if (cmd->clean_when_opened) { 868 851 r = __is_discarded(cmd, dblock, &discard); ··· 1245 1228 return 0; 1246 1229 } 1247 1230 1248 - int dm_cache_begin_hints(struct dm_cache_metadata *cmd, struct dm_cache_policy *policy) 1231 + static int save_hint(void *context, dm_cblock_t cblock, dm_oblock_t oblock, uint32_t hint) 1249 1232 { 1250 - int r; 1251 - 1252 - down_write(&cmd->root_lock); 1253 - r = begin_hints(cmd, policy); 1254 - up_write(&cmd->root_lock); 1255 - 1256 - return r; 1257 - } 1258 - 1259 - static int save_hint(struct dm_cache_metadata *cmd, dm_cblock_t cblock, 1260 - uint32_t hint) 1261 - { 1262 - int r; 1233 + struct dm_cache_metadata *cmd = context; 1263 1234 __le32 value = cpu_to_le32(hint); 1235 + int r; 1236 + 1264 1237 __dm_bless_for_disk(&value); 1265 1238 1266 1239 r = dm_array_set_value(&cmd->hint_info, cmd->hint_root, ··· 1260 1253 return r; 1261 1254 } 1262 1255 1263 - int dm_cache_save_hint(struct dm_cache_metadata *cmd, dm_cblock_t cblock, 1264 - uint32_t hint) 1256 + static int write_hints(struct dm_cache_metadata *cmd, struct dm_cache_policy *policy) 1265 1257 { 1266 1258 int r; 1267 1259 1268 - if (!hints_array_initialized(cmd)) 1269 - return 0; 1260 + r = begin_hints(cmd, policy); 1261 + if (r) { 1262 + DMERR("begin_hints failed"); 1263 + return r; 1264 + } 1265 + 1266 + return policy_walk_mappings(policy, save_hint, cmd); 1267 + } 1268 + 1269 + int dm_cache_write_hints(struct dm_cache_metadata *cmd, struct dm_cache_policy *policy) 1270 + { 1271 + int r; 1270 1272 1271 1273 down_write(&cmd->root_lock); 1272 - r = save_hint(cmd, cblock, hint); 1274 + r = write_hints(cmd, policy); 1273 1275 up_write(&cmd->root_lock); 1274 1276 1275 1277 return r;
+4 -11
drivers/md/dm-cache-metadata.h
··· 72 72 73 73 int dm_cache_discard_bitset_resize(struct dm_cache_metadata *cmd, 74 74 sector_t discard_block_size, 75 - dm_dblock_t new_nr_entries); 75 + dm_oblock_t new_nr_entries); 76 76 77 77 typedef int (*load_discard_fn)(void *context, sector_t discard_block_size, 78 - dm_dblock_t dblock, bool discarded); 78 + dm_oblock_t dblock, bool discarded); 79 79 int dm_cache_load_discards(struct dm_cache_metadata *cmd, 80 80 load_discard_fn fn, void *context); 81 81 82 - int dm_cache_set_discard(struct dm_cache_metadata *cmd, dm_dblock_t dblock, bool discard); 82 + int dm_cache_set_discard(struct dm_cache_metadata *cmd, dm_oblock_t dblock, bool discard); 83 83 84 84 int dm_cache_remove_mapping(struct dm_cache_metadata *cmd, dm_cblock_t cblock); 85 85 int dm_cache_insert_mapping(struct dm_cache_metadata *cmd, dm_cblock_t cblock, dm_oblock_t oblock); ··· 128 128 * rather than querying the policy for each cblock, we let it walk its data 129 129 * structures and fill in the hints in whatever order it wishes. 130 130 */ 131 - 132 - int dm_cache_begin_hints(struct dm_cache_metadata *cmd, struct dm_cache_policy *p); 133 - 134 - /* 135 - * requests hints for every cblock and stores in the metadata device. 136 - */ 137 - int dm_cache_save_hint(struct dm_cache_metadata *cmd, 138 - dm_cblock_t cblock, uint32_t hint); 131 + int dm_cache_write_hints(struct dm_cache_metadata *cmd, struct dm_cache_policy *p); 139 132 140 133 /* 141 134 * Query method. Are all the blocks in the cache clean?
+28 -103
drivers/md/dm-cache-target.c
··· 237 237 /* 238 238 * origin_blocks entries, discarded if set. 239 239 */ 240 - dm_dblock_t discard_nr_blocks; 240 + dm_oblock_t discard_nr_blocks; 241 241 unsigned long *discard_bitset; 242 - uint32_t discard_block_size; /* a power of 2 times sectors per block */ 243 242 244 243 /* 245 244 * Rather than reconstructing the table line for the status we just ··· 525 526 return b; 526 527 } 527 528 528 - static dm_dblock_t oblock_to_dblock(struct cache *cache, dm_oblock_t oblock) 529 - { 530 - uint32_t discard_blocks = cache->discard_block_size; 531 - dm_block_t b = from_oblock(oblock); 532 - 533 - if (!block_size_is_power_of_two(cache)) 534 - discard_blocks = discard_blocks / cache->sectors_per_block; 535 - else 536 - discard_blocks >>= cache->sectors_per_block_shift; 537 - 538 - b = block_div(b, discard_blocks); 539 - 540 - return to_dblock(b); 541 - } 542 - 543 - static void set_discard(struct cache *cache, dm_dblock_t b) 529 + static void set_discard(struct cache *cache, dm_oblock_t b) 544 530 { 545 531 unsigned long flags; 546 532 547 533 atomic_inc(&cache->stats.discard_count); 548 534 549 535 spin_lock_irqsave(&cache->lock, flags); 550 - set_bit(from_dblock(b), cache->discard_bitset); 536 + set_bit(from_oblock(b), cache->discard_bitset); 551 537 spin_unlock_irqrestore(&cache->lock, flags); 552 538 } 553 539 554 - static void clear_discard(struct cache *cache, dm_dblock_t b) 540 + static void clear_discard(struct cache *cache, dm_oblock_t b) 555 541 { 556 542 unsigned long flags; 557 543 558 544 spin_lock_irqsave(&cache->lock, flags); 559 - clear_bit(from_dblock(b), cache->discard_bitset); 545 + clear_bit(from_oblock(b), cache->discard_bitset); 560 546 spin_unlock_irqrestore(&cache->lock, flags); 561 547 } 562 548 563 - static bool is_discarded(struct cache *cache, dm_dblock_t b) 549 + static bool is_discarded(struct cache *cache, dm_oblock_t b) 564 550 { 565 551 int r; 566 552 unsigned long flags; 567 553 568 554 spin_lock_irqsave(&cache->lock, flags); 569 - r = test_bit(from_dblock(b), cache->discard_bitset); 555 + r = test_bit(from_oblock(b), cache->discard_bitset); 570 556 spin_unlock_irqrestore(&cache->lock, flags); 571 557 572 558 return r; ··· 563 579 unsigned long flags; 564 580 565 581 spin_lock_irqsave(&cache->lock, flags); 566 - r = test_bit(from_dblock(oblock_to_dblock(cache, b)), 567 - cache->discard_bitset); 582 + r = test_bit(from_oblock(b), cache->discard_bitset); 568 583 spin_unlock_irqrestore(&cache->lock, flags); 569 584 570 585 return r; ··· 688 705 check_if_tick_bio_needed(cache, bio); 689 706 remap_to_origin(cache, bio); 690 707 if (bio_data_dir(bio) == WRITE) 691 - clear_discard(cache, oblock_to_dblock(cache, oblock)); 708 + clear_discard(cache, oblock); 692 709 } 693 710 694 711 static void remap_to_cache_dirty(struct cache *cache, struct bio *bio, ··· 698 715 remap_to_cache(cache, bio, cblock); 699 716 if (bio_data_dir(bio) == WRITE) { 700 717 set_dirty(cache, oblock, cblock); 701 - clear_discard(cache, oblock_to_dblock(cache, oblock)); 718 + clear_discard(cache, oblock); 702 719 } 703 720 } 704 721 ··· 1271 1288 static void process_discard_bio(struct cache *cache, struct bio *bio) 1272 1289 { 1273 1290 dm_block_t start_block = dm_sector_div_up(bio->bi_iter.bi_sector, 1274 - cache->discard_block_size); 1291 + cache->sectors_per_block); 1275 1292 dm_block_t end_block = bio_end_sector(bio); 1276 1293 dm_block_t b; 1277 1294 1278 - end_block = block_div(end_block, cache->discard_block_size); 1295 + end_block = block_div(end_block, cache->sectors_per_block); 1279 1296 1280 1297 for (b = start_block; b < end_block; b++) 1281 - set_discard(cache, to_dblock(b)); 1298 + set_discard(cache, to_oblock(b)); 1282 1299 1283 1300 bio_endio(bio, 0); 1284 1301 } ··· 2154 2171 return 0; 2155 2172 } 2156 2173 2157 - /* 2158 - * We want the discard block size to be a power of two, at least the size 2159 - * of the cache block size, and have no more than 2^14 discard blocks 2160 - * across the origin. 2161 - */ 2162 - #define MAX_DISCARD_BLOCKS (1 << 14) 2163 - 2164 - static bool too_many_discard_blocks(sector_t discard_block_size, 2165 - sector_t origin_size) 2166 - { 2167 - (void) sector_div(origin_size, discard_block_size); 2168 - 2169 - return origin_size > MAX_DISCARD_BLOCKS; 2170 - } 2171 - 2172 - static sector_t calculate_discard_block_size(sector_t cache_block_size, 2173 - sector_t origin_size) 2174 - { 2175 - sector_t discard_block_size; 2176 - 2177 - discard_block_size = roundup_pow_of_two(cache_block_size); 2178 - 2179 - if (origin_size) 2180 - while (too_many_discard_blocks(discard_block_size, origin_size)) 2181 - discard_block_size *= 2; 2182 - 2183 - return discard_block_size; 2184 - } 2185 - 2186 2174 #define DEFAULT_MIGRATION_THRESHOLD 2048 2187 2175 2188 2176 static int cache_create(struct cache_args *ca, struct cache **result) ··· 2275 2321 } 2276 2322 clear_bitset(cache->dirty_bitset, from_cblock(cache->cache_size)); 2277 2323 2278 - cache->discard_block_size = 2279 - calculate_discard_block_size(cache->sectors_per_block, 2280 - cache->origin_sectors); 2281 - cache->discard_nr_blocks = oblock_to_dblock(cache, cache->origin_blocks); 2282 - cache->discard_bitset = alloc_bitset(from_dblock(cache->discard_nr_blocks)); 2324 + cache->discard_nr_blocks = cache->origin_blocks; 2325 + cache->discard_bitset = alloc_bitset(from_oblock(cache->discard_nr_blocks)); 2283 2326 if (!cache->discard_bitset) { 2284 2327 *error = "could not allocate discard bitset"; 2285 2328 goto bad; 2286 2329 } 2287 - clear_bitset(cache->discard_bitset, from_dblock(cache->discard_nr_blocks)); 2330 + clear_bitset(cache->discard_bitset, from_oblock(cache->discard_nr_blocks)); 2288 2331 2289 2332 cache->copier = dm_kcopyd_client_create(&dm_kcopyd_throttle); 2290 2333 if (IS_ERR(cache->copier)) { ··· 2565 2614 { 2566 2615 unsigned i, r; 2567 2616 2568 - r = dm_cache_discard_bitset_resize(cache->cmd, cache->discard_block_size, 2569 - cache->discard_nr_blocks); 2617 + r = dm_cache_discard_bitset_resize(cache->cmd, cache->sectors_per_block, 2618 + cache->origin_blocks); 2570 2619 if (r) { 2571 2620 DMERR("could not resize on-disk discard bitset"); 2572 2621 return r; 2573 2622 } 2574 2623 2575 - for (i = 0; i < from_dblock(cache->discard_nr_blocks); i++) { 2576 - r = dm_cache_set_discard(cache->cmd, to_dblock(i), 2577 - is_discarded(cache, to_dblock(i))); 2624 + for (i = 0; i < from_oblock(cache->discard_nr_blocks); i++) { 2625 + r = dm_cache_set_discard(cache->cmd, to_oblock(i), 2626 + is_discarded(cache, to_oblock(i))); 2578 2627 if (r) 2579 2628 return r; 2580 2629 } 2581 2630 2582 2631 return 0; 2583 - } 2584 - 2585 - static int save_hint(void *context, dm_cblock_t cblock, dm_oblock_t oblock, 2586 - uint32_t hint) 2587 - { 2588 - struct cache *cache = context; 2589 - return dm_cache_save_hint(cache->cmd, cblock, hint); 2590 - } 2591 - 2592 - static int write_hints(struct cache *cache) 2593 - { 2594 - int r; 2595 - 2596 - r = dm_cache_begin_hints(cache->cmd, cache->policy); 2597 - if (r) { 2598 - DMERR("dm_cache_begin_hints failed"); 2599 - return r; 2600 - } 2601 - 2602 - r = policy_walk_mappings(cache->policy, save_hint, cache); 2603 - if (r) 2604 - DMERR("policy_walk_mappings failed"); 2605 - 2606 - return r; 2607 2632 } 2608 2633 2609 2634 /* ··· 2599 2672 2600 2673 save_stats(cache); 2601 2674 2602 - r3 = write_hints(cache); 2675 + r3 = dm_cache_write_hints(cache->cmd, cache->policy); 2603 2676 if (r3) 2604 2677 DMERR("could not write hints"); 2605 2678 ··· 2647 2720 } 2648 2721 2649 2722 static int load_discard(void *context, sector_t discard_block_size, 2650 - dm_dblock_t dblock, bool discard) 2723 + dm_oblock_t oblock, bool discard) 2651 2724 { 2652 2725 struct cache *cache = context; 2653 2726 2654 - /* FIXME: handle mis-matched block size */ 2655 - 2656 2727 if (discard) 2657 - set_discard(cache, dblock); 2728 + set_discard(cache, oblock); 2658 2729 else 2659 - clear_discard(cache, dblock); 2730 + clear_discard(cache, oblock); 2660 2731 2661 2732 return 0; 2662 2733 } ··· 3045 3120 /* 3046 3121 * FIXME: these limits may be incompatible with the cache device 3047 3122 */ 3048 - limits->max_discard_sectors = cache->discard_block_size * 1024; 3049 - limits->discard_granularity = cache->discard_block_size << SECTOR_SHIFT; 3123 + limits->max_discard_sectors = cache->sectors_per_block; 3124 + limits->discard_granularity = cache->sectors_per_block << SECTOR_SHIFT; 3050 3125 } 3051 3126 3052 3127 static void cache_io_hints(struct dm_target *ti, struct queue_limits *limits) ··· 3070 3145 3071 3146 static struct target_type cache_target = { 3072 3147 .name = "cache", 3073 - .version = {1, 3, 0}, 3148 + .version = {1, 4, 0}, 3074 3149 .module = THIS_MODULE, 3075 3150 .ctr = cache_ctr, 3076 3151 .dtr = cache_dtr,
+1746
drivers/md/dm-era-target.c
··· 1 + #include "dm.h" 2 + #include "persistent-data/dm-transaction-manager.h" 3 + #include "persistent-data/dm-bitset.h" 4 + #include "persistent-data/dm-space-map.h" 5 + 6 + #include <linux/dm-io.h> 7 + #include <linux/dm-kcopyd.h> 8 + #include <linux/init.h> 9 + #include <linux/mempool.h> 10 + #include <linux/module.h> 11 + #include <linux/slab.h> 12 + #include <linux/vmalloc.h> 13 + 14 + #define DM_MSG_PREFIX "era" 15 + 16 + #define SUPERBLOCK_LOCATION 0 17 + #define SUPERBLOCK_MAGIC 2126579579 18 + #define SUPERBLOCK_CSUM_XOR 146538381 19 + #define MIN_ERA_VERSION 1 20 + #define MAX_ERA_VERSION 1 21 + #define INVALID_WRITESET_ROOT SUPERBLOCK_LOCATION 22 + #define MIN_BLOCK_SIZE 8 23 + 24 + /*---------------------------------------------------------------- 25 + * Writeset 26 + *--------------------------------------------------------------*/ 27 + struct writeset_metadata { 28 + uint32_t nr_bits; 29 + dm_block_t root; 30 + }; 31 + 32 + struct writeset { 33 + struct writeset_metadata md; 34 + 35 + /* 36 + * An in core copy of the bits to save constantly doing look ups on 37 + * disk. 38 + */ 39 + unsigned long *bits; 40 + }; 41 + 42 + /* 43 + * This does not free off the on disk bitset as this will normally be done 44 + * after digesting into the era array. 45 + */ 46 + static void writeset_free(struct writeset *ws) 47 + { 48 + vfree(ws->bits); 49 + } 50 + 51 + static int setup_on_disk_bitset(struct dm_disk_bitset *info, 52 + unsigned nr_bits, dm_block_t *root) 53 + { 54 + int r; 55 + 56 + r = dm_bitset_empty(info, root); 57 + if (r) 58 + return r; 59 + 60 + return dm_bitset_resize(info, *root, 0, nr_bits, false, root); 61 + } 62 + 63 + static size_t bitset_size(unsigned nr_bits) 64 + { 65 + return sizeof(unsigned long) * dm_div_up(nr_bits, BITS_PER_LONG); 66 + } 67 + 68 + /* 69 + * Allocates memory for the in core bitset. 70 + */ 71 + static int writeset_alloc(struct writeset *ws, dm_block_t nr_blocks) 72 + { 73 + ws->md.nr_bits = nr_blocks; 74 + ws->md.root = INVALID_WRITESET_ROOT; 75 + ws->bits = vzalloc(bitset_size(nr_blocks)); 76 + if (!ws->bits) { 77 + DMERR("%s: couldn't allocate in memory bitset", __func__); 78 + return -ENOMEM; 79 + } 80 + 81 + return 0; 82 + } 83 + 84 + /* 85 + * Wipes the in-core bitset, and creates a new on disk bitset. 86 + */ 87 + static int writeset_init(struct dm_disk_bitset *info, struct writeset *ws) 88 + { 89 + int r; 90 + 91 + memset(ws->bits, 0, bitset_size(ws->md.nr_bits)); 92 + 93 + r = setup_on_disk_bitset(info, ws->md.nr_bits, &ws->md.root); 94 + if (r) { 95 + DMERR("%s: setup_on_disk_bitset failed", __func__); 96 + return r; 97 + } 98 + 99 + return 0; 100 + } 101 + 102 + static bool writeset_marked(struct writeset *ws, dm_block_t block) 103 + { 104 + return test_bit(block, ws->bits); 105 + } 106 + 107 + static int writeset_marked_on_disk(struct dm_disk_bitset *info, 108 + struct writeset_metadata *m, dm_block_t block, 109 + bool *result) 110 + { 111 + dm_block_t old = m->root; 112 + 113 + /* 114 + * The bitset was flushed when it was archived, so we know there'll 115 + * be no change to the root. 116 + */ 117 + int r = dm_bitset_test_bit(info, m->root, block, &m->root, result); 118 + if (r) { 119 + DMERR("%s: dm_bitset_test_bit failed", __func__); 120 + return r; 121 + } 122 + 123 + BUG_ON(m->root != old); 124 + 125 + return r; 126 + } 127 + 128 + /* 129 + * Returns < 0 on error, 0 if the bit wasn't previously set, 1 if it was. 130 + */ 131 + static int writeset_test_and_set(struct dm_disk_bitset *info, 132 + struct writeset *ws, uint32_t block) 133 + { 134 + int r; 135 + 136 + if (!test_and_set_bit(block, ws->bits)) { 137 + r = dm_bitset_set_bit(info, ws->md.root, block, &ws->md.root); 138 + if (r) { 139 + /* FIXME: fail mode */ 140 + return r; 141 + } 142 + 143 + return 0; 144 + } 145 + 146 + return 1; 147 + } 148 + 149 + /*---------------------------------------------------------------- 150 + * On disk metadata layout 151 + *--------------------------------------------------------------*/ 152 + #define SPACE_MAP_ROOT_SIZE 128 153 + #define UUID_LEN 16 154 + 155 + struct writeset_disk { 156 + __le32 nr_bits; 157 + __le64 root; 158 + } __packed; 159 + 160 + struct superblock_disk { 161 + __le32 csum; 162 + __le32 flags; 163 + __le64 blocknr; 164 + 165 + __u8 uuid[UUID_LEN]; 166 + __le64 magic; 167 + __le32 version; 168 + 169 + __u8 metadata_space_map_root[SPACE_MAP_ROOT_SIZE]; 170 + 171 + __le32 data_block_size; 172 + __le32 metadata_block_size; 173 + __le32 nr_blocks; 174 + 175 + __le32 current_era; 176 + struct writeset_disk current_writeset; 177 + 178 + /* 179 + * Only these two fields are valid within the metadata snapshot. 180 + */ 181 + __le64 writeset_tree_root; 182 + __le64 era_array_root; 183 + 184 + __le64 metadata_snap; 185 + } __packed; 186 + 187 + /*---------------------------------------------------------------- 188 + * Superblock validation 189 + *--------------------------------------------------------------*/ 190 + static void sb_prepare_for_write(struct dm_block_validator *v, 191 + struct dm_block *b, 192 + size_t sb_block_size) 193 + { 194 + struct superblock_disk *disk = dm_block_data(b); 195 + 196 + disk->blocknr = cpu_to_le64(dm_block_location(b)); 197 + disk->csum = cpu_to_le32(dm_bm_checksum(&disk->flags, 198 + sb_block_size - sizeof(__le32), 199 + SUPERBLOCK_CSUM_XOR)); 200 + } 201 + 202 + static int check_metadata_version(struct superblock_disk *disk) 203 + { 204 + uint32_t metadata_version = le32_to_cpu(disk->version); 205 + if (metadata_version < MIN_ERA_VERSION || metadata_version > MAX_ERA_VERSION) { 206 + DMERR("Era metadata version %u found, but only versions between %u and %u supported.", 207 + metadata_version, MIN_ERA_VERSION, MAX_ERA_VERSION); 208 + return -EINVAL; 209 + } 210 + 211 + return 0; 212 + } 213 + 214 + static int sb_check(struct dm_block_validator *v, 215 + struct dm_block *b, 216 + size_t sb_block_size) 217 + { 218 + struct superblock_disk *disk = dm_block_data(b); 219 + __le32 csum_le; 220 + 221 + if (dm_block_location(b) != le64_to_cpu(disk->blocknr)) { 222 + DMERR("sb_check failed: blocknr %llu: wanted %llu", 223 + le64_to_cpu(disk->blocknr), 224 + (unsigned long long)dm_block_location(b)); 225 + return -ENOTBLK; 226 + } 227 + 228 + if (le64_to_cpu(disk->magic) != SUPERBLOCK_MAGIC) { 229 + DMERR("sb_check failed: magic %llu: wanted %llu", 230 + le64_to_cpu(disk->magic), 231 + (unsigned long long) SUPERBLOCK_MAGIC); 232 + return -EILSEQ; 233 + } 234 + 235 + csum_le = cpu_to_le32(dm_bm_checksum(&disk->flags, 236 + sb_block_size - sizeof(__le32), 237 + SUPERBLOCK_CSUM_XOR)); 238 + if (csum_le != disk->csum) { 239 + DMERR("sb_check failed: csum %u: wanted %u", 240 + le32_to_cpu(csum_le), le32_to_cpu(disk->csum)); 241 + return -EILSEQ; 242 + } 243 + 244 + return check_metadata_version(disk); 245 + } 246 + 247 + static struct dm_block_validator sb_validator = { 248 + .name = "superblock", 249 + .prepare_for_write = sb_prepare_for_write, 250 + .check = sb_check 251 + }; 252 + 253 + /*---------------------------------------------------------------- 254 + * Low level metadata handling 255 + *--------------------------------------------------------------*/ 256 + #define DM_ERA_METADATA_BLOCK_SIZE 4096 257 + #define DM_ERA_METADATA_CACHE_SIZE 64 258 + #define ERA_MAX_CONCURRENT_LOCKS 5 259 + 260 + struct era_metadata { 261 + struct block_device *bdev; 262 + struct dm_block_manager *bm; 263 + struct dm_space_map *sm; 264 + struct dm_transaction_manager *tm; 265 + 266 + dm_block_t block_size; 267 + uint32_t nr_blocks; 268 + 269 + uint32_t current_era; 270 + 271 + /* 272 + * We preallocate 2 writesets. When an era rolls over we 273 + * switch between them. This means the allocation is done at 274 + * preresume time, rather than on the io path. 275 + */ 276 + struct writeset writesets[2]; 277 + struct writeset *current_writeset; 278 + 279 + dm_block_t writeset_tree_root; 280 + dm_block_t era_array_root; 281 + 282 + struct dm_disk_bitset bitset_info; 283 + struct dm_btree_info writeset_tree_info; 284 + struct dm_array_info era_array_info; 285 + 286 + dm_block_t metadata_snap; 287 + 288 + /* 289 + * A flag that is set whenever a writeset has been archived. 290 + */ 291 + bool archived_writesets; 292 + 293 + /* 294 + * Reading the space map root can fail, so we read it into this 295 + * buffer before the superblock is locked and updated. 296 + */ 297 + __u8 metadata_space_map_root[SPACE_MAP_ROOT_SIZE]; 298 + }; 299 + 300 + static int superblock_read_lock(struct era_metadata *md, 301 + struct dm_block **sblock) 302 + { 303 + return dm_bm_read_lock(md->bm, SUPERBLOCK_LOCATION, 304 + &sb_validator, sblock); 305 + } 306 + 307 + static int superblock_lock_zero(struct era_metadata *md, 308 + struct dm_block **sblock) 309 + { 310 + return dm_bm_write_lock_zero(md->bm, SUPERBLOCK_LOCATION, 311 + &sb_validator, sblock); 312 + } 313 + 314 + static int superblock_lock(struct era_metadata *md, 315 + struct dm_block **sblock) 316 + { 317 + return dm_bm_write_lock(md->bm, SUPERBLOCK_LOCATION, 318 + &sb_validator, sblock); 319 + } 320 + 321 + /* FIXME: duplication with cache and thin */ 322 + static int superblock_all_zeroes(struct dm_block_manager *bm, bool *result) 323 + { 324 + int r; 325 + unsigned i; 326 + struct dm_block *b; 327 + __le64 *data_le, zero = cpu_to_le64(0); 328 + unsigned sb_block_size = dm_bm_block_size(bm) / sizeof(__le64); 329 + 330 + /* 331 + * We can't use a validator here - it may be all zeroes. 332 + */ 333 + r = dm_bm_read_lock(bm, SUPERBLOCK_LOCATION, NULL, &b); 334 + if (r) 335 + return r; 336 + 337 + data_le = dm_block_data(b); 338 + *result = true; 339 + for (i = 0; i < sb_block_size; i++) { 340 + if (data_le[i] != zero) { 341 + *result = false; 342 + break; 343 + } 344 + } 345 + 346 + return dm_bm_unlock(b); 347 + } 348 + 349 + /*----------------------------------------------------------------*/ 350 + 351 + static void ws_pack(const struct writeset_metadata *core, struct writeset_disk *disk) 352 + { 353 + disk->nr_bits = cpu_to_le32(core->nr_bits); 354 + disk->root = cpu_to_le64(core->root); 355 + } 356 + 357 + static void ws_unpack(const struct writeset_disk *disk, struct writeset_metadata *core) 358 + { 359 + core->nr_bits = le32_to_cpu(disk->nr_bits); 360 + core->root = le64_to_cpu(disk->root); 361 + } 362 + 363 + static void ws_inc(void *context, const void *value) 364 + { 365 + struct era_metadata *md = context; 366 + struct writeset_disk ws_d; 367 + dm_block_t b; 368 + 369 + memcpy(&ws_d, value, sizeof(ws_d)); 370 + b = le64_to_cpu(ws_d.root); 371 + 372 + dm_tm_inc(md->tm, b); 373 + } 374 + 375 + static void ws_dec(void *context, const void *value) 376 + { 377 + struct era_metadata *md = context; 378 + struct writeset_disk ws_d; 379 + dm_block_t b; 380 + 381 + memcpy(&ws_d, value, sizeof(ws_d)); 382 + b = le64_to_cpu(ws_d.root); 383 + 384 + dm_bitset_del(&md->bitset_info, b); 385 + } 386 + 387 + static int ws_eq(void *context, const void *value1, const void *value2) 388 + { 389 + return !memcmp(value1, value2, sizeof(struct writeset_metadata)); 390 + } 391 + 392 + /*----------------------------------------------------------------*/ 393 + 394 + static void setup_writeset_tree_info(struct era_metadata *md) 395 + { 396 + struct dm_btree_value_type *vt = &md->writeset_tree_info.value_type; 397 + md->writeset_tree_info.tm = md->tm; 398 + md->writeset_tree_info.levels = 1; 399 + vt->context = md; 400 + vt->size = sizeof(struct writeset_disk); 401 + vt->inc = ws_inc; 402 + vt->dec = ws_dec; 403 + vt->equal = ws_eq; 404 + } 405 + 406 + static void setup_era_array_info(struct era_metadata *md) 407 + 408 + { 409 + struct dm_btree_value_type vt; 410 + vt.context = NULL; 411 + vt.size = sizeof(__le32); 412 + vt.inc = NULL; 413 + vt.dec = NULL; 414 + vt.equal = NULL; 415 + 416 + dm_array_info_init(&md->era_array_info, md->tm, &vt); 417 + } 418 + 419 + static void setup_infos(struct era_metadata *md) 420 + { 421 + dm_disk_bitset_init(md->tm, &md->bitset_info); 422 + setup_writeset_tree_info(md); 423 + setup_era_array_info(md); 424 + } 425 + 426 + /*----------------------------------------------------------------*/ 427 + 428 + static int create_fresh_metadata(struct era_metadata *md) 429 + { 430 + int r; 431 + 432 + r = dm_tm_create_with_sm(md->bm, SUPERBLOCK_LOCATION, 433 + &md->tm, &md->sm); 434 + if (r < 0) { 435 + DMERR("dm_tm_create_with_sm failed"); 436 + return r; 437 + } 438 + 439 + setup_infos(md); 440 + 441 + r = dm_btree_empty(&md->writeset_tree_info, &md->writeset_tree_root); 442 + if (r) { 443 + DMERR("couldn't create new writeset tree"); 444 + goto bad; 445 + } 446 + 447 + r = dm_array_empty(&md->era_array_info, &md->era_array_root); 448 + if (r) { 449 + DMERR("couldn't create era array"); 450 + goto bad; 451 + } 452 + 453 + return 0; 454 + 455 + bad: 456 + dm_sm_destroy(md->sm); 457 + dm_tm_destroy(md->tm); 458 + 459 + return r; 460 + } 461 + 462 + static int save_sm_root(struct era_metadata *md) 463 + { 464 + int r; 465 + size_t metadata_len; 466 + 467 + r = dm_sm_root_size(md->sm, &metadata_len); 468 + if (r < 0) 469 + return r; 470 + 471 + return dm_sm_copy_root(md->sm, &md->metadata_space_map_root, 472 + metadata_len); 473 + } 474 + 475 + static void copy_sm_root(struct era_metadata *md, struct superblock_disk *disk) 476 + { 477 + memcpy(&disk->metadata_space_map_root, 478 + &md->metadata_space_map_root, 479 + sizeof(md->metadata_space_map_root)); 480 + } 481 + 482 + /* 483 + * Writes a superblock, including the static fields that don't get updated 484 + * with every commit (possible optimisation here). 'md' should be fully 485 + * constructed when this is called. 486 + */ 487 + static void prepare_superblock(struct era_metadata *md, struct superblock_disk *disk) 488 + { 489 + disk->magic = cpu_to_le64(SUPERBLOCK_MAGIC); 490 + disk->flags = cpu_to_le32(0ul); 491 + 492 + /* FIXME: can't keep blanking the uuid (uuid is currently unused though) */ 493 + memset(disk->uuid, 0, sizeof(disk->uuid)); 494 + disk->version = cpu_to_le32(MAX_ERA_VERSION); 495 + 496 + copy_sm_root(md, disk); 497 + 498 + disk->data_block_size = cpu_to_le32(md->block_size); 499 + disk->metadata_block_size = cpu_to_le32(DM_ERA_METADATA_BLOCK_SIZE >> SECTOR_SHIFT); 500 + disk->nr_blocks = cpu_to_le32(md->nr_blocks); 501 + disk->current_era = cpu_to_le32(md->current_era); 502 + 503 + ws_pack(&md->current_writeset->md, &disk->current_writeset); 504 + disk->writeset_tree_root = cpu_to_le64(md->writeset_tree_root); 505 + disk->era_array_root = cpu_to_le64(md->era_array_root); 506 + disk->metadata_snap = cpu_to_le64(md->metadata_snap); 507 + } 508 + 509 + static int write_superblock(struct era_metadata *md) 510 + { 511 + int r; 512 + struct dm_block *sblock; 513 + struct superblock_disk *disk; 514 + 515 + r = save_sm_root(md); 516 + if (r) { 517 + DMERR("%s: save_sm_root failed", __func__); 518 + return r; 519 + } 520 + 521 + r = superblock_lock_zero(md, &sblock); 522 + if (r) 523 + return r; 524 + 525 + disk = dm_block_data(sblock); 526 + prepare_superblock(md, disk); 527 + 528 + return dm_tm_commit(md->tm, sblock); 529 + } 530 + 531 + /* 532 + * Assumes block_size and the infos are set. 533 + */ 534 + static int format_metadata(struct era_metadata *md) 535 + { 536 + int r; 537 + 538 + r = create_fresh_metadata(md); 539 + if (r) 540 + return r; 541 + 542 + r = write_superblock(md); 543 + if (r) { 544 + dm_sm_destroy(md->sm); 545 + dm_tm_destroy(md->tm); 546 + return r; 547 + } 548 + 549 + return 0; 550 + } 551 + 552 + static int open_metadata(struct era_metadata *md) 553 + { 554 + int r; 555 + struct dm_block *sblock; 556 + struct superblock_disk *disk; 557 + 558 + r = superblock_read_lock(md, &sblock); 559 + if (r) { 560 + DMERR("couldn't read_lock superblock"); 561 + return r; 562 + } 563 + 564 + disk = dm_block_data(sblock); 565 + r = dm_tm_open_with_sm(md->bm, SUPERBLOCK_LOCATION, 566 + disk->metadata_space_map_root, 567 + sizeof(disk->metadata_space_map_root), 568 + &md->tm, &md->sm); 569 + if (r) { 570 + DMERR("dm_tm_open_with_sm failed"); 571 + goto bad; 572 + } 573 + 574 + setup_infos(md); 575 + 576 + md->block_size = le32_to_cpu(disk->data_block_size); 577 + md->nr_blocks = le32_to_cpu(disk->nr_blocks); 578 + md->current_era = le32_to_cpu(disk->current_era); 579 + 580 + md->writeset_tree_root = le64_to_cpu(disk->writeset_tree_root); 581 + md->era_array_root = le64_to_cpu(disk->era_array_root); 582 + md->metadata_snap = le64_to_cpu(disk->metadata_snap); 583 + md->archived_writesets = true; 584 + 585 + return dm_bm_unlock(sblock); 586 + 587 + bad: 588 + dm_bm_unlock(sblock); 589 + return r; 590 + } 591 + 592 + static int open_or_format_metadata(struct era_metadata *md, 593 + bool may_format) 594 + { 595 + int r; 596 + bool unformatted = false; 597 + 598 + r = superblock_all_zeroes(md->bm, &unformatted); 599 + if (r) 600 + return r; 601 + 602 + if (unformatted) 603 + return may_format ? format_metadata(md) : -EPERM; 604 + 605 + return open_metadata(md); 606 + } 607 + 608 + static int create_persistent_data_objects(struct era_metadata *md, 609 + bool may_format) 610 + { 611 + int r; 612 + 613 + md->bm = dm_block_manager_create(md->bdev, DM_ERA_METADATA_BLOCK_SIZE, 614 + DM_ERA_METADATA_CACHE_SIZE, 615 + ERA_MAX_CONCURRENT_LOCKS); 616 + if (IS_ERR(md->bm)) { 617 + DMERR("could not create block manager"); 618 + return PTR_ERR(md->bm); 619 + } 620 + 621 + r = open_or_format_metadata(md, may_format); 622 + if (r) 623 + dm_block_manager_destroy(md->bm); 624 + 625 + return r; 626 + } 627 + 628 + static void destroy_persistent_data_objects(struct era_metadata *md) 629 + { 630 + dm_sm_destroy(md->sm); 631 + dm_tm_destroy(md->tm); 632 + dm_block_manager_destroy(md->bm); 633 + } 634 + 635 + /* 636 + * This waits until all era_map threads have picked up the new filter. 637 + */ 638 + static void swap_writeset(struct era_metadata *md, struct writeset *new_writeset) 639 + { 640 + rcu_assign_pointer(md->current_writeset, new_writeset); 641 + synchronize_rcu(); 642 + } 643 + 644 + /*---------------------------------------------------------------- 645 + * Writesets get 'digested' into the main era array. 646 + * 647 + * We're using a coroutine here so the worker thread can do the digestion, 648 + * thus avoiding synchronisation of the metadata. Digesting a whole 649 + * writeset in one go would cause too much latency. 650 + *--------------------------------------------------------------*/ 651 + struct digest { 652 + uint32_t era; 653 + unsigned nr_bits, current_bit; 654 + struct writeset_metadata writeset; 655 + __le32 value; 656 + struct dm_disk_bitset info; 657 + 658 + int (*step)(struct era_metadata *, struct digest *); 659 + }; 660 + 661 + static int metadata_digest_lookup_writeset(struct era_metadata *md, 662 + struct digest *d); 663 + 664 + static int metadata_digest_remove_writeset(struct era_metadata *md, 665 + struct digest *d) 666 + { 667 + int r; 668 + uint64_t key = d->era; 669 + 670 + r = dm_btree_remove(&md->writeset_tree_info, md->writeset_tree_root, 671 + &key, &md->writeset_tree_root); 672 + if (r) { 673 + DMERR("%s: dm_btree_remove failed", __func__); 674 + return r; 675 + } 676 + 677 + d->step = metadata_digest_lookup_writeset; 678 + return 0; 679 + } 680 + 681 + #define INSERTS_PER_STEP 100 682 + 683 + static int metadata_digest_transcribe_writeset(struct era_metadata *md, 684 + struct digest *d) 685 + { 686 + int r; 687 + bool marked; 688 + unsigned b, e = min(d->current_bit + INSERTS_PER_STEP, d->nr_bits); 689 + 690 + for (b = d->current_bit; b < e; b++) { 691 + r = writeset_marked_on_disk(&d->info, &d->writeset, b, &marked); 692 + if (r) { 693 + DMERR("%s: writeset_marked_on_disk failed", __func__); 694 + return r; 695 + } 696 + 697 + if (!marked) 698 + continue; 699 + 700 + __dm_bless_for_disk(&d->value); 701 + r = dm_array_set_value(&md->era_array_info, md->era_array_root, 702 + b, &d->value, &md->era_array_root); 703 + if (r) { 704 + DMERR("%s: dm_array_set_value failed", __func__); 705 + return r; 706 + } 707 + } 708 + 709 + if (b == d->nr_bits) 710 + d->step = metadata_digest_remove_writeset; 711 + else 712 + d->current_bit = b; 713 + 714 + return 0; 715 + } 716 + 717 + static int metadata_digest_lookup_writeset(struct era_metadata *md, 718 + struct digest *d) 719 + { 720 + int r; 721 + uint64_t key; 722 + struct writeset_disk disk; 723 + 724 + r = dm_btree_find_lowest_key(&md->writeset_tree_info, 725 + md->writeset_tree_root, &key); 726 + if (r < 0) 727 + return r; 728 + 729 + d->era = key; 730 + 731 + r = dm_btree_lookup(&md->writeset_tree_info, 732 + md->writeset_tree_root, &key, &disk); 733 + if (r) { 734 + if (r == -ENODATA) { 735 + d->step = NULL; 736 + return 0; 737 + } 738 + 739 + DMERR("%s: dm_btree_lookup failed", __func__); 740 + return r; 741 + } 742 + 743 + ws_unpack(&disk, &d->writeset); 744 + d->value = cpu_to_le32(key); 745 + 746 + d->nr_bits = min(d->writeset.nr_bits, md->nr_blocks); 747 + d->current_bit = 0; 748 + d->step = metadata_digest_transcribe_writeset; 749 + 750 + return 0; 751 + } 752 + 753 + static int metadata_digest_start(struct era_metadata *md, struct digest *d) 754 + { 755 + if (d->step) 756 + return 0; 757 + 758 + memset(d, 0, sizeof(*d)); 759 + 760 + /* 761 + * We initialise another bitset info to avoid any caching side 762 + * effects with the previous one. 763 + */ 764 + dm_disk_bitset_init(md->tm, &d->info); 765 + d->step = metadata_digest_lookup_writeset; 766 + 767 + return 0; 768 + } 769 + 770 + /*---------------------------------------------------------------- 771 + * High level metadata interface. Target methods should use these, and not 772 + * the lower level ones. 773 + *--------------------------------------------------------------*/ 774 + static struct era_metadata *metadata_open(struct block_device *bdev, 775 + sector_t block_size, 776 + bool may_format) 777 + { 778 + int r; 779 + struct era_metadata *md = kzalloc(sizeof(*md), GFP_KERNEL); 780 + 781 + if (!md) 782 + return NULL; 783 + 784 + md->bdev = bdev; 785 + md->block_size = block_size; 786 + 787 + md->writesets[0].md.root = INVALID_WRITESET_ROOT; 788 + md->writesets[1].md.root = INVALID_WRITESET_ROOT; 789 + md->current_writeset = &md->writesets[0]; 790 + 791 + r = create_persistent_data_objects(md, may_format); 792 + if (r) { 793 + kfree(md); 794 + return ERR_PTR(r); 795 + } 796 + 797 + return md; 798 + } 799 + 800 + static void metadata_close(struct era_metadata *md) 801 + { 802 + destroy_persistent_data_objects(md); 803 + kfree(md); 804 + } 805 + 806 + static bool valid_nr_blocks(dm_block_t n) 807 + { 808 + /* 809 + * dm_bitset restricts us to 2^32. test_bit & co. restrict us 810 + * further to 2^31 - 1 811 + */ 812 + return n < (1ull << 31); 813 + } 814 + 815 + static int metadata_resize(struct era_metadata *md, void *arg) 816 + { 817 + int r; 818 + dm_block_t *new_size = arg; 819 + __le32 value; 820 + 821 + if (!valid_nr_blocks(*new_size)) { 822 + DMERR("Invalid number of origin blocks %llu", 823 + (unsigned long long) *new_size); 824 + return -EINVAL; 825 + } 826 + 827 + writeset_free(&md->writesets[0]); 828 + writeset_free(&md->writesets[1]); 829 + 830 + r = writeset_alloc(&md->writesets[0], *new_size); 831 + if (r) { 832 + DMERR("%s: writeset_alloc failed for writeset 0", __func__); 833 + return r; 834 + } 835 + 836 + r = writeset_alloc(&md->writesets[1], *new_size); 837 + if (r) { 838 + DMERR("%s: writeset_alloc failed for writeset 1", __func__); 839 + return r; 840 + } 841 + 842 + value = cpu_to_le32(0u); 843 + __dm_bless_for_disk(&value); 844 + r = dm_array_resize(&md->era_array_info, md->era_array_root, 845 + md->nr_blocks, *new_size, 846 + &value, &md->era_array_root); 847 + if (r) { 848 + DMERR("%s: dm_array_resize failed", __func__); 849 + return r; 850 + } 851 + 852 + md->nr_blocks = *new_size; 853 + return 0; 854 + } 855 + 856 + static int metadata_era_archive(struct era_metadata *md) 857 + { 858 + int r; 859 + uint64_t keys[1]; 860 + struct writeset_disk value; 861 + 862 + r = dm_bitset_flush(&md->bitset_info, md->current_writeset->md.root, 863 + &md->current_writeset->md.root); 864 + if (r) { 865 + DMERR("%s: dm_bitset_flush failed", __func__); 866 + return r; 867 + } 868 + 869 + ws_pack(&md->current_writeset->md, &value); 870 + md->current_writeset->md.root = INVALID_WRITESET_ROOT; 871 + 872 + keys[0] = md->current_era; 873 + __dm_bless_for_disk(&value); 874 + r = dm_btree_insert(&md->writeset_tree_info, md->writeset_tree_root, 875 + keys, &value, &md->writeset_tree_root); 876 + if (r) { 877 + DMERR("%s: couldn't insert writeset into btree", __func__); 878 + /* FIXME: fail mode */ 879 + return r; 880 + } 881 + 882 + md->archived_writesets = true; 883 + 884 + return 0; 885 + } 886 + 887 + static struct writeset *next_writeset(struct era_metadata *md) 888 + { 889 + return (md->current_writeset == &md->writesets[0]) ? 890 + &md->writesets[1] : &md->writesets[0]; 891 + } 892 + 893 + static int metadata_new_era(struct era_metadata *md) 894 + { 895 + int r; 896 + struct writeset *new_writeset = next_writeset(md); 897 + 898 + r = writeset_init(&md->bitset_info, new_writeset); 899 + if (r) { 900 + DMERR("%s: writeset_init failed", __func__); 901 + return r; 902 + } 903 + 904 + swap_writeset(md, new_writeset); 905 + md->current_era++; 906 + 907 + return 0; 908 + } 909 + 910 + static int metadata_era_rollover(struct era_metadata *md) 911 + { 912 + int r; 913 + 914 + if (md->current_writeset->md.root != INVALID_WRITESET_ROOT) { 915 + r = metadata_era_archive(md); 916 + if (r) { 917 + DMERR("%s: metadata_archive_era failed", __func__); 918 + /* FIXME: fail mode? */ 919 + return r; 920 + } 921 + } 922 + 923 + r = metadata_new_era(md); 924 + if (r) { 925 + DMERR("%s: new era failed", __func__); 926 + /* FIXME: fail mode */ 927 + return r; 928 + } 929 + 930 + return 0; 931 + } 932 + 933 + static bool metadata_current_marked(struct era_metadata *md, dm_block_t block) 934 + { 935 + bool r; 936 + struct writeset *ws; 937 + 938 + rcu_read_lock(); 939 + ws = rcu_dereference(md->current_writeset); 940 + r = writeset_marked(ws, block); 941 + rcu_read_unlock(); 942 + 943 + return r; 944 + } 945 + 946 + static int metadata_commit(struct era_metadata *md) 947 + { 948 + int r; 949 + struct dm_block *sblock; 950 + 951 + if (md->current_writeset->md.root != SUPERBLOCK_LOCATION) { 952 + r = dm_bitset_flush(&md->bitset_info, md->current_writeset->md.root, 953 + &md->current_writeset->md.root); 954 + if (r) { 955 + DMERR("%s: bitset flush failed", __func__); 956 + return r; 957 + } 958 + } 959 + 960 + r = save_sm_root(md); 961 + if (r) { 962 + DMERR("%s: save_sm_root failed", __func__); 963 + return r; 964 + } 965 + 966 + r = dm_tm_pre_commit(md->tm); 967 + if (r) { 968 + DMERR("%s: pre commit failed", __func__); 969 + return r; 970 + } 971 + 972 + r = superblock_lock(md, &sblock); 973 + if (r) { 974 + DMERR("%s: superblock lock failed", __func__); 975 + return r; 976 + } 977 + 978 + prepare_superblock(md, dm_block_data(sblock)); 979 + 980 + return dm_tm_commit(md->tm, sblock); 981 + } 982 + 983 + static int metadata_checkpoint(struct era_metadata *md) 984 + { 985 + /* 986 + * For now we just rollover, but later I want to put a check in to 987 + * avoid this if the filter is still pretty fresh. 988 + */ 989 + return metadata_era_rollover(md); 990 + } 991 + 992 + /* 993 + * Metadata snapshots allow userland to access era data. 994 + */ 995 + static int metadata_take_snap(struct era_metadata *md) 996 + { 997 + int r, inc; 998 + struct dm_block *clone; 999 + 1000 + if (md->metadata_snap != SUPERBLOCK_LOCATION) { 1001 + DMERR("%s: metadata snapshot already exists", __func__); 1002 + return -EINVAL; 1003 + } 1004 + 1005 + r = metadata_era_rollover(md); 1006 + if (r) { 1007 + DMERR("%s: era rollover failed", __func__); 1008 + return r; 1009 + } 1010 + 1011 + r = metadata_commit(md); 1012 + if (r) { 1013 + DMERR("%s: pre commit failed", __func__); 1014 + return r; 1015 + } 1016 + 1017 + r = dm_sm_inc_block(md->sm, SUPERBLOCK_LOCATION); 1018 + if (r) { 1019 + DMERR("%s: couldn't increment superblock", __func__); 1020 + return r; 1021 + } 1022 + 1023 + r = dm_tm_shadow_block(md->tm, SUPERBLOCK_LOCATION, 1024 + &sb_validator, &clone, &inc); 1025 + if (r) { 1026 + DMERR("%s: couldn't shadow superblock", __func__); 1027 + dm_sm_dec_block(md->sm, SUPERBLOCK_LOCATION); 1028 + return r; 1029 + } 1030 + BUG_ON(!inc); 1031 + 1032 + r = dm_sm_inc_block(md->sm, md->writeset_tree_root); 1033 + if (r) { 1034 + DMERR("%s: couldn't inc writeset tree root", __func__); 1035 + dm_tm_unlock(md->tm, clone); 1036 + return r; 1037 + } 1038 + 1039 + r = dm_sm_inc_block(md->sm, md->era_array_root); 1040 + if (r) { 1041 + DMERR("%s: couldn't inc era tree root", __func__); 1042 + dm_sm_dec_block(md->sm, md->writeset_tree_root); 1043 + dm_tm_unlock(md->tm, clone); 1044 + return r; 1045 + } 1046 + 1047 + md->metadata_snap = dm_block_location(clone); 1048 + 1049 + r = dm_tm_unlock(md->tm, clone); 1050 + if (r) { 1051 + DMERR("%s: couldn't unlock clone", __func__); 1052 + md->metadata_snap = SUPERBLOCK_LOCATION; 1053 + return r; 1054 + } 1055 + 1056 + return 0; 1057 + } 1058 + 1059 + static int metadata_drop_snap(struct era_metadata *md) 1060 + { 1061 + int r; 1062 + dm_block_t location; 1063 + struct dm_block *clone; 1064 + struct superblock_disk *disk; 1065 + 1066 + if (md->metadata_snap == SUPERBLOCK_LOCATION) { 1067 + DMERR("%s: no snap to drop", __func__); 1068 + return -EINVAL; 1069 + } 1070 + 1071 + r = dm_tm_read_lock(md->tm, md->metadata_snap, &sb_validator, &clone); 1072 + if (r) { 1073 + DMERR("%s: couldn't read lock superblock clone", __func__); 1074 + return r; 1075 + } 1076 + 1077 + /* 1078 + * Whatever happens now we'll commit with no record of the metadata 1079 + * snap. 1080 + */ 1081 + md->metadata_snap = SUPERBLOCK_LOCATION; 1082 + 1083 + disk = dm_block_data(clone); 1084 + r = dm_btree_del(&md->writeset_tree_info, 1085 + le64_to_cpu(disk->writeset_tree_root)); 1086 + if (r) { 1087 + DMERR("%s: error deleting writeset tree clone", __func__); 1088 + dm_tm_unlock(md->tm, clone); 1089 + return r; 1090 + } 1091 + 1092 + r = dm_array_del(&md->era_array_info, le64_to_cpu(disk->era_array_root)); 1093 + if (r) { 1094 + DMERR("%s: error deleting era array clone", __func__); 1095 + dm_tm_unlock(md->tm, clone); 1096 + return r; 1097 + } 1098 + 1099 + location = dm_block_location(clone); 1100 + dm_tm_unlock(md->tm, clone); 1101 + 1102 + return dm_sm_dec_block(md->sm, location); 1103 + } 1104 + 1105 + struct metadata_stats { 1106 + dm_block_t used; 1107 + dm_block_t total; 1108 + dm_block_t snap; 1109 + uint32_t era; 1110 + }; 1111 + 1112 + static int metadata_get_stats(struct era_metadata *md, void *ptr) 1113 + { 1114 + int r; 1115 + struct metadata_stats *s = ptr; 1116 + dm_block_t nr_free, nr_total; 1117 + 1118 + r = dm_sm_get_nr_free(md->sm, &nr_free); 1119 + if (r) { 1120 + DMERR("dm_sm_get_nr_free returned %d", r); 1121 + return r; 1122 + } 1123 + 1124 + r = dm_sm_get_nr_blocks(md->sm, &nr_total); 1125 + if (r) { 1126 + DMERR("dm_pool_get_metadata_dev_size returned %d", r); 1127 + return r; 1128 + } 1129 + 1130 + s->used = nr_total - nr_free; 1131 + s->total = nr_total; 1132 + s->snap = md->metadata_snap; 1133 + s->era = md->current_era; 1134 + 1135 + return 0; 1136 + } 1137 + 1138 + /*----------------------------------------------------------------*/ 1139 + 1140 + struct era { 1141 + struct dm_target *ti; 1142 + struct dm_target_callbacks callbacks; 1143 + 1144 + struct dm_dev *metadata_dev; 1145 + struct dm_dev *origin_dev; 1146 + 1147 + dm_block_t nr_blocks; 1148 + uint32_t sectors_per_block; 1149 + int sectors_per_block_shift; 1150 + struct era_metadata *md; 1151 + 1152 + struct workqueue_struct *wq; 1153 + struct work_struct worker; 1154 + 1155 + spinlock_t deferred_lock; 1156 + struct bio_list deferred_bios; 1157 + 1158 + spinlock_t rpc_lock; 1159 + struct list_head rpc_calls; 1160 + 1161 + struct digest digest; 1162 + atomic_t suspended; 1163 + }; 1164 + 1165 + struct rpc { 1166 + struct list_head list; 1167 + 1168 + int (*fn0)(struct era_metadata *); 1169 + int (*fn1)(struct era_metadata *, void *); 1170 + void *arg; 1171 + int result; 1172 + 1173 + struct completion complete; 1174 + }; 1175 + 1176 + /*---------------------------------------------------------------- 1177 + * Remapping. 1178 + *---------------------------------------------------------------*/ 1179 + static bool block_size_is_power_of_two(struct era *era) 1180 + { 1181 + return era->sectors_per_block_shift >= 0; 1182 + } 1183 + 1184 + static dm_block_t get_block(struct era *era, struct bio *bio) 1185 + { 1186 + sector_t block_nr = bio->bi_iter.bi_sector; 1187 + 1188 + if (!block_size_is_power_of_two(era)) 1189 + (void) sector_div(block_nr, era->sectors_per_block); 1190 + else 1191 + block_nr >>= era->sectors_per_block_shift; 1192 + 1193 + return block_nr; 1194 + } 1195 + 1196 + static void remap_to_origin(struct era *era, struct bio *bio) 1197 + { 1198 + bio->bi_bdev = era->origin_dev->bdev; 1199 + } 1200 + 1201 + /*---------------------------------------------------------------- 1202 + * Worker thread 1203 + *--------------------------------------------------------------*/ 1204 + static void wake_worker(struct era *era) 1205 + { 1206 + if (!atomic_read(&era->suspended)) 1207 + queue_work(era->wq, &era->worker); 1208 + } 1209 + 1210 + static void process_old_eras(struct era *era) 1211 + { 1212 + int r; 1213 + 1214 + if (!era->digest.step) 1215 + return; 1216 + 1217 + r = era->digest.step(era->md, &era->digest); 1218 + if (r < 0) { 1219 + DMERR("%s: digest step failed, stopping digestion", __func__); 1220 + era->digest.step = NULL; 1221 + 1222 + } else if (era->digest.step) 1223 + wake_worker(era); 1224 + } 1225 + 1226 + static void process_deferred_bios(struct era *era) 1227 + { 1228 + int r; 1229 + struct bio_list deferred_bios, marked_bios; 1230 + struct bio *bio; 1231 + bool commit_needed = false; 1232 + bool failed = false; 1233 + 1234 + bio_list_init(&deferred_bios); 1235 + bio_list_init(&marked_bios); 1236 + 1237 + spin_lock(&era->deferred_lock); 1238 + bio_list_merge(&deferred_bios, &era->deferred_bios); 1239 + bio_list_init(&era->deferred_bios); 1240 + spin_unlock(&era->deferred_lock); 1241 + 1242 + while ((bio = bio_list_pop(&deferred_bios))) { 1243 + r = writeset_test_and_set(&era->md->bitset_info, 1244 + era->md->current_writeset, 1245 + get_block(era, bio)); 1246 + if (r < 0) { 1247 + /* 1248 + * This is bad news, we need to rollback. 1249 + * FIXME: finish. 1250 + */ 1251 + failed = true; 1252 + 1253 + } else if (r == 0) 1254 + commit_needed = true; 1255 + 1256 + bio_list_add(&marked_bios, bio); 1257 + } 1258 + 1259 + if (commit_needed) { 1260 + r = metadata_commit(era->md); 1261 + if (r) 1262 + failed = true; 1263 + } 1264 + 1265 + if (failed) 1266 + while ((bio = bio_list_pop(&marked_bios))) 1267 + bio_io_error(bio); 1268 + else 1269 + while ((bio = bio_list_pop(&marked_bios))) 1270 + generic_make_request(bio); 1271 + } 1272 + 1273 + static void process_rpc_calls(struct era *era) 1274 + { 1275 + int r; 1276 + bool need_commit = false; 1277 + struct list_head calls; 1278 + struct rpc *rpc, *tmp; 1279 + 1280 + INIT_LIST_HEAD(&calls); 1281 + spin_lock(&era->rpc_lock); 1282 + list_splice_init(&era->rpc_calls, &calls); 1283 + spin_unlock(&era->rpc_lock); 1284 + 1285 + list_for_each_entry_safe(rpc, tmp, &calls, list) { 1286 + rpc->result = rpc->fn0 ? rpc->fn0(era->md) : rpc->fn1(era->md, rpc->arg); 1287 + need_commit = true; 1288 + } 1289 + 1290 + if (need_commit) { 1291 + r = metadata_commit(era->md); 1292 + if (r) 1293 + list_for_each_entry_safe(rpc, tmp, &calls, list) 1294 + rpc->result = r; 1295 + } 1296 + 1297 + list_for_each_entry_safe(rpc, tmp, &calls, list) 1298 + complete(&rpc->complete); 1299 + } 1300 + 1301 + static void kick_off_digest(struct era *era) 1302 + { 1303 + if (era->md->archived_writesets) { 1304 + era->md->archived_writesets = false; 1305 + metadata_digest_start(era->md, &era->digest); 1306 + } 1307 + } 1308 + 1309 + static void do_work(struct work_struct *ws) 1310 + { 1311 + struct era *era = container_of(ws, struct era, worker); 1312 + 1313 + kick_off_digest(era); 1314 + process_old_eras(era); 1315 + process_deferred_bios(era); 1316 + process_rpc_calls(era); 1317 + } 1318 + 1319 + static void defer_bio(struct era *era, struct bio *bio) 1320 + { 1321 + spin_lock(&era->deferred_lock); 1322 + bio_list_add(&era->deferred_bios, bio); 1323 + spin_unlock(&era->deferred_lock); 1324 + 1325 + wake_worker(era); 1326 + } 1327 + 1328 + /* 1329 + * Make an rpc call to the worker to change the metadata. 1330 + */ 1331 + static int perform_rpc(struct era *era, struct rpc *rpc) 1332 + { 1333 + rpc->result = 0; 1334 + init_completion(&rpc->complete); 1335 + 1336 + spin_lock(&era->rpc_lock); 1337 + list_add(&rpc->list, &era->rpc_calls); 1338 + spin_unlock(&era->rpc_lock); 1339 + 1340 + wake_worker(era); 1341 + wait_for_completion(&rpc->complete); 1342 + 1343 + return rpc->result; 1344 + } 1345 + 1346 + static int in_worker0(struct era *era, int (*fn)(struct era_metadata *)) 1347 + { 1348 + struct rpc rpc; 1349 + rpc.fn0 = fn; 1350 + rpc.fn1 = NULL; 1351 + 1352 + return perform_rpc(era, &rpc); 1353 + } 1354 + 1355 + static int in_worker1(struct era *era, 1356 + int (*fn)(struct era_metadata *, void *), void *arg) 1357 + { 1358 + struct rpc rpc; 1359 + rpc.fn0 = NULL; 1360 + rpc.fn1 = fn; 1361 + rpc.arg = arg; 1362 + 1363 + return perform_rpc(era, &rpc); 1364 + } 1365 + 1366 + static void start_worker(struct era *era) 1367 + { 1368 + atomic_set(&era->suspended, 0); 1369 + } 1370 + 1371 + static void stop_worker(struct era *era) 1372 + { 1373 + atomic_set(&era->suspended, 1); 1374 + flush_workqueue(era->wq); 1375 + } 1376 + 1377 + /*---------------------------------------------------------------- 1378 + * Target methods 1379 + *--------------------------------------------------------------*/ 1380 + static int dev_is_congested(struct dm_dev *dev, int bdi_bits) 1381 + { 1382 + struct request_queue *q = bdev_get_queue(dev->bdev); 1383 + return bdi_congested(&q->backing_dev_info, bdi_bits); 1384 + } 1385 + 1386 + static int era_is_congested(struct dm_target_callbacks *cb, int bdi_bits) 1387 + { 1388 + struct era *era = container_of(cb, struct era, callbacks); 1389 + return dev_is_congested(era->origin_dev, bdi_bits); 1390 + } 1391 + 1392 + static void era_destroy(struct era *era) 1393 + { 1394 + metadata_close(era->md); 1395 + 1396 + if (era->wq) 1397 + destroy_workqueue(era->wq); 1398 + 1399 + if (era->origin_dev) 1400 + dm_put_device(era->ti, era->origin_dev); 1401 + 1402 + if (era->metadata_dev) 1403 + dm_put_device(era->ti, era->metadata_dev); 1404 + 1405 + kfree(era); 1406 + } 1407 + 1408 + static dm_block_t calc_nr_blocks(struct era *era) 1409 + { 1410 + return dm_sector_div_up(era->ti->len, era->sectors_per_block); 1411 + } 1412 + 1413 + static bool valid_block_size(dm_block_t block_size) 1414 + { 1415 + bool greater_than_zero = block_size > 0; 1416 + bool multiple_of_min_block_size = (block_size & (MIN_BLOCK_SIZE - 1)) == 0; 1417 + 1418 + return greater_than_zero && multiple_of_min_block_size; 1419 + } 1420 + 1421 + /* 1422 + * <metadata dev> <data dev> <data block size (sectors)> 1423 + */ 1424 + static int era_ctr(struct dm_target *ti, unsigned argc, char **argv) 1425 + { 1426 + int r; 1427 + char dummy; 1428 + struct era *era; 1429 + struct era_metadata *md; 1430 + 1431 + if (argc != 3) { 1432 + ti->error = "Invalid argument count"; 1433 + return -EINVAL; 1434 + } 1435 + 1436 + era = kzalloc(sizeof(*era), GFP_KERNEL); 1437 + if (!era) { 1438 + ti->error = "Error allocating era structure"; 1439 + return -ENOMEM; 1440 + } 1441 + 1442 + era->ti = ti; 1443 + 1444 + r = dm_get_device(ti, argv[0], FMODE_READ | FMODE_WRITE, &era->metadata_dev); 1445 + if (r) { 1446 + ti->error = "Error opening metadata device"; 1447 + era_destroy(era); 1448 + return -EINVAL; 1449 + } 1450 + 1451 + r = dm_get_device(ti, argv[1], FMODE_READ | FMODE_WRITE, &era->origin_dev); 1452 + if (r) { 1453 + ti->error = "Error opening data device"; 1454 + era_destroy(era); 1455 + return -EINVAL; 1456 + } 1457 + 1458 + r = sscanf(argv[2], "%u%c", &era->sectors_per_block, &dummy); 1459 + if (r != 1) { 1460 + ti->error = "Error parsing block size"; 1461 + era_destroy(era); 1462 + return -EINVAL; 1463 + } 1464 + 1465 + r = dm_set_target_max_io_len(ti, era->sectors_per_block); 1466 + if (r) { 1467 + ti->error = "could not set max io len"; 1468 + era_destroy(era); 1469 + return -EINVAL; 1470 + } 1471 + 1472 + if (!valid_block_size(era->sectors_per_block)) { 1473 + ti->error = "Invalid block size"; 1474 + era_destroy(era); 1475 + return -EINVAL; 1476 + } 1477 + if (era->sectors_per_block & (era->sectors_per_block - 1)) 1478 + era->sectors_per_block_shift = -1; 1479 + else 1480 + era->sectors_per_block_shift = __ffs(era->sectors_per_block); 1481 + 1482 + md = metadata_open(era->metadata_dev->bdev, era->sectors_per_block, true); 1483 + if (IS_ERR(md)) { 1484 + ti->error = "Error reading metadata"; 1485 + era_destroy(era); 1486 + return PTR_ERR(md); 1487 + } 1488 + era->md = md; 1489 + 1490 + era->nr_blocks = calc_nr_blocks(era); 1491 + 1492 + r = metadata_resize(era->md, &era->nr_blocks); 1493 + if (r) { 1494 + ti->error = "couldn't resize metadata"; 1495 + era_destroy(era); 1496 + return -ENOMEM; 1497 + } 1498 + 1499 + era->wq = alloc_ordered_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM); 1500 + if (!era->wq) { 1501 + ti->error = "could not create workqueue for metadata object"; 1502 + era_destroy(era); 1503 + return -ENOMEM; 1504 + } 1505 + INIT_WORK(&era->worker, do_work); 1506 + 1507 + spin_lock_init(&era->deferred_lock); 1508 + bio_list_init(&era->deferred_bios); 1509 + 1510 + spin_lock_init(&era->rpc_lock); 1511 + INIT_LIST_HEAD(&era->rpc_calls); 1512 + 1513 + ti->private = era; 1514 + ti->num_flush_bios = 1; 1515 + ti->flush_supported = true; 1516 + 1517 + ti->num_discard_bios = 1; 1518 + ti->discards_supported = true; 1519 + era->callbacks.congested_fn = era_is_congested; 1520 + dm_table_add_target_callbacks(ti->table, &era->callbacks); 1521 + 1522 + return 0; 1523 + } 1524 + 1525 + static void era_dtr(struct dm_target *ti) 1526 + { 1527 + era_destroy(ti->private); 1528 + } 1529 + 1530 + static int era_map(struct dm_target *ti, struct bio *bio) 1531 + { 1532 + struct era *era = ti->private; 1533 + dm_block_t block = get_block(era, bio); 1534 + 1535 + /* 1536 + * All bios get remapped to the origin device. We do this now, but 1537 + * it may not get issued until later. Depending on whether the 1538 + * block is marked in this era. 1539 + */ 1540 + remap_to_origin(era, bio); 1541 + 1542 + /* 1543 + * REQ_FLUSH bios carry no data, so we're not interested in them. 1544 + */ 1545 + if (!(bio->bi_rw & REQ_FLUSH) && 1546 + (bio_data_dir(bio) == WRITE) && 1547 + !metadata_current_marked(era->md, block)) { 1548 + defer_bio(era, bio); 1549 + return DM_MAPIO_SUBMITTED; 1550 + } 1551 + 1552 + return DM_MAPIO_REMAPPED; 1553 + } 1554 + 1555 + static void era_postsuspend(struct dm_target *ti) 1556 + { 1557 + int r; 1558 + struct era *era = ti->private; 1559 + 1560 + r = in_worker0(era, metadata_era_archive); 1561 + if (r) { 1562 + DMERR("%s: couldn't archive current era", __func__); 1563 + /* FIXME: fail mode */ 1564 + } 1565 + 1566 + stop_worker(era); 1567 + } 1568 + 1569 + static int era_preresume(struct dm_target *ti) 1570 + { 1571 + int r; 1572 + struct era *era = ti->private; 1573 + dm_block_t new_size = calc_nr_blocks(era); 1574 + 1575 + if (era->nr_blocks != new_size) { 1576 + r = in_worker1(era, metadata_resize, &new_size); 1577 + if (r) 1578 + return r; 1579 + 1580 + era->nr_blocks = new_size; 1581 + } 1582 + 1583 + start_worker(era); 1584 + 1585 + r = in_worker0(era, metadata_new_era); 1586 + if (r) { 1587 + DMERR("%s: metadata_era_rollover failed", __func__); 1588 + return r; 1589 + } 1590 + 1591 + return 0; 1592 + } 1593 + 1594 + /* 1595 + * Status format: 1596 + * 1597 + * <metadata block size> <#used metadata blocks>/<#total metadata blocks> 1598 + * <current era> <held metadata root | '-'> 1599 + */ 1600 + static void era_status(struct dm_target *ti, status_type_t type, 1601 + unsigned status_flags, char *result, unsigned maxlen) 1602 + { 1603 + int r; 1604 + struct era *era = ti->private; 1605 + ssize_t sz = 0; 1606 + struct metadata_stats stats; 1607 + char buf[BDEVNAME_SIZE]; 1608 + 1609 + switch (type) { 1610 + case STATUSTYPE_INFO: 1611 + r = in_worker1(era, metadata_get_stats, &stats); 1612 + if (r) 1613 + goto err; 1614 + 1615 + DMEMIT("%u %llu/%llu %u", 1616 + (unsigned) (DM_ERA_METADATA_BLOCK_SIZE >> SECTOR_SHIFT), 1617 + (unsigned long long) stats.used, 1618 + (unsigned long long) stats.total, 1619 + (unsigned) stats.era); 1620 + 1621 + if (stats.snap != SUPERBLOCK_LOCATION) 1622 + DMEMIT(" %llu", stats.snap); 1623 + else 1624 + DMEMIT(" -"); 1625 + break; 1626 + 1627 + case STATUSTYPE_TABLE: 1628 + format_dev_t(buf, era->metadata_dev->bdev->bd_dev); 1629 + DMEMIT("%s ", buf); 1630 + format_dev_t(buf, era->origin_dev->bdev->bd_dev); 1631 + DMEMIT("%s %u", buf, era->sectors_per_block); 1632 + break; 1633 + } 1634 + 1635 + return; 1636 + 1637 + err: 1638 + DMEMIT("Error"); 1639 + } 1640 + 1641 + static int era_message(struct dm_target *ti, unsigned argc, char **argv) 1642 + { 1643 + struct era *era = ti->private; 1644 + 1645 + if (argc != 1) { 1646 + DMERR("incorrect number of message arguments"); 1647 + return -EINVAL; 1648 + } 1649 + 1650 + if (!strcasecmp(argv[0], "checkpoint")) 1651 + return in_worker0(era, metadata_checkpoint); 1652 + 1653 + if (!strcasecmp(argv[0], "take_metadata_snap")) 1654 + return in_worker0(era, metadata_take_snap); 1655 + 1656 + if (!strcasecmp(argv[0], "drop_metadata_snap")) 1657 + return in_worker0(era, metadata_drop_snap); 1658 + 1659 + DMERR("unsupported message '%s'", argv[0]); 1660 + return -EINVAL; 1661 + } 1662 + 1663 + static sector_t get_dev_size(struct dm_dev *dev) 1664 + { 1665 + return i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT; 1666 + } 1667 + 1668 + static int era_iterate_devices(struct dm_target *ti, 1669 + iterate_devices_callout_fn fn, void *data) 1670 + { 1671 + struct era *era = ti->private; 1672 + return fn(ti, era->origin_dev, 0, get_dev_size(era->origin_dev), data); 1673 + } 1674 + 1675 + static int era_merge(struct dm_target *ti, struct bvec_merge_data *bvm, 1676 + struct bio_vec *biovec, int max_size) 1677 + { 1678 + struct era *era = ti->private; 1679 + struct request_queue *q = bdev_get_queue(era->origin_dev->bdev); 1680 + 1681 + if (!q->merge_bvec_fn) 1682 + return max_size; 1683 + 1684 + bvm->bi_bdev = era->origin_dev->bdev; 1685 + 1686 + return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); 1687 + } 1688 + 1689 + static void era_io_hints(struct dm_target *ti, struct queue_limits *limits) 1690 + { 1691 + struct era *era = ti->private; 1692 + uint64_t io_opt_sectors = limits->io_opt >> SECTOR_SHIFT; 1693 + 1694 + /* 1695 + * If the system-determined stacked limits are compatible with the 1696 + * era device's blocksize (io_opt is a factor) do not override them. 1697 + */ 1698 + if (io_opt_sectors < era->sectors_per_block || 1699 + do_div(io_opt_sectors, era->sectors_per_block)) { 1700 + blk_limits_io_min(limits, 0); 1701 + blk_limits_io_opt(limits, era->sectors_per_block << SECTOR_SHIFT); 1702 + } 1703 + } 1704 + 1705 + /*----------------------------------------------------------------*/ 1706 + 1707 + static struct target_type era_target = { 1708 + .name = "era", 1709 + .version = {1, 0, 0}, 1710 + .module = THIS_MODULE, 1711 + .ctr = era_ctr, 1712 + .dtr = era_dtr, 1713 + .map = era_map, 1714 + .postsuspend = era_postsuspend, 1715 + .preresume = era_preresume, 1716 + .status = era_status, 1717 + .message = era_message, 1718 + .iterate_devices = era_iterate_devices, 1719 + .merge = era_merge, 1720 + .io_hints = era_io_hints 1721 + }; 1722 + 1723 + static int __init dm_era_init(void) 1724 + { 1725 + int r; 1726 + 1727 + r = dm_register_target(&era_target); 1728 + if (r) { 1729 + DMERR("era target registration failed: %d", r); 1730 + return r; 1731 + } 1732 + 1733 + return 0; 1734 + } 1735 + 1736 + static void __exit dm_era_exit(void) 1737 + { 1738 + dm_unregister_target(&era_target); 1739 + } 1740 + 1741 + module_init(dm_era_init); 1742 + module_exit(dm_era_exit); 1743 + 1744 + MODULE_DESCRIPTION(DM_NAME " era target"); 1745 + MODULE_AUTHOR("Joe Thornber <ejt@redhat.com>"); 1746 + MODULE_LICENSE("GPL");
+77 -142
drivers/md/dm-mpath.c
··· 93 93 unsigned pg_init_count; /* Number of times pg_init called */ 94 94 unsigned pg_init_delay_msecs; /* Number of msecs before pg_init retry */ 95 95 96 - unsigned queue_size; 97 - struct work_struct process_queued_ios; 98 - struct list_head queued_ios; 99 - 100 96 struct work_struct trigger_event; 101 97 102 98 /* ··· 117 121 static struct kmem_cache *_mpio_cache; 118 122 119 123 static struct workqueue_struct *kmultipathd, *kmpath_handlerd; 120 - static void process_queued_ios(struct work_struct *work); 121 124 static void trigger_event(struct work_struct *work); 122 125 static void activate_path(struct work_struct *work); 126 + static int __pgpath_busy(struct pgpath *pgpath); 123 127 124 128 125 129 /*----------------------------------------------- ··· 191 195 m = kzalloc(sizeof(*m), GFP_KERNEL); 192 196 if (m) { 193 197 INIT_LIST_HEAD(&m->priority_groups); 194 - INIT_LIST_HEAD(&m->queued_ios); 195 198 spin_lock_init(&m->lock); 196 199 m->queue_io = 1; 197 200 m->pg_init_delay_msecs = DM_PG_INIT_DELAY_DEFAULT; 198 - INIT_WORK(&m->process_queued_ios, process_queued_ios); 199 201 INIT_WORK(&m->trigger_event, trigger_event); 200 202 init_waitqueue_head(&m->pg_init_wait); 201 203 mutex_init(&m->work_mutex); ··· 250 256 * Path selection 251 257 *-----------------------------------------------*/ 252 258 253 - static void __pg_init_all_paths(struct multipath *m) 259 + static int __pg_init_all_paths(struct multipath *m) 254 260 { 255 261 struct pgpath *pgpath; 256 262 unsigned long pg_init_delay = 0; 257 263 264 + if (m->pg_init_in_progress || m->pg_init_disabled) 265 + return 0; 266 + 258 267 m->pg_init_count++; 259 268 m->pg_init_required = 0; 269 + 270 + /* Check here to reset pg_init_required */ 271 + if (!m->current_pg) 272 + return 0; 273 + 260 274 if (m->pg_init_delay_retry) 261 275 pg_init_delay = msecs_to_jiffies(m->pg_init_delay_msecs != DM_PG_INIT_DELAY_DEFAULT ? 262 276 m->pg_init_delay_msecs : DM_PG_INIT_DELAY_MSECS); ··· 276 274 pg_init_delay)) 277 275 m->pg_init_in_progress++; 278 276 } 277 + return m->pg_init_in_progress; 279 278 } 280 279 281 280 static void __switch_pg(struct multipath *m, struct pgpath *pgpath) ··· 368 365 */ 369 366 static int __must_push_back(struct multipath *m) 370 367 { 371 - return (m->queue_if_no_path != m->saved_queue_if_no_path && 372 - dm_noflush_suspending(m->ti)); 368 + return (m->queue_if_no_path || 369 + (m->queue_if_no_path != m->saved_queue_if_no_path && 370 + dm_noflush_suspending(m->ti))); 373 371 } 374 372 375 - static int map_io(struct multipath *m, struct request *clone, 376 - union map_info *map_context, unsigned was_queued) 373 + #define pg_ready(m) (!(m)->queue_io && !(m)->pg_init_required) 374 + 375 + /* 376 + * Map cloned requests 377 + */ 378 + static int multipath_map(struct dm_target *ti, struct request *clone, 379 + union map_info *map_context) 377 380 { 378 - int r = DM_MAPIO_REMAPPED; 381 + struct multipath *m = (struct multipath *) ti->private; 382 + int r = DM_MAPIO_REQUEUE; 379 383 size_t nr_bytes = blk_rq_bytes(clone); 380 384 unsigned long flags; 381 385 struct pgpath *pgpath; 382 386 struct block_device *bdev; 383 - struct dm_mpath_io *mpio = map_context->ptr; 387 + struct dm_mpath_io *mpio; 384 388 385 389 spin_lock_irqsave(&m->lock, flags); 386 390 ··· 398 388 399 389 pgpath = m->current_pgpath; 400 390 401 - if (was_queued) 402 - m->queue_size--; 391 + if (!pgpath) { 392 + if (!__must_push_back(m)) 393 + r = -EIO; /* Failed */ 394 + goto out_unlock; 395 + } 396 + if (!pg_ready(m)) { 397 + __pg_init_all_paths(m); 398 + goto out_unlock; 399 + } 400 + if (set_mapinfo(m, map_context) < 0) 401 + /* ENOMEM, requeue */ 402 + goto out_unlock; 403 403 404 - if (m->pg_init_required) { 405 - if (!m->pg_init_in_progress) 406 - queue_work(kmultipathd, &m->process_queued_ios); 407 - r = DM_MAPIO_REQUEUE; 408 - } else if ((pgpath && m->queue_io) || 409 - (!pgpath && m->queue_if_no_path)) { 410 - /* Queue for the daemon to resubmit */ 411 - list_add_tail(&clone->queuelist, &m->queued_ios); 412 - m->queue_size++; 413 - if (!m->queue_io) 414 - queue_work(kmultipathd, &m->process_queued_ios); 415 - pgpath = NULL; 416 - r = DM_MAPIO_SUBMITTED; 417 - } else if (pgpath) { 418 - bdev = pgpath->path.dev->bdev; 419 - clone->q = bdev_get_queue(bdev); 420 - clone->rq_disk = bdev->bd_disk; 421 - } else if (__must_push_back(m)) 422 - r = DM_MAPIO_REQUEUE; 423 - else 424 - r = -EIO; /* Failed */ 425 - 404 + bdev = pgpath->path.dev->bdev; 405 + clone->q = bdev_get_queue(bdev); 406 + clone->rq_disk = bdev->bd_disk; 407 + clone->cmd_flags |= REQ_FAILFAST_TRANSPORT; 408 + mpio = map_context->ptr; 426 409 mpio->pgpath = pgpath; 427 410 mpio->nr_bytes = nr_bytes; 428 - 429 - if (r == DM_MAPIO_REMAPPED && pgpath->pg->ps.type->start_io) 430 - pgpath->pg->ps.type->start_io(&pgpath->pg->ps, &pgpath->path, 411 + if (pgpath->pg->ps.type->start_io) 412 + pgpath->pg->ps.type->start_io(&pgpath->pg->ps, 413 + &pgpath->path, 431 414 nr_bytes); 415 + r = DM_MAPIO_REMAPPED; 432 416 417 + out_unlock: 433 418 spin_unlock_irqrestore(&m->lock, flags); 434 419 435 420 return r; ··· 445 440 else 446 441 m->saved_queue_if_no_path = queue_if_no_path; 447 442 m->queue_if_no_path = queue_if_no_path; 448 - if (!m->queue_if_no_path && m->queue_size) 449 - queue_work(kmultipathd, &m->process_queued_ios); 443 + if (!m->queue_if_no_path) 444 + dm_table_run_md_queue_async(m->ti->table); 450 445 451 446 spin_unlock_irqrestore(&m->lock, flags); 452 447 453 448 return 0; 454 - } 455 - 456 - /*----------------------------------------------------------------- 457 - * The multipath daemon is responsible for resubmitting queued ios. 458 - *---------------------------------------------------------------*/ 459 - 460 - static void dispatch_queued_ios(struct multipath *m) 461 - { 462 - int r; 463 - unsigned long flags; 464 - union map_info *info; 465 - struct request *clone, *n; 466 - LIST_HEAD(cl); 467 - 468 - spin_lock_irqsave(&m->lock, flags); 469 - list_splice_init(&m->queued_ios, &cl); 470 - spin_unlock_irqrestore(&m->lock, flags); 471 - 472 - list_for_each_entry_safe(clone, n, &cl, queuelist) { 473 - list_del_init(&clone->queuelist); 474 - 475 - info = dm_get_rq_mapinfo(clone); 476 - 477 - r = map_io(m, clone, info, 1); 478 - if (r < 0) { 479 - clear_mapinfo(m, info); 480 - dm_kill_unmapped_request(clone, r); 481 - } else if (r == DM_MAPIO_REMAPPED) 482 - dm_dispatch_request(clone); 483 - else if (r == DM_MAPIO_REQUEUE) { 484 - clear_mapinfo(m, info); 485 - dm_requeue_unmapped_request(clone); 486 - } 487 - } 488 - } 489 - 490 - static void process_queued_ios(struct work_struct *work) 491 - { 492 - struct multipath *m = 493 - container_of(work, struct multipath, process_queued_ios); 494 - struct pgpath *pgpath = NULL; 495 - unsigned must_queue = 1; 496 - unsigned long flags; 497 - 498 - spin_lock_irqsave(&m->lock, flags); 499 - 500 - if (!m->current_pgpath) 501 - __choose_pgpath(m, 0); 502 - 503 - pgpath = m->current_pgpath; 504 - 505 - if ((pgpath && !m->queue_io) || 506 - (!pgpath && !m->queue_if_no_path)) 507 - must_queue = 0; 508 - 509 - if (m->pg_init_required && !m->pg_init_in_progress && pgpath && 510 - !m->pg_init_disabled) 511 - __pg_init_all_paths(m); 512 - 513 - spin_unlock_irqrestore(&m->lock, flags); 514 - if (!must_queue) 515 - dispatch_queued_ios(m); 516 449 } 517 450 518 451 /* ··· 915 972 } 916 973 917 974 /* 918 - * Map cloned requests 919 - */ 920 - static int multipath_map(struct dm_target *ti, struct request *clone, 921 - union map_info *map_context) 922 - { 923 - int r; 924 - struct multipath *m = (struct multipath *) ti->private; 925 - 926 - if (set_mapinfo(m, map_context) < 0) 927 - /* ENOMEM, requeue */ 928 - return DM_MAPIO_REQUEUE; 929 - 930 - clone->cmd_flags |= REQ_FAILFAST_TRANSPORT; 931 - r = map_io(m, clone, map_context, 0); 932 - if (r < 0 || r == DM_MAPIO_REQUEUE) 933 - clear_mapinfo(m, map_context); 934 - 935 - return r; 936 - } 937 - 938 - /* 939 975 * Take a path out of use. 940 976 */ 941 977 static int fail_path(struct pgpath *pgpath) ··· 976 1054 977 1055 pgpath->is_active = 1; 978 1056 979 - if (!m->nr_valid_paths++ && m->queue_size) { 1057 + if (!m->nr_valid_paths++) { 980 1058 m->current_pgpath = NULL; 981 - queue_work(kmultipathd, &m->process_queued_ios); 1059 + dm_table_run_md_queue_async(m->ti->table); 982 1060 } else if (m->hw_handler_name && (m->current_pg == pgpath->pg)) { 983 1061 if (queue_work(kmpath_handlerd, &pgpath->activate_path.work)) 984 1062 m->pg_init_in_progress++; ··· 1174 1252 /* Activations of other paths are still on going */ 1175 1253 goto out; 1176 1254 1177 - if (!m->pg_init_required) 1178 - m->queue_io = 0; 1179 - 1180 - m->pg_init_delay_retry = delay_retry; 1181 - queue_work(kmultipathd, &m->process_queued_ios); 1255 + if (m->pg_init_required) { 1256 + m->pg_init_delay_retry = delay_retry; 1257 + if (__pg_init_all_paths(m)) 1258 + goto out; 1259 + } 1260 + m->queue_io = 0; 1182 1261 1183 1262 /* 1184 1263 * Wake up any thread waiting to suspend. ··· 1195 1272 struct pgpath *pgpath = 1196 1273 container_of(work, struct pgpath, activate_path.work); 1197 1274 1198 - scsi_dh_activate(bdev_get_queue(pgpath->path.dev->bdev), 1199 - pg_init_done, pgpath); 1275 + if (pgpath->is_active) 1276 + scsi_dh_activate(bdev_get_queue(pgpath->path.dev->bdev), 1277 + pg_init_done, pgpath); 1278 + else 1279 + pg_init_done(pgpath, SCSI_DH_DEV_OFFLINED); 1200 1280 } 1201 1281 1202 1282 static int noretry_error(int error) ··· 1359 1433 1360 1434 /* Features */ 1361 1435 if (type == STATUSTYPE_INFO) 1362 - DMEMIT("2 %u %u ", m->queue_size, m->pg_init_count); 1436 + DMEMIT("2 %u %u ", m->queue_io, m->pg_init_count); 1363 1437 else { 1364 1438 DMEMIT("%u ", m->queue_if_no_path + 1365 1439 (m->pg_init_retries > 0) * 2 + ··· 1478 1552 } 1479 1553 1480 1554 if (argc != 2) { 1481 - DMWARN("Unrecognised multipath message received."); 1555 + DMWARN("Invalid multipath message arguments. Expected 2 arguments, got %d.", argc); 1482 1556 goto out; 1483 1557 } 1484 1558 ··· 1496 1570 else if (!strcasecmp(argv[0], "fail_path")) 1497 1571 action = fail_path; 1498 1572 else { 1499 - DMWARN("Unrecognised multipath message received."); 1573 + DMWARN("Unrecognised multipath message received: %s", argv[0]); 1500 1574 goto out; 1501 1575 } 1502 1576 ··· 1558 1632 r = err; 1559 1633 } 1560 1634 1561 - if (r == -ENOTCONN && !fatal_signal_pending(current)) 1562 - queue_work(kmultipathd, &m->process_queued_ios); 1635 + if (r == -ENOTCONN && !fatal_signal_pending(current)) { 1636 + spin_lock_irqsave(&m->lock, flags); 1637 + if (!m->current_pg) { 1638 + /* Path status changed, redo selection */ 1639 + __choose_pgpath(m, 0); 1640 + } 1641 + if (m->pg_init_required) 1642 + __pg_init_all_paths(m); 1643 + spin_unlock_irqrestore(&m->lock, flags); 1644 + dm_table_run_md_queue_async(m->ti->table); 1645 + } 1563 1646 1564 1647 return r ? : __blkdev_driver_ioctl(bdev, mode, cmd, arg); 1565 1648 } ··· 1619 1684 spin_lock_irqsave(&m->lock, flags); 1620 1685 1621 1686 /* pg_init in progress, requeue until done */ 1622 - if (m->pg_init_in_progress) { 1687 + if (!pg_ready(m)) { 1623 1688 busy = 1; 1624 1689 goto out; 1625 1690 } ··· 1672 1737 *---------------------------------------------------------------*/ 1673 1738 static struct target_type multipath_target = { 1674 1739 .name = "multipath", 1675 - .version = {1, 6, 0}, 1740 + .version = {1, 7, 0}, 1676 1741 .module = THIS_MODULE, 1677 1742 .ctr = multipath_ctr, 1678 1743 .dtr = multipath_dtr,
+20 -1
drivers/md/dm-table.c
··· 945 945 return dm_table_get_type(t) == DM_TYPE_REQUEST_BASED; 946 946 } 947 947 948 - int dm_table_alloc_md_mempools(struct dm_table *t) 948 + static int dm_table_alloc_md_mempools(struct dm_table *t) 949 949 { 950 950 unsigned type = dm_table_get_type(t); 951 951 unsigned per_bio_data_size = 0; ··· 1617 1617 return t->md; 1618 1618 } 1619 1619 EXPORT_SYMBOL(dm_table_get_md); 1620 + 1621 + void dm_table_run_md_queue_async(struct dm_table *t) 1622 + { 1623 + struct mapped_device *md; 1624 + struct request_queue *queue; 1625 + unsigned long flags; 1626 + 1627 + if (!dm_table_request_based(t)) 1628 + return; 1629 + 1630 + md = dm_table_get_md(t); 1631 + queue = dm_get_md_queue(md); 1632 + if (queue) { 1633 + spin_lock_irqsave(queue->queue_lock, flags); 1634 + blk_run_queue_async(queue); 1635 + spin_unlock_irqrestore(queue->queue_lock, flags); 1636 + } 1637 + } 1638 + EXPORT_SYMBOL(dm_table_run_md_queue_async); 1620 1639 1621 1640 static int device_discard_capable(struct dm_target *ti, struct dm_dev *dev, 1622 1641 sector_t start, sector_t len, void *data)
+49 -35
drivers/md/dm-thin-metadata.c
··· 192 192 * operation possible in this state is the closing of the device. 193 193 */ 194 194 bool fail_io:1; 195 + 196 + /* 197 + * Reading the space map roots can fail, so we read it into these 198 + * buffers before the superblock is locked and updated. 199 + */ 200 + __u8 data_space_map_root[SPACE_MAP_ROOT_SIZE]; 201 + __u8 metadata_space_map_root[SPACE_MAP_ROOT_SIZE]; 195 202 }; 196 203 197 204 struct dm_thin_device { ··· 438 431 pmd->details_info.value_type.equal = NULL; 439 432 } 440 433 434 + static int save_sm_roots(struct dm_pool_metadata *pmd) 435 + { 436 + int r; 437 + size_t len; 438 + 439 + r = dm_sm_root_size(pmd->metadata_sm, &len); 440 + if (r < 0) 441 + return r; 442 + 443 + r = dm_sm_copy_root(pmd->metadata_sm, &pmd->metadata_space_map_root, len); 444 + if (r < 0) 445 + return r; 446 + 447 + r = dm_sm_root_size(pmd->data_sm, &len); 448 + if (r < 0) 449 + return r; 450 + 451 + return dm_sm_copy_root(pmd->data_sm, &pmd->data_space_map_root, len); 452 + } 453 + 454 + static void copy_sm_roots(struct dm_pool_metadata *pmd, 455 + struct thin_disk_superblock *disk) 456 + { 457 + memcpy(&disk->metadata_space_map_root, 458 + &pmd->metadata_space_map_root, 459 + sizeof(pmd->metadata_space_map_root)); 460 + 461 + memcpy(&disk->data_space_map_root, 462 + &pmd->data_space_map_root, 463 + sizeof(pmd->data_space_map_root)); 464 + } 465 + 441 466 static int __write_initial_superblock(struct dm_pool_metadata *pmd) 442 467 { 443 468 int r; 444 469 struct dm_block *sblock; 445 - size_t metadata_len, data_len; 446 470 struct thin_disk_superblock *disk_super; 447 471 sector_t bdev_size = i_size_read(pmd->bdev->bd_inode) >> SECTOR_SHIFT; 448 472 449 473 if (bdev_size > THIN_METADATA_MAX_SECTORS) 450 474 bdev_size = THIN_METADATA_MAX_SECTORS; 451 475 452 - r = dm_sm_root_size(pmd->metadata_sm, &metadata_len); 453 - if (r < 0) 454 - return r; 455 - 456 - r = dm_sm_root_size(pmd->data_sm, &data_len); 457 - if (r < 0) 458 - return r; 459 - 460 476 r = dm_sm_commit(pmd->data_sm); 477 + if (r < 0) 478 + return r; 479 + 480 + r = save_sm_roots(pmd); 461 481 if (r < 0) 462 482 return r; 463 483 ··· 505 471 disk_super->trans_id = 0; 506 472 disk_super->held_root = 0; 507 473 508 - r = dm_sm_copy_root(pmd->metadata_sm, &disk_super->metadata_space_map_root, 509 - metadata_len); 510 - if (r < 0) 511 - goto bad_locked; 512 - 513 - r = dm_sm_copy_root(pmd->data_sm, &disk_super->data_space_map_root, 514 - data_len); 515 - if (r < 0) 516 - goto bad_locked; 474 + copy_sm_roots(pmd, disk_super); 517 475 518 476 disk_super->data_mapping_root = cpu_to_le64(pmd->root); 519 477 disk_super->device_details_root = cpu_to_le64(pmd->details_root); ··· 514 488 disk_super->data_block_size = cpu_to_le32(pmd->data_block_size); 515 489 516 490 return dm_tm_commit(pmd->tm, sblock); 517 - 518 - bad_locked: 519 - dm_bm_unlock(sblock); 520 - return r; 521 491 } 522 492 523 493 static int __format_metadata(struct dm_pool_metadata *pmd) ··· 791 769 if (r < 0) 792 770 return r; 793 771 772 + r = save_sm_roots(pmd); 773 + if (r < 0) 774 + return r; 775 + 794 776 r = superblock_lock(pmd, &sblock); 795 777 if (r) 796 778 return r; ··· 806 780 disk_super->trans_id = cpu_to_le64(pmd->trans_id); 807 781 disk_super->flags = cpu_to_le32(pmd->flags); 808 782 809 - r = dm_sm_copy_root(pmd->metadata_sm, &disk_super->metadata_space_map_root, 810 - metadata_len); 811 - if (r < 0) 812 - goto out_locked; 813 - 814 - r = dm_sm_copy_root(pmd->data_sm, &disk_super->data_space_map_root, 815 - data_len); 816 - if (r < 0) 817 - goto out_locked; 783 + copy_sm_roots(pmd, disk_super); 818 784 819 785 return dm_tm_commit(pmd->tm, sblock); 820 - 821 - out_locked: 822 - dm_bm_unlock(sblock); 823 - return r; 824 786 } 825 787 826 788 struct dm_pool_metadata *dm_pool_metadata_open(struct block_device *bdev,
+192 -75
drivers/md/dm-thin.c
··· 12 12 #include <linux/dm-io.h> 13 13 #include <linux/dm-kcopyd.h> 14 14 #include <linux/list.h> 15 + #include <linux/rculist.h> 15 16 #include <linux/init.h> 16 17 #include <linux/module.h> 17 18 #include <linux/slab.h> 19 + #include <linux/rbtree.h> 18 20 19 21 #define DM_MSG_PREFIX "thin" 20 22 ··· 180 178 unsigned ref_count; 181 179 182 180 spinlock_t lock; 183 - struct bio_list deferred_bios; 184 181 struct bio_list deferred_flush_bios; 185 182 struct list_head prepared_mappings; 186 183 struct list_head prepared_discards; 187 - 188 - struct bio_list retry_on_resume_list; 184 + struct list_head active_thins; 189 185 190 186 struct dm_deferred_set *shared_read_ds; 191 187 struct dm_deferred_set *all_io_ds; ··· 220 220 * Target context for a thin. 221 221 */ 222 222 struct thin_c { 223 + struct list_head list; 223 224 struct dm_dev *pool_dev; 224 225 struct dm_dev *origin_dev; 225 226 dm_thin_id dev_id; ··· 228 227 struct pool *pool; 229 228 struct dm_thin_device *td; 230 229 bool requeue_mode:1; 230 + spinlock_t lock; 231 + struct bio_list deferred_bio_list; 232 + struct bio_list retry_on_resume_list; 233 + struct rb_root sort_bio_list; /* sorted list of deferred bios */ 231 234 }; 232 235 233 236 /*----------------------------------------------------------------*/ ··· 292 287 struct pool *pool = tc->pool; 293 288 unsigned long flags; 294 289 295 - spin_lock_irqsave(&pool->lock, flags); 296 - dm_cell_release_no_holder(pool->prison, cell, &pool->deferred_bios); 297 - spin_unlock_irqrestore(&pool->lock, flags); 290 + spin_lock_irqsave(&tc->lock, flags); 291 + dm_cell_release_no_holder(pool->prison, cell, &tc->deferred_bio_list); 292 + spin_unlock_irqrestore(&tc->lock, flags); 298 293 299 294 wake_worker(pool); 300 295 } ··· 373 368 struct dm_deferred_entry *shared_read_entry; 374 369 struct dm_deferred_entry *all_io_entry; 375 370 struct dm_thin_new_mapping *overwrite_mapping; 371 + struct rb_node rb_node; 376 372 }; 377 373 378 374 static void requeue_bio_list(struct thin_c *tc, struct bio_list *master) ··· 384 378 385 379 bio_list_init(&bios); 386 380 387 - spin_lock_irqsave(&tc->pool->lock, flags); 381 + spin_lock_irqsave(&tc->lock, flags); 388 382 bio_list_merge(&bios, master); 389 383 bio_list_init(master); 390 - spin_unlock_irqrestore(&tc->pool->lock, flags); 384 + spin_unlock_irqrestore(&tc->lock, flags); 391 385 392 - while ((bio = bio_list_pop(&bios))) { 393 - struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook)); 394 - 395 - if (h->tc == tc) 396 - bio_endio(bio, DM_ENDIO_REQUEUE); 397 - else 398 - bio_list_add(master, bio); 399 - } 386 + while ((bio = bio_list_pop(&bios))) 387 + bio_endio(bio, DM_ENDIO_REQUEUE); 400 388 } 401 389 402 390 static void requeue_io(struct thin_c *tc) 403 391 { 404 - struct pool *pool = tc->pool; 405 - 406 - requeue_bio_list(tc, &pool->deferred_bios); 407 - requeue_bio_list(tc, &pool->retry_on_resume_list); 392 + requeue_bio_list(tc, &tc->deferred_bio_list); 393 + requeue_bio_list(tc, &tc->retry_on_resume_list); 408 394 } 409 395 410 - static void error_retry_list(struct pool *pool) 396 + static void error_thin_retry_list(struct thin_c *tc) 411 397 { 412 398 struct bio *bio; 413 399 unsigned long flags; ··· 407 409 408 410 bio_list_init(&bios); 409 411 410 - spin_lock_irqsave(&pool->lock, flags); 411 - bio_list_merge(&bios, &pool->retry_on_resume_list); 412 - bio_list_init(&pool->retry_on_resume_list); 413 - spin_unlock_irqrestore(&pool->lock, flags); 412 + spin_lock_irqsave(&tc->lock, flags); 413 + bio_list_merge(&bios, &tc->retry_on_resume_list); 414 + bio_list_init(&tc->retry_on_resume_list); 415 + spin_unlock_irqrestore(&tc->lock, flags); 414 416 415 417 while ((bio = bio_list_pop(&bios))) 416 418 bio_io_error(bio); 419 + } 420 + 421 + static void error_retry_list(struct pool *pool) 422 + { 423 + struct thin_c *tc; 424 + 425 + rcu_read_lock(); 426 + list_for_each_entry_rcu(tc, &pool->active_thins, list) 427 + error_thin_retry_list(tc); 428 + rcu_read_unlock(); 417 429 } 418 430 419 431 /* ··· 616 608 struct pool *pool = tc->pool; 617 609 unsigned long flags; 618 610 619 - spin_lock_irqsave(&pool->lock, flags); 620 - cell_release(pool, cell, &pool->deferred_bios); 621 - spin_unlock_irqrestore(&tc->pool->lock, flags); 611 + spin_lock_irqsave(&tc->lock, flags); 612 + cell_release(pool, cell, &tc->deferred_bio_list); 613 + spin_unlock_irqrestore(&tc->lock, flags); 622 614 623 615 wake_worker(pool); 624 616 } ··· 631 623 struct pool *pool = tc->pool; 632 624 unsigned long flags; 633 625 634 - spin_lock_irqsave(&pool->lock, flags); 635 - cell_release_no_holder(pool, cell, &pool->deferred_bios); 636 - spin_unlock_irqrestore(&pool->lock, flags); 626 + spin_lock_irqsave(&tc->lock, flags); 627 + cell_release_no_holder(pool, cell, &tc->deferred_bio_list); 628 + spin_unlock_irqrestore(&tc->lock, flags); 637 629 638 630 wake_worker(pool); 639 631 } ··· 1009 1001 { 1010 1002 struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook)); 1011 1003 struct thin_c *tc = h->tc; 1012 - struct pool *pool = tc->pool; 1013 1004 unsigned long flags; 1014 1005 1015 - spin_lock_irqsave(&pool->lock, flags); 1016 - bio_list_add(&pool->retry_on_resume_list, bio); 1017 - spin_unlock_irqrestore(&pool->lock, flags); 1006 + spin_lock_irqsave(&tc->lock, flags); 1007 + bio_list_add(&tc->retry_on_resume_list, bio); 1008 + spin_unlock_irqrestore(&tc->lock, flags); 1018 1009 } 1019 1010 1020 1011 static bool should_error_unserviceable_bio(struct pool *pool) ··· 1370 1363 jiffies > pool->last_commit_jiffies + COMMIT_PERIOD; 1371 1364 } 1372 1365 1373 - static void process_deferred_bios(struct pool *pool) 1366 + #define thin_pbd(node) rb_entry((node), struct dm_thin_endio_hook, rb_node) 1367 + #define thin_bio(pbd) dm_bio_from_per_bio_data((pbd), sizeof(struct dm_thin_endio_hook)) 1368 + 1369 + static void __thin_bio_rb_add(struct thin_c *tc, struct bio *bio) 1374 1370 { 1375 - unsigned long flags; 1371 + struct rb_node **rbp, *parent; 1372 + struct dm_thin_endio_hook *pbd; 1373 + sector_t bi_sector = bio->bi_iter.bi_sector; 1374 + 1375 + rbp = &tc->sort_bio_list.rb_node; 1376 + parent = NULL; 1377 + while (*rbp) { 1378 + parent = *rbp; 1379 + pbd = thin_pbd(parent); 1380 + 1381 + if (bi_sector < thin_bio(pbd)->bi_iter.bi_sector) 1382 + rbp = &(*rbp)->rb_left; 1383 + else 1384 + rbp = &(*rbp)->rb_right; 1385 + } 1386 + 1387 + pbd = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook)); 1388 + rb_link_node(&pbd->rb_node, parent, rbp); 1389 + rb_insert_color(&pbd->rb_node, &tc->sort_bio_list); 1390 + } 1391 + 1392 + static void __extract_sorted_bios(struct thin_c *tc) 1393 + { 1394 + struct rb_node *node; 1395 + struct dm_thin_endio_hook *pbd; 1396 + struct bio *bio; 1397 + 1398 + for (node = rb_first(&tc->sort_bio_list); node; node = rb_next(node)) { 1399 + pbd = thin_pbd(node); 1400 + bio = thin_bio(pbd); 1401 + 1402 + bio_list_add(&tc->deferred_bio_list, bio); 1403 + rb_erase(&pbd->rb_node, &tc->sort_bio_list); 1404 + } 1405 + 1406 + WARN_ON(!RB_EMPTY_ROOT(&tc->sort_bio_list)); 1407 + } 1408 + 1409 + static void __sort_thin_deferred_bios(struct thin_c *tc) 1410 + { 1376 1411 struct bio *bio; 1377 1412 struct bio_list bios; 1378 1413 1379 1414 bio_list_init(&bios); 1415 + bio_list_merge(&bios, &tc->deferred_bio_list); 1416 + bio_list_init(&tc->deferred_bio_list); 1380 1417 1381 - spin_lock_irqsave(&pool->lock, flags); 1382 - bio_list_merge(&bios, &pool->deferred_bios); 1383 - bio_list_init(&pool->deferred_bios); 1384 - spin_unlock_irqrestore(&pool->lock, flags); 1418 + /* Sort deferred_bio_list using rb-tree */ 1419 + while ((bio = bio_list_pop(&bios))) 1420 + __thin_bio_rb_add(tc, bio); 1385 1421 1422 + /* 1423 + * Transfer the sorted bios in sort_bio_list back to 1424 + * deferred_bio_list to allow lockless submission of 1425 + * all bios. 1426 + */ 1427 + __extract_sorted_bios(tc); 1428 + } 1429 + 1430 + static void process_thin_deferred_bios(struct thin_c *tc) 1431 + { 1432 + struct pool *pool = tc->pool; 1433 + unsigned long flags; 1434 + struct bio *bio; 1435 + struct bio_list bios; 1436 + struct blk_plug plug; 1437 + 1438 + if (tc->requeue_mode) { 1439 + requeue_bio_list(tc, &tc->deferred_bio_list); 1440 + return; 1441 + } 1442 + 1443 + bio_list_init(&bios); 1444 + 1445 + spin_lock_irqsave(&tc->lock, flags); 1446 + 1447 + if (bio_list_empty(&tc->deferred_bio_list)) { 1448 + spin_unlock_irqrestore(&tc->lock, flags); 1449 + return; 1450 + } 1451 + 1452 + __sort_thin_deferred_bios(tc); 1453 + 1454 + bio_list_merge(&bios, &tc->deferred_bio_list); 1455 + bio_list_init(&tc->deferred_bio_list); 1456 + 1457 + spin_unlock_irqrestore(&tc->lock, flags); 1458 + 1459 + blk_start_plug(&plug); 1386 1460 while ((bio = bio_list_pop(&bios))) { 1387 - struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook)); 1388 - struct thin_c *tc = h->tc; 1389 - 1390 - if (tc->requeue_mode) { 1391 - bio_endio(bio, DM_ENDIO_REQUEUE); 1392 - continue; 1393 - } 1394 - 1395 1461 /* 1396 1462 * If we've got no free new_mapping structs, and processing 1397 1463 * this bio might require one, we pause until there are some 1398 1464 * prepared mappings to process. 1399 1465 */ 1400 1466 if (ensure_next_mapping(pool)) { 1401 - spin_lock_irqsave(&pool->lock, flags); 1402 - bio_list_merge(&pool->deferred_bios, &bios); 1403 - spin_unlock_irqrestore(&pool->lock, flags); 1404 - 1467 + spin_lock_irqsave(&tc->lock, flags); 1468 + bio_list_add(&tc->deferred_bio_list, bio); 1469 + bio_list_merge(&tc->deferred_bio_list, &bios); 1470 + spin_unlock_irqrestore(&tc->lock, flags); 1405 1471 break; 1406 1472 } 1407 1473 ··· 1483 1403 else 1484 1404 pool->process_bio(tc, bio); 1485 1405 } 1406 + blk_finish_plug(&plug); 1407 + } 1408 + 1409 + static void process_deferred_bios(struct pool *pool) 1410 + { 1411 + unsigned long flags; 1412 + struct bio *bio; 1413 + struct bio_list bios; 1414 + struct thin_c *tc; 1415 + 1416 + rcu_read_lock(); 1417 + list_for_each_entry_rcu(tc, &pool->active_thins, list) 1418 + process_thin_deferred_bios(tc); 1419 + rcu_read_unlock(); 1486 1420 1487 1421 /* 1488 1422 * If there are any deferred flush bios, we must commit ··· 1728 1634 unsigned long flags; 1729 1635 struct pool *pool = tc->pool; 1730 1636 1731 - spin_lock_irqsave(&pool->lock, flags); 1732 - bio_list_add(&pool->deferred_bios, bio); 1733 - spin_unlock_irqrestore(&pool->lock, flags); 1637 + spin_lock_irqsave(&tc->lock, flags); 1638 + bio_list_add(&tc->deferred_bio_list, bio); 1639 + spin_unlock_irqrestore(&tc->lock, flags); 1734 1640 1735 1641 wake_worker(pool); 1736 1642 } ··· 1851 1757 1852 1758 static int pool_is_congested(struct dm_target_callbacks *cb, int bdi_bits) 1853 1759 { 1854 - int r; 1855 - unsigned long flags; 1856 1760 struct pool_c *pt = container_of(cb, struct pool_c, callbacks); 1761 + struct request_queue *q; 1857 1762 1858 - spin_lock_irqsave(&pt->pool->lock, flags); 1859 - r = !bio_list_empty(&pt->pool->retry_on_resume_list); 1860 - spin_unlock_irqrestore(&pt->pool->lock, flags); 1763 + if (get_pool_mode(pt->pool) == PM_OUT_OF_DATA_SPACE) 1764 + return 1; 1861 1765 1862 - if (!r) { 1863 - struct request_queue *q = bdev_get_queue(pt->data_dev->bdev); 1864 - r = bdi_congested(&q->backing_dev_info, bdi_bits); 1865 - } 1866 - 1867 - return r; 1766 + q = bdev_get_queue(pt->data_dev->bdev); 1767 + return bdi_congested(&q->backing_dev_info, bdi_bits); 1868 1768 } 1869 1769 1870 - static void __requeue_bios(struct pool *pool) 1770 + static void requeue_bios(struct pool *pool) 1871 1771 { 1872 - bio_list_merge(&pool->deferred_bios, &pool->retry_on_resume_list); 1873 - bio_list_init(&pool->retry_on_resume_list); 1772 + unsigned long flags; 1773 + struct thin_c *tc; 1774 + 1775 + rcu_read_lock(); 1776 + list_for_each_entry_rcu(tc, &pool->active_thins, list) { 1777 + spin_lock_irqsave(&tc->lock, flags); 1778 + bio_list_merge(&tc->deferred_bio_list, &tc->retry_on_resume_list); 1779 + bio_list_init(&tc->retry_on_resume_list); 1780 + spin_unlock_irqrestore(&tc->lock, flags); 1781 + } 1782 + rcu_read_unlock(); 1874 1783 } 1875 1784 1876 1785 /*---------------------------------------------------------------- ··· 2054 1957 INIT_WORK(&pool->worker, do_worker); 2055 1958 INIT_DELAYED_WORK(&pool->waker, do_waker); 2056 1959 spin_lock_init(&pool->lock); 2057 - bio_list_init(&pool->deferred_bios); 2058 1960 bio_list_init(&pool->deferred_flush_bios); 2059 1961 INIT_LIST_HEAD(&pool->prepared_mappings); 2060 1962 INIT_LIST_HEAD(&pool->prepared_discards); 1963 + INIT_LIST_HEAD(&pool->active_thins); 2061 1964 pool->low_water_triggered = false; 2062 - bio_list_init(&pool->retry_on_resume_list); 2063 1965 2064 1966 pool->shared_read_ds = dm_deferred_set_create(); 2065 1967 if (!pool->shared_read_ds) { ··· 2603 2507 2604 2508 spin_lock_irqsave(&pool->lock, flags); 2605 2509 pool->low_water_triggered = false; 2606 - __requeue_bios(pool); 2607 2510 spin_unlock_irqrestore(&pool->lock, flags); 2511 + requeue_bios(pool); 2608 2512 2609 2513 do_waker(&pool->waker.work); 2610 2514 } ··· 3043 2947 .name = "thin-pool", 3044 2948 .features = DM_TARGET_SINGLETON | DM_TARGET_ALWAYS_WRITEABLE | 3045 2949 DM_TARGET_IMMUTABLE, 3046 - .version = {1, 11, 0}, 2950 + .version = {1, 12, 0}, 3047 2951 .module = THIS_MODULE, 3048 2952 .ctr = pool_ctr, 3049 2953 .dtr = pool_dtr, ··· 3064 2968 static void thin_dtr(struct dm_target *ti) 3065 2969 { 3066 2970 struct thin_c *tc = ti->private; 2971 + unsigned long flags; 2972 + 2973 + spin_lock_irqsave(&tc->pool->lock, flags); 2974 + list_del_rcu(&tc->list); 2975 + spin_unlock_irqrestore(&tc->pool->lock, flags); 2976 + synchronize_rcu(); 3067 2977 3068 2978 mutex_lock(&dm_thin_pool_table.mutex); 3069 2979 ··· 3116 3014 r = -ENOMEM; 3117 3015 goto out_unlock; 3118 3016 } 3017 + spin_lock_init(&tc->lock); 3018 + bio_list_init(&tc->deferred_bio_list); 3019 + bio_list_init(&tc->retry_on_resume_list); 3020 + tc->sort_bio_list = RB_ROOT; 3119 3021 3120 3022 if (argc == 3) { 3121 3023 r = dm_get_device(ti, argv[2], FMODE_READ, &origin_dev); ··· 3190 3084 dm_put(pool_md); 3191 3085 3192 3086 mutex_unlock(&dm_thin_pool_table.mutex); 3087 + 3088 + spin_lock(&tc->pool->lock); 3089 + list_add_tail_rcu(&tc->list, &tc->pool->active_thins); 3090 + spin_unlock(&tc->pool->lock); 3091 + /* 3092 + * This synchronize_rcu() call is needed here otherwise we risk a 3093 + * wake_worker() call finding no bios to process (because the newly 3094 + * added tc isn't yet visible). So this reduces latency since we 3095 + * aren't then dependent on the periodic commit to wake_worker(). 3096 + */ 3097 + synchronize_rcu(); 3193 3098 3194 3099 return 0; 3195 3100 ··· 3367 3250 3368 3251 static struct target_type thin_target = { 3369 3252 .name = "thin", 3370 - .version = {1, 11, 0}, 3253 + .version = {1, 12, 0}, 3371 3254 .module = THIS_MODULE, 3372 3255 .ctr = thin_ctr, 3373 3256 .dtr = thin_dtr,
+9 -15
drivers/md/dm.c
··· 94 94 struct bio clone; 95 95 }; 96 96 97 - union map_info *dm_get_mapinfo(struct bio *bio) 98 - { 99 - if (bio && bio->bi_private) 100 - return &((struct dm_target_io *)bio->bi_private)->info; 101 - return NULL; 102 - } 103 - 104 97 union map_info *dm_get_rq_mapinfo(struct request *rq) 105 98 { 106 99 if (rq && rq->end_io_data) ··· 468 475 return get_capacity(md->disk); 469 476 } 470 477 478 + struct request_queue *dm_get_md_queue(struct mapped_device *md) 479 + { 480 + return md->queue; 481 + } 482 + 471 483 struct dm_stats *dm_get_stats(struct mapped_device *md) 472 484 { 473 485 return &md->stats; ··· 758 760 static void clone_endio(struct bio *bio, int error) 759 761 { 760 762 int r = 0; 761 - struct dm_target_io *tio = bio->bi_private; 763 + struct dm_target_io *tio = container_of(bio, struct dm_target_io, clone); 762 764 struct dm_io *io = tio->io; 763 765 struct mapped_device *md = tio->io->md; 764 766 dm_endio_fn endio = tio->ti->type->end_io; ··· 792 794 */ 793 795 static void end_clone_bio(struct bio *clone, int error) 794 796 { 795 - struct dm_rq_clone_bio_info *info = clone->bi_private; 797 + struct dm_rq_clone_bio_info *info = 798 + container_of(clone, struct dm_rq_clone_bio_info, clone); 796 799 struct dm_rq_target_io *tio = info->tio; 797 800 struct bio *bio = info->orig; 798 801 unsigned int nr_bytes = info->orig->bi_iter.bi_size; ··· 1119 1120 struct dm_target *ti = tio->ti; 1120 1121 1121 1122 clone->bi_end_io = clone_endio; 1122 - clone->bi_private = tio; 1123 1123 1124 1124 /* 1125 1125 * Map the clone. If r == 0 we don't need to do ··· 1193 1195 1194 1196 tio->io = ci->io; 1195 1197 tio->ti = ti; 1196 - memset(&tio->info, 0, sizeof(tio->info)); 1197 1198 tio->target_bio_nr = target_bio_nr; 1198 1199 1199 1200 return tio; ··· 1527 1530 info->orig = bio_orig; 1528 1531 info->tio = tio; 1529 1532 bio->bi_end_io = end_clone_bio; 1530 - bio->bi_private = info; 1531 1533 1532 1534 return 0; 1533 1535 } ··· 2168 2172 return NULL; 2169 2173 2170 2174 dm_table_event_callback(map, NULL, NULL); 2171 - rcu_assign_pointer(md->map, NULL); 2175 + RCU_INIT_POINTER(md->map, NULL); 2172 2176 dm_sync_table(md); 2173 2177 2174 2178 return map; ··· 2868 2872 .getgeo = dm_blk_getgeo, 2869 2873 .owner = THIS_MODULE 2870 2874 }; 2871 - 2872 - EXPORT_SYMBOL(dm_get_mapinfo); 2873 2875 2874 2876 /* 2875 2877 * module hooks
+1 -1
drivers/md/dm.h
··· 73 73 struct target_type *dm_table_get_immutable_target_type(struct dm_table *t); 74 74 bool dm_table_request_based(struct dm_table *t); 75 75 bool dm_table_supports_discards(struct dm_table *t); 76 - int dm_table_alloc_md_mempools(struct dm_table *t); 77 76 void dm_table_free_md_mempools(struct dm_table *t); 78 77 struct dm_md_mempools *dm_table_get_md_mempools(struct dm_table *t); 79 78 ··· 188 189 int dm_cancel_deferred_remove(struct mapped_device *md); 189 190 int dm_request_based(struct mapped_device *md); 190 191 sector_t dm_get_size(struct mapped_device *md); 192 + struct request_queue *dm_get_md_queue(struct mapped_device *md); 191 193 struct dm_stats *dm_get_stats(struct mapped_device *md); 192 194 193 195 int dm_kobject_uevent(struct mapped_device *md, enum kobject_action action,
+9 -1
drivers/md/persistent-data/dm-bitset.c
··· 65 65 int r; 66 66 __le64 value; 67 67 68 - if (!info->current_index_set) 68 + if (!info->current_index_set || !info->dirty) 69 69 return 0; 70 70 71 71 value = cpu_to_le64(info->current_bits); ··· 77 77 return r; 78 78 79 79 info->current_index_set = false; 80 + info->dirty = false; 81 + 80 82 return 0; 81 83 } 82 84 EXPORT_SYMBOL_GPL(dm_bitset_flush); ··· 96 94 info->current_bits = le64_to_cpu(value); 97 95 info->current_index_set = true; 98 96 info->current_index = array_index; 97 + info->dirty = false; 98 + 99 99 return 0; 100 100 } 101 101 ··· 130 126 return r; 131 127 132 128 set_bit(b, (unsigned long *) &info->current_bits); 129 + info->dirty = true; 130 + 133 131 return 0; 134 132 } 135 133 EXPORT_SYMBOL_GPL(dm_bitset_set_bit); ··· 147 141 return r; 148 142 149 143 clear_bit(b, (unsigned long *) &info->current_bits); 144 + info->dirty = true; 145 + 150 146 return 0; 151 147 } 152 148 EXPORT_SYMBOL_GPL(dm_bitset_clear_bit);
+1
drivers/md/persistent-data/dm-bitset.h
··· 71 71 uint64_t current_bits; 72 72 73 73 bool current_index_set:1; 74 + bool dirty:1; 74 75 }; 75 76 76 77 /*
+2 -13
drivers/md/persistent-data/dm-block-manager.c
··· 595 595 } 596 596 EXPORT_SYMBOL_GPL(dm_bm_unlock); 597 597 598 - int dm_bm_flush_and_unlock(struct dm_block_manager *bm, 599 - struct dm_block *superblock) 598 + int dm_bm_flush(struct dm_block_manager *bm) 600 599 { 601 - int r; 602 - 603 600 if (bm->read_only) 604 601 return -EPERM; 605 602 606 - r = dm_bufio_write_dirty_buffers(bm->bufio); 607 - if (unlikely(r)) { 608 - dm_bm_unlock(superblock); 609 - return r; 610 - } 611 - 612 - dm_bm_unlock(superblock); 613 - 614 603 return dm_bufio_write_dirty_buffers(bm->bufio); 615 604 } 616 - EXPORT_SYMBOL_GPL(dm_bm_flush_and_unlock); 605 + EXPORT_SYMBOL_GPL(dm_bm_flush); 617 606 618 607 void dm_bm_prefetch(struct dm_block_manager *bm, dm_block_t b) 619 608 {
+1 -2
drivers/md/persistent-data/dm-block-manager.h
··· 105 105 * 106 106 * This method always blocks. 107 107 */ 108 - int dm_bm_flush_and_unlock(struct dm_block_manager *bm, 109 - struct dm_block *superblock); 108 + int dm_bm_flush(struct dm_block_manager *bm); 110 109 111 110 /* 112 111 * Request data is prefetched into the cache.
+3 -2
drivers/md/persistent-data/dm-transaction-manager.c
··· 154 154 if (r < 0) 155 155 return r; 156 156 157 - return 0; 157 + return dm_bm_flush(tm->bm); 158 158 } 159 159 EXPORT_SYMBOL_GPL(dm_tm_pre_commit); 160 160 ··· 164 164 return -EWOULDBLOCK; 165 165 166 166 wipe_shadow_table(tm); 167 + dm_bm_unlock(root); 167 168 168 - return dm_bm_flush_and_unlock(tm->bm, root); 169 + return dm_bm_flush(tm->bm); 169 170 } 170 171 EXPORT_SYMBOL_GPL(dm_tm_commit); 171 172
+8 -9
drivers/md/persistent-data/dm-transaction-manager.h
··· 38 38 /* 39 39 * We use a 2-phase commit here. 40 40 * 41 - * i) In the first phase the block manager is told to start flushing, and 42 - * the changes to the space map are written to disk. You should interrogate 43 - * your particular space map to get detail of its root node etc. to be 44 - * included in your superblock. 41 + * i) Make all changes for the transaction *except* for the superblock. 42 + * Then call dm_tm_pre_commit() to flush them to disk. 45 43 * 46 - * ii) @root will be committed last. You shouldn't use more than the 47 - * first 512 bytes of @root if you wish the transaction to survive a power 48 - * failure. You *must* have a write lock held on @root for both stage (i) 49 - * and (ii). The commit will drop the write lock. 44 + * ii) Lock your superblock. Update. Then call dm_tm_commit() which will 45 + * unlock the superblock and flush it. No other blocks should be updated 46 + * during this period. Care should be taken to never unlock a partially 47 + * updated superblock; perform any operations that could fail *before* you 48 + * take the superblock lock. 50 49 */ 51 50 int dm_tm_pre_commit(struct dm_transaction_manager *tm); 52 - int dm_tm_commit(struct dm_transaction_manager *tm, struct dm_block *root); 51 + int dm_tm_commit(struct dm_transaction_manager *tm, struct dm_block *superblock); 53 52 54 53 /* 55 54 * These methods are the only way to get hold of a writeable block.
+5 -3
include/linux/device-mapper.h
··· 23 23 24 24 union map_info { 25 25 void *ptr; 26 - unsigned long long ll; 27 26 }; 28 27 29 28 /* ··· 290 291 struct dm_target_io { 291 292 struct dm_io *io; 292 293 struct dm_target *ti; 293 - union map_info info; 294 294 unsigned target_bio_nr; 295 295 struct bio clone; 296 296 }; ··· 401 403 struct gendisk *dm_disk(struct mapped_device *md); 402 404 int dm_suspended(struct dm_target *ti); 403 405 int dm_noflush_suspending(struct dm_target *ti); 404 - union map_info *dm_get_mapinfo(struct bio *bio); 405 406 union map_info *dm_get_rq_mapinfo(struct request *rq); 406 407 407 408 struct queue_limits *dm_get_queue_limits(struct mapped_device *md); ··· 461 464 * Trigger an event. 462 465 */ 463 466 void dm_table_event(struct dm_table *t); 467 + 468 + /* 469 + * Run the queue for request-based targets. 470 + */ 471 + void dm_table_run_md_queue_async(struct dm_table *t); 464 472 465 473 /* 466 474 * The device must be suspended before calling this method.