Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm

* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: (48 commits)
dm mpath: change to be request based
dm: disable interrupt when taking map_lock
dm: do not set QUEUE_ORDERED_DRAIN if request based
dm: enable request based option
dm: prepare for request based option
dm raid1: add userspace log
dm: calculate queue limits during resume not load
dm log: fix create_log_context to use logical_block_size of log device
dm target:s introduce iterate devices fn
dm table: establish queue limits by copying table limits
dm table: replace struct io_restrictions with struct queue_limits
dm table: validate device logical_block_size
dm table: ensure targets are aligned to logical_block_size
dm ioctl: support cookies for udev
dm: sysfs add suspended attribute
dm table: improve warning message when devices not freed before destruction
dm mpath: add service time load balancer
dm mpath: add queue length load balancer
dm mpath: add start_io and nr_bytes to path selectors
dm snapshot: use barrier when writing exception store
...

+3992 -434
+54
Documentation/device-mapper/dm-log.txt
··· 1 + Device-Mapper Logging 2 + ===================== 3 + The device-mapper logging code is used by some of the device-mapper 4 + RAID targets to track regions of the disk that are not consistent. 5 + A region (or portion of the address space) of the disk may be 6 + inconsistent because a RAID stripe is currently being operated on or 7 + a machine died while the region was being altered. In the case of 8 + mirrors, a region would be considered dirty/inconsistent while you 9 + are writing to it because the writes need to be replicated for all 10 + the legs of the mirror and may not reach the legs at the same time. 11 + Once all writes are complete, the region is considered clean again. 12 + 13 + There is a generic logging interface that the device-mapper RAID 14 + implementations use to perform logging operations (see 15 + dm_dirty_log_type in include/linux/dm-dirty-log.h). Various different 16 + logging implementations are available and provide different 17 + capabilities. The list includes: 18 + 19 + Type Files 20 + ==== ===== 21 + disk drivers/md/dm-log.c 22 + core drivers/md/dm-log.c 23 + userspace drivers/md/dm-log-userspace* include/linux/dm-log-userspace.h 24 + 25 + The "disk" log type 26 + ------------------- 27 + This log implementation commits the log state to disk. This way, the 28 + logging state survives reboots/crashes. 29 + 30 + The "core" log type 31 + ------------------- 32 + This log implementation keeps the log state in memory. The log state 33 + will not survive a reboot or crash, but there may be a small boost in 34 + performance. This method can also be used if no storage device is 35 + available for storing log state. 36 + 37 + The "userspace" log type 38 + ------------------------ 39 + This log type simply provides a way to export the log API to userspace, 40 + so log implementations can be done there. This is done by forwarding most 41 + logging requests to userspace, where a daemon receives and processes the 42 + request. 43 + 44 + The structure used for communication between kernel and userspace are 45 + located in include/linux/dm-log-userspace.h. Due to the frequency, 46 + diversity, and 2-way communication nature of the exchanges between 47 + kernel and userspace, 'connector' is used as the interface for 48 + communication. 49 + 50 + There are currently two userspace log implementations that leverage this 51 + framework - "clustered_disk" and "clustered_core". These implementations 52 + provide a cluster-coherent log for shared-storage. Device-mapper mirroring 53 + can be used in a shared-storage environment when the cluster log implementations 54 + are employed.
+39
Documentation/device-mapper/dm-queue-length.txt
··· 1 + dm-queue-length 2 + =============== 3 + 4 + dm-queue-length is a path selector module for device-mapper targets, 5 + which selects a path with the least number of in-flight I/Os. 6 + The path selector name is 'queue-length'. 7 + 8 + Table parameters for each path: [<repeat_count>] 9 + <repeat_count>: The number of I/Os to dispatch using the selected 10 + path before switching to the next path. 11 + If not given, internal default is used. To check 12 + the default value, see the activated table. 13 + 14 + Status for each path: <status> <fail-count> <in-flight> 15 + <status>: 'A' if the path is active, 'F' if the path is failed. 16 + <fail-count>: The number of path failures. 17 + <in-flight>: The number of in-flight I/Os on the path. 18 + 19 + 20 + Algorithm 21 + ========= 22 + 23 + dm-queue-length increments/decrements 'in-flight' when an I/O is 24 + dispatched/completed respectively. 25 + dm-queue-length selects a path with the minimum 'in-flight'. 26 + 27 + 28 + Examples 29 + ======== 30 + In case that 2 paths (sda and sdb) are used with repeat_count == 128. 31 + 32 + # echo "0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128" \ 33 + dmsetup create test 34 + # 35 + # dmsetup table 36 + test: 0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128 37 + # 38 + # dmsetup status 39 + test: 0 10 multipath 2 0 0 0 1 1 E 0 2 1 8:0 A 0 0 8:16 A 0 0
+91
Documentation/device-mapper/dm-service-time.txt
··· 1 + dm-service-time 2 + =============== 3 + 4 + dm-service-time is a path selector module for device-mapper targets, 5 + which selects a path with the shortest estimated service time for 6 + the incoming I/O. 7 + 8 + The service time for each path is estimated by dividing the total size 9 + of in-flight I/Os on a path with the performance value of the path. 10 + The performance value is a relative throughput value among all paths 11 + in a path-group, and it can be specified as a table argument. 12 + 13 + The path selector name is 'service-time'. 14 + 15 + Table parameters for each path: [<repeat_count> [<relative_throughput>]] 16 + <repeat_count>: The number of I/Os to dispatch using the selected 17 + path before switching to the next path. 18 + If not given, internal default is used. To check 19 + the default value, see the activated table. 20 + <relative_throughput>: The relative throughput value of the path 21 + among all paths in the path-group. 22 + The valid range is 0-100. 23 + If not given, minimum value '1' is used. 24 + If '0' is given, the path isn't selected while 25 + other paths having a positive value are available. 26 + 27 + Status for each path: <status> <fail-count> <in-flight-size> \ 28 + <relative_throughput> 29 + <status>: 'A' if the path is active, 'F' if the path is failed. 30 + <fail-count>: The number of path failures. 31 + <in-flight-size>: The size of in-flight I/Os on the path. 32 + <relative_throughput>: The relative throughput value of the path 33 + among all paths in the path-group. 34 + 35 + 36 + Algorithm 37 + ========= 38 + 39 + dm-service-time adds the I/O size to 'in-flight-size' when the I/O is 40 + dispatched and substracts when completed. 41 + Basically, dm-service-time selects a path having minimum service time 42 + which is calculated by: 43 + 44 + ('in-flight-size' + 'size-of-incoming-io') / 'relative_throughput' 45 + 46 + However, some optimizations below are used to reduce the calculation 47 + as much as possible. 48 + 49 + 1. If the paths have the same 'relative_throughput', skip 50 + the division and just compare the 'in-flight-size'. 51 + 52 + 2. If the paths have the same 'in-flight-size', skip the division 53 + and just compare the 'relative_throughput'. 54 + 55 + 3. If some paths have non-zero 'relative_throughput' and others 56 + have zero 'relative_throughput', ignore those paths with zero 57 + 'relative_throughput'. 58 + 59 + If such optimizations can't be applied, calculate service time, and 60 + compare service time. 61 + If calculated service time is equal, the path having maximum 62 + 'relative_throughput' may be better. So compare 'relative_throughput' 63 + then. 64 + 65 + 66 + Examples 67 + ======== 68 + In case that 2 paths (sda and sdb) are used with repeat_count == 128 69 + and sda has an average throughput 1GB/s and sdb has 4GB/s, 70 + 'relative_throughput' value may be '1' for sda and '4' for sdb. 71 + 72 + # echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4" \ 73 + dmsetup create test 74 + # 75 + # dmsetup table 76 + test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4 77 + # 78 + # dmsetup status 79 + test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 1 8:16 A 0 0 4 80 + 81 + 82 + Or '2' for sda and '8' for sdb would be also true. 83 + 84 + # echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8" \ 85 + dmsetup create test 86 + # 87 + # dmsetup table 88 + test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8 89 + # 90 + # dmsetup status 91 + test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 2 8:16 A 0 0 8
+30
drivers/md/Kconfig
··· 231 231 Allow volume managers to mirror logical volumes, also 232 232 needed for live data migration tools such as 'pvmove'. 233 233 234 + config DM_LOG_USERSPACE 235 + tristate "Mirror userspace logging (EXPERIMENTAL)" 236 + depends on DM_MIRROR && EXPERIMENTAL && NET 237 + select CONNECTOR 238 + ---help--- 239 + The userspace logging module provides a mechanism for 240 + relaying the dm-dirty-log API to userspace. Log designs 241 + which are more suited to userspace implementation (e.g. 242 + shared storage logs) or experimental logs can be implemented 243 + by leveraging this framework. 244 + 234 245 config DM_ZERO 235 246 tristate "Zero target" 236 247 depends on BLK_DEV_DM ··· 259 248 depends on SCSI_DH || !SCSI_DH 260 249 ---help--- 261 250 Allow volume managers to support multipath hardware. 251 + 252 + config DM_MULTIPATH_QL 253 + tristate "I/O Path Selector based on the number of in-flight I/Os" 254 + depends on DM_MULTIPATH 255 + ---help--- 256 + This path selector is a dynamic load balancer which selects 257 + the path with the least number of in-flight I/Os. 258 + 259 + If unsure, say N. 260 + 261 + config DM_MULTIPATH_ST 262 + tristate "I/O Path Selector based on the service time" 263 + depends on DM_MULTIPATH 264 + ---help--- 265 + This path selector is a dynamic load balancer which selects 266 + the path expected to complete the incoming I/O in the shortest 267 + time. 268 + 269 + If unsure, say N. 262 270 263 271 config DM_DELAY 264 272 tristate "I/O delaying target (EXPERIMENTAL)"
+5
drivers/md/Makefile
··· 8 8 dm-snapshot-y += dm-snap.o dm-exception-store.o dm-snap-transient.o \ 9 9 dm-snap-persistent.o 10 10 dm-mirror-y += dm-raid1.o 11 + dm-log-userspace-y \ 12 + += dm-log-userspace-base.o dm-log-userspace-transfer.o 11 13 md-mod-y += md.o bitmap.o 12 14 raid456-y += raid5.o 13 15 raid6_pq-y += raid6algos.o raid6recov.o raid6tables.o \ ··· 38 36 obj-$(CONFIG_DM_CRYPT) += dm-crypt.o 39 37 obj-$(CONFIG_DM_DELAY) += dm-delay.o 40 38 obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o 39 + obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o 40 + obj-$(CONFIG_DM_MULTIPATH_ST) += dm-service-time.o 41 41 obj-$(CONFIG_DM_SNAPSHOT) += dm-snapshot.o 42 42 obj-$(CONFIG_DM_MIRROR) += dm-mirror.o dm-log.o dm-region-hash.o 43 + obj-$(CONFIG_DM_LOG_USERSPACE) += dm-log-userspace.o 43 44 obj-$(CONFIG_DM_ZERO) += dm-zero.o 44 45 45 46 quiet_cmd_unroll = UNROLL $@
+18 -1
drivers/md/dm-crypt.c
··· 1132 1132 goto bad_crypt_queue; 1133 1133 } 1134 1134 1135 + ti->num_flush_requests = 1; 1135 1136 ti->private = cc; 1136 1137 return 0; 1137 1138 ··· 1190 1189 union map_info *map_context) 1191 1190 { 1192 1191 struct dm_crypt_io *io; 1192 + struct crypt_config *cc; 1193 + 1194 + if (unlikely(bio_empty_barrier(bio))) { 1195 + cc = ti->private; 1196 + bio->bi_bdev = cc->dev->bdev; 1197 + return DM_MAPIO_REMAPPED; 1198 + } 1193 1199 1194 1200 io = crypt_io_alloc(ti, bio, bio->bi_sector - ti->begin); 1195 1201 ··· 1313 1305 return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); 1314 1306 } 1315 1307 1308 + static int crypt_iterate_devices(struct dm_target *ti, 1309 + iterate_devices_callout_fn fn, void *data) 1310 + { 1311 + struct crypt_config *cc = ti->private; 1312 + 1313 + return fn(ti, cc->dev, cc->start, data); 1314 + } 1315 + 1316 1316 static struct target_type crypt_target = { 1317 1317 .name = "crypt", 1318 - .version= {1, 6, 0}, 1318 + .version = {1, 7, 0}, 1319 1319 .module = THIS_MODULE, 1320 1320 .ctr = crypt_ctr, 1321 1321 .dtr = crypt_dtr, ··· 1334 1318 .resume = crypt_resume, 1335 1319 .message = crypt_message, 1336 1320 .merge = crypt_merge, 1321 + .iterate_devices = crypt_iterate_devices, 1337 1322 }; 1338 1323 1339 1324 static int __init dm_crypt_init(void)
+23 -3
drivers/md/dm-delay.c
··· 197 197 mutex_init(&dc->timer_lock); 198 198 atomic_set(&dc->may_delay, 1); 199 199 200 + ti->num_flush_requests = 1; 200 201 ti->private = dc; 201 202 return 0; 202 203 ··· 279 278 280 279 if ((bio_data_dir(bio) == WRITE) && (dc->dev_write)) { 281 280 bio->bi_bdev = dc->dev_write->bdev; 282 - bio->bi_sector = dc->start_write + 283 - (bio->bi_sector - ti->begin); 281 + if (bio_sectors(bio)) 282 + bio->bi_sector = dc->start_write + 283 + (bio->bi_sector - ti->begin); 284 284 285 285 return delay_bio(dc, dc->write_delay, bio); 286 286 } ··· 318 316 return 0; 319 317 } 320 318 319 + static int delay_iterate_devices(struct dm_target *ti, 320 + iterate_devices_callout_fn fn, void *data) 321 + { 322 + struct delay_c *dc = ti->private; 323 + int ret = 0; 324 + 325 + ret = fn(ti, dc->dev_read, dc->start_read, data); 326 + if (ret) 327 + goto out; 328 + 329 + if (dc->dev_write) 330 + ret = fn(ti, dc->dev_write, dc->start_write, data); 331 + 332 + out: 333 + return ret; 334 + } 335 + 321 336 static struct target_type delay_target = { 322 337 .name = "delay", 323 - .version = {1, 0, 2}, 338 + .version = {1, 1, 0}, 324 339 .module = THIS_MODULE, 325 340 .ctr = delay_ctr, 326 341 .dtr = delay_dtr, ··· 345 326 .presuspend = delay_presuspend, 346 327 .resume = delay_resume, 347 328 .status = delay_status, 329 + .iterate_devices = delay_iterate_devices, 348 330 }; 349 331 350 332 static int __init dm_delay_init(void)
+1 -1
drivers/md/dm-exception-store.c
··· 216 216 return -EINVAL; 217 217 } 218 218 219 - type = get_type(argv[1]); 219 + type = get_type(&persistent); 220 220 if (!type) { 221 221 ti->error = "Exception store type not recognised"; 222 222 r = -EINVAL;
+1 -1
drivers/md/dm-exception-store.h
··· 156 156 */ 157 157 static inline sector_t get_dev_size(struct block_device *bdev) 158 158 { 159 - return bdev->bd_inode->i_size >> SECTOR_SHIFT; 159 + return i_size_read(bdev->bd_inode) >> SECTOR_SHIFT; 160 160 } 161 161 162 162 static inline chunk_t sector_to_chunk(struct dm_exception_store *store,
+13 -1
drivers/md/dm-io.c
··· 22 22 /* FIXME: can we shrink this ? */ 23 23 struct io { 24 24 unsigned long error_bits; 25 + unsigned long eopnotsupp_bits; 25 26 atomic_t count; 26 27 struct task_struct *sleeper; 27 28 struct dm_io_client *client; ··· 108 107 *---------------------------------------------------------------*/ 109 108 static void dec_count(struct io *io, unsigned int region, int error) 110 109 { 111 - if (error) 110 + if (error) { 112 111 set_bit(region, &io->error_bits); 112 + if (error == -EOPNOTSUPP) 113 + set_bit(region, &io->eopnotsupp_bits); 114 + } 113 115 114 116 if (atomic_dec_and_test(&io->count)) { 115 117 if (io->sleeper) ··· 364 360 return -EIO; 365 361 } 366 362 363 + retry: 367 364 io.error_bits = 0; 365 + io.eopnotsupp_bits = 0; 368 366 atomic_set(&io.count, 1); /* see dispatch_io() */ 369 367 io.sleeper = current; 370 368 io.client = client; ··· 382 376 io_schedule(); 383 377 } 384 378 set_current_state(TASK_RUNNING); 379 + 380 + if (io.eopnotsupp_bits && (rw & (1 << BIO_RW_BARRIER))) { 381 + rw &= ~(1 << BIO_RW_BARRIER); 382 + goto retry; 383 + } 385 384 386 385 if (error_bits) 387 386 *error_bits = io.error_bits; ··· 408 397 409 398 io = mempool_alloc(client->pool, GFP_NOIO); 410 399 io->error_bits = 0; 400 + io->eopnotsupp_bits = 0; 411 401 atomic_set(&io->count, 1); /* see dispatch_io() */ 412 402 io->sleeper = NULL; 413 403 io->client = client;
+23 -4
drivers/md/dm-ioctl.c
··· 276 276 up_write(&_hash_lock); 277 277 } 278 278 279 - static int dm_hash_rename(const char *old, const char *new) 279 + static int dm_hash_rename(uint32_t cookie, const char *old, const char *new) 280 280 { 281 281 char *new_name, *old_name; 282 282 struct hash_cell *hc; ··· 333 333 dm_table_put(table); 334 334 } 335 335 336 - dm_kobject_uevent(hc->md); 336 + dm_kobject_uevent(hc->md, KOBJ_CHANGE, cookie); 337 337 338 338 dm_put(hc->md); 339 339 up_write(&_hash_lock); ··· 680 680 681 681 __hash_remove(hc); 682 682 up_write(&_hash_lock); 683 + 684 + dm_kobject_uevent(md, KOBJ_REMOVE, param->event_nr); 685 + 683 686 dm_put(md); 684 687 param->data_size = 0; 685 688 return 0; ··· 718 715 return r; 719 716 720 717 param->data_size = 0; 721 - return dm_hash_rename(param->name, new_name); 718 + return dm_hash_rename(param->event_nr, param->name, new_name); 722 719 } 723 720 724 721 static int dev_set_geometry(struct dm_ioctl *param, size_t param_size) ··· 845 842 if (dm_suspended(md)) 846 843 r = dm_resume(md); 847 844 848 - if (!r) 845 + 846 + if (!r) { 847 + dm_kobject_uevent(md, KOBJ_CHANGE, param->event_nr); 849 848 r = __dev_status(md, param); 849 + } 850 850 851 851 dm_put(md); 852 852 return r; ··· 1050 1044 next = spec->next; 1051 1045 } 1052 1046 1047 + r = dm_table_set_type(table); 1048 + if (r) { 1049 + DMWARN("unable to set table type"); 1050 + return r; 1051 + } 1052 + 1053 1053 return dm_table_complete(table); 1054 1054 } 1055 1055 ··· 1097 1085 if (r) { 1098 1086 DMERR("%s: could not register integrity profile.", 1099 1087 dm_device_name(md)); 1088 + dm_table_destroy(t); 1089 + goto out; 1090 + } 1091 + 1092 + r = dm_table_alloc_md_mempools(t); 1093 + if (r) { 1094 + DMWARN("unable to allocate mempools for this table"); 1100 1095 dm_table_destroy(t); 1101 1096 goto out; 1102 1097 }
+13 -2
drivers/md/dm-linear.c
··· 53 53 goto bad; 54 54 } 55 55 56 + ti->num_flush_requests = 1; 56 57 ti->private = lc; 57 58 return 0; 58 59 ··· 82 81 struct linear_c *lc = ti->private; 83 82 84 83 bio->bi_bdev = lc->dev->bdev; 85 - bio->bi_sector = linear_map_sector(ti, bio->bi_sector); 84 + if (bio_sectors(bio)) 85 + bio->bi_sector = linear_map_sector(ti, bio->bi_sector); 86 86 } 87 87 88 88 static int linear_map(struct dm_target *ti, struct bio *bio, ··· 134 132 return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); 135 133 } 136 134 135 + static int linear_iterate_devices(struct dm_target *ti, 136 + iterate_devices_callout_fn fn, void *data) 137 + { 138 + struct linear_c *lc = ti->private; 139 + 140 + return fn(ti, lc->dev, lc->start, data); 141 + } 142 + 137 143 static struct target_type linear_target = { 138 144 .name = "linear", 139 - .version= {1, 0, 3}, 145 + .version = {1, 1, 0}, 140 146 .module = THIS_MODULE, 141 147 .ctr = linear_ctr, 142 148 .dtr = linear_dtr, ··· 152 142 .status = linear_status, 153 143 .ioctl = linear_ioctl, 154 144 .merge = linear_merge, 145 + .iterate_devices = linear_iterate_devices, 155 146 }; 156 147 157 148 int __init dm_linear_init(void)
+696
drivers/md/dm-log-userspace-base.c
··· 1 + /* 2 + * Copyright (C) 2006-2009 Red Hat, Inc. 3 + * 4 + * This file is released under the LGPL. 5 + */ 6 + 7 + #include <linux/bio.h> 8 + #include <linux/dm-dirty-log.h> 9 + #include <linux/device-mapper.h> 10 + #include <linux/dm-log-userspace.h> 11 + 12 + #include "dm-log-userspace-transfer.h" 13 + 14 + struct flush_entry { 15 + int type; 16 + region_t region; 17 + struct list_head list; 18 + }; 19 + 20 + struct log_c { 21 + struct dm_target *ti; 22 + uint32_t region_size; 23 + region_t region_count; 24 + char uuid[DM_UUID_LEN]; 25 + 26 + char *usr_argv_str; 27 + uint32_t usr_argc; 28 + 29 + /* 30 + * in_sync_hint gets set when doing is_remote_recovering. It 31 + * represents the first region that needs recovery. IOW, the 32 + * first zero bit of sync_bits. This can be useful for to limit 33 + * traffic for calls like is_remote_recovering and get_resync_work, 34 + * but be take care in its use for anything else. 35 + */ 36 + uint64_t in_sync_hint; 37 + 38 + spinlock_t flush_lock; 39 + struct list_head flush_list; /* only for clear and mark requests */ 40 + }; 41 + 42 + static mempool_t *flush_entry_pool; 43 + 44 + static void *flush_entry_alloc(gfp_t gfp_mask, void *pool_data) 45 + { 46 + return kmalloc(sizeof(struct flush_entry), gfp_mask); 47 + } 48 + 49 + static void flush_entry_free(void *element, void *pool_data) 50 + { 51 + kfree(element); 52 + } 53 + 54 + static int userspace_do_request(struct log_c *lc, const char *uuid, 55 + int request_type, char *data, size_t data_size, 56 + char *rdata, size_t *rdata_size) 57 + { 58 + int r; 59 + 60 + /* 61 + * If the server isn't there, -ESRCH is returned, 62 + * and we must keep trying until the server is 63 + * restored. 64 + */ 65 + retry: 66 + r = dm_consult_userspace(uuid, request_type, data, 67 + data_size, rdata, rdata_size); 68 + 69 + if (r != -ESRCH) 70 + return r; 71 + 72 + DMERR(" Userspace log server not found."); 73 + while (1) { 74 + set_current_state(TASK_INTERRUPTIBLE); 75 + schedule_timeout(2*HZ); 76 + DMWARN("Attempting to contact userspace log server..."); 77 + r = dm_consult_userspace(uuid, DM_ULOG_CTR, lc->usr_argv_str, 78 + strlen(lc->usr_argv_str) + 1, 79 + NULL, NULL); 80 + if (!r) 81 + break; 82 + } 83 + DMINFO("Reconnected to userspace log server... DM_ULOG_CTR complete"); 84 + r = dm_consult_userspace(uuid, DM_ULOG_RESUME, NULL, 85 + 0, NULL, NULL); 86 + if (!r) 87 + goto retry; 88 + 89 + DMERR("Error trying to resume userspace log: %d", r); 90 + 91 + return -ESRCH; 92 + } 93 + 94 + static int build_constructor_string(struct dm_target *ti, 95 + unsigned argc, char **argv, 96 + char **ctr_str) 97 + { 98 + int i, str_size; 99 + char *str = NULL; 100 + 101 + *ctr_str = NULL; 102 + 103 + for (i = 0, str_size = 0; i < argc; i++) 104 + str_size += strlen(argv[i]) + 1; /* +1 for space between args */ 105 + 106 + str_size += 20; /* Max number of chars in a printed u64 number */ 107 + 108 + str = kzalloc(str_size, GFP_KERNEL); 109 + if (!str) { 110 + DMWARN("Unable to allocate memory for constructor string"); 111 + return -ENOMEM; 112 + } 113 + 114 + for (i = 0, str_size = 0; i < argc; i++) 115 + str_size += sprintf(str + str_size, "%s ", argv[i]); 116 + str_size += sprintf(str + str_size, "%llu", 117 + (unsigned long long)ti->len); 118 + 119 + *ctr_str = str; 120 + return str_size; 121 + } 122 + 123 + /* 124 + * userspace_ctr 125 + * 126 + * argv contains: 127 + * <UUID> <other args> 128 + * Where 'other args' is the userspace implementation specific log 129 + * arguments. An example might be: 130 + * <UUID> clustered_disk <arg count> <log dev> <region_size> [[no]sync] 131 + * 132 + * So, this module will strip off the <UUID> for identification purposes 133 + * when communicating with userspace about a log; but will pass on everything 134 + * else. 135 + */ 136 + static int userspace_ctr(struct dm_dirty_log *log, struct dm_target *ti, 137 + unsigned argc, char **argv) 138 + { 139 + int r = 0; 140 + int str_size; 141 + char *ctr_str = NULL; 142 + struct log_c *lc = NULL; 143 + uint64_t rdata; 144 + size_t rdata_size = sizeof(rdata); 145 + 146 + if (argc < 3) { 147 + DMWARN("Too few arguments to userspace dirty log"); 148 + return -EINVAL; 149 + } 150 + 151 + lc = kmalloc(sizeof(*lc), GFP_KERNEL); 152 + if (!lc) { 153 + DMWARN("Unable to allocate userspace log context."); 154 + return -ENOMEM; 155 + } 156 + 157 + lc->ti = ti; 158 + 159 + if (strlen(argv[0]) > (DM_UUID_LEN - 1)) { 160 + DMWARN("UUID argument too long."); 161 + kfree(lc); 162 + return -EINVAL; 163 + } 164 + 165 + strncpy(lc->uuid, argv[0], DM_UUID_LEN); 166 + spin_lock_init(&lc->flush_lock); 167 + INIT_LIST_HEAD(&lc->flush_list); 168 + 169 + str_size = build_constructor_string(ti, argc - 1, argv + 1, &ctr_str); 170 + if (str_size < 0) { 171 + kfree(lc); 172 + return str_size; 173 + } 174 + 175 + /* Send table string */ 176 + r = dm_consult_userspace(lc->uuid, DM_ULOG_CTR, 177 + ctr_str, str_size, NULL, NULL); 178 + 179 + if (r == -ESRCH) { 180 + DMERR("Userspace log server not found"); 181 + goto out; 182 + } 183 + 184 + /* Since the region size does not change, get it now */ 185 + rdata_size = sizeof(rdata); 186 + r = dm_consult_userspace(lc->uuid, DM_ULOG_GET_REGION_SIZE, 187 + NULL, 0, (char *)&rdata, &rdata_size); 188 + 189 + if (r) { 190 + DMERR("Failed to get region size of dirty log"); 191 + goto out; 192 + } 193 + 194 + lc->region_size = (uint32_t)rdata; 195 + lc->region_count = dm_sector_div_up(ti->len, lc->region_size); 196 + 197 + out: 198 + if (r) { 199 + kfree(lc); 200 + kfree(ctr_str); 201 + } else { 202 + lc->usr_argv_str = ctr_str; 203 + lc->usr_argc = argc; 204 + log->context = lc; 205 + } 206 + 207 + return r; 208 + } 209 + 210 + static void userspace_dtr(struct dm_dirty_log *log) 211 + { 212 + int r; 213 + struct log_c *lc = log->context; 214 + 215 + r = dm_consult_userspace(lc->uuid, DM_ULOG_DTR, 216 + NULL, 0, 217 + NULL, NULL); 218 + 219 + kfree(lc->usr_argv_str); 220 + kfree(lc); 221 + 222 + return; 223 + } 224 + 225 + static int userspace_presuspend(struct dm_dirty_log *log) 226 + { 227 + int r; 228 + struct log_c *lc = log->context; 229 + 230 + r = dm_consult_userspace(lc->uuid, DM_ULOG_PRESUSPEND, 231 + NULL, 0, 232 + NULL, NULL); 233 + 234 + return r; 235 + } 236 + 237 + static int userspace_postsuspend(struct dm_dirty_log *log) 238 + { 239 + int r; 240 + struct log_c *lc = log->context; 241 + 242 + r = dm_consult_userspace(lc->uuid, DM_ULOG_POSTSUSPEND, 243 + NULL, 0, 244 + NULL, NULL); 245 + 246 + return r; 247 + } 248 + 249 + static int userspace_resume(struct dm_dirty_log *log) 250 + { 251 + int r; 252 + struct log_c *lc = log->context; 253 + 254 + lc->in_sync_hint = 0; 255 + r = dm_consult_userspace(lc->uuid, DM_ULOG_RESUME, 256 + NULL, 0, 257 + NULL, NULL); 258 + 259 + return r; 260 + } 261 + 262 + static uint32_t userspace_get_region_size(struct dm_dirty_log *log) 263 + { 264 + struct log_c *lc = log->context; 265 + 266 + return lc->region_size; 267 + } 268 + 269 + /* 270 + * userspace_is_clean 271 + * 272 + * Check whether a region is clean. If there is any sort of 273 + * failure when consulting the server, we return not clean. 274 + * 275 + * Returns: 1 if clean, 0 otherwise 276 + */ 277 + static int userspace_is_clean(struct dm_dirty_log *log, region_t region) 278 + { 279 + int r; 280 + uint64_t region64 = (uint64_t)region; 281 + int64_t is_clean; 282 + size_t rdata_size; 283 + struct log_c *lc = log->context; 284 + 285 + rdata_size = sizeof(is_clean); 286 + r = userspace_do_request(lc, lc->uuid, DM_ULOG_IS_CLEAN, 287 + (char *)&region64, sizeof(region64), 288 + (char *)&is_clean, &rdata_size); 289 + 290 + return (r) ? 0 : (int)is_clean; 291 + } 292 + 293 + /* 294 + * userspace_in_sync 295 + * 296 + * Check if the region is in-sync. If there is any sort 297 + * of failure when consulting the server, we assume that 298 + * the region is not in sync. 299 + * 300 + * If 'can_block' is set, return immediately 301 + * 302 + * Returns: 1 if in-sync, 0 if not-in-sync, -EWOULDBLOCK 303 + */ 304 + static int userspace_in_sync(struct dm_dirty_log *log, region_t region, 305 + int can_block) 306 + { 307 + int r; 308 + uint64_t region64 = region; 309 + int64_t in_sync; 310 + size_t rdata_size; 311 + struct log_c *lc = log->context; 312 + 313 + /* 314 + * We can never respond directly - even if in_sync_hint is 315 + * set. This is because another machine could see a device 316 + * failure and mark the region out-of-sync. If we don't go 317 + * to userspace to ask, we might think the region is in-sync 318 + * and allow a read to pick up data that is stale. (This is 319 + * very unlikely if a device actually fails; but it is very 320 + * likely if a connection to one device from one machine fails.) 321 + * 322 + * There still might be a problem if the mirror caches the region 323 + * state as in-sync... but then this call would not be made. So, 324 + * that is a mirror problem. 325 + */ 326 + if (!can_block) 327 + return -EWOULDBLOCK; 328 + 329 + rdata_size = sizeof(in_sync); 330 + r = userspace_do_request(lc, lc->uuid, DM_ULOG_IN_SYNC, 331 + (char *)&region64, sizeof(region64), 332 + (char *)&in_sync, &rdata_size); 333 + return (r) ? 0 : (int)in_sync; 334 + } 335 + 336 + /* 337 + * userspace_flush 338 + * 339 + * This function is ok to block. 340 + * The flush happens in two stages. First, it sends all 341 + * clear/mark requests that are on the list. Then it 342 + * tells the server to commit them. This gives the 343 + * server a chance to optimise the commit, instead of 344 + * doing it for every request. 345 + * 346 + * Additionally, we could implement another thread that 347 + * sends the requests up to the server - reducing the 348 + * load on flush. Then the flush would have less in 349 + * the list and be responsible for the finishing commit. 350 + * 351 + * Returns: 0 on success, < 0 on failure 352 + */ 353 + static int userspace_flush(struct dm_dirty_log *log) 354 + { 355 + int r = 0; 356 + unsigned long flags; 357 + struct log_c *lc = log->context; 358 + LIST_HEAD(flush_list); 359 + struct flush_entry *fe, *tmp_fe; 360 + 361 + spin_lock_irqsave(&lc->flush_lock, flags); 362 + list_splice_init(&lc->flush_list, &flush_list); 363 + spin_unlock_irqrestore(&lc->flush_lock, flags); 364 + 365 + if (list_empty(&flush_list)) 366 + return 0; 367 + 368 + /* 369 + * FIXME: Count up requests, group request types, 370 + * allocate memory to stick all requests in and 371 + * send to server in one go. Failing the allocation, 372 + * do it one by one. 373 + */ 374 + 375 + list_for_each_entry(fe, &flush_list, list) { 376 + r = userspace_do_request(lc, lc->uuid, fe->type, 377 + (char *)&fe->region, 378 + sizeof(fe->region), 379 + NULL, NULL); 380 + if (r) 381 + goto fail; 382 + } 383 + 384 + r = userspace_do_request(lc, lc->uuid, DM_ULOG_FLUSH, 385 + NULL, 0, NULL, NULL); 386 + 387 + fail: 388 + /* 389 + * We can safely remove these entries, even if failure. 390 + * Calling code will receive an error and will know that 391 + * the log facility has failed. 392 + */ 393 + list_for_each_entry_safe(fe, tmp_fe, &flush_list, list) { 394 + list_del(&fe->list); 395 + mempool_free(fe, flush_entry_pool); 396 + } 397 + 398 + if (r) 399 + dm_table_event(lc->ti->table); 400 + 401 + return r; 402 + } 403 + 404 + /* 405 + * userspace_mark_region 406 + * 407 + * This function should avoid blocking unless absolutely required. 408 + * (Memory allocation is valid for blocking.) 409 + */ 410 + static void userspace_mark_region(struct dm_dirty_log *log, region_t region) 411 + { 412 + unsigned long flags; 413 + struct log_c *lc = log->context; 414 + struct flush_entry *fe; 415 + 416 + /* Wait for an allocation, but _never_ fail */ 417 + fe = mempool_alloc(flush_entry_pool, GFP_NOIO); 418 + BUG_ON(!fe); 419 + 420 + spin_lock_irqsave(&lc->flush_lock, flags); 421 + fe->type = DM_ULOG_MARK_REGION; 422 + fe->region = region; 423 + list_add(&fe->list, &lc->flush_list); 424 + spin_unlock_irqrestore(&lc->flush_lock, flags); 425 + 426 + return; 427 + } 428 + 429 + /* 430 + * userspace_clear_region 431 + * 432 + * This function must not block. 433 + * So, the alloc can't block. In the worst case, it is ok to 434 + * fail. It would simply mean we can't clear the region. 435 + * Does nothing to current sync context, but does mean 436 + * the region will be re-sync'ed on a reload of the mirror 437 + * even though it is in-sync. 438 + */ 439 + static void userspace_clear_region(struct dm_dirty_log *log, region_t region) 440 + { 441 + unsigned long flags; 442 + struct log_c *lc = log->context; 443 + struct flush_entry *fe; 444 + 445 + /* 446 + * If we fail to allocate, we skip the clearing of 447 + * the region. This doesn't hurt us in any way, except 448 + * to cause the region to be resync'ed when the 449 + * device is activated next time. 450 + */ 451 + fe = mempool_alloc(flush_entry_pool, GFP_ATOMIC); 452 + if (!fe) { 453 + DMERR("Failed to allocate memory to clear region."); 454 + return; 455 + } 456 + 457 + spin_lock_irqsave(&lc->flush_lock, flags); 458 + fe->type = DM_ULOG_CLEAR_REGION; 459 + fe->region = region; 460 + list_add(&fe->list, &lc->flush_list); 461 + spin_unlock_irqrestore(&lc->flush_lock, flags); 462 + 463 + return; 464 + } 465 + 466 + /* 467 + * userspace_get_resync_work 468 + * 469 + * Get a region that needs recovery. It is valid to return 470 + * an error for this function. 471 + * 472 + * Returns: 1 if region filled, 0 if no work, <0 on error 473 + */ 474 + static int userspace_get_resync_work(struct dm_dirty_log *log, region_t *region) 475 + { 476 + int r; 477 + size_t rdata_size; 478 + struct log_c *lc = log->context; 479 + struct { 480 + int64_t i; /* 64-bit for mix arch compatibility */ 481 + region_t r; 482 + } pkg; 483 + 484 + if (lc->in_sync_hint >= lc->region_count) 485 + return 0; 486 + 487 + rdata_size = sizeof(pkg); 488 + r = userspace_do_request(lc, lc->uuid, DM_ULOG_GET_RESYNC_WORK, 489 + NULL, 0, 490 + (char *)&pkg, &rdata_size); 491 + 492 + *region = pkg.r; 493 + return (r) ? r : (int)pkg.i; 494 + } 495 + 496 + /* 497 + * userspace_set_region_sync 498 + * 499 + * Set the sync status of a given region. This function 500 + * must not fail. 501 + */ 502 + static void userspace_set_region_sync(struct dm_dirty_log *log, 503 + region_t region, int in_sync) 504 + { 505 + int r; 506 + struct log_c *lc = log->context; 507 + struct { 508 + region_t r; 509 + int64_t i; 510 + } pkg; 511 + 512 + pkg.r = region; 513 + pkg.i = (int64_t)in_sync; 514 + 515 + r = userspace_do_request(lc, lc->uuid, DM_ULOG_SET_REGION_SYNC, 516 + (char *)&pkg, sizeof(pkg), 517 + NULL, NULL); 518 + 519 + /* 520 + * It would be nice to be able to report failures. 521 + * However, it is easy emough to detect and resolve. 522 + */ 523 + return; 524 + } 525 + 526 + /* 527 + * userspace_get_sync_count 528 + * 529 + * If there is any sort of failure when consulting the server, 530 + * we assume that the sync count is zero. 531 + * 532 + * Returns: sync count on success, 0 on failure 533 + */ 534 + static region_t userspace_get_sync_count(struct dm_dirty_log *log) 535 + { 536 + int r; 537 + size_t rdata_size; 538 + uint64_t sync_count; 539 + struct log_c *lc = log->context; 540 + 541 + rdata_size = sizeof(sync_count); 542 + r = userspace_do_request(lc, lc->uuid, DM_ULOG_GET_SYNC_COUNT, 543 + NULL, 0, 544 + (char *)&sync_count, &rdata_size); 545 + 546 + if (r) 547 + return 0; 548 + 549 + if (sync_count >= lc->region_count) 550 + lc->in_sync_hint = lc->region_count; 551 + 552 + return (region_t)sync_count; 553 + } 554 + 555 + /* 556 + * userspace_status 557 + * 558 + * Returns: amount of space consumed 559 + */ 560 + static int userspace_status(struct dm_dirty_log *log, status_type_t status_type, 561 + char *result, unsigned maxlen) 562 + { 563 + int r = 0; 564 + size_t sz = (size_t)maxlen; 565 + struct log_c *lc = log->context; 566 + 567 + switch (status_type) { 568 + case STATUSTYPE_INFO: 569 + r = userspace_do_request(lc, lc->uuid, DM_ULOG_STATUS_INFO, 570 + NULL, 0, 571 + result, &sz); 572 + 573 + if (r) { 574 + sz = 0; 575 + DMEMIT("%s 1 COM_FAILURE", log->type->name); 576 + } 577 + break; 578 + case STATUSTYPE_TABLE: 579 + sz = 0; 580 + DMEMIT("%s %u %s %s", log->type->name, lc->usr_argc + 1, 581 + lc->uuid, lc->usr_argv_str); 582 + break; 583 + } 584 + return (r) ? 0 : (int)sz; 585 + } 586 + 587 + /* 588 + * userspace_is_remote_recovering 589 + * 590 + * Returns: 1 if region recovering, 0 otherwise 591 + */ 592 + static int userspace_is_remote_recovering(struct dm_dirty_log *log, 593 + region_t region) 594 + { 595 + int r; 596 + uint64_t region64 = region; 597 + struct log_c *lc = log->context; 598 + static unsigned long long limit; 599 + struct { 600 + int64_t is_recovering; 601 + uint64_t in_sync_hint; 602 + } pkg; 603 + size_t rdata_size = sizeof(pkg); 604 + 605 + /* 606 + * Once the mirror has been reported to be in-sync, 607 + * it will never again ask for recovery work. So, 608 + * we can safely say there is not a remote machine 609 + * recovering if the device is in-sync. (in_sync_hint 610 + * must be reset at resume time.) 611 + */ 612 + if (region < lc->in_sync_hint) 613 + return 0; 614 + else if (jiffies < limit) 615 + return 1; 616 + 617 + limit = jiffies + (HZ / 4); 618 + r = userspace_do_request(lc, lc->uuid, DM_ULOG_IS_REMOTE_RECOVERING, 619 + (char *)&region64, sizeof(region64), 620 + (char *)&pkg, &rdata_size); 621 + if (r) 622 + return 1; 623 + 624 + lc->in_sync_hint = pkg.in_sync_hint; 625 + 626 + return (int)pkg.is_recovering; 627 + } 628 + 629 + static struct dm_dirty_log_type _userspace_type = { 630 + .name = "userspace", 631 + .module = THIS_MODULE, 632 + .ctr = userspace_ctr, 633 + .dtr = userspace_dtr, 634 + .presuspend = userspace_presuspend, 635 + .postsuspend = userspace_postsuspend, 636 + .resume = userspace_resume, 637 + .get_region_size = userspace_get_region_size, 638 + .is_clean = userspace_is_clean, 639 + .in_sync = userspace_in_sync, 640 + .flush = userspace_flush, 641 + .mark_region = userspace_mark_region, 642 + .clear_region = userspace_clear_region, 643 + .get_resync_work = userspace_get_resync_work, 644 + .set_region_sync = userspace_set_region_sync, 645 + .get_sync_count = userspace_get_sync_count, 646 + .status = userspace_status, 647 + .is_remote_recovering = userspace_is_remote_recovering, 648 + }; 649 + 650 + static int __init userspace_dirty_log_init(void) 651 + { 652 + int r = 0; 653 + 654 + flush_entry_pool = mempool_create(100, flush_entry_alloc, 655 + flush_entry_free, NULL); 656 + 657 + if (!flush_entry_pool) { 658 + DMWARN("Unable to create flush_entry_pool: No memory."); 659 + return -ENOMEM; 660 + } 661 + 662 + r = dm_ulog_tfr_init(); 663 + if (r) { 664 + DMWARN("Unable to initialize userspace log communications"); 665 + mempool_destroy(flush_entry_pool); 666 + return r; 667 + } 668 + 669 + r = dm_dirty_log_type_register(&_userspace_type); 670 + if (r) { 671 + DMWARN("Couldn't register userspace dirty log type"); 672 + dm_ulog_tfr_exit(); 673 + mempool_destroy(flush_entry_pool); 674 + return r; 675 + } 676 + 677 + DMINFO("version 1.0.0 loaded"); 678 + return 0; 679 + } 680 + 681 + static void __exit userspace_dirty_log_exit(void) 682 + { 683 + dm_dirty_log_type_unregister(&_userspace_type); 684 + dm_ulog_tfr_exit(); 685 + mempool_destroy(flush_entry_pool); 686 + 687 + DMINFO("version 1.0.0 unloaded"); 688 + return; 689 + } 690 + 691 + module_init(userspace_dirty_log_init); 692 + module_exit(userspace_dirty_log_exit); 693 + 694 + MODULE_DESCRIPTION(DM_NAME " userspace dirty log link"); 695 + MODULE_AUTHOR("Jonathan Brassow <dm-devel@redhat.com>"); 696 + MODULE_LICENSE("GPL");
+276
drivers/md/dm-log-userspace-transfer.c
··· 1 + /* 2 + * Copyright (C) 2006-2009 Red Hat, Inc. 3 + * 4 + * This file is released under the LGPL. 5 + */ 6 + 7 + #include <linux/kernel.h> 8 + #include <linux/module.h> 9 + #include <net/sock.h> 10 + #include <linux/workqueue.h> 11 + #include <linux/connector.h> 12 + #include <linux/device-mapper.h> 13 + #include <linux/dm-log-userspace.h> 14 + 15 + #include "dm-log-userspace-transfer.h" 16 + 17 + static uint32_t dm_ulog_seq; 18 + 19 + /* 20 + * Netlink/Connector is an unreliable protocol. How long should 21 + * we wait for a response before assuming it was lost and retrying? 22 + * (If we do receive a response after this time, it will be discarded 23 + * and the response to the resent request will be waited for. 24 + */ 25 + #define DM_ULOG_RETRY_TIMEOUT (15 * HZ) 26 + 27 + /* 28 + * Pre-allocated space for speed 29 + */ 30 + #define DM_ULOG_PREALLOCED_SIZE 512 31 + static struct cn_msg *prealloced_cn_msg; 32 + static struct dm_ulog_request *prealloced_ulog_tfr; 33 + 34 + static struct cb_id ulog_cn_id = { 35 + .idx = CN_IDX_DM, 36 + .val = CN_VAL_DM_USERSPACE_LOG 37 + }; 38 + 39 + static DEFINE_MUTEX(dm_ulog_lock); 40 + 41 + struct receiving_pkg { 42 + struct list_head list; 43 + struct completion complete; 44 + 45 + uint32_t seq; 46 + 47 + int error; 48 + size_t *data_size; 49 + char *data; 50 + }; 51 + 52 + static DEFINE_SPINLOCK(receiving_list_lock); 53 + static struct list_head receiving_list; 54 + 55 + static int dm_ulog_sendto_server(struct dm_ulog_request *tfr) 56 + { 57 + int r; 58 + struct cn_msg *msg = prealloced_cn_msg; 59 + 60 + memset(msg, 0, sizeof(struct cn_msg)); 61 + 62 + msg->id.idx = ulog_cn_id.idx; 63 + msg->id.val = ulog_cn_id.val; 64 + msg->ack = 0; 65 + msg->seq = tfr->seq; 66 + msg->len = sizeof(struct dm_ulog_request) + tfr->data_size; 67 + 68 + r = cn_netlink_send(msg, 0, gfp_any()); 69 + 70 + return r; 71 + } 72 + 73 + /* 74 + * Parameters for this function can be either msg or tfr, but not 75 + * both. This function fills in the reply for a waiting request. 76 + * If just msg is given, then the reply is simply an ACK from userspace 77 + * that the request was received. 78 + * 79 + * Returns: 0 on success, -ENOENT on failure 80 + */ 81 + static int fill_pkg(struct cn_msg *msg, struct dm_ulog_request *tfr) 82 + { 83 + uint32_t rtn_seq = (msg) ? msg->seq : (tfr) ? tfr->seq : 0; 84 + struct receiving_pkg *pkg; 85 + 86 + /* 87 + * The 'receiving_pkg' entries in this list are statically 88 + * allocated on the stack in 'dm_consult_userspace'. 89 + * Each process that is waiting for a reply from the user 90 + * space server will have an entry in this list. 91 + * 92 + * We are safe to do it this way because the stack space 93 + * is unique to each process, but still addressable by 94 + * other processes. 95 + */ 96 + list_for_each_entry(pkg, &receiving_list, list) { 97 + if (rtn_seq != pkg->seq) 98 + continue; 99 + 100 + if (msg) { 101 + pkg->error = -msg->ack; 102 + /* 103 + * If we are trying again, we will need to know our 104 + * storage capacity. Otherwise, along with the 105 + * error code, we make explicit that we have no data. 106 + */ 107 + if (pkg->error != -EAGAIN) 108 + *(pkg->data_size) = 0; 109 + } else if (tfr->data_size > *(pkg->data_size)) { 110 + DMERR("Insufficient space to receive package [%u] " 111 + "(%u vs %lu)", tfr->request_type, 112 + tfr->data_size, *(pkg->data_size)); 113 + 114 + *(pkg->data_size) = 0; 115 + pkg->error = -ENOSPC; 116 + } else { 117 + pkg->error = tfr->error; 118 + memcpy(pkg->data, tfr->data, tfr->data_size); 119 + *(pkg->data_size) = tfr->data_size; 120 + } 121 + complete(&pkg->complete); 122 + return 0; 123 + } 124 + 125 + return -ENOENT; 126 + } 127 + 128 + /* 129 + * This is the connector callback that delivers data 130 + * that was sent from userspace. 131 + */ 132 + static void cn_ulog_callback(void *data) 133 + { 134 + struct cn_msg *msg = (struct cn_msg *)data; 135 + struct dm_ulog_request *tfr = (struct dm_ulog_request *)(msg + 1); 136 + 137 + spin_lock(&receiving_list_lock); 138 + if (msg->len == 0) 139 + fill_pkg(msg, NULL); 140 + else if (msg->len < sizeof(*tfr)) 141 + DMERR("Incomplete message received (expected %u, got %u): [%u]", 142 + (unsigned)sizeof(*tfr), msg->len, msg->seq); 143 + else 144 + fill_pkg(NULL, tfr); 145 + spin_unlock(&receiving_list_lock); 146 + } 147 + 148 + /** 149 + * dm_consult_userspace 150 + * @uuid: log's uuid (must be DM_UUID_LEN in size) 151 + * @request_type: found in include/linux/dm-log-userspace.h 152 + * @data: data to tx to the server 153 + * @data_size: size of data in bytes 154 + * @rdata: place to put return data from server 155 + * @rdata_size: value-result (amount of space given/amount of space used) 156 + * 157 + * rdata_size is undefined on failure. 158 + * 159 + * Memory used to communicate with userspace is zero'ed 160 + * before populating to ensure that no unwanted bits leak 161 + * from kernel space to user-space. All userspace log communications 162 + * between kernel and user space go through this function. 163 + * 164 + * Returns: 0 on success, -EXXX on failure 165 + **/ 166 + int dm_consult_userspace(const char *uuid, int request_type, 167 + char *data, size_t data_size, 168 + char *rdata, size_t *rdata_size) 169 + { 170 + int r = 0; 171 + size_t dummy = 0; 172 + int overhead_size = 173 + sizeof(struct dm_ulog_request *) + sizeof(struct cn_msg); 174 + struct dm_ulog_request *tfr = prealloced_ulog_tfr; 175 + struct receiving_pkg pkg; 176 + 177 + if (data_size > (DM_ULOG_PREALLOCED_SIZE - overhead_size)) { 178 + DMINFO("Size of tfr exceeds preallocated size"); 179 + return -EINVAL; 180 + } 181 + 182 + if (!rdata_size) 183 + rdata_size = &dummy; 184 + resend: 185 + /* 186 + * We serialize the sending of requests so we can 187 + * use the preallocated space. 188 + */ 189 + mutex_lock(&dm_ulog_lock); 190 + 191 + memset(tfr, 0, DM_ULOG_PREALLOCED_SIZE - overhead_size); 192 + memcpy(tfr->uuid, uuid, DM_UUID_LEN); 193 + tfr->seq = dm_ulog_seq++; 194 + 195 + /* 196 + * Must be valid request type (all other bits set to 197 + * zero). This reserves other bits for possible future 198 + * use. 199 + */ 200 + tfr->request_type = request_type & DM_ULOG_REQUEST_MASK; 201 + 202 + tfr->data_size = data_size; 203 + if (data && data_size) 204 + memcpy(tfr->data, data, data_size); 205 + 206 + memset(&pkg, 0, sizeof(pkg)); 207 + init_completion(&pkg.complete); 208 + pkg.seq = tfr->seq; 209 + pkg.data_size = rdata_size; 210 + pkg.data = rdata; 211 + spin_lock(&receiving_list_lock); 212 + list_add(&(pkg.list), &receiving_list); 213 + spin_unlock(&receiving_list_lock); 214 + 215 + r = dm_ulog_sendto_server(tfr); 216 + 217 + mutex_unlock(&dm_ulog_lock); 218 + 219 + if (r) { 220 + DMERR("Unable to send log request [%u] to userspace: %d", 221 + request_type, r); 222 + spin_lock(&receiving_list_lock); 223 + list_del_init(&(pkg.list)); 224 + spin_unlock(&receiving_list_lock); 225 + 226 + goto out; 227 + } 228 + 229 + r = wait_for_completion_timeout(&(pkg.complete), DM_ULOG_RETRY_TIMEOUT); 230 + spin_lock(&receiving_list_lock); 231 + list_del_init(&(pkg.list)); 232 + spin_unlock(&receiving_list_lock); 233 + if (!r) { 234 + DMWARN("[%s] Request timed out: [%u/%u] - retrying", 235 + (strlen(uuid) > 8) ? 236 + (uuid + (strlen(uuid) - 8)) : (uuid), 237 + request_type, pkg.seq); 238 + goto resend; 239 + } 240 + 241 + r = pkg.error; 242 + if (r == -EAGAIN) 243 + goto resend; 244 + 245 + out: 246 + return r; 247 + } 248 + 249 + int dm_ulog_tfr_init(void) 250 + { 251 + int r; 252 + void *prealloced; 253 + 254 + INIT_LIST_HEAD(&receiving_list); 255 + 256 + prealloced = kmalloc(DM_ULOG_PREALLOCED_SIZE, GFP_KERNEL); 257 + if (!prealloced) 258 + return -ENOMEM; 259 + 260 + prealloced_cn_msg = prealloced; 261 + prealloced_ulog_tfr = prealloced + sizeof(struct cn_msg); 262 + 263 + r = cn_add_callback(&ulog_cn_id, "dmlogusr", cn_ulog_callback); 264 + if (r) { 265 + cn_del_callback(&ulog_cn_id); 266 + return r; 267 + } 268 + 269 + return 0; 270 + } 271 + 272 + void dm_ulog_tfr_exit(void) 273 + { 274 + cn_del_callback(&ulog_cn_id); 275 + kfree(prealloced_cn_msg); 276 + }
+18
drivers/md/dm-log-userspace-transfer.h
··· 1 + /* 2 + * Copyright (C) 2006-2009 Red Hat, Inc. 3 + * 4 + * This file is released under the LGPL. 5 + */ 6 + 7 + #ifndef __DM_LOG_USERSPACE_TRANSFER_H__ 8 + #define __DM_LOG_USERSPACE_TRANSFER_H__ 9 + 10 + #define DM_MSG_PREFIX "dm-log-userspace" 11 + 12 + int dm_ulog_tfr_init(void); 13 + void dm_ulog_tfr_exit(void); 14 + int dm_consult_userspace(const char *uuid, int request_type, 15 + char *data, size_t data_size, 16 + char *rdata, size_t *rdata_size); 17 + 18 + #endif /* __DM_LOG_USERSPACE_TRANSFER_H__ */
+5 -4
drivers/md/dm-log.c
··· 412 412 /* 413 413 * Buffer holds both header and bitset. 414 414 */ 415 - buf_size = dm_round_up((LOG_OFFSET << SECTOR_SHIFT) + 416 - bitset_size, 417 - ti->limits.logical_block_size); 415 + buf_size = 416 + dm_round_up((LOG_OFFSET << SECTOR_SHIFT) + bitset_size, 417 + bdev_logical_block_size(lc->header_location. 418 + bdev)); 418 419 419 - if (buf_size > dev->bdev->bd_inode->i_size) { 420 + if (buf_size > i_size_read(dev->bdev->bd_inode)) { 420 421 DMWARN("log device %s too small: need %llu bytes", 421 422 dev->name, (unsigned long long)buf_size); 422 423 kfree(lc);
+220 -114
drivers/md/dm-mpath.c
··· 8 8 #include <linux/device-mapper.h> 9 9 10 10 #include "dm-path-selector.h" 11 - #include "dm-bio-record.h" 12 11 #include "dm-uevent.h" 13 12 14 13 #include <linux/ctype.h> ··· 34 35 35 36 struct dm_path path; 36 37 struct work_struct deactivate_path; 38 + struct work_struct activate_path; 37 39 }; 38 40 39 41 #define path_to_pgpath(__pgp) container_of((__pgp), struct pgpath, path) ··· 64 64 spinlock_t lock; 65 65 66 66 const char *hw_handler_name; 67 - struct work_struct activate_path; 68 - struct pgpath *pgpath_to_activate; 69 67 unsigned nr_priority_groups; 70 68 struct list_head priority_groups; 71 69 unsigned pg_init_required; /* pg_init needs calling? */ ··· 82 84 unsigned pg_init_count; /* Number of times pg_init called */ 83 85 84 86 struct work_struct process_queued_ios; 85 - struct bio_list queued_ios; 87 + struct list_head queued_ios; 86 88 unsigned queue_size; 87 89 88 90 struct work_struct trigger_event; ··· 99 101 */ 100 102 struct dm_mpath_io { 101 103 struct pgpath *pgpath; 102 - struct dm_bio_details details; 104 + size_t nr_bytes; 103 105 }; 104 106 105 107 typedef int (*action_fn) (struct pgpath *pgpath); ··· 126 128 if (pgpath) { 127 129 pgpath->is_active = 1; 128 130 INIT_WORK(&pgpath->deactivate_path, deactivate_path); 131 + INIT_WORK(&pgpath->activate_path, activate_path); 129 132 } 130 133 131 134 return pgpath; ··· 159 160 160 161 static void free_pgpaths(struct list_head *pgpaths, struct dm_target *ti) 161 162 { 162 - unsigned long flags; 163 163 struct pgpath *pgpath, *tmp; 164 164 struct multipath *m = ti->private; 165 165 ··· 167 169 if (m->hw_handler_name) 168 170 scsi_dh_detach(bdev_get_queue(pgpath->path.dev->bdev)); 169 171 dm_put_device(ti, pgpath->path.dev); 170 - spin_lock_irqsave(&m->lock, flags); 171 - if (m->pgpath_to_activate == pgpath) 172 - m->pgpath_to_activate = NULL; 173 - spin_unlock_irqrestore(&m->lock, flags); 174 172 free_pgpath(pgpath); 175 173 } 176 174 } ··· 192 198 m = kzalloc(sizeof(*m), GFP_KERNEL); 193 199 if (m) { 194 200 INIT_LIST_HEAD(&m->priority_groups); 201 + INIT_LIST_HEAD(&m->queued_ios); 195 202 spin_lock_init(&m->lock); 196 203 m->queue_io = 1; 197 204 INIT_WORK(&m->process_queued_ios, process_queued_ios); 198 205 INIT_WORK(&m->trigger_event, trigger_event); 199 - INIT_WORK(&m->activate_path, activate_path); 200 206 m->mpio_pool = mempool_create_slab_pool(MIN_IOS, _mpio_cache); 201 207 if (!m->mpio_pool) { 202 208 kfree(m); ··· 244 250 m->pg_init_count = 0; 245 251 } 246 252 247 - static int __choose_path_in_pg(struct multipath *m, struct priority_group *pg) 253 + static int __choose_path_in_pg(struct multipath *m, struct priority_group *pg, 254 + size_t nr_bytes) 248 255 { 249 256 struct dm_path *path; 250 257 251 - path = pg->ps.type->select_path(&pg->ps, &m->repeat_count); 258 + path = pg->ps.type->select_path(&pg->ps, &m->repeat_count, nr_bytes); 252 259 if (!path) 253 260 return -ENXIO; 254 261 ··· 261 266 return 0; 262 267 } 263 268 264 - static void __choose_pgpath(struct multipath *m) 269 + static void __choose_pgpath(struct multipath *m, size_t nr_bytes) 265 270 { 266 271 struct priority_group *pg; 267 272 unsigned bypassed = 1; ··· 273 278 if (m->next_pg) { 274 279 pg = m->next_pg; 275 280 m->next_pg = NULL; 276 - if (!__choose_path_in_pg(m, pg)) 281 + if (!__choose_path_in_pg(m, pg, nr_bytes)) 277 282 return; 278 283 } 279 284 280 285 /* Don't change PG until it has no remaining paths */ 281 - if (m->current_pg && !__choose_path_in_pg(m, m->current_pg)) 286 + if (m->current_pg && !__choose_path_in_pg(m, m->current_pg, nr_bytes)) 282 287 return; 283 288 284 289 /* ··· 290 295 list_for_each_entry(pg, &m->priority_groups, list) { 291 296 if (pg->bypassed == bypassed) 292 297 continue; 293 - if (!__choose_path_in_pg(m, pg)) 298 + if (!__choose_path_in_pg(m, pg, nr_bytes)) 294 299 return; 295 300 } 296 301 } while (bypassed--); ··· 317 322 dm_noflush_suspending(m->ti)); 318 323 } 319 324 320 - static int map_io(struct multipath *m, struct bio *bio, 325 + static int map_io(struct multipath *m, struct request *clone, 321 326 struct dm_mpath_io *mpio, unsigned was_queued) 322 327 { 323 328 int r = DM_MAPIO_REMAPPED; 329 + size_t nr_bytes = blk_rq_bytes(clone); 324 330 unsigned long flags; 325 331 struct pgpath *pgpath; 332 + struct block_device *bdev; 326 333 327 334 spin_lock_irqsave(&m->lock, flags); 328 335 329 336 /* Do we need to select a new pgpath? */ 330 337 if (!m->current_pgpath || 331 338 (!m->queue_io && (m->repeat_count && --m->repeat_count == 0))) 332 - __choose_pgpath(m); 339 + __choose_pgpath(m, nr_bytes); 333 340 334 341 pgpath = m->current_pgpath; 335 342 ··· 341 344 if ((pgpath && m->queue_io) || 342 345 (!pgpath && m->queue_if_no_path)) { 343 346 /* Queue for the daemon to resubmit */ 344 - bio_list_add(&m->queued_ios, bio); 347 + list_add_tail(&clone->queuelist, &m->queued_ios); 345 348 m->queue_size++; 346 349 if ((m->pg_init_required && !m->pg_init_in_progress) || 347 350 !m->queue_io) 348 351 queue_work(kmultipathd, &m->process_queued_ios); 349 352 pgpath = NULL; 350 353 r = DM_MAPIO_SUBMITTED; 351 - } else if (pgpath) 352 - bio->bi_bdev = pgpath->path.dev->bdev; 353 - else if (__must_push_back(m)) 354 + } else if (pgpath) { 355 + bdev = pgpath->path.dev->bdev; 356 + clone->q = bdev_get_queue(bdev); 357 + clone->rq_disk = bdev->bd_disk; 358 + } else if (__must_push_back(m)) 354 359 r = DM_MAPIO_REQUEUE; 355 360 else 356 361 r = -EIO; /* Failed */ 357 362 358 363 mpio->pgpath = pgpath; 364 + mpio->nr_bytes = nr_bytes; 365 + 366 + if (r == DM_MAPIO_REMAPPED && pgpath->pg->ps.type->start_io) 367 + pgpath->pg->ps.type->start_io(&pgpath->pg->ps, &pgpath->path, 368 + nr_bytes); 359 369 360 370 spin_unlock_irqrestore(&m->lock, flags); 361 371 ··· 400 396 { 401 397 int r; 402 398 unsigned long flags; 403 - struct bio *bio = NULL, *next; 404 399 struct dm_mpath_io *mpio; 405 400 union map_info *info; 401 + struct request *clone, *n; 402 + LIST_HEAD(cl); 406 403 407 404 spin_lock_irqsave(&m->lock, flags); 408 - bio = bio_list_get(&m->queued_ios); 405 + list_splice_init(&m->queued_ios, &cl); 409 406 spin_unlock_irqrestore(&m->lock, flags); 410 407 411 - while (bio) { 412 - next = bio->bi_next; 413 - bio->bi_next = NULL; 408 + list_for_each_entry_safe(clone, n, &cl, queuelist) { 409 + list_del_init(&clone->queuelist); 414 410 415 - info = dm_get_mapinfo(bio); 411 + info = dm_get_rq_mapinfo(clone); 416 412 mpio = info->ptr; 417 413 418 - r = map_io(m, bio, mpio, 1); 419 - if (r < 0) 420 - bio_endio(bio, r); 421 - else if (r == DM_MAPIO_REMAPPED) 422 - generic_make_request(bio); 423 - else if (r == DM_MAPIO_REQUEUE) 424 - bio_endio(bio, -EIO); 425 - 426 - bio = next; 414 + r = map_io(m, clone, mpio, 1); 415 + if (r < 0) { 416 + mempool_free(mpio, m->mpio_pool); 417 + dm_kill_unmapped_request(clone, r); 418 + } else if (r == DM_MAPIO_REMAPPED) 419 + dm_dispatch_request(clone); 420 + else if (r == DM_MAPIO_REQUEUE) { 421 + mempool_free(mpio, m->mpio_pool); 422 + dm_requeue_unmapped_request(clone); 423 + } 427 424 } 428 425 } 429 426 ··· 432 427 { 433 428 struct multipath *m = 434 429 container_of(work, struct multipath, process_queued_ios); 435 - struct pgpath *pgpath = NULL; 436 - unsigned init_required = 0, must_queue = 1; 430 + struct pgpath *pgpath = NULL, *tmp; 431 + unsigned must_queue = 1; 437 432 unsigned long flags; 438 433 439 434 spin_lock_irqsave(&m->lock, flags); ··· 442 437 goto out; 443 438 444 439 if (!m->current_pgpath) 445 - __choose_pgpath(m); 440 + __choose_pgpath(m, 0); 446 441 447 442 pgpath = m->current_pgpath; 448 443 ··· 451 446 must_queue = 0; 452 447 453 448 if (m->pg_init_required && !m->pg_init_in_progress && pgpath) { 454 - m->pgpath_to_activate = pgpath; 455 449 m->pg_init_count++; 456 450 m->pg_init_required = 0; 457 - m->pg_init_in_progress = 1; 458 - init_required = 1; 451 + list_for_each_entry(tmp, &pgpath->pg->pgpaths, list) { 452 + if (queue_work(kmpath_handlerd, &tmp->activate_path)) 453 + m->pg_init_in_progress++; 454 + } 459 455 } 460 - 461 456 out: 462 457 spin_unlock_irqrestore(&m->lock, flags); 463 - 464 - if (init_required) 465 - queue_work(kmpath_handlerd, &m->activate_path); 466 - 467 458 if (!must_queue) 468 459 dispatch_queued_ios(m); 469 460 } ··· 554 553 return -EINVAL; 555 554 } 556 555 556 + if (ps_argc > as->argc) { 557 + dm_put_path_selector(pst); 558 + ti->error = "not enough arguments for path selector"; 559 + return -EINVAL; 560 + } 561 + 557 562 r = pst->create(&pg->ps, ps_argc, as->argv); 558 563 if (r) { 559 564 dm_put_path_selector(pst); ··· 598 591 } 599 592 600 593 if (m->hw_handler_name) { 601 - r = scsi_dh_attach(bdev_get_queue(p->path.dev->bdev), 602 - m->hw_handler_name); 594 + struct request_queue *q = bdev_get_queue(p->path.dev->bdev); 595 + 596 + r = scsi_dh_attach(q, m->hw_handler_name); 597 + if (r == -EBUSY) { 598 + /* 599 + * Already attached to different hw_handler, 600 + * try to reattach with correct one. 601 + */ 602 + scsi_dh_detach(q); 603 + r = scsi_dh_attach(q, m->hw_handler_name); 604 + } 605 + 603 606 if (r < 0) { 607 + ti->error = "error attaching hardware handler"; 604 608 dm_put_device(ti, p->path.dev); 605 609 goto bad; 606 610 } ··· 716 698 717 699 if (!hw_argc) 718 700 return 0; 701 + 702 + if (hw_argc > as->argc) { 703 + ti->error = "not enough arguments for hardware handler"; 704 + return -EINVAL; 705 + } 719 706 720 707 m->hw_handler_name = kstrdup(shift(as), GFP_KERNEL); 721 708 request_module("scsi_dh_%s", m->hw_handler_name); ··· 846 823 goto bad; 847 824 } 848 825 826 + ti->num_flush_requests = 1; 827 + 849 828 return 0; 850 829 851 830 bad: ··· 861 836 862 837 flush_workqueue(kmpath_handlerd); 863 838 flush_workqueue(kmultipathd); 839 + flush_scheduled_work(); 864 840 free_multipath(m); 865 841 } 866 842 867 843 /* 868 - * Map bios, recording original fields for later in case we have to resubmit 844 + * Map cloned requests 869 845 */ 870 - static int multipath_map(struct dm_target *ti, struct bio *bio, 846 + static int multipath_map(struct dm_target *ti, struct request *clone, 871 847 union map_info *map_context) 872 848 { 873 849 int r; 874 850 struct dm_mpath_io *mpio; 875 851 struct multipath *m = (struct multipath *) ti->private; 876 852 877 - mpio = mempool_alloc(m->mpio_pool, GFP_NOIO); 878 - dm_bio_record(&mpio->details, bio); 853 + mpio = mempool_alloc(m->mpio_pool, GFP_ATOMIC); 854 + if (!mpio) 855 + /* ENOMEM, requeue */ 856 + return DM_MAPIO_REQUEUE; 857 + memset(mpio, 0, sizeof(*mpio)); 879 858 880 859 map_context->ptr = mpio; 881 - bio->bi_rw |= (1 << BIO_RW_FAILFAST_TRANSPORT); 882 - r = map_io(m, bio, mpio, 0); 860 + clone->cmd_flags |= REQ_FAILFAST_TRANSPORT; 861 + r = map_io(m, clone, mpio, 0); 883 862 if (r < 0 || r == DM_MAPIO_REQUEUE) 884 863 mempool_free(mpio, m->mpio_pool); 885 864 ··· 953 924 954 925 pgpath->is_active = 1; 955 926 956 - m->current_pgpath = NULL; 957 - if (!m->nr_valid_paths++ && m->queue_size) 927 + if (!m->nr_valid_paths++ && m->queue_size) { 928 + m->current_pgpath = NULL; 958 929 queue_work(kmultipathd, &m->process_queued_ios); 930 + } else if (m->hw_handler_name && (m->current_pg == pgpath->pg)) { 931 + if (queue_work(kmpath_handlerd, &pgpath->activate_path)) 932 + m->pg_init_in_progress++; 933 + } 959 934 960 935 dm_path_uevent(DM_UEVENT_PATH_REINSTATED, m->ti, 961 936 pgpath->path.dev->name, m->nr_valid_paths); ··· 1135 1102 1136 1103 spin_lock_irqsave(&m->lock, flags); 1137 1104 if (errors) { 1138 - DMERR("Could not failover device. Error %d.", errors); 1139 - m->current_pgpath = NULL; 1140 - m->current_pg = NULL; 1105 + if (pgpath == m->current_pgpath) { 1106 + DMERR("Could not failover device. Error %d.", errors); 1107 + m->current_pgpath = NULL; 1108 + m->current_pg = NULL; 1109 + } 1141 1110 } else if (!m->pg_init_required) { 1142 1111 m->queue_io = 0; 1143 1112 pg->bypassed = 0; 1144 1113 } 1145 1114 1146 - m->pg_init_in_progress = 0; 1147 - queue_work(kmultipathd, &m->process_queued_ios); 1115 + m->pg_init_in_progress--; 1116 + if (!m->pg_init_in_progress) 1117 + queue_work(kmultipathd, &m->process_queued_ios); 1148 1118 spin_unlock_irqrestore(&m->lock, flags); 1149 1119 } 1150 1120 1151 1121 static void activate_path(struct work_struct *work) 1152 1122 { 1153 1123 int ret; 1154 - struct multipath *m = 1155 - container_of(work, struct multipath, activate_path); 1156 - struct dm_path *path; 1157 - unsigned long flags; 1124 + struct pgpath *pgpath = 1125 + container_of(work, struct pgpath, activate_path); 1158 1126 1159 - spin_lock_irqsave(&m->lock, flags); 1160 - path = &m->pgpath_to_activate->path; 1161 - m->pgpath_to_activate = NULL; 1162 - spin_unlock_irqrestore(&m->lock, flags); 1163 - if (!path) 1164 - return; 1165 - ret = scsi_dh_activate(bdev_get_queue(path->dev->bdev)); 1166 - pg_init_done(path, ret); 1127 + ret = scsi_dh_activate(bdev_get_queue(pgpath->path.dev->bdev)); 1128 + pg_init_done(&pgpath->path, ret); 1167 1129 } 1168 1130 1169 1131 /* 1170 1132 * end_io handling 1171 1133 */ 1172 - static int do_end_io(struct multipath *m, struct bio *bio, 1134 + static int do_end_io(struct multipath *m, struct request *clone, 1173 1135 int error, struct dm_mpath_io *mpio) 1174 1136 { 1137 + /* 1138 + * We don't queue any clone request inside the multipath target 1139 + * during end I/O handling, since those clone requests don't have 1140 + * bio clones. If we queue them inside the multipath target, 1141 + * we need to make bio clones, that requires memory allocation. 1142 + * (See drivers/md/dm.c:end_clone_bio() about why the clone requests 1143 + * don't have bio clones.) 1144 + * Instead of queueing the clone request here, we queue the original 1145 + * request into dm core, which will remake a clone request and 1146 + * clone bios for it and resubmit it later. 1147 + */ 1148 + int r = DM_ENDIO_REQUEUE; 1175 1149 unsigned long flags; 1176 1150 1177 - if (!error) 1151 + if (!error && !clone->errors) 1178 1152 return 0; /* I/O complete */ 1179 - 1180 - if ((error == -EWOULDBLOCK) && bio_rw_ahead(bio)) 1181 - return error; 1182 1153 1183 1154 if (error == -EOPNOTSUPP) 1184 1155 return error; 1185 1156 1186 - spin_lock_irqsave(&m->lock, flags); 1187 - if (!m->nr_valid_paths) { 1188 - if (__must_push_back(m)) { 1189 - spin_unlock_irqrestore(&m->lock, flags); 1190 - return DM_ENDIO_REQUEUE; 1191 - } else if (!m->queue_if_no_path) { 1192 - spin_unlock_irqrestore(&m->lock, flags); 1193 - return -EIO; 1194 - } else { 1195 - spin_unlock_irqrestore(&m->lock, flags); 1196 - goto requeue; 1197 - } 1198 - } 1199 - spin_unlock_irqrestore(&m->lock, flags); 1200 - 1201 1157 if (mpio->pgpath) 1202 1158 fail_path(mpio->pgpath); 1203 1159 1204 - requeue: 1205 - dm_bio_restore(&mpio->details, bio); 1206 - 1207 - /* queue for the daemon to resubmit or fail */ 1208 1160 spin_lock_irqsave(&m->lock, flags); 1209 - bio_list_add(&m->queued_ios, bio); 1210 - m->queue_size++; 1211 - if (!m->queue_io) 1212 - queue_work(kmultipathd, &m->process_queued_ios); 1161 + if (!m->nr_valid_paths && !m->queue_if_no_path && !__must_push_back(m)) 1162 + r = -EIO; 1213 1163 spin_unlock_irqrestore(&m->lock, flags); 1214 1164 1215 - return DM_ENDIO_INCOMPLETE; /* io not complete */ 1165 + return r; 1216 1166 } 1217 1167 1218 - static int multipath_end_io(struct dm_target *ti, struct bio *bio, 1168 + static int multipath_end_io(struct dm_target *ti, struct request *clone, 1219 1169 int error, union map_info *map_context) 1220 1170 { 1221 1171 struct multipath *m = ti->private; ··· 1207 1191 struct path_selector *ps; 1208 1192 int r; 1209 1193 1210 - r = do_end_io(m, bio, error, mpio); 1194 + r = do_end_io(m, clone, error, mpio); 1211 1195 if (pgpath) { 1212 1196 ps = &pgpath->pg->ps; 1213 1197 if (ps->type->end_io) 1214 - ps->type->end_io(ps, &pgpath->path); 1198 + ps->type->end_io(ps, &pgpath->path, mpio->nr_bytes); 1215 1199 } 1216 - if (r != DM_ENDIO_INCOMPLETE) 1217 - mempool_free(mpio, m->mpio_pool); 1200 + mempool_free(mpio, m->mpio_pool); 1218 1201 1219 1202 return r; 1220 1203 } ··· 1426 1411 spin_lock_irqsave(&m->lock, flags); 1427 1412 1428 1413 if (!m->current_pgpath) 1429 - __choose_pgpath(m); 1414 + __choose_pgpath(m, 0); 1430 1415 1431 1416 if (m->current_pgpath) { 1432 1417 bdev = m->current_pgpath->path.dev->bdev; ··· 1443 1428 return r ? : __blkdev_driver_ioctl(bdev, mode, cmd, arg); 1444 1429 } 1445 1430 1431 + static int multipath_iterate_devices(struct dm_target *ti, 1432 + iterate_devices_callout_fn fn, void *data) 1433 + { 1434 + struct multipath *m = ti->private; 1435 + struct priority_group *pg; 1436 + struct pgpath *p; 1437 + int ret = 0; 1438 + 1439 + list_for_each_entry(pg, &m->priority_groups, list) { 1440 + list_for_each_entry(p, &pg->pgpaths, list) { 1441 + ret = fn(ti, p->path.dev, ti->begin, data); 1442 + if (ret) 1443 + goto out; 1444 + } 1445 + } 1446 + 1447 + out: 1448 + return ret; 1449 + } 1450 + 1451 + static int __pgpath_busy(struct pgpath *pgpath) 1452 + { 1453 + struct request_queue *q = bdev_get_queue(pgpath->path.dev->bdev); 1454 + 1455 + return dm_underlying_device_busy(q); 1456 + } 1457 + 1458 + /* 1459 + * We return "busy", only when we can map I/Os but underlying devices 1460 + * are busy (so even if we map I/Os now, the I/Os will wait on 1461 + * the underlying queue). 1462 + * In other words, if we want to kill I/Os or queue them inside us 1463 + * due to map unavailability, we don't return "busy". Otherwise, 1464 + * dm core won't give us the I/Os and we can't do what we want. 1465 + */ 1466 + static int multipath_busy(struct dm_target *ti) 1467 + { 1468 + int busy = 0, has_active = 0; 1469 + struct multipath *m = ti->private; 1470 + struct priority_group *pg; 1471 + struct pgpath *pgpath; 1472 + unsigned long flags; 1473 + 1474 + spin_lock_irqsave(&m->lock, flags); 1475 + 1476 + /* Guess which priority_group will be used at next mapping time */ 1477 + if (unlikely(!m->current_pgpath && m->next_pg)) 1478 + pg = m->next_pg; 1479 + else if (likely(m->current_pg)) 1480 + pg = m->current_pg; 1481 + else 1482 + /* 1483 + * We don't know which pg will be used at next mapping time. 1484 + * We don't call __choose_pgpath() here to avoid to trigger 1485 + * pg_init just by busy checking. 1486 + * So we don't know whether underlying devices we will be using 1487 + * at next mapping time are busy or not. Just try mapping. 1488 + */ 1489 + goto out; 1490 + 1491 + /* 1492 + * If there is one non-busy active path at least, the path selector 1493 + * will be able to select it. So we consider such a pg as not busy. 1494 + */ 1495 + busy = 1; 1496 + list_for_each_entry(pgpath, &pg->pgpaths, list) 1497 + if (pgpath->is_active) { 1498 + has_active = 1; 1499 + 1500 + if (!__pgpath_busy(pgpath)) { 1501 + busy = 0; 1502 + break; 1503 + } 1504 + } 1505 + 1506 + if (!has_active) 1507 + /* 1508 + * No active path in this pg, so this pg won't be used and 1509 + * the current_pg will be changed at next mapping time. 1510 + * We need to try mapping to determine it. 1511 + */ 1512 + busy = 0; 1513 + 1514 + out: 1515 + spin_unlock_irqrestore(&m->lock, flags); 1516 + 1517 + return busy; 1518 + } 1519 + 1446 1520 /*----------------------------------------------------------------- 1447 1521 * Module setup 1448 1522 *---------------------------------------------------------------*/ 1449 1523 static struct target_type multipath_target = { 1450 1524 .name = "multipath", 1451 - .version = {1, 0, 5}, 1525 + .version = {1, 1, 0}, 1452 1526 .module = THIS_MODULE, 1453 1527 .ctr = multipath_ctr, 1454 1528 .dtr = multipath_dtr, 1455 - .map = multipath_map, 1456 - .end_io = multipath_end_io, 1529 + .map_rq = multipath_map, 1530 + .rq_end_io = multipath_end_io, 1457 1531 .presuspend = multipath_presuspend, 1458 1532 .resume = multipath_resume, 1459 1533 .status = multipath_status, 1460 1534 .message = multipath_message, 1461 1535 .ioctl = multipath_ioctl, 1536 + .iterate_devices = multipath_iterate_devices, 1537 + .busy = multipath_busy, 1462 1538 }; 1463 1539 1464 1540 static int __init dm_multipath_init(void)
+6 -2
drivers/md/dm-path-selector.h
··· 56 56 * the path fails. 57 57 */ 58 58 struct dm_path *(*select_path) (struct path_selector *ps, 59 - unsigned *repeat_count); 59 + unsigned *repeat_count, 60 + size_t nr_bytes); 60 61 61 62 /* 62 63 * Notify the selector that a path has failed. ··· 76 75 int (*status) (struct path_selector *ps, struct dm_path *path, 77 76 status_type_t type, char *result, unsigned int maxlen); 78 77 79 - int (*end_io) (struct path_selector *ps, struct dm_path *path); 78 + int (*start_io) (struct path_selector *ps, struct dm_path *path, 79 + size_t nr_bytes); 80 + int (*end_io) (struct path_selector *ps, struct dm_path *path, 81 + size_t nr_bytes); 80 82 }; 81 83 82 84 /* Register a path selector */
+263
drivers/md/dm-queue-length.c
··· 1 + /* 2 + * Copyright (C) 2004-2005 IBM Corp. All Rights Reserved. 3 + * Copyright (C) 2006-2009 NEC Corporation. 4 + * 5 + * dm-queue-length.c 6 + * 7 + * Module Author: Stefan Bader, IBM 8 + * Modified by: Kiyoshi Ueda, NEC 9 + * 10 + * This file is released under the GPL. 11 + * 12 + * queue-length path selector - choose a path with the least number of 13 + * in-flight I/Os. 14 + */ 15 + 16 + #include "dm.h" 17 + #include "dm-path-selector.h" 18 + 19 + #include <linux/slab.h> 20 + #include <linux/ctype.h> 21 + #include <linux/errno.h> 22 + #include <linux/module.h> 23 + #include <asm/atomic.h> 24 + 25 + #define DM_MSG_PREFIX "multipath queue-length" 26 + #define QL_MIN_IO 128 27 + #define QL_VERSION "0.1.0" 28 + 29 + struct selector { 30 + struct list_head valid_paths; 31 + struct list_head failed_paths; 32 + }; 33 + 34 + struct path_info { 35 + struct list_head list; 36 + struct dm_path *path; 37 + unsigned repeat_count; 38 + atomic_t qlen; /* the number of in-flight I/Os */ 39 + }; 40 + 41 + static struct selector *alloc_selector(void) 42 + { 43 + struct selector *s = kmalloc(sizeof(*s), GFP_KERNEL); 44 + 45 + if (s) { 46 + INIT_LIST_HEAD(&s->valid_paths); 47 + INIT_LIST_HEAD(&s->failed_paths); 48 + } 49 + 50 + return s; 51 + } 52 + 53 + static int ql_create(struct path_selector *ps, unsigned argc, char **argv) 54 + { 55 + struct selector *s = alloc_selector(); 56 + 57 + if (!s) 58 + return -ENOMEM; 59 + 60 + ps->context = s; 61 + return 0; 62 + } 63 + 64 + static void ql_free_paths(struct list_head *paths) 65 + { 66 + struct path_info *pi, *next; 67 + 68 + list_for_each_entry_safe(pi, next, paths, list) { 69 + list_del(&pi->list); 70 + kfree(pi); 71 + } 72 + } 73 + 74 + static void ql_destroy(struct path_selector *ps) 75 + { 76 + struct selector *s = ps->context; 77 + 78 + ql_free_paths(&s->valid_paths); 79 + ql_free_paths(&s->failed_paths); 80 + kfree(s); 81 + ps->context = NULL; 82 + } 83 + 84 + static int ql_status(struct path_selector *ps, struct dm_path *path, 85 + status_type_t type, char *result, unsigned maxlen) 86 + { 87 + unsigned sz = 0; 88 + struct path_info *pi; 89 + 90 + /* When called with NULL path, return selector status/args. */ 91 + if (!path) 92 + DMEMIT("0 "); 93 + else { 94 + pi = path->pscontext; 95 + 96 + switch (type) { 97 + case STATUSTYPE_INFO: 98 + DMEMIT("%d ", atomic_read(&pi->qlen)); 99 + break; 100 + case STATUSTYPE_TABLE: 101 + DMEMIT("%u ", pi->repeat_count); 102 + break; 103 + } 104 + } 105 + 106 + return sz; 107 + } 108 + 109 + static int ql_add_path(struct path_selector *ps, struct dm_path *path, 110 + int argc, char **argv, char **error) 111 + { 112 + struct selector *s = ps->context; 113 + struct path_info *pi; 114 + unsigned repeat_count = QL_MIN_IO; 115 + 116 + /* 117 + * Arguments: [<repeat_count>] 118 + * <repeat_count>: The number of I/Os before switching path. 119 + * If not given, default (QL_MIN_IO) is used. 120 + */ 121 + if (argc > 1) { 122 + *error = "queue-length ps: incorrect number of arguments"; 123 + return -EINVAL; 124 + } 125 + 126 + if ((argc == 1) && (sscanf(argv[0], "%u", &repeat_count) != 1)) { 127 + *error = "queue-length ps: invalid repeat count"; 128 + return -EINVAL; 129 + } 130 + 131 + /* Allocate the path information structure */ 132 + pi = kmalloc(sizeof(*pi), GFP_KERNEL); 133 + if (!pi) { 134 + *error = "queue-length ps: Error allocating path information"; 135 + return -ENOMEM; 136 + } 137 + 138 + pi->path = path; 139 + pi->repeat_count = repeat_count; 140 + atomic_set(&pi->qlen, 0); 141 + 142 + path->pscontext = pi; 143 + 144 + list_add_tail(&pi->list, &s->valid_paths); 145 + 146 + return 0; 147 + } 148 + 149 + static void ql_fail_path(struct path_selector *ps, struct dm_path *path) 150 + { 151 + struct selector *s = ps->context; 152 + struct path_info *pi = path->pscontext; 153 + 154 + list_move(&pi->list, &s->failed_paths); 155 + } 156 + 157 + static int ql_reinstate_path(struct path_selector *ps, struct dm_path *path) 158 + { 159 + struct selector *s = ps->context; 160 + struct path_info *pi = path->pscontext; 161 + 162 + list_move_tail(&pi->list, &s->valid_paths); 163 + 164 + return 0; 165 + } 166 + 167 + /* 168 + * Select a path having the minimum number of in-flight I/Os 169 + */ 170 + static struct dm_path *ql_select_path(struct path_selector *ps, 171 + unsigned *repeat_count, size_t nr_bytes) 172 + { 173 + struct selector *s = ps->context; 174 + struct path_info *pi = NULL, *best = NULL; 175 + 176 + if (list_empty(&s->valid_paths)) 177 + return NULL; 178 + 179 + /* Change preferred (first in list) path to evenly balance. */ 180 + list_move_tail(s->valid_paths.next, &s->valid_paths); 181 + 182 + list_for_each_entry(pi, &s->valid_paths, list) { 183 + if (!best || 184 + (atomic_read(&pi->qlen) < atomic_read(&best->qlen))) 185 + best = pi; 186 + 187 + if (!atomic_read(&best->qlen)) 188 + break; 189 + } 190 + 191 + if (!best) 192 + return NULL; 193 + 194 + *repeat_count = best->repeat_count; 195 + 196 + return best->path; 197 + } 198 + 199 + static int ql_start_io(struct path_selector *ps, struct dm_path *path, 200 + size_t nr_bytes) 201 + { 202 + struct path_info *pi = path->pscontext; 203 + 204 + atomic_inc(&pi->qlen); 205 + 206 + return 0; 207 + } 208 + 209 + static int ql_end_io(struct path_selector *ps, struct dm_path *path, 210 + size_t nr_bytes) 211 + { 212 + struct path_info *pi = path->pscontext; 213 + 214 + atomic_dec(&pi->qlen); 215 + 216 + return 0; 217 + } 218 + 219 + static struct path_selector_type ql_ps = { 220 + .name = "queue-length", 221 + .module = THIS_MODULE, 222 + .table_args = 1, 223 + .info_args = 1, 224 + .create = ql_create, 225 + .destroy = ql_destroy, 226 + .status = ql_status, 227 + .add_path = ql_add_path, 228 + .fail_path = ql_fail_path, 229 + .reinstate_path = ql_reinstate_path, 230 + .select_path = ql_select_path, 231 + .start_io = ql_start_io, 232 + .end_io = ql_end_io, 233 + }; 234 + 235 + static int __init dm_ql_init(void) 236 + { 237 + int r = dm_register_path_selector(&ql_ps); 238 + 239 + if (r < 0) 240 + DMERR("register failed %d", r); 241 + 242 + DMINFO("version " QL_VERSION " loaded"); 243 + 244 + return r; 245 + } 246 + 247 + static void __exit dm_ql_exit(void) 248 + { 249 + int r = dm_unregister_path_selector(&ql_ps); 250 + 251 + if (r < 0) 252 + DMERR("unregister failed %d", r); 253 + } 254 + 255 + module_init(dm_ql_init); 256 + module_exit(dm_ql_exit); 257 + 258 + MODULE_AUTHOR("Stefan Bader <Stefan.Bader at de.ibm.com>"); 259 + MODULE_DESCRIPTION( 260 + "(C) Copyright IBM Corp. 2004,2005 All Rights Reserved.\n" 261 + DM_NAME " path selector to balance the number of in-flight I/Os" 262 + ); 263 + MODULE_LICENSE("GPL");
+16 -1
drivers/md/dm-raid1.c
··· 1283 1283 return 0; 1284 1284 } 1285 1285 1286 + static int mirror_iterate_devices(struct dm_target *ti, 1287 + iterate_devices_callout_fn fn, void *data) 1288 + { 1289 + struct mirror_set *ms = ti->private; 1290 + int ret = 0; 1291 + unsigned i; 1292 + 1293 + for (i = 0; !ret && i < ms->nr_mirrors; i++) 1294 + ret = fn(ti, ms->mirror[i].dev, 1295 + ms->mirror[i].offset, data); 1296 + 1297 + return ret; 1298 + } 1299 + 1286 1300 static struct target_type mirror_target = { 1287 1301 .name = "mirror", 1288 - .version = {1, 0, 20}, 1302 + .version = {1, 12, 0}, 1289 1303 .module = THIS_MODULE, 1290 1304 .ctr = mirror_ctr, 1291 1305 .dtr = mirror_dtr, ··· 1309 1295 .postsuspend = mirror_postsuspend, 1310 1296 .resume = mirror_resume, 1311 1297 .status = mirror_status, 1298 + .iterate_devices = mirror_iterate_devices, 1312 1299 }; 1313 1300 1314 1301 static int __init dm_mirror_init(void)
+1 -1
drivers/md/dm-region-hash.c
··· 283 283 284 284 nreg = mempool_alloc(rh->region_pool, GFP_ATOMIC); 285 285 if (unlikely(!nreg)) 286 - nreg = kmalloc(sizeof(*nreg), GFP_NOIO); 286 + nreg = kmalloc(sizeof(*nreg), GFP_NOIO | __GFP_NOFAIL); 287 287 288 288 nreg->state = rh->log->type->in_sync(rh->log, region, 1) ? 289 289 DM_RH_CLEAN : DM_RH_NOSYNC;
+1 -1
drivers/md/dm-round-robin.c
··· 161 161 } 162 162 163 163 static struct dm_path *rr_select_path(struct path_selector *ps, 164 - unsigned *repeat_count) 164 + unsigned *repeat_count, size_t nr_bytes) 165 165 { 166 166 struct selector *s = (struct selector *) ps->context; 167 167 struct path_info *pi = NULL;
+339
drivers/md/dm-service-time.c
··· 1 + /* 2 + * Copyright (C) 2007-2009 NEC Corporation. All Rights Reserved. 3 + * 4 + * Module Author: Kiyoshi Ueda 5 + * 6 + * This file is released under the GPL. 7 + * 8 + * Throughput oriented path selector. 9 + */ 10 + 11 + #include "dm.h" 12 + #include "dm-path-selector.h" 13 + 14 + #define DM_MSG_PREFIX "multipath service-time" 15 + #define ST_MIN_IO 1 16 + #define ST_MAX_RELATIVE_THROUGHPUT 100 17 + #define ST_MAX_RELATIVE_THROUGHPUT_SHIFT 7 18 + #define ST_MAX_INFLIGHT_SIZE ((size_t)-1 >> ST_MAX_RELATIVE_THROUGHPUT_SHIFT) 19 + #define ST_VERSION "0.2.0" 20 + 21 + struct selector { 22 + struct list_head valid_paths; 23 + struct list_head failed_paths; 24 + }; 25 + 26 + struct path_info { 27 + struct list_head list; 28 + struct dm_path *path; 29 + unsigned repeat_count; 30 + unsigned relative_throughput; 31 + atomic_t in_flight_size; /* Total size of in-flight I/Os */ 32 + }; 33 + 34 + static struct selector *alloc_selector(void) 35 + { 36 + struct selector *s = kmalloc(sizeof(*s), GFP_KERNEL); 37 + 38 + if (s) { 39 + INIT_LIST_HEAD(&s->valid_paths); 40 + INIT_LIST_HEAD(&s->failed_paths); 41 + } 42 + 43 + return s; 44 + } 45 + 46 + static int st_create(struct path_selector *ps, unsigned argc, char **argv) 47 + { 48 + struct selector *s = alloc_selector(); 49 + 50 + if (!s) 51 + return -ENOMEM; 52 + 53 + ps->context = s; 54 + return 0; 55 + } 56 + 57 + static void free_paths(struct list_head *paths) 58 + { 59 + struct path_info *pi, *next; 60 + 61 + list_for_each_entry_safe(pi, next, paths, list) { 62 + list_del(&pi->list); 63 + kfree(pi); 64 + } 65 + } 66 + 67 + static void st_destroy(struct path_selector *ps) 68 + { 69 + struct selector *s = ps->context; 70 + 71 + free_paths(&s->valid_paths); 72 + free_paths(&s->failed_paths); 73 + kfree(s); 74 + ps->context = NULL; 75 + } 76 + 77 + static int st_status(struct path_selector *ps, struct dm_path *path, 78 + status_type_t type, char *result, unsigned maxlen) 79 + { 80 + unsigned sz = 0; 81 + struct path_info *pi; 82 + 83 + if (!path) 84 + DMEMIT("0 "); 85 + else { 86 + pi = path->pscontext; 87 + 88 + switch (type) { 89 + case STATUSTYPE_INFO: 90 + DMEMIT("%d %u ", atomic_read(&pi->in_flight_size), 91 + pi->relative_throughput); 92 + break; 93 + case STATUSTYPE_TABLE: 94 + DMEMIT("%u %u ", pi->repeat_count, 95 + pi->relative_throughput); 96 + break; 97 + } 98 + } 99 + 100 + return sz; 101 + } 102 + 103 + static int st_add_path(struct path_selector *ps, struct dm_path *path, 104 + int argc, char **argv, char **error) 105 + { 106 + struct selector *s = ps->context; 107 + struct path_info *pi; 108 + unsigned repeat_count = ST_MIN_IO; 109 + unsigned relative_throughput = 1; 110 + 111 + /* 112 + * Arguments: [<repeat_count> [<relative_throughput>]] 113 + * <repeat_count>: The number of I/Os before switching path. 114 + * If not given, default (ST_MIN_IO) is used. 115 + * <relative_throughput>: The relative throughput value of 116 + * the path among all paths in the path-group. 117 + * The valid range: 0-<ST_MAX_RELATIVE_THROUGHPUT> 118 + * If not given, minimum value '1' is used. 119 + * If '0' is given, the path isn't selected while 120 + * other paths having a positive value are 121 + * available. 122 + */ 123 + if (argc > 2) { 124 + *error = "service-time ps: incorrect number of arguments"; 125 + return -EINVAL; 126 + } 127 + 128 + if (argc && (sscanf(argv[0], "%u", &repeat_count) != 1)) { 129 + *error = "service-time ps: invalid repeat count"; 130 + return -EINVAL; 131 + } 132 + 133 + if ((argc == 2) && 134 + (sscanf(argv[1], "%u", &relative_throughput) != 1 || 135 + relative_throughput > ST_MAX_RELATIVE_THROUGHPUT)) { 136 + *error = "service-time ps: invalid relative_throughput value"; 137 + return -EINVAL; 138 + } 139 + 140 + /* allocate the path */ 141 + pi = kmalloc(sizeof(*pi), GFP_KERNEL); 142 + if (!pi) { 143 + *error = "service-time ps: Error allocating path context"; 144 + return -ENOMEM; 145 + } 146 + 147 + pi->path = path; 148 + pi->repeat_count = repeat_count; 149 + pi->relative_throughput = relative_throughput; 150 + atomic_set(&pi->in_flight_size, 0); 151 + 152 + path->pscontext = pi; 153 + 154 + list_add_tail(&pi->list, &s->valid_paths); 155 + 156 + return 0; 157 + } 158 + 159 + static void st_fail_path(struct path_selector *ps, struct dm_path *path) 160 + { 161 + struct selector *s = ps->context; 162 + struct path_info *pi = path->pscontext; 163 + 164 + list_move(&pi->list, &s->failed_paths); 165 + } 166 + 167 + static int st_reinstate_path(struct path_selector *ps, struct dm_path *path) 168 + { 169 + struct selector *s = ps->context; 170 + struct path_info *pi = path->pscontext; 171 + 172 + list_move_tail(&pi->list, &s->valid_paths); 173 + 174 + return 0; 175 + } 176 + 177 + /* 178 + * Compare the estimated service time of 2 paths, pi1 and pi2, 179 + * for the incoming I/O. 180 + * 181 + * Returns: 182 + * < 0 : pi1 is better 183 + * 0 : no difference between pi1 and pi2 184 + * > 0 : pi2 is better 185 + * 186 + * Description: 187 + * Basically, the service time is estimated by: 188 + * ('pi->in-flight-size' + 'incoming') / 'pi->relative_throughput' 189 + * To reduce the calculation, some optimizations are made. 190 + * (See comments inline) 191 + */ 192 + static int st_compare_load(struct path_info *pi1, struct path_info *pi2, 193 + size_t incoming) 194 + { 195 + size_t sz1, sz2, st1, st2; 196 + 197 + sz1 = atomic_read(&pi1->in_flight_size); 198 + sz2 = atomic_read(&pi2->in_flight_size); 199 + 200 + /* 201 + * Case 1: Both have same throughput value. Choose less loaded path. 202 + */ 203 + if (pi1->relative_throughput == pi2->relative_throughput) 204 + return sz1 - sz2; 205 + 206 + /* 207 + * Case 2a: Both have same load. Choose higher throughput path. 208 + * Case 2b: One path has no throughput value. Choose the other one. 209 + */ 210 + if (sz1 == sz2 || 211 + !pi1->relative_throughput || !pi2->relative_throughput) 212 + return pi2->relative_throughput - pi1->relative_throughput; 213 + 214 + /* 215 + * Case 3: Calculate service time. Choose faster path. 216 + * Service time using pi1: 217 + * st1 = (sz1 + incoming) / pi1->relative_throughput 218 + * Service time using pi2: 219 + * st2 = (sz2 + incoming) / pi2->relative_throughput 220 + * 221 + * To avoid the division, transform the expression to use 222 + * multiplication. 223 + * Because ->relative_throughput > 0 here, if st1 < st2, 224 + * the expressions below are the same meaning: 225 + * (sz1 + incoming) / pi1->relative_throughput < 226 + * (sz2 + incoming) / pi2->relative_throughput 227 + * (sz1 + incoming) * pi2->relative_throughput < 228 + * (sz2 + incoming) * pi1->relative_throughput 229 + * So use the later one. 230 + */ 231 + sz1 += incoming; 232 + sz2 += incoming; 233 + if (unlikely(sz1 >= ST_MAX_INFLIGHT_SIZE || 234 + sz2 >= ST_MAX_INFLIGHT_SIZE)) { 235 + /* 236 + * Size may be too big for multiplying pi->relative_throughput 237 + * and overflow. 238 + * To avoid the overflow and mis-selection, shift down both. 239 + */ 240 + sz1 >>= ST_MAX_RELATIVE_THROUGHPUT_SHIFT; 241 + sz2 >>= ST_MAX_RELATIVE_THROUGHPUT_SHIFT; 242 + } 243 + st1 = sz1 * pi2->relative_throughput; 244 + st2 = sz2 * pi1->relative_throughput; 245 + if (st1 != st2) 246 + return st1 - st2; 247 + 248 + /* 249 + * Case 4: Service time is equal. Choose higher throughput path. 250 + */ 251 + return pi2->relative_throughput - pi1->relative_throughput; 252 + } 253 + 254 + static struct dm_path *st_select_path(struct path_selector *ps, 255 + unsigned *repeat_count, size_t nr_bytes) 256 + { 257 + struct selector *s = ps->context; 258 + struct path_info *pi = NULL, *best = NULL; 259 + 260 + if (list_empty(&s->valid_paths)) 261 + return NULL; 262 + 263 + /* Change preferred (first in list) path to evenly balance. */ 264 + list_move_tail(s->valid_paths.next, &s->valid_paths); 265 + 266 + list_for_each_entry(pi, &s->valid_paths, list) 267 + if (!best || (st_compare_load(pi, best, nr_bytes) < 0)) 268 + best = pi; 269 + 270 + if (!best) 271 + return NULL; 272 + 273 + *repeat_count = best->repeat_count; 274 + 275 + return best->path; 276 + } 277 + 278 + static int st_start_io(struct path_selector *ps, struct dm_path *path, 279 + size_t nr_bytes) 280 + { 281 + struct path_info *pi = path->pscontext; 282 + 283 + atomic_add(nr_bytes, &pi->in_flight_size); 284 + 285 + return 0; 286 + } 287 + 288 + static int st_end_io(struct path_selector *ps, struct dm_path *path, 289 + size_t nr_bytes) 290 + { 291 + struct path_info *pi = path->pscontext; 292 + 293 + atomic_sub(nr_bytes, &pi->in_flight_size); 294 + 295 + return 0; 296 + } 297 + 298 + static struct path_selector_type st_ps = { 299 + .name = "service-time", 300 + .module = THIS_MODULE, 301 + .table_args = 2, 302 + .info_args = 2, 303 + .create = st_create, 304 + .destroy = st_destroy, 305 + .status = st_status, 306 + .add_path = st_add_path, 307 + .fail_path = st_fail_path, 308 + .reinstate_path = st_reinstate_path, 309 + .select_path = st_select_path, 310 + .start_io = st_start_io, 311 + .end_io = st_end_io, 312 + }; 313 + 314 + static int __init dm_st_init(void) 315 + { 316 + int r = dm_register_path_selector(&st_ps); 317 + 318 + if (r < 0) 319 + DMERR("register failed %d", r); 320 + 321 + DMINFO("version " ST_VERSION " loaded"); 322 + 323 + return r; 324 + } 325 + 326 + static void __exit dm_st_exit(void) 327 + { 328 + int r = dm_unregister_path_selector(&st_ps); 329 + 330 + if (r < 0) 331 + DMERR("unregister failed %d", r); 332 + } 333 + 334 + module_init(dm_st_init); 335 + module_exit(dm_st_exit); 336 + 337 + MODULE_DESCRIPTION(DM_NAME " throughput oriented path selector"); 338 + MODULE_AUTHOR("Kiyoshi Ueda <k-ueda@ct.jp.nec.com>"); 339 + MODULE_LICENSE("GPL");
+1 -1
drivers/md/dm-snap-persistent.c
··· 636 636 /* 637 637 * Commit exceptions to disk. 638 638 */ 639 - if (ps->valid && area_io(ps, WRITE)) 639 + if (ps->valid && area_io(ps, WRITE_BARRIER)) 640 640 ps->valid = 0; 641 641 642 642 /*
+11
drivers/md/dm-snap.c
··· 678 678 679 679 ti->private = s; 680 680 ti->split_io = s->store->chunk_size; 681 + ti->num_flush_requests = 1; 681 682 682 683 return 0; 683 684 ··· 1031 1030 chunk_t chunk; 1032 1031 struct dm_snap_pending_exception *pe = NULL; 1033 1032 1033 + if (unlikely(bio_empty_barrier(bio))) { 1034 + bio->bi_bdev = s->store->cow->bdev; 1035 + return DM_MAPIO_REMAPPED; 1036 + } 1037 + 1034 1038 chunk = sector_to_chunk(s->store, bio->bi_sector); 1035 1039 1036 1040 /* Full snapshots are not usable */ ··· 1344 1338 } 1345 1339 1346 1340 ti->private = dev; 1341 + ti->num_flush_requests = 1; 1342 + 1347 1343 return 0; 1348 1344 } 1349 1345 ··· 1360 1352 { 1361 1353 struct dm_dev *dev = ti->private; 1362 1354 bio->bi_bdev = dev->bdev; 1355 + 1356 + if (unlikely(bio_empty_barrier(bio))) 1357 + return DM_MAPIO_REMAPPED; 1363 1358 1364 1359 /* Only tell snapshots if this is a write */ 1365 1360 return (bio_rw(bio) == WRITE) ? do_origin(dev, bio) : DM_MAPIO_REMAPPED;
+29 -4
drivers/md/dm-stripe.c
··· 167 167 sc->stripes = stripes; 168 168 sc->stripe_width = width; 169 169 ti->split_io = chunk_size; 170 + ti->num_flush_requests = stripes; 170 171 171 172 sc->chunk_mask = ((sector_t) chunk_size) - 1; 172 173 for (sc->chunk_shift = 0; chunk_size; sc->chunk_shift++) ··· 212 211 union map_info *map_context) 213 212 { 214 213 struct stripe_c *sc = (struct stripe_c *) ti->private; 214 + sector_t offset, chunk; 215 + uint32_t stripe; 215 216 216 - sector_t offset = bio->bi_sector - ti->begin; 217 - sector_t chunk = offset >> sc->chunk_shift; 218 - uint32_t stripe = sector_div(chunk, sc->stripes); 217 + if (unlikely(bio_empty_barrier(bio))) { 218 + BUG_ON(map_context->flush_request >= sc->stripes); 219 + bio->bi_bdev = sc->stripe[map_context->flush_request].dev->bdev; 220 + return DM_MAPIO_REMAPPED; 221 + } 222 + 223 + offset = bio->bi_sector - ti->begin; 224 + chunk = offset >> sc->chunk_shift; 225 + stripe = sector_div(chunk, sc->stripes); 219 226 220 227 bio->bi_bdev = sc->stripe[stripe].dev->bdev; 221 228 bio->bi_sector = sc->stripe[stripe].physical_start + ··· 313 304 return error; 314 305 } 315 306 307 + static int stripe_iterate_devices(struct dm_target *ti, 308 + iterate_devices_callout_fn fn, void *data) 309 + { 310 + struct stripe_c *sc = ti->private; 311 + int ret = 0; 312 + unsigned i = 0; 313 + 314 + do 315 + ret = fn(ti, sc->stripe[i].dev, 316 + sc->stripe[i].physical_start, data); 317 + while (!ret && ++i < sc->stripes); 318 + 319 + return ret; 320 + } 321 + 316 322 static struct target_type stripe_target = { 317 323 .name = "striped", 318 - .version = {1, 1, 0}, 324 + .version = {1, 2, 0}, 319 325 .module = THIS_MODULE, 320 326 .ctr = stripe_ctr, 321 327 .dtr = stripe_dtr, 322 328 .map = stripe_map, 323 329 .end_io = stripe_end_io, 324 330 .status = stripe_status, 331 + .iterate_devices = stripe_iterate_devices, 325 332 }; 326 333 327 334 int __init dm_stripe_init(void)
+9
drivers/md/dm-sysfs.c
··· 57 57 return strlen(buf); 58 58 } 59 59 60 + static ssize_t dm_attr_suspended_show(struct mapped_device *md, char *buf) 61 + { 62 + sprintf(buf, "%d\n", dm_suspended(md)); 63 + 64 + return strlen(buf); 65 + } 66 + 60 67 static DM_ATTR_RO(name); 61 68 static DM_ATTR_RO(uuid); 69 + static DM_ATTR_RO(suspended); 62 70 63 71 static struct attribute *dm_attrs[] = { 64 72 &dm_attr_name.attr, 65 73 &dm_attr_uuid.attr, 74 + &dm_attr_suspended.attr, 66 75 NULL, 67 76 }; 68 77
+325 -138
drivers/md/dm-table.c
··· 41 41 struct dm_table { 42 42 struct mapped_device *md; 43 43 atomic_t holders; 44 + unsigned type; 44 45 45 46 /* btree table */ 46 47 unsigned int depth; ··· 63 62 /* a list of devices used by this table */ 64 63 struct list_head devices; 65 64 66 - /* 67 - * These are optimistic limits taken from all the 68 - * targets, some targets will need smaller limits. 69 - */ 70 - struct io_restrictions limits; 71 - 72 65 /* events get handed up using this callback */ 73 66 void (*event_fn)(void *); 74 67 void *event_context; 68 + 69 + struct dm_md_mempools *mempools; 75 70 }; 76 71 77 72 /* ··· 83 86 } 84 87 85 88 return result; 86 - } 87 - 88 - /* 89 - * Returns the minimum that is _not_ zero, unless both are zero. 90 - */ 91 - #define min_not_zero(l, r) (l == 0) ? r : ((r == 0) ? l : min(l, r)) 92 - 93 - /* 94 - * Combine two io_restrictions, always taking the lower value. 95 - */ 96 - static void combine_restrictions_low(struct io_restrictions *lhs, 97 - struct io_restrictions *rhs) 98 - { 99 - lhs->max_sectors = 100 - min_not_zero(lhs->max_sectors, rhs->max_sectors); 101 - 102 - lhs->max_phys_segments = 103 - min_not_zero(lhs->max_phys_segments, rhs->max_phys_segments); 104 - 105 - lhs->max_hw_segments = 106 - min_not_zero(lhs->max_hw_segments, rhs->max_hw_segments); 107 - 108 - lhs->logical_block_size = max(lhs->logical_block_size, 109 - rhs->logical_block_size); 110 - 111 - lhs->max_segment_size = 112 - min_not_zero(lhs->max_segment_size, rhs->max_segment_size); 113 - 114 - lhs->max_hw_sectors = 115 - min_not_zero(lhs->max_hw_sectors, rhs->max_hw_sectors); 116 - 117 - lhs->seg_boundary_mask = 118 - min_not_zero(lhs->seg_boundary_mask, rhs->seg_boundary_mask); 119 - 120 - lhs->bounce_pfn = min_not_zero(lhs->bounce_pfn, rhs->bounce_pfn); 121 - 122 - lhs->no_cluster |= rhs->no_cluster; 123 89 } 124 90 125 91 /* ··· 227 267 list_for_each_safe(tmp, next, devices) { 228 268 struct dm_dev_internal *dd = 229 269 list_entry(tmp, struct dm_dev_internal, list); 270 + DMWARN("dm_table_destroy: dm_put_device call missing for %s", 271 + dd->dm_dev.name); 230 272 kfree(dd); 231 273 } 232 274 } ··· 258 296 vfree(t->highs); 259 297 260 298 /* free the device list */ 261 - if (t->devices.next != &t->devices) { 262 - DMWARN("devices still present during destroy: " 263 - "dm_table_remove_device calls missing"); 264 - 299 + if (t->devices.next != &t->devices) 265 300 free_devices(&t->devices); 266 - } 301 + 302 + dm_free_md_mempools(t->mempools); 267 303 268 304 kfree(t); 269 305 } ··· 345 385 /* 346 386 * If possible, this checks an area of a destination device is valid. 347 387 */ 348 - static int check_device_area(struct dm_dev_internal *dd, sector_t start, 349 - sector_t len) 388 + static int device_area_is_valid(struct dm_target *ti, struct dm_dev *dev, 389 + sector_t start, void *data) 350 390 { 351 - sector_t dev_size = dd->dm_dev.bdev->bd_inode->i_size >> SECTOR_SHIFT; 391 + struct queue_limits *limits = data; 392 + struct block_device *bdev = dev->bdev; 393 + sector_t dev_size = 394 + i_size_read(bdev->bd_inode) >> SECTOR_SHIFT; 395 + unsigned short logical_block_size_sectors = 396 + limits->logical_block_size >> SECTOR_SHIFT; 397 + char b[BDEVNAME_SIZE]; 352 398 353 399 if (!dev_size) 354 400 return 1; 355 401 356 - return ((start < dev_size) && (len <= (dev_size - start))); 402 + if ((start >= dev_size) || (start + ti->len > dev_size)) { 403 + DMWARN("%s: %s too small for target", 404 + dm_device_name(ti->table->md), bdevname(bdev, b)); 405 + return 0; 406 + } 407 + 408 + if (logical_block_size_sectors <= 1) 409 + return 1; 410 + 411 + if (start & (logical_block_size_sectors - 1)) { 412 + DMWARN("%s: start=%llu not aligned to h/w " 413 + "logical block size %hu of %s", 414 + dm_device_name(ti->table->md), 415 + (unsigned long long)start, 416 + limits->logical_block_size, bdevname(bdev, b)); 417 + return 0; 418 + } 419 + 420 + if (ti->len & (logical_block_size_sectors - 1)) { 421 + DMWARN("%s: len=%llu not aligned to h/w " 422 + "logical block size %hu of %s", 423 + dm_device_name(ti->table->md), 424 + (unsigned long long)ti->len, 425 + limits->logical_block_size, bdevname(bdev, b)); 426 + return 0; 427 + } 428 + 429 + return 1; 357 430 } 358 431 359 432 /* ··· 472 479 } 473 480 atomic_inc(&dd->count); 474 481 475 - if (!check_device_area(dd, start, len)) { 476 - DMWARN("device %s too small for target", path); 477 - dm_put_device(ti, &dd->dm_dev); 478 - return -EINVAL; 479 - } 480 - 481 482 *result = &dd->dm_dev; 482 - 483 483 return 0; 484 484 } 485 485 486 - void dm_set_device_limits(struct dm_target *ti, struct block_device *bdev) 486 + /* 487 + * Returns the minimum that is _not_ zero, unless both are zero. 488 + */ 489 + #define min_not_zero(l, r) (l == 0) ? r : ((r == 0) ? l : min(l, r)) 490 + 491 + int dm_set_device_limits(struct dm_target *ti, struct dm_dev *dev, 492 + sector_t start, void *data) 487 493 { 494 + struct queue_limits *limits = data; 495 + struct block_device *bdev = dev->bdev; 488 496 struct request_queue *q = bdev_get_queue(bdev); 489 - struct io_restrictions *rs = &ti->limits; 490 497 char b[BDEVNAME_SIZE]; 491 498 492 499 if (unlikely(!q)) { 493 500 DMWARN("%s: Cannot set limits for nonexistent device %s", 494 501 dm_device_name(ti->table->md), bdevname(bdev, b)); 495 - return; 502 + return 0; 496 503 } 497 504 498 - /* 499 - * Combine the device limits low. 500 - * 501 - * FIXME: if we move an io_restriction struct 502 - * into q this would just be a call to 503 - * combine_restrictions_low() 504 - */ 505 - rs->max_sectors = 506 - min_not_zero(rs->max_sectors, queue_max_sectors(q)); 505 + if (blk_stack_limits(limits, &q->limits, start) < 0) 506 + DMWARN("%s: target device %s is misaligned", 507 + dm_device_name(ti->table->md), bdevname(bdev, b)); 507 508 508 509 /* 509 510 * Check if merge fn is supported. ··· 506 519 */ 507 520 508 521 if (q->merge_bvec_fn && !ti->type->merge) 509 - rs->max_sectors = 510 - min_not_zero(rs->max_sectors, 522 + limits->max_sectors = 523 + min_not_zero(limits->max_sectors, 511 524 (unsigned int) (PAGE_SIZE >> 9)); 512 - 513 - rs->max_phys_segments = 514 - min_not_zero(rs->max_phys_segments, 515 - queue_max_phys_segments(q)); 516 - 517 - rs->max_hw_segments = 518 - min_not_zero(rs->max_hw_segments, queue_max_hw_segments(q)); 519 - 520 - rs->logical_block_size = max(rs->logical_block_size, 521 - queue_logical_block_size(q)); 522 - 523 - rs->max_segment_size = 524 - min_not_zero(rs->max_segment_size, queue_max_segment_size(q)); 525 - 526 - rs->max_hw_sectors = 527 - min_not_zero(rs->max_hw_sectors, queue_max_hw_sectors(q)); 528 - 529 - rs->seg_boundary_mask = 530 - min_not_zero(rs->seg_boundary_mask, 531 - queue_segment_boundary(q)); 532 - 533 - rs->bounce_pfn = min_not_zero(rs->bounce_pfn, queue_bounce_pfn(q)); 534 - 535 - rs->no_cluster |= !test_bit(QUEUE_FLAG_CLUSTER, &q->queue_flags); 525 + return 0; 536 526 } 537 527 EXPORT_SYMBOL_GPL(dm_set_device_limits); 538 528 539 529 int dm_get_device(struct dm_target *ti, const char *path, sector_t start, 540 530 sector_t len, fmode_t mode, struct dm_dev **result) 541 531 { 542 - int r = __table_get_device(ti->table, ti, path, 543 - start, len, mode, result); 544 - 545 - if (!r) 546 - dm_set_device_limits(ti, (*result)->bdev); 547 - 548 - return r; 532 + return __table_get_device(ti->table, ti, path, 533 + start, len, mode, result); 549 534 } 535 + 550 536 551 537 /* 552 538 * Decrement a devices use count and remove it if necessary. ··· 635 675 return 0; 636 676 } 637 677 638 - static void check_for_valid_limits(struct io_restrictions *rs) 678 + /* 679 + * Impose necessary and sufficient conditions on a devices's table such 680 + * that any incoming bio which respects its logical_block_size can be 681 + * processed successfully. If it falls across the boundary between 682 + * two or more targets, the size of each piece it gets split into must 683 + * be compatible with the logical_block_size of the target processing it. 684 + */ 685 + static int validate_hardware_logical_block_alignment(struct dm_table *table, 686 + struct queue_limits *limits) 639 687 { 640 - if (!rs->max_sectors) 641 - rs->max_sectors = SAFE_MAX_SECTORS; 642 - if (!rs->max_hw_sectors) 643 - rs->max_hw_sectors = SAFE_MAX_SECTORS; 644 - if (!rs->max_phys_segments) 645 - rs->max_phys_segments = MAX_PHYS_SEGMENTS; 646 - if (!rs->max_hw_segments) 647 - rs->max_hw_segments = MAX_HW_SEGMENTS; 648 - if (!rs->logical_block_size) 649 - rs->logical_block_size = 1 << SECTOR_SHIFT; 650 - if (!rs->max_segment_size) 651 - rs->max_segment_size = MAX_SEGMENT_SIZE; 652 - if (!rs->seg_boundary_mask) 653 - rs->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK; 654 - if (!rs->bounce_pfn) 655 - rs->bounce_pfn = -1; 688 + /* 689 + * This function uses arithmetic modulo the logical_block_size 690 + * (in units of 512-byte sectors). 691 + */ 692 + unsigned short device_logical_block_size_sects = 693 + limits->logical_block_size >> SECTOR_SHIFT; 694 + 695 + /* 696 + * Offset of the start of the next table entry, mod logical_block_size. 697 + */ 698 + unsigned short next_target_start = 0; 699 + 700 + /* 701 + * Given an aligned bio that extends beyond the end of a 702 + * target, how many sectors must the next target handle? 703 + */ 704 + unsigned short remaining = 0; 705 + 706 + struct dm_target *uninitialized_var(ti); 707 + struct queue_limits ti_limits; 708 + unsigned i = 0; 709 + 710 + /* 711 + * Check each entry in the table in turn. 712 + */ 713 + while (i < dm_table_get_num_targets(table)) { 714 + ti = dm_table_get_target(table, i++); 715 + 716 + blk_set_default_limits(&ti_limits); 717 + 718 + /* combine all target devices' limits */ 719 + if (ti->type->iterate_devices) 720 + ti->type->iterate_devices(ti, dm_set_device_limits, 721 + &ti_limits); 722 + 723 + /* 724 + * If the remaining sectors fall entirely within this 725 + * table entry are they compatible with its logical_block_size? 726 + */ 727 + if (remaining < ti->len && 728 + remaining & ((ti_limits.logical_block_size >> 729 + SECTOR_SHIFT) - 1)) 730 + break; /* Error */ 731 + 732 + next_target_start = 733 + (unsigned short) ((next_target_start + ti->len) & 734 + (device_logical_block_size_sects - 1)); 735 + remaining = next_target_start ? 736 + device_logical_block_size_sects - next_target_start : 0; 737 + } 738 + 739 + if (remaining) { 740 + DMWARN("%s: table line %u (start sect %llu len %llu) " 741 + "not aligned to h/w logical block size %hu", 742 + dm_device_name(table->md), i, 743 + (unsigned long long) ti->begin, 744 + (unsigned long long) ti->len, 745 + limits->logical_block_size); 746 + return -EINVAL; 747 + } 748 + 749 + return 0; 656 750 } 657 751 658 752 int dm_table_add_target(struct dm_table *t, const char *type, ··· 761 747 762 748 t->highs[t->num_targets++] = tgt->begin + tgt->len - 1; 763 749 764 - /* FIXME: the plan is to combine high here and then have 765 - * the merge fn apply the target level restrictions. */ 766 - combine_restrictions_low(&t->limits, &tgt->limits); 767 750 return 0; 768 751 769 752 bad: 770 753 DMERR("%s: %s: %s", dm_device_name(t->md), type, tgt->error); 771 754 dm_put_target_type(tgt->type); 772 755 return r; 756 + } 757 + 758 + int dm_table_set_type(struct dm_table *t) 759 + { 760 + unsigned i; 761 + unsigned bio_based = 0, request_based = 0; 762 + struct dm_target *tgt; 763 + struct dm_dev_internal *dd; 764 + struct list_head *devices; 765 + 766 + for (i = 0; i < t->num_targets; i++) { 767 + tgt = t->targets + i; 768 + if (dm_target_request_based(tgt)) 769 + request_based = 1; 770 + else 771 + bio_based = 1; 772 + 773 + if (bio_based && request_based) { 774 + DMWARN("Inconsistent table: different target types" 775 + " can't be mixed up"); 776 + return -EINVAL; 777 + } 778 + } 779 + 780 + if (bio_based) { 781 + /* We must use this table as bio-based */ 782 + t->type = DM_TYPE_BIO_BASED; 783 + return 0; 784 + } 785 + 786 + BUG_ON(!request_based); /* No targets in this table */ 787 + 788 + /* Non-request-stackable devices can't be used for request-based dm */ 789 + devices = dm_table_get_devices(t); 790 + list_for_each_entry(dd, devices, list) { 791 + if (!blk_queue_stackable(bdev_get_queue(dd->dm_dev.bdev))) { 792 + DMWARN("table load rejected: including" 793 + " non-request-stackable devices"); 794 + return -EINVAL; 795 + } 796 + } 797 + 798 + /* 799 + * Request-based dm supports only tables that have a single target now. 800 + * To support multiple targets, request splitting support is needed, 801 + * and that needs lots of changes in the block-layer. 802 + * (e.g. request completion process for partial completion.) 803 + */ 804 + if (t->num_targets > 1) { 805 + DMWARN("Request-based dm doesn't support multiple targets yet"); 806 + return -EINVAL; 807 + } 808 + 809 + t->type = DM_TYPE_REQUEST_BASED; 810 + 811 + return 0; 812 + } 813 + 814 + unsigned dm_table_get_type(struct dm_table *t) 815 + { 816 + return t->type; 817 + } 818 + 819 + bool dm_table_bio_based(struct dm_table *t) 820 + { 821 + return dm_table_get_type(t) == DM_TYPE_BIO_BASED; 822 + } 823 + 824 + bool dm_table_request_based(struct dm_table *t) 825 + { 826 + return dm_table_get_type(t) == DM_TYPE_REQUEST_BASED; 827 + } 828 + 829 + int dm_table_alloc_md_mempools(struct dm_table *t) 830 + { 831 + unsigned type = dm_table_get_type(t); 832 + 833 + if (unlikely(type == DM_TYPE_NONE)) { 834 + DMWARN("no table type is set, can't allocate mempools"); 835 + return -EINVAL; 836 + } 837 + 838 + t->mempools = dm_alloc_md_mempools(type); 839 + if (!t->mempools) 840 + return -ENOMEM; 841 + 842 + return 0; 843 + } 844 + 845 + void dm_table_free_md_mempools(struct dm_table *t) 846 + { 847 + dm_free_md_mempools(t->mempools); 848 + t->mempools = NULL; 849 + } 850 + 851 + struct dm_md_mempools *dm_table_get_md_mempools(struct dm_table *t) 852 + { 853 + return t->mempools; 773 854 } 774 855 775 856 static int setup_indexes(struct dm_table *t) ··· 900 791 { 901 792 int r = 0; 902 793 unsigned int leaf_nodes; 903 - 904 - check_for_valid_limits(&t->limits); 905 794 906 795 /* how many indexes will the btree have ? */ 907 796 leaf_nodes = dm_div_up(t->num_targets, KEYS_PER_NODE); ··· 976 869 } 977 870 978 871 /* 872 + * Establish the new table's queue_limits and validate them. 873 + */ 874 + int dm_calculate_queue_limits(struct dm_table *table, 875 + struct queue_limits *limits) 876 + { 877 + struct dm_target *uninitialized_var(ti); 878 + struct queue_limits ti_limits; 879 + unsigned i = 0; 880 + 881 + blk_set_default_limits(limits); 882 + 883 + while (i < dm_table_get_num_targets(table)) { 884 + blk_set_default_limits(&ti_limits); 885 + 886 + ti = dm_table_get_target(table, i++); 887 + 888 + if (!ti->type->iterate_devices) 889 + goto combine_limits; 890 + 891 + /* 892 + * Combine queue limits of all the devices this target uses. 893 + */ 894 + ti->type->iterate_devices(ti, dm_set_device_limits, 895 + &ti_limits); 896 + 897 + /* 898 + * Check each device area is consistent with the target's 899 + * overall queue limits. 900 + */ 901 + if (!ti->type->iterate_devices(ti, device_area_is_valid, 902 + &ti_limits)) 903 + return -EINVAL; 904 + 905 + combine_limits: 906 + /* 907 + * Merge this target's queue limits into the overall limits 908 + * for the table. 909 + */ 910 + if (blk_stack_limits(limits, &ti_limits, 0) < 0) 911 + DMWARN("%s: target device " 912 + "(start sect %llu len %llu) " 913 + "is misaligned", 914 + dm_device_name(table->md), 915 + (unsigned long long) ti->begin, 916 + (unsigned long long) ti->len); 917 + } 918 + 919 + return validate_hardware_logical_block_alignment(table, limits); 920 + } 921 + 922 + /* 979 923 * Set the integrity profile for this device if all devices used have 980 924 * matching profiles. 981 925 */ ··· 1065 907 return; 1066 908 } 1067 909 1068 - void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q) 910 + void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, 911 + struct queue_limits *limits) 1069 912 { 1070 913 /* 1071 - * Make sure we obey the optimistic sub devices 1072 - * restrictions. 914 + * Each target device in the table has a data area that should normally 915 + * be aligned such that the DM device's alignment_offset is 0. 916 + * FIXME: Propagate alignment_offsets up the stack and warn of 917 + * sub-optimal or inconsistent settings. 1073 918 */ 1074 - blk_queue_max_sectors(q, t->limits.max_sectors); 1075 - blk_queue_max_phys_segments(q, t->limits.max_phys_segments); 1076 - blk_queue_max_hw_segments(q, t->limits.max_hw_segments); 1077 - blk_queue_logical_block_size(q, t->limits.logical_block_size); 1078 - blk_queue_max_segment_size(q, t->limits.max_segment_size); 1079 - blk_queue_max_hw_sectors(q, t->limits.max_hw_sectors); 1080 - blk_queue_segment_boundary(q, t->limits.seg_boundary_mask); 1081 - blk_queue_bounce_limit(q, t->limits.bounce_pfn); 919 + limits->alignment_offset = 0; 920 + limits->misaligned = 0; 1082 921 1083 - if (t->limits.no_cluster) 922 + /* 923 + * Copy table's limits to the DM device's request_queue 924 + */ 925 + q->limits = *limits; 926 + 927 + if (limits->no_cluster) 1084 928 queue_flag_clear_unlocked(QUEUE_FLAG_CLUSTER, q); 1085 929 else 1086 930 queue_flag_set_unlocked(QUEUE_FLAG_CLUSTER, q); 1087 931 1088 932 dm_table_set_integrity(t); 933 + 934 + /* 935 + * QUEUE_FLAG_STACKABLE must be set after all queue settings are 936 + * visible to other CPUs because, once the flag is set, incoming bios 937 + * are processed by request-based dm, which refers to the queue 938 + * settings. 939 + * Until the flag set, bios are passed to bio-based dm and queued to 940 + * md->deferred where queue settings are not needed yet. 941 + * Those bios are passed to request-based dm at the resume time. 942 + */ 943 + smp_mb(); 944 + if (dm_table_request_based(t)) 945 + queue_flag_set_unlocked(QUEUE_FLAG_STACKABLE, q); 1089 946 } 1090 947 1091 948 unsigned int dm_table_get_num_targets(struct dm_table *t) ··· 1194 1021 } 1195 1022 1196 1023 return r; 1024 + } 1025 + 1026 + int dm_table_any_busy_target(struct dm_table *t) 1027 + { 1028 + unsigned i; 1029 + struct dm_target *ti; 1030 + 1031 + for (i = 0; i < t->num_targets; i++) { 1032 + ti = t->targets + i; 1033 + if (ti->type->busy && ti->type->busy(ti)) 1034 + return 1; 1035 + } 1036 + 1037 + return 0; 1197 1038 } 1198 1039 1199 1040 void dm_table_unplug_all(struct dm_table *t)
+1000 -133
drivers/md/dm.c
··· 24 24 25 25 #define DM_MSG_PREFIX "core" 26 26 27 + /* 28 + * Cookies are numeric values sent with CHANGE and REMOVE 29 + * uevents while resuming, removing or renaming the device. 30 + */ 31 + #define DM_COOKIE_ENV_VAR_NAME "DM_COOKIE" 32 + #define DM_COOKIE_LENGTH 24 33 + 27 34 static const char *_name = DM_NAME; 28 35 29 36 static unsigned int major = 0; ··· 78 71 */ 79 72 struct dm_rq_clone_bio_info { 80 73 struct bio *orig; 81 - struct request *rq; 74 + struct dm_rq_target_io *tio; 82 75 }; 83 76 84 77 union map_info *dm_get_mapinfo(struct bio *bio) ··· 87 80 return &((struct dm_target_io *)bio->bi_private)->info; 88 81 return NULL; 89 82 } 83 + 84 + union map_info *dm_get_rq_mapinfo(struct request *rq) 85 + { 86 + if (rq && rq->end_io_data) 87 + return &((struct dm_rq_target_io *)rq->end_io_data)->info; 88 + return NULL; 89 + } 90 + EXPORT_SYMBOL_GPL(dm_get_rq_mapinfo); 90 91 91 92 #define MINOR_ALLOCED ((void *)-1) 92 93 ··· 172 157 * freeze/thaw support require holding onto a super block 173 158 */ 174 159 struct super_block *frozen_sb; 175 - struct block_device *suspended_bdev; 160 + struct block_device *bdev; 176 161 177 162 /* forced geometry settings */ 178 163 struct hd_geometry geometry; 179 164 165 + /* marker of flush suspend for request-based dm */ 166 + struct request suspend_rq; 167 + 168 + /* For saving the address of __make_request for request based dm */ 169 + make_request_fn *saved_make_request_fn; 170 + 180 171 /* sysfs handle */ 181 172 struct kobject kobj; 173 + 174 + /* zero-length barrier that will be cloned and submitted to targets */ 175 + struct bio barrier_bio; 176 + }; 177 + 178 + /* 179 + * For mempools pre-allocation at the table loading time. 180 + */ 181 + struct dm_md_mempools { 182 + mempool_t *io_pool; 183 + mempool_t *tio_pool; 184 + struct bio_set *bs; 182 185 }; 183 186 184 187 #define MIN_IOS 256 ··· 424 391 mempool_free(io, md->io_pool); 425 392 } 426 393 427 - static struct dm_target_io *alloc_tio(struct mapped_device *md) 428 - { 429 - return mempool_alloc(md->tio_pool, GFP_NOIO); 430 - } 431 - 432 394 static void free_tio(struct mapped_device *md, struct dm_target_io *tio) 433 395 { 434 396 mempool_free(tio, md->tio_pool); 397 + } 398 + 399 + static struct dm_rq_target_io *alloc_rq_tio(struct mapped_device *md) 400 + { 401 + return mempool_alloc(md->tio_pool, GFP_ATOMIC); 402 + } 403 + 404 + static void free_rq_tio(struct dm_rq_target_io *tio) 405 + { 406 + mempool_free(tio, tio->md->tio_pool); 407 + } 408 + 409 + static struct dm_rq_clone_bio_info *alloc_bio_info(struct mapped_device *md) 410 + { 411 + return mempool_alloc(md->io_pool, GFP_ATOMIC); 412 + } 413 + 414 + static void free_bio_info(struct dm_rq_clone_bio_info *info) 415 + { 416 + mempool_free(info, info->tio->md->io_pool); 435 417 } 436 418 437 419 static void start_io_acct(struct dm_io *io) ··· 512 464 struct dm_table *dm_get_table(struct mapped_device *md) 513 465 { 514 466 struct dm_table *t; 467 + unsigned long flags; 515 468 516 - read_lock(&md->map_lock); 469 + read_lock_irqsave(&md->map_lock, flags); 517 470 t = md->map; 518 471 if (t) 519 472 dm_table_get(t); 520 - read_unlock(&md->map_lock); 473 + read_unlock_irqrestore(&md->map_lock, flags); 521 474 522 475 return t; 523 476 } ··· 585 536 * Target requested pushing back the I/O. 586 537 */ 587 538 spin_lock_irqsave(&md->deferred_lock, flags); 588 - if (__noflush_suspending(md)) 589 - bio_list_add_head(&md->deferred, io->bio); 590 - else 539 + if (__noflush_suspending(md)) { 540 + if (!bio_barrier(io->bio)) 541 + bio_list_add_head(&md->deferred, 542 + io->bio); 543 + } else 591 544 /* noflush suspend was interrupted. */ 592 545 io->error = -EIO; 593 546 spin_unlock_irqrestore(&md->deferred_lock, flags); ··· 604 553 * a per-device variable for error reporting. 605 554 * Note that you can't touch the bio after end_io_acct 606 555 */ 607 - md->barrier_error = io_error; 556 + if (!md->barrier_error && io_error != -EOPNOTSUPP) 557 + md->barrier_error = io_error; 608 558 end_io_acct(io); 609 559 } else { 610 560 end_io_acct(io); ··· 659 607 dec_pending(io, error); 660 608 } 661 609 610 + /* 611 + * Partial completion handling for request-based dm 612 + */ 613 + static void end_clone_bio(struct bio *clone, int error) 614 + { 615 + struct dm_rq_clone_bio_info *info = clone->bi_private; 616 + struct dm_rq_target_io *tio = info->tio; 617 + struct bio *bio = info->orig; 618 + unsigned int nr_bytes = info->orig->bi_size; 619 + 620 + bio_put(clone); 621 + 622 + if (tio->error) 623 + /* 624 + * An error has already been detected on the request. 625 + * Once error occurred, just let clone->end_io() handle 626 + * the remainder. 627 + */ 628 + return; 629 + else if (error) { 630 + /* 631 + * Don't notice the error to the upper layer yet. 632 + * The error handling decision is made by the target driver, 633 + * when the request is completed. 634 + */ 635 + tio->error = error; 636 + return; 637 + } 638 + 639 + /* 640 + * I/O for the bio successfully completed. 641 + * Notice the data completion to the upper layer. 642 + */ 643 + 644 + /* 645 + * bios are processed from the head of the list. 646 + * So the completing bio should always be rq->bio. 647 + * If it's not, something wrong is happening. 648 + */ 649 + if (tio->orig->bio != bio) 650 + DMERR("bio completion is going in the middle of the request"); 651 + 652 + /* 653 + * Update the original request. 654 + * Do not use blk_end_request() here, because it may complete 655 + * the original request before the clone, and break the ordering. 656 + */ 657 + blk_update_request(tio->orig, 0, nr_bytes); 658 + } 659 + 660 + /* 661 + * Don't touch any member of the md after calling this function because 662 + * the md may be freed in dm_put() at the end of this function. 663 + * Or do dm_get() before calling this function and dm_put() later. 664 + */ 665 + static void rq_completed(struct mapped_device *md, int run_queue) 666 + { 667 + int wakeup_waiters = 0; 668 + struct request_queue *q = md->queue; 669 + unsigned long flags; 670 + 671 + spin_lock_irqsave(q->queue_lock, flags); 672 + if (!queue_in_flight(q)) 673 + wakeup_waiters = 1; 674 + spin_unlock_irqrestore(q->queue_lock, flags); 675 + 676 + /* nudge anyone waiting on suspend queue */ 677 + if (wakeup_waiters) 678 + wake_up(&md->wait); 679 + 680 + if (run_queue) 681 + blk_run_queue(q); 682 + 683 + /* 684 + * dm_put() must be at the end of this function. See the comment above 685 + */ 686 + dm_put(md); 687 + } 688 + 689 + static void dm_unprep_request(struct request *rq) 690 + { 691 + struct request *clone = rq->special; 692 + struct dm_rq_target_io *tio = clone->end_io_data; 693 + 694 + rq->special = NULL; 695 + rq->cmd_flags &= ~REQ_DONTPREP; 696 + 697 + blk_rq_unprep_clone(clone); 698 + free_rq_tio(tio); 699 + } 700 + 701 + /* 702 + * Requeue the original request of a clone. 703 + */ 704 + void dm_requeue_unmapped_request(struct request *clone) 705 + { 706 + struct dm_rq_target_io *tio = clone->end_io_data; 707 + struct mapped_device *md = tio->md; 708 + struct request *rq = tio->orig; 709 + struct request_queue *q = rq->q; 710 + unsigned long flags; 711 + 712 + dm_unprep_request(rq); 713 + 714 + spin_lock_irqsave(q->queue_lock, flags); 715 + if (elv_queue_empty(q)) 716 + blk_plug_device(q); 717 + blk_requeue_request(q, rq); 718 + spin_unlock_irqrestore(q->queue_lock, flags); 719 + 720 + rq_completed(md, 0); 721 + } 722 + EXPORT_SYMBOL_GPL(dm_requeue_unmapped_request); 723 + 724 + static void __stop_queue(struct request_queue *q) 725 + { 726 + blk_stop_queue(q); 727 + } 728 + 729 + static void stop_queue(struct request_queue *q) 730 + { 731 + unsigned long flags; 732 + 733 + spin_lock_irqsave(q->queue_lock, flags); 734 + __stop_queue(q); 735 + spin_unlock_irqrestore(q->queue_lock, flags); 736 + } 737 + 738 + static void __start_queue(struct request_queue *q) 739 + { 740 + if (blk_queue_stopped(q)) 741 + blk_start_queue(q); 742 + } 743 + 744 + static void start_queue(struct request_queue *q) 745 + { 746 + unsigned long flags; 747 + 748 + spin_lock_irqsave(q->queue_lock, flags); 749 + __start_queue(q); 750 + spin_unlock_irqrestore(q->queue_lock, flags); 751 + } 752 + 753 + /* 754 + * Complete the clone and the original request. 755 + * Must be called without queue lock. 756 + */ 757 + static void dm_end_request(struct request *clone, int error) 758 + { 759 + struct dm_rq_target_io *tio = clone->end_io_data; 760 + struct mapped_device *md = tio->md; 761 + struct request *rq = tio->orig; 762 + 763 + if (blk_pc_request(rq)) { 764 + rq->errors = clone->errors; 765 + rq->resid_len = clone->resid_len; 766 + 767 + if (rq->sense) 768 + /* 769 + * We are using the sense buffer of the original 770 + * request. 771 + * So setting the length of the sense data is enough. 772 + */ 773 + rq->sense_len = clone->sense_len; 774 + } 775 + 776 + BUG_ON(clone->bio); 777 + free_rq_tio(tio); 778 + 779 + blk_end_request_all(rq, error); 780 + 781 + rq_completed(md, 1); 782 + } 783 + 784 + /* 785 + * Request completion handler for request-based dm 786 + */ 787 + static void dm_softirq_done(struct request *rq) 788 + { 789 + struct request *clone = rq->completion_data; 790 + struct dm_rq_target_io *tio = clone->end_io_data; 791 + dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io; 792 + int error = tio->error; 793 + 794 + if (!(rq->cmd_flags & REQ_FAILED) && rq_end_io) 795 + error = rq_end_io(tio->ti, clone, error, &tio->info); 796 + 797 + if (error <= 0) 798 + /* The target wants to complete the I/O */ 799 + dm_end_request(clone, error); 800 + else if (error == DM_ENDIO_INCOMPLETE) 801 + /* The target will handle the I/O */ 802 + return; 803 + else if (error == DM_ENDIO_REQUEUE) 804 + /* The target wants to requeue the I/O */ 805 + dm_requeue_unmapped_request(clone); 806 + else { 807 + DMWARN("unimplemented target endio return value: %d", error); 808 + BUG(); 809 + } 810 + } 811 + 812 + /* 813 + * Complete the clone and the original request with the error status 814 + * through softirq context. 815 + */ 816 + static void dm_complete_request(struct request *clone, int error) 817 + { 818 + struct dm_rq_target_io *tio = clone->end_io_data; 819 + struct request *rq = tio->orig; 820 + 821 + tio->error = error; 822 + rq->completion_data = clone; 823 + blk_complete_request(rq); 824 + } 825 + 826 + /* 827 + * Complete the not-mapped clone and the original request with the error status 828 + * through softirq context. 829 + * Target's rq_end_io() function isn't called. 830 + * This may be used when the target's map_rq() function fails. 831 + */ 832 + void dm_kill_unmapped_request(struct request *clone, int error) 833 + { 834 + struct dm_rq_target_io *tio = clone->end_io_data; 835 + struct request *rq = tio->orig; 836 + 837 + rq->cmd_flags |= REQ_FAILED; 838 + dm_complete_request(clone, error); 839 + } 840 + EXPORT_SYMBOL_GPL(dm_kill_unmapped_request); 841 + 842 + /* 843 + * Called with the queue lock held 844 + */ 845 + static void end_clone_request(struct request *clone, int error) 846 + { 847 + /* 848 + * For just cleaning up the information of the queue in which 849 + * the clone was dispatched. 850 + * The clone is *NOT* freed actually here because it is alloced from 851 + * dm own mempool and REQ_ALLOCED isn't set in clone->cmd_flags. 852 + */ 853 + __blk_put_request(clone->q, clone); 854 + 855 + /* 856 + * Actual request completion is done in a softirq context which doesn't 857 + * hold the queue lock. Otherwise, deadlock could occur because: 858 + * - another request may be submitted by the upper level driver 859 + * of the stacking during the completion 860 + * - the submission which requires queue lock may be done 861 + * against this queue 862 + */ 863 + dm_complete_request(clone, error); 864 + } 865 + 662 866 static sector_t max_io_len(struct mapped_device *md, 663 867 sector_t sector, struct dm_target *ti) 664 868 { ··· 941 633 int r; 942 634 sector_t sector; 943 635 struct mapped_device *md; 944 - 945 - /* 946 - * Sanity checks. 947 - */ 948 - BUG_ON(!clone->bi_size); 949 636 950 637 clone->bi_end_io = clone_endio; 951 638 clone->bi_private = tio; ··· 1055 752 return clone; 1056 753 } 1057 754 755 + static struct dm_target_io *alloc_tio(struct clone_info *ci, 756 + struct dm_target *ti) 757 + { 758 + struct dm_target_io *tio = mempool_alloc(ci->md->tio_pool, GFP_NOIO); 759 + 760 + tio->io = ci->io; 761 + tio->ti = ti; 762 + memset(&tio->info, 0, sizeof(tio->info)); 763 + 764 + return tio; 765 + } 766 + 767 + static void __flush_target(struct clone_info *ci, struct dm_target *ti, 768 + unsigned flush_nr) 769 + { 770 + struct dm_target_io *tio = alloc_tio(ci, ti); 771 + struct bio *clone; 772 + 773 + tio->info.flush_request = flush_nr; 774 + 775 + clone = bio_alloc_bioset(GFP_NOIO, 0, ci->md->bs); 776 + __bio_clone(clone, ci->bio); 777 + clone->bi_destructor = dm_bio_destructor; 778 + 779 + __map_bio(ti, clone, tio); 780 + } 781 + 782 + static int __clone_and_map_empty_barrier(struct clone_info *ci) 783 + { 784 + unsigned target_nr = 0, flush_nr; 785 + struct dm_target *ti; 786 + 787 + while ((ti = dm_table_get_target(ci->map, target_nr++))) 788 + for (flush_nr = 0; flush_nr < ti->num_flush_requests; 789 + flush_nr++) 790 + __flush_target(ci, ti, flush_nr); 791 + 792 + ci->sector_count = 0; 793 + 794 + return 0; 795 + } 796 + 1058 797 static int __clone_and_map(struct clone_info *ci) 1059 798 { 1060 799 struct bio *clone, *bio = ci->bio; 1061 800 struct dm_target *ti; 1062 801 sector_t len = 0, max; 1063 802 struct dm_target_io *tio; 803 + 804 + if (unlikely(bio_empty_barrier(bio))) 805 + return __clone_and_map_empty_barrier(ci); 1064 806 1065 807 ti = dm_table_find_target(ci->map, ci->sector); 1066 808 if (!dm_target_is_valid(ti)) ··· 1116 768 /* 1117 769 * Allocate a target io object. 1118 770 */ 1119 - tio = alloc_tio(ci->md); 1120 - tio->io = ci->io; 1121 - tio->ti = ti; 1122 - memset(&tio->info, 0, sizeof(tio->info)); 771 + tio = alloc_tio(ci, ti); 1123 772 1124 773 if (ci->sector_count <= max) { 1125 774 /* ··· 1172 827 1173 828 max = max_io_len(ci->md, ci->sector, ti); 1174 829 1175 - tio = alloc_tio(ci->md); 1176 - tio->io = ci->io; 1177 - tio->ti = ti; 1178 - memset(&tio->info, 0, sizeof(tio->info)); 830 + tio = alloc_tio(ci, ti); 1179 831 } 1180 832 1181 833 len = min(remaining, max); ··· 1207 865 if (!bio_barrier(bio)) 1208 866 bio_io_error(bio); 1209 867 else 1210 - md->barrier_error = -EIO; 868 + if (!md->barrier_error) 869 + md->barrier_error = -EIO; 1211 870 return; 1212 871 } 1213 872 ··· 1221 878 ci.io->md = md; 1222 879 ci.sector = bio->bi_sector; 1223 880 ci.sector_count = bio_sectors(bio); 881 + if (unlikely(bio_empty_barrier(bio))) 882 + ci.sector_count = 1; 1224 883 ci.idx = bio->bi_idx; 1225 884 1226 885 start_io_acct(ci.io); ··· 1270 925 */ 1271 926 if (max_size && ti->type->merge) 1272 927 max_size = ti->type->merge(ti, bvm, biovec, max_size); 928 + /* 929 + * If the target doesn't support merge method and some of the devices 930 + * provided their merge_bvec method (we know this by looking at 931 + * queue_max_hw_sectors), then we can't allow bios with multiple vector 932 + * entries. So always set max_size to 0, and the code below allows 933 + * just one page. 934 + */ 935 + else if (queue_max_hw_sectors(q) <= PAGE_SIZE >> 9) 936 + 937 + max_size = 0; 1273 938 1274 939 out_table: 1275 940 dm_table_put(map); ··· 1298 943 * The request function that just remaps the bio built up by 1299 944 * dm_merge_bvec. 1300 945 */ 1301 - static int dm_request(struct request_queue *q, struct bio *bio) 946 + static int _dm_request(struct request_queue *q, struct bio *bio) 1302 947 { 1303 948 int rw = bio_data_dir(bio); 1304 949 struct mapped_device *md = q->queuedata; ··· 1335 980 return 0; 1336 981 } 1337 982 983 + static int dm_make_request(struct request_queue *q, struct bio *bio) 984 + { 985 + struct mapped_device *md = q->queuedata; 986 + 987 + if (unlikely(bio_barrier(bio))) { 988 + bio_endio(bio, -EOPNOTSUPP); 989 + return 0; 990 + } 991 + 992 + return md->saved_make_request_fn(q, bio); /* call __make_request() */ 993 + } 994 + 995 + static int dm_request_based(struct mapped_device *md) 996 + { 997 + return blk_queue_stackable(md->queue); 998 + } 999 + 1000 + static int dm_request(struct request_queue *q, struct bio *bio) 1001 + { 1002 + struct mapped_device *md = q->queuedata; 1003 + 1004 + if (dm_request_based(md)) 1005 + return dm_make_request(q, bio); 1006 + 1007 + return _dm_request(q, bio); 1008 + } 1009 + 1010 + void dm_dispatch_request(struct request *rq) 1011 + { 1012 + int r; 1013 + 1014 + if (blk_queue_io_stat(rq->q)) 1015 + rq->cmd_flags |= REQ_IO_STAT; 1016 + 1017 + rq->start_time = jiffies; 1018 + r = blk_insert_cloned_request(rq->q, rq); 1019 + if (r) 1020 + dm_complete_request(rq, r); 1021 + } 1022 + EXPORT_SYMBOL_GPL(dm_dispatch_request); 1023 + 1024 + static void dm_rq_bio_destructor(struct bio *bio) 1025 + { 1026 + struct dm_rq_clone_bio_info *info = bio->bi_private; 1027 + struct mapped_device *md = info->tio->md; 1028 + 1029 + free_bio_info(info); 1030 + bio_free(bio, md->bs); 1031 + } 1032 + 1033 + static int dm_rq_bio_constructor(struct bio *bio, struct bio *bio_orig, 1034 + void *data) 1035 + { 1036 + struct dm_rq_target_io *tio = data; 1037 + struct mapped_device *md = tio->md; 1038 + struct dm_rq_clone_bio_info *info = alloc_bio_info(md); 1039 + 1040 + if (!info) 1041 + return -ENOMEM; 1042 + 1043 + info->orig = bio_orig; 1044 + info->tio = tio; 1045 + bio->bi_end_io = end_clone_bio; 1046 + bio->bi_private = info; 1047 + bio->bi_destructor = dm_rq_bio_destructor; 1048 + 1049 + return 0; 1050 + } 1051 + 1052 + static int setup_clone(struct request *clone, struct request *rq, 1053 + struct dm_rq_target_io *tio) 1054 + { 1055 + int r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC, 1056 + dm_rq_bio_constructor, tio); 1057 + 1058 + if (r) 1059 + return r; 1060 + 1061 + clone->cmd = rq->cmd; 1062 + clone->cmd_len = rq->cmd_len; 1063 + clone->sense = rq->sense; 1064 + clone->buffer = rq->buffer; 1065 + clone->end_io = end_clone_request; 1066 + clone->end_io_data = tio; 1067 + 1068 + return 0; 1069 + } 1070 + 1071 + static int dm_rq_flush_suspending(struct mapped_device *md) 1072 + { 1073 + return !md->suspend_rq.special; 1074 + } 1075 + 1076 + /* 1077 + * Called with the queue lock held. 1078 + */ 1079 + static int dm_prep_fn(struct request_queue *q, struct request *rq) 1080 + { 1081 + struct mapped_device *md = q->queuedata; 1082 + struct dm_rq_target_io *tio; 1083 + struct request *clone; 1084 + 1085 + if (unlikely(rq == &md->suspend_rq)) { 1086 + if (dm_rq_flush_suspending(md)) 1087 + return BLKPREP_OK; 1088 + else 1089 + /* The flush suspend was interrupted */ 1090 + return BLKPREP_KILL; 1091 + } 1092 + 1093 + if (unlikely(rq->special)) { 1094 + DMWARN("Already has something in rq->special."); 1095 + return BLKPREP_KILL; 1096 + } 1097 + 1098 + tio = alloc_rq_tio(md); /* Only one for each original request */ 1099 + if (!tio) 1100 + /* -ENOMEM */ 1101 + return BLKPREP_DEFER; 1102 + 1103 + tio->md = md; 1104 + tio->ti = NULL; 1105 + tio->orig = rq; 1106 + tio->error = 0; 1107 + memset(&tio->info, 0, sizeof(tio->info)); 1108 + 1109 + clone = &tio->clone; 1110 + if (setup_clone(clone, rq, tio)) { 1111 + /* -ENOMEM */ 1112 + free_rq_tio(tio); 1113 + return BLKPREP_DEFER; 1114 + } 1115 + 1116 + rq->special = clone; 1117 + rq->cmd_flags |= REQ_DONTPREP; 1118 + 1119 + return BLKPREP_OK; 1120 + } 1121 + 1122 + static void map_request(struct dm_target *ti, struct request *rq, 1123 + struct mapped_device *md) 1124 + { 1125 + int r; 1126 + struct request *clone = rq->special; 1127 + struct dm_rq_target_io *tio = clone->end_io_data; 1128 + 1129 + /* 1130 + * Hold the md reference here for the in-flight I/O. 1131 + * We can't rely on the reference count by device opener, 1132 + * because the device may be closed during the request completion 1133 + * when all bios are completed. 1134 + * See the comment in rq_completed() too. 1135 + */ 1136 + dm_get(md); 1137 + 1138 + tio->ti = ti; 1139 + r = ti->type->map_rq(ti, clone, &tio->info); 1140 + switch (r) { 1141 + case DM_MAPIO_SUBMITTED: 1142 + /* The target has taken the I/O to submit by itself later */ 1143 + break; 1144 + case DM_MAPIO_REMAPPED: 1145 + /* The target has remapped the I/O so dispatch it */ 1146 + dm_dispatch_request(clone); 1147 + break; 1148 + case DM_MAPIO_REQUEUE: 1149 + /* The target wants to requeue the I/O */ 1150 + dm_requeue_unmapped_request(clone); 1151 + break; 1152 + default: 1153 + if (r > 0) { 1154 + DMWARN("unimplemented target map return value: %d", r); 1155 + BUG(); 1156 + } 1157 + 1158 + /* The target wants to complete the I/O */ 1159 + dm_kill_unmapped_request(clone, r); 1160 + break; 1161 + } 1162 + } 1163 + 1164 + /* 1165 + * q->request_fn for request-based dm. 1166 + * Called with the queue lock held. 1167 + */ 1168 + static void dm_request_fn(struct request_queue *q) 1169 + { 1170 + struct mapped_device *md = q->queuedata; 1171 + struct dm_table *map = dm_get_table(md); 1172 + struct dm_target *ti; 1173 + struct request *rq; 1174 + 1175 + /* 1176 + * For noflush suspend, check blk_queue_stopped() to immediately 1177 + * quit I/O dispatching. 1178 + */ 1179 + while (!blk_queue_plugged(q) && !blk_queue_stopped(q)) { 1180 + rq = blk_peek_request(q); 1181 + if (!rq) 1182 + goto plug_and_out; 1183 + 1184 + if (unlikely(rq == &md->suspend_rq)) { /* Flush suspend maker */ 1185 + if (queue_in_flight(q)) 1186 + /* Not quiet yet. Wait more */ 1187 + goto plug_and_out; 1188 + 1189 + /* This device should be quiet now */ 1190 + __stop_queue(q); 1191 + blk_start_request(rq); 1192 + __blk_end_request_all(rq, 0); 1193 + wake_up(&md->wait); 1194 + goto out; 1195 + } 1196 + 1197 + ti = dm_table_find_target(map, blk_rq_pos(rq)); 1198 + if (ti->type->busy && ti->type->busy(ti)) 1199 + goto plug_and_out; 1200 + 1201 + blk_start_request(rq); 1202 + spin_unlock(q->queue_lock); 1203 + map_request(ti, rq, md); 1204 + spin_lock_irq(q->queue_lock); 1205 + } 1206 + 1207 + goto out; 1208 + 1209 + plug_and_out: 1210 + if (!elv_queue_empty(q)) 1211 + /* Some requests still remain, retry later */ 1212 + blk_plug_device(q); 1213 + 1214 + out: 1215 + dm_table_put(map); 1216 + 1217 + return; 1218 + } 1219 + 1220 + int dm_underlying_device_busy(struct request_queue *q) 1221 + { 1222 + return blk_lld_busy(q); 1223 + } 1224 + EXPORT_SYMBOL_GPL(dm_underlying_device_busy); 1225 + 1226 + static int dm_lld_busy(struct request_queue *q) 1227 + { 1228 + int r; 1229 + struct mapped_device *md = q->queuedata; 1230 + struct dm_table *map = dm_get_table(md); 1231 + 1232 + if (!map || test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) 1233 + r = 1; 1234 + else 1235 + r = dm_table_any_busy_target(map); 1236 + 1237 + dm_table_put(map); 1238 + 1239 + return r; 1240 + } 1241 + 1338 1242 static void dm_unplug_all(struct request_queue *q) 1339 1243 { 1340 1244 struct mapped_device *md = q->queuedata; 1341 1245 struct dm_table *map = dm_get_table(md); 1342 1246 1343 1247 if (map) { 1248 + if (dm_request_based(md)) 1249 + generic_unplug_device(q); 1250 + 1344 1251 dm_table_unplug_all(map); 1345 1252 dm_table_put(map); 1346 1253 } ··· 1617 1000 if (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) { 1618 1001 map = dm_get_table(md); 1619 1002 if (map) { 1620 - r = dm_table_any_congested(map, bdi_bits); 1003 + /* 1004 + * Request-based dm cares about only own queue for 1005 + * the query about congestion status of request_queue 1006 + */ 1007 + if (dm_request_based(md)) 1008 + r = md->queue->backing_dev_info.state & 1009 + bdi_bits; 1010 + else 1011 + r = dm_table_any_congested(map, bdi_bits); 1012 + 1621 1013 dm_table_put(map); 1622 1014 } 1623 1015 } ··· 1749 1123 INIT_LIST_HEAD(&md->uevent_list); 1750 1124 spin_lock_init(&md->uevent_lock); 1751 1125 1752 - md->queue = blk_alloc_queue(GFP_KERNEL); 1126 + md->queue = blk_init_queue(dm_request_fn, NULL); 1753 1127 if (!md->queue) 1754 1128 goto bad_queue; 1755 1129 1130 + /* 1131 + * Request-based dm devices cannot be stacked on top of bio-based dm 1132 + * devices. The type of this dm device has not been decided yet, 1133 + * although we initialized the queue using blk_init_queue(). 1134 + * The type is decided at the first table loading time. 1135 + * To prevent problematic device stacking, clear the queue flag 1136 + * for request stacking support until then. 1137 + * 1138 + * This queue is new, so no concurrency on the queue_flags. 1139 + */ 1140 + queue_flag_clear_unlocked(QUEUE_FLAG_STACKABLE, md->queue); 1141 + md->saved_make_request_fn = md->queue->make_request_fn; 1756 1142 md->queue->queuedata = md; 1757 1143 md->queue->backing_dev_info.congested_fn = dm_any_congested; 1758 1144 md->queue->backing_dev_info.congested_data = md; 1759 1145 blk_queue_make_request(md->queue, dm_request); 1760 - blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN, NULL); 1761 1146 blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY); 1762 1147 md->queue->unplug_fn = dm_unplug_all; 1763 1148 blk_queue_merge_bvec(md->queue, dm_merge_bvec); 1764 - 1765 - md->io_pool = mempool_create_slab_pool(MIN_IOS, _io_cache); 1766 - if (!md->io_pool) 1767 - goto bad_io_pool; 1768 - 1769 - md->tio_pool = mempool_create_slab_pool(MIN_IOS, _tio_cache); 1770 - if (!md->tio_pool) 1771 - goto bad_tio_pool; 1772 - 1773 - md->bs = bioset_create(16, 0); 1774 - if (!md->bs) 1775 - goto bad_no_bioset; 1149 + blk_queue_softirq_done(md->queue, dm_softirq_done); 1150 + blk_queue_prep_rq(md->queue, dm_prep_fn); 1151 + blk_queue_lld_busy(md->queue, dm_lld_busy); 1776 1152 1777 1153 md->disk = alloc_disk(1); 1778 1154 if (!md->disk) ··· 1798 1170 if (!md->wq) 1799 1171 goto bad_thread; 1800 1172 1173 + md->bdev = bdget_disk(md->disk, 0); 1174 + if (!md->bdev) 1175 + goto bad_bdev; 1176 + 1801 1177 /* Populate the mapping, nobody knows we exist yet */ 1802 1178 spin_lock(&_minor_lock); 1803 1179 old_md = idr_replace(&_minor_idr, md, minor); ··· 1811 1179 1812 1180 return md; 1813 1181 1182 + bad_bdev: 1183 + destroy_workqueue(md->wq); 1814 1184 bad_thread: 1815 1185 put_disk(md->disk); 1816 1186 bad_disk: 1817 - bioset_free(md->bs); 1818 - bad_no_bioset: 1819 - mempool_destroy(md->tio_pool); 1820 - bad_tio_pool: 1821 - mempool_destroy(md->io_pool); 1822 - bad_io_pool: 1823 1187 blk_cleanup_queue(md->queue); 1824 1188 bad_queue: 1825 1189 free_minor(minor); ··· 1832 1204 { 1833 1205 int minor = MINOR(disk_devt(md->disk)); 1834 1206 1835 - if (md->suspended_bdev) { 1836 - unlock_fs(md); 1837 - bdput(md->suspended_bdev); 1838 - } 1207 + unlock_fs(md); 1208 + bdput(md->bdev); 1839 1209 destroy_workqueue(md->wq); 1840 - mempool_destroy(md->tio_pool); 1841 - mempool_destroy(md->io_pool); 1842 - bioset_free(md->bs); 1210 + if (md->tio_pool) 1211 + mempool_destroy(md->tio_pool); 1212 + if (md->io_pool) 1213 + mempool_destroy(md->io_pool); 1214 + if (md->bs) 1215 + bioset_free(md->bs); 1843 1216 blk_integrity_unregister(md->disk); 1844 1217 del_gendisk(md->disk); 1845 1218 free_minor(minor); ··· 1853 1224 blk_cleanup_queue(md->queue); 1854 1225 module_put(THIS_MODULE); 1855 1226 kfree(md); 1227 + } 1228 + 1229 + static void __bind_mempools(struct mapped_device *md, struct dm_table *t) 1230 + { 1231 + struct dm_md_mempools *p; 1232 + 1233 + if (md->io_pool && md->tio_pool && md->bs) 1234 + /* the md already has necessary mempools */ 1235 + goto out; 1236 + 1237 + p = dm_table_get_md_mempools(t); 1238 + BUG_ON(!p || md->io_pool || md->tio_pool || md->bs); 1239 + 1240 + md->io_pool = p->io_pool; 1241 + p->io_pool = NULL; 1242 + md->tio_pool = p->tio_pool; 1243 + p->tio_pool = NULL; 1244 + md->bs = p->bs; 1245 + p->bs = NULL; 1246 + 1247 + out: 1248 + /* mempool bind completed, now no need any mempools in the table */ 1249 + dm_table_free_md_mempools(t); 1856 1250 } 1857 1251 1858 1252 /* ··· 1901 1249 { 1902 1250 set_capacity(md->disk, size); 1903 1251 1904 - mutex_lock(&md->suspended_bdev->bd_inode->i_mutex); 1905 - i_size_write(md->suspended_bdev->bd_inode, (loff_t)size << SECTOR_SHIFT); 1906 - mutex_unlock(&md->suspended_bdev->bd_inode->i_mutex); 1252 + mutex_lock(&md->bdev->bd_inode->i_mutex); 1253 + i_size_write(md->bdev->bd_inode, (loff_t)size << SECTOR_SHIFT); 1254 + mutex_unlock(&md->bdev->bd_inode->i_mutex); 1907 1255 } 1908 1256 1909 - static int __bind(struct mapped_device *md, struct dm_table *t) 1257 + static int __bind(struct mapped_device *md, struct dm_table *t, 1258 + struct queue_limits *limits) 1910 1259 { 1911 1260 struct request_queue *q = md->queue; 1912 1261 sector_t size; 1262 + unsigned long flags; 1913 1263 1914 1264 size = dm_table_get_size(t); 1915 1265 ··· 1921 1267 if (size != get_capacity(md->disk)) 1922 1268 memset(&md->geometry, 0, sizeof(md->geometry)); 1923 1269 1924 - if (md->suspended_bdev) 1925 - __set_size(md, size); 1270 + __set_size(md, size); 1926 1271 1927 1272 if (!size) { 1928 1273 dm_table_destroy(t); ··· 1930 1277 1931 1278 dm_table_event_callback(t, event_callback, md); 1932 1279 1933 - write_lock(&md->map_lock); 1280 + /* 1281 + * The queue hasn't been stopped yet, if the old table type wasn't 1282 + * for request-based during suspension. So stop it to prevent 1283 + * I/O mapping before resume. 1284 + * This must be done before setting the queue restrictions, 1285 + * because request-based dm may be run just after the setting. 1286 + */ 1287 + if (dm_table_request_based(t) && !blk_queue_stopped(q)) 1288 + stop_queue(q); 1289 + 1290 + __bind_mempools(md, t); 1291 + 1292 + write_lock_irqsave(&md->map_lock, flags); 1934 1293 md->map = t; 1935 - dm_table_set_restrictions(t, q); 1936 - write_unlock(&md->map_lock); 1294 + dm_table_set_restrictions(t, q, limits); 1295 + write_unlock_irqrestore(&md->map_lock, flags); 1937 1296 1938 1297 return 0; 1939 1298 } ··· 1953 1288 static void __unbind(struct mapped_device *md) 1954 1289 { 1955 1290 struct dm_table *map = md->map; 1291 + unsigned long flags; 1956 1292 1957 1293 if (!map) 1958 1294 return; 1959 1295 1960 1296 dm_table_event_callback(map, NULL, NULL); 1961 - write_lock(&md->map_lock); 1297 + write_lock_irqsave(&md->map_lock, flags); 1962 1298 md->map = NULL; 1963 - write_unlock(&md->map_lock); 1299 + write_unlock_irqrestore(&md->map_lock, flags); 1964 1300 dm_table_destroy(map); 1965 1301 } 1966 1302 ··· 2065 1399 { 2066 1400 int r = 0; 2067 1401 DECLARE_WAITQUEUE(wait, current); 1402 + struct request_queue *q = md->queue; 1403 + unsigned long flags; 2068 1404 2069 1405 dm_unplug_all(md->queue); 2070 1406 ··· 2076 1408 set_current_state(interruptible); 2077 1409 2078 1410 smp_mb(); 2079 - if (!atomic_read(&md->pending)) 1411 + if (dm_request_based(md)) { 1412 + spin_lock_irqsave(q->queue_lock, flags); 1413 + if (!queue_in_flight(q) && blk_queue_stopped(q)) { 1414 + spin_unlock_irqrestore(q->queue_lock, flags); 1415 + break; 1416 + } 1417 + spin_unlock_irqrestore(q->queue_lock, flags); 1418 + } else if (!atomic_read(&md->pending)) 2080 1419 break; 2081 1420 2082 1421 if (interruptible == TASK_INTERRUPTIBLE && ··· 2101 1426 return r; 2102 1427 } 2103 1428 2104 - static int dm_flush(struct mapped_device *md) 1429 + static void dm_flush(struct mapped_device *md) 2105 1430 { 2106 1431 dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE); 2107 - return 0; 1432 + 1433 + bio_init(&md->barrier_bio); 1434 + md->barrier_bio.bi_bdev = md->bdev; 1435 + md->barrier_bio.bi_rw = WRITE_BARRIER; 1436 + __split_and_process_bio(md, &md->barrier_bio); 1437 + 1438 + dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE); 2108 1439 } 2109 1440 2110 1441 static void process_barrier(struct mapped_device *md, struct bio *bio) 2111 1442 { 2112 - int error = dm_flush(md); 1443 + md->barrier_error = 0; 2113 1444 2114 - if (unlikely(error)) { 2115 - bio_endio(bio, error); 2116 - return; 1445 + dm_flush(md); 1446 + 1447 + if (!bio_empty_barrier(bio)) { 1448 + __split_and_process_bio(md, bio); 1449 + dm_flush(md); 2117 1450 } 2118 - if (bio_empty_barrier(bio)) { 2119 - bio_endio(bio, 0); 2120 - return; 2121 - } 2122 - 2123 - __split_and_process_bio(md, bio); 2124 - 2125 - error = dm_flush(md); 2126 - 2127 - if (!error && md->barrier_error) 2128 - error = md->barrier_error; 2129 1451 2130 1452 if (md->barrier_error != DM_ENDIO_REQUEUE) 2131 - bio_endio(bio, error); 1453 + bio_endio(bio, md->barrier_error); 1454 + else { 1455 + spin_lock_irq(&md->deferred_lock); 1456 + bio_list_add_head(&md->deferred, bio); 1457 + spin_unlock_irq(&md->deferred_lock); 1458 + } 2132 1459 } 2133 1460 2134 1461 /* ··· 2156 1479 2157 1480 up_write(&md->io_lock); 2158 1481 2159 - if (bio_barrier(c)) 2160 - process_barrier(md, c); 2161 - else 2162 - __split_and_process_bio(md, c); 1482 + if (dm_request_based(md)) 1483 + generic_make_request(c); 1484 + else { 1485 + if (bio_barrier(c)) 1486 + process_barrier(md, c); 1487 + else 1488 + __split_and_process_bio(md, c); 1489 + } 2163 1490 2164 1491 down_write(&md->io_lock); 2165 1492 } ··· 2183 1502 */ 2184 1503 int dm_swap_table(struct mapped_device *md, struct dm_table *table) 2185 1504 { 1505 + struct queue_limits limits; 2186 1506 int r = -EINVAL; 2187 1507 2188 1508 mutex_lock(&md->suspend_lock); ··· 2192 1510 if (!dm_suspended(md)) 2193 1511 goto out; 2194 1512 2195 - /* without bdev, the device size cannot be changed */ 2196 - if (!md->suspended_bdev) 2197 - if (get_capacity(md->disk) != dm_table_get_size(table)) 2198 - goto out; 1513 + r = dm_calculate_queue_limits(table, &limits); 1514 + if (r) 1515 + goto out; 1516 + 1517 + /* cannot change the device type, once a table is bound */ 1518 + if (md->map && 1519 + (dm_table_get_type(md->map) != dm_table_get_type(table))) { 1520 + DMWARN("can't change the device type after a table is bound"); 1521 + goto out; 1522 + } 1523 + 1524 + /* 1525 + * It is enought that blk_queue_ordered() is called only once when 1526 + * the first bio-based table is bound. 1527 + * 1528 + * This setting should be moved to alloc_dev() when request-based dm 1529 + * supports barrier. 1530 + */ 1531 + if (!md->map && dm_table_bio_based(table)) 1532 + blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN, NULL); 2199 1533 2200 1534 __unbind(md); 2201 - r = __bind(md, table); 1535 + r = __bind(md, table, &limits); 2202 1536 2203 1537 out: 2204 1538 mutex_unlock(&md->suspend_lock); 1539 + return r; 1540 + } 1541 + 1542 + static void dm_rq_invalidate_suspend_marker(struct mapped_device *md) 1543 + { 1544 + md->suspend_rq.special = (void *)0x1; 1545 + } 1546 + 1547 + static void dm_rq_abort_suspend(struct mapped_device *md, int noflush) 1548 + { 1549 + struct request_queue *q = md->queue; 1550 + unsigned long flags; 1551 + 1552 + spin_lock_irqsave(q->queue_lock, flags); 1553 + if (!noflush) 1554 + dm_rq_invalidate_suspend_marker(md); 1555 + __start_queue(q); 1556 + spin_unlock_irqrestore(q->queue_lock, flags); 1557 + } 1558 + 1559 + static void dm_rq_start_suspend(struct mapped_device *md, int noflush) 1560 + { 1561 + struct request *rq = &md->suspend_rq; 1562 + struct request_queue *q = md->queue; 1563 + 1564 + if (noflush) 1565 + stop_queue(q); 1566 + else { 1567 + blk_rq_init(q, rq); 1568 + blk_insert_request(q, rq, 0, NULL); 1569 + } 1570 + } 1571 + 1572 + static int dm_rq_suspend_available(struct mapped_device *md, int noflush) 1573 + { 1574 + int r = 1; 1575 + struct request *rq = &md->suspend_rq; 1576 + struct request_queue *q = md->queue; 1577 + unsigned long flags; 1578 + 1579 + if (noflush) 1580 + return r; 1581 + 1582 + /* The marker must be protected by queue lock if it is in use */ 1583 + spin_lock_irqsave(q->queue_lock, flags); 1584 + if (unlikely(rq->ref_count)) { 1585 + /* 1586 + * This can happen, when the previous flush suspend was 1587 + * interrupted, the marker is still in the queue and 1588 + * this flush suspend has been invoked, because we don't 1589 + * remove the marker at the time of suspend interruption. 1590 + * We have only one marker per mapped_device, so we can't 1591 + * start another flush suspend while it is in use. 1592 + */ 1593 + BUG_ON(!rq->special); /* The marker should be invalidated */ 1594 + DMWARN("Invalidating the previous flush suspend is still in" 1595 + " progress. Please retry later."); 1596 + r = 0; 1597 + } 1598 + spin_unlock_irqrestore(q->queue_lock, flags); 1599 + 2205 1600 return r; 2206 1601 } 2207 1602 ··· 2292 1533 2293 1534 WARN_ON(md->frozen_sb); 2294 1535 2295 - md->frozen_sb = freeze_bdev(md->suspended_bdev); 1536 + md->frozen_sb = freeze_bdev(md->bdev); 2296 1537 if (IS_ERR(md->frozen_sb)) { 2297 1538 r = PTR_ERR(md->frozen_sb); 2298 1539 md->frozen_sb = NULL; ··· 2301 1542 2302 1543 set_bit(DMF_FROZEN, &md->flags); 2303 1544 2304 - /* don't bdput right now, we don't want the bdev 2305 - * to go away while it is locked. 2306 - */ 2307 1545 return 0; 2308 1546 } 2309 1547 ··· 2309 1553 if (!test_bit(DMF_FROZEN, &md->flags)) 2310 1554 return; 2311 1555 2312 - thaw_bdev(md->suspended_bdev, md->frozen_sb); 1556 + thaw_bdev(md->bdev, md->frozen_sb); 2313 1557 md->frozen_sb = NULL; 2314 1558 clear_bit(DMF_FROZEN, &md->flags); 2315 1559 } ··· 2320 1564 * the background. Before the table can be swapped with 2321 1565 * dm_bind_table, dm_suspend must be called to flush any in 2322 1566 * flight bios and ensure that any further io gets deferred. 1567 + */ 1568 + /* 1569 + * Suspend mechanism in request-based dm. 1570 + * 1571 + * After the suspend starts, further incoming requests are kept in 1572 + * the request_queue and deferred. 1573 + * Remaining requests in the request_queue at the start of suspend are flushed 1574 + * if it is flush suspend. 1575 + * The suspend completes when the following conditions have been satisfied, 1576 + * so wait for it: 1577 + * 1. q->in_flight is 0 (which means no in_flight request) 1578 + * 2. queue has been stopped (which means no request dispatching) 1579 + * 1580 + * 1581 + * Noflush suspend 1582 + * --------------- 1583 + * Noflush suspend doesn't need to dispatch remaining requests. 1584 + * So stop the queue immediately. Then, wait for all in_flight requests 1585 + * to be completed or requeued. 1586 + * 1587 + * To abort noflush suspend, start the queue. 1588 + * 1589 + * 1590 + * Flush suspend 1591 + * ------------- 1592 + * Flush suspend needs to dispatch remaining requests. So stop the queue 1593 + * after the remaining requests are completed. (Requeued request must be also 1594 + * re-dispatched and completed. Until then, we can't stop the queue.) 1595 + * 1596 + * During flushing the remaining requests, further incoming requests are also 1597 + * inserted to the same queue. To distinguish which requests are to be 1598 + * flushed, we insert a marker request to the queue at the time of starting 1599 + * flush suspend, like a barrier. 1600 + * The dispatching is blocked when the marker is found on the top of the queue. 1601 + * And the queue is stopped when all in_flight requests are completed, since 1602 + * that means the remaining requests are completely flushed. 1603 + * Then, the marker is removed from the queue. 1604 + * 1605 + * To abort flush suspend, we also need to take care of the marker, not only 1606 + * starting the queue. 1607 + * We don't remove the marker forcibly from the queue since it's against 1608 + * the block-layer manner. Instead, we put a invalidated mark on the marker. 1609 + * When the invalidated marker is found on the top of the queue, it is 1610 + * immediately removed from the queue, so it doesn't block dispatching. 1611 + * Because we have only one marker per mapped_device, we can't start another 1612 + * flush suspend until the invalidated marker is removed from the queue. 1613 + * So fail and return with -EBUSY in such a case. 2323 1614 */ 2324 1615 int dm_suspend(struct mapped_device *md, unsigned suspend_flags) 2325 1616 { ··· 2382 1579 goto out_unlock; 2383 1580 } 2384 1581 1582 + if (dm_request_based(md) && !dm_rq_suspend_available(md, noflush)) { 1583 + r = -EBUSY; 1584 + goto out_unlock; 1585 + } 1586 + 2385 1587 map = dm_get_table(md); 2386 1588 2387 1589 /* ··· 2399 1591 /* This does not get reverted if there's an error later. */ 2400 1592 dm_table_presuspend_targets(map); 2401 1593 2402 - /* bdget() can stall if the pending I/Os are not flushed */ 2403 - if (!noflush) { 2404 - md->suspended_bdev = bdget_disk(md->disk, 0); 2405 - if (!md->suspended_bdev) { 2406 - DMWARN("bdget failed in dm_suspend"); 2407 - r = -ENOMEM; 1594 + /* 1595 + * Flush I/O to the device. noflush supersedes do_lockfs, 1596 + * because lock_fs() needs to flush I/Os. 1597 + */ 1598 + if (!noflush && do_lockfs) { 1599 + r = lock_fs(md); 1600 + if (r) 2408 1601 goto out; 2409 - } 2410 - 2411 - /* 2412 - * Flush I/O to the device. noflush supersedes do_lockfs, 2413 - * because lock_fs() needs to flush I/Os. 2414 - */ 2415 - if (do_lockfs) { 2416 - r = lock_fs(md); 2417 - if (r) 2418 - goto out; 2419 - } 2420 1602 } 2421 1603 2422 1604 /* ··· 2432 1634 2433 1635 flush_workqueue(md->wq); 2434 1636 1637 + if (dm_request_based(md)) 1638 + dm_rq_start_suspend(md, noflush); 1639 + 2435 1640 /* 2436 1641 * At this point no more requests are entering target request routines. 2437 1642 * We call dm_wait_for_completion to wait for all existing requests ··· 2451 1650 if (r < 0) { 2452 1651 dm_queue_flush(md); 2453 1652 1653 + if (dm_request_based(md)) 1654 + dm_rq_abort_suspend(md, noflush); 1655 + 2454 1656 unlock_fs(md); 2455 1657 goto out; /* pushback list is already flushed, so skip flush */ 2456 1658 } ··· 2469 1665 set_bit(DMF_SUSPENDED, &md->flags); 2470 1666 2471 1667 out: 2472 - if (r && md->suspended_bdev) { 2473 - bdput(md->suspended_bdev); 2474 - md->suspended_bdev = NULL; 2475 - } 2476 - 2477 1668 dm_table_put(map); 2478 1669 2479 1670 out_unlock: ··· 2495 1696 2496 1697 dm_queue_flush(md); 2497 1698 2498 - unlock_fs(md); 1699 + /* 1700 + * Flushing deferred I/Os must be done after targets are resumed 1701 + * so that mapping of targets can work correctly. 1702 + * Request-based dm is queueing the deferred I/Os in its request_queue. 1703 + */ 1704 + if (dm_request_based(md)) 1705 + start_queue(md->queue); 2499 1706 2500 - if (md->suspended_bdev) { 2501 - bdput(md->suspended_bdev); 2502 - md->suspended_bdev = NULL; 2503 - } 1707 + unlock_fs(md); 2504 1708 2505 1709 clear_bit(DMF_SUSPENDED, &md->flags); 2506 1710 2507 1711 dm_table_unplug_all(map); 2508 - 2509 - dm_kobject_uevent(md); 2510 - 2511 1712 r = 0; 2512 - 2513 1713 out: 2514 1714 dm_table_put(map); 2515 1715 mutex_unlock(&md->suspend_lock); ··· 2519 1721 /*----------------------------------------------------------------- 2520 1722 * Event notification. 2521 1723 *---------------------------------------------------------------*/ 2522 - void dm_kobject_uevent(struct mapped_device *md) 1724 + void dm_kobject_uevent(struct mapped_device *md, enum kobject_action action, 1725 + unsigned cookie) 2523 1726 { 2524 - kobject_uevent(&disk_to_dev(md->disk)->kobj, KOBJ_CHANGE); 1727 + char udev_cookie[DM_COOKIE_LENGTH]; 1728 + char *envp[] = { udev_cookie, NULL }; 1729 + 1730 + if (!cookie) 1731 + kobject_uevent(&disk_to_dev(md->disk)->kobj, action); 1732 + else { 1733 + snprintf(udev_cookie, DM_COOKIE_LENGTH, "%s=%u", 1734 + DM_COOKIE_ENV_VAR_NAME, cookie); 1735 + kobject_uevent_env(&disk_to_dev(md->disk)->kobj, action, envp); 1736 + } 2525 1737 } 2526 1738 2527 1739 uint32_t dm_next_uevent_seq(struct mapped_device *md) ··· 2585 1777 if (&md->kobj != kobj) 2586 1778 return NULL; 2587 1779 1780 + if (test_bit(DMF_FREEING, &md->flags) || 1781 + test_bit(DMF_DELETING, &md->flags)) 1782 + return NULL; 1783 + 2588 1784 dm_get(md); 2589 1785 return md; 2590 1786 } ··· 2608 1796 return r; 2609 1797 } 2610 1798 EXPORT_SYMBOL_GPL(dm_noflush_suspending); 1799 + 1800 + struct dm_md_mempools *dm_alloc_md_mempools(unsigned type) 1801 + { 1802 + struct dm_md_mempools *pools = kmalloc(sizeof(*pools), GFP_KERNEL); 1803 + 1804 + if (!pools) 1805 + return NULL; 1806 + 1807 + pools->io_pool = (type == DM_TYPE_BIO_BASED) ? 1808 + mempool_create_slab_pool(MIN_IOS, _io_cache) : 1809 + mempool_create_slab_pool(MIN_IOS, _rq_bio_info_cache); 1810 + if (!pools->io_pool) 1811 + goto free_pools_and_out; 1812 + 1813 + pools->tio_pool = (type == DM_TYPE_BIO_BASED) ? 1814 + mempool_create_slab_pool(MIN_IOS, _tio_cache) : 1815 + mempool_create_slab_pool(MIN_IOS, _rq_tio_cache); 1816 + if (!pools->tio_pool) 1817 + goto free_io_pool_and_out; 1818 + 1819 + pools->bs = (type == DM_TYPE_BIO_BASED) ? 1820 + bioset_create(16, 0) : bioset_create(MIN_IOS, 0); 1821 + if (!pools->bs) 1822 + goto free_tio_pool_and_out; 1823 + 1824 + return pools; 1825 + 1826 + free_tio_pool_and_out: 1827 + mempool_destroy(pools->tio_pool); 1828 + 1829 + free_io_pool_and_out: 1830 + mempool_destroy(pools->io_pool); 1831 + 1832 + free_pools_and_out: 1833 + kfree(pools); 1834 + 1835 + return NULL; 1836 + } 1837 + 1838 + void dm_free_md_mempools(struct dm_md_mempools *pools) 1839 + { 1840 + if (!pools) 1841 + return; 1842 + 1843 + if (pools->io_pool) 1844 + mempool_destroy(pools->io_pool); 1845 + 1846 + if (pools->tio_pool) 1847 + mempool_destroy(pools->tio_pool); 1848 + 1849 + if (pools->bs) 1850 + bioset_free(pools->bs); 1851 + 1852 + kfree(pools); 1853 + } 2611 1854 2612 1855 static struct block_device_operations dm_blk_dops = { 2613 1856 .open = dm_blk_open,
+33 -2
drivers/md/dm.h
··· 23 23 #define DM_SUSPEND_NOFLUSH_FLAG (1 << 1) 24 24 25 25 /* 26 + * Type of table and mapped_device's mempool 27 + */ 28 + #define DM_TYPE_NONE 0 29 + #define DM_TYPE_BIO_BASED 1 30 + #define DM_TYPE_REQUEST_BASED 2 31 + 32 + /* 26 33 * List of devices that a metadevice uses and should open/close. 27 34 */ 28 35 struct dm_dev_internal { ··· 39 32 }; 40 33 41 34 struct dm_table; 35 + struct dm_md_mempools; 42 36 43 37 /*----------------------------------------------------------------- 44 38 * Internal table functions. ··· 49 41 void (*fn)(void *), void *context); 50 42 struct dm_target *dm_table_get_target(struct dm_table *t, unsigned int index); 51 43 struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector); 52 - void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q); 44 + int dm_calculate_queue_limits(struct dm_table *table, 45 + struct queue_limits *limits); 46 + void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, 47 + struct queue_limits *limits); 53 48 struct list_head *dm_table_get_devices(struct dm_table *t); 54 49 void dm_table_presuspend_targets(struct dm_table *t); 55 50 void dm_table_postsuspend_targets(struct dm_table *t); 56 51 int dm_table_resume_targets(struct dm_table *t); 57 52 int dm_table_any_congested(struct dm_table *t, int bdi_bits); 53 + int dm_table_any_busy_target(struct dm_table *t); 54 + int dm_table_set_type(struct dm_table *t); 55 + unsigned dm_table_get_type(struct dm_table *t); 56 + bool dm_table_bio_based(struct dm_table *t); 57 + bool dm_table_request_based(struct dm_table *t); 58 + int dm_table_alloc_md_mempools(struct dm_table *t); 59 + void dm_table_free_md_mempools(struct dm_table *t); 60 + struct dm_md_mempools *dm_table_get_md_mempools(struct dm_table *t); 58 61 59 62 /* 60 63 * To check the return value from dm_table_find_target(). 61 64 */ 62 65 #define dm_target_is_valid(t) ((t)->table) 66 + 67 + /* 68 + * To check whether the target type is request-based or not (bio-based). 69 + */ 70 + #define dm_target_request_based(t) ((t)->type->map_rq != NULL) 63 71 64 72 /*----------------------------------------------------------------- 65 73 * A registry of target types. ··· 116 92 int dm_open_count(struct mapped_device *md); 117 93 int dm_lock_for_deletion(struct mapped_device *md); 118 94 119 - void dm_kobject_uevent(struct mapped_device *md); 95 + void dm_kobject_uevent(struct mapped_device *md, enum kobject_action action, 96 + unsigned cookie); 120 97 121 98 int dm_kcopyd_init(void); 122 99 void dm_kcopyd_exit(void); 100 + 101 + /* 102 + * Mempool operations 103 + */ 104 + struct dm_md_mempools *dm_alloc_md_mempools(unsigned type); 105 + void dm_free_md_mempools(struct dm_md_mempools *pools); 123 106 124 107 #endif
+1
include/linux/Kbuild
··· 57 57 header-y += dlm_device.h 58 58 header-y += dlm_netlink.h 59 59 header-y += dm-ioctl.h 60 + header-y += dm-log-userspace.h 60 61 header-y += dn.h 61 62 header-y += dqblk_xfs.h 62 63 header-y += efs_fs_sb.h
+3 -1
include/linux/connector.h
··· 41 41 #define CN_IDX_BB 0x5 /* BlackBoard, from the TSP GPL sampling framework */ 42 42 #define CN_DST_IDX 0x6 43 43 #define CN_DST_VAL 0x1 44 + #define CN_IDX_DM 0x7 /* Device Mapper */ 45 + #define CN_VAL_DM_USERSPACE_LOG 0x1 44 46 45 - #define CN_NETLINK_USERS 7 47 + #define CN_NETLINK_USERS 8 46 48 47 49 /* 48 50 * Maximum connector's message size.
+30 -17
include/linux/device-mapper.h
··· 11 11 #include <linux/bio.h> 12 12 #include <linux/blkdev.h> 13 13 14 + struct dm_dev; 14 15 struct dm_target; 15 16 struct dm_table; 16 17 struct mapped_device; ··· 22 21 union map_info { 23 22 void *ptr; 24 23 unsigned long long ll; 24 + unsigned flush_request; 25 25 }; 26 26 27 27 /* ··· 82 80 typedef int (*dm_merge_fn) (struct dm_target *ti, struct bvec_merge_data *bvm, 83 81 struct bio_vec *biovec, int max_size); 84 82 83 + typedef int (*iterate_devices_callout_fn) (struct dm_target *ti, 84 + struct dm_dev *dev, 85 + sector_t physical_start, 86 + void *data); 87 + 88 + typedef int (*dm_iterate_devices_fn) (struct dm_target *ti, 89 + iterate_devices_callout_fn fn, 90 + void *data); 91 + 85 92 /* 86 93 * Returns: 87 94 * 0: The target can handle the next I/O immediately. ··· 103 92 /* 104 93 * Combine device limits. 105 94 */ 106 - void dm_set_device_limits(struct dm_target *ti, struct block_device *bdev); 95 + int dm_set_device_limits(struct dm_target *ti, struct dm_dev *dev, 96 + sector_t start, void *data); 107 97 108 98 struct dm_dev { 109 99 struct block_device *bdev; ··· 150 138 dm_ioctl_fn ioctl; 151 139 dm_merge_fn merge; 152 140 dm_busy_fn busy; 141 + dm_iterate_devices_fn iterate_devices; 153 142 154 143 /* For internal device-mapper use. */ 155 144 struct list_head list; 156 - }; 157 - 158 - struct io_restrictions { 159 - unsigned long bounce_pfn; 160 - unsigned long seg_boundary_mask; 161 - unsigned max_hw_sectors; 162 - unsigned max_sectors; 163 - unsigned max_segment_size; 164 - unsigned short logical_block_size; 165 - unsigned short max_hw_segments; 166 - unsigned short max_phys_segments; 167 - unsigned char no_cluster; /* inverted so that 0 is default */ 168 145 }; 169 146 170 147 struct dm_target { ··· 164 163 sector_t begin; 165 164 sector_t len; 166 165 167 - /* FIXME: turn this into a mask, and merge with io_restrictions */ 168 166 /* Always a power of 2 */ 169 167 sector_t split_io; 170 168 171 169 /* 172 - * These are automatically filled in by 173 - * dm_table_get_device. 170 + * A number of zero-length barrier requests that will be submitted 171 + * to the target for the purpose of flushing cache. 172 + * 173 + * The request number will be placed in union map_info->flush_request. 174 + * It is a responsibility of the target driver to remap these requests 175 + * to the real underlying devices. 174 176 */ 175 - struct io_restrictions limits; 177 + unsigned num_flush_requests; 176 178 177 179 /* target specific data */ 178 180 void *private; ··· 234 230 int dm_suspended(struct mapped_device *md); 235 231 int dm_noflush_suspending(struct dm_target *ti); 236 232 union map_info *dm_get_mapinfo(struct bio *bio); 233 + union map_info *dm_get_rq_mapinfo(struct request *rq); 237 234 238 235 /* 239 236 * Geometry functions. ··· 396 391 { 397 392 return (n << SECTOR_SHIFT); 398 393 } 394 + 395 + /*----------------------------------------------------------------- 396 + * Helper for block layer and dm core operations 397 + *---------------------------------------------------------------*/ 398 + void dm_dispatch_request(struct request *rq); 399 + void dm_requeue_unmapped_request(struct request *rq); 400 + void dm_kill_unmapped_request(struct request *rq, int error); 401 + int dm_underlying_device_busy(struct request_queue *q); 399 402 400 403 #endif /* _LINUX_DEVICE_MAPPER_H */
+12 -2
include/linux/dm-ioctl.h
··· 123 123 __u32 target_count; /* in/out */ 124 124 __s32 open_count; /* out */ 125 125 __u32 flags; /* in/out */ 126 + 127 + /* 128 + * event_nr holds either the event number (input and output) or the 129 + * udev cookie value (input only). 130 + * The DM_DEV_WAIT ioctl takes an event number as input. 131 + * The DM_SUSPEND, DM_DEV_REMOVE and DM_DEV_RENAME ioctls 132 + * use the field as a cookie to return in the DM_COOKIE 133 + * variable with the uevents they issue. 134 + * For output, the ioctls return the event number, not the cookie. 135 + */ 126 136 __u32 event_nr; /* in/out */ 127 137 __u32 padding; 128 138 ··· 266 256 #define DM_DEV_SET_GEOMETRY _IOWR(DM_IOCTL, DM_DEV_SET_GEOMETRY_CMD, struct dm_ioctl) 267 257 268 258 #define DM_VERSION_MAJOR 4 269 - #define DM_VERSION_MINOR 14 259 + #define DM_VERSION_MINOR 15 270 260 #define DM_VERSION_PATCHLEVEL 0 271 - #define DM_VERSION_EXTRA "-ioctl (2008-04-23)" 261 + #define DM_VERSION_EXTRA "-ioctl (2009-04-01)" 272 262 273 263 /* Status bits */ 274 264 #define DM_READONLY_FLAG (1 << 0) /* In/Out */
+386
include/linux/dm-log-userspace.h
··· 1 + /* 2 + * Copyright (C) 2006-2009 Red Hat, Inc. 3 + * 4 + * This file is released under the LGPL. 5 + */ 6 + 7 + #ifndef __DM_LOG_USERSPACE_H__ 8 + #define __DM_LOG_USERSPACE_H__ 9 + 10 + #include <linux/dm-ioctl.h> /* For DM_UUID_LEN */ 11 + 12 + /* 13 + * The device-mapper userspace log module consists of a kernel component and 14 + * a user-space component. The kernel component implements the API defined 15 + * in dm-dirty-log.h. Its purpose is simply to pass the parameters and 16 + * return values of those API functions between kernel and user-space. 17 + * 18 + * Below are defined the 'request_types' - DM_ULOG_CTR, DM_ULOG_DTR, etc. 19 + * These request types represent the different functions in the device-mapper 20 + * dirty log API. Each of these is described in more detail below. 21 + * 22 + * The user-space program must listen for requests from the kernel (representing 23 + * the various API functions) and process them. 24 + * 25 + * User-space begins by setting up the communication link (error checking 26 + * removed for clarity): 27 + * fd = socket(PF_NETLINK, SOCK_DGRAM, NETLINK_CONNECTOR); 28 + * addr.nl_family = AF_NETLINK; 29 + * addr.nl_groups = CN_IDX_DM; 30 + * addr.nl_pid = 0; 31 + * r = bind(fd, (struct sockaddr *) &addr, sizeof(addr)); 32 + * opt = addr.nl_groups; 33 + * setsockopt(fd, SOL_NETLINK, NETLINK_ADD_MEMBERSHIP, &opt, sizeof(opt)); 34 + * 35 + * User-space will then wait to receive requests form the kernel, which it 36 + * will process as described below. The requests are received in the form, 37 + * ((struct dm_ulog_request) + (additional data)). Depending on the request 38 + * type, there may or may not be 'additional data'. In the descriptions below, 39 + * you will see 'Payload-to-userspace' and 'Payload-to-kernel'. The 40 + * 'Payload-to-userspace' is what the kernel sends in 'additional data' as 41 + * necessary parameters to complete the request. The 'Payload-to-kernel' is 42 + * the 'additional data' returned to the kernel that contains the necessary 43 + * results of the request. The 'data_size' field in the dm_ulog_request 44 + * structure denotes the availability and amount of payload data. 45 + */ 46 + 47 + /* 48 + * DM_ULOG_CTR corresponds to (found in dm-dirty-log.h): 49 + * int (*ctr)(struct dm_dirty_log *log, struct dm_target *ti, 50 + * unsigned argc, char **argv); 51 + * 52 + * Payload-to-userspace: 53 + * A single string containing all the argv arguments separated by ' 's 54 + * Payload-to-kernel: 55 + * None. ('data_size' in the dm_ulog_request struct should be 0.) 56 + * 57 + * The UUID contained in the dm_ulog_request structure is the reference that 58 + * will be used by all request types to a specific log. The constructor must 59 + * record this assotiation with instance created. 60 + * 61 + * When the request has been processed, user-space must return the 62 + * dm_ulog_request to the kernel - setting the 'error' field and 63 + * 'data_size' appropriately. 64 + */ 65 + #define DM_ULOG_CTR 1 66 + 67 + /* 68 + * DM_ULOG_DTR corresponds to (found in dm-dirty-log.h): 69 + * void (*dtr)(struct dm_dirty_log *log); 70 + * 71 + * Payload-to-userspace: 72 + * A single string containing all the argv arguments separated by ' 's 73 + * Payload-to-kernel: 74 + * None. ('data_size' in the dm_ulog_request struct should be 0.) 75 + * 76 + * The UUID contained in the dm_ulog_request structure is all that is 77 + * necessary to identify the log instance being destroyed. There is no 78 + * payload data. 79 + * 80 + * When the request has been processed, user-space must return the 81 + * dm_ulog_request to the kernel - setting the 'error' field and clearing 82 + * 'data_size' appropriately. 83 + */ 84 + #define DM_ULOG_DTR 2 85 + 86 + /* 87 + * DM_ULOG_PRESUSPEND corresponds to (found in dm-dirty-log.h): 88 + * int (*presuspend)(struct dm_dirty_log *log); 89 + * 90 + * Payload-to-userspace: 91 + * None. 92 + * Payload-to-kernel: 93 + * None. 94 + * 95 + * The UUID contained in the dm_ulog_request structure is all that is 96 + * necessary to identify the log instance being presuspended. There is no 97 + * payload data. 98 + * 99 + * When the request has been processed, user-space must return the 100 + * dm_ulog_request to the kernel - setting the 'error' field and 101 + * 'data_size' appropriately. 102 + */ 103 + #define DM_ULOG_PRESUSPEND 3 104 + 105 + /* 106 + * DM_ULOG_POSTSUSPEND corresponds to (found in dm-dirty-log.h): 107 + * int (*postsuspend)(struct dm_dirty_log *log); 108 + * 109 + * Payload-to-userspace: 110 + * None. 111 + * Payload-to-kernel: 112 + * None. 113 + * 114 + * The UUID contained in the dm_ulog_request structure is all that is 115 + * necessary to identify the log instance being postsuspended. There is no 116 + * payload data. 117 + * 118 + * When the request has been processed, user-space must return the 119 + * dm_ulog_request to the kernel - setting the 'error' field and 120 + * 'data_size' appropriately. 121 + */ 122 + #define DM_ULOG_POSTSUSPEND 4 123 + 124 + /* 125 + * DM_ULOG_RESUME corresponds to (found in dm-dirty-log.h): 126 + * int (*resume)(struct dm_dirty_log *log); 127 + * 128 + * Payload-to-userspace: 129 + * None. 130 + * Payload-to-kernel: 131 + * None. 132 + * 133 + * The UUID contained in the dm_ulog_request structure is all that is 134 + * necessary to identify the log instance being resumed. There is no 135 + * payload data. 136 + * 137 + * When the request has been processed, user-space must return the 138 + * dm_ulog_request to the kernel - setting the 'error' field and 139 + * 'data_size' appropriately. 140 + */ 141 + #define DM_ULOG_RESUME 5 142 + 143 + /* 144 + * DM_ULOG_GET_REGION_SIZE corresponds to (found in dm-dirty-log.h): 145 + * uint32_t (*get_region_size)(struct dm_dirty_log *log); 146 + * 147 + * Payload-to-userspace: 148 + * None. 149 + * Payload-to-kernel: 150 + * uint64_t - contains the region size 151 + * 152 + * The region size is something that was determined at constructor time. 153 + * It is returned in the payload area and 'data_size' is set to 154 + * reflect this. 155 + * 156 + * When the request has been processed, user-space must return the 157 + * dm_ulog_request to the kernel - setting the 'error' field appropriately. 158 + */ 159 + #define DM_ULOG_GET_REGION_SIZE 6 160 + 161 + /* 162 + * DM_ULOG_IS_CLEAN corresponds to (found in dm-dirty-log.h): 163 + * int (*is_clean)(struct dm_dirty_log *log, region_t region); 164 + * 165 + * Payload-to-userspace: 166 + * uint64_t - the region to get clean status on 167 + * Payload-to-kernel: 168 + * int64_t - 1 if clean, 0 otherwise 169 + * 170 + * Payload is sizeof(uint64_t) and contains the region for which the clean 171 + * status is being made. 172 + * 173 + * When the request has been processed, user-space must return the 174 + * dm_ulog_request to the kernel - filling the payload with 0 (not clean) or 175 + * 1 (clean), setting 'data_size' and 'error' appropriately. 176 + */ 177 + #define DM_ULOG_IS_CLEAN 7 178 + 179 + /* 180 + * DM_ULOG_IN_SYNC corresponds to (found in dm-dirty-log.h): 181 + * int (*in_sync)(struct dm_dirty_log *log, region_t region, 182 + * int can_block); 183 + * 184 + * Payload-to-userspace: 185 + * uint64_t - the region to get sync status on 186 + * Payload-to-kernel: 187 + * int64_t - 1 if in-sync, 0 otherwise 188 + * 189 + * Exactly the same as 'is_clean' above, except this time asking "has the 190 + * region been recovered?" vs. "is the region not being modified?" 191 + */ 192 + #define DM_ULOG_IN_SYNC 8 193 + 194 + /* 195 + * DM_ULOG_FLUSH corresponds to (found in dm-dirty-log.h): 196 + * int (*flush)(struct dm_dirty_log *log); 197 + * 198 + * Payload-to-userspace: 199 + * None. 200 + * Payload-to-kernel: 201 + * None. 202 + * 203 + * No incoming or outgoing payload. Simply flush log state to disk. 204 + * 205 + * When the request has been processed, user-space must return the 206 + * dm_ulog_request to the kernel - setting the 'error' field and clearing 207 + * 'data_size' appropriately. 208 + */ 209 + #define DM_ULOG_FLUSH 9 210 + 211 + /* 212 + * DM_ULOG_MARK_REGION corresponds to (found in dm-dirty-log.h): 213 + * void (*mark_region)(struct dm_dirty_log *log, region_t region); 214 + * 215 + * Payload-to-userspace: 216 + * uint64_t [] - region(s) to mark 217 + * Payload-to-kernel: 218 + * None. 219 + * 220 + * Incoming payload contains the one or more regions to mark dirty. 221 + * The number of regions contained in the payload can be determined from 222 + * 'data_size/sizeof(uint64_t)'. 223 + * 224 + * When the request has been processed, user-space must return the 225 + * dm_ulog_request to the kernel - setting the 'error' field and clearing 226 + * 'data_size' appropriately. 227 + */ 228 + #define DM_ULOG_MARK_REGION 10 229 + 230 + /* 231 + * DM_ULOG_CLEAR_REGION corresponds to (found in dm-dirty-log.h): 232 + * void (*clear_region)(struct dm_dirty_log *log, region_t region); 233 + * 234 + * Payload-to-userspace: 235 + * uint64_t [] - region(s) to clear 236 + * Payload-to-kernel: 237 + * None. 238 + * 239 + * Incoming payload contains the one or more regions to mark clean. 240 + * The number of regions contained in the payload can be determined from 241 + * 'data_size/sizeof(uint64_t)'. 242 + * 243 + * When the request has been processed, user-space must return the 244 + * dm_ulog_request to the kernel - setting the 'error' field and clearing 245 + * 'data_size' appropriately. 246 + */ 247 + #define DM_ULOG_CLEAR_REGION 11 248 + 249 + /* 250 + * DM_ULOG_GET_RESYNC_WORK corresponds to (found in dm-dirty-log.h): 251 + * int (*get_resync_work)(struct dm_dirty_log *log, region_t *region); 252 + * 253 + * Payload-to-userspace: 254 + * None. 255 + * Payload-to-kernel: 256 + * { 257 + * int64_t i; -- 1 if recovery necessary, 0 otherwise 258 + * uint64_t r; -- The region to recover if i=1 259 + * } 260 + * 'data_size' should be set appropriately. 261 + * 262 + * When the request has been processed, user-space must return the 263 + * dm_ulog_request to the kernel - setting the 'error' field appropriately. 264 + */ 265 + #define DM_ULOG_GET_RESYNC_WORK 12 266 + 267 + /* 268 + * DM_ULOG_SET_REGION_SYNC corresponds to (found in dm-dirty-log.h): 269 + * void (*set_region_sync)(struct dm_dirty_log *log, 270 + * region_t region, int in_sync); 271 + * 272 + * Payload-to-userspace: 273 + * { 274 + * uint64_t - region to set sync state on 275 + * int64_t - 0 if not-in-sync, 1 if in-sync 276 + * } 277 + * Payload-to-kernel: 278 + * None. 279 + * 280 + * When the request has been processed, user-space must return the 281 + * dm_ulog_request to the kernel - setting the 'error' field and clearing 282 + * 'data_size' appropriately. 283 + */ 284 + #define DM_ULOG_SET_REGION_SYNC 13 285 + 286 + /* 287 + * DM_ULOG_GET_SYNC_COUNT corresponds to (found in dm-dirty-log.h): 288 + * region_t (*get_sync_count)(struct dm_dirty_log *log); 289 + * 290 + * Payload-to-userspace: 291 + * None. 292 + * Payload-to-kernel: 293 + * uint64_t - the number of in-sync regions 294 + * 295 + * No incoming payload. Kernel-bound payload contains the number of 296 + * regions that are in-sync (in a size_t). 297 + * 298 + * When the request has been processed, user-space must return the 299 + * dm_ulog_request to the kernel - setting the 'error' field and 300 + * 'data_size' appropriately. 301 + */ 302 + #define DM_ULOG_GET_SYNC_COUNT 14 303 + 304 + /* 305 + * DM_ULOG_STATUS_INFO corresponds to (found in dm-dirty-log.h): 306 + * int (*status)(struct dm_dirty_log *log, STATUSTYPE_INFO, 307 + * char *result, unsigned maxlen); 308 + * 309 + * Payload-to-userspace: 310 + * None. 311 + * Payload-to-kernel: 312 + * Character string containing STATUSTYPE_INFO 313 + * 314 + * When the request has been processed, user-space must return the 315 + * dm_ulog_request to the kernel - setting the 'error' field and 316 + * 'data_size' appropriately. 317 + */ 318 + #define DM_ULOG_STATUS_INFO 15 319 + 320 + /* 321 + * DM_ULOG_STATUS_TABLE corresponds to (found in dm-dirty-log.h): 322 + * int (*status)(struct dm_dirty_log *log, STATUSTYPE_TABLE, 323 + * char *result, unsigned maxlen); 324 + * 325 + * Payload-to-userspace: 326 + * None. 327 + * Payload-to-kernel: 328 + * Character string containing STATUSTYPE_TABLE 329 + * 330 + * When the request has been processed, user-space must return the 331 + * dm_ulog_request to the kernel - setting the 'error' field and 332 + * 'data_size' appropriately. 333 + */ 334 + #define DM_ULOG_STATUS_TABLE 16 335 + 336 + /* 337 + * DM_ULOG_IS_REMOTE_RECOVERING corresponds to (found in dm-dirty-log.h): 338 + * int (*is_remote_recovering)(struct dm_dirty_log *log, region_t region); 339 + * 340 + * Payload-to-userspace: 341 + * uint64_t - region to determine recovery status on 342 + * Payload-to-kernel: 343 + * { 344 + * int64_t is_recovering; -- 0 if no, 1 if yes 345 + * uint64_t in_sync_hint; -- lowest region still needing resync 346 + * } 347 + * 348 + * When the request has been processed, user-space must return the 349 + * dm_ulog_request to the kernel - setting the 'error' field and 350 + * 'data_size' appropriately. 351 + */ 352 + #define DM_ULOG_IS_REMOTE_RECOVERING 17 353 + 354 + /* 355 + * (DM_ULOG_REQUEST_MASK & request_type) to get the request type 356 + * 357 + * Payload-to-userspace: 358 + * A single string containing all the argv arguments separated by ' 's 359 + * Payload-to-kernel: 360 + * None. ('data_size' in the dm_ulog_request struct should be 0.) 361 + * 362 + * We are reserving 8 bits of the 32-bit 'request_type' field for the 363 + * various request types above. The remaining 24-bits are currently 364 + * set to zero and are reserved for future use and compatibility concerns. 365 + * 366 + * User-space should always use DM_ULOG_REQUEST_TYPE to aquire the 367 + * request type from the 'request_type' field to maintain forward compatibility. 368 + */ 369 + #define DM_ULOG_REQUEST_MASK 0xFF 370 + #define DM_ULOG_REQUEST_TYPE(request_type) \ 371 + (DM_ULOG_REQUEST_MASK & (request_type)) 372 + 373 + struct dm_ulog_request { 374 + char uuid[DM_UUID_LEN]; /* Ties a request to a specific mirror log */ 375 + char padding[7]; /* Padding because DM_UUID_LEN = 129 */ 376 + 377 + int32_t error; /* Used to report back processing errors */ 378 + 379 + uint32_t seq; /* Sequence number for request */ 380 + uint32_t request_type; /* DM_ULOG_* defined above */ 381 + uint32_t data_size; /* How much data (not including this struct) */ 382 + 383 + char data[0]; 384 + }; 385 + 386 + #endif /* __DM_LOG_USERSPACE_H__ */