Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

dm: add statistics support

Support the collection of I/O statistics on user-defined regions of
a DM device. If no regions are defined no statistics are collected so
there isn't any performance impact. Only bio-based DM devices are
currently supported.

Each user-defined region specifies a starting sector, length and step.
Individual statistics will be collected for each step-sized area within
the range specified.

The I/O statistics counters for each step-sized area of a region are
in the same format as /sys/block/*/stat or /proc/diskstats but extra
counters (12 and 13) are provided: total time spent reading and
writing in milliseconds. All these counters may be accessed by sending
the @stats_print message to the appropriate DM device via dmsetup.

The creation of DM statistics will allocate memory via kmalloc or
fallback to using vmalloc space. At most, 1/4 of the overall system
memory may be allocated by DM statistics. The admin can see how much
memory is used by reading
/sys/module/dm_mod/parameters/stats_current_allocated_bytes

See Documentation/device-mapper/statistics.txt for more details.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

authored by

Mikulas Patocka and committed by
Mike Snitzer
fd2ed4d2 94563bad

+1299 -14
+186
Documentation/device-mapper/statistics.txt
··· 1 + DM statistics 2 + ============= 3 + 4 + Device Mapper supports the collection of I/O statistics on user-defined 5 + regions of a DM device. If no regions are defined no statistics are 6 + collected so there isn't any performance impact. Only bio-based DM 7 + devices are currently supported. 8 + 9 + Each user-defined region specifies a starting sector, length and step. 10 + Individual statistics will be collected for each step-sized area within 11 + the range specified. 12 + 13 + The I/O statistics counters for each step-sized area of a region are 14 + in the same format as /sys/block/*/stat or /proc/diskstats (see: 15 + Documentation/iostats.txt). But two extra counters (12 and 13) are 16 + provided: total time spent reading and writing in milliseconds. All 17 + these counters may be accessed by sending the @stats_print message to 18 + the appropriate DM device via dmsetup. 19 + 20 + Each region has a corresponding unique identifier, which we call a 21 + region_id, that is assigned when the region is created. The region_id 22 + must be supplied when querying statistics about the region, deleting the 23 + region, etc. Unique region_ids enable multiple userspace programs to 24 + request and process statistics for the same DM device without stepping 25 + on each other's data. 26 + 27 + The creation of DM statistics will allocate memory via kmalloc or 28 + fallback to using vmalloc space. At most, 1/4 of the overall system 29 + memory may be allocated by DM statistics. The admin can see how much 30 + memory is used by reading 31 + /sys/module/dm_mod/parameters/stats_current_allocated_bytes 32 + 33 + Messages 34 + ======== 35 + 36 + @stats_create <range> <step> [<program_id> [<aux_data>]] 37 + 38 + Create a new region and return the region_id. 39 + 40 + <range> 41 + "-" - whole device 42 + "<start_sector>+<length>" - a range of <length> 512-byte sectors 43 + starting with <start_sector>. 44 + 45 + <step> 46 + "<area_size>" - the range is subdivided into areas each containing 47 + <area_size> sectors. 48 + "/<number_of_areas>" - the range is subdivided into the specified 49 + number of areas. 50 + 51 + <program_id> 52 + An optional parameter. A name that uniquely identifies 53 + the userspace owner of the range. This groups ranges together 54 + so that userspace programs can identify the ranges they 55 + created and ignore those created by others. 56 + The kernel returns this string back in the output of 57 + @stats_list message, but it doesn't use it for anything else. 58 + 59 + <aux_data> 60 + An optional parameter. A word that provides auxiliary data 61 + that is useful to the client program that created the range. 62 + The kernel returns this string back in the output of 63 + @stats_list message, but it doesn't use this value for anything. 64 + 65 + @stats_delete <region_id> 66 + 67 + Delete the region with the specified id. 68 + 69 + <region_id> 70 + region_id returned from @stats_create 71 + 72 + @stats_clear <region_id> 73 + 74 + Clear all the counters except the in-flight i/o counters. 75 + 76 + <region_id> 77 + region_id returned from @stats_create 78 + 79 + @stats_list [<program_id>] 80 + 81 + List all regions registered with @stats_create. 82 + 83 + <program_id> 84 + An optional parameter. 85 + If this parameter is specified, only matching regions 86 + are returned. 87 + If it is not specified, all regions are returned. 88 + 89 + Output format: 90 + <region_id>: <start_sector>+<length> <step> <program_id> <aux_data> 91 + 92 + @stats_print <region_id> [<starting_line> <number_of_lines>] 93 + 94 + Print counters for each step-sized area of a region. 95 + 96 + <region_id> 97 + region_id returned from @stats_create 98 + 99 + <starting_line> 100 + The index of the starting line in the output. 101 + If omitted, all lines are returned. 102 + 103 + <number_of_lines> 104 + The number of lines to include in the output. 105 + If omitted, all lines are returned. 106 + 107 + Output format for each step-sized area of a region: 108 + 109 + <start_sector>+<length> counters 110 + 111 + The first 11 counters have the same meaning as 112 + /sys/block/*/stat or /proc/diskstats. 113 + 114 + Please refer to Documentation/iostats.txt for details. 115 + 116 + 1. the number of reads completed 117 + 2. the number of reads merged 118 + 3. the number of sectors read 119 + 4. the number of milliseconds spent reading 120 + 5. the number of writes completed 121 + 6. the number of writes merged 122 + 7. the number of sectors written 123 + 8. the number of milliseconds spent writing 124 + 9. the number of I/Os currently in progress 125 + 10. the number of milliseconds spent doing I/Os 126 + 11. the weighted number of milliseconds spent doing I/Os 127 + 128 + Additional counters: 129 + 12. the total time spent reading in milliseconds 130 + 13. the total time spent writing in milliseconds 131 + 132 + @stats_print_clear <region_id> [<starting_line> <number_of_lines>] 133 + 134 + Atomically print and then clear all the counters except the 135 + in-flight i/o counters. Useful when the client consuming the 136 + statistics does not want to lose any statistics (those updated 137 + between printing and clearing). 138 + 139 + <region_id> 140 + region_id returned from @stats_create 141 + 142 + <starting_line> 143 + The index of the starting line in the output. 144 + If omitted, all lines are printed and then cleared. 145 + 146 + <number_of_lines> 147 + The number of lines to process. 148 + If omitted, all lines are printed and then cleared. 149 + 150 + @stats_set_aux <region_id> <aux_data> 151 + 152 + Store auxiliary data aux_data for the specified region. 153 + 154 + <region_id> 155 + region_id returned from @stats_create 156 + 157 + <aux_data> 158 + The string that identifies data which is useful to the client 159 + program that created the range. The kernel returns this 160 + string back in the output of @stats_list message, but it 161 + doesn't use this value for anything. 162 + 163 + Examples 164 + ======== 165 + 166 + Subdivide the DM device 'vol' into 100 pieces and start collecting 167 + statistics on them: 168 + 169 + dmsetup message vol 0 @stats_create - /100 170 + 171 + Set the auxillary data string to "foo bar baz" (the escape for each 172 + space must also be escaped, otherwise the shell will consume them): 173 + 174 + dmsetup message vol 0 @stats_set_aux 0 foo\\ bar\\ baz 175 + 176 + List the statistics: 177 + 178 + dmsetup message vol 0 @stats_list 179 + 180 + Print the statistics: 181 + 182 + dmsetup message vol 0 @stats_print 0 183 + 184 + Delete the statistics: 185 + 186 + dmsetup message vol 0 @stats_delete 0
+1 -1
drivers/md/Makefile
··· 3 3 # 4 4 5 5 dm-mod-y += dm.o dm-table.o dm-target.o dm-linear.o dm-stripe.o \ 6 - dm-ioctl.o dm-io.o dm-kcopyd.o dm-sysfs.o 6 + dm-ioctl.o dm-io.o dm-kcopyd.o dm-sysfs.o dm-stats.o 7 7 dm-multipath-y += dm-path-selector.o dm-mpath.o 8 8 dm-snapshot-y += dm-snap.o dm-exception-store.o dm-snap-transient.o \ 9 9 dm-snap-persistent.o
+14 -8
drivers/md/dm-ioctl.c
··· 1455 1455 return 0; 1456 1456 } 1457 1457 1458 - static bool buffer_test_overflow(char *result, unsigned maxlen) 1459 - { 1460 - return !maxlen || strlen(result) + 1 >= maxlen; 1461 - } 1462 - 1463 1458 /* 1464 - * Process device-mapper dependent messages. 1459 + * Process device-mapper dependent messages. Messages prefixed with '@' 1460 + * are processed by the DM core. All others are delivered to the target. 1465 1461 * Returns a number <= 1 if message was processed by device mapper. 1466 1462 * Returns 2 if message should be delivered to the target. 1467 1463 */ 1468 1464 static int message_for_md(struct mapped_device *md, unsigned argc, char **argv, 1469 1465 char *result, unsigned maxlen) 1470 1466 { 1471 - return 2; 1467 + int r; 1468 + 1469 + if (**argv != '@') 1470 + return 2; /* no '@' prefix, deliver to target */ 1471 + 1472 + r = dm_stats_message(md, argc, argv, result, maxlen); 1473 + if (r < 2) 1474 + return r; 1475 + 1476 + DMERR("Unsupported message sent to DM core: %s", argv[0]); 1477 + return -EINVAL; 1472 1478 } 1473 1479 1474 1480 /* ··· 1548 1542 1549 1543 if (r == 1) { 1550 1544 param->flags |= DM_DATA_OUT_FLAG; 1551 - if (buffer_test_overflow(result, maxlen)) 1545 + if (dm_message_test_buffer_overflow(result, maxlen)) 1552 1546 param->flags |= DM_BUFFER_FULL_FLAG; 1553 1547 else 1554 1548 param->data_size = param->data_start + strlen(result) + 1;
+969
drivers/md/dm-stats.c
··· 1 + #include <linux/errno.h> 2 + #include <linux/numa.h> 3 + #include <linux/slab.h> 4 + #include <linux/rculist.h> 5 + #include <linux/threads.h> 6 + #include <linux/preempt.h> 7 + #include <linux/irqflags.h> 8 + #include <linux/vmalloc.h> 9 + #include <linux/mm.h> 10 + #include <linux/module.h> 11 + #include <linux/device-mapper.h> 12 + 13 + #include "dm.h" 14 + #include "dm-stats.h" 15 + 16 + #define DM_MSG_PREFIX "stats" 17 + 18 + static int dm_stat_need_rcu_barrier; 19 + 20 + /* 21 + * Using 64-bit values to avoid overflow (which is a 22 + * problem that block/genhd.c's IO accounting has). 23 + */ 24 + struct dm_stat_percpu { 25 + unsigned long long sectors[2]; 26 + unsigned long long ios[2]; 27 + unsigned long long merges[2]; 28 + unsigned long long ticks[2]; 29 + unsigned long long io_ticks[2]; 30 + unsigned long long io_ticks_total; 31 + unsigned long long time_in_queue; 32 + }; 33 + 34 + struct dm_stat_shared { 35 + atomic_t in_flight[2]; 36 + unsigned long stamp; 37 + struct dm_stat_percpu tmp; 38 + }; 39 + 40 + struct dm_stat { 41 + struct list_head list_entry; 42 + int id; 43 + size_t n_entries; 44 + sector_t start; 45 + sector_t end; 46 + sector_t step; 47 + const char *program_id; 48 + const char *aux_data; 49 + struct rcu_head rcu_head; 50 + size_t shared_alloc_size; 51 + size_t percpu_alloc_size; 52 + struct dm_stat_percpu *stat_percpu[NR_CPUS]; 53 + struct dm_stat_shared stat_shared[0]; 54 + }; 55 + 56 + struct dm_stats_last_position { 57 + sector_t last_sector; 58 + unsigned last_rw; 59 + }; 60 + 61 + /* 62 + * A typo on the command line could possibly make the kernel run out of memory 63 + * and crash. To prevent the crash we account all used memory. We fail if we 64 + * exhaust 1/4 of all memory or 1/2 of vmalloc space. 65 + */ 66 + #define DM_STATS_MEMORY_FACTOR 4 67 + #define DM_STATS_VMALLOC_FACTOR 2 68 + 69 + static DEFINE_SPINLOCK(shared_memory_lock); 70 + 71 + static unsigned long shared_memory_amount; 72 + 73 + static bool __check_shared_memory(size_t alloc_size) 74 + { 75 + size_t a; 76 + 77 + a = shared_memory_amount + alloc_size; 78 + if (a < shared_memory_amount) 79 + return false; 80 + if (a >> PAGE_SHIFT > totalram_pages / DM_STATS_MEMORY_FACTOR) 81 + return false; 82 + #ifdef CONFIG_MMU 83 + if (a > (VMALLOC_END - VMALLOC_START) / DM_STATS_VMALLOC_FACTOR) 84 + return false; 85 + #endif 86 + return true; 87 + } 88 + 89 + static bool check_shared_memory(size_t alloc_size) 90 + { 91 + bool ret; 92 + 93 + spin_lock_irq(&shared_memory_lock); 94 + 95 + ret = __check_shared_memory(alloc_size); 96 + 97 + spin_unlock_irq(&shared_memory_lock); 98 + 99 + return ret; 100 + } 101 + 102 + static bool claim_shared_memory(size_t alloc_size) 103 + { 104 + spin_lock_irq(&shared_memory_lock); 105 + 106 + if (!__check_shared_memory(alloc_size)) { 107 + spin_unlock_irq(&shared_memory_lock); 108 + return false; 109 + } 110 + 111 + shared_memory_amount += alloc_size; 112 + 113 + spin_unlock_irq(&shared_memory_lock); 114 + 115 + return true; 116 + } 117 + 118 + static void free_shared_memory(size_t alloc_size) 119 + { 120 + unsigned long flags; 121 + 122 + spin_lock_irqsave(&shared_memory_lock, flags); 123 + 124 + if (WARN_ON_ONCE(shared_memory_amount < alloc_size)) { 125 + spin_unlock_irqrestore(&shared_memory_lock, flags); 126 + DMCRIT("Memory usage accounting bug."); 127 + return; 128 + } 129 + 130 + shared_memory_amount -= alloc_size; 131 + 132 + spin_unlock_irqrestore(&shared_memory_lock, flags); 133 + } 134 + 135 + static void *dm_kvzalloc(size_t alloc_size, int node) 136 + { 137 + void *p; 138 + 139 + if (!claim_shared_memory(alloc_size)) 140 + return NULL; 141 + 142 + if (alloc_size <= KMALLOC_MAX_SIZE) { 143 + p = kzalloc_node(alloc_size, GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN, node); 144 + if (p) 145 + return p; 146 + } 147 + p = vzalloc_node(alloc_size, node); 148 + if (p) 149 + return p; 150 + 151 + free_shared_memory(alloc_size); 152 + 153 + return NULL; 154 + } 155 + 156 + static void dm_kvfree(void *ptr, size_t alloc_size) 157 + { 158 + if (!ptr) 159 + return; 160 + 161 + free_shared_memory(alloc_size); 162 + 163 + if (is_vmalloc_addr(ptr)) 164 + vfree(ptr); 165 + else 166 + kfree(ptr); 167 + } 168 + 169 + static void dm_stat_free(struct rcu_head *head) 170 + { 171 + int cpu; 172 + struct dm_stat *s = container_of(head, struct dm_stat, rcu_head); 173 + 174 + kfree(s->program_id); 175 + kfree(s->aux_data); 176 + for_each_possible_cpu(cpu) 177 + dm_kvfree(s->stat_percpu[cpu], s->percpu_alloc_size); 178 + dm_kvfree(s, s->shared_alloc_size); 179 + } 180 + 181 + static int dm_stat_in_flight(struct dm_stat_shared *shared) 182 + { 183 + return atomic_read(&shared->in_flight[READ]) + 184 + atomic_read(&shared->in_flight[WRITE]); 185 + } 186 + 187 + void dm_stats_init(struct dm_stats *stats) 188 + { 189 + int cpu; 190 + struct dm_stats_last_position *last; 191 + 192 + mutex_init(&stats->mutex); 193 + INIT_LIST_HEAD(&stats->list); 194 + stats->last = alloc_percpu(struct dm_stats_last_position); 195 + for_each_possible_cpu(cpu) { 196 + last = per_cpu_ptr(stats->last, cpu); 197 + last->last_sector = (sector_t)ULLONG_MAX; 198 + last->last_rw = UINT_MAX; 199 + } 200 + } 201 + 202 + void dm_stats_cleanup(struct dm_stats *stats) 203 + { 204 + size_t ni; 205 + struct dm_stat *s; 206 + struct dm_stat_shared *shared; 207 + 208 + while (!list_empty(&stats->list)) { 209 + s = container_of(stats->list.next, struct dm_stat, list_entry); 210 + list_del(&s->list_entry); 211 + for (ni = 0; ni < s->n_entries; ni++) { 212 + shared = &s->stat_shared[ni]; 213 + if (WARN_ON(dm_stat_in_flight(shared))) { 214 + DMCRIT("leaked in-flight counter at index %lu " 215 + "(start %llu, end %llu, step %llu): reads %d, writes %d", 216 + (unsigned long)ni, 217 + (unsigned long long)s->start, 218 + (unsigned long long)s->end, 219 + (unsigned long long)s->step, 220 + atomic_read(&shared->in_flight[READ]), 221 + atomic_read(&shared->in_flight[WRITE])); 222 + } 223 + } 224 + dm_stat_free(&s->rcu_head); 225 + } 226 + free_percpu(stats->last); 227 + } 228 + 229 + static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end, 230 + sector_t step, const char *program_id, const char *aux_data, 231 + void (*suspend_callback)(struct mapped_device *), 232 + void (*resume_callback)(struct mapped_device *), 233 + struct mapped_device *md) 234 + { 235 + struct list_head *l; 236 + struct dm_stat *s, *tmp_s; 237 + sector_t n_entries; 238 + size_t ni; 239 + size_t shared_alloc_size; 240 + size_t percpu_alloc_size; 241 + struct dm_stat_percpu *p; 242 + int cpu; 243 + int ret_id; 244 + int r; 245 + 246 + if (end < start || !step) 247 + return -EINVAL; 248 + 249 + n_entries = end - start; 250 + if (dm_sector_div64(n_entries, step)) 251 + n_entries++; 252 + 253 + if (n_entries != (size_t)n_entries || !(size_t)(n_entries + 1)) 254 + return -EOVERFLOW; 255 + 256 + shared_alloc_size = sizeof(struct dm_stat) + (size_t)n_entries * sizeof(struct dm_stat_shared); 257 + if ((shared_alloc_size - sizeof(struct dm_stat)) / sizeof(struct dm_stat_shared) != n_entries) 258 + return -EOVERFLOW; 259 + 260 + percpu_alloc_size = (size_t)n_entries * sizeof(struct dm_stat_percpu); 261 + if (percpu_alloc_size / sizeof(struct dm_stat_percpu) != n_entries) 262 + return -EOVERFLOW; 263 + 264 + if (!check_shared_memory(shared_alloc_size + num_possible_cpus() * percpu_alloc_size)) 265 + return -ENOMEM; 266 + 267 + s = dm_kvzalloc(shared_alloc_size, NUMA_NO_NODE); 268 + if (!s) 269 + return -ENOMEM; 270 + 271 + s->n_entries = n_entries; 272 + s->start = start; 273 + s->end = end; 274 + s->step = step; 275 + s->shared_alloc_size = shared_alloc_size; 276 + s->percpu_alloc_size = percpu_alloc_size; 277 + 278 + s->program_id = kstrdup(program_id, GFP_KERNEL); 279 + if (!s->program_id) { 280 + r = -ENOMEM; 281 + goto out; 282 + } 283 + s->aux_data = kstrdup(aux_data, GFP_KERNEL); 284 + if (!s->aux_data) { 285 + r = -ENOMEM; 286 + goto out; 287 + } 288 + 289 + for (ni = 0; ni < n_entries; ni++) { 290 + atomic_set(&s->stat_shared[ni].in_flight[READ], 0); 291 + atomic_set(&s->stat_shared[ni].in_flight[WRITE], 0); 292 + } 293 + 294 + for_each_possible_cpu(cpu) { 295 + p = dm_kvzalloc(percpu_alloc_size, cpu_to_node(cpu)); 296 + if (!p) { 297 + r = -ENOMEM; 298 + goto out; 299 + } 300 + s->stat_percpu[cpu] = p; 301 + } 302 + 303 + /* 304 + * Suspend/resume to make sure there is no i/o in flight, 305 + * so that newly created statistics will be exact. 306 + * 307 + * (note: we couldn't suspend earlier because we must not 308 + * allocate memory while suspended) 309 + */ 310 + suspend_callback(md); 311 + 312 + mutex_lock(&stats->mutex); 313 + s->id = 0; 314 + list_for_each(l, &stats->list) { 315 + tmp_s = container_of(l, struct dm_stat, list_entry); 316 + if (WARN_ON(tmp_s->id < s->id)) { 317 + r = -EINVAL; 318 + goto out_unlock_resume; 319 + } 320 + if (tmp_s->id > s->id) 321 + break; 322 + if (unlikely(s->id == INT_MAX)) { 323 + r = -ENFILE; 324 + goto out_unlock_resume; 325 + } 326 + s->id++; 327 + } 328 + ret_id = s->id; 329 + list_add_tail_rcu(&s->list_entry, l); 330 + mutex_unlock(&stats->mutex); 331 + 332 + resume_callback(md); 333 + 334 + return ret_id; 335 + 336 + out_unlock_resume: 337 + mutex_unlock(&stats->mutex); 338 + resume_callback(md); 339 + out: 340 + dm_stat_free(&s->rcu_head); 341 + return r; 342 + } 343 + 344 + static struct dm_stat *__dm_stats_find(struct dm_stats *stats, int id) 345 + { 346 + struct dm_stat *s; 347 + 348 + list_for_each_entry(s, &stats->list, list_entry) { 349 + if (s->id > id) 350 + break; 351 + if (s->id == id) 352 + return s; 353 + } 354 + 355 + return NULL; 356 + } 357 + 358 + static int dm_stats_delete(struct dm_stats *stats, int id) 359 + { 360 + struct dm_stat *s; 361 + int cpu; 362 + 363 + mutex_lock(&stats->mutex); 364 + 365 + s = __dm_stats_find(stats, id); 366 + if (!s) { 367 + mutex_unlock(&stats->mutex); 368 + return -ENOENT; 369 + } 370 + 371 + list_del_rcu(&s->list_entry); 372 + mutex_unlock(&stats->mutex); 373 + 374 + /* 375 + * vfree can't be called from RCU callback 376 + */ 377 + for_each_possible_cpu(cpu) 378 + if (is_vmalloc_addr(s->stat_percpu)) 379 + goto do_sync_free; 380 + if (is_vmalloc_addr(s)) { 381 + do_sync_free: 382 + synchronize_rcu_expedited(); 383 + dm_stat_free(&s->rcu_head); 384 + } else { 385 + ACCESS_ONCE(dm_stat_need_rcu_barrier) = 1; 386 + call_rcu(&s->rcu_head, dm_stat_free); 387 + } 388 + return 0; 389 + } 390 + 391 + static int dm_stats_list(struct dm_stats *stats, const char *program, 392 + char *result, unsigned maxlen) 393 + { 394 + struct dm_stat *s; 395 + sector_t len; 396 + unsigned sz = 0; 397 + 398 + /* 399 + * Output format: 400 + * <region_id>: <start_sector>+<length> <step> <program_id> <aux_data> 401 + */ 402 + 403 + mutex_lock(&stats->mutex); 404 + list_for_each_entry(s, &stats->list, list_entry) { 405 + if (!program || !strcmp(program, s->program_id)) { 406 + len = s->end - s->start; 407 + DMEMIT("%d: %llu+%llu %llu %s %s\n", s->id, 408 + (unsigned long long)s->start, 409 + (unsigned long long)len, 410 + (unsigned long long)s->step, 411 + s->program_id, 412 + s->aux_data); 413 + } 414 + } 415 + mutex_unlock(&stats->mutex); 416 + 417 + return 1; 418 + } 419 + 420 + static void dm_stat_round(struct dm_stat_shared *shared, struct dm_stat_percpu *p) 421 + { 422 + /* 423 + * This is racy, but so is part_round_stats_single. 424 + */ 425 + unsigned long now = jiffies; 426 + unsigned in_flight_read; 427 + unsigned in_flight_write; 428 + unsigned long difference = now - shared->stamp; 429 + 430 + if (!difference) 431 + return; 432 + in_flight_read = (unsigned)atomic_read(&shared->in_flight[READ]); 433 + in_flight_write = (unsigned)atomic_read(&shared->in_flight[WRITE]); 434 + if (in_flight_read) 435 + p->io_ticks[READ] += difference; 436 + if (in_flight_write) 437 + p->io_ticks[WRITE] += difference; 438 + if (in_flight_read + in_flight_write) { 439 + p->io_ticks_total += difference; 440 + p->time_in_queue += (in_flight_read + in_flight_write) * difference; 441 + } 442 + shared->stamp = now; 443 + } 444 + 445 + static void dm_stat_for_entry(struct dm_stat *s, size_t entry, 446 + unsigned long bi_rw, sector_t len, bool merged, 447 + bool end, unsigned long duration) 448 + { 449 + unsigned long idx = bi_rw & REQ_WRITE; 450 + struct dm_stat_shared *shared = &s->stat_shared[entry]; 451 + struct dm_stat_percpu *p; 452 + 453 + /* 454 + * For strict correctness we should use local_irq_disable/enable 455 + * instead of preempt_disable/enable. 456 + * 457 + * This is racy if the driver finishes bios from non-interrupt 458 + * context as well as from interrupt context or from more different 459 + * interrupts. 460 + * 461 + * However, the race only results in not counting some events, 462 + * so it is acceptable. 463 + * 464 + * part_stat_lock()/part_stat_unlock() have this race too. 465 + */ 466 + preempt_disable(); 467 + p = &s->stat_percpu[smp_processor_id()][entry]; 468 + 469 + if (!end) { 470 + dm_stat_round(shared, p); 471 + atomic_inc(&shared->in_flight[idx]); 472 + } else { 473 + dm_stat_round(shared, p); 474 + atomic_dec(&shared->in_flight[idx]); 475 + p->sectors[idx] += len; 476 + p->ios[idx] += 1; 477 + p->merges[idx] += merged; 478 + p->ticks[idx] += duration; 479 + } 480 + 481 + preempt_enable(); 482 + } 483 + 484 + static void __dm_stat_bio(struct dm_stat *s, unsigned long bi_rw, 485 + sector_t bi_sector, sector_t end_sector, 486 + bool end, unsigned long duration, 487 + struct dm_stats_aux *stats_aux) 488 + { 489 + sector_t rel_sector, offset, todo, fragment_len; 490 + size_t entry; 491 + 492 + if (end_sector <= s->start || bi_sector >= s->end) 493 + return; 494 + if (unlikely(bi_sector < s->start)) { 495 + rel_sector = 0; 496 + todo = end_sector - s->start; 497 + } else { 498 + rel_sector = bi_sector - s->start; 499 + todo = end_sector - bi_sector; 500 + } 501 + if (unlikely(end_sector > s->end)) 502 + todo -= (end_sector - s->end); 503 + 504 + offset = dm_sector_div64(rel_sector, s->step); 505 + entry = rel_sector; 506 + do { 507 + if (WARN_ON_ONCE(entry >= s->n_entries)) { 508 + DMCRIT("Invalid area access in region id %d", s->id); 509 + return; 510 + } 511 + fragment_len = todo; 512 + if (fragment_len > s->step - offset) 513 + fragment_len = s->step - offset; 514 + dm_stat_for_entry(s, entry, bi_rw, fragment_len, 515 + stats_aux->merged, end, duration); 516 + todo -= fragment_len; 517 + entry++; 518 + offset = 0; 519 + } while (unlikely(todo != 0)); 520 + } 521 + 522 + void dm_stats_account_io(struct dm_stats *stats, unsigned long bi_rw, 523 + sector_t bi_sector, unsigned bi_sectors, bool end, 524 + unsigned long duration, struct dm_stats_aux *stats_aux) 525 + { 526 + struct dm_stat *s; 527 + sector_t end_sector; 528 + struct dm_stats_last_position *last; 529 + 530 + if (unlikely(!bi_sectors)) 531 + return; 532 + 533 + end_sector = bi_sector + bi_sectors; 534 + 535 + if (!end) { 536 + /* 537 + * A race condition can at worst result in the merged flag being 538 + * misrepresented, so we don't have to disable preemption here. 539 + */ 540 + last = __this_cpu_ptr(stats->last); 541 + stats_aux->merged = 542 + (bi_sector == (ACCESS_ONCE(last->last_sector) && 543 + ((bi_rw & (REQ_WRITE | REQ_DISCARD)) == 544 + (ACCESS_ONCE(last->last_rw) & (REQ_WRITE | REQ_DISCARD))) 545 + )); 546 + ACCESS_ONCE(last->last_sector) = end_sector; 547 + ACCESS_ONCE(last->last_rw) = bi_rw; 548 + } 549 + 550 + rcu_read_lock(); 551 + 552 + list_for_each_entry_rcu(s, &stats->list, list_entry) 553 + __dm_stat_bio(s, bi_rw, bi_sector, end_sector, end, duration, stats_aux); 554 + 555 + rcu_read_unlock(); 556 + } 557 + 558 + static void __dm_stat_init_temporary_percpu_totals(struct dm_stat_shared *shared, 559 + struct dm_stat *s, size_t x) 560 + { 561 + int cpu; 562 + struct dm_stat_percpu *p; 563 + 564 + local_irq_disable(); 565 + p = &s->stat_percpu[smp_processor_id()][x]; 566 + dm_stat_round(shared, p); 567 + local_irq_enable(); 568 + 569 + memset(&shared->tmp, 0, sizeof(shared->tmp)); 570 + for_each_possible_cpu(cpu) { 571 + p = &s->stat_percpu[cpu][x]; 572 + shared->tmp.sectors[READ] += ACCESS_ONCE(p->sectors[READ]); 573 + shared->tmp.sectors[WRITE] += ACCESS_ONCE(p->sectors[WRITE]); 574 + shared->tmp.ios[READ] += ACCESS_ONCE(p->ios[READ]); 575 + shared->tmp.ios[WRITE] += ACCESS_ONCE(p->ios[WRITE]); 576 + shared->tmp.merges[READ] += ACCESS_ONCE(p->merges[READ]); 577 + shared->tmp.merges[WRITE] += ACCESS_ONCE(p->merges[WRITE]); 578 + shared->tmp.ticks[READ] += ACCESS_ONCE(p->ticks[READ]); 579 + shared->tmp.ticks[WRITE] += ACCESS_ONCE(p->ticks[WRITE]); 580 + shared->tmp.io_ticks[READ] += ACCESS_ONCE(p->io_ticks[READ]); 581 + shared->tmp.io_ticks[WRITE] += ACCESS_ONCE(p->io_ticks[WRITE]); 582 + shared->tmp.io_ticks_total += ACCESS_ONCE(p->io_ticks_total); 583 + shared->tmp.time_in_queue += ACCESS_ONCE(p->time_in_queue); 584 + } 585 + } 586 + 587 + static void __dm_stat_clear(struct dm_stat *s, size_t idx_start, size_t idx_end, 588 + bool init_tmp_percpu_totals) 589 + { 590 + size_t x; 591 + struct dm_stat_shared *shared; 592 + struct dm_stat_percpu *p; 593 + 594 + for (x = idx_start; x < idx_end; x++) { 595 + shared = &s->stat_shared[x]; 596 + if (init_tmp_percpu_totals) 597 + __dm_stat_init_temporary_percpu_totals(shared, s, x); 598 + local_irq_disable(); 599 + p = &s->stat_percpu[smp_processor_id()][x]; 600 + p->sectors[READ] -= shared->tmp.sectors[READ]; 601 + p->sectors[WRITE] -= shared->tmp.sectors[WRITE]; 602 + p->ios[READ] -= shared->tmp.ios[READ]; 603 + p->ios[WRITE] -= shared->tmp.ios[WRITE]; 604 + p->merges[READ] -= shared->tmp.merges[READ]; 605 + p->merges[WRITE] -= shared->tmp.merges[WRITE]; 606 + p->ticks[READ] -= shared->tmp.ticks[READ]; 607 + p->ticks[WRITE] -= shared->tmp.ticks[WRITE]; 608 + p->io_ticks[READ] -= shared->tmp.io_ticks[READ]; 609 + p->io_ticks[WRITE] -= shared->tmp.io_ticks[WRITE]; 610 + p->io_ticks_total -= shared->tmp.io_ticks_total; 611 + p->time_in_queue -= shared->tmp.time_in_queue; 612 + local_irq_enable(); 613 + } 614 + } 615 + 616 + static int dm_stats_clear(struct dm_stats *stats, int id) 617 + { 618 + struct dm_stat *s; 619 + 620 + mutex_lock(&stats->mutex); 621 + 622 + s = __dm_stats_find(stats, id); 623 + if (!s) { 624 + mutex_unlock(&stats->mutex); 625 + return -ENOENT; 626 + } 627 + 628 + __dm_stat_clear(s, 0, s->n_entries, true); 629 + 630 + mutex_unlock(&stats->mutex); 631 + 632 + return 1; 633 + } 634 + 635 + /* 636 + * This is like jiffies_to_msec, but works for 64-bit values. 637 + */ 638 + static unsigned long long dm_jiffies_to_msec64(unsigned long long j) 639 + { 640 + unsigned long long result = 0; 641 + unsigned mult; 642 + 643 + if (j) 644 + result = jiffies_to_msecs(j & 0x3fffff); 645 + if (j >= 1 << 22) { 646 + mult = jiffies_to_msecs(1 << 22); 647 + result += (unsigned long long)mult * (unsigned long long)jiffies_to_msecs((j >> 22) & 0x3fffff); 648 + } 649 + if (j >= 1ULL << 44) 650 + result += (unsigned long long)mult * (unsigned long long)mult * (unsigned long long)jiffies_to_msecs(j >> 44); 651 + 652 + return result; 653 + } 654 + 655 + static int dm_stats_print(struct dm_stats *stats, int id, 656 + size_t idx_start, size_t idx_len, 657 + bool clear, char *result, unsigned maxlen) 658 + { 659 + unsigned sz = 0; 660 + struct dm_stat *s; 661 + size_t x; 662 + sector_t start, end, step; 663 + size_t idx_end; 664 + struct dm_stat_shared *shared; 665 + 666 + /* 667 + * Output format: 668 + * <start_sector>+<length> counters 669 + */ 670 + 671 + mutex_lock(&stats->mutex); 672 + 673 + s = __dm_stats_find(stats, id); 674 + if (!s) { 675 + mutex_unlock(&stats->mutex); 676 + return -ENOENT; 677 + } 678 + 679 + idx_end = idx_start + idx_len; 680 + if (idx_end < idx_start || 681 + idx_end > s->n_entries) 682 + idx_end = s->n_entries; 683 + 684 + if (idx_start > idx_end) 685 + idx_start = idx_end; 686 + 687 + step = s->step; 688 + start = s->start + (step * idx_start); 689 + 690 + for (x = idx_start; x < idx_end; x++, start = end) { 691 + shared = &s->stat_shared[x]; 692 + end = start + step; 693 + if (unlikely(end > s->end)) 694 + end = s->end; 695 + 696 + __dm_stat_init_temporary_percpu_totals(shared, s, x); 697 + 698 + DMEMIT("%llu+%llu %llu %llu %llu %llu %llu %llu %llu %llu %d %llu %llu %llu %llu\n", 699 + (unsigned long long)start, 700 + (unsigned long long)step, 701 + shared->tmp.ios[READ], 702 + shared->tmp.merges[READ], 703 + shared->tmp.sectors[READ], 704 + dm_jiffies_to_msec64(shared->tmp.ticks[READ]), 705 + shared->tmp.ios[WRITE], 706 + shared->tmp.merges[WRITE], 707 + shared->tmp.sectors[WRITE], 708 + dm_jiffies_to_msec64(shared->tmp.ticks[WRITE]), 709 + dm_stat_in_flight(shared), 710 + dm_jiffies_to_msec64(shared->tmp.io_ticks_total), 711 + dm_jiffies_to_msec64(shared->tmp.time_in_queue), 712 + dm_jiffies_to_msec64(shared->tmp.io_ticks[READ]), 713 + dm_jiffies_to_msec64(shared->tmp.io_ticks[WRITE])); 714 + 715 + if (unlikely(sz + 1 >= maxlen)) 716 + goto buffer_overflow; 717 + } 718 + 719 + if (clear) 720 + __dm_stat_clear(s, idx_start, idx_end, false); 721 + 722 + buffer_overflow: 723 + mutex_unlock(&stats->mutex); 724 + 725 + return 1; 726 + } 727 + 728 + static int dm_stats_set_aux(struct dm_stats *stats, int id, const char *aux_data) 729 + { 730 + struct dm_stat *s; 731 + const char *new_aux_data; 732 + 733 + mutex_lock(&stats->mutex); 734 + 735 + s = __dm_stats_find(stats, id); 736 + if (!s) { 737 + mutex_unlock(&stats->mutex); 738 + return -ENOENT; 739 + } 740 + 741 + new_aux_data = kstrdup(aux_data, GFP_KERNEL); 742 + if (!new_aux_data) { 743 + mutex_unlock(&stats->mutex); 744 + return -ENOMEM; 745 + } 746 + 747 + kfree(s->aux_data); 748 + s->aux_data = new_aux_data; 749 + 750 + mutex_unlock(&stats->mutex); 751 + 752 + return 0; 753 + } 754 + 755 + static int message_stats_create(struct mapped_device *md, 756 + unsigned argc, char **argv, 757 + char *result, unsigned maxlen) 758 + { 759 + int id; 760 + char dummy; 761 + unsigned long long start, end, len, step; 762 + unsigned divisor; 763 + const char *program_id, *aux_data; 764 + 765 + /* 766 + * Input format: 767 + * <range> <step> [<program_id> [<aux_data>]] 768 + */ 769 + 770 + if (argc < 3 || argc > 5) 771 + return -EINVAL; 772 + 773 + if (!strcmp(argv[1], "-")) { 774 + start = 0; 775 + len = dm_get_size(md); 776 + if (!len) 777 + len = 1; 778 + } else if (sscanf(argv[1], "%llu+%llu%c", &start, &len, &dummy) != 2 || 779 + start != (sector_t)start || len != (sector_t)len) 780 + return -EINVAL; 781 + 782 + end = start + len; 783 + if (start >= end) 784 + return -EINVAL; 785 + 786 + if (sscanf(argv[2], "/%u%c", &divisor, &dummy) == 1) { 787 + step = end - start; 788 + if (do_div(step, divisor)) 789 + step++; 790 + if (!step) 791 + step = 1; 792 + } else if (sscanf(argv[2], "%llu%c", &step, &dummy) != 1 || 793 + step != (sector_t)step || !step) 794 + return -EINVAL; 795 + 796 + program_id = "-"; 797 + aux_data = "-"; 798 + 799 + if (argc > 3) 800 + program_id = argv[3]; 801 + 802 + if (argc > 4) 803 + aux_data = argv[4]; 804 + 805 + /* 806 + * If a buffer overflow happens after we created the region, 807 + * it's too late (the userspace would retry with a larger 808 + * buffer, but the region id that caused the overflow is already 809 + * leaked). So we must detect buffer overflow in advance. 810 + */ 811 + snprintf(result, maxlen, "%d", INT_MAX); 812 + if (dm_message_test_buffer_overflow(result, maxlen)) 813 + return 1; 814 + 815 + id = dm_stats_create(dm_get_stats(md), start, end, step, program_id, aux_data, 816 + dm_internal_suspend, dm_internal_resume, md); 817 + if (id < 0) 818 + return id; 819 + 820 + snprintf(result, maxlen, "%d", id); 821 + 822 + return 1; 823 + } 824 + 825 + static int message_stats_delete(struct mapped_device *md, 826 + unsigned argc, char **argv) 827 + { 828 + int id; 829 + char dummy; 830 + 831 + if (argc != 2) 832 + return -EINVAL; 833 + 834 + if (sscanf(argv[1], "%d%c", &id, &dummy) != 1 || id < 0) 835 + return -EINVAL; 836 + 837 + return dm_stats_delete(dm_get_stats(md), id); 838 + } 839 + 840 + static int message_stats_clear(struct mapped_device *md, 841 + unsigned argc, char **argv) 842 + { 843 + int id; 844 + char dummy; 845 + 846 + if (argc != 2) 847 + return -EINVAL; 848 + 849 + if (sscanf(argv[1], "%d%c", &id, &dummy) != 1 || id < 0) 850 + return -EINVAL; 851 + 852 + return dm_stats_clear(dm_get_stats(md), id); 853 + } 854 + 855 + static int message_stats_list(struct mapped_device *md, 856 + unsigned argc, char **argv, 857 + char *result, unsigned maxlen) 858 + { 859 + int r; 860 + const char *program = NULL; 861 + 862 + if (argc < 1 || argc > 2) 863 + return -EINVAL; 864 + 865 + if (argc > 1) { 866 + program = kstrdup(argv[1], GFP_KERNEL); 867 + if (!program) 868 + return -ENOMEM; 869 + } 870 + 871 + r = dm_stats_list(dm_get_stats(md), program, result, maxlen); 872 + 873 + kfree(program); 874 + 875 + return r; 876 + } 877 + 878 + static int message_stats_print(struct mapped_device *md, 879 + unsigned argc, char **argv, bool clear, 880 + char *result, unsigned maxlen) 881 + { 882 + int id; 883 + char dummy; 884 + unsigned long idx_start = 0, idx_len = ULONG_MAX; 885 + 886 + if (argc != 2 && argc != 4) 887 + return -EINVAL; 888 + 889 + if (sscanf(argv[1], "%d%c", &id, &dummy) != 1 || id < 0) 890 + return -EINVAL; 891 + 892 + if (argc > 3) { 893 + if (strcmp(argv[2], "-") && 894 + sscanf(argv[2], "%lu%c", &idx_start, &dummy) != 1) 895 + return -EINVAL; 896 + if (strcmp(argv[3], "-") && 897 + sscanf(argv[3], "%lu%c", &idx_len, &dummy) != 1) 898 + return -EINVAL; 899 + } 900 + 901 + return dm_stats_print(dm_get_stats(md), id, idx_start, idx_len, clear, 902 + result, maxlen); 903 + } 904 + 905 + static int message_stats_set_aux(struct mapped_device *md, 906 + unsigned argc, char **argv) 907 + { 908 + int id; 909 + char dummy; 910 + 911 + if (argc != 3) 912 + return -EINVAL; 913 + 914 + if (sscanf(argv[1], "%d%c", &id, &dummy) != 1 || id < 0) 915 + return -EINVAL; 916 + 917 + return dm_stats_set_aux(dm_get_stats(md), id, argv[2]); 918 + } 919 + 920 + int dm_stats_message(struct mapped_device *md, unsigned argc, char **argv, 921 + char *result, unsigned maxlen) 922 + { 923 + int r; 924 + 925 + if (dm_request_based(md)) { 926 + DMWARN("Statistics are only supported for bio-based devices"); 927 + return -EOPNOTSUPP; 928 + } 929 + 930 + /* All messages here must start with '@' */ 931 + if (!strcasecmp(argv[0], "@stats_create")) 932 + r = message_stats_create(md, argc, argv, result, maxlen); 933 + else if (!strcasecmp(argv[0], "@stats_delete")) 934 + r = message_stats_delete(md, argc, argv); 935 + else if (!strcasecmp(argv[0], "@stats_clear")) 936 + r = message_stats_clear(md, argc, argv); 937 + else if (!strcasecmp(argv[0], "@stats_list")) 938 + r = message_stats_list(md, argc, argv, result, maxlen); 939 + else if (!strcasecmp(argv[0], "@stats_print")) 940 + r = message_stats_print(md, argc, argv, false, result, maxlen); 941 + else if (!strcasecmp(argv[0], "@stats_print_clear")) 942 + r = message_stats_print(md, argc, argv, true, result, maxlen); 943 + else if (!strcasecmp(argv[0], "@stats_set_aux")) 944 + r = message_stats_set_aux(md, argc, argv); 945 + else 946 + return 2; /* this wasn't a stats message */ 947 + 948 + if (r == -EINVAL) 949 + DMWARN("Invalid parameters for message %s", argv[0]); 950 + 951 + return r; 952 + } 953 + 954 + int __init dm_statistics_init(void) 955 + { 956 + dm_stat_need_rcu_barrier = 0; 957 + return 0; 958 + } 959 + 960 + void dm_statistics_exit(void) 961 + { 962 + if (dm_stat_need_rcu_barrier) 963 + rcu_barrier(); 964 + if (WARN_ON(shared_memory_amount)) 965 + DMCRIT("shared_memory_amount leaked: %lu", shared_memory_amount); 966 + } 967 + 968 + module_param_named(stats_current_allocated_bytes, shared_memory_amount, ulong, S_IRUGO); 969 + MODULE_PARM_DESC(stats_current_allocated_bytes, "Memory currently used by statistics");
+40
drivers/md/dm-stats.h
··· 1 + #ifndef DM_STATS_H 2 + #define DM_STATS_H 3 + 4 + #include <linux/types.h> 5 + #include <linux/mutex.h> 6 + #include <linux/list.h> 7 + 8 + int dm_statistics_init(void); 9 + void dm_statistics_exit(void); 10 + 11 + struct dm_stats { 12 + struct mutex mutex; 13 + struct list_head list; /* list of struct dm_stat */ 14 + struct dm_stats_last_position __percpu *last; 15 + sector_t last_sector; 16 + unsigned last_rw; 17 + }; 18 + 19 + struct dm_stats_aux { 20 + bool merged; 21 + }; 22 + 23 + void dm_stats_init(struct dm_stats *st); 24 + void dm_stats_cleanup(struct dm_stats *st); 25 + 26 + struct mapped_device; 27 + 28 + int dm_stats_message(struct mapped_device *md, unsigned argc, char **argv, 29 + char *result, unsigned maxlen); 30 + 31 + void dm_stats_account_io(struct dm_stats *stats, unsigned long bi_rw, 32 + sector_t bi_sector, unsigned bi_sectors, bool end, 33 + unsigned long duration, struct dm_stats_aux *aux); 34 + 35 + static inline bool dm_stats_used(struct dm_stats *st) 36 + { 37 + return !list_empty(&st->list); 38 + } 39 + 40 + #endif
+62 -3
drivers/md/dm.c
··· 60 60 struct bio *bio; 61 61 unsigned long start_time; 62 62 spinlock_t endio_lock; 63 + struct dm_stats_aux stats_aux; 63 64 }; 64 65 65 66 /* ··· 199 198 200 199 /* zero-length flush that will be cloned and submitted to targets */ 201 200 struct bio flush_bio; 201 + 202 + struct dm_stats stats; 202 203 }; 203 204 204 205 /* ··· 272 269 dm_io_init, 273 270 dm_kcopyd_init, 274 271 dm_interface_init, 272 + dm_statistics_init, 275 273 }; 276 274 277 275 static void (*_exits[])(void) = { ··· 283 279 dm_io_exit, 284 280 dm_kcopyd_exit, 285 281 dm_interface_exit, 282 + dm_statistics_exit, 286 283 }; 287 284 288 285 static int __init dm_init(void) ··· 389 384 return r; 390 385 } 391 386 387 + sector_t dm_get_size(struct mapped_device *md) 388 + { 389 + return get_capacity(md->disk); 390 + } 391 + 392 + struct dm_stats *dm_get_stats(struct mapped_device *md) 393 + { 394 + return &md->stats; 395 + } 396 + 392 397 static int dm_blk_getgeo(struct block_device *bdev, struct hd_geometry *geo) 393 398 { 394 399 struct mapped_device *md = bdev->bd_disk->private_data; ··· 481 466 static void start_io_acct(struct dm_io *io) 482 467 { 483 468 struct mapped_device *md = io->md; 469 + struct bio *bio = io->bio; 484 470 int cpu; 485 - int rw = bio_data_dir(io->bio); 471 + int rw = bio_data_dir(bio); 486 472 487 473 io->start_time = jiffies; 488 474 ··· 492 476 part_stat_unlock(); 493 477 atomic_set(&dm_disk(md)->part0.in_flight[rw], 494 478 atomic_inc_return(&md->pending[rw])); 479 + 480 + if (unlikely(dm_stats_used(&md->stats))) 481 + dm_stats_account_io(&md->stats, bio->bi_rw, bio->bi_sector, 482 + bio_sectors(bio), false, 0, &io->stats_aux); 495 483 } 496 484 497 485 static void end_io_acct(struct dm_io *io) ··· 510 490 part_round_stats(cpu, &dm_disk(md)->part0); 511 491 part_stat_add(cpu, &dm_disk(md)->part0, ticks[rw], duration); 512 492 part_stat_unlock(); 493 + 494 + if (unlikely(dm_stats_used(&md->stats))) 495 + dm_stats_account_io(&md->stats, bio->bi_rw, bio->bi_sector, 496 + bio_sectors(bio), true, duration, &io->stats_aux); 513 497 514 498 /* 515 499 * After this is decremented the bio must not be touched if it is ··· 1543 1519 return; 1544 1520 } 1545 1521 1546 - static int dm_request_based(struct mapped_device *md) 1522 + int dm_request_based(struct mapped_device *md) 1547 1523 { 1548 1524 return blk_queue_stackable(md->queue); 1549 1525 } ··· 1982 1958 md->flush_bio.bi_bdev = md->bdev; 1983 1959 md->flush_bio.bi_rw = WRITE_FLUSH; 1984 1960 1961 + dm_stats_init(&md->stats); 1962 + 1985 1963 /* Populate the mapping, nobody knows we exist yet */ 1986 1964 spin_lock(&_minor_lock); 1987 1965 old_md = idr_replace(&_minor_idr, md, minor); ··· 2035 2009 2036 2010 put_disk(md->disk); 2037 2011 blk_cleanup_queue(md->queue); 2012 + dm_stats_cleanup(&md->stats); 2038 2013 module_put(THIS_MODULE); 2039 2014 kfree(md); 2040 2015 } ··· 2177 2150 /* 2178 2151 * Wipe any geometry if the size of the table changed. 2179 2152 */ 2180 - if (size != get_capacity(md->disk)) 2153 + if (size != dm_get_size(md)) 2181 2154 memset(&md->geometry, 0, sizeof(md->geometry)); 2182 2155 2183 2156 __set_size(md, size); ··· 2721 2694 mutex_unlock(&md->suspend_lock); 2722 2695 2723 2696 return r; 2697 + } 2698 + 2699 + /* 2700 + * Internal suspend/resume works like userspace-driven suspend. It waits 2701 + * until all bios finish and prevents issuing new bios to the target drivers. 2702 + * It may be used only from the kernel. 2703 + * 2704 + * Internal suspend holds md->suspend_lock, which prevents interaction with 2705 + * userspace-driven suspend. 2706 + */ 2707 + 2708 + void dm_internal_suspend(struct mapped_device *md) 2709 + { 2710 + mutex_lock(&md->suspend_lock); 2711 + if (dm_suspended_md(md)) 2712 + return; 2713 + 2714 + set_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags); 2715 + synchronize_srcu(&md->io_barrier); 2716 + flush_workqueue(md->wq); 2717 + dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE); 2718 + } 2719 + 2720 + void dm_internal_resume(struct mapped_device *md) 2721 + { 2722 + if (dm_suspended_md(md)) 2723 + goto done; 2724 + 2725 + dm_queue_flush(md); 2726 + 2727 + done: 2728 + mutex_unlock(&md->suspend_lock); 2724 2729 } 2725 2730 2726 2731 /*-----------------------------------------------------------------
+16
drivers/md/dm.h
··· 16 16 #include <linux/blkdev.h> 17 17 #include <linux/hdreg.h> 18 18 19 + #include "dm-stats.h" 20 + 19 21 /* 20 22 * Suspend feature flags 21 23 */ ··· 159 157 void dm_destroy_immediate(struct mapped_device *md); 160 158 int dm_open_count(struct mapped_device *md); 161 159 int dm_lock_for_deletion(struct mapped_device *md); 160 + int dm_request_based(struct mapped_device *md); 161 + sector_t dm_get_size(struct mapped_device *md); 162 + struct dm_stats *dm_get_stats(struct mapped_device *md); 162 163 163 164 int dm_kobject_uevent(struct mapped_device *md, enum kobject_action action, 164 165 unsigned cookie); 166 + 167 + void dm_internal_suspend(struct mapped_device *md); 168 + void dm_internal_resume(struct mapped_device *md); 165 169 166 170 int dm_io_init(void); 167 171 void dm_io_exit(void); ··· 180 172 */ 181 173 struct dm_md_mempools *dm_alloc_md_mempools(unsigned type, unsigned integrity, unsigned per_bio_data_size); 182 174 void dm_free_md_mempools(struct dm_md_mempools *pools); 175 + 176 + /* 177 + * Helpers that are used by DM core 178 + */ 179 + static inline bool dm_message_test_buffer_overflow(char *result, unsigned maxlen) 180 + { 181 + return !maxlen || strlen(result) + 1 >= maxlen; 182 + } 183 183 184 184 #endif
+9
include/linux/device-mapper.h
··· 10 10 11 11 #include <linux/bio.h> 12 12 #include <linux/blkdev.h> 13 + #include <linux/math64.h> 13 14 #include <linux/ratelimit.h> 14 15 15 16 struct dm_dev; ··· 550 549 #define DM_MAPIO_SUBMITTED 0 551 550 #define DM_MAPIO_REMAPPED 1 552 551 #define DM_MAPIO_REQUEUE DM_ENDIO_REQUEUE 552 + 553 + #define dm_sector_div64(x, y)( \ 554 + { \ 555 + u64 _res; \ 556 + (x) = div64_u64_rem(x, y, &_res); \ 557 + _res; \ 558 + } \ 559 + ) 553 560 554 561 /* 555 562 * Ceiling(n / sz)
+2 -2
include/uapi/linux/dm-ioctl.h
··· 267 267 #define DM_DEV_SET_GEOMETRY _IOWR(DM_IOCTL, DM_DEV_SET_GEOMETRY_CMD, struct dm_ioctl) 268 268 269 269 #define DM_VERSION_MAJOR 4 270 - #define DM_VERSION_MINOR 25 270 + #define DM_VERSION_MINOR 26 271 271 #define DM_VERSION_PATCHLEVEL 0 272 - #define DM_VERSION_EXTRA "-ioctl (2013-06-26)" 272 + #define DM_VERSION_EXTRA "-ioctl (2013-08-15)" 273 273 274 274 /* Status bits */ 275 275 #define DM_READONLY_FLAG (1 << 0) /* In/Out */