Merge tag 'dm-3.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm

+77

Documentation/device-mapper/cache-policies.txt

··· 1 + Guidance for writing policies 2 + ============================= 3 + 4 + Try to keep transactionality out of it. The core is careful to 5 + avoid asking about anything that is migrating. This is a pain, but 6 + makes it easier to write the policies. 7 + 8 + Mappings are loaded into the policy at construction time. 9 + 10 + Every bio that is mapped by the target is referred to the policy. 11 + The policy can return a simple HIT or MISS or issue a migration. 12 + 13 + Currently there's no way for the policy to issue background work, 14 + e.g. to start writing back dirty blocks that are going to be evicte 15 + soon. 16 + 17 + Because we map bios, rather than requests it's easy for the policy 18 + to get fooled by many small bios. For this reason the core target 19 + issues periodic ticks to the policy. It's suggested that the policy 20 + doesn't update states (eg, hit counts) for a block more than once 21 + for each tick. The core ticks by watching bios complete, and so 22 + trying to see when the io scheduler has let the ios run. 23 + 24 + 25 + Overview of supplied cache replacement policies 26 + =============================================== 27 + 28 + multiqueue 29 + ---------- 30 + 31 + This policy is the default. 32 + 33 + The multiqueue policy has two sets of 16 queues: one set for entries 34 + waiting for the cache and another one for those in the cache. 35 + Cache entries in the queues are aged based on logical time. Entry into 36 + the cache is based on variable thresholds and queue selection is based 37 + on hit count on entry. The policy aims to take different cache miss 38 + costs into account and to adjust to varying load patterns automatically. 39 + 40 + Message and constructor argument pairs are: 41 + 'sequential_threshold <#nr_sequential_ios>' and 42 + 'random_threshold <#nr_random_ios>'. 43 + 44 + The sequential threshold indicates the number of contiguous I/Os 45 + required before a stream is treated as sequential. The random threshold 46 + is the number of intervening non-contiguous I/Os that must be seen 47 + before the stream is treated as random again. 48 + 49 + The sequential and random thresholds default to 512 and 4 respectively. 50 + 51 + Large, sequential ios are probably better left on the origin device 52 + since spindles tend to have good bandwidth. The io_tracker counts 53 + contiguous I/Os to try to spot when the io is in one of these sequential 54 + modes. 55 + 56 + cleaner 57 + ------- 58 + 59 + The cleaner writes back all dirty blocks in a cache to decommission it. 60 + 61 + Examples 62 + ======== 63 + 64 + The syntax for a table is: 65 + cache <metadata dev> <cache dev> <origin dev> <block size> 66 + <#feature_args> [<feature arg>]* 67 + <policy> <#policy_args> [<policy arg>]* 68 + 69 + The syntax to send a message using the dmsetup command is: 70 + dmsetup message <mapped device> 0 sequential_threshold 1024 71 + dmsetup message <mapped device> 0 random_threshold 8 72 + 73 + Using dmsetup: 74 + dmsetup create blah --table "0 268435456 cache /dev/sdb /dev/sdc \ 75 + /dev/sdd 512 0 mq 4 sequential_threshold 1024 random_threshold 8" 76 + creates a 128GB large mapped device named 'blah' with the 77 + sequential threshold set to 1024 and the random_threshold set to 8.

+243

Documentation/device-mapper/cache.txt

··· 1 + Introduction 2 + ============ 3 + 4 + dm-cache is a device mapper target written by Joe Thornber, Heinz 5 + Mauelshagen, and Mike Snitzer. 6 + 7 + It aims to improve performance of a block device (eg, a spindle) by 8 + dynamically migrating some of its data to a faster, smaller device 9 + (eg, an SSD). 10 + 11 + This device-mapper solution allows us to insert this caching at 12 + different levels of the dm stack, for instance above the data device for 13 + a thin-provisioning pool. Caching solutions that are integrated more 14 + closely with the virtual memory system should give better performance. 15 + 16 + The target reuses the metadata library used in the thin-provisioning 17 + library. 18 + 19 + The decision as to what data to migrate and when is left to a plug-in 20 + policy module. Several of these have been written as we experiment, 21 + and we hope other people will contribute others for specific io 22 + scenarios (eg. a vm image server). 23 + 24 + Glossary 25 + ======== 26 + 27 + Migration - Movement of the primary copy of a logical block from one 28 + device to the other. 29 + Promotion - Migration from slow device to fast device. 30 + Demotion - Migration from fast device to slow device. 31 + 32 + The origin device always contains a copy of the logical block, which 33 + may be out of date or kept in sync with the copy on the cache device 34 + (depending on policy). 35 + 36 + Design 37 + ====== 38 + 39 + Sub-devices 40 + ----------- 41 + 42 + The target is constructed by passing three devices to it (along with 43 + other parameters detailed later): 44 + 45 + 1. An origin device - the big, slow one. 46 + 47 + 2. A cache device - the small, fast one. 48 + 49 + 3. A small metadata device - records which blocks are in the cache, 50 + which are dirty, and extra hints for use by the policy object. 51 + This information could be put on the cache device, but having it 52 + separate allows the volume manager to configure it differently, 53 + e.g. as a mirror for extra robustness. 54 + 55 + Fixed block size 56 + ---------------- 57 + 58 + The origin is divided up into blocks of a fixed size. This block size 59 + is configurable when you first create the cache. Typically we've been 60 + using block sizes of 256k - 1024k. 61 + 62 + Having a fixed block size simplifies the target a lot. But it is 63 + something of a compromise. For instance, a small part of a block may be 64 + getting hit a lot, yet the whole block will be promoted to the cache. 65 + So large block sizes are bad because they waste cache space. And small 66 + block sizes are bad because they increase the amount of metadata (both 67 + in core and on disk). 68 + 69 + Writeback/writethrough 70 + ---------------------- 71 + 72 + The cache has two modes, writeback and writethrough. 73 + 74 + If writeback, the default, is selected then a write to a block that is 75 + cached will go only to the cache and the block will be marked dirty in 76 + the metadata. 77 + 78 + If writethrough is selected then a write to a cached block will not 79 + complete until it has hit both the origin and cache devices. Clean 80 + blocks should remain clean. 81 + 82 + A simple cleaner policy is provided, which will clean (write back) all 83 + dirty blocks in a cache. Useful for decommissioning a cache. 84 + 85 + Migration throttling 86 + -------------------- 87 + 88 + Migrating data between the origin and cache device uses bandwidth. 89 + The user can set a throttle to prevent more than a certain amount of 90 + migration occuring at any one time. Currently we're not taking any 91 + account of normal io traffic going to the devices. More work needs 92 + doing here to avoid migrating during those peak io moments. 93 + 94 + For the time being, a message "migration_threshold <#sectors>" 95 + can be used to set the maximum number of sectors being migrated, 96 + the default being 204800 sectors (or 100MB). 97 + 98 + Updating on-disk metadata 99 + ------------------------- 100 + 101 + On-disk metadata is committed every time a REQ_SYNC or REQ_FUA bio is 102 + written. If no such requests are made then commits will occur every 103 + second. This means the cache behaves like a physical disk that has a 104 + write cache (the same is true of the thin-provisioning target). If 105 + power is lost you may lose some recent writes. The metadata should 106 + always be consistent in spite of any crash. 107 + 108 + The 'dirty' state for a cache block changes far too frequently for us 109 + to keep updating it on the fly. So we treat it as a hint. In normal 110 + operation it will be written when the dm device is suspended. If the 111 + system crashes all cache blocks will be assumed dirty when restarted. 112 + 113 + Per-block policy hints 114 + ---------------------- 115 + 116 + Policy plug-ins can store a chunk of data per cache block. It's up to 117 + the policy how big this chunk is, but it should be kept small. Like the 118 + dirty flags this data is lost if there's a crash so a safe fallback 119 + value should always be possible. 120 + 121 + For instance, the 'mq' policy, which is currently the default policy, 122 + uses this facility to store the hit count of the cache blocks. If 123 + there's a crash this information will be lost, which means the cache 124 + may be less efficient until those hit counts are regenerated. 125 + 126 + Policy hints affect performance, not correctness. 127 + 128 + Policy messaging 129 + ---------------- 130 + 131 + Policies will have different tunables, specific to each one, so we 132 + need a generic way of getting and setting these. Device-mapper 133 + messages are used. Refer to cache-policies.txt. 134 + 135 + Discard bitset resolution 136 + ------------------------- 137 + 138 + We can avoid copying data during migration if we know the block has 139 + been discarded. A prime example of this is when mkfs discards the 140 + whole block device. We store a bitset tracking the discard state of 141 + blocks. However, we allow this bitset to have a different block size 142 + from the cache blocks. This is because we need to track the discard 143 + state for all of the origin device (compare with the dirty bitset 144 + which is just for the smaller cache device). 145 + 146 + Target interface 147 + ================ 148 + 149 + Constructor 150 + ----------- 151 + 152 + cache <metadata dev> <cache dev> <origin dev> <block size> 153 + <#feature args> [<feature arg>]* 154 + <policy> <#policy args> [policy args]* 155 + 156 + metadata dev : fast device holding the persistent metadata 157 + cache dev : fast device holding cached data blocks 158 + origin dev : slow device holding original data blocks 159 + block size : cache unit size in sectors 160 + 161 + #feature args : number of feature arguments passed 162 + feature args : writethrough. (The default is writeback.) 163 + 164 + policy : the replacement policy to use 165 + #policy args : an even number of arguments corresponding to 166 + key/value pairs passed to the policy 167 + policy args : key/value pairs passed to the policy 168 + E.g. 'sequential_threshold 1024' 169 + See cache-policies.txt for details. 170 + 171 + Optional feature arguments are: 172 + writethrough : write through caching that prohibits cache block 173 + content from being different from origin block content. 174 + Without this argument, the default behaviour is to write 175 + back cache block contents later for performance reasons, 176 + so they may differ from the corresponding origin blocks. 177 + 178 + A policy called 'default' is always registered. This is an alias for 179 + the policy we currently think is giving best all round performance. 180 + 181 + As the default policy could vary between kernels, if you are relying on 182 + the characteristics of a specific policy, always request it by name. 183 + 184 + Status 185 + ------ 186 + 187 + <#used metadata blocks>/<#total metadata blocks> <#read hits> <#read misses> 188 + <#write hits> <#write misses> <#demotions> <#promotions> <#blocks in cache> 189 + <#dirty> <#features> <features>* <#core args> <core args>* <#policy args> 190 + <policy args>* 191 + 192 + #used metadata blocks : Number of metadata blocks used 193 + #total metadata blocks : Total number of metadata blocks 194 + #read hits : Number of times a READ bio has been mapped 195 + to the cache 196 + #read misses : Number of times a READ bio has been mapped 197 + to the origin 198 + #write hits : Number of times a WRITE bio has been mapped 199 + to the cache 200 + #write misses : Number of times a WRITE bio has been 201 + mapped to the origin 202 + #demotions : Number of times a block has been removed 203 + from the cache 204 + #promotions : Number of times a block has been moved to 205 + the cache 206 + #blocks in cache : Number of blocks resident in the cache 207 + #dirty : Number of blocks in the cache that differ 208 + from the origin 209 + #feature args : Number of feature args to follow 210 + feature args : 'writethrough' (optional) 211 + #core args : Number of core arguments (must be even) 212 + core args : Key/value pairs for tuning the core 213 + e.g. migration_threshold 214 + #policy args : Number of policy arguments to follow (must be even) 215 + policy args : Key/value pairs 216 + e.g. 'sequential_threshold 1024 217 + 218 + Messages 219 + -------- 220 + 221 + Policies will have different tunables, specific to each one, so we 222 + need a generic way of getting and setting these. Device-mapper 223 + messages are used. (A sysfs interface would also be possible.) 224 + 225 + The message format is: 226 + 227 + <key> <value> 228 + 229 + E.g. 230 + dmsetup message my_cache 0 sequential_threshold 1024 231 + 232 + Examples 233 + ======== 234 + 235 + The test suite can be found here: 236 + 237 + https://github.com/jthornber/thinp-test-suite 238 + 239 + dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ 240 + /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0' 241 + dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ 242 + /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \ 243 + mq 4 sequential_threshold 1024 random_threshold 8'

+43 -12

drivers/md/Kconfig

··· 210 210 211 211 config DM_BUFIO 212 212 tristate 213 - depends on BLK_DEV_DM && EXPERIMENTAL 213 + depends on BLK_DEV_DM 214 214 ---help--- 215 215 This interface allows you to do buffered I/O on a device and acts 216 216 as a cache, holding recently-read blocks in memory and performing ··· 218 218 219 219 config DM_BIO_PRISON 220 220 tristate 221 - depends on BLK_DEV_DM && EXPERIMENTAL 221 + depends on BLK_DEV_DM 222 222 ---help--- 223 223 Some bio locking schemes used by other device-mapper targets 224 224 including thin provisioning. ··· 251 251 Allow volume managers to take writable snapshots of a device. 252 252 253 253 config DM_THIN_PROVISIONING 254 - tristate "Thin provisioning target (EXPERIMENTAL)" 255 - depends on BLK_DEV_DM && EXPERIMENTAL 254 + tristate "Thin provisioning target" 255 + depends on BLK_DEV_DM 256 256 select DM_PERSISTENT_DATA 257 257 select DM_BIO_PRISON 258 258 ---help--- ··· 267 267 block manager locking used by thin provisioning. 268 268 269 269 If unsure, say N. 270 + 271 + config DM_CACHE 272 + tristate "Cache target (EXPERIMENTAL)" 273 + depends on BLK_DEV_DM 274 + default n 275 + select DM_PERSISTENT_DATA 276 + select DM_BIO_PRISON 277 + ---help--- 278 + dm-cache attempts to improve performance of a block device by 279 + moving frequently used data to a smaller, higher performance 280 + device. Different 'policy' plugins can be used to change the 281 + algorithms used to select which blocks are promoted, demoted, 282 + cleaned etc. It supports writeback and writethrough modes. 283 + 284 + config DM_CACHE_MQ 285 + tristate "MQ Cache Policy (EXPERIMENTAL)" 286 + depends on DM_CACHE 287 + default y 288 + ---help--- 289 + A cache policy that uses a multiqueue ordered by recent hit 290 + count to select which blocks should be promoted and demoted. 291 + This is meant to be a general purpose policy. It prioritises 292 + reads over writes. 293 + 294 + config DM_CACHE_CLEANER 295 + tristate "Cleaner Cache Policy (EXPERIMENTAL)" 296 + depends on DM_CACHE 297 + default y 298 + ---help--- 299 + A simple cache policy that writes back all data to the 300 + origin. Used when decommissioning a dm-cache. 270 301 271 302 config DM_MIRROR 272 303 tristate "Mirror target" ··· 333 302 in one of the available parity distribution methods. 334 303 335 304 config DM_LOG_USERSPACE 336 - tristate "Mirror userspace logging (EXPERIMENTAL)" 337 - depends on DM_MIRROR && EXPERIMENTAL && NET 305 + tristate "Mirror userspace logging" 306 + depends on DM_MIRROR && NET 338 307 select CONNECTOR 339 308 ---help--- 340 309 The userspace logging module provides a mechanism for ··· 381 350 If unsure, say N. 382 351 383 352 config DM_DELAY 384 - tristate "I/O delaying target (EXPERIMENTAL)" 385 - depends on BLK_DEV_DM && EXPERIMENTAL 353 + tristate "I/O delaying target" 354 + depends on BLK_DEV_DM 386 355 ---help--- 387 356 A target that delays reads and/or writes and can send 388 357 them to different devices. Useful for testing. ··· 396 365 Generate udev events for DM events. 397 366 398 367 config DM_FLAKEY 399 - tristate "Flakey target (EXPERIMENTAL)" 400 - depends on BLK_DEV_DM && EXPERIMENTAL 368 + tristate "Flakey target" 369 + depends on BLK_DEV_DM 401 370 ---help--- 402 371 A target that intermittently fails I/O for debugging purposes. 403 372 404 373 config DM_VERITY 405 - tristate "Verity target support (EXPERIMENTAL)" 406 - depends on BLK_DEV_DM && EXPERIMENTAL 374 + tristate "Verity target support" 375 + depends on BLK_DEV_DM 407 376 select CRYPTO 408 377 select CRYPTO_HASH 409 378 select DM_BUFIO

+6

drivers/md/Makefile

··· 11 11 dm-log-userspace-y \ 12 12 += dm-log-userspace-base.o dm-log-userspace-transfer.o 13 13 dm-thin-pool-y += dm-thin.o dm-thin-metadata.o 14 + dm-cache-y += dm-cache-target.o dm-cache-metadata.o dm-cache-policy.o 15 + dm-cache-mq-y += dm-cache-policy-mq.o 16 + dm-cache-cleaner-y += dm-cache-policy-cleaner.o 14 17 md-mod-y += md.o bitmap.o 15 18 raid456-y += raid5.o 16 19 ··· 47 44 obj-$(CONFIG_DM_RAID) += dm-raid.o 48 45 obj-$(CONFIG_DM_THIN_PROVISIONING) += dm-thin-pool.o 49 46 obj-$(CONFIG_DM_VERITY) += dm-verity.o 47 + obj-$(CONFIG_DM_CACHE) += dm-cache.o 48 + obj-$(CONFIG_DM_CACHE_MQ) += dm-cache-mq.o 49 + obj-$(CONFIG_DM_CACHE_CLEANER) += dm-cache-cleaner.o 50 50 51 51 ifeq ($(CONFIG_DM_UEVENT),y) 52 52 dm-mod-objs += dm-uevent.o

+84 -77

drivers/md/dm-bio-prison.c

··· 14 14 15 15 /*----------------------------------------------------------------*/ 16 16 17 - struct dm_bio_prison_cell { 18 - struct hlist_node list; 19 - struct dm_bio_prison *prison; 20 - struct dm_cell_key key; 21 - struct bio *holder; 22 - struct bio_list bios; 23 - }; 24 - 25 17 struct dm_bio_prison { 26 18 spinlock_t lock; 27 19 mempool_t *cell_pool; ··· 79 87 } 80 88 EXPORT_SYMBOL_GPL(dm_bio_prison_destroy); 81 89 90 + struct dm_bio_prison_cell *dm_bio_prison_alloc_cell(struct dm_bio_prison *prison, gfp_t gfp) 91 + { 92 + return mempool_alloc(prison->cell_pool, gfp); 93 + } 94 + EXPORT_SYMBOL_GPL(dm_bio_prison_alloc_cell); 95 + 96 + void dm_bio_prison_free_cell(struct dm_bio_prison *prison, 97 + struct dm_bio_prison_cell *cell) 98 + { 99 + mempool_free(cell, prison->cell_pool); 100 + } 101 + EXPORT_SYMBOL_GPL(dm_bio_prison_free_cell); 102 + 82 103 static uint32_t hash_key(struct dm_bio_prison *prison, struct dm_cell_key *key) 83 104 { 84 105 const unsigned long BIG_PRIME = 4294967291UL; ··· 119 114 return NULL; 120 115 } 121 116 122 - /* 123 - * This may block if a new cell needs allocating. You must ensure that 124 - * cells will be unlocked even if the calling thread is blocked. 125 - * 126 - * Returns 1 if the cell was already held, 0 if @inmate is the new holder. 127 - */ 128 - int dm_bio_detain(struct dm_bio_prison *prison, struct dm_cell_key *key, 129 - struct bio *inmate, struct dm_bio_prison_cell **ref) 117 + static void __setup_new_cell(struct dm_bio_prison *prison, 118 + struct dm_cell_key *key, 119 + struct bio *holder, 120 + uint32_t hash, 121 + struct dm_bio_prison_cell *cell) 130 122 { 131 - int r = 1; 132 - unsigned long flags; 133 - uint32_t hash = hash_key(prison, key); 134 - struct dm_bio_prison_cell *cell, *cell2; 135 - 136 - BUG_ON(hash > prison->nr_buckets); 137 - 138 - spin_lock_irqsave(&prison->lock, flags); 139 - 140 - cell = __search_bucket(prison->cells + hash, key); 141 - if (cell) { 142 - bio_list_add(&cell->bios, inmate); 143 - goto out; 144 - } 145 - 146 - /* 147 - * Allocate a new cell 148 - */ 149 - spin_unlock_irqrestore(&prison->lock, flags); 150 - cell2 = mempool_alloc(prison->cell_pool, GFP_NOIO); 151 - spin_lock_irqsave(&prison->lock, flags); 152 - 153 - /* 154 - * We've been unlocked, so we have to double check that 155 - * nobody else has inserted this cell in the meantime. 156 - */ 157 - cell = __search_bucket(prison->cells + hash, key); 158 - if (cell) { 159 - mempool_free(cell2, prison->cell_pool); 160 - bio_list_add(&cell->bios, inmate); 161 - goto out; 162 - } 163 - 164 - /* 165 - * Use new cell. 166 - */ 167 - cell = cell2; 168 - 169 - cell->prison = prison; 170 123 memcpy(&cell->key, key, sizeof(cell->key)); 171 - cell->holder = inmate; 124 + cell->holder = holder; 172 125 bio_list_init(&cell->bios); 173 126 hlist_add_head(&cell->list, prison->cells + hash); 127 + } 174 128 175 - r = 0; 129 + static int __bio_detain(struct dm_bio_prison *prison, 130 + struct dm_cell_key *key, 131 + struct bio *inmate, 132 + struct dm_bio_prison_cell *cell_prealloc, 133 + struct dm_bio_prison_cell **cell_result) 134 + { 135 + uint32_t hash = hash_key(prison, key); 136 + struct dm_bio_prison_cell *cell; 176 137 177 - out: 138 + cell = __search_bucket(prison->cells + hash, key); 139 + if (cell) { 140 + if (inmate) 141 + bio_list_add(&cell->bios, inmate); 142 + *cell_result = cell; 143 + return 1; 144 + } 145 + 146 + __setup_new_cell(prison, key, inmate, hash, cell_prealloc); 147 + *cell_result = cell_prealloc; 148 + return 0; 149 + } 150 + 151 + static int bio_detain(struct dm_bio_prison *prison, 152 + struct dm_cell_key *key, 153 + struct bio *inmate, 154 + struct dm_bio_prison_cell *cell_prealloc, 155 + struct dm_bio_prison_cell **cell_result) 156 + { 157 + int r; 158 + unsigned long flags; 159 + 160 + spin_lock_irqsave(&prison->lock, flags); 161 + r = __bio_detain(prison, key, inmate, cell_prealloc, cell_result); 178 162 spin_unlock_irqrestore(&prison->lock, flags); 179 - 180 - *ref = cell; 181 163 182 164 return r; 183 165 } 166 + 167 + int dm_bio_detain(struct dm_bio_prison *prison, 168 + struct dm_cell_key *key, 169 + struct bio *inmate, 170 + struct dm_bio_prison_cell *cell_prealloc, 171 + struct dm_bio_prison_cell **cell_result) 172 + { 173 + return bio_detain(prison, key, inmate, cell_prealloc, cell_result); 174 + } 184 175 EXPORT_SYMBOL_GPL(dm_bio_detain); 176 + 177 + int dm_get_cell(struct dm_bio_prison *prison, 178 + struct dm_cell_key *key, 179 + struct dm_bio_prison_cell *cell_prealloc, 180 + struct dm_bio_prison_cell **cell_result) 181 + { 182 + return bio_detain(prison, key, NULL, cell_prealloc, cell_result); 183 + } 184 + EXPORT_SYMBOL_GPL(dm_get_cell); 185 185 186 186 /* 187 187 * @inmates must have been initialised prior to this call 188 188 */ 189 - static void __cell_release(struct dm_bio_prison_cell *cell, struct bio_list *inmates) 189 + static void __cell_release(struct dm_bio_prison_cell *cell, 190 + struct bio_list *inmates) 190 191 { 191 - struct dm_bio_prison *prison = cell->prison; 192 - 193 192 hlist_del(&cell->list); 194 193 195 194 if (inmates) { 196 - bio_list_add(inmates, cell->holder); 195 + if (cell->holder) 196 + bio_list_add(inmates, cell->holder); 197 197 bio_list_merge(inmates, &cell->bios); 198 198 } 199 - 200 - mempool_free(cell, prison->cell_pool); 201 199 } 202 200 203 - void dm_cell_release(struct dm_bio_prison_cell *cell, struct bio_list *bios) 201 + void dm_cell_release(struct dm_bio_prison *prison, 202 + struct dm_bio_prison_cell *cell, 203 + struct bio_list *bios) 204 204 { 205 205 unsigned long flags; 206 - struct dm_bio_prison *prison = cell->prison; 207 206 208 207 spin_lock_irqsave(&prison->lock, flags); 209 208 __cell_release(cell, bios); ··· 218 209 /* 219 210 * Sometimes we don't want the holder, just the additional bios. 220 211 */ 221 - static void __cell_release_no_holder(struct dm_bio_prison_cell *cell, struct bio_list *inmates) 212 + static void __cell_release_no_holder(struct dm_bio_prison_cell *cell, 213 + struct bio_list *inmates) 222 214 { 223 - struct dm_bio_prison *prison = cell->prison; 224 - 225 215 hlist_del(&cell->list); 226 216 bio_list_merge(inmates, &cell->bios); 227 - 228 - mempool_free(cell, prison->cell_pool); 229 217 } 230 218 231 - void dm_cell_release_no_holder(struct dm_bio_prison_cell *cell, struct bio_list *inmates) 219 + void dm_cell_release_no_holder(struct dm_bio_prison *prison, 220 + struct dm_bio_prison_cell *cell, 221 + struct bio_list *inmates) 232 222 { 233 223 unsigned long flags; 234 - struct dm_bio_prison *prison = cell->prison; 235 224 236 225 spin_lock_irqsave(&prison->lock, flags); 237 226 __cell_release_no_holder(cell, inmates); ··· 237 230 } 238 231 EXPORT_SYMBOL_GPL(dm_cell_release_no_holder); 239 232 240 - void dm_cell_error(struct dm_bio_prison_cell *cell) 233 + void dm_cell_error(struct dm_bio_prison *prison, 234 + struct dm_bio_prison_cell *cell) 241 235 { 242 - struct dm_bio_prison *prison = cell->prison; 243 236 struct bio_list bios; 244 237 struct bio *bio; 245 238 unsigned long flags;

+48 -8

drivers/md/dm-bio-prison.h

··· 22 22 * subsequently unlocked the bios become available. 23 23 */ 24 24 struct dm_bio_prison; 25 - struct dm_bio_prison_cell; 26 25 27 26 /* FIXME: this needs to be more abstract */ 28 27 struct dm_cell_key { ··· 30 31 dm_block_t block; 31 32 }; 32 33 34 + /* 35 + * Treat this as opaque, only in header so callers can manage allocation 36 + * themselves. 37 + */ 38 + struct dm_bio_prison_cell { 39 + struct hlist_node list; 40 + struct dm_cell_key key; 41 + struct bio *holder; 42 + struct bio_list bios; 43 + }; 44 + 33 45 struct dm_bio_prison *dm_bio_prison_create(unsigned nr_cells); 34 46 void dm_bio_prison_destroy(struct dm_bio_prison *prison); 35 47 36 48 /* 37 - * This may block if a new cell needs allocating. You must ensure that 38 - * cells will be unlocked even if the calling thread is blocked. 49 + * These two functions just wrap a mempool. This is a transitory step: 50 + * Eventually all bio prison clients should manage their own cell memory. 51 + * 52 + * Like mempool_alloc(), dm_bio_prison_alloc_cell() can only fail if called 53 + * in interrupt context or passed GFP_NOWAIT. 54 + */ 55 + struct dm_bio_prison_cell *dm_bio_prison_alloc_cell(struct dm_bio_prison *prison, 56 + gfp_t gfp); 57 + void dm_bio_prison_free_cell(struct dm_bio_prison *prison, 58 + struct dm_bio_prison_cell *cell); 59 + 60 + /* 61 + * Creates, or retrieves a cell for the given key. 62 + * 63 + * Returns 1 if pre-existing cell returned, zero if new cell created using 64 + * @cell_prealloc. 65 + */ 66 + int dm_get_cell(struct dm_bio_prison *prison, 67 + struct dm_cell_key *key, 68 + struct dm_bio_prison_cell *cell_prealloc, 69 + struct dm_bio_prison_cell **cell_result); 70 + 71 + /* 72 + * An atomic op that combines retrieving a cell, and adding a bio to it. 39 73 * 40 74 * Returns 1 if the cell was already held, 0 if @inmate is the new holder. 41 75 */ 42 - int dm_bio_detain(struct dm_bio_prison *prison, struct dm_cell_key *key, 43 - struct bio *inmate, struct dm_bio_prison_cell **ref); 76 + int dm_bio_detain(struct dm_bio_prison *prison, 77 + struct dm_cell_key *key, 78 + struct bio *inmate, 79 + struct dm_bio_prison_cell *cell_prealloc, 80 + struct dm_bio_prison_cell **cell_result); 44 81 45 - void dm_cell_release(struct dm_bio_prison_cell *cell, struct bio_list *bios); 46 - void dm_cell_release_no_holder(struct dm_bio_prison_cell *cell, struct bio_list *inmates); 47 - void dm_cell_error(struct dm_bio_prison_cell *cell); 82 + void dm_cell_release(struct dm_bio_prison *prison, 83 + struct dm_bio_prison_cell *cell, 84 + struct bio_list *bios); 85 + void dm_cell_release_no_holder(struct dm_bio_prison *prison, 86 + struct dm_bio_prison_cell *cell, 87 + struct bio_list *inmates); 88 + void dm_cell_error(struct dm_bio_prison *prison, 89 + struct dm_bio_prison_cell *cell); 48 90 49 91 /*----------------------------------------------------------------*/ 50 92

+1 -1

drivers/md/dm-bufio.c

··· 1192 1192 int dm_bufio_issue_flush(struct dm_bufio_client *c) 1193 1193 { 1194 1194 struct dm_io_request io_req = { 1195 - .bi_rw = REQ_FLUSH, 1195 + .bi_rw = WRITE_FLUSH, 1196 1196 .mem.type = DM_IO_KMEM, 1197 1197 .mem.ptr.addr = NULL, 1198 1198 .client = c->dm_io,

+54

drivers/md/dm-cache-block-types.h

··· 1 + /* 2 + * Copyright (C) 2012 Red Hat, Inc. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #ifndef DM_CACHE_BLOCK_TYPES_H 8 + #define DM_CACHE_BLOCK_TYPES_H 9 + 10 + #include "persistent-data/dm-block-manager.h" 11 + 12 + /*----------------------------------------------------------------*/ 13 + 14 + /* 15 + * It's helpful to get sparse to differentiate between indexes into the 16 + * origin device, indexes into the cache device, and indexes into the 17 + * discard bitset. 18 + */ 19 + 20 + typedef dm_block_t __bitwise__ dm_oblock_t; 21 + typedef uint32_t __bitwise__ dm_cblock_t; 22 + typedef dm_block_t __bitwise__ dm_dblock_t; 23 + 24 + static inline dm_oblock_t to_oblock(dm_block_t b) 25 + { 26 + return (__force dm_oblock_t) b; 27 + } 28 + 29 + static inline dm_block_t from_oblock(dm_oblock_t b) 30 + { 31 + return (__force dm_block_t) b; 32 + } 33 + 34 + static inline dm_cblock_t to_cblock(uint32_t b) 35 + { 36 + return (__force dm_cblock_t) b; 37 + } 38 + 39 + static inline uint32_t from_cblock(dm_cblock_t b) 40 + { 41 + return (__force uint32_t) b; 42 + } 43 + 44 + static inline dm_dblock_t to_dblock(dm_block_t b) 45 + { 46 + return (__force dm_dblock_t) b; 47 + } 48 + 49 + static inline dm_block_t from_dblock(dm_dblock_t b) 50 + { 51 + return (__force dm_block_t) b; 52 + } 53 + 54 + #endif /* DM_CACHE_BLOCK_TYPES_H */

+1146

drivers/md/dm-cache-metadata.c

··· 1 + /* 2 + * Copyright (C) 2012 Red Hat, Inc. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #include "dm-cache-metadata.h" 8 + 9 + #include "persistent-data/dm-array.h" 10 + #include "persistent-data/dm-bitset.h" 11 + #include "persistent-data/dm-space-map.h" 12 + #include "persistent-data/dm-space-map-disk.h" 13 + #include "persistent-data/dm-transaction-manager.h" 14 + 15 + #include <linux/device-mapper.h> 16 + 17 + /*----------------------------------------------------------------*/ 18 + 19 + #define DM_MSG_PREFIX "cache metadata" 20 + 21 + #define CACHE_SUPERBLOCK_MAGIC 06142003 22 + #define CACHE_SUPERBLOCK_LOCATION 0 23 + #define CACHE_VERSION 1 24 + #define CACHE_METADATA_CACHE_SIZE 64 25 + 26 + /* 27 + * 3 for btree insert + 28 + * 2 for btree lookup used within space map 29 + */ 30 + #define CACHE_MAX_CONCURRENT_LOCKS 5 31 + #define SPACE_MAP_ROOT_SIZE 128 32 + 33 + enum superblock_flag_bits { 34 + /* for spotting crashes that would invalidate the dirty bitset */ 35 + CLEAN_SHUTDOWN, 36 + }; 37 + 38 + /* 39 + * Each mapping from cache block -> origin block carries a set of flags. 40 + */ 41 + enum mapping_bits { 42 + /* 43 + * A valid mapping. Because we're using an array we clear this 44 + * flag for an non existant mapping. 45 + */ 46 + M_VALID = 1, 47 + 48 + /* 49 + * The data on the cache is different from that on the origin. 50 + */ 51 + M_DIRTY = 2 52 + }; 53 + 54 + struct cache_disk_superblock { 55 + __le32 csum; 56 + __le32 flags; 57 + __le64 blocknr; 58 + 59 + __u8 uuid[16]; 60 + __le64 magic; 61 + __le32 version; 62 + 63 + __u8 policy_name[CACHE_POLICY_NAME_SIZE]; 64 + __le32 policy_hint_size; 65 + 66 + __u8 metadata_space_map_root[SPACE_MAP_ROOT_SIZE]; 67 + __le64 mapping_root; 68 + __le64 hint_root; 69 + 70 + __le64 discard_root; 71 + __le64 discard_block_size; 72 + __le64 discard_nr_blocks; 73 + 74 + __le32 data_block_size; 75 + __le32 metadata_block_size; 76 + __le32 cache_blocks; 77 + 78 + __le32 compat_flags; 79 + __le32 compat_ro_flags; 80 + __le32 incompat_flags; 81 + 82 + __le32 read_hits; 83 + __le32 read_misses; 84 + __le32 write_hits; 85 + __le32 write_misses; 86 + } __packed; 87 + 88 + struct dm_cache_metadata { 89 + struct block_device *bdev; 90 + struct dm_block_manager *bm; 91 + struct dm_space_map *metadata_sm; 92 + struct dm_transaction_manager *tm; 93 + 94 + struct dm_array_info info; 95 + struct dm_array_info hint_info; 96 + struct dm_disk_bitset discard_info; 97 + 98 + struct rw_semaphore root_lock; 99 + dm_block_t root; 100 + dm_block_t hint_root; 101 + dm_block_t discard_root; 102 + 103 + sector_t discard_block_size; 104 + dm_dblock_t discard_nr_blocks; 105 + 106 + sector_t data_block_size; 107 + dm_cblock_t cache_blocks; 108 + bool changed:1; 109 + bool clean_when_opened:1; 110 + 111 + char policy_name[CACHE_POLICY_NAME_SIZE]; 112 + size_t policy_hint_size; 113 + struct dm_cache_statistics stats; 114 + }; 115 + 116 + /*------------------------------------------------------------------- 117 + * superblock validator 118 + *-----------------------------------------------------------------*/ 119 + 120 + #define SUPERBLOCK_CSUM_XOR 9031977 121 + 122 + static void sb_prepare_for_write(struct dm_block_validator *v, 123 + struct dm_block *b, 124 + size_t sb_block_size) 125 + { 126 + struct cache_disk_superblock *disk_super = dm_block_data(b); 127 + 128 + disk_super->blocknr = cpu_to_le64(dm_block_location(b)); 129 + disk_super->csum = cpu_to_le32(dm_bm_checksum(&disk_super->flags, 130 + sb_block_size - sizeof(__le32), 131 + SUPERBLOCK_CSUM_XOR)); 132 + } 133 + 134 + static int sb_check(struct dm_block_validator *v, 135 + struct dm_block *b, 136 + size_t sb_block_size) 137 + { 138 + struct cache_disk_superblock *disk_super = dm_block_data(b); 139 + __le32 csum_le; 140 + 141 + if (dm_block_location(b) != le64_to_cpu(disk_super->blocknr)) { 142 + DMERR("sb_check failed: blocknr %llu: wanted %llu", 143 + le64_to_cpu(disk_super->blocknr), 144 + (unsigned long long)dm_block_location(b)); 145 + return -ENOTBLK; 146 + } 147 + 148 + if (le64_to_cpu(disk_super->magic) != CACHE_SUPERBLOCK_MAGIC) { 149 + DMERR("sb_check failed: magic %llu: wanted %llu", 150 + le64_to_cpu(disk_super->magic), 151 + (unsigned long long)CACHE_SUPERBLOCK_MAGIC); 152 + return -EILSEQ; 153 + } 154 + 155 + csum_le = cpu_to_le32(dm_bm_checksum(&disk_super->flags, 156 + sb_block_size - sizeof(__le32), 157 + SUPERBLOCK_CSUM_XOR)); 158 + if (csum_le != disk_super->csum) { 159 + DMERR("sb_check failed: csum %u: wanted %u", 160 + le32_to_cpu(csum_le), le32_to_cpu(disk_super->csum)); 161 + return -EILSEQ; 162 + } 163 + 164 + return 0; 165 + } 166 + 167 + static struct dm_block_validator sb_validator = { 168 + .name = "superblock", 169 + .prepare_for_write = sb_prepare_for_write, 170 + .check = sb_check 171 + }; 172 + 173 + /*----------------------------------------------------------------*/ 174 + 175 + static int superblock_read_lock(struct dm_cache_metadata *cmd, 176 + struct dm_block **sblock) 177 + { 178 + return dm_bm_read_lock(cmd->bm, CACHE_SUPERBLOCK_LOCATION, 179 + &sb_validator, sblock); 180 + } 181 + 182 + static int superblock_lock_zero(struct dm_cache_metadata *cmd, 183 + struct dm_block **sblock) 184 + { 185 + return dm_bm_write_lock_zero(cmd->bm, CACHE_SUPERBLOCK_LOCATION, 186 + &sb_validator, sblock); 187 + } 188 + 189 + static int superblock_lock(struct dm_cache_metadata *cmd, 190 + struct dm_block **sblock) 191 + { 192 + return dm_bm_write_lock(cmd->bm, CACHE_SUPERBLOCK_LOCATION, 193 + &sb_validator, sblock); 194 + } 195 + 196 + /*----------------------------------------------------------------*/ 197 + 198 + static int __superblock_all_zeroes(struct dm_block_manager *bm, int *result) 199 + { 200 + int r; 201 + unsigned i; 202 + struct dm_block *b; 203 + __le64 *data_le, zero = cpu_to_le64(0); 204 + unsigned sb_block_size = dm_bm_block_size(bm) / sizeof(__le64); 205 + 206 + /* 207 + * We can't use a validator here - it may be all zeroes. 208 + */ 209 + r = dm_bm_read_lock(bm, CACHE_SUPERBLOCK_LOCATION, NULL, &b); 210 + if (r) 211 + return r; 212 + 213 + data_le = dm_block_data(b); 214 + *result = 1; 215 + for (i = 0; i < sb_block_size; i++) { 216 + if (data_le[i] != zero) { 217 + *result = 0; 218 + break; 219 + } 220 + } 221 + 222 + return dm_bm_unlock(b); 223 + } 224 + 225 + static void __setup_mapping_info(struct dm_cache_metadata *cmd) 226 + { 227 + struct dm_btree_value_type vt; 228 + 229 + vt.context = NULL; 230 + vt.size = sizeof(__le64); 231 + vt.inc = NULL; 232 + vt.dec = NULL; 233 + vt.equal = NULL; 234 + dm_array_info_init(&cmd->info, cmd->tm, &vt); 235 + 236 + if (cmd->policy_hint_size) { 237 + vt.size = sizeof(__le32); 238 + dm_array_info_init(&cmd->hint_info, cmd->tm, &vt); 239 + } 240 + } 241 + 242 + static int __write_initial_superblock(struct dm_cache_metadata *cmd) 243 + { 244 + int r; 245 + struct dm_block *sblock; 246 + size_t metadata_len; 247 + struct cache_disk_superblock *disk_super; 248 + sector_t bdev_size = i_size_read(cmd->bdev->bd_inode) >> SECTOR_SHIFT; 249 + 250 + /* FIXME: see if we can lose the max sectors limit */ 251 + if (bdev_size > DM_CACHE_METADATA_MAX_SECTORS) 252 + bdev_size = DM_CACHE_METADATA_MAX_SECTORS; 253 + 254 + r = dm_sm_root_size(cmd->metadata_sm, &metadata_len); 255 + if (r < 0) 256 + return r; 257 + 258 + r = dm_tm_pre_commit(cmd->tm); 259 + if (r < 0) 260 + return r; 261 + 262 + r = superblock_lock_zero(cmd, &sblock); 263 + if (r) 264 + return r; 265 + 266 + disk_super = dm_block_data(sblock); 267 + disk_super->flags = 0; 268 + memset(disk_super->uuid, 0, sizeof(disk_super->uuid)); 269 + disk_super->magic = cpu_to_le64(CACHE_SUPERBLOCK_MAGIC); 270 + disk_super->version = cpu_to_le32(CACHE_VERSION); 271 + memset(disk_super->policy_name, 0, CACHE_POLICY_NAME_SIZE); 272 + disk_super->policy_hint_size = 0; 273 + 274 + r = dm_sm_copy_root(cmd->metadata_sm, &disk_super->metadata_space_map_root, 275 + metadata_len); 276 + if (r < 0) 277 + goto bad_locked; 278 + 279 + disk_super->mapping_root = cpu_to_le64(cmd->root); 280 + disk_super->hint_root = cpu_to_le64(cmd->hint_root); 281 + disk_super->discard_root = cpu_to_le64(cmd->discard_root); 282 + disk_super->discard_block_size = cpu_to_le64(cmd->discard_block_size); 283 + disk_super->discard_nr_blocks = cpu_to_le64(from_dblock(cmd->discard_nr_blocks)); 284 + disk_super->metadata_block_size = cpu_to_le32(DM_CACHE_METADATA_BLOCK_SIZE >> SECTOR_SHIFT); 285 + disk_super->data_block_size = cpu_to_le32(cmd->data_block_size); 286 + disk_super->cache_blocks = cpu_to_le32(0); 287 + memset(disk_super->policy_name, 0, sizeof(disk_super->policy_name)); 288 + 289 + disk_super->read_hits = cpu_to_le32(0); 290 + disk_super->read_misses = cpu_to_le32(0); 291 + disk_super->write_hits = cpu_to_le32(0); 292 + disk_super->write_misses = cpu_to_le32(0); 293 + 294 + return dm_tm_commit(cmd->tm, sblock); 295 + 296 + bad_locked: 297 + dm_bm_unlock(sblock); 298 + return r; 299 + } 300 + 301 + static int __format_metadata(struct dm_cache_metadata *cmd) 302 + { 303 + int r; 304 + 305 + r = dm_tm_create_with_sm(cmd->bm, CACHE_SUPERBLOCK_LOCATION, 306 + &cmd->tm, &cmd->metadata_sm); 307 + if (r < 0) { 308 + DMERR("tm_create_with_sm failed"); 309 + return r; 310 + } 311 + 312 + __setup_mapping_info(cmd); 313 + 314 + r = dm_array_empty(&cmd->info, &cmd->root); 315 + if (r < 0) 316 + goto bad; 317 + 318 + dm_disk_bitset_init(cmd->tm, &cmd->discard_info); 319 + 320 + r = dm_bitset_empty(&cmd->discard_info, &cmd->discard_root); 321 + if (r < 0) 322 + goto bad; 323 + 324 + cmd->discard_block_size = 0; 325 + cmd->discard_nr_blocks = 0; 326 + 327 + r = __write_initial_superblock(cmd); 328 + if (r) 329 + goto bad; 330 + 331 + cmd->clean_when_opened = true; 332 + return 0; 333 + 334 + bad: 335 + dm_tm_destroy(cmd->tm); 336 + dm_sm_destroy(cmd->metadata_sm); 337 + 338 + return r; 339 + } 340 + 341 + static int __check_incompat_features(struct cache_disk_superblock *disk_super, 342 + struct dm_cache_metadata *cmd) 343 + { 344 + uint32_t features; 345 + 346 + features = le32_to_cpu(disk_super->incompat_flags) & ~DM_CACHE_FEATURE_INCOMPAT_SUPP; 347 + if (features) { 348 + DMERR("could not access metadata due to unsupported optional features (%lx).", 349 + (unsigned long)features); 350 + return -EINVAL; 351 + } 352 + 353 + /* 354 + * Check for read-only metadata to skip the following RDWR checks. 355 + */ 356 + if (get_disk_ro(cmd->bdev->bd_disk)) 357 + return 0; 358 + 359 + features = le32_to_cpu(disk_super->compat_ro_flags) & ~DM_CACHE_FEATURE_COMPAT_RO_SUPP; 360 + if (features) { 361 + DMERR("could not access metadata RDWR due to unsupported optional features (%lx).", 362 + (unsigned long)features); 363 + return -EINVAL; 364 + } 365 + 366 + return 0; 367 + } 368 + 369 + static int __open_metadata(struct dm_cache_metadata *cmd) 370 + { 371 + int r; 372 + struct dm_block *sblock; 373 + struct cache_disk_superblock *disk_super; 374 + unsigned long sb_flags; 375 + 376 + r = superblock_read_lock(cmd, &sblock); 377 + if (r < 0) { 378 + DMERR("couldn't read lock superblock"); 379 + return r; 380 + } 381 + 382 + disk_super = dm_block_data(sblock); 383 + 384 + r = __check_incompat_features(disk_super, cmd); 385 + if (r < 0) 386 + goto bad; 387 + 388 + r = dm_tm_open_with_sm(cmd->bm, CACHE_SUPERBLOCK_LOCATION, 389 + disk_super->metadata_space_map_root, 390 + sizeof(disk_super->metadata_space_map_root), 391 + &cmd->tm, &cmd->metadata_sm); 392 + if (r < 0) { 393 + DMERR("tm_open_with_sm failed"); 394 + goto bad; 395 + } 396 + 397 + __setup_mapping_info(cmd); 398 + dm_disk_bitset_init(cmd->tm, &cmd->discard_info); 399 + sb_flags = le32_to_cpu(disk_super->flags); 400 + cmd->clean_when_opened = test_bit(CLEAN_SHUTDOWN, &sb_flags); 401 + return dm_bm_unlock(sblock); 402 + 403 + bad: 404 + dm_bm_unlock(sblock); 405 + return r; 406 + } 407 + 408 + static int __open_or_format_metadata(struct dm_cache_metadata *cmd, 409 + bool format_device) 410 + { 411 + int r, unformatted; 412 + 413 + r = __superblock_all_zeroes(cmd->bm, &unformatted); 414 + if (r) 415 + return r; 416 + 417 + if (unformatted) 418 + return format_device ? __format_metadata(cmd) : -EPERM; 419 + 420 + return __open_metadata(cmd); 421 + } 422 + 423 + static int __create_persistent_data_objects(struct dm_cache_metadata *cmd, 424 + bool may_format_device) 425 + { 426 + int r; 427 + cmd->bm = dm_block_manager_create(cmd->bdev, DM_CACHE_METADATA_BLOCK_SIZE, 428 + CACHE_METADATA_CACHE_SIZE, 429 + CACHE_MAX_CONCURRENT_LOCKS); 430 + if (IS_ERR(cmd->bm)) { 431 + DMERR("could not create block manager"); 432 + return PTR_ERR(cmd->bm); 433 + } 434 + 435 + r = __open_or_format_metadata(cmd, may_format_device); 436 + if (r) 437 + dm_block_manager_destroy(cmd->bm); 438 + 439 + return r; 440 + } 441 + 442 + static void __destroy_persistent_data_objects(struct dm_cache_metadata *cmd) 443 + { 444 + dm_sm_destroy(cmd->metadata_sm); 445 + dm_tm_destroy(cmd->tm); 446 + dm_block_manager_destroy(cmd->bm); 447 + } 448 + 449 + typedef unsigned long (*flags_mutator)(unsigned long); 450 + 451 + static void update_flags(struct cache_disk_superblock *disk_super, 452 + flags_mutator mutator) 453 + { 454 + uint32_t sb_flags = mutator(le32_to_cpu(disk_super->flags)); 455 + disk_super->flags = cpu_to_le32(sb_flags); 456 + } 457 + 458 + static unsigned long set_clean_shutdown(unsigned long flags) 459 + { 460 + set_bit(CLEAN_SHUTDOWN, &flags); 461 + return flags; 462 + } 463 + 464 + static unsigned long clear_clean_shutdown(unsigned long flags) 465 + { 466 + clear_bit(CLEAN_SHUTDOWN, &flags); 467 + return flags; 468 + } 469 + 470 + static void read_superblock_fields(struct dm_cache_metadata *cmd, 471 + struct cache_disk_superblock *disk_super) 472 + { 473 + cmd->root = le64_to_cpu(disk_super->mapping_root); 474 + cmd->hint_root = le64_to_cpu(disk_super->hint_root); 475 + cmd->discard_root = le64_to_cpu(disk_super->discard_root); 476 + cmd->discard_block_size = le64_to_cpu(disk_super->discard_block_size); 477 + cmd->discard_nr_blocks = to_dblock(le64_to_cpu(disk_super->discard_nr_blocks)); 478 + cmd->data_block_size = le32_to_cpu(disk_super->data_block_size); 479 + cmd->cache_blocks = to_cblock(le32_to_cpu(disk_super->cache_blocks)); 480 + strncpy(cmd->policy_name, disk_super->policy_name, sizeof(cmd->policy_name)); 481 + cmd->policy_hint_size = le32_to_cpu(disk_super->policy_hint_size); 482 + 483 + cmd->stats.read_hits = le32_to_cpu(disk_super->read_hits); 484 + cmd->stats.read_misses = le32_to_cpu(disk_super->read_misses); 485 + cmd->stats.write_hits = le32_to_cpu(disk_super->write_hits); 486 + cmd->stats.write_misses = le32_to_cpu(disk_super->write_misses); 487 + 488 + cmd->changed = false; 489 + } 490 + 491 + /* 492 + * The mutator updates the superblock flags. 493 + */ 494 + static int __begin_transaction_flags(struct dm_cache_metadata *cmd, 495 + flags_mutator mutator) 496 + { 497 + int r; 498 + struct cache_disk_superblock *disk_super; 499 + struct dm_block *sblock; 500 + 501 + r = superblock_lock(cmd, &sblock); 502 + if (r) 503 + return r; 504 + 505 + disk_super = dm_block_data(sblock); 506 + update_flags(disk_super, mutator); 507 + read_superblock_fields(cmd, disk_super); 508 + 509 + return dm_bm_flush_and_unlock(cmd->bm, sblock); 510 + } 511 + 512 + static int __begin_transaction(struct dm_cache_metadata *cmd) 513 + { 514 + int r; 515 + struct cache_disk_superblock *disk_super; 516 + struct dm_block *sblock; 517 + 518 + /* 519 + * We re-read the superblock every time. Shouldn't need to do this 520 + * really. 521 + */ 522 + r = superblock_read_lock(cmd, &sblock); 523 + if (r) 524 + return r; 525 + 526 + disk_super = dm_block_data(sblock); 527 + read_superblock_fields(cmd, disk_super); 528 + dm_bm_unlock(sblock); 529 + 530 + return 0; 531 + } 532 + 533 + static int __commit_transaction(struct dm_cache_metadata *cmd, 534 + flags_mutator mutator) 535 + { 536 + int r; 537 + size_t metadata_len; 538 + struct cache_disk_superblock *disk_super; 539 + struct dm_block *sblock; 540 + 541 + /* 542 + * We need to know if the cache_disk_superblock exceeds a 512-byte sector. 543 + */ 544 + BUILD_BUG_ON(sizeof(struct cache_disk_superblock) > 512); 545 + 546 + r = dm_bitset_flush(&cmd->discard_info, cmd->discard_root, 547 + &cmd->discard_root); 548 + if (r) 549 + return r; 550 + 551 + r = dm_tm_pre_commit(cmd->tm); 552 + if (r < 0) 553 + return r; 554 + 555 + r = dm_sm_root_size(cmd->metadata_sm, &metadata_len); 556 + if (r < 0) 557 + return r; 558 + 559 + r = superblock_lock(cmd, &sblock); 560 + if (r) 561 + return r; 562 + 563 + disk_super = dm_block_data(sblock); 564 + 565 + if (mutator) 566 + update_flags(disk_super, mutator); 567 + 568 + disk_super->mapping_root = cpu_to_le64(cmd->root); 569 + disk_super->hint_root = cpu_to_le64(cmd->hint_root); 570 + disk_super->discard_root = cpu_to_le64(cmd->discard_root); 571 + disk_super->discard_block_size = cpu_to_le64(cmd->discard_block_size); 572 + disk_super->discard_nr_blocks = cpu_to_le64(from_dblock(cmd->discard_nr_blocks)); 573 + disk_super->cache_blocks = cpu_to_le32(from_cblock(cmd->cache_blocks)); 574 + strncpy(disk_super->policy_name, cmd->policy_name, sizeof(disk_super->policy_name)); 575 + 576 + disk_super->read_hits = cpu_to_le32(cmd->stats.read_hits); 577 + disk_super->read_misses = cpu_to_le32(cmd->stats.read_misses); 578 + disk_super->write_hits = cpu_to_le32(cmd->stats.write_hits); 579 + disk_super->write_misses = cpu_to_le32(cmd->stats.write_misses); 580 + 581 + r = dm_sm_copy_root(cmd->metadata_sm, &disk_super->metadata_space_map_root, 582 + metadata_len); 583 + if (r < 0) { 584 + dm_bm_unlock(sblock); 585 + return r; 586 + } 587 + 588 + return dm_tm_commit(cmd->tm, sblock); 589 + } 590 + 591 + /*----------------------------------------------------------------*/ 592 + 593 + /* 594 + * The mappings are held in a dm-array that has 64-bit values stored in 595 + * little-endian format. The index is the cblock, the high 48bits of the 596 + * value are the oblock and the low 16 bit the flags. 597 + */ 598 + #define FLAGS_MASK ((1 << 16) - 1) 599 + 600 + static __le64 pack_value(dm_oblock_t block, unsigned flags) 601 + { 602 + uint64_t value = from_oblock(block); 603 + value <<= 16; 604 + value = value | (flags & FLAGS_MASK); 605 + return cpu_to_le64(value); 606 + } 607 + 608 + static void unpack_value(__le64 value_le, dm_oblock_t *block, unsigned *flags) 609 + { 610 + uint64_t value = le64_to_cpu(value_le); 611 + uint64_t b = value >> 16; 612 + *block = to_oblock(b); 613 + *flags = value & FLAGS_MASK; 614 + } 615 + 616 + /*----------------------------------------------------------------*/ 617 + 618 + struct dm_cache_metadata *dm_cache_metadata_open(struct block_device *bdev, 619 + sector_t data_block_size, 620 + bool may_format_device, 621 + size_t policy_hint_size) 622 + { 623 + int r; 624 + struct dm_cache_metadata *cmd; 625 + 626 + cmd = kzalloc(sizeof(*cmd), GFP_KERNEL); 627 + if (!cmd) { 628 + DMERR("could not allocate metadata struct"); 629 + return NULL; 630 + } 631 + 632 + init_rwsem(&cmd->root_lock); 633 + cmd->bdev = bdev; 634 + cmd->data_block_size = data_block_size; 635 + cmd->cache_blocks = 0; 636 + cmd->policy_hint_size = policy_hint_size; 637 + cmd->changed = true; 638 + 639 + r = __create_persistent_data_objects(cmd, may_format_device); 640 + if (r) { 641 + kfree(cmd); 642 + return ERR_PTR(r); 643 + } 644 + 645 + r = __begin_transaction_flags(cmd, clear_clean_shutdown); 646 + if (r < 0) { 647 + dm_cache_metadata_close(cmd); 648 + return ERR_PTR(r); 649 + } 650 + 651 + return cmd; 652 + } 653 + 654 + void dm_cache_metadata_close(struct dm_cache_metadata *cmd) 655 + { 656 + __destroy_persistent_data_objects(cmd); 657 + kfree(cmd); 658 + } 659 + 660 + int dm_cache_resize(struct dm_cache_metadata *cmd, dm_cblock_t new_cache_size) 661 + { 662 + int r; 663 + __le64 null_mapping = pack_value(0, 0); 664 + 665 + down_write(&cmd->root_lock); 666 + __dm_bless_for_disk(&null_mapping); 667 + r = dm_array_resize(&cmd->info, cmd->root, from_cblock(cmd->cache_blocks), 668 + from_cblock(new_cache_size), 669 + &null_mapping, &cmd->root); 670 + if (!r) 671 + cmd->cache_blocks = new_cache_size; 672 + cmd->changed = true; 673 + up_write(&cmd->root_lock); 674 + 675 + return r; 676 + } 677 + 678 + int dm_cache_discard_bitset_resize(struct dm_cache_metadata *cmd, 679 + sector_t discard_block_size, 680 + dm_dblock_t new_nr_entries) 681 + { 682 + int r; 683 + 684 + down_write(&cmd->root_lock); 685 + r = dm_bitset_resize(&cmd->discard_info, 686 + cmd->discard_root, 687 + from_dblock(cmd->discard_nr_blocks), 688 + from_dblock(new_nr_entries), 689 + false, &cmd->discard_root); 690 + if (!r) { 691 + cmd->discard_block_size = discard_block_size; 692 + cmd->discard_nr_blocks = new_nr_entries; 693 + } 694 + 695 + cmd->changed = true; 696 + up_write(&cmd->root_lock); 697 + 698 + return r; 699 + } 700 + 701 + static int __set_discard(struct dm_cache_metadata *cmd, dm_dblock_t b) 702 + { 703 + return dm_bitset_set_bit(&cmd->discard_info, cmd->discard_root, 704 + from_dblock(b), &cmd->discard_root); 705 + } 706 + 707 + static int __clear_discard(struct dm_cache_metadata *cmd, dm_dblock_t b) 708 + { 709 + return dm_bitset_clear_bit(&cmd->discard_info, cmd->discard_root, 710 + from_dblock(b), &cmd->discard_root); 711 + } 712 + 713 + static int __is_discarded(struct dm_cache_metadata *cmd, dm_dblock_t b, 714 + bool *is_discarded) 715 + { 716 + return dm_bitset_test_bit(&cmd->discard_info, cmd->discard_root, 717 + from_dblock(b), &cmd->discard_root, 718 + is_discarded); 719 + } 720 + 721 + static int __discard(struct dm_cache_metadata *cmd, 722 + dm_dblock_t dblock, bool discard) 723 + { 724 + int r; 725 + 726 + r = (discard ? __set_discard : __clear_discard)(cmd, dblock); 727 + if (r) 728 + return r; 729 + 730 + cmd->changed = true; 731 + return 0; 732 + } 733 + 734 + int dm_cache_set_discard(struct dm_cache_metadata *cmd, 735 + dm_dblock_t dblock, bool discard) 736 + { 737 + int r; 738 + 739 + down_write(&cmd->root_lock); 740 + r = __discard(cmd, dblock, discard); 741 + up_write(&cmd->root_lock); 742 + 743 + return r; 744 + } 745 + 746 + static int __load_discards(struct dm_cache_metadata *cmd, 747 + load_discard_fn fn, void *context) 748 + { 749 + int r = 0; 750 + dm_block_t b; 751 + bool discard; 752 + 753 + for (b = 0; b < from_dblock(cmd->discard_nr_blocks); b++) { 754 + dm_dblock_t dblock = to_dblock(b); 755 + 756 + if (cmd->clean_when_opened) { 757 + r = __is_discarded(cmd, dblock, &discard); 758 + if (r) 759 + return r; 760 + } else 761 + discard = false; 762 + 763 + r = fn(context, cmd->discard_block_size, dblock, discard); 764 + if (r) 765 + break; 766 + } 767 + 768 + return r; 769 + } 770 + 771 + int dm_cache_load_discards(struct dm_cache_metadata *cmd, 772 + load_discard_fn fn, void *context) 773 + { 774 + int r; 775 + 776 + down_read(&cmd->root_lock); 777 + r = __load_discards(cmd, fn, context); 778 + up_read(&cmd->root_lock); 779 + 780 + return r; 781 + } 782 + 783 + dm_cblock_t dm_cache_size(struct dm_cache_metadata *cmd) 784 + { 785 + dm_cblock_t r; 786 + 787 + down_read(&cmd->root_lock); 788 + r = cmd->cache_blocks; 789 + up_read(&cmd->root_lock); 790 + 791 + return r; 792 + } 793 + 794 + static int __remove(struct dm_cache_metadata *cmd, dm_cblock_t cblock) 795 + { 796 + int r; 797 + __le64 value = pack_value(0, 0); 798 + 799 + __dm_bless_for_disk(&value); 800 + r = dm_array_set_value(&cmd->info, cmd->root, from_cblock(cblock), 801 + &value, &cmd->root); 802 + if (r) 803 + return r; 804 + 805 + cmd->changed = true; 806 + return 0; 807 + } 808 + 809 + int dm_cache_remove_mapping(struct dm_cache_metadata *cmd, dm_cblock_t cblock) 810 + { 811 + int r; 812 + 813 + down_write(&cmd->root_lock); 814 + r = __remove(cmd, cblock); 815 + up_write(&cmd->root_lock); 816 + 817 + return r; 818 + } 819 + 820 + static int __insert(struct dm_cache_metadata *cmd, 821 + dm_cblock_t cblock, dm_oblock_t oblock) 822 + { 823 + int r; 824 + __le64 value = pack_value(oblock, M_VALID); 825 + __dm_bless_for_disk(&value); 826 + 827 + r = dm_array_set_value(&cmd->info, cmd->root, from_cblock(cblock), 828 + &value, &cmd->root); 829 + if (r) 830 + return r; 831 + 832 + cmd->changed = true; 833 + return 0; 834 + } 835 + 836 + int dm_cache_insert_mapping(struct dm_cache_metadata *cmd, 837 + dm_cblock_t cblock, dm_oblock_t oblock) 838 + { 839 + int r; 840 + 841 + down_write(&cmd->root_lock); 842 + r = __insert(cmd, cblock, oblock); 843 + up_write(&cmd->root_lock); 844 + 845 + return r; 846 + } 847 + 848 + struct thunk { 849 + load_mapping_fn fn; 850 + void *context; 851 + 852 + struct dm_cache_metadata *cmd; 853 + bool respect_dirty_flags; 854 + bool hints_valid; 855 + }; 856 + 857 + static bool hints_array_initialized(struct dm_cache_metadata *cmd) 858 + { 859 + return cmd->hint_root && cmd->policy_hint_size; 860 + } 861 + 862 + static bool hints_array_available(struct dm_cache_metadata *cmd, 863 + const char *policy_name) 864 + { 865 + bool policy_names_match = !strncmp(cmd->policy_name, policy_name, 866 + sizeof(cmd->policy_name)); 867 + 868 + return cmd->clean_when_opened && policy_names_match && 869 + hints_array_initialized(cmd); 870 + } 871 + 872 + static int __load_mapping(void *context, uint64_t cblock, void *leaf) 873 + { 874 + int r = 0; 875 + bool dirty; 876 + __le64 value; 877 + __le32 hint_value = 0; 878 + dm_oblock_t oblock; 879 + unsigned flags; 880 + struct thunk *thunk = context; 881 + struct dm_cache_metadata *cmd = thunk->cmd; 882 + 883 + memcpy(&value, leaf, sizeof(value)); 884 + unpack_value(value, &oblock, &flags); 885 + 886 + if (flags & M_VALID) { 887 + if (thunk->hints_valid) { 888 + r = dm_array_get_value(&cmd->hint_info, cmd->hint_root, 889 + cblock, &hint_value); 890 + if (r && r != -ENODATA) 891 + return r; 892 + } 893 + 894 + dirty = thunk->respect_dirty_flags ? (flags & M_DIRTY) : true; 895 + r = thunk->fn(thunk->context, oblock, to_cblock(cblock), 896 + dirty, le32_to_cpu(hint_value), thunk->hints_valid); 897 + } 898 + 899 + return r; 900 + } 901 + 902 + static int __load_mappings(struct dm_cache_metadata *cmd, const char *policy_name, 903 + load_mapping_fn fn, void *context) 904 + { 905 + struct thunk thunk; 906 + 907 + thunk.fn = fn; 908 + thunk.context = context; 909 + 910 + thunk.cmd = cmd; 911 + thunk.respect_dirty_flags = cmd->clean_when_opened; 912 + thunk.hints_valid = hints_array_available(cmd, policy_name); 913 + 914 + return dm_array_walk(&cmd->info, cmd->root, __load_mapping, &thunk); 915 + } 916 + 917 + int dm_cache_load_mappings(struct dm_cache_metadata *cmd, const char *policy_name, 918 + load_mapping_fn fn, void *context) 919 + { 920 + int r; 921 + 922 + down_read(&cmd->root_lock); 923 + r = __load_mappings(cmd, policy_name, fn, context); 924 + up_read(&cmd->root_lock); 925 + 926 + return r; 927 + } 928 + 929 + static int __dump_mapping(void *context, uint64_t cblock, void *leaf) 930 + { 931 + int r = 0; 932 + __le64 value; 933 + dm_oblock_t oblock; 934 + unsigned flags; 935 + 936 + memcpy(&value, leaf, sizeof(value)); 937 + unpack_value(value, &oblock, &flags); 938 + 939 + return r; 940 + } 941 + 942 + static int __dump_mappings(struct dm_cache_metadata *cmd) 943 + { 944 + return dm_array_walk(&cmd->info, cmd->root, __dump_mapping, NULL); 945 + } 946 + 947 + void dm_cache_dump(struct dm_cache_metadata *cmd) 948 + { 949 + down_read(&cmd->root_lock); 950 + __dump_mappings(cmd); 951 + up_read(&cmd->root_lock); 952 + } 953 + 954 + int dm_cache_changed_this_transaction(struct dm_cache_metadata *cmd) 955 + { 956 + int r; 957 + 958 + down_read(&cmd->root_lock); 959 + r = cmd->changed; 960 + up_read(&cmd->root_lock); 961 + 962 + return r; 963 + } 964 + 965 + static int __dirty(struct dm_cache_metadata *cmd, dm_cblock_t cblock, bool dirty) 966 + { 967 + int r; 968 + unsigned flags; 969 + dm_oblock_t oblock; 970 + __le64 value; 971 + 972 + r = dm_array_get_value(&cmd->info, cmd->root, from_cblock(cblock), &value); 973 + if (r) 974 + return r; 975 + 976 + unpack_value(value, &oblock, &flags); 977 + 978 + if (((flags & M_DIRTY) && dirty) || (!(flags & M_DIRTY) && !dirty)) 979 + /* nothing to be done */ 980 + return 0; 981 + 982 + value = pack_value(oblock, flags | (dirty ? M_DIRTY : 0)); 983 + __dm_bless_for_disk(&value); 984 + 985 + r = dm_array_set_value(&cmd->info, cmd->root, from_cblock(cblock), 986 + &value, &cmd->root); 987 + if (r) 988 + return r; 989 + 990 + cmd->changed = true; 991 + return 0; 992 + 993 + } 994 + 995 + int dm_cache_set_dirty(struct dm_cache_metadata *cmd, 996 + dm_cblock_t cblock, bool dirty) 997 + { 998 + int r; 999 + 1000 + down_write(&cmd->root_lock); 1001 + r = __dirty(cmd, cblock, dirty); 1002 + up_write(&cmd->root_lock); 1003 + 1004 + return r; 1005 + } 1006 + 1007 + void dm_cache_metadata_get_stats(struct dm_cache_metadata *cmd, 1008 + struct dm_cache_statistics *stats) 1009 + { 1010 + down_read(&cmd->root_lock); 1011 + memcpy(stats, &cmd->stats, sizeof(*stats)); 1012 + up_read(&cmd->root_lock); 1013 + } 1014 + 1015 + void dm_cache_metadata_set_stats(struct dm_cache_metadata *cmd, 1016 + struct dm_cache_statistics *stats) 1017 + { 1018 + down_write(&cmd->root_lock); 1019 + memcpy(&cmd->stats, stats, sizeof(*stats)); 1020 + up_write(&cmd->root_lock); 1021 + } 1022 + 1023 + int dm_cache_commit(struct dm_cache_metadata *cmd, bool clean_shutdown) 1024 + { 1025 + int r; 1026 + flags_mutator mutator = (clean_shutdown ? set_clean_shutdown : 1027 + clear_clean_shutdown); 1028 + 1029 + down_write(&cmd->root_lock); 1030 + r = __commit_transaction(cmd, mutator); 1031 + if (r) 1032 + goto out; 1033 + 1034 + r = __begin_transaction(cmd); 1035 + 1036 + out: 1037 + up_write(&cmd->root_lock); 1038 + return r; 1039 + } 1040 + 1041 + int dm_cache_get_free_metadata_block_count(struct dm_cache_metadata *cmd, 1042 + dm_block_t *result) 1043 + { 1044 + int r = -EINVAL; 1045 + 1046 + down_read(&cmd->root_lock); 1047 + r = dm_sm_get_nr_free(cmd->metadata_sm, result); 1048 + up_read(&cmd->root_lock); 1049 + 1050 + return r; 1051 + } 1052 + 1053 + int dm_cache_get_metadata_dev_size(struct dm_cache_metadata *cmd, 1054 + dm_block_t *result) 1055 + { 1056 + int r = -EINVAL; 1057 + 1058 + down_read(&cmd->root_lock); 1059 + r = dm_sm_get_nr_blocks(cmd->metadata_sm, result); 1060 + up_read(&cmd->root_lock); 1061 + 1062 + return r; 1063 + } 1064 + 1065 + /*----------------------------------------------------------------*/ 1066 + 1067 + static int begin_hints(struct dm_cache_metadata *cmd, struct dm_cache_policy *policy) 1068 + { 1069 + int r; 1070 + __le32 value; 1071 + size_t hint_size; 1072 + const char *policy_name = dm_cache_policy_get_name(policy); 1073 + 1074 + if (!policy_name[0] || 1075 + (strlen(policy_name) > sizeof(cmd->policy_name) - 1)) 1076 + return -EINVAL; 1077 + 1078 + if (strcmp(cmd->policy_name, policy_name)) { 1079 + strncpy(cmd->policy_name, policy_name, sizeof(cmd->policy_name)); 1080 + 1081 + hint_size = dm_cache_policy_get_hint_size(policy); 1082 + if (!hint_size) 1083 + return 0; /* short-circuit hints initialization */ 1084 + cmd->policy_hint_size = hint_size; 1085 + 1086 + if (cmd->hint_root) { 1087 + r = dm_array_del(&cmd->hint_info, cmd->hint_root); 1088 + if (r) 1089 + return r; 1090 + } 1091 + 1092 + r = dm_array_empty(&cmd->hint_info, &cmd->hint_root); 1093 + if (r) 1094 + return r; 1095 + 1096 + value = cpu_to_le32(0); 1097 + __dm_bless_for_disk(&value); 1098 + r = dm_array_resize(&cmd->hint_info, cmd->hint_root, 0, 1099 + from_cblock(cmd->cache_blocks), 1100 + &value, &cmd->hint_root); 1101 + if (r) 1102 + return r; 1103 + } 1104 + 1105 + return 0; 1106 + } 1107 + 1108 + int dm_cache_begin_hints(struct dm_cache_metadata *cmd, struct dm_cache_policy *policy) 1109 + { 1110 + int r; 1111 + 1112 + down_write(&cmd->root_lock); 1113 + r = begin_hints(cmd, policy); 1114 + up_write(&cmd->root_lock); 1115 + 1116 + return r; 1117 + } 1118 + 1119 + static int save_hint(struct dm_cache_metadata *cmd, dm_cblock_t cblock, 1120 + uint32_t hint) 1121 + { 1122 + int r; 1123 + __le32 value = cpu_to_le32(hint); 1124 + __dm_bless_for_disk(&value); 1125 + 1126 + r = dm_array_set_value(&cmd->hint_info, cmd->hint_root, 1127 + from_cblock(cblock), &value, &cmd->hint_root); 1128 + cmd->changed = true; 1129 + 1130 + return r; 1131 + } 1132 + 1133 + int dm_cache_save_hint(struct dm_cache_metadata *cmd, dm_cblock_t cblock, 1134 + uint32_t hint) 1135 + { 1136 + int r; 1137 + 1138 + if (!hints_array_initialized(cmd)) 1139 + return 0; 1140 + 1141 + down_write(&cmd->root_lock); 1142 + r = save_hint(cmd, cblock, hint); 1143 + up_write(&cmd->root_lock); 1144 + 1145 + return r; 1146 + }

+142

drivers/md/dm-cache-metadata.h

··· 1 + /* 2 + * Copyright (C) 2012 Red Hat, Inc. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #ifndef DM_CACHE_METADATA_H 8 + #define DM_CACHE_METADATA_H 9 + 10 + #include "dm-cache-block-types.h" 11 + #include "dm-cache-policy-internal.h" 12 + 13 + /*----------------------------------------------------------------*/ 14 + 15 + #define DM_CACHE_METADATA_BLOCK_SIZE 4096 16 + 17 + /* FIXME: remove this restriction */ 18 + /* 19 + * The metadata device is currently limited in size. 20 + * 21 + * We have one block of index, which can hold 255 index entries. Each 22 + * index entry contains allocation info about 16k metadata blocks. 23 + */ 24 + #define DM_CACHE_METADATA_MAX_SECTORS (255 * (1 << 14) * (DM_CACHE_METADATA_BLOCK_SIZE / (1 << SECTOR_SHIFT))) 25 + 26 + /* 27 + * A metadata device larger than 16GB triggers a warning. 28 + */ 29 + #define DM_CACHE_METADATA_MAX_SECTORS_WARNING (16 * (1024 * 1024 * 1024 >> SECTOR_SHIFT)) 30 + 31 + /*----------------------------------------------------------------*/ 32 + 33 + /* 34 + * Ext[234]-style compat feature flags. 35 + * 36 + * A new feature which old metadata will still be compatible with should 37 + * define a DM_CACHE_FEATURE_COMPAT_* flag (rarely useful). 38 + * 39 + * A new feature that is not compatible with old code should define a 40 + * DM_CACHE_FEATURE_INCOMPAT_* flag and guard the relevant code with 41 + * that flag. 42 + * 43 + * A new feature that is not compatible with old code accessing the 44 + * metadata RDWR should define a DM_CACHE_FEATURE_RO_COMPAT_* flag and 45 + * guard the relevant code with that flag. 46 + * 47 + * As these various flags are defined they should be added to the 48 + * following masks. 49 + */ 50 + #define DM_CACHE_FEATURE_COMPAT_SUPP 0UL 51 + #define DM_CACHE_FEATURE_COMPAT_RO_SUPP 0UL 52 + #define DM_CACHE_FEATURE_INCOMPAT_SUPP 0UL 53 + 54 + /* 55 + * Reopens or creates a new, empty metadata volume. 56 + * Returns an ERR_PTR on failure. 57 + */ 58 + struct dm_cache_metadata *dm_cache_metadata_open(struct block_device *bdev, 59 + sector_t data_block_size, 60 + bool may_format_device, 61 + size_t policy_hint_size); 62 + 63 + void dm_cache_metadata_close(struct dm_cache_metadata *cmd); 64 + 65 + /* 66 + * The metadata needs to know how many cache blocks there are. We don't 67 + * care about the origin, assuming the core target is giving us valid 68 + * origin blocks to map to. 69 + */ 70 + int dm_cache_resize(struct dm_cache_metadata *cmd, dm_cblock_t new_cache_size); 71 + dm_cblock_t dm_cache_size(struct dm_cache_metadata *cmd); 72 + 73 + int dm_cache_discard_bitset_resize(struct dm_cache_metadata *cmd, 74 + sector_t discard_block_size, 75 + dm_dblock_t new_nr_entries); 76 + 77 + typedef int (*load_discard_fn)(void *context, sector_t discard_block_size, 78 + dm_dblock_t dblock, bool discarded); 79 + int dm_cache_load_discards(struct dm_cache_metadata *cmd, 80 + load_discard_fn fn, void *context); 81 + 82 + int dm_cache_set_discard(struct dm_cache_metadata *cmd, dm_dblock_t dblock, bool discard); 83 + 84 + int dm_cache_remove_mapping(struct dm_cache_metadata *cmd, dm_cblock_t cblock); 85 + int dm_cache_insert_mapping(struct dm_cache_metadata *cmd, dm_cblock_t cblock, dm_oblock_t oblock); 86 + int dm_cache_changed_this_transaction(struct dm_cache_metadata *cmd); 87 + 88 + typedef int (*load_mapping_fn)(void *context, dm_oblock_t oblock, 89 + dm_cblock_t cblock, bool dirty, 90 + uint32_t hint, bool hint_valid); 91 + int dm_cache_load_mappings(struct dm_cache_metadata *cmd, 92 + const char *policy_name, 93 + load_mapping_fn fn, 94 + void *context); 95 + 96 + int dm_cache_set_dirty(struct dm_cache_metadata *cmd, dm_cblock_t cblock, bool dirty); 97 + 98 + struct dm_cache_statistics { 99 + uint32_t read_hits; 100 + uint32_t read_misses; 101 + uint32_t write_hits; 102 + uint32_t write_misses; 103 + }; 104 + 105 + void dm_cache_metadata_get_stats(struct dm_cache_metadata *cmd, 106 + struct dm_cache_statistics *stats); 107 + void dm_cache_metadata_set_stats(struct dm_cache_metadata *cmd, 108 + struct dm_cache_statistics *stats); 109 + 110 + int dm_cache_commit(struct dm_cache_metadata *cmd, bool clean_shutdown); 111 + 112 + int dm_cache_get_free_metadata_block_count(struct dm_cache_metadata *cmd, 113 + dm_block_t *result); 114 + 115 + int dm_cache_get_metadata_dev_size(struct dm_cache_metadata *cmd, 116 + dm_block_t *result); 117 + 118 + void dm_cache_dump(struct dm_cache_metadata *cmd); 119 + 120 + /* 121 + * The policy is invited to save a 32bit hint value for every cblock (eg, 122 + * for a hit count). These are stored against the policy name. If 123 + * policies are changed, then hints will be lost. If the machine crashes, 124 + * hints will be lost. 125 + * 126 + * The hints are indexed by the cblock, but many policies will not 127 + * neccessarily have a fast way of accessing efficiently via cblock. So 128 + * rather than querying the policy for each cblock, we let it walk its data 129 + * structures and fill in the hints in whatever order it wishes. 130 + */ 131 + 132 + int dm_cache_begin_hints(struct dm_cache_metadata *cmd, struct dm_cache_policy *p); 133 + 134 + /* 135 + * requests hints for every cblock and stores in the metadata device. 136 + */ 137 + int dm_cache_save_hint(struct dm_cache_metadata *cmd, 138 + dm_cblock_t cblock, uint32_t hint); 139 + 140 + /*----------------------------------------------------------------*/ 141 + 142 + #endif /* DM_CACHE_METADATA_H */

+464

drivers/md/dm-cache-policy-cleaner.c

··· 1 + /* 2 + * Copyright (C) 2012 Red Hat. All rights reserved. 3 + * 4 + * writeback cache policy supporting flushing out dirty cache blocks. 5 + * 6 + * This file is released under the GPL. 7 + */ 8 + 9 + #include "dm-cache-policy.h" 10 + #include "dm.h" 11 + 12 + #include <linux/hash.h> 13 + #include <linux/module.h> 14 + #include <linux/slab.h> 15 + #include <linux/vmalloc.h> 16 + 17 + /*----------------------------------------------------------------*/ 18 + 19 + #define DM_MSG_PREFIX "cache cleaner" 20 + #define CLEANER_VERSION "1.0.0" 21 + 22 + /* Cache entry struct. */ 23 + struct wb_cache_entry { 24 + struct list_head list; 25 + struct hlist_node hlist; 26 + 27 + dm_oblock_t oblock; 28 + dm_cblock_t cblock; 29 + bool dirty:1; 30 + bool pending:1; 31 + }; 32 + 33 + struct hash { 34 + struct hlist_head *table; 35 + dm_block_t hash_bits; 36 + unsigned nr_buckets; 37 + }; 38 + 39 + struct policy { 40 + struct dm_cache_policy policy; 41 + spinlock_t lock; 42 + 43 + struct list_head free; 44 + struct list_head clean; 45 + struct list_head clean_pending; 46 + struct list_head dirty; 47 + 48 + /* 49 + * We know exactly how many cblocks will be needed, 50 + * so we can allocate them up front. 51 + */ 52 + dm_cblock_t cache_size, nr_cblocks_allocated; 53 + struct wb_cache_entry *cblocks; 54 + struct hash chash; 55 + }; 56 + 57 + /*----------------------------------------------------------------------------*/ 58 + 59 + /* 60 + * Low-level functions. 61 + */ 62 + static unsigned next_power(unsigned n, unsigned min) 63 + { 64 + return roundup_pow_of_two(max(n, min)); 65 + } 66 + 67 + static struct policy *to_policy(struct dm_cache_policy *p) 68 + { 69 + return container_of(p, struct policy, policy); 70 + } 71 + 72 + static struct list_head *list_pop(struct list_head *q) 73 + { 74 + struct list_head *r = q->next; 75 + 76 + list_del(r); 77 + 78 + return r; 79 + } 80 + 81 + /*----------------------------------------------------------------------------*/ 82 + 83 + /* Allocate/free various resources. */ 84 + static int alloc_hash(struct hash *hash, unsigned elts) 85 + { 86 + hash->nr_buckets = next_power(elts >> 4, 16); 87 + hash->hash_bits = ffs(hash->nr_buckets) - 1; 88 + hash->table = vzalloc(sizeof(*hash->table) * hash->nr_buckets); 89 + 90 + return hash->table ? 0 : -ENOMEM; 91 + } 92 + 93 + static void free_hash(struct hash *hash) 94 + { 95 + vfree(hash->table); 96 + } 97 + 98 + static int alloc_cache_blocks_with_hash(struct policy *p, dm_cblock_t cache_size) 99 + { 100 + int r = -ENOMEM; 101 + 102 + p->cblocks = vzalloc(sizeof(*p->cblocks) * from_cblock(cache_size)); 103 + if (p->cblocks) { 104 + unsigned u = from_cblock(cache_size); 105 + 106 + while (u--) 107 + list_add(&p->cblocks[u].list, &p->free); 108 + 109 + p->nr_cblocks_allocated = 0; 110 + 111 + /* Cache entries hash. */ 112 + r = alloc_hash(&p->chash, from_cblock(cache_size)); 113 + if (r) 114 + vfree(p->cblocks); 115 + } 116 + 117 + return r; 118 + } 119 + 120 + static void free_cache_blocks_and_hash(struct policy *p) 121 + { 122 + free_hash(&p->chash); 123 + vfree(p->cblocks); 124 + } 125 + 126 + static struct wb_cache_entry *alloc_cache_entry(struct policy *p) 127 + { 128 + struct wb_cache_entry *e; 129 + 130 + BUG_ON(from_cblock(p->nr_cblocks_allocated) >= from_cblock(p->cache_size)); 131 + 132 + e = list_entry(list_pop(&p->free), struct wb_cache_entry, list); 133 + p->nr_cblocks_allocated = to_cblock(from_cblock(p->nr_cblocks_allocated) + 1); 134 + 135 + return e; 136 + } 137 + 138 + /*----------------------------------------------------------------------------*/ 139 + 140 + /* Hash functions (lookup, insert, remove). */ 141 + static struct wb_cache_entry *lookup_cache_entry(struct policy *p, dm_oblock_t oblock) 142 + { 143 + struct hash *hash = &p->chash; 144 + unsigned h = hash_64(from_oblock(oblock), hash->hash_bits); 145 + struct wb_cache_entry *cur; 146 + struct hlist_head *bucket = &hash->table[h]; 147 + 148 + hlist_for_each_entry(cur, bucket, hlist) { 149 + if (cur->oblock == oblock) { 150 + /* Move upfront bucket for faster access. */ 151 + hlist_del(&cur->hlist); 152 + hlist_add_head(&cur->hlist, bucket); 153 + return cur; 154 + } 155 + } 156 + 157 + return NULL; 158 + } 159 + 160 + static void insert_cache_hash_entry(struct policy *p, struct wb_cache_entry *e) 161 + { 162 + unsigned h = hash_64(from_oblock(e->oblock), p->chash.hash_bits); 163 + 164 + hlist_add_head(&e->hlist, &p->chash.table[h]); 165 + } 166 + 167 + static void remove_cache_hash_entry(struct wb_cache_entry *e) 168 + { 169 + hlist_del(&e->hlist); 170 + } 171 + 172 + /* Public interface (see dm-cache-policy.h */ 173 + static int wb_map(struct dm_cache_policy *pe, dm_oblock_t oblock, 174 + bool can_block, bool can_migrate, bool discarded_oblock, 175 + struct bio *bio, struct policy_result *result) 176 + { 177 + struct policy *p = to_policy(pe); 178 + struct wb_cache_entry *e; 179 + unsigned long flags; 180 + 181 + result->op = POLICY_MISS; 182 + 183 + if (can_block) 184 + spin_lock_irqsave(&p->lock, flags); 185 + 186 + else if (!spin_trylock_irqsave(&p->lock, flags)) 187 + return -EWOULDBLOCK; 188 + 189 + e = lookup_cache_entry(p, oblock); 190 + if (e) { 191 + result->op = POLICY_HIT; 192 + result->cblock = e->cblock; 193 + 194 + } 195 + 196 + spin_unlock_irqrestore(&p->lock, flags); 197 + 198 + return 0; 199 + } 200 + 201 + static int wb_lookup(struct dm_cache_policy *pe, dm_oblock_t oblock, dm_cblock_t *cblock) 202 + { 203 + int r; 204 + struct policy *p = to_policy(pe); 205 + struct wb_cache_entry *e; 206 + unsigned long flags; 207 + 208 + if (!spin_trylock_irqsave(&p->lock, flags)) 209 + return -EWOULDBLOCK; 210 + 211 + e = lookup_cache_entry(p, oblock); 212 + if (e) { 213 + *cblock = e->cblock; 214 + r = 0; 215 + 216 + } else 217 + r = -ENOENT; 218 + 219 + spin_unlock_irqrestore(&p->lock, flags); 220 + 221 + return r; 222 + } 223 + 224 + static void __set_clear_dirty(struct dm_cache_policy *pe, dm_oblock_t oblock, bool set) 225 + { 226 + struct policy *p = to_policy(pe); 227 + struct wb_cache_entry *e; 228 + 229 + e = lookup_cache_entry(p, oblock); 230 + BUG_ON(!e); 231 + 232 + if (set) { 233 + if (!e->dirty) { 234 + e->dirty = true; 235 + list_move(&e->list, &p->dirty); 236 + } 237 + 238 + } else { 239 + if (e->dirty) { 240 + e->pending = false; 241 + e->dirty = false; 242 + list_move(&e->list, &p->clean); 243 + } 244 + } 245 + } 246 + 247 + static void wb_set_dirty(struct dm_cache_policy *pe, dm_oblock_t oblock) 248 + { 249 + struct policy *p = to_policy(pe); 250 + unsigned long flags; 251 + 252 + spin_lock_irqsave(&p->lock, flags); 253 + __set_clear_dirty(pe, oblock, true); 254 + spin_unlock_irqrestore(&p->lock, flags); 255 + } 256 + 257 + static void wb_clear_dirty(struct dm_cache_policy *pe, dm_oblock_t oblock) 258 + { 259 + struct policy *p = to_policy(pe); 260 + unsigned long flags; 261 + 262 + spin_lock_irqsave(&p->lock, flags); 263 + __set_clear_dirty(pe, oblock, false); 264 + spin_unlock_irqrestore(&p->lock, flags); 265 + } 266 + 267 + static void add_cache_entry(struct policy *p, struct wb_cache_entry *e) 268 + { 269 + insert_cache_hash_entry(p, e); 270 + if (e->dirty) 271 + list_add(&e->list, &p->dirty); 272 + else 273 + list_add(&e->list, &p->clean); 274 + } 275 + 276 + static int wb_load_mapping(struct dm_cache_policy *pe, 277 + dm_oblock_t oblock, dm_cblock_t cblock, 278 + uint32_t hint, bool hint_valid) 279 + { 280 + int r; 281 + struct policy *p = to_policy(pe); 282 + struct wb_cache_entry *e = alloc_cache_entry(p); 283 + 284 + if (e) { 285 + e->cblock = cblock; 286 + e->oblock = oblock; 287 + e->dirty = false; /* blocks default to clean */ 288 + add_cache_entry(p, e); 289 + r = 0; 290 + 291 + } else 292 + r = -ENOMEM; 293 + 294 + return r; 295 + } 296 + 297 + static void wb_destroy(struct dm_cache_policy *pe) 298 + { 299 + struct policy *p = to_policy(pe); 300 + 301 + free_cache_blocks_and_hash(p); 302 + kfree(p); 303 + } 304 + 305 + static struct wb_cache_entry *__wb_force_remove_mapping(struct policy *p, dm_oblock_t oblock) 306 + { 307 + struct wb_cache_entry *r = lookup_cache_entry(p, oblock); 308 + 309 + BUG_ON(!r); 310 + 311 + remove_cache_hash_entry(r); 312 + list_del(&r->list); 313 + 314 + return r; 315 + } 316 + 317 + static void wb_remove_mapping(struct dm_cache_policy *pe, dm_oblock_t oblock) 318 + { 319 + struct policy *p = to_policy(pe); 320 + struct wb_cache_entry *e; 321 + unsigned long flags; 322 + 323 + spin_lock_irqsave(&p->lock, flags); 324 + e = __wb_force_remove_mapping(p, oblock); 325 + list_add_tail(&e->list, &p->free); 326 + BUG_ON(!from_cblock(p->nr_cblocks_allocated)); 327 + p->nr_cblocks_allocated = to_cblock(from_cblock(p->nr_cblocks_allocated) - 1); 328 + spin_unlock_irqrestore(&p->lock, flags); 329 + } 330 + 331 + static void wb_force_mapping(struct dm_cache_policy *pe, 332 + dm_oblock_t current_oblock, dm_oblock_t oblock) 333 + { 334 + struct policy *p = to_policy(pe); 335 + struct wb_cache_entry *e; 336 + unsigned long flags; 337 + 338 + spin_lock_irqsave(&p->lock, flags); 339 + e = __wb_force_remove_mapping(p, current_oblock); 340 + e->oblock = oblock; 341 + add_cache_entry(p, e); 342 + spin_unlock_irqrestore(&p->lock, flags); 343 + } 344 + 345 + static struct wb_cache_entry *get_next_dirty_entry(struct policy *p) 346 + { 347 + struct list_head *l; 348 + struct wb_cache_entry *r; 349 + 350 + if (list_empty(&p->dirty)) 351 + return NULL; 352 + 353 + l = list_pop(&p->dirty); 354 + r = container_of(l, struct wb_cache_entry, list); 355 + list_add(l, &p->clean_pending); 356 + 357 + return r; 358 + } 359 + 360 + static int wb_writeback_work(struct dm_cache_policy *pe, 361 + dm_oblock_t *oblock, 362 + dm_cblock_t *cblock) 363 + { 364 + int r = -ENOENT; 365 + struct policy *p = to_policy(pe); 366 + struct wb_cache_entry *e; 367 + unsigned long flags; 368 + 369 + spin_lock_irqsave(&p->lock, flags); 370 + 371 + e = get_next_dirty_entry(p); 372 + if (e) { 373 + *oblock = e->oblock; 374 + *cblock = e->cblock; 375 + r = 0; 376 + } 377 + 378 + spin_unlock_irqrestore(&p->lock, flags); 379 + 380 + return r; 381 + } 382 + 383 + static dm_cblock_t wb_residency(struct dm_cache_policy *pe) 384 + { 385 + return to_policy(pe)->nr_cblocks_allocated; 386 + } 387 + 388 + /* Init the policy plugin interface function pointers. */ 389 + static void init_policy_functions(struct policy *p) 390 + { 391 + p->policy.destroy = wb_destroy; 392 + p->policy.map = wb_map; 393 + p->policy.lookup = wb_lookup; 394 + p->policy.set_dirty = wb_set_dirty; 395 + p->policy.clear_dirty = wb_clear_dirty; 396 + p->policy.load_mapping = wb_load_mapping; 397 + p->policy.walk_mappings = NULL; 398 + p->policy.remove_mapping = wb_remove_mapping; 399 + p->policy.writeback_work = wb_writeback_work; 400 + p->policy.force_mapping = wb_force_mapping; 401 + p->policy.residency = wb_residency; 402 + p->policy.tick = NULL; 403 + } 404 + 405 + static struct dm_cache_policy *wb_create(dm_cblock_t cache_size, 406 + sector_t origin_size, 407 + sector_t cache_block_size) 408 + { 409 + int r; 410 + struct policy *p = kzalloc(sizeof(*p), GFP_KERNEL); 411 + 412 + if (!p) 413 + return NULL; 414 + 415 + init_policy_functions(p); 416 + INIT_LIST_HEAD(&p->free); 417 + INIT_LIST_HEAD(&p->clean); 418 + INIT_LIST_HEAD(&p->clean_pending); 419 + INIT_LIST_HEAD(&p->dirty); 420 + 421 + p->cache_size = cache_size; 422 + spin_lock_init(&p->lock); 423 + 424 + /* Allocate cache entry structs and add them to free list. */ 425 + r = alloc_cache_blocks_with_hash(p, cache_size); 426 + if (!r) 427 + return &p->policy; 428 + 429 + kfree(p); 430 + 431 + return NULL; 432 + } 433 + /*----------------------------------------------------------------------------*/ 434 + 435 + static struct dm_cache_policy_type wb_policy_type = { 436 + .name = "cleaner", 437 + .hint_size = 0, 438 + .owner = THIS_MODULE, 439 + .create = wb_create 440 + }; 441 + 442 + static int __init wb_init(void) 443 + { 444 + int r = dm_cache_policy_register(&wb_policy_type); 445 + 446 + if (r < 0) 447 + DMERR("register failed %d", r); 448 + else 449 + DMINFO("version " CLEANER_VERSION " loaded"); 450 + 451 + return r; 452 + } 453 + 454 + static void __exit wb_exit(void) 455 + { 456 + dm_cache_policy_unregister(&wb_policy_type); 457 + } 458 + 459 + module_init(wb_init); 460 + module_exit(wb_exit); 461 + 462 + MODULE_AUTHOR("Heinz Mauelshagen <dm-devel@redhat.com>"); 463 + MODULE_LICENSE("GPL"); 464 + MODULE_DESCRIPTION("cleaner cache policy");

+124

drivers/md/dm-cache-policy-internal.h

··· 1 + /* 2 + * Copyright (C) 2012 Red Hat. All rights reserved. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #ifndef DM_CACHE_POLICY_INTERNAL_H 8 + #define DM_CACHE_POLICY_INTERNAL_H 9 + 10 + #include "dm-cache-policy.h" 11 + 12 + /*----------------------------------------------------------------*/ 13 + 14 + /* 15 + * Little inline functions that simplify calling the policy methods. 16 + */ 17 + static inline int policy_map(struct dm_cache_policy *p, dm_oblock_t oblock, 18 + bool can_block, bool can_migrate, bool discarded_oblock, 19 + struct bio *bio, struct policy_result *result) 20 + { 21 + return p->map(p, oblock, can_block, can_migrate, discarded_oblock, bio, result); 22 + } 23 + 24 + static inline int policy_lookup(struct dm_cache_policy *p, dm_oblock_t oblock, dm_cblock_t *cblock) 25 + { 26 + BUG_ON(!p->lookup); 27 + return p->lookup(p, oblock, cblock); 28 + } 29 + 30 + static inline void policy_set_dirty(struct dm_cache_policy *p, dm_oblock_t oblock) 31 + { 32 + if (p->set_dirty) 33 + p->set_dirty(p, oblock); 34 + } 35 + 36 + static inline void policy_clear_dirty(struct dm_cache_policy *p, dm_oblock_t oblock) 37 + { 38 + if (p->clear_dirty) 39 + p->clear_dirty(p, oblock); 40 + } 41 + 42 + static inline int policy_load_mapping(struct dm_cache_policy *p, 43 + dm_oblock_t oblock, dm_cblock_t cblock, 44 + uint32_t hint, bool hint_valid) 45 + { 46 + return p->load_mapping(p, oblock, cblock, hint, hint_valid); 47 + } 48 + 49 + static inline int policy_walk_mappings(struct dm_cache_policy *p, 50 + policy_walk_fn fn, void *context) 51 + { 52 + return p->walk_mappings ? p->walk_mappings(p, fn, context) : 0; 53 + } 54 + 55 + static inline int policy_writeback_work(struct dm_cache_policy *p, 56 + dm_oblock_t *oblock, 57 + dm_cblock_t *cblock) 58 + { 59 + return p->writeback_work ? p->writeback_work(p, oblock, cblock) : -ENOENT; 60 + } 61 + 62 + static inline void policy_remove_mapping(struct dm_cache_policy *p, dm_oblock_t oblock) 63 + { 64 + return p->remove_mapping(p, oblock); 65 + } 66 + 67 + static inline void policy_force_mapping(struct dm_cache_policy *p, 68 + dm_oblock_t current_oblock, dm_oblock_t new_oblock) 69 + { 70 + return p->force_mapping(p, current_oblock, new_oblock); 71 + } 72 + 73 + static inline dm_cblock_t policy_residency(struct dm_cache_policy *p) 74 + { 75 + return p->residency(p); 76 + } 77 + 78 + static inline void policy_tick(struct dm_cache_policy *p) 79 + { 80 + if (p->tick) 81 + return p->tick(p); 82 + } 83 + 84 + static inline int policy_emit_config_values(struct dm_cache_policy *p, char *result, unsigned maxlen) 85 + { 86 + ssize_t sz = 0; 87 + if (p->emit_config_values) 88 + return p->emit_config_values(p, result, maxlen); 89 + 90 + DMEMIT("0"); 91 + return 0; 92 + } 93 + 94 + static inline int policy_set_config_value(struct dm_cache_policy *p, 95 + const char *key, const char *value) 96 + { 97 + return p->set_config_value ? p->set_config_value(p, key, value) : -EINVAL; 98 + } 99 + 100 + /*----------------------------------------------------------------*/ 101 + 102 + /* 103 + * Creates a new cache policy given a policy name, a cache size, an origin size and the block size. 104 + */ 105 + struct dm_cache_policy *dm_cache_policy_create(const char *name, dm_cblock_t cache_size, 106 + sector_t origin_size, sector_t block_size); 107 + 108 + /* 109 + * Destroys the policy. This drops references to the policy module as well 110 + * as calling it's destroy method. So always use this rather than calling 111 + * the policy->destroy method directly. 112 + */ 113 + void dm_cache_policy_destroy(struct dm_cache_policy *p); 114 + 115 + /* 116 + * In case we've forgotten. 117 + */ 118 + const char *dm_cache_policy_get_name(struct dm_cache_policy *p); 119 + 120 + size_t dm_cache_policy_get_hint_size(struct dm_cache_policy *p); 121 + 122 + /*----------------------------------------------------------------*/ 123 + 124 + #endif /* DM_CACHE_POLICY_INTERNAL_H */

+1195

drivers/md/dm-cache-policy-mq.c

··· 1 + /* 2 + * Copyright (C) 2012 Red Hat. All rights reserved. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #include "dm-cache-policy.h" 8 + #include "dm.h" 9 + 10 + #include <linux/hash.h> 11 + #include <linux/module.h> 12 + #include <linux/mutex.h> 13 + #include <linux/slab.h> 14 + #include <linux/vmalloc.h> 15 + 16 + #define DM_MSG_PREFIX "cache-policy-mq" 17 + #define MQ_VERSION "1.0.0" 18 + 19 + static struct kmem_cache *mq_entry_cache; 20 + 21 + /*----------------------------------------------------------------*/ 22 + 23 + static unsigned next_power(unsigned n, unsigned min) 24 + { 25 + return roundup_pow_of_two(max(n, min)); 26 + } 27 + 28 + /*----------------------------------------------------------------*/ 29 + 30 + static unsigned long *alloc_bitset(unsigned nr_entries) 31 + { 32 + size_t s = sizeof(unsigned long) * dm_div_up(nr_entries, BITS_PER_LONG); 33 + return vzalloc(s); 34 + } 35 + 36 + static void free_bitset(unsigned long *bits) 37 + { 38 + vfree(bits); 39 + } 40 + 41 + /*----------------------------------------------------------------*/ 42 + 43 + /* 44 + * Large, sequential ios are probably better left on the origin device since 45 + * spindles tend to have good bandwidth. 46 + * 47 + * The io_tracker tries to spot when the io is in one of these sequential 48 + * modes. 49 + * 50 + * Two thresholds to switch between random and sequential io mode are defaulting 51 + * as follows and can be adjusted via the constructor and message interfaces. 52 + */ 53 + #define RANDOM_THRESHOLD_DEFAULT 4 54 + #define SEQUENTIAL_THRESHOLD_DEFAULT 512 55 + 56 + enum io_pattern { 57 + PATTERN_SEQUENTIAL, 58 + PATTERN_RANDOM 59 + }; 60 + 61 + struct io_tracker { 62 + enum io_pattern pattern; 63 + 64 + unsigned nr_seq_samples; 65 + unsigned nr_rand_samples; 66 + unsigned thresholds[2]; 67 + 68 + dm_oblock_t last_end_oblock; 69 + }; 70 + 71 + static void iot_init(struct io_tracker *t, 72 + int sequential_threshold, int random_threshold) 73 + { 74 + t->pattern = PATTERN_RANDOM; 75 + t->nr_seq_samples = 0; 76 + t->nr_rand_samples = 0; 77 + t->last_end_oblock = 0; 78 + t->thresholds[PATTERN_RANDOM] = random_threshold; 79 + t->thresholds[PATTERN_SEQUENTIAL] = sequential_threshold; 80 + } 81 + 82 + static enum io_pattern iot_pattern(struct io_tracker *t) 83 + { 84 + return t->pattern; 85 + } 86 + 87 + static void iot_update_stats(struct io_tracker *t, struct bio *bio) 88 + { 89 + if (bio->bi_sector == from_oblock(t->last_end_oblock) + 1) 90 + t->nr_seq_samples++; 91 + else { 92 + /* 93 + * Just one non-sequential IO is enough to reset the 94 + * counters. 95 + */ 96 + if (t->nr_seq_samples) { 97 + t->nr_seq_samples = 0; 98 + t->nr_rand_samples = 0; 99 + } 100 + 101 + t->nr_rand_samples++; 102 + } 103 + 104 + t->last_end_oblock = to_oblock(bio->bi_sector + bio_sectors(bio) - 1); 105 + } 106 + 107 + static void iot_check_for_pattern_switch(struct io_tracker *t) 108 + { 109 + switch (t->pattern) { 110 + case PATTERN_SEQUENTIAL: 111 + if (t->nr_rand_samples >= t->thresholds[PATTERN_RANDOM]) { 112 + t->pattern = PATTERN_RANDOM; 113 + t->nr_seq_samples = t->nr_rand_samples = 0; 114 + } 115 + break; 116 + 117 + case PATTERN_RANDOM: 118 + if (t->nr_seq_samples >= t->thresholds[PATTERN_SEQUENTIAL]) { 119 + t->pattern = PATTERN_SEQUENTIAL; 120 + t->nr_seq_samples = t->nr_rand_samples = 0; 121 + } 122 + break; 123 + } 124 + } 125 + 126 + static void iot_examine_bio(struct io_tracker *t, struct bio *bio) 127 + { 128 + iot_update_stats(t, bio); 129 + iot_check_for_pattern_switch(t); 130 + } 131 + 132 + /*----------------------------------------------------------------*/ 133 + 134 + 135 + /* 136 + * This queue is divided up into different levels. Allowing us to push 137 + * entries to the back of any of the levels. Think of it as a partially 138 + * sorted queue. 139 + */ 140 + #define NR_QUEUE_LEVELS 16u 141 + 142 + struct queue { 143 + struct list_head qs[NR_QUEUE_LEVELS]; 144 + }; 145 + 146 + static void queue_init(struct queue *q) 147 + { 148 + unsigned i; 149 + 150 + for (i = 0; i < NR_QUEUE_LEVELS; i++) 151 + INIT_LIST_HEAD(q->qs + i); 152 + } 153 + 154 + /* 155 + * Insert an entry to the back of the given level. 156 + */ 157 + static void queue_push(struct queue *q, unsigned level, struct list_head *elt) 158 + { 159 + list_add_tail(elt, q->qs + level); 160 + } 161 + 162 + static void queue_remove(struct list_head *elt) 163 + { 164 + list_del(elt); 165 + } 166 + 167 + /* 168 + * Shifts all regions down one level. This has no effect on the order of 169 + * the queue. 170 + */ 171 + static void queue_shift_down(struct queue *q) 172 + { 173 + unsigned level; 174 + 175 + for (level = 1; level < NR_QUEUE_LEVELS; level++) 176 + list_splice_init(q->qs + level, q->qs + level - 1); 177 + } 178 + 179 + /* 180 + * Gives us the oldest entry of the lowest popoulated level. If the first 181 + * level is emptied then we shift down one level. 182 + */ 183 + static struct list_head *queue_pop(struct queue *q) 184 + { 185 + unsigned level; 186 + struct list_head *r; 187 + 188 + for (level = 0; level < NR_QUEUE_LEVELS; level++) 189 + if (!list_empty(q->qs + level)) { 190 + r = q->qs[level].next; 191 + list_del(r); 192 + 193 + /* have we just emptied the bottom level? */ 194 + if (level == 0 && list_empty(q->qs)) 195 + queue_shift_down(q); 196 + 197 + return r; 198 + } 199 + 200 + return NULL; 201 + } 202 + 203 + static struct list_head *list_pop(struct list_head *lh) 204 + { 205 + struct list_head *r = lh->next; 206 + 207 + BUG_ON(!r); 208 + list_del_init(r); 209 + 210 + return r; 211 + } 212 + 213 + /*----------------------------------------------------------------*/ 214 + 215 + /* 216 + * Describes a cache entry. Used in both the cache and the pre_cache. 217 + */ 218 + struct entry { 219 + struct hlist_node hlist; 220 + struct list_head list; 221 + dm_oblock_t oblock; 222 + dm_cblock_t cblock; /* valid iff in_cache */ 223 + 224 + /* 225 + * FIXME: pack these better 226 + */ 227 + bool in_cache:1; 228 + unsigned hit_count; 229 + unsigned generation; 230 + unsigned tick; 231 + }; 232 + 233 + struct mq_policy { 234 + struct dm_cache_policy policy; 235 + 236 + /* protects everything */ 237 + struct mutex lock; 238 + dm_cblock_t cache_size; 239 + struct io_tracker tracker; 240 + 241 + /* 242 + * We maintain two queues of entries. The cache proper contains 243 + * the currently active mappings. Whereas the pre_cache tracks 244 + * blocks that are being hit frequently and potential candidates 245 + * for promotion to the cache. 246 + */ 247 + struct queue pre_cache; 248 + struct queue cache; 249 + 250 + /* 251 + * Keeps track of time, incremented by the core. We use this to 252 + * avoid attributing multiple hits within the same tick. 253 + * 254 + * Access to tick_protected should be done with the spin lock held. 255 + * It's copied to tick at the start of the map function (within the 256 + * mutex). 257 + */ 258 + spinlock_t tick_lock; 259 + unsigned tick_protected; 260 + unsigned tick; 261 + 262 + /* 263 + * A count of the number of times the map function has been called 264 + * and found an entry in the pre_cache or cache. Currently used to 265 + * calculate the generation. 266 + */ 267 + unsigned hit_count; 268 + 269 + /* 270 + * A generation is a longish period that is used to trigger some 271 + * book keeping effects. eg, decrementing hit counts on entries. 272 + * This is needed to allow the cache to evolve as io patterns 273 + * change. 274 + */ 275 + unsigned generation; 276 + unsigned generation_period; /* in lookups (will probably change) */ 277 + 278 + /* 279 + * Entries in the pre_cache whose hit count passes the promotion 280 + * threshold move to the cache proper. Working out the correct 281 + * value for the promotion_threshold is crucial to this policy. 282 + */ 283 + unsigned promote_threshold; 284 + 285 + /* 286 + * We need cache_size entries for the cache, and choose to have 287 + * cache_size entries for the pre_cache too. One motivation for 288 + * using the same size is to make the hit counts directly 289 + * comparable between pre_cache and cache. 290 + */ 291 + unsigned nr_entries; 292 + unsigned nr_entries_allocated; 293 + struct list_head free; 294 + 295 + /* 296 + * Cache blocks may be unallocated. We store this info in a 297 + * bitset. 298 + */ 299 + unsigned long *allocation_bitset; 300 + unsigned nr_cblocks_allocated; 301 + unsigned find_free_nr_words; 302 + unsigned find_free_last_word; 303 + 304 + /* 305 + * The hash table allows us to quickly find an entry by origin 306 + * block. Both pre_cache and cache entries are in here. 307 + */ 308 + unsigned nr_buckets; 309 + dm_block_t hash_bits; 310 + struct hlist_head *table; 311 + }; 312 + 313 + /*----------------------------------------------------------------*/ 314 + /* Free/alloc mq cache entry structures. */ 315 + static void takeout_queue(struct list_head *lh, struct queue *q) 316 + { 317 + unsigned level; 318 + 319 + for (level = 0; level < NR_QUEUE_LEVELS; level++) 320 + list_splice(q->qs + level, lh); 321 + } 322 + 323 + static void free_entries(struct mq_policy *mq) 324 + { 325 + struct entry *e, *tmp; 326 + 327 + takeout_queue(&mq->free, &mq->pre_cache); 328 + takeout_queue(&mq->free, &mq->cache); 329 + 330 + list_for_each_entry_safe(e, tmp, &mq->free, list) 331 + kmem_cache_free(mq_entry_cache, e); 332 + } 333 + 334 + static int alloc_entries(struct mq_policy *mq, unsigned elts) 335 + { 336 + unsigned u = mq->nr_entries; 337 + 338 + INIT_LIST_HEAD(&mq->free); 339 + mq->nr_entries_allocated = 0; 340 + 341 + while (u--) { 342 + struct entry *e = kmem_cache_zalloc(mq_entry_cache, GFP_KERNEL); 343 + 344 + if (!e) { 345 + free_entries(mq); 346 + return -ENOMEM; 347 + } 348 + 349 + 350 + list_add(&e->list, &mq->free); 351 + } 352 + 353 + return 0; 354 + } 355 + 356 + /*----------------------------------------------------------------*/ 357 + 358 + /* 359 + * Simple hash table implementation. Should replace with the standard hash 360 + * table that's making its way upstream. 361 + */ 362 + static void hash_insert(struct mq_policy *mq, struct entry *e) 363 + { 364 + unsigned h = hash_64(from_oblock(e->oblock), mq->hash_bits); 365 + 366 + hlist_add_head(&e->hlist, mq->table + h); 367 + } 368 + 369 + static struct entry *hash_lookup(struct mq_policy *mq, dm_oblock_t oblock) 370 + { 371 + unsigned h = hash_64(from_oblock(oblock), mq->hash_bits); 372 + struct hlist_head *bucket = mq->table + h; 373 + struct entry *e; 374 + 375 + hlist_for_each_entry(e, bucket, hlist) 376 + if (e->oblock == oblock) { 377 + hlist_del(&e->hlist); 378 + hlist_add_head(&e->hlist, bucket); 379 + return e; 380 + } 381 + 382 + return NULL; 383 + } 384 + 385 + static void hash_remove(struct entry *e) 386 + { 387 + hlist_del(&e->hlist); 388 + } 389 + 390 + /*----------------------------------------------------------------*/ 391 + 392 + /* 393 + * Allocates a new entry structure. The memory is allocated in one lump, 394 + * so we just handing it out here. Returns NULL if all entries have 395 + * already been allocated. Cannot fail otherwise. 396 + */ 397 + static struct entry *alloc_entry(struct mq_policy *mq) 398 + { 399 + struct entry *e; 400 + 401 + if (mq->nr_entries_allocated >= mq->nr_entries) { 402 + BUG_ON(!list_empty(&mq->free)); 403 + return NULL; 404 + } 405 + 406 + e = list_entry(list_pop(&mq->free), struct entry, list); 407 + INIT_LIST_HEAD(&e->list); 408 + INIT_HLIST_NODE(&e->hlist); 409 + 410 + mq->nr_entries_allocated++; 411 + return e; 412 + } 413 + 414 + /*----------------------------------------------------------------*/ 415 + 416 + /* 417 + * Mark cache blocks allocated or not in the bitset. 418 + */ 419 + static void alloc_cblock(struct mq_policy *mq, dm_cblock_t cblock) 420 + { 421 + BUG_ON(from_cblock(cblock) > from_cblock(mq->cache_size)); 422 + BUG_ON(test_bit(from_cblock(cblock), mq->allocation_bitset)); 423 + 424 + set_bit(from_cblock(cblock), mq->allocation_bitset); 425 + mq->nr_cblocks_allocated++; 426 + } 427 + 428 + static void free_cblock(struct mq_policy *mq, dm_cblock_t cblock) 429 + { 430 + BUG_ON(from_cblock(cblock) > from_cblock(mq->cache_size)); 431 + BUG_ON(!test_bit(from_cblock(cblock), mq->allocation_bitset)); 432 + 433 + clear_bit(from_cblock(cblock), mq->allocation_bitset); 434 + mq->nr_cblocks_allocated--; 435 + } 436 + 437 + static bool any_free_cblocks(struct mq_policy *mq) 438 + { 439 + return mq->nr_cblocks_allocated < from_cblock(mq->cache_size); 440 + } 441 + 442 + /* 443 + * Fills result out with a cache block that isn't in use, or return 444 + * -ENOSPC. This does _not_ mark the cblock as allocated, the caller is 445 + * reponsible for that. 446 + */ 447 + static int __find_free_cblock(struct mq_policy *mq, unsigned begin, unsigned end, 448 + dm_cblock_t *result, unsigned *last_word) 449 + { 450 + int r = -ENOSPC; 451 + unsigned w; 452 + 453 + for (w = begin; w < end; w++) { 454 + /* 455 + * ffz is undefined if no zero exists 456 + */ 457 + if (mq->allocation_bitset[w] != ~0UL) { 458 + *last_word = w; 459 + *result = to_cblock((w * BITS_PER_LONG) + ffz(mq->allocation_bitset[w])); 460 + if (from_cblock(*result) < from_cblock(mq->cache_size)) 461 + r = 0; 462 + 463 + break; 464 + } 465 + } 466 + 467 + return r; 468 + } 469 + 470 + static int find_free_cblock(struct mq_policy *mq, dm_cblock_t *result) 471 + { 472 + int r; 473 + 474 + if (!any_free_cblocks(mq)) 475 + return -ENOSPC; 476 + 477 + r = __find_free_cblock(mq, mq->find_free_last_word, mq->find_free_nr_words, result, &mq->find_free_last_word); 478 + if (r == -ENOSPC && mq->find_free_last_word) 479 + r = __find_free_cblock(mq, 0, mq->find_free_last_word, result, &mq->find_free_last_word); 480 + 481 + return r; 482 + } 483 + 484 + /*----------------------------------------------------------------*/ 485 + 486 + /* 487 + * Now we get to the meat of the policy. This section deals with deciding 488 + * when to to add entries to the pre_cache and cache, and move between 489 + * them. 490 + */ 491 + 492 + /* 493 + * The queue level is based on the log2 of the hit count. 494 + */ 495 + static unsigned queue_level(struct entry *e) 496 + { 497 + return min((unsigned) ilog2(e->hit_count), NR_QUEUE_LEVELS - 1u); 498 + } 499 + 500 + /* 501 + * Inserts the entry into the pre_cache or the cache. Ensures the cache 502 + * block is marked as allocated if necc. Inserts into the hash table. Sets the 503 + * tick which records when the entry was last moved about. 504 + */ 505 + static void push(struct mq_policy *mq, struct entry *e) 506 + { 507 + e->tick = mq->tick; 508 + hash_insert(mq, e); 509 + 510 + if (e->in_cache) { 511 + alloc_cblock(mq, e->cblock); 512 + queue_push(&mq->cache, queue_level(e), &e->list); 513 + } else 514 + queue_push(&mq->pre_cache, queue_level(e), &e->list); 515 + } 516 + 517 + /* 518 + * Removes an entry from pre_cache or cache. Removes from the hash table. 519 + * Frees off the cache block if necc. 520 + */ 521 + static void del(struct mq_policy *mq, struct entry *e) 522 + { 523 + queue_remove(&e->list); 524 + hash_remove(e); 525 + if (e->in_cache) 526 + free_cblock(mq, e->cblock); 527 + } 528 + 529 + /* 530 + * Like del, except it removes the first entry in the queue (ie. the least 531 + * recently used). 532 + */ 533 + static struct entry *pop(struct mq_policy *mq, struct queue *q) 534 + { 535 + struct entry *e = container_of(queue_pop(q), struct entry, list); 536 + 537 + if (e) { 538 + hash_remove(e); 539 + 540 + if (e->in_cache) 541 + free_cblock(mq, e->cblock); 542 + } 543 + 544 + return e; 545 + } 546 + 547 + /* 548 + * Has this entry already been updated? 549 + */ 550 + static bool updated_this_tick(struct mq_policy *mq, struct entry *e) 551 + { 552 + return mq->tick == e->tick; 553 + } 554 + 555 + /* 556 + * The promotion threshold is adjusted every generation. As are the counts 557 + * of the entries. 558 + * 559 + * At the moment the threshold is taken by averaging the hit counts of some 560 + * of the entries in the cache (the first 20 entries of the first level). 561 + * 562 + * We can be much cleverer than this though. For example, each promotion 563 + * could bump up the threshold helping to prevent churn. Much more to do 564 + * here. 565 + */ 566 + 567 + #define MAX_TO_AVERAGE 20 568 + 569 + static void check_generation(struct mq_policy *mq) 570 + { 571 + unsigned total = 0, nr = 0, count = 0, level; 572 + struct list_head *head; 573 + struct entry *e; 574 + 575 + if ((mq->hit_count >= mq->generation_period) && 576 + (mq->nr_cblocks_allocated == from_cblock(mq->cache_size))) { 577 + 578 + mq->hit_count = 0; 579 + mq->generation++; 580 + 581 + for (level = 0; level < NR_QUEUE_LEVELS && count < MAX_TO_AVERAGE; level++) { 582 + head = mq->cache.qs + level; 583 + list_for_each_entry(e, head, list) { 584 + nr++; 585 + total += e->hit_count; 586 + 587 + if (++count >= MAX_TO_AVERAGE) 588 + break; 589 + } 590 + } 591 + 592 + mq->promote_threshold = nr ? total / nr : 1; 593 + if (mq->promote_threshold * nr < total) 594 + mq->promote_threshold++; 595 + } 596 + } 597 + 598 + /* 599 + * Whenever we use an entry we bump up it's hit counter, and push it to the 600 + * back to it's current level. 601 + */ 602 + static void requeue_and_update_tick(struct mq_policy *mq, struct entry *e) 603 + { 604 + if (updated_this_tick(mq, e)) 605 + return; 606 + 607 + e->hit_count++; 608 + mq->hit_count++; 609 + check_generation(mq); 610 + 611 + /* generation adjustment, to stop the counts increasing forever. */ 612 + /* FIXME: divide? */ 613 + /* e->hit_count -= min(e->hit_count - 1, mq->generation - e->generation); */ 614 + e->generation = mq->generation; 615 + 616 + del(mq, e); 617 + push(mq, e); 618 + } 619 + 620 + /* 621 + * Demote the least recently used entry from the cache to the pre_cache. 622 + * Returns the new cache entry to use, and the old origin block it was 623 + * mapped to. 624 + * 625 + * We drop the hit count on the demoted entry back to 1 to stop it bouncing 626 + * straight back into the cache if it's subsequently hit. There are 627 + * various options here, and more experimentation would be good: 628 + * 629 + * - just forget about the demoted entry completely (ie. don't insert it 630 + into the pre_cache). 631 + * - divide the hit count rather that setting to some hard coded value. 632 + * - set the hit count to a hard coded value other than 1, eg, is it better 633 + * if it goes in at level 2? 634 + */ 635 + static dm_cblock_t demote_cblock(struct mq_policy *mq, dm_oblock_t *oblock) 636 + { 637 + dm_cblock_t result; 638 + struct entry *demoted = pop(mq, &mq->cache); 639 + 640 + BUG_ON(!demoted); 641 + result = demoted->cblock; 642 + *oblock = demoted->oblock; 643 + demoted->in_cache = false; 644 + demoted->hit_count = 1; 645 + push(mq, demoted); 646 + 647 + return result; 648 + } 649 + 650 + /* 651 + * We modify the basic promotion_threshold depending on the specific io. 652 + * 653 + * If the origin block has been discarded then there's no cost to copy it 654 + * to the cache. 655 + * 656 + * We bias towards reads, since they can be demoted at no cost if they 657 + * haven't been dirtied. 658 + */ 659 + #define DISCARDED_PROMOTE_THRESHOLD 1 660 + #define READ_PROMOTE_THRESHOLD 4 661 + #define WRITE_PROMOTE_THRESHOLD 8 662 + 663 + static unsigned adjusted_promote_threshold(struct mq_policy *mq, 664 + bool discarded_oblock, int data_dir) 665 + { 666 + if (discarded_oblock && any_free_cblocks(mq) && data_dir == WRITE) 667 + /* 668 + * We don't need to do any copying at all, so give this a 669 + * very low threshold. In practice this only triggers 670 + * during initial population after a format. 671 + */ 672 + return DISCARDED_PROMOTE_THRESHOLD; 673 + 674 + return data_dir == READ ? 675 + (mq->promote_threshold + READ_PROMOTE_THRESHOLD) : 676 + (mq->promote_threshold + WRITE_PROMOTE_THRESHOLD); 677 + } 678 + 679 + static bool should_promote(struct mq_policy *mq, struct entry *e, 680 + bool discarded_oblock, int data_dir) 681 + { 682 + return e->hit_count >= 683 + adjusted_promote_threshold(mq, discarded_oblock, data_dir); 684 + } 685 + 686 + static int cache_entry_found(struct mq_policy *mq, 687 + struct entry *e, 688 + struct policy_result *result) 689 + { 690 + requeue_and_update_tick(mq, e); 691 + 692 + if (e->in_cache) { 693 + result->op = POLICY_HIT; 694 + result->cblock = e->cblock; 695 + } 696 + 697 + return 0; 698 + } 699 + 700 + /* 701 + * Moves and entry from the pre_cache to the cache. The main work is 702 + * finding which cache block to use. 703 + */ 704 + static int pre_cache_to_cache(struct mq_policy *mq, struct entry *e, 705 + struct policy_result *result) 706 + { 707 + dm_cblock_t cblock; 708 + 709 + if (find_free_cblock(mq, &cblock) == -ENOSPC) { 710 + result->op = POLICY_REPLACE; 711 + cblock = demote_cblock(mq, &result->old_oblock); 712 + } else 713 + result->op = POLICY_NEW; 714 + 715 + result->cblock = e->cblock = cblock; 716 + 717 + del(mq, e); 718 + e->in_cache = true; 719 + push(mq, e); 720 + 721 + return 0; 722 + } 723 + 724 + static int pre_cache_entry_found(struct mq_policy *mq, struct entry *e, 725 + bool can_migrate, bool discarded_oblock, 726 + int data_dir, struct policy_result *result) 727 + { 728 + int r = 0; 729 + bool updated = updated_this_tick(mq, e); 730 + 731 + requeue_and_update_tick(mq, e); 732 + 733 + if ((!discarded_oblock && updated) || 734 + !should_promote(mq, e, discarded_oblock, data_dir)) 735 + result->op = POLICY_MISS; 736 + else if (!can_migrate) 737 + r = -EWOULDBLOCK; 738 + else 739 + r = pre_cache_to_cache(mq, e, result); 740 + 741 + return r; 742 + } 743 + 744 + static void insert_in_pre_cache(struct mq_policy *mq, 745 + dm_oblock_t oblock) 746 + { 747 + struct entry *e = alloc_entry(mq); 748 + 749 + if (!e) 750 + /* 751 + * There's no spare entry structure, so we grab the least 752 + * used one from the pre_cache. 753 + */ 754 + e = pop(mq, &mq->pre_cache); 755 + 756 + if (unlikely(!e)) { 757 + DMWARN("couldn't pop from pre cache"); 758 + return; 759 + } 760 + 761 + e->in_cache = false; 762 + e->oblock = oblock; 763 + e->hit_count = 1; 764 + e->generation = mq->generation; 765 + push(mq, e); 766 + } 767 + 768 + static void insert_in_cache(struct mq_policy *mq, dm_oblock_t oblock, 769 + struct policy_result *result) 770 + { 771 + struct entry *e; 772 + dm_cblock_t cblock; 773 + 774 + if (find_free_cblock(mq, &cblock) == -ENOSPC) { 775 + result->op = POLICY_MISS; 776 + insert_in_pre_cache(mq, oblock); 777 + return; 778 + } 779 + 780 + e = alloc_entry(mq); 781 + if (unlikely(!e)) { 782 + result->op = POLICY_MISS; 783 + return; 784 + } 785 + 786 + e->oblock = oblock; 787 + e->cblock = cblock; 788 + e->in_cache = true; 789 + e->hit_count = 1; 790 + e->generation = mq->generation; 791 + push(mq, e); 792 + 793 + result->op = POLICY_NEW; 794 + result->cblock = e->cblock; 795 + } 796 + 797 + static int no_entry_found(struct mq_policy *mq, dm_oblock_t oblock, 798 + bool can_migrate, bool discarded_oblock, 799 + int data_dir, struct policy_result *result) 800 + { 801 + if (adjusted_promote_threshold(mq, discarded_oblock, data_dir) == 1) { 802 + if (can_migrate) 803 + insert_in_cache(mq, oblock, result); 804 + else 805 + return -EWOULDBLOCK; 806 + } else { 807 + insert_in_pre_cache(mq, oblock); 808 + result->op = POLICY_MISS; 809 + } 810 + 811 + return 0; 812 + } 813 + 814 + /* 815 + * Looks the oblock up in the hash table, then decides whether to put in 816 + * pre_cache, or cache etc. 817 + */ 818 + static int map(struct mq_policy *mq, dm_oblock_t oblock, 819 + bool can_migrate, bool discarded_oblock, 820 + int data_dir, struct policy_result *result) 821 + { 822 + int r = 0; 823 + struct entry *e = hash_lookup(mq, oblock); 824 + 825 + if (e && e->in_cache) 826 + r = cache_entry_found(mq, e, result); 827 + else if (iot_pattern(&mq->tracker) == PATTERN_SEQUENTIAL) 828 + result->op = POLICY_MISS; 829 + else if (e) 830 + r = pre_cache_entry_found(mq, e, can_migrate, discarded_oblock, 831 + data_dir, result); 832 + else 833 + r = no_entry_found(mq, oblock, can_migrate, discarded_oblock, 834 + data_dir, result); 835 + 836 + if (r == -EWOULDBLOCK) 837 + result->op = POLICY_MISS; 838 + 839 + return r; 840 + } 841 + 842 + /*----------------------------------------------------------------*/ 843 + 844 + /* 845 + * Public interface, via the policy struct. See dm-cache-policy.h for a 846 + * description of these. 847 + */ 848 + 849 + static struct mq_policy *to_mq_policy(struct dm_cache_policy *p) 850 + { 851 + return container_of(p, struct mq_policy, policy); 852 + } 853 + 854 + static void mq_destroy(struct dm_cache_policy *p) 855 + { 856 + struct mq_policy *mq = to_mq_policy(p); 857 + 858 + free_bitset(mq->allocation_bitset); 859 + kfree(mq->table); 860 + free_entries(mq); 861 + kfree(mq); 862 + } 863 + 864 + static void copy_tick(struct mq_policy *mq) 865 + { 866 + unsigned long flags; 867 + 868 + spin_lock_irqsave(&mq->tick_lock, flags); 869 + mq->tick = mq->tick_protected; 870 + spin_unlock_irqrestore(&mq->tick_lock, flags); 871 + } 872 + 873 + static int mq_map(struct dm_cache_policy *p, dm_oblock_t oblock, 874 + bool can_block, bool can_migrate, bool discarded_oblock, 875 + struct bio *bio, struct policy_result *result) 876 + { 877 + int r; 878 + struct mq_policy *mq = to_mq_policy(p); 879 + 880 + result->op = POLICY_MISS; 881 + 882 + if (can_block) 883 + mutex_lock(&mq->lock); 884 + else if (!mutex_trylock(&mq->lock)) 885 + return -EWOULDBLOCK; 886 + 887 + copy_tick(mq); 888 + 889 + iot_examine_bio(&mq->tracker, bio); 890 + r = map(mq, oblock, can_migrate, discarded_oblock, 891 + bio_data_dir(bio), result); 892 + 893 + mutex_unlock(&mq->lock); 894 + 895 + return r; 896 + } 897 + 898 + static int mq_lookup(struct dm_cache_policy *p, dm_oblock_t oblock, dm_cblock_t *cblock) 899 + { 900 + int r; 901 + struct mq_policy *mq = to_mq_policy(p); 902 + struct entry *e; 903 + 904 + if (!mutex_trylock(&mq->lock)) 905 + return -EWOULDBLOCK; 906 + 907 + e = hash_lookup(mq, oblock); 908 + if (e && e->in_cache) { 909 + *cblock = e->cblock; 910 + r = 0; 911 + } else 912 + r = -ENOENT; 913 + 914 + mutex_unlock(&mq->lock); 915 + 916 + return r; 917 + } 918 + 919 + static int mq_load_mapping(struct dm_cache_policy *p, 920 + dm_oblock_t oblock, dm_cblock_t cblock, 921 + uint32_t hint, bool hint_valid) 922 + { 923 + struct mq_policy *mq = to_mq_policy(p); 924 + struct entry *e; 925 + 926 + e = alloc_entry(mq); 927 + if (!e) 928 + return -ENOMEM; 929 + 930 + e->cblock = cblock; 931 + e->oblock = oblock; 932 + e->in_cache = true; 933 + e->hit_count = hint_valid ? hint : 1; 934 + e->generation = mq->generation; 935 + push(mq, e); 936 + 937 + return 0; 938 + } 939 + 940 + static int mq_walk_mappings(struct dm_cache_policy *p, policy_walk_fn fn, 941 + void *context) 942 + { 943 + struct mq_policy *mq = to_mq_policy(p); 944 + int r = 0; 945 + struct entry *e; 946 + unsigned level; 947 + 948 + mutex_lock(&mq->lock); 949 + 950 + for (level = 0; level < NR_QUEUE_LEVELS; level++) 951 + list_for_each_entry(e, &mq->cache.qs[level], list) { 952 + r = fn(context, e->cblock, e->oblock, e->hit_count); 953 + if (r) 954 + goto out; 955 + } 956 + 957 + out: 958 + mutex_unlock(&mq->lock); 959 + 960 + return r; 961 + } 962 + 963 + static void remove_mapping(struct mq_policy *mq, dm_oblock_t oblock) 964 + { 965 + struct entry *e = hash_lookup(mq, oblock); 966 + 967 + BUG_ON(!e || !e->in_cache); 968 + 969 + del(mq, e); 970 + e->in_cache = false; 971 + push(mq, e); 972 + } 973 + 974 + static void mq_remove_mapping(struct dm_cache_policy *p, dm_oblock_t oblock) 975 + { 976 + struct mq_policy *mq = to_mq_policy(p); 977 + 978 + mutex_lock(&mq->lock); 979 + remove_mapping(mq, oblock); 980 + mutex_unlock(&mq->lock); 981 + } 982 + 983 + static void force_mapping(struct mq_policy *mq, 984 + dm_oblock_t current_oblock, dm_oblock_t new_oblock) 985 + { 986 + struct entry *e = hash_lookup(mq, current_oblock); 987 + 988 + BUG_ON(!e || !e->in_cache); 989 + 990 + del(mq, e); 991 + e->oblock = new_oblock; 992 + push(mq, e); 993 + } 994 + 995 + static void mq_force_mapping(struct dm_cache_policy *p, 996 + dm_oblock_t current_oblock, dm_oblock_t new_oblock) 997 + { 998 + struct mq_policy *mq = to_mq_policy(p); 999 + 1000 + mutex_lock(&mq->lock); 1001 + force_mapping(mq, current_oblock, new_oblock); 1002 + mutex_unlock(&mq->lock); 1003 + } 1004 + 1005 + static dm_cblock_t mq_residency(struct dm_cache_policy *p) 1006 + { 1007 + struct mq_policy *mq = to_mq_policy(p); 1008 + 1009 + /* FIXME: lock mutex, not sure we can block here */ 1010 + return to_cblock(mq->nr_cblocks_allocated); 1011 + } 1012 + 1013 + static void mq_tick(struct dm_cache_policy *p) 1014 + { 1015 + struct mq_policy *mq = to_mq_policy(p); 1016 + unsigned long flags; 1017 + 1018 + spin_lock_irqsave(&mq->tick_lock, flags); 1019 + mq->tick_protected++; 1020 + spin_unlock_irqrestore(&mq->tick_lock, flags); 1021 + } 1022 + 1023 + static int mq_set_config_value(struct dm_cache_policy *p, 1024 + const char *key, const char *value) 1025 + { 1026 + struct mq_policy *mq = to_mq_policy(p); 1027 + enum io_pattern pattern; 1028 + unsigned long tmp; 1029 + 1030 + if (!strcasecmp(key, "random_threshold")) 1031 + pattern = PATTERN_RANDOM; 1032 + else if (!strcasecmp(key, "sequential_threshold")) 1033 + pattern = PATTERN_SEQUENTIAL; 1034 + else 1035 + return -EINVAL; 1036 + 1037 + if (kstrtoul(value, 10, &tmp)) 1038 + return -EINVAL; 1039 + 1040 + mq->tracker.thresholds[pattern] = tmp; 1041 + 1042 + return 0; 1043 + } 1044 + 1045 + static int mq_emit_config_values(struct dm_cache_policy *p, char *result, unsigned maxlen) 1046 + { 1047 + ssize_t sz = 0; 1048 + struct mq_policy *mq = to_mq_policy(p); 1049 + 1050 + DMEMIT("4 random_threshold %u sequential_threshold %u", 1051 + mq->tracker.thresholds[PATTERN_RANDOM], 1052 + mq->tracker.thresholds[PATTERN_SEQUENTIAL]); 1053 + 1054 + return 0; 1055 + } 1056 + 1057 + /* Init the policy plugin interface function pointers. */ 1058 + static void init_policy_functions(struct mq_policy *mq) 1059 + { 1060 + mq->policy.destroy = mq_destroy; 1061 + mq->policy.map = mq_map; 1062 + mq->policy.lookup = mq_lookup; 1063 + mq->policy.load_mapping = mq_load_mapping; 1064 + mq->policy.walk_mappings = mq_walk_mappings; 1065 + mq->policy.remove_mapping = mq_remove_mapping; 1066 + mq->policy.writeback_work = NULL; 1067 + mq->policy.force_mapping = mq_force_mapping; 1068 + mq->policy.residency = mq_residency; 1069 + mq->policy.tick = mq_tick; 1070 + mq->policy.emit_config_values = mq_emit_config_values; 1071 + mq->policy.set_config_value = mq_set_config_value; 1072 + } 1073 + 1074 + static struct dm_cache_policy *mq_create(dm_cblock_t cache_size, 1075 + sector_t origin_size, 1076 + sector_t cache_block_size) 1077 + { 1078 + int r; 1079 + struct mq_policy *mq = kzalloc(sizeof(*mq), GFP_KERNEL); 1080 + 1081 + if (!mq) 1082 + return NULL; 1083 + 1084 + init_policy_functions(mq); 1085 + iot_init(&mq->tracker, SEQUENTIAL_THRESHOLD_DEFAULT, RANDOM_THRESHOLD_DEFAULT); 1086 + 1087 + mq->cache_size = cache_size; 1088 + mq->tick_protected = 0; 1089 + mq->tick = 0; 1090 + mq->hit_count = 0; 1091 + mq->generation = 0; 1092 + mq->promote_threshold = 0; 1093 + mutex_init(&mq->lock); 1094 + spin_lock_init(&mq->tick_lock); 1095 + mq->find_free_nr_words = dm_div_up(from_cblock(mq->cache_size), BITS_PER_LONG); 1096 + mq->find_free_last_word = 0; 1097 + 1098 + queue_init(&mq->pre_cache); 1099 + queue_init(&mq->cache); 1100 + mq->generation_period = max((unsigned) from_cblock(cache_size), 1024U); 1101 + 1102 + mq->nr_entries = 2 * from_cblock(cache_size); 1103 + r = alloc_entries(mq, mq->nr_entries); 1104 + if (r) 1105 + goto bad_cache_alloc; 1106 + 1107 + mq->nr_entries_allocated = 0; 1108 + mq->nr_cblocks_allocated = 0; 1109 + 1110 + mq->nr_buckets = next_power(from_cblock(cache_size) / 2, 16); 1111 + mq->hash_bits = ffs(mq->nr_buckets) - 1; 1112 + mq->table = kzalloc(sizeof(*mq->table) * mq->nr_buckets, GFP_KERNEL); 1113 + if (!mq->table) 1114 + goto bad_alloc_table; 1115 + 1116 + mq->allocation_bitset = alloc_bitset(from_cblock(cache_size)); 1117 + if (!mq->allocation_bitset) 1118 + goto bad_alloc_bitset; 1119 + 1120 + return &mq->policy; 1121 + 1122 + bad_alloc_bitset: 1123 + kfree(mq->table); 1124 + bad_alloc_table: 1125 + free_entries(mq); 1126 + bad_cache_alloc: 1127 + kfree(mq); 1128 + 1129 + return NULL; 1130 + } 1131 + 1132 + /*----------------------------------------------------------------*/ 1133 + 1134 + static struct dm_cache_policy_type mq_policy_type = { 1135 + .name = "mq", 1136 + .hint_size = 4, 1137 + .owner = THIS_MODULE, 1138 + .create = mq_create 1139 + }; 1140 + 1141 + static struct dm_cache_policy_type default_policy_type = { 1142 + .name = "default", 1143 + .hint_size = 4, 1144 + .owner = THIS_MODULE, 1145 + .create = mq_create 1146 + }; 1147 + 1148 + static int __init mq_init(void) 1149 + { 1150 + int r; 1151 + 1152 + mq_entry_cache = kmem_cache_create("dm_mq_policy_cache_entry", 1153 + sizeof(struct entry), 1154 + __alignof__(struct entry), 1155 + 0, NULL); 1156 + if (!mq_entry_cache) 1157 + goto bad; 1158 + 1159 + r = dm_cache_policy_register(&mq_policy_type); 1160 + if (r) { 1161 + DMERR("register failed %d", r); 1162 + goto bad_register_mq; 1163 + } 1164 + 1165 + r = dm_cache_policy_register(&default_policy_type); 1166 + if (!r) { 1167 + DMINFO("version " MQ_VERSION " loaded"); 1168 + return 0; 1169 + } 1170 + 1171 + DMERR("register failed (as default) %d", r); 1172 + 1173 + dm_cache_policy_unregister(&mq_policy_type); 1174 + bad_register_mq: 1175 + kmem_cache_destroy(mq_entry_cache); 1176 + bad: 1177 + return -ENOMEM; 1178 + } 1179 + 1180 + static void __exit mq_exit(void) 1181 + { 1182 + dm_cache_policy_unregister(&mq_policy_type); 1183 + dm_cache_policy_unregister(&default_policy_type); 1184 + 1185 + kmem_cache_destroy(mq_entry_cache); 1186 + } 1187 + 1188 + module_init(mq_init); 1189 + module_exit(mq_exit); 1190 + 1191 + MODULE_AUTHOR("Joe Thornber <dm-devel@redhat.com>"); 1192 + MODULE_LICENSE("GPL"); 1193 + MODULE_DESCRIPTION("mq cache policy"); 1194 + 1195 + MODULE_ALIAS("dm-cache-default");

+161

drivers/md/dm-cache-policy.c

··· 1 + /* 2 + * Copyright (C) 2012 Red Hat. All rights reserved. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #include "dm-cache-policy-internal.h" 8 + #include "dm.h" 9 + 10 + #include <linux/module.h> 11 + #include <linux/slab.h> 12 + 13 + /*----------------------------------------------------------------*/ 14 + 15 + #define DM_MSG_PREFIX "cache-policy" 16 + 17 + static DEFINE_SPINLOCK(register_lock); 18 + static LIST_HEAD(register_list); 19 + 20 + static struct dm_cache_policy_type *__find_policy(const char *name) 21 + { 22 + struct dm_cache_policy_type *t; 23 + 24 + list_for_each_entry(t, &register_list, list) 25 + if (!strcmp(t->name, name)) 26 + return t; 27 + 28 + return NULL; 29 + } 30 + 31 + static struct dm_cache_policy_type *__get_policy_once(const char *name) 32 + { 33 + struct dm_cache_policy_type *t = __find_policy(name); 34 + 35 + if (t && !try_module_get(t->owner)) { 36 + DMWARN("couldn't get module %s", name); 37 + t = ERR_PTR(-EINVAL); 38 + } 39 + 40 + return t; 41 + } 42 + 43 + static struct dm_cache_policy_type *get_policy_once(const char *name) 44 + { 45 + struct dm_cache_policy_type *t; 46 + 47 + spin_lock(&register_lock); 48 + t = __get_policy_once(name); 49 + spin_unlock(&register_lock); 50 + 51 + return t; 52 + } 53 + 54 + static struct dm_cache_policy_type *get_policy(const char *name) 55 + { 56 + struct dm_cache_policy_type *t; 57 + 58 + t = get_policy_once(name); 59 + if (IS_ERR(t)) 60 + return NULL; 61 + 62 + if (t) 63 + return t; 64 + 65 + request_module("dm-cache-%s", name); 66 + 67 + t = get_policy_once(name); 68 + if (IS_ERR(t)) 69 + return NULL; 70 + 71 + return t; 72 + } 73 + 74 + static void put_policy(struct dm_cache_policy_type *t) 75 + { 76 + module_put(t->owner); 77 + } 78 + 79 + int dm_cache_policy_register(struct dm_cache_policy_type *type) 80 + { 81 + int r; 82 + 83 + /* One size fits all for now */ 84 + if (type->hint_size != 0 && type->hint_size != 4) { 85 + DMWARN("hint size must be 0 or 4 but %llu supplied.", (unsigned long long) type->hint_size); 86 + return -EINVAL; 87 + } 88 + 89 + spin_lock(&register_lock); 90 + if (__find_policy(type->name)) { 91 + DMWARN("attempt to register policy under duplicate name %s", type->name); 92 + r = -EINVAL; 93 + } else { 94 + list_add(&type->list, &register_list); 95 + r = 0; 96 + } 97 + spin_unlock(&register_lock); 98 + 99 + return r; 100 + } 101 + EXPORT_SYMBOL_GPL(dm_cache_policy_register); 102 + 103 + void dm_cache_policy_unregister(struct dm_cache_policy_type *type) 104 + { 105 + spin_lock(&register_lock); 106 + list_del_init(&type->list); 107 + spin_unlock(&register_lock); 108 + } 109 + EXPORT_SYMBOL_GPL(dm_cache_policy_unregister); 110 + 111 + struct dm_cache_policy *dm_cache_policy_create(const char *name, 112 + dm_cblock_t cache_size, 113 + sector_t origin_size, 114 + sector_t cache_block_size) 115 + { 116 + struct dm_cache_policy *p = NULL; 117 + struct dm_cache_policy_type *type; 118 + 119 + type = get_policy(name); 120 + if (!type) { 121 + DMWARN("unknown policy type"); 122 + return NULL; 123 + } 124 + 125 + p = type->create(cache_size, origin_size, cache_block_size); 126 + if (!p) { 127 + put_policy(type); 128 + return NULL; 129 + } 130 + p->private = type; 131 + 132 + return p; 133 + } 134 + EXPORT_SYMBOL_GPL(dm_cache_policy_create); 135 + 136 + void dm_cache_policy_destroy(struct dm_cache_policy *p) 137 + { 138 + struct dm_cache_policy_type *t = p->private; 139 + 140 + p->destroy(p); 141 + put_policy(t); 142 + } 143 + EXPORT_SYMBOL_GPL(dm_cache_policy_destroy); 144 + 145 + const char *dm_cache_policy_get_name(struct dm_cache_policy *p) 146 + { 147 + struct dm_cache_policy_type *t = p->private; 148 + 149 + return t->name; 150 + } 151 + EXPORT_SYMBOL_GPL(dm_cache_policy_get_name); 152 + 153 + size_t dm_cache_policy_get_hint_size(struct dm_cache_policy *p) 154 + { 155 + struct dm_cache_policy_type *t = p->private; 156 + 157 + return t->hint_size; 158 + } 159 + EXPORT_SYMBOL_GPL(dm_cache_policy_get_hint_size); 160 + 161 + /*----------------------------------------------------------------*/

+228

drivers/md/dm-cache-policy.h

··· 1 + /* 2 + * Copyright (C) 2012 Red Hat. All rights reserved. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #ifndef DM_CACHE_POLICY_H 8 + #define DM_CACHE_POLICY_H 9 + 10 + #include "dm-cache-block-types.h" 11 + 12 + #include <linux/device-mapper.h> 13 + 14 + /*----------------------------------------------------------------*/ 15 + 16 + /* FIXME: make it clear which methods are optional. Get debug policy to 17 + * double check this at start. 18 + */ 19 + 20 + /* 21 + * The cache policy makes the important decisions about which blocks get to 22 + * live on the faster cache device. 23 + * 24 + * When the core target has to remap a bio it calls the 'map' method of the 25 + * policy. This returns an instruction telling the core target what to do. 26 + * 27 + * POLICY_HIT: 28 + * That block is in the cache. Remap to the cache and carry on. 29 + * 30 + * POLICY_MISS: 31 + * This block is on the origin device. Remap and carry on. 32 + * 33 + * POLICY_NEW: 34 + * This block is currently on the origin device, but the policy wants to 35 + * move it. The core should: 36 + * 37 + * - hold any further io to this origin block 38 + * - copy the origin to the given cache block 39 + * - release all the held blocks 40 + * - remap the original block to the cache 41 + * 42 + * POLICY_REPLACE: 43 + * This block is currently on the origin device. The policy wants to 44 + * move it to the cache, with the added complication that the destination 45 + * cache block needs a writeback first. The core should: 46 + * 47 + * - hold any further io to this origin block 48 + * - hold any further io to the origin block that's being written back 49 + * - writeback 50 + * - copy new block to cache 51 + * - release held blocks 52 + * - remap bio to cache and reissue. 53 + * 54 + * Should the core run into trouble while processing a POLICY_NEW or 55 + * POLICY_REPLACE instruction it will roll back the policies mapping using 56 + * remove_mapping() or force_mapping(). These methods must not fail. This 57 + * approach avoids having transactional semantics in the policy (ie, the 58 + * core informing the policy when a migration is complete), and hence makes 59 + * it easier to write new policies. 60 + * 61 + * In general policy methods should never block, except in the case of the 62 + * map function when can_migrate is set. So be careful to implement using 63 + * bounded, preallocated memory. 64 + */ 65 + enum policy_operation { 66 + POLICY_HIT, 67 + POLICY_MISS, 68 + POLICY_NEW, 69 + POLICY_REPLACE 70 + }; 71 + 72 + /* 73 + * This is the instruction passed back to the core target. 74 + */ 75 + struct policy_result { 76 + enum policy_operation op; 77 + dm_oblock_t old_oblock; /* POLICY_REPLACE */ 78 + dm_cblock_t cblock; /* POLICY_HIT, POLICY_NEW, POLICY_REPLACE */ 79 + }; 80 + 81 + typedef int (*policy_walk_fn)(void *context, dm_cblock_t cblock, 82 + dm_oblock_t oblock, uint32_t hint); 83 + 84 + /* 85 + * The cache policy object. Just a bunch of methods. It is envisaged that 86 + * this structure will be embedded in a bigger, policy specific structure 87 + * (ie. use container_of()). 88 + */ 89 + struct dm_cache_policy { 90 + 91 + /* 92 + * FIXME: make it clear which methods are optional, and which may 93 + * block. 94 + */ 95 + 96 + /* 97 + * Destroys this object. 98 + */ 99 + void (*destroy)(struct dm_cache_policy *p); 100 + 101 + /* 102 + * See large comment above. 103 + * 104 + * oblock - the origin block we're interested in. 105 + * 106 + * can_block - indicates whether the current thread is allowed to 107 + * block. -EWOULDBLOCK returned if it can't and would. 108 + * 109 + * can_migrate - gives permission for POLICY_NEW or POLICY_REPLACE 110 + * instructions. If denied and the policy would have 111 + * returned one of these instructions it should 112 + * return -EWOULDBLOCK. 113 + * 114 + * discarded_oblock - indicates whether the whole origin block is 115 + * in a discarded state (FIXME: better to tell the 116 + * policy about this sooner, so it can recycle that 117 + * cache block if it wants.) 118 + * bio - the bio that triggered this call. 119 + * result - gets filled in with the instruction. 120 + * 121 + * May only return 0, or -EWOULDBLOCK (if !can_migrate) 122 + */ 123 + int (*map)(struct dm_cache_policy *p, dm_oblock_t oblock, 124 + bool can_block, bool can_migrate, bool discarded_oblock, 125 + struct bio *bio, struct policy_result *result); 126 + 127 + /* 128 + * Sometimes we want to see if a block is in the cache, without 129 + * triggering any update of stats. (ie. it's not a real hit). 130 + * 131 + * Must not block. 132 + * 133 + * Returns 1 iff in cache, 0 iff not, < 0 on error (-EWOULDBLOCK 134 + * would be typical). 135 + */ 136 + int (*lookup)(struct dm_cache_policy *p, dm_oblock_t oblock, dm_cblock_t *cblock); 137 + 138 + /* 139 + * oblock must be a mapped block. Must not block. 140 + */ 141 + void (*set_dirty)(struct dm_cache_policy *p, dm_oblock_t oblock); 142 + void (*clear_dirty)(struct dm_cache_policy *p, dm_oblock_t oblock); 143 + 144 + /* 145 + * Called when a cache target is first created. Used to load a 146 + * mapping from the metadata device into the policy. 147 + */ 148 + int (*load_mapping)(struct dm_cache_policy *p, dm_oblock_t oblock, 149 + dm_cblock_t cblock, uint32_t hint, bool hint_valid); 150 + 151 + int (*walk_mappings)(struct dm_cache_policy *p, policy_walk_fn fn, 152 + void *context); 153 + 154 + /* 155 + * Override functions used on the error paths of the core target. 156 + * They must succeed. 157 + */ 158 + void (*remove_mapping)(struct dm_cache_policy *p, dm_oblock_t oblock); 159 + void (*force_mapping)(struct dm_cache_policy *p, dm_oblock_t current_oblock, 160 + dm_oblock_t new_oblock); 161 + 162 + int (*writeback_work)(struct dm_cache_policy *p, dm_oblock_t *oblock, dm_cblock_t *cblock); 163 + 164 + 165 + /* 166 + * How full is the cache? 167 + */ 168 + dm_cblock_t (*residency)(struct dm_cache_policy *p); 169 + 170 + /* 171 + * Because of where we sit in the block layer, we can be asked to 172 + * map a lot of little bios that are all in the same block (no 173 + * queue merging has occurred). To stop the policy being fooled by 174 + * these the core target sends regular tick() calls to the policy. 175 + * The policy should only count an entry as hit once per tick. 176 + */ 177 + void (*tick)(struct dm_cache_policy *p); 178 + 179 + /* 180 + * Configuration. 181 + */ 182 + int (*emit_config_values)(struct dm_cache_policy *p, 183 + char *result, unsigned maxlen); 184 + int (*set_config_value)(struct dm_cache_policy *p, 185 + const char *key, const char *value); 186 + 187 + /* 188 + * Book keeping ptr for the policy register, not for general use. 189 + */ 190 + void *private; 191 + }; 192 + 193 + /*----------------------------------------------------------------*/ 194 + 195 + /* 196 + * We maintain a little register of the different policy types. 197 + */ 198 + #define CACHE_POLICY_NAME_SIZE 16 199 + 200 + struct dm_cache_policy_type { 201 + /* For use by the register code only. */ 202 + struct list_head list; 203 + 204 + /* 205 + * Policy writers should fill in these fields. The name field is 206 + * what gets passed on the target line to select your policy. 207 + */ 208 + char name[CACHE_POLICY_NAME_SIZE]; 209 + 210 + /* 211 + * Policies may store a hint for each each cache block. 212 + * Currently the size of this hint must be 0 or 4 bytes but we 213 + * expect to relax this in future. 214 + */ 215 + size_t hint_size; 216 + 217 + struct module *owner; 218 + struct dm_cache_policy *(*create)(dm_cblock_t cache_size, 219 + sector_t origin_size, 220 + sector_t block_size); 221 + }; 222 + 223 + int dm_cache_policy_register(struct dm_cache_policy_type *type); 224 + void dm_cache_policy_unregister(struct dm_cache_policy_type *type); 225 + 226 + /*----------------------------------------------------------------*/ 227 + 228 + #endif /* DM_CACHE_POLICY_H */

+2584

drivers/md/dm-cache-target.c

··· 1 + /* 2 + * Copyright (C) 2012 Red Hat. All rights reserved. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #include "dm.h" 8 + #include "dm-bio-prison.h" 9 + #include "dm-cache-metadata.h" 10 + 11 + #include <linux/dm-io.h> 12 + #include <linux/dm-kcopyd.h> 13 + #include <linux/init.h> 14 + #include <linux/mempool.h> 15 + #include <linux/module.h> 16 + #include <linux/slab.h> 17 + #include <linux/vmalloc.h> 18 + 19 + #define DM_MSG_PREFIX "cache" 20 + 21 + DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(cache_copy_throttle, 22 + "A percentage of time allocated for copying to and/or from cache"); 23 + 24 + /*----------------------------------------------------------------*/ 25 + 26 + /* 27 + * Glossary: 28 + * 29 + * oblock: index of an origin block 30 + * cblock: index of a cache block 31 + * promotion: movement of a block from origin to cache 32 + * demotion: movement of a block from cache to origin 33 + * migration: movement of a block between the origin and cache device, 34 + * either direction 35 + */ 36 + 37 + /*----------------------------------------------------------------*/ 38 + 39 + static size_t bitset_size_in_bytes(unsigned nr_entries) 40 + { 41 + return sizeof(unsigned long) * dm_div_up(nr_entries, BITS_PER_LONG); 42 + } 43 + 44 + static unsigned long *alloc_bitset(unsigned nr_entries) 45 + { 46 + size_t s = bitset_size_in_bytes(nr_entries); 47 + return vzalloc(s); 48 + } 49 + 50 + static void clear_bitset(void *bitset, unsigned nr_entries) 51 + { 52 + size_t s = bitset_size_in_bytes(nr_entries); 53 + memset(bitset, 0, s); 54 + } 55 + 56 + static void free_bitset(unsigned long *bits) 57 + { 58 + vfree(bits); 59 + } 60 + 61 + /*----------------------------------------------------------------*/ 62 + 63 + #define PRISON_CELLS 1024 64 + #define MIGRATION_POOL_SIZE 128 65 + #define COMMIT_PERIOD HZ 66 + #define MIGRATION_COUNT_WINDOW 10 67 + 68 + /* 69 + * The block size of the device holding cache data must be >= 32KB 70 + */ 71 + #define DATA_DEV_BLOCK_SIZE_MIN_SECTORS (32 * 1024 >> SECTOR_SHIFT) 72 + 73 + /* 74 + * FIXME: the cache is read/write for the time being. 75 + */ 76 + enum cache_mode { 77 + CM_WRITE, /* metadata may be changed */ 78 + CM_READ_ONLY, /* metadata may not be changed */ 79 + }; 80 + 81 + struct cache_features { 82 + enum cache_mode mode; 83 + bool write_through:1; 84 + }; 85 + 86 + struct cache_stats { 87 + atomic_t read_hit; 88 + atomic_t read_miss; 89 + atomic_t write_hit; 90 + atomic_t write_miss; 91 + atomic_t demotion; 92 + atomic_t promotion; 93 + atomic_t copies_avoided; 94 + atomic_t cache_cell_clash; 95 + atomic_t commit_count; 96 + atomic_t discard_count; 97 + }; 98 + 99 + struct cache { 100 + struct dm_target *ti; 101 + struct dm_target_callbacks callbacks; 102 + 103 + /* 104 + * Metadata is written to this device. 105 + */ 106 + struct dm_dev *metadata_dev; 107 + 108 + /* 109 + * The slower of the two data devices. Typically a spindle. 110 + */ 111 + struct dm_dev *origin_dev; 112 + 113 + /* 114 + * The faster of the two data devices. Typically an SSD. 115 + */ 116 + struct dm_dev *cache_dev; 117 + 118 + /* 119 + * Cache features such as write-through. 120 + */ 121 + struct cache_features features; 122 + 123 + /* 124 + * Size of the origin device in _complete_ blocks and native sectors. 125 + */ 126 + dm_oblock_t origin_blocks; 127 + sector_t origin_sectors; 128 + 129 + /* 130 + * Size of the cache device in blocks. 131 + */ 132 + dm_cblock_t cache_size; 133 + 134 + /* 135 + * Fields for converting from sectors to blocks. 136 + */ 137 + uint32_t sectors_per_block; 138 + int sectors_per_block_shift; 139 + 140 + struct dm_cache_metadata *cmd; 141 + 142 + spinlock_t lock; 143 + struct bio_list deferred_bios; 144 + struct bio_list deferred_flush_bios; 145 + struct list_head quiesced_migrations; 146 + struct list_head completed_migrations; 147 + struct list_head need_commit_migrations; 148 + sector_t migration_threshold; 149 + atomic_t nr_migrations; 150 + wait_queue_head_t migration_wait; 151 + 152 + /* 153 + * cache_size entries, dirty if set 154 + */ 155 + dm_cblock_t nr_dirty; 156 + unsigned long *dirty_bitset; 157 + 158 + /* 159 + * origin_blocks entries, discarded if set. 160 + */ 161 + sector_t discard_block_size; /* a power of 2 times sectors per block */ 162 + dm_dblock_t discard_nr_blocks; 163 + unsigned long *discard_bitset; 164 + 165 + struct dm_kcopyd_client *copier; 166 + struct workqueue_struct *wq; 167 + struct work_struct worker; 168 + 169 + struct delayed_work waker; 170 + unsigned long last_commit_jiffies; 171 + 172 + struct dm_bio_prison *prison; 173 + struct dm_deferred_set *all_io_ds; 174 + 175 + mempool_t *migration_pool; 176 + struct dm_cache_migration *next_migration; 177 + 178 + struct dm_cache_policy *policy; 179 + unsigned policy_nr_args; 180 + 181 + bool need_tick_bio:1; 182 + bool sized:1; 183 + bool quiescing:1; 184 + bool commit_requested:1; 185 + bool loaded_mappings:1; 186 + bool loaded_discards:1; 187 + 188 + struct cache_stats stats; 189 + 190 + /* 191 + * Rather than reconstructing the table line for the status we just 192 + * save it and regurgitate. 193 + */ 194 + unsigned nr_ctr_args; 195 + const char **ctr_args; 196 + }; 197 + 198 + struct per_bio_data { 199 + bool tick:1; 200 + unsigned req_nr:2; 201 + struct dm_deferred_entry *all_io_entry; 202 + }; 203 + 204 + struct dm_cache_migration { 205 + struct list_head list; 206 + struct cache *cache; 207 + 208 + unsigned long start_jiffies; 209 + dm_oblock_t old_oblock; 210 + dm_oblock_t new_oblock; 211 + dm_cblock_t cblock; 212 + 213 + bool err:1; 214 + bool writeback:1; 215 + bool demote:1; 216 + bool promote:1; 217 + 218 + struct dm_bio_prison_cell *old_ocell; 219 + struct dm_bio_prison_cell *new_ocell; 220 + }; 221 + 222 + /* 223 + * Processing a bio in the worker thread may require these memory 224 + * allocations. We prealloc to avoid deadlocks (the same worker thread 225 + * frees them back to the mempool). 226 + */ 227 + struct prealloc { 228 + struct dm_cache_migration *mg; 229 + struct dm_bio_prison_cell *cell1; 230 + struct dm_bio_prison_cell *cell2; 231 + }; 232 + 233 + static void wake_worker(struct cache *cache) 234 + { 235 + queue_work(cache->wq, &cache->worker); 236 + } 237 + 238 + /*----------------------------------------------------------------*/ 239 + 240 + static struct dm_bio_prison_cell *alloc_prison_cell(struct cache *cache) 241 + { 242 + /* FIXME: change to use a local slab. */ 243 + return dm_bio_prison_alloc_cell(cache->prison, GFP_NOWAIT); 244 + } 245 + 246 + static void free_prison_cell(struct cache *cache, struct dm_bio_prison_cell *cell) 247 + { 248 + dm_bio_prison_free_cell(cache->prison, cell); 249 + } 250 + 251 + static int prealloc_data_structs(struct cache *cache, struct prealloc *p) 252 + { 253 + if (!p->mg) { 254 + p->mg = mempool_alloc(cache->migration_pool, GFP_NOWAIT); 255 + if (!p->mg) 256 + return -ENOMEM; 257 + } 258 + 259 + if (!p->cell1) { 260 + p->cell1 = alloc_prison_cell(cache); 261 + if (!p->cell1) 262 + return -ENOMEM; 263 + } 264 + 265 + if (!p->cell2) { 266 + p->cell2 = alloc_prison_cell(cache); 267 + if (!p->cell2) 268 + return -ENOMEM; 269 + } 270 + 271 + return 0; 272 + } 273 + 274 + static void prealloc_free_structs(struct cache *cache, struct prealloc *p) 275 + { 276 + if (p->cell2) 277 + free_prison_cell(cache, p->cell2); 278 + 279 + if (p->cell1) 280 + free_prison_cell(cache, p->cell1); 281 + 282 + if (p->mg) 283 + mempool_free(p->mg, cache->migration_pool); 284 + } 285 + 286 + static struct dm_cache_migration *prealloc_get_migration(struct prealloc *p) 287 + { 288 + struct dm_cache_migration *mg = p->mg; 289 + 290 + BUG_ON(!mg); 291 + p->mg = NULL; 292 + 293 + return mg; 294 + } 295 + 296 + /* 297 + * You must have a cell within the prealloc struct to return. If not this 298 + * function will BUG() rather than returning NULL. 299 + */ 300 + static struct dm_bio_prison_cell *prealloc_get_cell(struct prealloc *p) 301 + { 302 + struct dm_bio_prison_cell *r = NULL; 303 + 304 + if (p->cell1) { 305 + r = p->cell1; 306 + p->cell1 = NULL; 307 + 308 + } else if (p->cell2) { 309 + r = p->cell2; 310 + p->cell2 = NULL; 311 + } else 312 + BUG(); 313 + 314 + return r; 315 + } 316 + 317 + /* 318 + * You can't have more than two cells in a prealloc struct. BUG() will be 319 + * called if you try and overfill. 320 + */ 321 + static void prealloc_put_cell(struct prealloc *p, struct dm_bio_prison_cell *cell) 322 + { 323 + if (!p->cell2) 324 + p->cell2 = cell; 325 + 326 + else if (!p->cell1) 327 + p->cell1 = cell; 328 + 329 + else 330 + BUG(); 331 + } 332 + 333 + /*----------------------------------------------------------------*/ 334 + 335 + static void build_key(dm_oblock_t oblock, struct dm_cell_key *key) 336 + { 337 + key->virtual = 0; 338 + key->dev = 0; 339 + key->block = from_oblock(oblock); 340 + } 341 + 342 + /* 343 + * The caller hands in a preallocated cell, and a free function for it. 344 + * The cell will be freed if there's an error, or if it wasn't used because 345 + * a cell with that key already exists. 346 + */ 347 + typedef void (*cell_free_fn)(void *context, struct dm_bio_prison_cell *cell); 348 + 349 + static int bio_detain(struct cache *cache, dm_oblock_t oblock, 350 + struct bio *bio, struct dm_bio_prison_cell *cell_prealloc, 351 + cell_free_fn free_fn, void *free_context, 352 + struct dm_bio_prison_cell **cell_result) 353 + { 354 + int r; 355 + struct dm_cell_key key; 356 + 357 + build_key(oblock, &key); 358 + r = dm_bio_detain(cache->prison, &key, bio, cell_prealloc, cell_result); 359 + if (r) 360 + free_fn(free_context, cell_prealloc); 361 + 362 + return r; 363 + } 364 + 365 + static int get_cell(struct cache *cache, 366 + dm_oblock_t oblock, 367 + struct prealloc *structs, 368 + struct dm_bio_prison_cell **cell_result) 369 + { 370 + int r; 371 + struct dm_cell_key key; 372 + struct dm_bio_prison_cell *cell_prealloc; 373 + 374 + cell_prealloc = prealloc_get_cell(structs); 375 + 376 + build_key(oblock, &key); 377 + r = dm_get_cell(cache->prison, &key, cell_prealloc, cell_result); 378 + if (r) 379 + prealloc_put_cell(structs, cell_prealloc); 380 + 381 + return r; 382 + } 383 + 384 + /*----------------------------------------------------------------*/ 385 + 386 + static bool is_dirty(struct cache *cache, dm_cblock_t b) 387 + { 388 + return test_bit(from_cblock(b), cache->dirty_bitset); 389 + } 390 + 391 + static void set_dirty(struct cache *cache, dm_oblock_t oblock, dm_cblock_t cblock) 392 + { 393 + if (!test_and_set_bit(from_cblock(cblock), cache->dirty_bitset)) { 394 + cache->nr_dirty = to_cblock(from_cblock(cache->nr_dirty) + 1); 395 + policy_set_dirty(cache->policy, oblock); 396 + } 397 + } 398 + 399 + static void clear_dirty(struct cache *cache, dm_oblock_t oblock, dm_cblock_t cblock) 400 + { 401 + if (test_and_clear_bit(from_cblock(cblock), cache->dirty_bitset)) { 402 + policy_clear_dirty(cache->policy, oblock); 403 + cache->nr_dirty = to_cblock(from_cblock(cache->nr_dirty) - 1); 404 + if (!from_cblock(cache->nr_dirty)) 405 + dm_table_event(cache->ti->table); 406 + } 407 + } 408 + 409 + /*----------------------------------------------------------------*/ 410 + static bool block_size_is_power_of_two(struct cache *cache) 411 + { 412 + return cache->sectors_per_block_shift >= 0; 413 + } 414 + 415 + static dm_dblock_t oblock_to_dblock(struct cache *cache, dm_oblock_t oblock) 416 + { 417 + sector_t discard_blocks = cache->discard_block_size; 418 + dm_block_t b = from_oblock(oblock); 419 + 420 + if (!block_size_is_power_of_two(cache)) 421 + (void) sector_div(discard_blocks, cache->sectors_per_block); 422 + else 423 + discard_blocks >>= cache->sectors_per_block_shift; 424 + 425 + (void) sector_div(b, discard_blocks); 426 + 427 + return to_dblock(b); 428 + } 429 + 430 + static void set_discard(struct cache *cache, dm_dblock_t b) 431 + { 432 + unsigned long flags; 433 + 434 + atomic_inc(&cache->stats.discard_count); 435 + 436 + spin_lock_irqsave(&cache->lock, flags); 437 + set_bit(from_dblock(b), cache->discard_bitset); 438 + spin_unlock_irqrestore(&cache->lock, flags); 439 + } 440 + 441 + static void clear_discard(struct cache *cache, dm_dblock_t b) 442 + { 443 + unsigned long flags; 444 + 445 + spin_lock_irqsave(&cache->lock, flags); 446 + clear_bit(from_dblock(b), cache->discard_bitset); 447 + spin_unlock_irqrestore(&cache->lock, flags); 448 + } 449 + 450 + static bool is_discarded(struct cache *cache, dm_dblock_t b) 451 + { 452 + int r; 453 + unsigned long flags; 454 + 455 + spin_lock_irqsave(&cache->lock, flags); 456 + r = test_bit(from_dblock(b), cache->discard_bitset); 457 + spin_unlock_irqrestore(&cache->lock, flags); 458 + 459 + return r; 460 + } 461 + 462 + static bool is_discarded_oblock(struct cache *cache, dm_oblock_t b) 463 + { 464 + int r; 465 + unsigned long flags; 466 + 467 + spin_lock_irqsave(&cache->lock, flags); 468 + r = test_bit(from_dblock(oblock_to_dblock(cache, b)), 469 + cache->discard_bitset); 470 + spin_unlock_irqrestore(&cache->lock, flags); 471 + 472 + return r; 473 + } 474 + 475 + /*----------------------------------------------------------------*/ 476 + 477 + static void load_stats(struct cache *cache) 478 + { 479 + struct dm_cache_statistics stats; 480 + 481 + dm_cache_metadata_get_stats(cache->cmd, &stats); 482 + atomic_set(&cache->stats.read_hit, stats.read_hits); 483 + atomic_set(&cache->stats.read_miss, stats.read_misses); 484 + atomic_set(&cache->stats.write_hit, stats.write_hits); 485 + atomic_set(&cache->stats.write_miss, stats.write_misses); 486 + } 487 + 488 + static void save_stats(struct cache *cache) 489 + { 490 + struct dm_cache_statistics stats; 491 + 492 + stats.read_hits = atomic_read(&cache->stats.read_hit); 493 + stats.read_misses = atomic_read(&cache->stats.read_miss); 494 + stats.write_hits = atomic_read(&cache->stats.write_hit); 495 + stats.write_misses = atomic_read(&cache->stats.write_miss); 496 + 497 + dm_cache_metadata_set_stats(cache->cmd, &stats); 498 + } 499 + 500 + /*---------------------------------------------------------------- 501 + * Per bio data 502 + *--------------------------------------------------------------*/ 503 + static struct per_bio_data *get_per_bio_data(struct bio *bio) 504 + { 505 + struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data)); 506 + BUG_ON(!pb); 507 + return pb; 508 + } 509 + 510 + static struct per_bio_data *init_per_bio_data(struct bio *bio) 511 + { 512 + struct per_bio_data *pb = get_per_bio_data(bio); 513 + 514 + pb->tick = false; 515 + pb->req_nr = dm_bio_get_target_bio_nr(bio); 516 + pb->all_io_entry = NULL; 517 + 518 + return pb; 519 + } 520 + 521 + /*---------------------------------------------------------------- 522 + * Remapping 523 + *--------------------------------------------------------------*/ 524 + static void remap_to_origin(struct cache *cache, struct bio *bio) 525 + { 526 + bio->bi_bdev = cache->origin_dev->bdev; 527 + } 528 + 529 + static void remap_to_cache(struct cache *cache, struct bio *bio, 530 + dm_cblock_t cblock) 531 + { 532 + sector_t bi_sector = bio->bi_sector; 533 + 534 + bio->bi_bdev = cache->cache_dev->bdev; 535 + if (!block_size_is_power_of_two(cache)) 536 + bio->bi_sector = (from_cblock(cblock) * cache->sectors_per_block) + 537 + sector_div(bi_sector, cache->sectors_per_block); 538 + else 539 + bio->bi_sector = (from_cblock(cblock) << cache->sectors_per_block_shift) | 540 + (bi_sector & (cache->sectors_per_block - 1)); 541 + } 542 + 543 + static void check_if_tick_bio_needed(struct cache *cache, struct bio *bio) 544 + { 545 + unsigned long flags; 546 + struct per_bio_data *pb = get_per_bio_data(bio); 547 + 548 + spin_lock_irqsave(&cache->lock, flags); 549 + if (cache->need_tick_bio && 550 + !(bio->bi_rw & (REQ_FUA | REQ_FLUSH | REQ_DISCARD))) { 551 + pb->tick = true; 552 + cache->need_tick_bio = false; 553 + } 554 + spin_unlock_irqrestore(&cache->lock, flags); 555 + } 556 + 557 + static void remap_to_origin_clear_discard(struct cache *cache, struct bio *bio, 558 + dm_oblock_t oblock) 559 + { 560 + check_if_tick_bio_needed(cache, bio); 561 + remap_to_origin(cache, bio); 562 + if (bio_data_dir(bio) == WRITE) 563 + clear_discard(cache, oblock_to_dblock(cache, oblock)); 564 + } 565 + 566 + static void remap_to_cache_dirty(struct cache *cache, struct bio *bio, 567 + dm_oblock_t oblock, dm_cblock_t cblock) 568 + { 569 + remap_to_cache(cache, bio, cblock); 570 + if (bio_data_dir(bio) == WRITE) { 571 + set_dirty(cache, oblock, cblock); 572 + clear_discard(cache, oblock_to_dblock(cache, oblock)); 573 + } 574 + } 575 + 576 + static dm_oblock_t get_bio_block(struct cache *cache, struct bio *bio) 577 + { 578 + sector_t block_nr = bio->bi_sector; 579 + 580 + if (!block_size_is_power_of_two(cache)) 581 + (void) sector_div(block_nr, cache->sectors_per_block); 582 + else 583 + block_nr >>= cache->sectors_per_block_shift; 584 + 585 + return to_oblock(block_nr); 586 + } 587 + 588 + static int bio_triggers_commit(struct cache *cache, struct bio *bio) 589 + { 590 + return bio->bi_rw & (REQ_FLUSH | REQ_FUA); 591 + } 592 + 593 + static void issue(struct cache *cache, struct bio *bio) 594 + { 595 + unsigned long flags; 596 + 597 + if (!bio_triggers_commit(cache, bio)) { 598 + generic_make_request(bio); 599 + return; 600 + } 601 + 602 + /* 603 + * Batch together any bios that trigger commits and then issue a 604 + * single commit for them in do_worker(). 605 + */ 606 + spin_lock_irqsave(&cache->lock, flags); 607 + cache->commit_requested = true; 608 + bio_list_add(&cache->deferred_flush_bios, bio); 609 + spin_unlock_irqrestore(&cache->lock, flags); 610 + } 611 + 612 + /*---------------------------------------------------------------- 613 + * Migration processing 614 + * 615 + * Migration covers moving data from the origin device to the cache, or 616 + * vice versa. 617 + *--------------------------------------------------------------*/ 618 + static void free_migration(struct dm_cache_migration *mg) 619 + { 620 + mempool_free(mg, mg->cache->migration_pool); 621 + } 622 + 623 + static void inc_nr_migrations(struct cache *cache) 624 + { 625 + atomic_inc(&cache->nr_migrations); 626 + } 627 + 628 + static void dec_nr_migrations(struct cache *cache) 629 + { 630 + atomic_dec(&cache->nr_migrations); 631 + 632 + /* 633 + * Wake the worker in case we're suspending the target. 634 + */ 635 + wake_up(&cache->migration_wait); 636 + } 637 + 638 + static void __cell_defer(struct cache *cache, struct dm_bio_prison_cell *cell, 639 + bool holder) 640 + { 641 + (holder ? dm_cell_release : dm_cell_release_no_holder) 642 + (cache->prison, cell, &cache->deferred_bios); 643 + free_prison_cell(cache, cell); 644 + } 645 + 646 + static void cell_defer(struct cache *cache, struct dm_bio_prison_cell *cell, 647 + bool holder) 648 + { 649 + unsigned long flags; 650 + 651 + spin_lock_irqsave(&cache->lock, flags); 652 + __cell_defer(cache, cell, holder); 653 + spin_unlock_irqrestore(&cache->lock, flags); 654 + 655 + wake_worker(cache); 656 + } 657 + 658 + static void cleanup_migration(struct dm_cache_migration *mg) 659 + { 660 + dec_nr_migrations(mg->cache); 661 + free_migration(mg); 662 + } 663 + 664 + static void migration_failure(struct dm_cache_migration *mg) 665 + { 666 + struct cache *cache = mg->cache; 667 + 668 + if (mg->writeback) { 669 + DMWARN_LIMIT("writeback failed; couldn't copy block"); 670 + set_dirty(cache, mg->old_oblock, mg->cblock); 671 + cell_defer(cache, mg->old_ocell, false); 672 + 673 + } else if (mg->demote) { 674 + DMWARN_LIMIT("demotion failed; couldn't copy block"); 675 + policy_force_mapping(cache->policy, mg->new_oblock, mg->old_oblock); 676 + 677 + cell_defer(cache, mg->old_ocell, mg->promote ? 0 : 1); 678 + if (mg->promote) 679 + cell_defer(cache, mg->new_ocell, 1); 680 + } else { 681 + DMWARN_LIMIT("promotion failed; couldn't copy block"); 682 + policy_remove_mapping(cache->policy, mg->new_oblock); 683 + cell_defer(cache, mg->new_ocell, 1); 684 + } 685 + 686 + cleanup_migration(mg); 687 + } 688 + 689 + static void migration_success_pre_commit(struct dm_cache_migration *mg) 690 + { 691 + unsigned long flags; 692 + struct cache *cache = mg->cache; 693 + 694 + if (mg->writeback) { 695 + cell_defer(cache, mg->old_ocell, false); 696 + clear_dirty(cache, mg->old_oblock, mg->cblock); 697 + cleanup_migration(mg); 698 + return; 699 + 700 + } else if (mg->demote) { 701 + if (dm_cache_remove_mapping(cache->cmd, mg->cblock)) { 702 + DMWARN_LIMIT("demotion failed; couldn't update on disk metadata"); 703 + policy_force_mapping(cache->policy, mg->new_oblock, 704 + mg->old_oblock); 705 + if (mg->promote) 706 + cell_defer(cache, mg->new_ocell, true); 707 + cleanup_migration(mg); 708 + return; 709 + } 710 + } else { 711 + if (dm_cache_insert_mapping(cache->cmd, mg->cblock, mg->new_oblock)) { 712 + DMWARN_LIMIT("promotion failed; couldn't update on disk metadata"); 713 + policy_remove_mapping(cache->policy, mg->new_oblock); 714 + cleanup_migration(mg); 715 + return; 716 + } 717 + } 718 + 719 + spin_lock_irqsave(&cache->lock, flags); 720 + list_add_tail(&mg->list, &cache->need_commit_migrations); 721 + cache->commit_requested = true; 722 + spin_unlock_irqrestore(&cache->lock, flags); 723 + } 724 + 725 + static void migration_success_post_commit(struct dm_cache_migration *mg) 726 + { 727 + unsigned long flags; 728 + struct cache *cache = mg->cache; 729 + 730 + if (mg->writeback) { 731 + DMWARN("writeback unexpectedly triggered commit"); 732 + return; 733 + 734 + } else if (mg->demote) { 735 + cell_defer(cache, mg->old_ocell, mg->promote ? 0 : 1); 736 + 737 + if (mg->promote) { 738 + mg->demote = false; 739 + 740 + spin_lock_irqsave(&cache->lock, flags); 741 + list_add_tail(&mg->list, &cache->quiesced_migrations); 742 + spin_unlock_irqrestore(&cache->lock, flags); 743 + 744 + } else 745 + cleanup_migration(mg); 746 + 747 + } else { 748 + cell_defer(cache, mg->new_ocell, true); 749 + clear_dirty(cache, mg->new_oblock, mg->cblock); 750 + cleanup_migration(mg); 751 + } 752 + } 753 + 754 + static void copy_complete(int read_err, unsigned long write_err, void *context) 755 + { 756 + unsigned long flags; 757 + struct dm_cache_migration *mg = (struct dm_cache_migration *) context; 758 + struct cache *cache = mg->cache; 759 + 760 + if (read_err || write_err) 761 + mg->err = true; 762 + 763 + spin_lock_irqsave(&cache->lock, flags); 764 + list_add_tail(&mg->list, &cache->completed_migrations); 765 + spin_unlock_irqrestore(&cache->lock, flags); 766 + 767 + wake_worker(cache); 768 + } 769 + 770 + static void issue_copy_real(struct dm_cache_migration *mg) 771 + { 772 + int r; 773 + struct dm_io_region o_region, c_region; 774 + struct cache *cache = mg->cache; 775 + 776 + o_region.bdev = cache->origin_dev->bdev; 777 + o_region.count = cache->sectors_per_block; 778 + 779 + c_region.bdev = cache->cache_dev->bdev; 780 + c_region.sector = from_cblock(mg->cblock) * cache->sectors_per_block; 781 + c_region.count = cache->sectors_per_block; 782 + 783 + if (mg->writeback || mg->demote) { 784 + /* demote */ 785 + o_region.sector = from_oblock(mg->old_oblock) * cache->sectors_per_block; 786 + r = dm_kcopyd_copy(cache->copier, &c_region, 1, &o_region, 0, copy_complete, mg); 787 + } else { 788 + /* promote */ 789 + o_region.sector = from_oblock(mg->new_oblock) * cache->sectors_per_block; 790 + r = dm_kcopyd_copy(cache->copier, &o_region, 1, &c_region, 0, copy_complete, mg); 791 + } 792 + 793 + if (r < 0) 794 + migration_failure(mg); 795 + } 796 + 797 + static void avoid_copy(struct dm_cache_migration *mg) 798 + { 799 + atomic_inc(&mg->cache->stats.copies_avoided); 800 + migration_success_pre_commit(mg); 801 + } 802 + 803 + static void issue_copy(struct dm_cache_migration *mg) 804 + { 805 + bool avoid; 806 + struct cache *cache = mg->cache; 807 + 808 + if (mg->writeback || mg->demote) 809 + avoid = !is_dirty(cache, mg->cblock) || 810 + is_discarded_oblock(cache, mg->old_oblock); 811 + else 812 + avoid = is_discarded_oblock(cache, mg->new_oblock); 813 + 814 + avoid ? avoid_copy(mg) : issue_copy_real(mg); 815 + } 816 + 817 + static void complete_migration(struct dm_cache_migration *mg) 818 + { 819 + if (mg->err) 820 + migration_failure(mg); 821 + else 822 + migration_success_pre_commit(mg); 823 + } 824 + 825 + static void process_migrations(struct cache *cache, struct list_head *head, 826 + void (*fn)(struct dm_cache_migration *)) 827 + { 828 + unsigned long flags; 829 + struct list_head list; 830 + struct dm_cache_migration *mg, *tmp; 831 + 832 + INIT_LIST_HEAD(&list); 833 + spin_lock_irqsave(&cache->lock, flags); 834 + list_splice_init(head, &list); 835 + spin_unlock_irqrestore(&cache->lock, flags); 836 + 837 + list_for_each_entry_safe(mg, tmp, &list, list) 838 + fn(mg); 839 + } 840 + 841 + static void __queue_quiesced_migration(struct dm_cache_migration *mg) 842 + { 843 + list_add_tail(&mg->list, &mg->cache->quiesced_migrations); 844 + } 845 + 846 + static void queue_quiesced_migration(struct dm_cache_migration *mg) 847 + { 848 + unsigned long flags; 849 + struct cache *cache = mg->cache; 850 + 851 + spin_lock_irqsave(&cache->lock, flags); 852 + __queue_quiesced_migration(mg); 853 + spin_unlock_irqrestore(&cache->lock, flags); 854 + 855 + wake_worker(cache); 856 + } 857 + 858 + static void queue_quiesced_migrations(struct cache *cache, struct list_head *work) 859 + { 860 + unsigned long flags; 861 + struct dm_cache_migration *mg, *tmp; 862 + 863 + spin_lock_irqsave(&cache->lock, flags); 864 + list_for_each_entry_safe(mg, tmp, work, list) 865 + __queue_quiesced_migration(mg); 866 + spin_unlock_irqrestore(&cache->lock, flags); 867 + 868 + wake_worker(cache); 869 + } 870 + 871 + static void check_for_quiesced_migrations(struct cache *cache, 872 + struct per_bio_data *pb) 873 + { 874 + struct list_head work; 875 + 876 + if (!pb->all_io_entry) 877 + return; 878 + 879 + INIT_LIST_HEAD(&work); 880 + if (pb->all_io_entry) 881 + dm_deferred_entry_dec(pb->all_io_entry, &work); 882 + 883 + if (!list_empty(&work)) 884 + queue_quiesced_migrations(cache, &work); 885 + } 886 + 887 + static void quiesce_migration(struct dm_cache_migration *mg) 888 + { 889 + if (!dm_deferred_set_add_work(mg->cache->all_io_ds, &mg->list)) 890 + queue_quiesced_migration(mg); 891 + } 892 + 893 + static void promote(struct cache *cache, struct prealloc *structs, 894 + dm_oblock_t oblock, dm_cblock_t cblock, 895 + struct dm_bio_prison_cell *cell) 896 + { 897 + struct dm_cache_migration *mg = prealloc_get_migration(structs); 898 + 899 + mg->err = false; 900 + mg->writeback = false; 901 + mg->demote = false; 902 + mg->promote = true; 903 + mg->cache = cache; 904 + mg->new_oblock = oblock; 905 + mg->cblock = cblock; 906 + mg->old_ocell = NULL; 907 + mg->new_ocell = cell; 908 + mg->start_jiffies = jiffies; 909 + 910 + inc_nr_migrations(cache); 911 + quiesce_migration(mg); 912 + } 913 + 914 + static void writeback(struct cache *cache, struct prealloc *structs, 915 + dm_oblock_t oblock, dm_cblock_t cblock, 916 + struct dm_bio_prison_cell *cell) 917 + { 918 + struct dm_cache_migration *mg = prealloc_get_migration(structs); 919 + 920 + mg->err = false; 921 + mg->writeback = true; 922 + mg->demote = false; 923 + mg->promote = false; 924 + mg->cache = cache; 925 + mg->old_oblock = oblock; 926 + mg->cblock = cblock; 927 + mg->old_ocell = cell; 928 + mg->new_ocell = NULL; 929 + mg->start_jiffies = jiffies; 930 + 931 + inc_nr_migrations(cache); 932 + quiesce_migration(mg); 933 + } 934 + 935 + static void demote_then_promote(struct cache *cache, struct prealloc *structs, 936 + dm_oblock_t old_oblock, dm_oblock_t new_oblock, 937 + dm_cblock_t cblock, 938 + struct dm_bio_prison_cell *old_ocell, 939 + struct dm_bio_prison_cell *new_ocell) 940 + { 941 + struct dm_cache_migration *mg = prealloc_get_migration(structs); 942 + 943 + mg->err = false; 944 + mg->writeback = false; 945 + mg->demote = true; 946 + mg->promote = true; 947 + mg->cache = cache; 948 + mg->old_oblock = old_oblock; 949 + mg->new_oblock = new_oblock; 950 + mg->cblock = cblock; 951 + mg->old_ocell = old_ocell; 952 + mg->new_ocell = new_ocell; 953 + mg->start_jiffies = jiffies; 954 + 955 + inc_nr_migrations(cache); 956 + quiesce_migration(mg); 957 + } 958 + 959 + /*---------------------------------------------------------------- 960 + * bio processing 961 + *--------------------------------------------------------------*/ 962 + static void defer_bio(struct cache *cache, struct bio *bio) 963 + { 964 + unsigned long flags; 965 + 966 + spin_lock_irqsave(&cache->lock, flags); 967 + bio_list_add(&cache->deferred_bios, bio); 968 + spin_unlock_irqrestore(&cache->lock, flags); 969 + 970 + wake_worker(cache); 971 + } 972 + 973 + static void process_flush_bio(struct cache *cache, struct bio *bio) 974 + { 975 + struct per_bio_data *pb = get_per_bio_data(bio); 976 + 977 + BUG_ON(bio->bi_size); 978 + if (!pb->req_nr) 979 + remap_to_origin(cache, bio); 980 + else 981 + remap_to_cache(cache, bio, 0); 982 + 983 + issue(cache, bio); 984 + } 985 + 986 + /* 987 + * People generally discard large parts of a device, eg, the whole device 988 + * when formatting. Splitting these large discards up into cache block 989 + * sized ios and then quiescing (always neccessary for discard) takes too 990 + * long. 991 + * 992 + * We keep it simple, and allow any size of discard to come in, and just 993 + * mark off blocks on the discard bitset. No passdown occurs! 994 + * 995 + * To implement passdown we need to change the bio_prison such that a cell 996 + * can have a key that spans many blocks. 997 + */ 998 + static void process_discard_bio(struct cache *cache, struct bio *bio) 999 + { 1000 + dm_block_t start_block = dm_sector_div_up(bio->bi_sector, 1001 + cache->discard_block_size); 1002 + dm_block_t end_block = bio->bi_sector + bio_sectors(bio); 1003 + dm_block_t b; 1004 + 1005 + (void) sector_div(end_block, cache->discard_block_size); 1006 + 1007 + for (b = start_block; b < end_block; b++) 1008 + set_discard(cache, to_dblock(b)); 1009 + 1010 + bio_endio(bio, 0); 1011 + } 1012 + 1013 + static bool spare_migration_bandwidth(struct cache *cache) 1014 + { 1015 + sector_t current_volume = (atomic_read(&cache->nr_migrations) + 1) * 1016 + cache->sectors_per_block; 1017 + return current_volume < cache->migration_threshold; 1018 + } 1019 + 1020 + static bool is_writethrough_io(struct cache *cache, struct bio *bio, 1021 + dm_cblock_t cblock) 1022 + { 1023 + return bio_data_dir(bio) == WRITE && 1024 + cache->features.write_through && !is_dirty(cache, cblock); 1025 + } 1026 + 1027 + static void inc_hit_counter(struct cache *cache, struct bio *bio) 1028 + { 1029 + atomic_inc(bio_data_dir(bio) == READ ? 1030 + &cache->stats.read_hit : &cache->stats.write_hit); 1031 + } 1032 + 1033 + static void inc_miss_counter(struct cache *cache, struct bio *bio) 1034 + { 1035 + atomic_inc(bio_data_dir(bio) == READ ? 1036 + &cache->stats.read_miss : &cache->stats.write_miss); 1037 + } 1038 + 1039 + static void process_bio(struct cache *cache, struct prealloc *structs, 1040 + struct bio *bio) 1041 + { 1042 + int r; 1043 + bool release_cell = true; 1044 + dm_oblock_t block = get_bio_block(cache, bio); 1045 + struct dm_bio_prison_cell *cell_prealloc, *old_ocell, *new_ocell; 1046 + struct policy_result lookup_result; 1047 + struct per_bio_data *pb = get_per_bio_data(bio); 1048 + bool discarded_block = is_discarded_oblock(cache, block); 1049 + bool can_migrate = discarded_block || spare_migration_bandwidth(cache); 1050 + 1051 + /* 1052 + * Check to see if that block is currently migrating. 1053 + */ 1054 + cell_prealloc = prealloc_get_cell(structs); 1055 + r = bio_detain(cache, block, bio, cell_prealloc, 1056 + (cell_free_fn) prealloc_put_cell, 1057 + structs, &new_ocell); 1058 + if (r > 0) 1059 + return; 1060 + 1061 + r = policy_map(cache->policy, block, true, can_migrate, discarded_block, 1062 + bio, &lookup_result); 1063 + 1064 + if (r == -EWOULDBLOCK) 1065 + /* migration has been denied */ 1066 + lookup_result.op = POLICY_MISS; 1067 + 1068 + switch (lookup_result.op) { 1069 + case POLICY_HIT: 1070 + inc_hit_counter(cache, bio); 1071 + pb->all_io_entry = dm_deferred_entry_inc(cache->all_io_ds); 1072 + 1073 + if (is_writethrough_io(cache, bio, lookup_result.cblock)) { 1074 + /* 1075 + * No need to mark anything dirty in write through mode. 1076 + */ 1077 + pb->req_nr == 0 ? 1078 + remap_to_cache(cache, bio, lookup_result.cblock) : 1079 + remap_to_origin_clear_discard(cache, bio, block); 1080 + } else 1081 + remap_to_cache_dirty(cache, bio, block, lookup_result.cblock); 1082 + 1083 + issue(cache, bio); 1084 + break; 1085 + 1086 + case POLICY_MISS: 1087 + inc_miss_counter(cache, bio); 1088 + pb->all_io_entry = dm_deferred_entry_inc(cache->all_io_ds); 1089 + 1090 + if (pb->req_nr != 0) { 1091 + /* 1092 + * This is a duplicate writethrough io that is no 1093 + * longer needed because the block has been demoted. 1094 + */ 1095 + bio_endio(bio, 0); 1096 + } else { 1097 + remap_to_origin_clear_discard(cache, bio, block); 1098 + issue(cache, bio); 1099 + } 1100 + break; 1101 + 1102 + case POLICY_NEW: 1103 + atomic_inc(&cache->stats.promotion); 1104 + promote(cache, structs, block, lookup_result.cblock, new_ocell); 1105 + release_cell = false; 1106 + break; 1107 + 1108 + case POLICY_REPLACE: 1109 + cell_prealloc = prealloc_get_cell(structs); 1110 + r = bio_detain(cache, lookup_result.old_oblock, bio, cell_prealloc, 1111 + (cell_free_fn) prealloc_put_cell, 1112 + structs, &old_ocell); 1113 + if (r > 0) { 1114 + /* 1115 + * We have to be careful to avoid lock inversion of 1116 + * the cells. So we back off, and wait for the 1117 + * old_ocell to become free. 1118 + */ 1119 + policy_force_mapping(cache->policy, block, 1120 + lookup_result.old_oblock); 1121 + atomic_inc(&cache->stats.cache_cell_clash); 1122 + break; 1123 + } 1124 + atomic_inc(&cache->stats.demotion); 1125 + atomic_inc(&cache->stats.promotion); 1126 + 1127 + demote_then_promote(cache, structs, lookup_result.old_oblock, 1128 + block, lookup_result.cblock, 1129 + old_ocell, new_ocell); 1130 + release_cell = false; 1131 + break; 1132 + 1133 + default: 1134 + DMERR_LIMIT("%s: erroring bio, unknown policy op: %u", __func__, 1135 + (unsigned) lookup_result.op); 1136 + bio_io_error(bio); 1137 + } 1138 + 1139 + if (release_cell) 1140 + cell_defer(cache, new_ocell, false); 1141 + } 1142 + 1143 + static int need_commit_due_to_time(struct cache *cache) 1144 + { 1145 + return jiffies < cache->last_commit_jiffies || 1146 + jiffies > cache->last_commit_jiffies + COMMIT_PERIOD; 1147 + } 1148 + 1149 + static int commit_if_needed(struct cache *cache) 1150 + { 1151 + if (dm_cache_changed_this_transaction(cache->cmd) && 1152 + (cache->commit_requested || need_commit_due_to_time(cache))) { 1153 + atomic_inc(&cache->stats.commit_count); 1154 + cache->last_commit_jiffies = jiffies; 1155 + cache->commit_requested = false; 1156 + return dm_cache_commit(cache->cmd, false); 1157 + } 1158 + 1159 + return 0; 1160 + } 1161 + 1162 + static void process_deferred_bios(struct cache *cache) 1163 + { 1164 + unsigned long flags; 1165 + struct bio_list bios; 1166 + struct bio *bio; 1167 + struct prealloc structs; 1168 + 1169 + memset(&structs, 0, sizeof(structs)); 1170 + bio_list_init(&bios); 1171 + 1172 + spin_lock_irqsave(&cache->lock, flags); 1173 + bio_list_merge(&bios, &cache->deferred_bios); 1174 + bio_list_init(&cache->deferred_bios); 1175 + spin_unlock_irqrestore(&cache->lock, flags); 1176 + 1177 + while (!bio_list_empty(&bios)) { 1178 + /* 1179 + * If we've got no free migration structs, and processing 1180 + * this bio might require one, we pause until there are some 1181 + * prepared mappings to process. 1182 + */ 1183 + if (prealloc_data_structs(cache, &structs)) { 1184 + spin_lock_irqsave(&cache->lock, flags); 1185 + bio_list_merge(&cache->deferred_bios, &bios); 1186 + spin_unlock_irqrestore(&cache->lock, flags); 1187 + break; 1188 + } 1189 + 1190 + bio = bio_list_pop(&bios); 1191 + 1192 + if (bio->bi_rw & REQ_FLUSH) 1193 + process_flush_bio(cache, bio); 1194 + else if (bio->bi_rw & REQ_DISCARD) 1195 + process_discard_bio(cache, bio); 1196 + else 1197 + process_bio(cache, &structs, bio); 1198 + } 1199 + 1200 + prealloc_free_structs(cache, &structs); 1201 + } 1202 + 1203 + static void process_deferred_flush_bios(struct cache *cache, bool submit_bios) 1204 + { 1205 + unsigned long flags; 1206 + struct bio_list bios; 1207 + struct bio *bio; 1208 + 1209 + bio_list_init(&bios); 1210 + 1211 + spin_lock_irqsave(&cache->lock, flags); 1212 + bio_list_merge(&bios, &cache->deferred_flush_bios); 1213 + bio_list_init(&cache->deferred_flush_bios); 1214 + spin_unlock_irqrestore(&cache->lock, flags); 1215 + 1216 + while ((bio = bio_list_pop(&bios))) 1217 + submit_bios ? generic_make_request(bio) : bio_io_error(bio); 1218 + } 1219 + 1220 + static void writeback_some_dirty_blocks(struct cache *cache) 1221 + { 1222 + int r = 0; 1223 + dm_oblock_t oblock; 1224 + dm_cblock_t cblock; 1225 + struct prealloc structs; 1226 + struct dm_bio_prison_cell *old_ocell; 1227 + 1228 + memset(&structs, 0, sizeof(structs)); 1229 + 1230 + while (spare_migration_bandwidth(cache)) { 1231 + if (prealloc_data_structs(cache, &structs)) 1232 + break; 1233 + 1234 + r = policy_writeback_work(cache->policy, &oblock, &cblock); 1235 + if (r) 1236 + break; 1237 + 1238 + r = get_cell(cache, oblock, &structs, &old_ocell); 1239 + if (r) { 1240 + policy_set_dirty(cache->policy, oblock); 1241 + break; 1242 + } 1243 + 1244 + writeback(cache, &structs, oblock, cblock, old_ocell); 1245 + } 1246 + 1247 + prealloc_free_structs(cache, &structs); 1248 + } 1249 + 1250 + /*---------------------------------------------------------------- 1251 + * Main worker loop 1252 + *--------------------------------------------------------------*/ 1253 + static void start_quiescing(struct cache *cache) 1254 + { 1255 + unsigned long flags; 1256 + 1257 + spin_lock_irqsave(&cache->lock, flags); 1258 + cache->quiescing = 1; 1259 + spin_unlock_irqrestore(&cache->lock, flags); 1260 + } 1261 + 1262 + static void stop_quiescing(struct cache *cache) 1263 + { 1264 + unsigned long flags; 1265 + 1266 + spin_lock_irqsave(&cache->lock, flags); 1267 + cache->quiescing = 0; 1268 + spin_unlock_irqrestore(&cache->lock, flags); 1269 + } 1270 + 1271 + static bool is_quiescing(struct cache *cache) 1272 + { 1273 + int r; 1274 + unsigned long flags; 1275 + 1276 + spin_lock_irqsave(&cache->lock, flags); 1277 + r = cache->quiescing; 1278 + spin_unlock_irqrestore(&cache->lock, flags); 1279 + 1280 + return r; 1281 + } 1282 + 1283 + static void wait_for_migrations(struct cache *cache) 1284 + { 1285 + wait_event(cache->migration_wait, !atomic_read(&cache->nr_migrations)); 1286 + } 1287 + 1288 + static void stop_worker(struct cache *cache) 1289 + { 1290 + cancel_delayed_work(&cache->waker); 1291 + flush_workqueue(cache->wq); 1292 + } 1293 + 1294 + static void requeue_deferred_io(struct cache *cache) 1295 + { 1296 + struct bio *bio; 1297 + struct bio_list bios; 1298 + 1299 + bio_list_init(&bios); 1300 + bio_list_merge(&bios, &cache->deferred_bios); 1301 + bio_list_init(&cache->deferred_bios); 1302 + 1303 + while ((bio = bio_list_pop(&bios))) 1304 + bio_endio(bio, DM_ENDIO_REQUEUE); 1305 + } 1306 + 1307 + static int more_work(struct cache *cache) 1308 + { 1309 + if (is_quiescing(cache)) 1310 + return !list_empty(&cache->quiesced_migrations) || 1311 + !list_empty(&cache->completed_migrations) || 1312 + !list_empty(&cache->need_commit_migrations); 1313 + else 1314 + return !bio_list_empty(&cache->deferred_bios) || 1315 + !bio_list_empty(&cache->deferred_flush_bios) || 1316 + !list_empty(&cache->quiesced_migrations) || 1317 + !list_empty(&cache->completed_migrations) || 1318 + !list_empty(&cache->need_commit_migrations); 1319 + } 1320 + 1321 + static void do_worker(struct work_struct *ws) 1322 + { 1323 + struct cache *cache = container_of(ws, struct cache, worker); 1324 + 1325 + do { 1326 + if (!is_quiescing(cache)) 1327 + process_deferred_bios(cache); 1328 + 1329 + process_migrations(cache, &cache->quiesced_migrations, issue_copy); 1330 + process_migrations(cache, &cache->completed_migrations, complete_migration); 1331 + 1332 + writeback_some_dirty_blocks(cache); 1333 + 1334 + if (commit_if_needed(cache)) { 1335 + process_deferred_flush_bios(cache, false); 1336 + 1337 + /* 1338 + * FIXME: rollback metadata or just go into a 1339 + * failure mode and error everything 1340 + */ 1341 + } else { 1342 + process_deferred_flush_bios(cache, true); 1343 + process_migrations(cache, &cache->need_commit_migrations, 1344 + migration_success_post_commit); 1345 + } 1346 + } while (more_work(cache)); 1347 + } 1348 + 1349 + /* 1350 + * We want to commit periodically so that not too much 1351 + * unwritten metadata builds up. 1352 + */ 1353 + static void do_waker(struct work_struct *ws) 1354 + { 1355 + struct cache *cache = container_of(to_delayed_work(ws), struct cache, waker); 1356 + wake_worker(cache); 1357 + queue_delayed_work(cache->wq, &cache->waker, COMMIT_PERIOD); 1358 + } 1359 + 1360 + /*----------------------------------------------------------------*/ 1361 + 1362 + static int is_congested(struct dm_dev *dev, int bdi_bits) 1363 + { 1364 + struct request_queue *q = bdev_get_queue(dev->bdev); 1365 + return bdi_congested(&q->backing_dev_info, bdi_bits); 1366 + } 1367 + 1368 + static int cache_is_congested(struct dm_target_callbacks *cb, int bdi_bits) 1369 + { 1370 + struct cache *cache = container_of(cb, struct cache, callbacks); 1371 + 1372 + return is_congested(cache->origin_dev, bdi_bits) || 1373 + is_congested(cache->cache_dev, bdi_bits); 1374 + } 1375 + 1376 + /*---------------------------------------------------------------- 1377 + * Target methods 1378 + *--------------------------------------------------------------*/ 1379 + 1380 + /* 1381 + * This function gets called on the error paths of the constructor, so we 1382 + * have to cope with a partially initialised struct. 1383 + */ 1384 + static void destroy(struct cache *cache) 1385 + { 1386 + unsigned i; 1387 + 1388 + if (cache->next_migration) 1389 + mempool_free(cache->next_migration, cache->migration_pool); 1390 + 1391 + if (cache->migration_pool) 1392 + mempool_destroy(cache->migration_pool); 1393 + 1394 + if (cache->all_io_ds) 1395 + dm_deferred_set_destroy(cache->all_io_ds); 1396 + 1397 + if (cache->prison) 1398 + dm_bio_prison_destroy(cache->prison); 1399 + 1400 + if (cache->wq) 1401 + destroy_workqueue(cache->wq); 1402 + 1403 + if (cache->dirty_bitset) 1404 + free_bitset(cache->dirty_bitset); 1405 + 1406 + if (cache->discard_bitset) 1407 + free_bitset(cache->discard_bitset); 1408 + 1409 + if (cache->copier) 1410 + dm_kcopyd_client_destroy(cache->copier); 1411 + 1412 + if (cache->cmd) 1413 + dm_cache_metadata_close(cache->cmd); 1414 + 1415 + if (cache->metadata_dev) 1416 + dm_put_device(cache->ti, cache->metadata_dev); 1417 + 1418 + if (cache->origin_dev) 1419 + dm_put_device(cache->ti, cache->origin_dev); 1420 + 1421 + if (cache->cache_dev) 1422 + dm_put_device(cache->ti, cache->cache_dev); 1423 + 1424 + if (cache->policy) 1425 + dm_cache_policy_destroy(cache->policy); 1426 + 1427 + for (i = 0; i < cache->nr_ctr_args ; i++) 1428 + kfree(cache->ctr_args[i]); 1429 + kfree(cache->ctr_args); 1430 + 1431 + kfree(cache); 1432 + } 1433 + 1434 + static void cache_dtr(struct dm_target *ti) 1435 + { 1436 + struct cache *cache = ti->private; 1437 + 1438 + destroy(cache); 1439 + } 1440 + 1441 + static sector_t get_dev_size(struct dm_dev *dev) 1442 + { 1443 + return i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT; 1444 + } 1445 + 1446 + /*----------------------------------------------------------------*/ 1447 + 1448 + /* 1449 + * Construct a cache device mapping. 1450 + * 1451 + * cache <metadata dev> <cache dev> <origin dev> <block size> 1452 + * <#feature args> [<feature arg>]* 1453 + * <policy> <#policy args> [<policy arg>]* 1454 + * 1455 + * metadata dev : fast device holding the persistent metadata 1456 + * cache dev : fast device holding cached data blocks 1457 + * origin dev : slow device holding original data blocks 1458 + * block size : cache unit size in sectors 1459 + * 1460 + * #feature args : number of feature arguments passed 1461 + * feature args : writethrough. (The default is writeback.) 1462 + * 1463 + * policy : the replacement policy to use 1464 + * #policy args : an even number of policy arguments corresponding 1465 + * to key/value pairs passed to the policy 1466 + * policy args : key/value pairs passed to the policy 1467 + * E.g. 'sequential_threshold 1024' 1468 + * See cache-policies.txt for details. 1469 + * 1470 + * Optional feature arguments are: 1471 + * writethrough : write through caching that prohibits cache block 1472 + * content from being different from origin block content. 1473 + * Without this argument, the default behaviour is to write 1474 + * back cache block contents later for performance reasons, 1475 + * so they may differ from the corresponding origin blocks. 1476 + */ 1477 + struct cache_args { 1478 + struct dm_target *ti; 1479 + 1480 + struct dm_dev *metadata_dev; 1481 + 1482 + struct dm_dev *cache_dev; 1483 + sector_t cache_sectors; 1484 + 1485 + struct dm_dev *origin_dev; 1486 + sector_t origin_sectors; 1487 + 1488 + uint32_t block_size; 1489 + 1490 + const char *policy_name; 1491 + int policy_argc; 1492 + const char **policy_argv; 1493 + 1494 + struct cache_features features; 1495 + }; 1496 + 1497 + static void destroy_cache_args(struct cache_args *ca) 1498 + { 1499 + if (ca->metadata_dev) 1500 + dm_put_device(ca->ti, ca->metadata_dev); 1501 + 1502 + if (ca->cache_dev) 1503 + dm_put_device(ca->ti, ca->cache_dev); 1504 + 1505 + if (ca->origin_dev) 1506 + dm_put_device(ca->ti, ca->origin_dev); 1507 + 1508 + kfree(ca); 1509 + } 1510 + 1511 + static bool at_least_one_arg(struct dm_arg_set *as, char **error) 1512 + { 1513 + if (!as->argc) { 1514 + *error = "Insufficient args"; 1515 + return false; 1516 + } 1517 + 1518 + return true; 1519 + } 1520 + 1521 + static int parse_metadata_dev(struct cache_args *ca, struct dm_arg_set *as, 1522 + char **error) 1523 + { 1524 + int r; 1525 + sector_t metadata_dev_size; 1526 + char b[BDEVNAME_SIZE]; 1527 + 1528 + if (!at_least_one_arg(as, error)) 1529 + return -EINVAL; 1530 + 1531 + r = dm_get_device(ca->ti, dm_shift_arg(as), FMODE_READ | FMODE_WRITE, 1532 + &ca->metadata_dev); 1533 + if (r) { 1534 + *error = "Error opening metadata device"; 1535 + return r; 1536 + } 1537 + 1538 + metadata_dev_size = get_dev_size(ca->metadata_dev); 1539 + if (metadata_dev_size > DM_CACHE_METADATA_MAX_SECTORS_WARNING) 1540 + DMWARN("Metadata device %s is larger than %u sectors: excess space will not be used.", 1541 + bdevname(ca->metadata_dev->bdev, b), THIN_METADATA_MAX_SECTORS); 1542 + 1543 + return 0; 1544 + } 1545 + 1546 + static int parse_cache_dev(struct cache_args *ca, struct dm_arg_set *as, 1547 + char **error) 1548 + { 1549 + int r; 1550 + 1551 + if (!at_least_one_arg(as, error)) 1552 + return -EINVAL; 1553 + 1554 + r = dm_get_device(ca->ti, dm_shift_arg(as), FMODE_READ | FMODE_WRITE, 1555 + &ca->cache_dev); 1556 + if (r) { 1557 + *error = "Error opening cache device"; 1558 + return r; 1559 + } 1560 + ca->cache_sectors = get_dev_size(ca->cache_dev); 1561 + 1562 + return 0; 1563 + } 1564 + 1565 + static int parse_origin_dev(struct cache_args *ca, struct dm_arg_set *as, 1566 + char **error) 1567 + { 1568 + int r; 1569 + 1570 + if (!at_least_one_arg(as, error)) 1571 + return -EINVAL; 1572 + 1573 + r = dm_get_device(ca->ti, dm_shift_arg(as), FMODE_READ | FMODE_WRITE, 1574 + &ca->origin_dev); 1575 + if (r) { 1576 + *error = "Error opening origin device"; 1577 + return r; 1578 + } 1579 + 1580 + ca->origin_sectors = get_dev_size(ca->origin_dev); 1581 + if (ca->ti->len > ca->origin_sectors) { 1582 + *error = "Device size larger than cached device"; 1583 + return -EINVAL; 1584 + } 1585 + 1586 + return 0; 1587 + } 1588 + 1589 + static int parse_block_size(struct cache_args *ca, struct dm_arg_set *as, 1590 + char **error) 1591 + { 1592 + unsigned long tmp; 1593 + 1594 + if (!at_least_one_arg(as, error)) 1595 + return -EINVAL; 1596 + 1597 + if (kstrtoul(dm_shift_arg(as), 10, &tmp) || !tmp || 1598 + tmp < DATA_DEV_BLOCK_SIZE_MIN_SECTORS || 1599 + tmp & (DATA_DEV_BLOCK_SIZE_MIN_SECTORS - 1)) { 1600 + *error = "Invalid data block size"; 1601 + return -EINVAL; 1602 + } 1603 + 1604 + if (tmp > ca->cache_sectors) { 1605 + *error = "Data block size is larger than the cache device"; 1606 + return -EINVAL; 1607 + } 1608 + 1609 + ca->block_size = tmp; 1610 + 1611 + return 0; 1612 + } 1613 + 1614 + static void init_features(struct cache_features *cf) 1615 + { 1616 + cf->mode = CM_WRITE; 1617 + cf->write_through = false; 1618 + } 1619 + 1620 + static int parse_features(struct cache_args *ca, struct dm_arg_set *as, 1621 + char **error) 1622 + { 1623 + static struct dm_arg _args[] = { 1624 + {0, 1, "Invalid number of cache feature arguments"}, 1625 + }; 1626 + 1627 + int r; 1628 + unsigned argc; 1629 + const char *arg; 1630 + struct cache_features *cf = &ca->features; 1631 + 1632 + init_features(cf); 1633 + 1634 + r = dm_read_arg_group(_args, as, &argc, error); 1635 + if (r) 1636 + return -EINVAL; 1637 + 1638 + while (argc--) { 1639 + arg = dm_shift_arg(as); 1640 + 1641 + if (!strcasecmp(arg, "writeback")) 1642 + cf->write_through = false; 1643 + 1644 + else if (!strcasecmp(arg, "writethrough")) 1645 + cf->write_through = true; 1646 + 1647 + else { 1648 + *error = "Unrecognised cache feature requested"; 1649 + return -EINVAL; 1650 + } 1651 + } 1652 + 1653 + return 0; 1654 + } 1655 + 1656 + static int parse_policy(struct cache_args *ca, struct dm_arg_set *as, 1657 + char **error) 1658 + { 1659 + static struct dm_arg _args[] = { 1660 + {0, 1024, "Invalid number of policy arguments"}, 1661 + }; 1662 + 1663 + int r; 1664 + 1665 + if (!at_least_one_arg(as, error)) 1666 + return -EINVAL; 1667 + 1668 + ca->policy_name = dm_shift_arg(as); 1669 + 1670 + r = dm_read_arg_group(_args, as, &ca->policy_argc, error); 1671 + if (r) 1672 + return -EINVAL; 1673 + 1674 + ca->policy_argv = (const char **)as->argv; 1675 + dm_consume_args(as, ca->policy_argc); 1676 + 1677 + return 0; 1678 + } 1679 + 1680 + static int parse_cache_args(struct cache_args *ca, int argc, char **argv, 1681 + char **error) 1682 + { 1683 + int r; 1684 + struct dm_arg_set as; 1685 + 1686 + as.argc = argc; 1687 + as.argv = argv; 1688 + 1689 + r = parse_metadata_dev(ca, &as, error); 1690 + if (r) 1691 + return r; 1692 + 1693 + r = parse_cache_dev(ca, &as, error); 1694 + if (r) 1695 + return r; 1696 + 1697 + r = parse_origin_dev(ca, &as, error); 1698 + if (r) 1699 + return r; 1700 + 1701 + r = parse_block_size(ca, &as, error); 1702 + if (r) 1703 + return r; 1704 + 1705 + r = parse_features(ca, &as, error); 1706 + if (r) 1707 + return r; 1708 + 1709 + r = parse_policy(ca, &as, error); 1710 + if (r) 1711 + return r; 1712 + 1713 + return 0; 1714 + } 1715 + 1716 + /*----------------------------------------------------------------*/ 1717 + 1718 + static struct kmem_cache *migration_cache; 1719 + 1720 + static int set_config_values(struct dm_cache_policy *p, int argc, const char **argv) 1721 + { 1722 + int r = 0; 1723 + 1724 + if (argc & 1) { 1725 + DMWARN("Odd number of policy arguments given but they should be <key> <value> pairs."); 1726 + return -EINVAL; 1727 + } 1728 + 1729 + while (argc) { 1730 + r = policy_set_config_value(p, argv[0], argv[1]); 1731 + if (r) { 1732 + DMWARN("policy_set_config_value failed: key = '%s', value = '%s'", 1733 + argv[0], argv[1]); 1734 + return r; 1735 + } 1736 + 1737 + argc -= 2; 1738 + argv += 2; 1739 + } 1740 + 1741 + return r; 1742 + } 1743 + 1744 + static int create_cache_policy(struct cache *cache, struct cache_args *ca, 1745 + char **error) 1746 + { 1747 + int r; 1748 + 1749 + cache->policy = dm_cache_policy_create(ca->policy_name, 1750 + cache->cache_size, 1751 + cache->origin_sectors, 1752 + cache->sectors_per_block); 1753 + if (!cache->policy) { 1754 + *error = "Error creating cache's policy"; 1755 + return -ENOMEM; 1756 + } 1757 + 1758 + r = set_config_values(cache->policy, ca->policy_argc, ca->policy_argv); 1759 + if (r) 1760 + dm_cache_policy_destroy(cache->policy); 1761 + 1762 + return r; 1763 + } 1764 + 1765 + /* 1766 + * We want the discard block size to be a power of two, at least the size 1767 + * of the cache block size, and have no more than 2^14 discard blocks 1768 + * across the origin. 1769 + */ 1770 + #define MAX_DISCARD_BLOCKS (1 << 14) 1771 + 1772 + static bool too_many_discard_blocks(sector_t discard_block_size, 1773 + sector_t origin_size) 1774 + { 1775 + (void) sector_div(origin_size, discard_block_size); 1776 + 1777 + return origin_size > MAX_DISCARD_BLOCKS; 1778 + } 1779 + 1780 + static sector_t calculate_discard_block_size(sector_t cache_block_size, 1781 + sector_t origin_size) 1782 + { 1783 + sector_t discard_block_size; 1784 + 1785 + discard_block_size = roundup_pow_of_two(cache_block_size); 1786 + 1787 + if (origin_size) 1788 + while (too_many_discard_blocks(discard_block_size, origin_size)) 1789 + discard_block_size *= 2; 1790 + 1791 + return discard_block_size; 1792 + } 1793 + 1794 + #define DEFAULT_MIGRATION_THRESHOLD (2048 * 100) 1795 + 1796 + static unsigned cache_num_write_bios(struct dm_target *ti, struct bio *bio); 1797 + 1798 + static int cache_create(struct cache_args *ca, struct cache **result) 1799 + { 1800 + int r = 0; 1801 + char **error = &ca->ti->error; 1802 + struct cache *cache; 1803 + struct dm_target *ti = ca->ti; 1804 + dm_block_t origin_blocks; 1805 + struct dm_cache_metadata *cmd; 1806 + bool may_format = ca->features.mode == CM_WRITE; 1807 + 1808 + cache = kzalloc(sizeof(*cache), GFP_KERNEL); 1809 + if (!cache) 1810 + return -ENOMEM; 1811 + 1812 + cache->ti = ca->ti; 1813 + ti->private = cache; 1814 + ti->per_bio_data_size = sizeof(struct per_bio_data); 1815 + ti->num_flush_bios = 2; 1816 + ti->flush_supported = true; 1817 + 1818 + ti->num_discard_bios = 1; 1819 + ti->discards_supported = true; 1820 + ti->discard_zeroes_data_unsupported = true; 1821 + 1822 + memcpy(&cache->features, &ca->features, sizeof(cache->features)); 1823 + 1824 + if (cache->features.write_through) 1825 + ti->num_write_bios = cache_num_write_bios; 1826 + 1827 + cache->callbacks.congested_fn = cache_is_congested; 1828 + dm_table_add_target_callbacks(ti->table, &cache->callbacks); 1829 + 1830 + cache->metadata_dev = ca->metadata_dev; 1831 + cache->origin_dev = ca->origin_dev; 1832 + cache->cache_dev = ca->cache_dev; 1833 + 1834 + ca->metadata_dev = ca->origin_dev = ca->cache_dev = NULL; 1835 + 1836 + /* FIXME: factor out this whole section */ 1837 + origin_blocks = cache->origin_sectors = ca->origin_sectors; 1838 + (void) sector_div(origin_blocks, ca->block_size); 1839 + cache->origin_blocks = to_oblock(origin_blocks); 1840 + 1841 + cache->sectors_per_block = ca->block_size; 1842 + if (dm_set_target_max_io_len(ti, cache->sectors_per_block)) { 1843 + r = -EINVAL; 1844 + goto bad; 1845 + } 1846 + 1847 + if (ca->block_size & (ca->block_size - 1)) { 1848 + dm_block_t cache_size = ca->cache_sectors; 1849 + 1850 + cache->sectors_per_block_shift = -1; 1851 + (void) sector_div(cache_size, ca->block_size); 1852 + cache->cache_size = to_cblock(cache_size); 1853 + } else { 1854 + cache->sectors_per_block_shift = __ffs(ca->block_size); 1855 + cache->cache_size = to_cblock(ca->cache_sectors >> cache->sectors_per_block_shift); 1856 + } 1857 + 1858 + r = create_cache_policy(cache, ca, error); 1859 + if (r) 1860 + goto bad; 1861 + cache->policy_nr_args = ca->policy_argc; 1862 + 1863 + cmd = dm_cache_metadata_open(cache->metadata_dev->bdev, 1864 + ca->block_size, may_format, 1865 + dm_cache_policy_get_hint_size(cache->policy)); 1866 + if (IS_ERR(cmd)) { 1867 + *error = "Error creating metadata object"; 1868 + r = PTR_ERR(cmd); 1869 + goto bad; 1870 + } 1871 + cache->cmd = cmd; 1872 + 1873 + spin_lock_init(&cache->lock); 1874 + bio_list_init(&cache->deferred_bios); 1875 + bio_list_init(&cache->deferred_flush_bios); 1876 + INIT_LIST_HEAD(&cache->quiesced_migrations); 1877 + INIT_LIST_HEAD(&cache->completed_migrations); 1878 + INIT_LIST_HEAD(&cache->need_commit_migrations); 1879 + cache->migration_threshold = DEFAULT_MIGRATION_THRESHOLD; 1880 + atomic_set(&cache->nr_migrations, 0); 1881 + init_waitqueue_head(&cache->migration_wait); 1882 + 1883 + cache->nr_dirty = 0; 1884 + cache->dirty_bitset = alloc_bitset(from_cblock(cache->cache_size)); 1885 + if (!cache->dirty_bitset) { 1886 + *error = "could not allocate dirty bitset"; 1887 + goto bad; 1888 + } 1889 + clear_bitset(cache->dirty_bitset, from_cblock(cache->cache_size)); 1890 + 1891 + cache->discard_block_size = 1892 + calculate_discard_block_size(cache->sectors_per_block, 1893 + cache->origin_sectors); 1894 + cache->discard_nr_blocks = oblock_to_dblock(cache, cache->origin_blocks); 1895 + cache->discard_bitset = alloc_bitset(from_dblock(cache->discard_nr_blocks)); 1896 + if (!cache->discard_bitset) { 1897 + *error = "could not allocate discard bitset"; 1898 + goto bad; 1899 + } 1900 + clear_bitset(cache->discard_bitset, from_dblock(cache->discard_nr_blocks)); 1901 + 1902 + cache->copier = dm_kcopyd_client_create(&dm_kcopyd_throttle); 1903 + if (IS_ERR(cache->copier)) { 1904 + *error = "could not create kcopyd client"; 1905 + r = PTR_ERR(cache->copier); 1906 + goto bad; 1907 + } 1908 + 1909 + cache->wq = alloc_ordered_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM); 1910 + if (!cache->wq) { 1911 + *error = "could not create workqueue for metadata object"; 1912 + goto bad; 1913 + } 1914 + INIT_WORK(&cache->worker, do_worker); 1915 + INIT_DELAYED_WORK(&cache->waker, do_waker); 1916 + cache->last_commit_jiffies = jiffies; 1917 + 1918 + cache->prison = dm_bio_prison_create(PRISON_CELLS); 1919 + if (!cache->prison) { 1920 + *error = "could not create bio prison"; 1921 + goto bad; 1922 + } 1923 + 1924 + cache->all_io_ds = dm_deferred_set_create(); 1925 + if (!cache->all_io_ds) { 1926 + *error = "could not create all_io deferred set"; 1927 + goto bad; 1928 + } 1929 + 1930 + cache->migration_pool = mempool_create_slab_pool(MIGRATION_POOL_SIZE, 1931 + migration_cache); 1932 + if (!cache->migration_pool) { 1933 + *error = "Error creating cache's migration mempool"; 1934 + goto bad; 1935 + } 1936 + 1937 + cache->next_migration = NULL; 1938 + 1939 + cache->need_tick_bio = true; 1940 + cache->sized = false; 1941 + cache->quiescing = false; 1942 + cache->commit_requested = false; 1943 + cache->loaded_mappings = false; 1944 + cache->loaded_discards = false; 1945 + 1946 + load_stats(cache); 1947 + 1948 + atomic_set(&cache->stats.demotion, 0); 1949 + atomic_set(&cache->stats.promotion, 0); 1950 + atomic_set(&cache->stats.copies_avoided, 0); 1951 + atomic_set(&cache->stats.cache_cell_clash, 0); 1952 + atomic_set(&cache->stats.commit_count, 0); 1953 + atomic_set(&cache->stats.discard_count, 0); 1954 + 1955 + *result = cache; 1956 + return 0; 1957 + 1958 + bad: 1959 + destroy(cache); 1960 + return r; 1961 + } 1962 + 1963 + static int copy_ctr_args(struct cache *cache, int argc, const char **argv) 1964 + { 1965 + unsigned i; 1966 + const char **copy; 1967 + 1968 + copy = kcalloc(argc, sizeof(*copy), GFP_KERNEL); 1969 + if (!copy) 1970 + return -ENOMEM; 1971 + for (i = 0; i < argc; i++) { 1972 + copy[i] = kstrdup(argv[i], GFP_KERNEL); 1973 + if (!copy[i]) { 1974 + while (i--) 1975 + kfree(copy[i]); 1976 + kfree(copy); 1977 + return -ENOMEM; 1978 + } 1979 + } 1980 + 1981 + cache->nr_ctr_args = argc; 1982 + cache->ctr_args = copy; 1983 + 1984 + return 0; 1985 + } 1986 + 1987 + static int cache_ctr(struct dm_target *ti, unsigned argc, char **argv) 1988 + { 1989 + int r = -EINVAL; 1990 + struct cache_args *ca; 1991 + struct cache *cache = NULL; 1992 + 1993 + ca = kzalloc(sizeof(*ca), GFP_KERNEL); 1994 + if (!ca) { 1995 + ti->error = "Error allocating memory for cache"; 1996 + return -ENOMEM; 1997 + } 1998 + ca->ti = ti; 1999 + 2000 + r = parse_cache_args(ca, argc, argv, &ti->error); 2001 + if (r) 2002 + goto out; 2003 + 2004 + r = cache_create(ca, &cache); 2005 + 2006 + r = copy_ctr_args(cache, argc - 3, (const char **)argv + 3); 2007 + if (r) { 2008 + destroy(cache); 2009 + goto out; 2010 + } 2011 + 2012 + ti->private = cache; 2013 + 2014 + out: 2015 + destroy_cache_args(ca); 2016 + return r; 2017 + } 2018 + 2019 + static unsigned cache_num_write_bios(struct dm_target *ti, struct bio *bio) 2020 + { 2021 + int r; 2022 + struct cache *cache = ti->private; 2023 + dm_oblock_t block = get_bio_block(cache, bio); 2024 + dm_cblock_t cblock; 2025 + 2026 + r = policy_lookup(cache->policy, block, &cblock); 2027 + if (r < 0) 2028 + return 2; /* assume the worst */ 2029 + 2030 + return (!r && !is_dirty(cache, cblock)) ? 2 : 1; 2031 + } 2032 + 2033 + static int cache_map(struct dm_target *ti, struct bio *bio) 2034 + { 2035 + struct cache *cache = ti->private; 2036 + 2037 + int r; 2038 + dm_oblock_t block = get_bio_block(cache, bio); 2039 + bool can_migrate = false; 2040 + bool discarded_block; 2041 + struct dm_bio_prison_cell *cell; 2042 + struct policy_result lookup_result; 2043 + struct per_bio_data *pb; 2044 + 2045 + if (from_oblock(block) > from_oblock(cache->origin_blocks)) { 2046 + /* 2047 + * This can only occur if the io goes to a partial block at 2048 + * the end of the origin device. We don't cache these. 2049 + * Just remap to the origin and carry on. 2050 + */ 2051 + remap_to_origin_clear_discard(cache, bio, block); 2052 + return DM_MAPIO_REMAPPED; 2053 + } 2054 + 2055 + pb = init_per_bio_data(bio); 2056 + 2057 + if (bio->bi_rw & (REQ_FLUSH | REQ_FUA | REQ_DISCARD)) { 2058 + defer_bio(cache, bio); 2059 + return DM_MAPIO_SUBMITTED; 2060 + } 2061 + 2062 + /* 2063 + * Check to see if that block is currently migrating. 2064 + */ 2065 + cell = alloc_prison_cell(cache); 2066 + if (!cell) { 2067 + defer_bio(cache, bio); 2068 + return DM_MAPIO_SUBMITTED; 2069 + } 2070 + 2071 + r = bio_detain(cache, block, bio, cell, 2072 + (cell_free_fn) free_prison_cell, 2073 + cache, &cell); 2074 + if (r) { 2075 + if (r < 0) 2076 + defer_bio(cache, bio); 2077 + 2078 + return DM_MAPIO_SUBMITTED; 2079 + } 2080 + 2081 + discarded_block = is_discarded_oblock(cache, block); 2082 + 2083 + r = policy_map(cache->policy, block, false, can_migrate, discarded_block, 2084 + bio, &lookup_result); 2085 + if (r == -EWOULDBLOCK) { 2086 + cell_defer(cache, cell, true); 2087 + return DM_MAPIO_SUBMITTED; 2088 + 2089 + } else if (r) { 2090 + DMERR_LIMIT("Unexpected return from cache replacement policy: %d", r); 2091 + bio_io_error(bio); 2092 + return DM_MAPIO_SUBMITTED; 2093 + } 2094 + 2095 + switch (lookup_result.op) { 2096 + case POLICY_HIT: 2097 + inc_hit_counter(cache, bio); 2098 + pb->all_io_entry = dm_deferred_entry_inc(cache->all_io_ds); 2099 + 2100 + if (is_writethrough_io(cache, bio, lookup_result.cblock)) { 2101 + /* 2102 + * No need to mark anything dirty in write through mode. 2103 + */ 2104 + pb->req_nr == 0 ? 2105 + remap_to_cache(cache, bio, lookup_result.cblock) : 2106 + remap_to_origin_clear_discard(cache, bio, block); 2107 + cell_defer(cache, cell, false); 2108 + } else { 2109 + remap_to_cache_dirty(cache, bio, block, lookup_result.cblock); 2110 + cell_defer(cache, cell, false); 2111 + } 2112 + break; 2113 + 2114 + case POLICY_MISS: 2115 + inc_miss_counter(cache, bio); 2116 + pb->all_io_entry = dm_deferred_entry_inc(cache->all_io_ds); 2117 + 2118 + if (pb->req_nr != 0) { 2119 + /* 2120 + * This is a duplicate writethrough io that is no 2121 + * longer needed because the block has been demoted. 2122 + */ 2123 + bio_endio(bio, 0); 2124 + cell_defer(cache, cell, false); 2125 + return DM_MAPIO_SUBMITTED; 2126 + } else { 2127 + remap_to_origin_clear_discard(cache, bio, block); 2128 + cell_defer(cache, cell, false); 2129 + } 2130 + break; 2131 + 2132 + default: 2133 + DMERR_LIMIT("%s: erroring bio: unknown policy op: %u", __func__, 2134 + (unsigned) lookup_result.op); 2135 + bio_io_error(bio); 2136 + return DM_MAPIO_SUBMITTED; 2137 + } 2138 + 2139 + return DM_MAPIO_REMAPPED; 2140 + } 2141 + 2142 + static int cache_end_io(struct dm_target *ti, struct bio *bio, int error) 2143 + { 2144 + struct cache *cache = ti->private; 2145 + unsigned long flags; 2146 + struct per_bio_data *pb = get_per_bio_data(bio); 2147 + 2148 + if (pb->tick) { 2149 + policy_tick(cache->policy); 2150 + 2151 + spin_lock_irqsave(&cache->lock, flags); 2152 + cache->need_tick_bio = true; 2153 + spin_unlock_irqrestore(&cache->lock, flags); 2154 + } 2155 + 2156 + check_for_quiesced_migrations(cache, pb); 2157 + 2158 + return 0; 2159 + } 2160 + 2161 + static int write_dirty_bitset(struct cache *cache) 2162 + { 2163 + unsigned i, r; 2164 + 2165 + for (i = 0; i < from_cblock(cache->cache_size); i++) { 2166 + r = dm_cache_set_dirty(cache->cmd, to_cblock(i), 2167 + is_dirty(cache, to_cblock(i))); 2168 + if (r) 2169 + return r; 2170 + } 2171 + 2172 + return 0; 2173 + } 2174 + 2175 + static int write_discard_bitset(struct cache *cache) 2176 + { 2177 + unsigned i, r; 2178 + 2179 + r = dm_cache_discard_bitset_resize(cache->cmd, cache->discard_block_size, 2180 + cache->discard_nr_blocks); 2181 + if (r) { 2182 + DMERR("could not resize on-disk discard bitset"); 2183 + return r; 2184 + } 2185 + 2186 + for (i = 0; i < from_dblock(cache->discard_nr_blocks); i++) { 2187 + r = dm_cache_set_discard(cache->cmd, to_dblock(i), 2188 + is_discarded(cache, to_dblock(i))); 2189 + if (r) 2190 + return r; 2191 + } 2192 + 2193 + return 0; 2194 + } 2195 + 2196 + static int save_hint(void *context, dm_cblock_t cblock, dm_oblock_t oblock, 2197 + uint32_t hint) 2198 + { 2199 + struct cache *cache = context; 2200 + return dm_cache_save_hint(cache->cmd, cblock, hint); 2201 + } 2202 + 2203 + static int write_hints(struct cache *cache) 2204 + { 2205 + int r; 2206 + 2207 + r = dm_cache_begin_hints(cache->cmd, cache->policy); 2208 + if (r) { 2209 + DMERR("dm_cache_begin_hints failed"); 2210 + return r; 2211 + } 2212 + 2213 + r = policy_walk_mappings(cache->policy, save_hint, cache); 2214 + if (r) 2215 + DMERR("policy_walk_mappings failed"); 2216 + 2217 + return r; 2218 + } 2219 + 2220 + /* 2221 + * returns true on success 2222 + */ 2223 + static bool sync_metadata(struct cache *cache) 2224 + { 2225 + int r1, r2, r3, r4; 2226 + 2227 + r1 = write_dirty_bitset(cache); 2228 + if (r1) 2229 + DMERR("could not write dirty bitset"); 2230 + 2231 + r2 = write_discard_bitset(cache); 2232 + if (r2) 2233 + DMERR("could not write discard bitset"); 2234 + 2235 + save_stats(cache); 2236 + 2237 + r3 = write_hints(cache); 2238 + if (r3) 2239 + DMERR("could not write hints"); 2240 + 2241 + /* 2242 + * If writing the above metadata failed, we still commit, but don't 2243 + * set the clean shutdown flag. This will effectively force every 2244 + * dirty bit to be set on reload. 2245 + */ 2246 + r4 = dm_cache_commit(cache->cmd, !r1 && !r2 && !r3); 2247 + if (r4) 2248 + DMERR("could not write cache metadata. Data loss may occur."); 2249 + 2250 + return !r1 && !r2 && !r3 && !r4; 2251 + } 2252 + 2253 + static void cache_postsuspend(struct dm_target *ti) 2254 + { 2255 + struct cache *cache = ti->private; 2256 + 2257 + start_quiescing(cache); 2258 + wait_for_migrations(cache); 2259 + stop_worker(cache); 2260 + requeue_deferred_io(cache); 2261 + stop_quiescing(cache); 2262 + 2263 + (void) sync_metadata(cache); 2264 + } 2265 + 2266 + static int load_mapping(void *context, dm_oblock_t oblock, dm_cblock_t cblock, 2267 + bool dirty, uint32_t hint, bool hint_valid) 2268 + { 2269 + int r; 2270 + struct cache *cache = context; 2271 + 2272 + r = policy_load_mapping(cache->policy, oblock, cblock, hint, hint_valid); 2273 + if (r) 2274 + return r; 2275 + 2276 + if (dirty) 2277 + set_dirty(cache, oblock, cblock); 2278 + else 2279 + clear_dirty(cache, oblock, cblock); 2280 + 2281 + return 0; 2282 + } 2283 + 2284 + static int load_discard(void *context, sector_t discard_block_size, 2285 + dm_dblock_t dblock, bool discard) 2286 + { 2287 + struct cache *cache = context; 2288 + 2289 + /* FIXME: handle mis-matched block size */ 2290 + 2291 + if (discard) 2292 + set_discard(cache, dblock); 2293 + else 2294 + clear_discard(cache, dblock); 2295 + 2296 + return 0; 2297 + } 2298 + 2299 + static int cache_preresume(struct dm_target *ti) 2300 + { 2301 + int r = 0; 2302 + struct cache *cache = ti->private; 2303 + sector_t actual_cache_size = get_dev_size(cache->cache_dev); 2304 + (void) sector_div(actual_cache_size, cache->sectors_per_block); 2305 + 2306 + /* 2307 + * Check to see if the cache has resized. 2308 + */ 2309 + if (from_cblock(cache->cache_size) != actual_cache_size || !cache->sized) { 2310 + cache->cache_size = to_cblock(actual_cache_size); 2311 + 2312 + r = dm_cache_resize(cache->cmd, cache->cache_size); 2313 + if (r) { 2314 + DMERR("could not resize cache metadata"); 2315 + return r; 2316 + } 2317 + 2318 + cache->sized = true; 2319 + } 2320 + 2321 + if (!cache->loaded_mappings) { 2322 + r = dm_cache_load_mappings(cache->cmd, 2323 + dm_cache_policy_get_name(cache->policy), 2324 + load_mapping, cache); 2325 + if (r) { 2326 + DMERR("could not load cache mappings"); 2327 + return r; 2328 + } 2329 + 2330 + cache->loaded_mappings = true; 2331 + } 2332 + 2333 + if (!cache->loaded_discards) { 2334 + r = dm_cache_load_discards(cache->cmd, load_discard, cache); 2335 + if (r) { 2336 + DMERR("could not load origin discards"); 2337 + return r; 2338 + } 2339 + 2340 + cache->loaded_discards = true; 2341 + } 2342 + 2343 + return r; 2344 + } 2345 + 2346 + static void cache_resume(struct dm_target *ti) 2347 + { 2348 + struct cache *cache = ti->private; 2349 + 2350 + cache->need_tick_bio = true; 2351 + do_waker(&cache->waker.work); 2352 + } 2353 + 2354 + /* 2355 + * Status format: 2356 + * 2357 + * <#used metadata blocks>/<#total metadata blocks> 2358 + * <#read hits> <#read misses> <#write hits> <#write misses> 2359 + * <#demotions> <#promotions> <#blocks in cache> <#dirty> 2360 + * <#features> <features>* 2361 + * <#core args> <core args> 2362 + * <#policy args> <policy args>* 2363 + */ 2364 + static void cache_status(struct dm_target *ti, status_type_t type, 2365 + unsigned status_flags, char *result, unsigned maxlen) 2366 + { 2367 + int r = 0; 2368 + unsigned i; 2369 + ssize_t sz = 0; 2370 + dm_block_t nr_free_blocks_metadata = 0; 2371 + dm_block_t nr_blocks_metadata = 0; 2372 + char buf[BDEVNAME_SIZE]; 2373 + struct cache *cache = ti->private; 2374 + dm_cblock_t residency; 2375 + 2376 + switch (type) { 2377 + case STATUSTYPE_INFO: 2378 + /* Commit to ensure statistics aren't out-of-date */ 2379 + if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti)) { 2380 + r = dm_cache_commit(cache->cmd, false); 2381 + if (r) 2382 + DMERR("could not commit metadata for accurate status"); 2383 + } 2384 + 2385 + r = dm_cache_get_free_metadata_block_count(cache->cmd, 2386 + &nr_free_blocks_metadata); 2387 + if (r) { 2388 + DMERR("could not get metadata free block count"); 2389 + goto err; 2390 + } 2391 + 2392 + r = dm_cache_get_metadata_dev_size(cache->cmd, &nr_blocks_metadata); 2393 + if (r) { 2394 + DMERR("could not get metadata device size"); 2395 + goto err; 2396 + } 2397 + 2398 + residency = policy_residency(cache->policy); 2399 + 2400 + DMEMIT("%llu/%llu %u %u %u %u %u %u %llu %u ", 2401 + (unsigned long long)(nr_blocks_metadata - nr_free_blocks_metadata), 2402 + (unsigned long long)nr_blocks_metadata, 2403 + (unsigned) atomic_read(&cache->stats.read_hit), 2404 + (unsigned) atomic_read(&cache->stats.read_miss), 2405 + (unsigned) atomic_read(&cache->stats.write_hit), 2406 + (unsigned) atomic_read(&cache->stats.write_miss), 2407 + (unsigned) atomic_read(&cache->stats.demotion), 2408 + (unsigned) atomic_read(&cache->stats.promotion), 2409 + (unsigned long long) from_cblock(residency), 2410 + cache->nr_dirty); 2411 + 2412 + if (cache->features.write_through) 2413 + DMEMIT("1 writethrough "); 2414 + else 2415 + DMEMIT("0 "); 2416 + 2417 + DMEMIT("2 migration_threshold %llu ", (unsigned long long) cache->migration_threshold); 2418 + if (sz < maxlen) { 2419 + r = policy_emit_config_values(cache->policy, result + sz, maxlen - sz); 2420 + if (r) 2421 + DMERR("policy_emit_config_values returned %d", r); 2422 + } 2423 + 2424 + break; 2425 + 2426 + case STATUSTYPE_TABLE: 2427 + format_dev_t(buf, cache->metadata_dev->bdev->bd_dev); 2428 + DMEMIT("%s ", buf); 2429 + format_dev_t(buf, cache->cache_dev->bdev->bd_dev); 2430 + DMEMIT("%s ", buf); 2431 + format_dev_t(buf, cache->origin_dev->bdev->bd_dev); 2432 + DMEMIT("%s", buf); 2433 + 2434 + for (i = 0; i < cache->nr_ctr_args - 1; i++) 2435 + DMEMIT(" %s", cache->ctr_args[i]); 2436 + if (cache->nr_ctr_args) 2437 + DMEMIT(" %s", cache->ctr_args[cache->nr_ctr_args - 1]); 2438 + } 2439 + 2440 + return; 2441 + 2442 + err: 2443 + DMEMIT("Error"); 2444 + } 2445 + 2446 + #define NOT_CORE_OPTION 1 2447 + 2448 + static int process_config_option(struct cache *cache, char **argv) 2449 + { 2450 + unsigned long tmp; 2451 + 2452 + if (!strcasecmp(argv[0], "migration_threshold")) { 2453 + if (kstrtoul(argv[1], 10, &tmp)) 2454 + return -EINVAL; 2455 + 2456 + cache->migration_threshold = tmp; 2457 + return 0; 2458 + } 2459 + 2460 + return NOT_CORE_OPTION; 2461 + } 2462 + 2463 + /* 2464 + * Supports <key> <value>. 2465 + * 2466 + * The key migration_threshold is supported by the cache target core. 2467 + */ 2468 + static int cache_message(struct dm_target *ti, unsigned argc, char **argv) 2469 + { 2470 + int r; 2471 + struct cache *cache = ti->private; 2472 + 2473 + if (argc != 2) 2474 + return -EINVAL; 2475 + 2476 + r = process_config_option(cache, argv); 2477 + if (r == NOT_CORE_OPTION) 2478 + return policy_set_config_value(cache->policy, argv[0], argv[1]); 2479 + 2480 + return r; 2481 + } 2482 + 2483 + static int cache_iterate_devices(struct dm_target *ti, 2484 + iterate_devices_callout_fn fn, void *data) 2485 + { 2486 + int r = 0; 2487 + struct cache *cache = ti->private; 2488 + 2489 + r = fn(ti, cache->cache_dev, 0, get_dev_size(cache->cache_dev), data); 2490 + if (!r) 2491 + r = fn(ti, cache->origin_dev, 0, ti->len, data); 2492 + 2493 + return r; 2494 + } 2495 + 2496 + /* 2497 + * We assume I/O is going to the origin (which is the volume 2498 + * more likely to have restrictions e.g. by being striped). 2499 + * (Looking up the exact location of the data would be expensive 2500 + * and could always be out of date by the time the bio is submitted.) 2501 + */ 2502 + static int cache_bvec_merge(struct dm_target *ti, 2503 + struct bvec_merge_data *bvm, 2504 + struct bio_vec *biovec, int max_size) 2505 + { 2506 + struct cache *cache = ti->private; 2507 + struct request_queue *q = bdev_get_queue(cache->origin_dev->bdev); 2508 + 2509 + if (!q->merge_bvec_fn) 2510 + return max_size; 2511 + 2512 + bvm->bi_bdev = cache->origin_dev->bdev; 2513 + return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); 2514 + } 2515 + 2516 + static void set_discard_limits(struct cache *cache, struct queue_limits *limits) 2517 + { 2518 + /* 2519 + * FIXME: these limits may be incompatible with the cache device 2520 + */ 2521 + limits->max_discard_sectors = cache->discard_block_size * 1024; 2522 + limits->discard_granularity = cache->discard_block_size << SECTOR_SHIFT; 2523 + } 2524 + 2525 + static void cache_io_hints(struct dm_target *ti, struct queue_limits *limits) 2526 + { 2527 + struct cache *cache = ti->private; 2528 + 2529 + blk_limits_io_min(limits, 0); 2530 + blk_limits_io_opt(limits, cache->sectors_per_block << SECTOR_SHIFT); 2531 + set_discard_limits(cache, limits); 2532 + } 2533 + 2534 + /*----------------------------------------------------------------*/ 2535 + 2536 + static struct target_type cache_target = { 2537 + .name = "cache", 2538 + .version = {1, 0, 0}, 2539 + .module = THIS_MODULE, 2540 + .ctr = cache_ctr, 2541 + .dtr = cache_dtr, 2542 + .map = cache_map, 2543 + .end_io = cache_end_io, 2544 + .postsuspend = cache_postsuspend, 2545 + .preresume = cache_preresume, 2546 + .resume = cache_resume, 2547 + .status = cache_status, 2548 + .message = cache_message, 2549 + .iterate_devices = cache_iterate_devices, 2550 + .merge = cache_bvec_merge, 2551 + .io_hints = cache_io_hints, 2552 + }; 2553 + 2554 + static int __init dm_cache_init(void) 2555 + { 2556 + int r; 2557 + 2558 + r = dm_register_target(&cache_target); 2559 + if (r) { 2560 + DMERR("cache target registration failed: %d", r); 2561 + return r; 2562 + } 2563 + 2564 + migration_cache = KMEM_CACHE(dm_cache_migration, 0); 2565 + if (!migration_cache) { 2566 + dm_unregister_target(&cache_target); 2567 + return -ENOMEM; 2568 + } 2569 + 2570 + return 0; 2571 + } 2572 + 2573 + static void __exit dm_cache_exit(void) 2574 + { 2575 + dm_unregister_target(&cache_target); 2576 + kmem_cache_destroy(migration_cache); 2577 + } 2578 + 2579 + module_init(dm_cache_init); 2580 + module_exit(dm_cache_exit); 2581 + 2582 + MODULE_DESCRIPTION(DM_NAME " cache target"); 2583 + MODULE_AUTHOR("Joe Thornber <ejt@redhat.com>"); 2584 + MODULE_LICENSE("GPL");

+12 -33

drivers/md/dm-crypt.c

··· 1234 1234 return 0; 1235 1235 } 1236 1236 1237 - /* 1238 - * Encode key into its hex representation 1239 - */ 1240 - static void crypt_encode_key(char *hex, u8 *key, unsigned int size) 1241 - { 1242 - unsigned int i; 1243 - 1244 - for (i = 0; i < size; i++) { 1245 - sprintf(hex, "%02x", *key); 1246 - hex += 2; 1247 - key++; 1248 - } 1249 - } 1250 - 1251 1237 static void crypt_free_tfms(struct crypt_config *cc) 1252 1238 { 1253 1239 unsigned i; ··· 1637 1651 1638 1652 if (opt_params == 1 && opt_string && 1639 1653 !strcasecmp(opt_string, "allow_discards")) 1640 - ti->num_discard_requests = 1; 1654 + ti->num_discard_bios = 1; 1641 1655 else if (opt_params) { 1642 1656 ret = -EINVAL; 1643 1657 ti->error = "Invalid feature arguments"; ··· 1665 1679 goto bad; 1666 1680 } 1667 1681 1668 - ti->num_flush_requests = 1; 1682 + ti->num_flush_bios = 1; 1669 1683 ti->discard_zeroes_data_unsupported = true; 1670 1684 1671 1685 return 0; ··· 1703 1717 return DM_MAPIO_SUBMITTED; 1704 1718 } 1705 1719 1706 - static int crypt_status(struct dm_target *ti, status_type_t type, 1707 - unsigned status_flags, char *result, unsigned maxlen) 1720 + static void crypt_status(struct dm_target *ti, status_type_t type, 1721 + unsigned status_flags, char *result, unsigned maxlen) 1708 1722 { 1709 1723 struct crypt_config *cc = ti->private; 1710 - unsigned int sz = 0; 1724 + unsigned i, sz = 0; 1711 1725 1712 1726 switch (type) { 1713 1727 case STATUSTYPE_INFO: ··· 1717 1731 case STATUSTYPE_TABLE: 1718 1732 DMEMIT("%s ", cc->cipher_string); 1719 1733 1720 - if (cc->key_size > 0) { 1721 - if ((maxlen - sz) < ((cc->key_size << 1) + 1)) 1722 - return -ENOMEM; 1723 - 1724 - crypt_encode_key(result + sz, cc->key, cc->key_size); 1725 - sz += cc->key_size << 1; 1726 - } else { 1727 - if (sz >= maxlen) 1728 - return -ENOMEM; 1729 - result[sz++] = '-'; 1730 - } 1734 + if (cc->key_size > 0) 1735 + for (i = 0; i < cc->key_size; i++) 1736 + DMEMIT("%02x", cc->key[i]); 1737 + else 1738 + DMEMIT("-"); 1731 1739 1732 1740 DMEMIT(" %llu %s %llu", (unsigned long long)cc->iv_offset, 1733 1741 cc->dev->name, (unsigned long long)cc->start); 1734 1742 1735 - if (ti->num_discard_requests) 1743 + if (ti->num_discard_bios) 1736 1744 DMEMIT(" 1 allow_discards"); 1737 1745 1738 1746 break; 1739 1747 } 1740 - return 0; 1741 1748 } 1742 1749 1743 1750 static void crypt_postsuspend(struct dm_target *ti) ··· 1824 1845 1825 1846 static struct target_type crypt_target = { 1826 1847 .name = "crypt", 1827 - .version = {1, 12, 0}, 1848 + .version = {1, 12, 1}, 1828 1849 .module = THIS_MODULE, 1829 1850 .ctr = crypt_ctr, 1830 1851 .dtr = crypt_dtr,

+5 -7

drivers/md/dm-delay.c

··· 198 198 mutex_init(&dc->timer_lock); 199 199 atomic_set(&dc->may_delay, 1); 200 200 201 - ti->num_flush_requests = 1; 202 - ti->num_discard_requests = 1; 201 + ti->num_flush_bios = 1; 202 + ti->num_discard_bios = 1; 203 203 ti->private = dc; 204 204 return 0; 205 205 ··· 293 293 return delay_bio(dc, dc->read_delay, bio); 294 294 } 295 295 296 - static int delay_status(struct dm_target *ti, status_type_t type, 297 - unsigned status_flags, char *result, unsigned maxlen) 296 + static void delay_status(struct dm_target *ti, status_type_t type, 297 + unsigned status_flags, char *result, unsigned maxlen) 298 298 { 299 299 struct delay_c *dc = ti->private; 300 300 int sz = 0; ··· 314 314 dc->write_delay); 315 315 break; 316 316 } 317 - 318 - return 0; 319 317 } 320 318 321 319 static int delay_iterate_devices(struct dm_target *ti, ··· 335 337 336 338 static struct target_type delay_target = { 337 339 .name = "delay", 338 - .version = {1, 2, 0}, 340 + .version = {1, 2, 1}, 339 341 .module = THIS_MODULE, 340 342 .ctr = delay_ctr, 341 343 .dtr = delay_dtr,

+5 -6

drivers/md/dm-flakey.c

··· 216 216 goto bad; 217 217 } 218 218 219 - ti->num_flush_requests = 1; 220 - ti->num_discard_requests = 1; 219 + ti->num_flush_bios = 1; 220 + ti->num_discard_bios = 1; 221 221 ti->per_bio_data_size = sizeof(struct per_bio_data); 222 222 ti->private = fc; 223 223 return 0; ··· 337 337 return error; 338 338 } 339 339 340 - static int flakey_status(struct dm_target *ti, status_type_t type, 341 - unsigned status_flags, char *result, unsigned maxlen) 340 + static void flakey_status(struct dm_target *ti, status_type_t type, 341 + unsigned status_flags, char *result, unsigned maxlen) 342 342 { 343 343 unsigned sz = 0; 344 344 struct flakey_c *fc = ti->private; ··· 368 368 369 369 break; 370 370 } 371 - return 0; 372 371 } 373 372 374 373 static int flakey_ioctl(struct dm_target *ti, unsigned int cmd, unsigned long arg) ··· 410 411 411 412 static struct target_type flakey_target = { 412 413 .name = "flakey", 413 - .version = {1, 3, 0}, 414 + .version = {1, 3, 1}, 414 415 .module = THIS_MODULE, 415 416 .ctr = flakey_ctr, 416 417 .dtr = flakey_dtr,

+117 -45

drivers/md/dm-ioctl.c

··· 1067 1067 num_targets = dm_table_get_num_targets(table); 1068 1068 for (i = 0; i < num_targets; i++) { 1069 1069 struct dm_target *ti = dm_table_get_target(table, i); 1070 + size_t l; 1070 1071 1071 1072 remaining = len - (outptr - outbuf); 1072 1073 if (remaining <= sizeof(struct dm_target_spec)) { ··· 1094 1093 if (ti->type->status) { 1095 1094 if (param->flags & DM_NOFLUSH_FLAG) 1096 1095 status_flags |= DM_STATUS_NOFLUSH_FLAG; 1097 - if (ti->type->status(ti, type, status_flags, outptr, remaining)) { 1098 - param->flags |= DM_BUFFER_FULL_FLAG; 1099 - break; 1100 - } 1096 + ti->type->status(ti, type, status_flags, outptr, remaining); 1101 1097 } else 1102 1098 outptr[0] = '\0'; 1103 1099 1104 - outptr += strlen(outptr) + 1; 1100 + l = strlen(outptr) + 1; 1101 + if (l == remaining) { 1102 + param->flags |= DM_BUFFER_FULL_FLAG; 1103 + break; 1104 + } 1105 + 1106 + outptr += l; 1105 1107 used = param->data_start + (outptr - outbuf); 1106 1108 1107 1109 outptr = align_ptr(outptr); ··· 1414 1410 return 0; 1415 1411 } 1416 1412 1413 + static bool buffer_test_overflow(char *result, unsigned maxlen) 1414 + { 1415 + return !maxlen || strlen(result) + 1 >= maxlen; 1416 + } 1417 + 1418 + /* 1419 + * Process device-mapper dependent messages. 1420 + * Returns a number <= 1 if message was processed by device mapper. 1421 + * Returns 2 if message should be delivered to the target. 1422 + */ 1423 + static int message_for_md(struct mapped_device *md, unsigned argc, char **argv, 1424 + char *result, unsigned maxlen) 1425 + { 1426 + return 2; 1427 + } 1428 + 1417 1429 /* 1418 1430 * Pass a message to the target that's at the supplied device offset. 1419 1431 */ ··· 1441 1421 struct dm_table *table; 1442 1422 struct dm_target *ti; 1443 1423 struct dm_target_msg *tmsg = (void *) param + param->data_start; 1424 + size_t maxlen; 1425 + char *result = get_result_buffer(param, param_size, &maxlen); 1444 1426 1445 1427 md = find_device(param); 1446 1428 if (!md) ··· 1465 1443 DMWARN("Empty message received."); 1466 1444 goto out_argv; 1467 1445 } 1446 + 1447 + r = message_for_md(md, argc, argv, result, maxlen); 1448 + if (r <= 1) 1449 + goto out_argv; 1468 1450 1469 1451 table = dm_get_live_table(md); 1470 1452 if (!table) ··· 1495 1469 out_argv: 1496 1470 kfree(argv); 1497 1471 out: 1498 - param->data_size = 0; 1472 + if (r >= 0) 1473 + __dev_status(md, param); 1474 + 1475 + if (r == 1) { 1476 + param->flags |= DM_DATA_OUT_FLAG; 1477 + if (buffer_test_overflow(result, maxlen)) 1478 + param->flags |= DM_BUFFER_FULL_FLAG; 1479 + else 1480 + param->data_size = param->data_start + strlen(result) + 1; 1481 + r = 0; 1482 + } 1483 + 1499 1484 dm_put(md); 1500 1485 return r; 1501 1486 } 1487 + 1488 + /* 1489 + * The ioctl parameter block consists of two parts, a dm_ioctl struct 1490 + * followed by a data buffer. This flag is set if the second part, 1491 + * which has a variable size, is not used by the function processing 1492 + * the ioctl. 1493 + */ 1494 + #define IOCTL_FLAGS_NO_PARAMS 1 1502 1495 1503 1496 /*----------------------------------------------------------------- 1504 1497 * Implementation of open/close/ioctl on the special char 1505 1498 * device. 1506 1499 *---------------------------------------------------------------*/ 1507 - static ioctl_fn lookup_ioctl(unsigned int cmd) 1500 + static ioctl_fn lookup_ioctl(unsigned int cmd, int *ioctl_flags) 1508 1501 { 1509 1502 static struct { 1510 1503 int cmd; 1504 + int flags; 1511 1505 ioctl_fn fn; 1512 1506 } _ioctls[] = { 1513 - {DM_VERSION_CMD, NULL}, /* version is dealt with elsewhere */ 1514 - {DM_REMOVE_ALL_CMD, remove_all}, 1515 - {DM_LIST_DEVICES_CMD, list_devices}, 1507 + {DM_VERSION_CMD, 0, NULL}, /* version is dealt with elsewhere */ 1508 + {DM_REMOVE_ALL_CMD, IOCTL_FLAGS_NO_PARAMS, remove_all}, 1509 + {DM_LIST_DEVICES_CMD, 0, list_devices}, 1516 1510 1517 - {DM_DEV_CREATE_CMD, dev_create}, 1518 - {DM_DEV_REMOVE_CMD, dev_remove}, 1519 - {DM_DEV_RENAME_CMD, dev_rename}, 1520 - {DM_DEV_SUSPEND_CMD, dev_suspend}, 1521 - {DM_DEV_STATUS_CMD, dev_status}, 1522 - {DM_DEV_WAIT_CMD, dev_wait}, 1511 + {DM_DEV_CREATE_CMD, IOCTL_FLAGS_NO_PARAMS, dev_create}, 1512 + {DM_DEV_REMOVE_CMD, IOCTL_FLAGS_NO_PARAMS, dev_remove}, 1513 + {DM_DEV_RENAME_CMD, 0, dev_rename}, 1514 + {DM_DEV_SUSPEND_CMD, IOCTL_FLAGS_NO_PARAMS, dev_suspend}, 1515 + {DM_DEV_STATUS_CMD, IOCTL_FLAGS_NO_PARAMS, dev_status}, 1516 + {DM_DEV_WAIT_CMD, 0, dev_wait}, 1523 1517 1524 - {DM_TABLE_LOAD_CMD, table_load}, 1525 - {DM_TABLE_CLEAR_CMD, table_clear}, 1526 - {DM_TABLE_DEPS_CMD, table_deps}, 1527 - {DM_TABLE_STATUS_CMD, table_status}, 1518 + {DM_TABLE_LOAD_CMD, 0, table_load}, 1519 + {DM_TABLE_CLEAR_CMD, IOCTL_FLAGS_NO_PARAMS, table_clear}, 1520 + {DM_TABLE_DEPS_CMD, 0, table_deps}, 1521 + {DM_TABLE_STATUS_CMD, 0, table_status}, 1528 1522 1529 - {DM_LIST_VERSIONS_CMD, list_versions}, 1523 + {DM_LIST_VERSIONS_CMD, 0, list_versions}, 1530 1524 1531 - {DM_TARGET_MSG_CMD, target_message}, 1532 - {DM_DEV_SET_GEOMETRY_CMD, dev_set_geometry} 1525 + {DM_TARGET_MSG_CMD, 0, target_message}, 1526 + {DM_DEV_SET_GEOMETRY_CMD, 0, dev_set_geometry} 1533 1527 }; 1534 1528 1535 - return (cmd >= ARRAY_SIZE(_ioctls)) ? NULL : _ioctls[cmd].fn; 1529 + if (unlikely(cmd >= ARRAY_SIZE(_ioctls))) 1530 + return NULL; 1531 + 1532 + *ioctl_flags = _ioctls[cmd].flags; 1533 + return _ioctls[cmd].fn; 1536 1534 } 1537 1535 1538 1536 /* ··· 1593 1543 return r; 1594 1544 } 1595 1545 1596 - #define DM_PARAMS_VMALLOC 0x0001 /* Params alloced with vmalloc not kmalloc */ 1546 + #define DM_PARAMS_KMALLOC 0x0001 /* Params alloced with kmalloc */ 1547 + #define DM_PARAMS_VMALLOC 0x0002 /* Params alloced with vmalloc */ 1597 1548 #define DM_WIPE_BUFFER 0x0010 /* Wipe input buffer before returning from ioctl */ 1598 1549 1599 1550 static void free_params(struct dm_ioctl *param, size_t param_size, int param_flags) ··· 1602 1551 if (param_flags & DM_WIPE_BUFFER) 1603 1552 memset(param, 0, param_size); 1604 1553 1554 + if (param_flags & DM_PARAMS_KMALLOC) 1555 + kfree(param); 1605 1556 if (param_flags & DM_PARAMS_VMALLOC) 1606 1557 vfree(param); 1607 - else 1608 - kfree(param); 1609 1558 } 1610 1559 1611 - static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl **param, int *param_flags) 1560 + static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kernel, 1561 + int ioctl_flags, 1562 + struct dm_ioctl **param, int *param_flags) 1612 1563 { 1613 - struct dm_ioctl tmp, *dmi; 1564 + struct dm_ioctl *dmi; 1614 1565 int secure_data; 1566 + const size_t minimum_data_size = sizeof(*param_kernel) - sizeof(param_kernel->data); 1615 1567 1616 - if (copy_from_user(&tmp, user, sizeof(tmp) - sizeof(tmp.data))) 1568 + if (copy_from_user(param_kernel, user, minimum_data_size)) 1617 1569 return -EFAULT; 1618 1570 1619 - if (tmp.data_size < (sizeof(tmp) - sizeof(tmp.data))) 1571 + if (param_kernel->data_size < minimum_data_size) 1620 1572 return -EINVAL; 1621 1573 1622 - secure_data = tmp.flags & DM_SECURE_DATA_FLAG; 1574 + secure_data = param_kernel->flags & DM_SECURE_DATA_FLAG; 1623 1575 1624 1576 *param_flags = secure_data ? DM_WIPE_BUFFER : 0; 1577 + 1578 + if (ioctl_flags & IOCTL_FLAGS_NO_PARAMS) { 1579 + dmi = param_kernel; 1580 + dmi->data_size = minimum_data_size; 1581 + goto data_copied; 1582 + } 1625 1583 1626 1584 /* 1627 1585 * Try to avoid low memory issues when a device is suspended. 1628 1586 * Use kmalloc() rather than vmalloc() when we can. 1629 1587 */ 1630 1588 dmi = NULL; 1631 - if (tmp.data_size <= KMALLOC_MAX_SIZE) 1632 - dmi = kmalloc(tmp.data_size, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN); 1633 - 1634 - if (!dmi) { 1635 - dmi = __vmalloc(tmp.data_size, GFP_NOIO | __GFP_REPEAT | __GFP_HIGH, PAGE_KERNEL); 1636 - *param_flags |= DM_PARAMS_VMALLOC; 1589 + if (param_kernel->data_size <= KMALLOC_MAX_SIZE) { 1590 + dmi = kmalloc(param_kernel->data_size, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN); 1591 + if (dmi) 1592 + *param_flags |= DM_PARAMS_KMALLOC; 1637 1593 } 1638 1594 1639 1595 if (!dmi) { 1640 - if (secure_data && clear_user(user, tmp.data_size)) 1596 + dmi = __vmalloc(param_kernel->data_size, GFP_NOIO | __GFP_REPEAT | __GFP_HIGH, PAGE_KERNEL); 1597 + if (dmi) 1598 + *param_flags |= DM_PARAMS_VMALLOC; 1599 + } 1600 + 1601 + if (!dmi) { 1602 + if (secure_data && clear_user(user, param_kernel->data_size)) 1641 1603 return -EFAULT; 1642 1604 return -ENOMEM; 1643 1605 } 1644 1606 1645 - if (copy_from_user(dmi, user, tmp.data_size)) 1607 + if (copy_from_user(dmi, user, param_kernel->data_size)) 1646 1608 goto bad; 1647 1609 1610 + data_copied: 1648 1611 /* 1649 1612 * Abort if something changed the ioctl data while it was being copied. 1650 1613 */ 1651 - if (dmi->data_size != tmp.data_size) { 1614 + if (dmi->data_size != param_kernel->data_size) { 1652 1615 DMERR("rejecting ioctl: data size modified while processing parameters"); 1653 1616 goto bad; 1654 1617 } 1655 1618 1656 1619 /* Wipe the user buffer so we do not return it to userspace */ 1657 - if (secure_data && clear_user(user, tmp.data_size)) 1620 + if (secure_data && clear_user(user, param_kernel->data_size)) 1658 1621 goto bad; 1659 1622 1660 1623 *param = dmi; 1661 1624 return 0; 1662 1625 1663 1626 bad: 1664 - free_params(dmi, tmp.data_size, *param_flags); 1627 + free_params(dmi, param_kernel->data_size, *param_flags); 1665 1628 1666 1629 return -EFAULT; 1667 1630 } ··· 1686 1621 param->flags &= ~DM_BUFFER_FULL_FLAG; 1687 1622 param->flags &= ~DM_UEVENT_GENERATED_FLAG; 1688 1623 param->flags &= ~DM_SECURE_DATA_FLAG; 1624 + param->flags &= ~DM_DATA_OUT_FLAG; 1689 1625 1690 1626 /* Ignores parameters */ 1691 1627 if (cmd == DM_REMOVE_ALL_CMD || ··· 1714 1648 static int ctl_ioctl(uint command, struct dm_ioctl __user *user) 1715 1649 { 1716 1650 int r = 0; 1651 + int ioctl_flags; 1717 1652 int param_flags; 1718 1653 unsigned int cmd; 1719 1654 struct dm_ioctl *uninitialized_var(param); 1720 1655 ioctl_fn fn = NULL; 1721 1656 size_t input_param_size; 1657 + struct dm_ioctl param_kernel; 1722 1658 1723 1659 /* only root can play with this */ 1724 1660 if (!capable(CAP_SYS_ADMIN)) ··· 1745 1677 if (cmd == DM_VERSION_CMD) 1746 1678 return 0; 1747 1679 1748 - fn = lookup_ioctl(cmd); 1680 + fn = lookup_ioctl(cmd, &ioctl_flags); 1749 1681 if (!fn) { 1750 1682 DMWARN("dm_ctl_ioctl: unknown command 0x%x", command); 1751 1683 return -ENOTTY; ··· 1754 1686 /* 1755 1687 * Copy the parameters into kernel space. 1756 1688 */ 1757 - r = copy_params(user, &param, &param_flags); 1689 + r = copy_params(user, &param_kernel, ioctl_flags, &param, &param_flags); 1758 1690 1759 1691 if (r) 1760 1692 return r; ··· 1766 1698 1767 1699 param->data_size = sizeof(*param); 1768 1700 r = fn(param, input_param_size); 1701 + 1702 + if (unlikely(param->flags & DM_BUFFER_FULL_FLAG) && 1703 + unlikely(ioctl_flags & IOCTL_FLAGS_NO_PARAMS)) 1704 + DMERR("ioctl %d tried to output some data but has IOCTL_FLAGS_NO_PARAMS set", cmd); 1769 1705 1770 1706 /* 1771 1707 * Copy the results back to userland.

+120 -1

drivers/md/dm-kcopyd.c

··· 22 22 #include <linux/vmalloc.h> 23 23 #include <linux/workqueue.h> 24 24 #include <linux/mutex.h> 25 + #include <linux/delay.h> 25 26 #include <linux/device-mapper.h> 26 27 #include <linux/dm-kcopyd.h> 27 28 ··· 52 51 struct workqueue_struct *kcopyd_wq; 53 52 struct work_struct kcopyd_work; 54 53 54 + struct dm_kcopyd_throttle *throttle; 55 + 55 56 /* 56 57 * We maintain three lists of jobs: 57 58 * ··· 70 67 }; 71 68 72 69 static struct page_list zero_page_list; 70 + 71 + static DEFINE_SPINLOCK(throttle_spinlock); 72 + 73 + /* 74 + * IO/IDLE accounting slowly decays after (1 << ACCOUNT_INTERVAL_SHIFT) period. 75 + * When total_period >= (1 << ACCOUNT_INTERVAL_SHIFT) the counters are divided 76 + * by 2. 77 + */ 78 + #define ACCOUNT_INTERVAL_SHIFT SHIFT_HZ 79 + 80 + /* 81 + * Sleep this number of milliseconds. 82 + * 83 + * The value was decided experimentally. 84 + * Smaller values seem to cause an increased copy rate above the limit. 85 + * The reason for this is unknown but possibly due to jiffies rounding errors 86 + * or read/write cache inside the disk. 87 + */ 88 + #define SLEEP_MSEC 100 89 + 90 + /* 91 + * Maximum number of sleep events. There is a theoretical livelock if more 92 + * kcopyd clients do work simultaneously which this limit avoids. 93 + */ 94 + #define MAX_SLEEPS 10 95 + 96 + static void io_job_start(struct dm_kcopyd_throttle *t) 97 + { 98 + unsigned throttle, now, difference; 99 + int slept = 0, skew; 100 + 101 + if (unlikely(!t)) 102 + return; 103 + 104 + try_again: 105 + spin_lock_irq(&throttle_spinlock); 106 + 107 + throttle = ACCESS_ONCE(t->throttle); 108 + 109 + if (likely(throttle >= 100)) 110 + goto skip_limit; 111 + 112 + now = jiffies; 113 + difference = now - t->last_jiffies; 114 + t->last_jiffies = now; 115 + if (t->num_io_jobs) 116 + t->io_period += difference; 117 + t->total_period += difference; 118 + 119 + /* 120 + * Maintain sane values if we got a temporary overflow. 121 + */ 122 + if (unlikely(t->io_period > t->total_period)) 123 + t->io_period = t->total_period; 124 + 125 + if (unlikely(t->total_period >= (1 << ACCOUNT_INTERVAL_SHIFT))) { 126 + int shift = fls(t->total_period >> ACCOUNT_INTERVAL_SHIFT); 127 + t->total_period >>= shift; 128 + t->io_period >>= shift; 129 + } 130 + 131 + skew = t->io_period - throttle * t->total_period / 100; 132 + 133 + if (unlikely(skew > 0) && slept < MAX_SLEEPS) { 134 + slept++; 135 + spin_unlock_irq(&throttle_spinlock); 136 + msleep(SLEEP_MSEC); 137 + goto try_again; 138 + } 139 + 140 + skip_limit: 141 + t->num_io_jobs++; 142 + 143 + spin_unlock_irq(&throttle_spinlock); 144 + } 145 + 146 + static void io_job_finish(struct dm_kcopyd_throttle *t) 147 + { 148 + unsigned long flags; 149 + 150 + if (unlikely(!t)) 151 + return; 152 + 153 + spin_lock_irqsave(&throttle_spinlock, flags); 154 + 155 + t->num_io_jobs--; 156 + 157 + if (likely(ACCESS_ONCE(t->throttle) >= 100)) 158 + goto skip_limit; 159 + 160 + if (!t->num_io_jobs) { 161 + unsigned now, difference; 162 + 163 + now = jiffies; 164 + difference = now - t->last_jiffies; 165 + t->last_jiffies = now; 166 + 167 + t->io_period += difference; 168 + t->total_period += difference; 169 + 170 + /* 171 + * Maintain sane values if we got a temporary overflow. 172 + */ 173 + if (unlikely(t->io_period > t->total_period)) 174 + t->io_period = t->total_period; 175 + } 176 + 177 + skip_limit: 178 + spin_unlock_irqrestore(&throttle_spinlock, flags); 179 + } 180 + 73 181 74 182 static void wake(struct dm_kcopyd_client *kc) 75 183 { ··· 462 348 struct kcopyd_job *job = (struct kcopyd_job *) context; 463 349 struct dm_kcopyd_client *kc = job->kc; 464 350 351 + io_job_finish(kc->throttle); 352 + 465 353 if (error) { 466 354 if (job->rw & WRITE) 467 355 job->write_err |= error; ··· 504 388 .notify.context = job, 505 389 .client = job->kc->io_client, 506 390 }; 391 + 392 + io_job_start(job->kc->throttle); 507 393 508 394 if (job->rw == READ) 509 395 r = dm_io(&io_req, 1, &job->source, NULL); ··· 813 695 /*----------------------------------------------------------------- 814 696 * Client setup 815 697 *---------------------------------------------------------------*/ 816 - struct dm_kcopyd_client *dm_kcopyd_client_create(void) 698 + struct dm_kcopyd_client *dm_kcopyd_client_create(struct dm_kcopyd_throttle *throttle) 817 699 { 818 700 int r = -ENOMEM; 819 701 struct dm_kcopyd_client *kc; ··· 826 708 INIT_LIST_HEAD(&kc->complete_jobs); 827 709 INIT_LIST_HEAD(&kc->io_jobs); 828 710 INIT_LIST_HEAD(&kc->pages_jobs); 711 + kc->throttle = throttle; 829 712 830 713 kc->job_pool = mempool_create_slab_pool(MIN_JOBS, _job_cache); 831 714 if (!kc->job_pool)

+6 -7

drivers/md/dm-linear.c

··· 53 53 goto bad; 54 54 } 55 55 56 - ti->num_flush_requests = 1; 57 - ti->num_discard_requests = 1; 58 - ti->num_write_same_requests = 1; 56 + ti->num_flush_bios = 1; 57 + ti->num_discard_bios = 1; 58 + ti->num_write_same_bios = 1; 59 59 ti->private = lc; 60 60 return 0; 61 61 ··· 95 95 return DM_MAPIO_REMAPPED; 96 96 } 97 97 98 - static int linear_status(struct dm_target *ti, status_type_t type, 99 - unsigned status_flags, char *result, unsigned maxlen) 98 + static void linear_status(struct dm_target *ti, status_type_t type, 99 + unsigned status_flags, char *result, unsigned maxlen) 100 100 { 101 101 struct linear_c *lc = (struct linear_c *) ti->private; 102 102 ··· 110 110 (unsigned long long)lc->start); 111 111 break; 112 112 } 113 - return 0; 114 113 } 115 114 116 115 static int linear_ioctl(struct dm_target *ti, unsigned int cmd, ··· 154 155 155 156 static struct target_type linear_target = { 156 157 .name = "linear", 157 - .version = {1, 2, 0}, 158 + .version = {1, 2, 1}, 158 159 .module = THIS_MODULE, 159 160 .ctr = linear_ctr, 160 161 .dtr = linear_dtr,

+5 -7

drivers/md/dm-mpath.c

··· 905 905 goto bad; 906 906 } 907 907 908 - ti->num_flush_requests = 1; 909 - ti->num_discard_requests = 1; 908 + ti->num_flush_bios = 1; 909 + ti->num_discard_bios = 1; 910 910 911 911 return 0; 912 912 ··· 1378 1378 * [priority selector-name num_ps_args [ps_args]* 1379 1379 * num_paths num_selector_args [path_dev [selector_args]* ]+ ]+ 1380 1380 */ 1381 - static int multipath_status(struct dm_target *ti, status_type_t type, 1382 - unsigned status_flags, char *result, unsigned maxlen) 1381 + static void multipath_status(struct dm_target *ti, status_type_t type, 1382 + unsigned status_flags, char *result, unsigned maxlen) 1383 1383 { 1384 1384 int sz = 0; 1385 1385 unsigned long flags; ··· 1485 1485 } 1486 1486 1487 1487 spin_unlock_irqrestore(&m->lock, flags); 1488 - 1489 - return 0; 1490 1488 } 1491 1489 1492 1490 static int multipath_message(struct dm_target *ti, unsigned argc, char **argv) ··· 1693 1695 *---------------------------------------------------------------*/ 1694 1696 static struct target_type multipath_target = { 1695 1697 .name = "multipath", 1696 - .version = {1, 5, 0}, 1698 + .version = {1, 5, 1}, 1697 1699 .module = THIS_MODULE, 1698 1700 .ctr = multipath_ctr, 1699 1701 .dtr = multipath_dtr,

+4 -6

drivers/md/dm-raid.c

··· 1151 1151 1152 1152 INIT_WORK(&rs->md.event_work, do_table_event); 1153 1153 ti->private = rs; 1154 - ti->num_flush_requests = 1; 1154 + ti->num_flush_bios = 1; 1155 1155 1156 1156 mutex_lock(&rs->md.reconfig_mutex); 1157 1157 ret = md_run(&rs->md); ··· 1201 1201 return DM_MAPIO_SUBMITTED; 1202 1202 } 1203 1203 1204 - static int raid_status(struct dm_target *ti, status_type_t type, 1205 - unsigned status_flags, char *result, unsigned maxlen) 1204 + static void raid_status(struct dm_target *ti, status_type_t type, 1205 + unsigned status_flags, char *result, unsigned maxlen) 1206 1206 { 1207 1207 struct raid_set *rs = ti->private; 1208 1208 unsigned raid_param_cnt = 1; /* at least 1 for chunksize */ ··· 1344 1344 DMEMIT(" -"); 1345 1345 } 1346 1346 } 1347 - 1348 - return 0; 1349 1347 } 1350 1348 1351 1349 static int raid_iterate_devices(struct dm_target *ti, iterate_devices_callout_fn fn, void *data) ··· 1403 1405 1404 1406 static struct target_type raid_target = { 1405 1407 .name = "raid", 1406 - .version = {1, 4, 1}, 1408 + .version = {1, 4, 2}, 1407 1409 .module = THIS_MODULE, 1408 1410 .ctr = raid_ctr, 1409 1411 .dtr = raid_dtr,

+9 -8

drivers/md/dm-raid1.c

··· 82 82 struct mirror mirror[0]; 83 83 }; 84 84 85 + DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(raid1_resync_throttle, 86 + "A percentage of time allocated for raid resynchronization"); 87 + 85 88 static void wakeup_mirrord(void *context) 86 89 { 87 90 struct mirror_set *ms = context; ··· 1075 1072 if (r) 1076 1073 goto err_free_context; 1077 1074 1078 - ti->num_flush_requests = 1; 1079 - ti->num_discard_requests = 1; 1075 + ti->num_flush_bios = 1; 1076 + ti->num_discard_bios = 1; 1080 1077 ti->per_bio_data_size = sizeof(struct dm_raid1_bio_record); 1081 1078 ti->discard_zeroes_data_unsupported = true; 1082 1079 ··· 1114 1111 goto err_destroy_wq; 1115 1112 } 1116 1113 1117 - ms->kcopyd_client = dm_kcopyd_client_create(); 1114 + ms->kcopyd_client = dm_kcopyd_client_create(&dm_kcopyd_throttle); 1118 1115 if (IS_ERR(ms->kcopyd_client)) { 1119 1116 r = PTR_ERR(ms->kcopyd_client); 1120 1117 goto err_destroy_wq; ··· 1350 1347 } 1351 1348 1352 1349 1353 - static int mirror_status(struct dm_target *ti, status_type_t type, 1354 - unsigned status_flags, char *result, unsigned maxlen) 1350 + static void mirror_status(struct dm_target *ti, status_type_t type, 1351 + unsigned status_flags, char *result, unsigned maxlen) 1355 1352 { 1356 1353 unsigned int m, sz = 0; 1357 1354 struct mirror_set *ms = (struct mirror_set *) ti->private; ··· 1386 1383 if (ms->features & DM_RAID1_HANDLE_ERRORS) 1387 1384 DMEMIT(" 1 handle_errors"); 1388 1385 } 1389 - 1390 - return 0; 1391 1386 } 1392 1387 1393 1388 static int mirror_iterate_devices(struct dm_target *ti, ··· 1404 1403 1405 1404 static struct target_type mirror_target = { 1406 1405 .name = "mirror", 1407 - .version = {1, 13, 1}, 1406 + .version = {1, 13, 2}, 1408 1407 .module = THIS_MODULE, 1409 1408 .ctr = mirror_ctr, 1410 1409 .dtr = mirror_dtr,

+17 -16

drivers/md/dm-snap.c

··· 124 124 #define RUNNING_MERGE 0 125 125 #define SHUTDOWN_MERGE 1 126 126 127 + DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(snapshot_copy_throttle, 128 + "A percentage of time allocated for copy on write"); 129 + 127 130 struct dm_dev *dm_snap_origin(struct dm_snapshot *s) 128 131 { 129 132 return s->origin; ··· 1040 1037 int i; 1041 1038 int r = -EINVAL; 1042 1039 char *origin_path, *cow_path; 1043 - unsigned args_used, num_flush_requests = 1; 1040 + unsigned args_used, num_flush_bios = 1; 1044 1041 fmode_t origin_mode = FMODE_READ; 1045 1042 1046 1043 if (argc != 4) { ··· 1050 1047 } 1051 1048 1052 1049 if (dm_target_is_snapshot_merge(ti)) { 1053 - num_flush_requests = 2; 1050 + num_flush_bios = 2; 1054 1051 origin_mode = FMODE_WRITE; 1055 1052 } 1056 1053 ··· 1111 1108 goto bad_hash_tables; 1112 1109 } 1113 1110 1114 - s->kcopyd_client = dm_kcopyd_client_create(); 1111 + s->kcopyd_client = dm_kcopyd_client_create(&dm_kcopyd_throttle); 1115 1112 if (IS_ERR(s->kcopyd_client)) { 1116 1113 r = PTR_ERR(s->kcopyd_client); 1117 1114 ti->error = "Could not create kcopyd client"; ··· 1130 1127 spin_lock_init(&s->tracked_chunk_lock); 1131 1128 1132 1129 ti->private = s; 1133 - ti->num_flush_requests = num_flush_requests; 1130 + ti->num_flush_bios = num_flush_bios; 1134 1131 ti->per_bio_data_size = sizeof(struct dm_snap_tracked_chunk); 1135 1132 1136 1133 /* Add snapshot to the list of snapshots for this origin */ ··· 1694 1691 init_tracked_chunk(bio); 1695 1692 1696 1693 if (bio->bi_rw & REQ_FLUSH) { 1697 - if (!dm_bio_get_target_request_nr(bio)) 1694 + if (!dm_bio_get_target_bio_nr(bio)) 1698 1695 bio->bi_bdev = s->origin->bdev; 1699 1696 else 1700 1697 bio->bi_bdev = s->cow->bdev; ··· 1839 1836 start_merge(s); 1840 1837 } 1841 1838 1842 - static int snapshot_status(struct dm_target *ti, status_type_t type, 1843 - unsigned status_flags, char *result, unsigned maxlen) 1839 + static void snapshot_status(struct dm_target *ti, status_type_t type, 1840 + unsigned status_flags, char *result, unsigned maxlen) 1844 1841 { 1845 1842 unsigned sz = 0; 1846 1843 struct dm_snapshot *snap = ti->private; ··· 1886 1883 maxlen - sz); 1887 1884 break; 1888 1885 } 1889 - 1890 - return 0; 1891 1886 } 1892 1887 1893 1888 static int snapshot_iterate_devices(struct dm_target *ti, ··· 2105 2104 } 2106 2105 2107 2106 ti->private = dev; 2108 - ti->num_flush_requests = 1; 2107 + ti->num_flush_bios = 1; 2109 2108 2110 2109 return 0; 2111 2110 } ··· 2139 2138 ti->max_io_len = get_origin_minimum_chunksize(dev->bdev); 2140 2139 } 2141 2140 2142 - static int origin_status(struct dm_target *ti, status_type_t type, 2143 - unsigned status_flags, char *result, unsigned maxlen) 2141 + static void origin_status(struct dm_target *ti, status_type_t type, 2142 + unsigned status_flags, char *result, unsigned maxlen) 2144 2143 { 2145 2144 struct dm_dev *dev = ti->private; 2146 2145 ··· 2153 2152 snprintf(result, maxlen, "%s", dev->name); 2154 2153 break; 2155 2154 } 2156 - 2157 - return 0; 2158 2155 } 2159 2156 2160 2157 static int origin_merge(struct dm_target *ti, struct bvec_merge_data *bvm, ··· 2179 2180 2180 2181 static struct target_type origin_target = { 2181 2182 .name = "snapshot-origin", 2182 - .version = {1, 8, 0}, 2183 + .version = {1, 8, 1}, 2183 2184 .module = THIS_MODULE, 2184 2185 .ctr = origin_ctr, 2185 2186 .dtr = origin_dtr, ··· 2192 2193 2193 2194 static struct target_type snapshot_target = { 2194 2195 .name = "snapshot", 2195 - .version = {1, 11, 0}, 2196 + .version = {1, 11, 1}, 2196 2197 .module = THIS_MODULE, 2197 2198 .ctr = snapshot_ctr, 2198 2199 .dtr = snapshot_dtr, ··· 2305 2306 MODULE_DESCRIPTION(DM_NAME " snapshot target"); 2306 2307 MODULE_AUTHOR("Joe Thornber"); 2307 2308 MODULE_LICENSE("GPL"); 2309 + MODULE_ALIAS("dm-snapshot-origin"); 2310 + MODULE_ALIAS("dm-snapshot-merge");

+13 -14

drivers/md/dm-stripe.c

··· 160 160 if (r) 161 161 return r; 162 162 163 - ti->num_flush_requests = stripes; 164 - ti->num_discard_requests = stripes; 165 - ti->num_write_same_requests = stripes; 163 + ti->num_flush_bios = stripes; 164 + ti->num_discard_bios = stripes; 165 + ti->num_write_same_bios = stripes; 166 166 167 167 sc->chunk_size = chunk_size; 168 168 if (chunk_size & (chunk_size - 1)) ··· 276 276 { 277 277 struct stripe_c *sc = ti->private; 278 278 uint32_t stripe; 279 - unsigned target_request_nr; 279 + unsigned target_bio_nr; 280 280 281 281 if (bio->bi_rw & REQ_FLUSH) { 282 - target_request_nr = dm_bio_get_target_request_nr(bio); 283 - BUG_ON(target_request_nr >= sc->stripes); 284 - bio->bi_bdev = sc->stripe[target_request_nr].dev->bdev; 282 + target_bio_nr = dm_bio_get_target_bio_nr(bio); 283 + BUG_ON(target_bio_nr >= sc->stripes); 284 + bio->bi_bdev = sc->stripe[target_bio_nr].dev->bdev; 285 285 return DM_MAPIO_REMAPPED; 286 286 } 287 287 if (unlikely(bio->bi_rw & REQ_DISCARD) || 288 288 unlikely(bio->bi_rw & REQ_WRITE_SAME)) { 289 - target_request_nr = dm_bio_get_target_request_nr(bio); 290 - BUG_ON(target_request_nr >= sc->stripes); 291 - return stripe_map_range(sc, bio, target_request_nr); 289 + target_bio_nr = dm_bio_get_target_bio_nr(bio); 290 + BUG_ON(target_bio_nr >= sc->stripes); 291 + return stripe_map_range(sc, bio, target_bio_nr); 292 292 } 293 293 294 294 stripe_map_sector(sc, bio->bi_sector, &stripe, &bio->bi_sector); ··· 312 312 * 313 313 */ 314 314 315 - static int stripe_status(struct dm_target *ti, status_type_t type, 316 - unsigned status_flags, char *result, unsigned maxlen) 315 + static void stripe_status(struct dm_target *ti, status_type_t type, 316 + unsigned status_flags, char *result, unsigned maxlen) 317 317 { 318 318 struct stripe_c *sc = (struct stripe_c *) ti->private; 319 319 char buffer[sc->stripes + 1]; ··· 340 340 (unsigned long long)sc->stripe[i].physical_start); 341 341 break; 342 342 } 343 - return 0; 344 343 } 345 344 346 345 static int stripe_end_io(struct dm_target *ti, struct bio *bio, int error) ··· 427 428 428 429 static struct target_type stripe_target = { 429 430 .name = "striped", 430 - .version = {1, 5, 0}, 431 + .version = {1, 5, 1}, 431 432 .module = THIS_MODULE, 432 433 .ctr = stripe_ctr, 433 434 .dtr = stripe_dtr,

+5 -6

drivers/md/dm-table.c

··· 217 217 218 218 if (alloc_targets(t, num_targets)) { 219 219 kfree(t); 220 - t = NULL; 221 220 return -ENOMEM; 222 221 } 223 222 ··· 822 823 823 824 t->highs[t->num_targets++] = tgt->begin + tgt->len - 1; 824 825 825 - if (!tgt->num_discard_requests && tgt->discards_supported) 826 - DMWARN("%s: %s: ignoring discards_supported because num_discard_requests is zero.", 826 + if (!tgt->num_discard_bios && tgt->discards_supported) 827 + DMWARN("%s: %s: ignoring discards_supported because num_discard_bios is zero.", 827 828 dm_device_name(t->md), type); 828 829 829 830 return 0; ··· 1359 1360 while (i < dm_table_get_num_targets(t)) { 1360 1361 ti = dm_table_get_target(t, i++); 1361 1362 1362 - if (!ti->num_flush_requests) 1363 + if (!ti->num_flush_bios) 1363 1364 continue; 1364 1365 1365 1366 if (ti->flush_supported) ··· 1438 1439 while (i < dm_table_get_num_targets(t)) { 1439 1440 ti = dm_table_get_target(t, i++); 1440 1441 1441 - if (!ti->num_write_same_requests) 1442 + if (!ti->num_write_same_bios) 1442 1443 return false; 1443 1444 1444 1445 if (!ti->type->iterate_devices || ··· 1656 1657 while (i < dm_table_get_num_targets(t)) { 1657 1658 ti = dm_table_get_target(t, i++); 1658 1659 1659 - if (!ti->num_discard_requests) 1660 + if (!ti->num_discard_bios) 1660 1661 continue; 1661 1662 1662 1663 if (ti->discards_supported)

+1 -1

drivers/md/dm-target.c

··· 116 116 /* 117 117 * Return error for discards instead of -EOPNOTSUPP 118 118 */ 119 - tt->num_discard_requests = 1; 119 + tt->num_discard_bios = 1; 120 120 121 121 return 0; 122 122 }

+6 -6

drivers/md/dm-thin-metadata.c

··· 280 280 *t = v & ((1 << 24) - 1); 281 281 } 282 282 283 - static void data_block_inc(void *context, void *value_le) 283 + static void data_block_inc(void *context, const void *value_le) 284 284 { 285 285 struct dm_space_map *sm = context; 286 286 __le64 v_le; ··· 292 292 dm_sm_inc_block(sm, b); 293 293 } 294 294 295 - static void data_block_dec(void *context, void *value_le) 295 + static void data_block_dec(void *context, const void *value_le) 296 296 { 297 297 struct dm_space_map *sm = context; 298 298 __le64 v_le; ··· 304 304 dm_sm_dec_block(sm, b); 305 305 } 306 306 307 - static int data_block_equal(void *context, void *value1_le, void *value2_le) 307 + static int data_block_equal(void *context, const void *value1_le, const void *value2_le) 308 308 { 309 309 __le64 v1_le, v2_le; 310 310 uint64_t b1, b2; ··· 318 318 return b1 == b2; 319 319 } 320 320 321 - static void subtree_inc(void *context, void *value) 321 + static void subtree_inc(void *context, const void *value) 322 322 { 323 323 struct dm_btree_info *info = context; 324 324 __le64 root_le; ··· 329 329 dm_tm_inc(info->tm, root); 330 330 } 331 331 332 - static void subtree_dec(void *context, void *value) 332 + static void subtree_dec(void *context, const void *value) 333 333 { 334 334 struct dm_btree_info *info = context; 335 335 __le64 root_le; ··· 341 341 DMERR("btree delete failed\n"); 342 342 } 343 343 344 - static int subtree_equal(void *context, void *value1_le, void *value2_le) 344 + static int subtree_equal(void *context, const void *value1_le, const void *value2_le) 345 345 { 346 346 __le64 v1_le, v2_le; 347 347 memcpy(&v1_le, value1_le, sizeof(v1_le));

+180 -97

drivers/md/dm-thin.c

··· 26 26 #define PRISON_CELLS 1024 27 27 #define COMMIT_PERIOD HZ 28 28 29 + DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(snapshot_copy_throttle, 30 + "A percentage of time allocated for copy on write"); 31 + 29 32 /* 30 33 * The block size of the device holding pool data must be 31 34 * between 64KB and 1GB. ··· 230 227 /*----------------------------------------------------------------*/ 231 228 232 229 /* 230 + * wake_worker() is used when new work is queued and when pool_resume is 231 + * ready to continue deferred IO processing. 232 + */ 233 + static void wake_worker(struct pool *pool) 234 + { 235 + queue_work(pool->wq, &pool->worker); 236 + } 237 + 238 + /*----------------------------------------------------------------*/ 239 + 240 + static int bio_detain(struct pool *pool, struct dm_cell_key *key, struct bio *bio, 241 + struct dm_bio_prison_cell **cell_result) 242 + { 243 + int r; 244 + struct dm_bio_prison_cell *cell_prealloc; 245 + 246 + /* 247 + * Allocate a cell from the prison's mempool. 248 + * This might block but it can't fail. 249 + */ 250 + cell_prealloc = dm_bio_prison_alloc_cell(pool->prison, GFP_NOIO); 251 + 252 + r = dm_bio_detain(pool->prison, key, bio, cell_prealloc, cell_result); 253 + if (r) 254 + /* 255 + * We reused an old cell; we can get rid of 256 + * the new one. 257 + */ 258 + dm_bio_prison_free_cell(pool->prison, cell_prealloc); 259 + 260 + return r; 261 + } 262 + 263 + static void cell_release(struct pool *pool, 264 + struct dm_bio_prison_cell *cell, 265 + struct bio_list *bios) 266 + { 267 + dm_cell_release(pool->prison, cell, bios); 268 + dm_bio_prison_free_cell(pool->prison, cell); 269 + } 270 + 271 + static void cell_release_no_holder(struct pool *pool, 272 + struct dm_bio_prison_cell *cell, 273 + struct bio_list *bios) 274 + { 275 + dm_cell_release_no_holder(pool->prison, cell, bios); 276 + dm_bio_prison_free_cell(pool->prison, cell); 277 + } 278 + 279 + static void cell_defer_no_holder_no_free(struct thin_c *tc, 280 + struct dm_bio_prison_cell *cell) 281 + { 282 + struct pool *pool = tc->pool; 283 + unsigned long flags; 284 + 285 + spin_lock_irqsave(&pool->lock, flags); 286 + dm_cell_release_no_holder(pool->prison, cell, &pool->deferred_bios); 287 + spin_unlock_irqrestore(&pool->lock, flags); 288 + 289 + wake_worker(pool); 290 + } 291 + 292 + static void cell_error(struct pool *pool, 293 + struct dm_bio_prison_cell *cell) 294 + { 295 + dm_cell_error(pool->prison, cell); 296 + dm_bio_prison_free_cell(pool->prison, cell); 297 + } 298 + 299 + /*----------------------------------------------------------------*/ 300 + 301 + /* 233 302 * A global list of pools that uses a struct mapped_device as a key. 234 303 */ 235 304 static struct dm_thin_pool_table { ··· 405 330 * target. 406 331 */ 407 332 333 + static bool block_size_is_power_of_two(struct pool *pool) 334 + { 335 + return pool->sectors_per_block_shift >= 0; 336 + } 337 + 408 338 static dm_block_t get_bio_block(struct thin_c *tc, struct bio *bio) 409 339 { 340 + struct pool *pool = tc->pool; 410 341 sector_t block_nr = bio->bi_sector; 411 342 412 - if (tc->pool->sectors_per_block_shift < 0) 413 - (void) sector_div(block_nr, tc->pool->sectors_per_block); 343 + if (block_size_is_power_of_two(pool)) 344 + block_nr >>= pool->sectors_per_block_shift; 414 345 else 415 - block_nr >>= tc->pool->sectors_per_block_shift; 346 + (void) sector_div(block_nr, pool->sectors_per_block); 416 347 417 348 return block_nr; 418 349 } ··· 429 348 sector_t bi_sector = bio->bi_sector; 430 349 431 350 bio->bi_bdev = tc->pool_dev->bdev; 432 - if (tc->pool->sectors_per_block_shift < 0) 433 - bio->bi_sector = (block * pool->sectors_per_block) + 434 - sector_div(bi_sector, pool->sectors_per_block); 435 - else 351 + if (block_size_is_power_of_two(pool)) 436 352 bio->bi_sector = (block << pool->sectors_per_block_shift) | 437 353 (bi_sector & (pool->sectors_per_block - 1)); 354 + else 355 + bio->bi_sector = (block * pool->sectors_per_block) + 356 + sector_div(bi_sector, pool->sectors_per_block); 438 357 } 439 358 440 359 static void remap_to_origin(struct thin_c *tc, struct bio *bio) ··· 499 418 { 500 419 remap(tc, bio, block); 501 420 issue(tc, bio); 502 - } 503 - 504 - /* 505 - * wake_worker() is used when new work is queued and when pool_resume is 506 - * ready to continue deferred IO processing. 507 - */ 508 - static void wake_worker(struct pool *pool) 509 - { 510 - queue_work(pool->wq, &pool->worker); 511 421 } 512 422 513 423 /*----------------------------------------------------------------*/ ··· 587 515 unsigned long flags; 588 516 589 517 spin_lock_irqsave(&pool->lock, flags); 590 - dm_cell_release(cell, &pool->deferred_bios); 518 + cell_release(pool, cell, &pool->deferred_bios); 591 519 spin_unlock_irqrestore(&tc->pool->lock, flags); 592 520 593 521 wake_worker(pool); 594 522 } 595 523 596 524 /* 597 - * Same as cell_defer except it omits the original holder of the cell. 525 + * Same as cell_defer above, except it omits the original holder of the cell. 598 526 */ 599 527 static void cell_defer_no_holder(struct thin_c *tc, struct dm_bio_prison_cell *cell) 600 528 { ··· 602 530 unsigned long flags; 603 531 604 532 spin_lock_irqsave(&pool->lock, flags); 605 - dm_cell_release_no_holder(cell, &pool->deferred_bios); 533 + cell_release_no_holder(pool, cell, &pool->deferred_bios); 606 534 spin_unlock_irqrestore(&pool->lock, flags); 607 535 608 536 wake_worker(pool); ··· 612 540 { 613 541 if (m->bio) 614 542 m->bio->bi_end_io = m->saved_bi_end_io; 615 - dm_cell_error(m->cell); 543 + cell_error(m->tc->pool, m->cell); 616 544 list_del(&m->list); 617 545 mempool_free(m, m->tc->pool->mapping_pool); 618 546 } 547 + 619 548 static void process_prepared_mapping(struct dm_thin_new_mapping *m) 620 549 { 621 550 struct thin_c *tc = m->tc; 551 + struct pool *pool = tc->pool; 622 552 struct bio *bio; 623 553 int r; 624 554 ··· 629 555 bio->bi_end_io = m->saved_bi_end_io; 630 556 631 557 if (m->err) { 632 - dm_cell_error(m->cell); 558 + cell_error(pool, m->cell); 633 559 goto out; 634 560 } 635 561 ··· 641 567 r = dm_thin_insert_block(tc->td, m->virt_block, m->data_block); 642 568 if (r) { 643 569 DMERR_LIMIT("dm_thin_insert_block() failed"); 644 - dm_cell_error(m->cell); 570 + cell_error(pool, m->cell); 645 571 goto out; 646 572 } 647 573 ··· 659 585 660 586 out: 661 587 list_del(&m->list); 662 - mempool_free(m, tc->pool->mapping_pool); 588 + mempool_free(m, pool->mapping_pool); 663 589 } 664 590 665 591 static void process_prepared_discard_fail(struct dm_thin_new_mapping *m) ··· 810 736 if (r < 0) { 811 737 mempool_free(m, pool->mapping_pool); 812 738 DMERR_LIMIT("dm_kcopyd_copy() failed"); 813 - dm_cell_error(cell); 739 + cell_error(pool, cell); 814 740 } 815 741 } 816 742 } ··· 876 802 if (r < 0) { 877 803 mempool_free(m, pool->mapping_pool); 878 804 DMERR_LIMIT("dm_kcopyd_zero() failed"); 879 - dm_cell_error(cell); 805 + cell_error(pool, cell); 880 806 } 881 807 } 882 808 } ··· 982 908 spin_unlock_irqrestore(&pool->lock, flags); 983 909 } 984 910 985 - static void no_space(struct dm_bio_prison_cell *cell) 911 + static void no_space(struct pool *pool, struct dm_bio_prison_cell *cell) 986 912 { 987 913 struct bio *bio; 988 914 struct bio_list bios; 989 915 990 916 bio_list_init(&bios); 991 - dm_cell_release(cell, &bios); 917 + cell_release(pool, cell, &bios); 992 918 993 919 while ((bio = bio_list_pop(&bios))) 994 920 retry_on_resume(bio); ··· 1006 932 struct dm_thin_new_mapping *m; 1007 933 1008 934 build_virtual_key(tc->td, block, &key); 1009 - if (dm_bio_detain(tc->pool->prison, &key, bio, &cell)) 935 + if (bio_detain(tc->pool, &key, bio, &cell)) 1010 936 return; 1011 937 1012 938 r = dm_thin_find_block(tc->td, block, 1, &lookup_result); ··· 1018 944 * on this block. 1019 945 */ 1020 946 build_data_key(tc->td, lookup_result.block, &key2); 1021 - if (dm_bio_detain(tc->pool->prison, &key2, bio, &cell2)) { 947 + if (bio_detain(tc->pool, &key2, bio, &cell2)) { 1022 948 cell_defer_no_holder(tc, cell); 1023 949 break; 1024 950 } ··· 1094 1020 break; 1095 1021 1096 1022 case -ENOSPC: 1097 - no_space(cell); 1023 + no_space(tc->pool, cell); 1098 1024 break; 1099 1025 1100 1026 default: 1101 1027 DMERR_LIMIT("%s: alloc_data_block() failed: error = %d", 1102 1028 __func__, r); 1103 - dm_cell_error(cell); 1029 + cell_error(tc->pool, cell); 1104 1030 break; 1105 1031 } 1106 1032 } ··· 1118 1044 * of being broken so we have nothing further to do here. 1119 1045 */ 1120 1046 build_data_key(tc->td, lookup_result->block, &key); 1121 - if (dm_bio_detain(pool->prison, &key, bio, &cell)) 1047 + if (bio_detain(pool, &key, bio, &cell)) 1122 1048 return; 1123 1049 1124 1050 if (bio_data_dir(bio) == WRITE && bio->bi_size) ··· 1139 1065 { 1140 1066 int r; 1141 1067 dm_block_t data_block; 1068 + struct pool *pool = tc->pool; 1142 1069 1143 1070 /* 1144 1071 * Remap empty bios (flushes) immediately, without provisioning. 1145 1072 */ 1146 1073 if (!bio->bi_size) { 1147 - inc_all_io_entry(tc->pool, bio); 1074 + inc_all_io_entry(pool, bio); 1148 1075 cell_defer_no_holder(tc, cell); 1149 1076 1150 1077 remap_and_issue(tc, bio, 0); ··· 1172 1097 break; 1173 1098 1174 1099 case -ENOSPC: 1175 - no_space(cell); 1100 + no_space(pool, cell); 1176 1101 break; 1177 1102 1178 1103 default: 1179 1104 DMERR_LIMIT("%s: alloc_data_block() failed: error = %d", 1180 1105 __func__, r); 1181 - set_pool_mode(tc->pool, PM_READ_ONLY); 1182 - dm_cell_error(cell); 1106 + set_pool_mode(pool, PM_READ_ONLY); 1107 + cell_error(pool, cell); 1183 1108 break; 1184 1109 } 1185 1110 } ··· 1187 1112 static void process_bio(struct thin_c *tc, struct bio *bio) 1188 1113 { 1189 1114 int r; 1115 + struct pool *pool = tc->pool; 1190 1116 dm_block_t block = get_bio_block(tc, bio); 1191 1117 struct dm_bio_prison_cell *cell; 1192 1118 struct dm_cell_key key; ··· 1198 1122 * being provisioned so we have nothing further to do here. 1199 1123 */ 1200 1124 build_virtual_key(tc->td, block, &key); 1201 - if (dm_bio_detain(tc->pool->prison, &key, bio, &cell)) 1125 + if (bio_detain(pool, &key, bio, &cell)) 1202 1126 return; 1203 1127 1204 1128 r = dm_thin_find_block(tc->td, block, 1, &lookup_result); ··· 1206 1130 case 0: 1207 1131 if (lookup_result.shared) { 1208 1132 process_shared_bio(tc, bio, block, &lookup_result); 1209 - cell_defer_no_holder(tc, cell); 1133 + cell_defer_no_holder(tc, cell); /* FIXME: pass this cell into process_shared? */ 1210 1134 } else { 1211 - inc_all_io_entry(tc->pool, bio); 1135 + inc_all_io_entry(pool, bio); 1212 1136 cell_defer_no_holder(tc, cell); 1213 1137 1214 1138 remap_and_issue(tc, bio, lookup_result.block); ··· 1217 1141 1218 1142 case -ENODATA: 1219 1143 if (bio_data_dir(bio) == READ && tc->origin_dev) { 1220 - inc_all_io_entry(tc->pool, bio); 1144 + inc_all_io_entry(pool, bio); 1221 1145 cell_defer_no_holder(tc, cell); 1222 1146 1223 1147 remap_to_origin_and_issue(tc, bio); ··· 1454 1378 dm_block_t block = get_bio_block(tc, bio); 1455 1379 struct dm_thin_device *td = tc->td; 1456 1380 struct dm_thin_lookup_result result; 1457 - struct dm_bio_prison_cell *cell1, *cell2; 1381 + struct dm_bio_prison_cell cell1, cell2; 1382 + struct dm_bio_prison_cell *cell_result; 1458 1383 struct dm_cell_key key; 1459 1384 1460 1385 thin_hook_bio(tc, bio); ··· 1497 1420 } 1498 1421 1499 1422 build_virtual_key(tc->td, block, &key); 1500 - if (dm_bio_detain(tc->pool->prison, &key, bio, &cell1)) 1423 + if (dm_bio_detain(tc->pool->prison, &key, bio, &cell1, &cell_result)) 1501 1424 return DM_MAPIO_SUBMITTED; 1502 1425 1503 1426 build_data_key(tc->td, result.block, &key); 1504 - if (dm_bio_detain(tc->pool->prison, &key, bio, &cell2)) { 1505 - cell_defer_no_holder(tc, cell1); 1427 + if (dm_bio_detain(tc->pool->prison, &key, bio, &cell2, &cell_result)) { 1428 + cell_defer_no_holder_no_free(tc, &cell1); 1506 1429 return DM_MAPIO_SUBMITTED; 1507 1430 } 1508 1431 1509 1432 inc_all_io_entry(tc->pool, bio); 1510 - cell_defer_no_holder(tc, cell2); 1511 - cell_defer_no_holder(tc, cell1); 1433 + cell_defer_no_holder_no_free(tc, &cell2); 1434 + cell_defer_no_holder_no_free(tc, &cell1); 1512 1435 1513 1436 remap(tc, bio, result.block); 1514 1437 return DM_MAPIO_REMAPPED; ··· 1713 1636 goto bad_prison; 1714 1637 } 1715 1638 1716 - pool->copier = dm_kcopyd_client_create(); 1639 + pool->copier = dm_kcopyd_client_create(&dm_kcopyd_throttle); 1717 1640 if (IS_ERR(pool->copier)) { 1718 1641 r = PTR_ERR(pool->copier); 1719 1642 *error = "Error creating pool's kcopyd client"; ··· 2015 1938 pt->data_dev = data_dev; 2016 1939 pt->low_water_blocks = low_water_blocks; 2017 1940 pt->adjusted_pf = pt->requested_pf = pf; 2018 - ti->num_flush_requests = 1; 1941 + ti->num_flush_bios = 1; 2019 1942 2020 1943 /* 2021 1944 * Only need to enable discards if the pool should pass ··· 2023 1946 * processing will cause mappings to be removed from the btree. 2024 1947 */ 2025 1948 if (pf.discard_enabled && pf.discard_passdown) { 2026 - ti->num_discard_requests = 1; 1949 + ti->num_discard_bios = 1; 2027 1950 2028 1951 /* 2029 1952 * Setting 'discards_supported' circumvents the normal ··· 2376 2299 * <transaction id> <used metadata sectors>/<total metadata sectors> 2377 2300 * <used data sectors>/<total data sectors> <held metadata root> 2378 2301 */ 2379 - static int pool_status(struct dm_target *ti, status_type_t type, 2380 - unsigned status_flags, char *result, unsigned maxlen) 2302 + static void pool_status(struct dm_target *ti, status_type_t type, 2303 + unsigned status_flags, char *result, unsigned maxlen) 2381 2304 { 2382 2305 int r; 2383 2306 unsigned sz = 0; ··· 2403 2326 if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti)) 2404 2327 (void) commit_or_fallback(pool); 2405 2328 2406 - r = dm_pool_get_metadata_transaction_id(pool->pmd, 2407 - &transaction_id); 2408 - if (r) 2409 - return r; 2329 + r = dm_pool_get_metadata_transaction_id(pool->pmd, &transaction_id); 2330 + if (r) { 2331 + DMERR("dm_pool_get_metadata_transaction_id returned %d", r); 2332 + goto err; 2333 + } 2410 2334 2411 - r = dm_pool_get_free_metadata_block_count(pool->pmd, 2412 - &nr_free_blocks_metadata); 2413 - if (r) 2414 - return r; 2335 + r = dm_pool_get_free_metadata_block_count(pool->pmd, &nr_free_blocks_metadata); 2336 + if (r) { 2337 + DMERR("dm_pool_get_free_metadata_block_count returned %d", r); 2338 + goto err; 2339 + } 2415 2340 2416 2341 r = dm_pool_get_metadata_dev_size(pool->pmd, &nr_blocks_metadata); 2417 - if (r) 2418 - return r; 2342 + if (r) { 2343 + DMERR("dm_pool_get_metadata_dev_size returned %d", r); 2344 + goto err; 2345 + } 2419 2346 2420 - r = dm_pool_get_free_block_count(pool->pmd, 2421 - &nr_free_blocks_data); 2422 - if (r) 2423 - return r; 2347 + r = dm_pool_get_free_block_count(pool->pmd, &nr_free_blocks_data); 2348 + if (r) { 2349 + DMERR("dm_pool_get_free_block_count returned %d", r); 2350 + goto err; 2351 + } 2424 2352 2425 2353 r = dm_pool_get_data_dev_size(pool->pmd, &nr_blocks_data); 2426 - if (r) 2427 - return r; 2354 + if (r) { 2355 + DMERR("dm_pool_get_data_dev_size returned %d", r); 2356 + goto err; 2357 + } 2428 2358 2429 2359 r = dm_pool_get_metadata_snap(pool->pmd, &held_root); 2430 - if (r) 2431 - return r; 2360 + if (r) { 2361 + DMERR("dm_pool_get_metadata_snap returned %d", r); 2362 + goto err; 2363 + } 2432 2364 2433 2365 DMEMIT("%llu %llu/%llu %llu/%llu ", 2434 2366 (unsigned long long)transaction_id, ··· 2474 2388 emit_flags(&pt->requested_pf, result, sz, maxlen); 2475 2389 break; 2476 2390 } 2391 + return; 2477 2392 2478 - return 0; 2393 + err: 2394 + DMEMIT("Error"); 2479 2395 } 2480 2396 2481 2397 static int pool_iterate_devices(struct dm_target *ti, ··· 2502 2414 return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); 2503 2415 } 2504 2416 2505 - static bool block_size_is_power_of_two(struct pool *pool) 2506 - { 2507 - return pool->sectors_per_block_shift >= 0; 2508 - } 2509 - 2510 2417 static void set_discard_limits(struct pool_c *pt, struct queue_limits *limits) 2511 2418 { 2512 2419 struct pool *pool = pt->pool; ··· 2515 2432 if (pt->adjusted_pf.discard_passdown) { 2516 2433 data_limits = &bdev_get_queue(pt->data_dev->bdev)->limits; 2517 2434 limits->discard_granularity = data_limits->discard_granularity; 2518 - } else if (block_size_is_power_of_two(pool)) 2435 + } else 2519 2436 limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT; 2520 - else 2521 - /* 2522 - * Use largest power of 2 that is a factor of sectors_per_block 2523 - * but at least DATA_DEV_BLOCK_SIZE_MIN_SECTORS. 2524 - */ 2525 - limits->discard_granularity = max(1 << (ffs(pool->sectors_per_block) - 1), 2526 - DATA_DEV_BLOCK_SIZE_MIN_SECTORS) << SECTOR_SHIFT; 2527 2437 } 2528 2438 2529 2439 static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits) ··· 2544 2468 .name = "thin-pool", 2545 2469 .features = DM_TARGET_SINGLETON | DM_TARGET_ALWAYS_WRITEABLE | 2546 2470 DM_TARGET_IMMUTABLE, 2547 - .version = {1, 6, 0}, 2471 + .version = {1, 6, 1}, 2548 2472 .module = THIS_MODULE, 2549 2473 .ctr = pool_ctr, 2550 2474 .dtr = pool_dtr, ··· 2664 2588 if (r) 2665 2589 goto bad_thin_open; 2666 2590 2667 - ti->num_flush_requests = 1; 2591 + ti->num_flush_bios = 1; 2668 2592 ti->flush_supported = true; 2669 2593 ti->per_bio_data_size = sizeof(struct dm_thin_endio_hook); 2670 2594 2671 2595 /* In case the pool supports discards, pass them on. */ 2672 2596 if (tc->pool->pf.discard_enabled) { 2673 2597 ti->discards_supported = true; 2674 - ti->num_discard_requests = 1; 2598 + ti->num_discard_bios = 1; 2675 2599 ti->discard_zeroes_data_unsupported = true; 2676 - /* Discard requests must be split on a block boundary */ 2677 - ti->split_discard_requests = true; 2600 + /* Discard bios must be split on a block boundary */ 2601 + ti->split_discard_bios = true; 2678 2602 } 2679 2603 2680 2604 dm_put(pool_md); ··· 2752 2676 /* 2753 2677 * <nr mapped sectors> <highest mapped sector> 2754 2678 */ 2755 - static int thin_status(struct dm_target *ti, status_type_t type, 2756 - unsigned status_flags, char *result, unsigned maxlen) 2679 + static void thin_status(struct dm_target *ti, status_type_t type, 2680 + unsigned status_flags, char *result, unsigned maxlen) 2757 2681 { 2758 2682 int r; 2759 2683 ssize_t sz = 0; ··· 2763 2687 2764 2688 if (get_pool_mode(tc->pool) == PM_FAIL) { 2765 2689 DMEMIT("Fail"); 2766 - return 0; 2690 + return; 2767 2691 } 2768 2692 2769 2693 if (!tc->td) ··· 2772 2696 switch (type) { 2773 2697 case STATUSTYPE_INFO: 2774 2698 r = dm_thin_get_mapped_count(tc->td, &mapped); 2775 - if (r) 2776 - return r; 2699 + if (r) { 2700 + DMERR("dm_thin_get_mapped_count returned %d", r); 2701 + goto err; 2702 + } 2777 2703 2778 2704 r = dm_thin_get_highest_mapped_block(tc->td, &highest); 2779 - if (r < 0) 2780 - return r; 2705 + if (r < 0) { 2706 + DMERR("dm_thin_get_highest_mapped_block returned %d", r); 2707 + goto err; 2708 + } 2781 2709 2782 2710 DMEMIT("%llu ", mapped * tc->pool->sectors_per_block); 2783 2711 if (r) ··· 2801 2721 } 2802 2722 } 2803 2723 2804 - return 0; 2724 + return; 2725 + 2726 + err: 2727 + DMEMIT("Error"); 2805 2728 } 2806 2729 2807 2730 static int thin_iterate_devices(struct dm_target *ti, ··· 2831 2748 2832 2749 static struct target_type thin_target = { 2833 2750 .name = "thin", 2834 - .version = {1, 7, 0}, 2751 + .version = {1, 7, 1}, 2835 2752 .module = THIS_MODULE, 2836 2753 .ctr = thin_ctr, 2837 2754 .dtr = thin_dtr,

+3 -5

drivers/md/dm-verity.c

··· 508 508 /* 509 509 * Status: V (valid) or C (corruption found) 510 510 */ 511 - static int verity_status(struct dm_target *ti, status_type_t type, 512 - unsigned status_flags, char *result, unsigned maxlen) 511 + static void verity_status(struct dm_target *ti, status_type_t type, 512 + unsigned status_flags, char *result, unsigned maxlen) 513 513 { 514 514 struct dm_verity *v = ti->private; 515 515 unsigned sz = 0; ··· 540 540 DMEMIT("%02x", v->salt[x]); 541 541 break; 542 542 } 543 - 544 - return 0; 545 543 } 546 544 547 545 static int verity_ioctl(struct dm_target *ti, unsigned cmd, ··· 858 860 859 861 static struct target_type verity_target = { 860 862 .name = "verity", 861 - .version = {1, 1, 0}, 863 + .version = {1, 1, 1}, 862 864 .module = THIS_MODULE, 863 865 .ctr = verity_ctr, 864 866 .dtr = verity_dtr,

+1 -1

drivers/md/dm-zero.c

··· 25 25 /* 26 26 * Silently drop discards, avoiding -EOPNOTSUPP. 27 27 */ 28 - ti->num_discard_requests = 1; 28 + ti->num_discard_bios = 1; 29 29 30 30 return 0; 31 31 }

+241 -213

drivers/md/dm.c

··· 163 163 * io objects are allocated from here. 164 164 */ 165 165 mempool_t *io_pool; 166 - mempool_t *tio_pool; 167 166 168 167 struct bio_set *bs; 169 168 ··· 196 197 */ 197 198 struct dm_md_mempools { 198 199 mempool_t *io_pool; 199 - mempool_t *tio_pool; 200 200 struct bio_set *bs; 201 201 }; 202 202 203 203 #define MIN_IOS 256 204 204 static struct kmem_cache *_io_cache; 205 205 static struct kmem_cache *_rq_tio_cache; 206 - 207 - /* 208 - * Unused now, and needs to be deleted. But since io_pool is overloaded and it's 209 - * still used for _io_cache, I'm leaving this for a later cleanup 210 - */ 211 - static struct kmem_cache *_rq_bio_info_cache; 212 206 213 207 static int __init local_init(void) 214 208 { ··· 216 224 if (!_rq_tio_cache) 217 225 goto out_free_io_cache; 218 226 219 - _rq_bio_info_cache = KMEM_CACHE(dm_rq_clone_bio_info, 0); 220 - if (!_rq_bio_info_cache) 221 - goto out_free_rq_tio_cache; 222 - 223 227 r = dm_uevent_init(); 224 228 if (r) 225 - goto out_free_rq_bio_info_cache; 229 + goto out_free_rq_tio_cache; 226 230 227 231 _major = major; 228 232 r = register_blkdev(_major, _name); ··· 232 244 233 245 out_uevent_exit: 234 246 dm_uevent_exit(); 235 - out_free_rq_bio_info_cache: 236 - kmem_cache_destroy(_rq_bio_info_cache); 237 247 out_free_rq_tio_cache: 238 248 kmem_cache_destroy(_rq_tio_cache); 239 249 out_free_io_cache: ··· 242 256 243 257 static void local_exit(void) 244 258 { 245 - kmem_cache_destroy(_rq_bio_info_cache); 246 259 kmem_cache_destroy(_rq_tio_cache); 247 260 kmem_cache_destroy(_io_cache); 248 261 unregister_blkdev(_major, _name); ··· 433 448 static struct dm_rq_target_io *alloc_rq_tio(struct mapped_device *md, 434 449 gfp_t gfp_mask) 435 450 { 436 - return mempool_alloc(md->tio_pool, gfp_mask); 451 + return mempool_alloc(md->io_pool, gfp_mask); 437 452 } 438 453 439 454 static void free_rq_tio(struct dm_rq_target_io *tio) 440 455 { 441 - mempool_free(tio, tio->md->tio_pool); 456 + mempool_free(tio, tio->md->io_pool); 442 457 } 443 458 444 459 static int md_in_flight(struct mapped_device *md) ··· 970 985 } 971 986 EXPORT_SYMBOL_GPL(dm_set_target_max_io_len); 972 987 973 - static void __map_bio(struct dm_target *ti, struct dm_target_io *tio) 988 + static void __map_bio(struct dm_target_io *tio) 974 989 { 975 990 int r; 976 991 sector_t sector; 977 992 struct mapped_device *md; 978 993 struct bio *clone = &tio->clone; 994 + struct dm_target *ti = tio->ti; 979 995 980 996 clone->bi_end_io = clone_endio; 981 997 clone->bi_private = tio; ··· 1017 1031 unsigned short idx; 1018 1032 }; 1019 1033 1034 + static void bio_setup_sector(struct bio *bio, sector_t sector, sector_t len) 1035 + { 1036 + bio->bi_sector = sector; 1037 + bio->bi_size = to_bytes(len); 1038 + } 1039 + 1040 + static void bio_setup_bv(struct bio *bio, unsigned short idx, unsigned short bv_count) 1041 + { 1042 + bio->bi_idx = idx; 1043 + bio->bi_vcnt = idx + bv_count; 1044 + bio->bi_flags &= ~(1 << BIO_SEG_VALID); 1045 + } 1046 + 1047 + static void clone_bio_integrity(struct bio *bio, struct bio *clone, 1048 + unsigned short idx, unsigned len, unsigned offset, 1049 + unsigned trim) 1050 + { 1051 + if (!bio_integrity(bio)) 1052 + return; 1053 + 1054 + bio_integrity_clone(clone, bio, GFP_NOIO); 1055 + 1056 + if (trim) 1057 + bio_integrity_trim(clone, bio_sector_offset(bio, idx, offset), len); 1058 + } 1059 + 1020 1060 /* 1021 1061 * Creates a little bio that just does part of a bvec. 1022 1062 */ 1023 - static void split_bvec(struct dm_target_io *tio, struct bio *bio, 1024 - sector_t sector, unsigned short idx, unsigned int offset, 1025 - unsigned int len, struct bio_set *bs) 1063 + static void clone_split_bio(struct dm_target_io *tio, struct bio *bio, 1064 + sector_t sector, unsigned short idx, 1065 + unsigned offset, unsigned len) 1026 1066 { 1027 1067 struct bio *clone = &tio->clone; 1028 1068 struct bio_vec *bv = bio->bi_io_vec + idx; 1029 1069 1030 1070 *clone->bi_io_vec = *bv; 1031 1071 1032 - clone->bi_sector = sector; 1072 + bio_setup_sector(clone, sector, len); 1073 + 1033 1074 clone->bi_bdev = bio->bi_bdev; 1034 1075 clone->bi_rw = bio->bi_rw; 1035 1076 clone->bi_vcnt = 1; 1036 - clone->bi_size = to_bytes(len); 1037 1077 clone->bi_io_vec->bv_offset = offset; 1038 1078 clone->bi_io_vec->bv_len = clone->bi_size; 1039 1079 clone->bi_flags |= 1 << BIO_CLONED; 1040 1080 1041 - if (bio_integrity(bio)) { 1042 - bio_integrity_clone(clone, bio, GFP_NOIO); 1043 - bio_integrity_trim(clone, 1044 - bio_sector_offset(bio, idx, offset), len); 1045 - } 1081 + clone_bio_integrity(bio, clone, idx, len, offset, 1); 1046 1082 } 1047 1083 1048 1084 /* ··· 1072 1064 */ 1073 1065 static void clone_bio(struct dm_target_io *tio, struct bio *bio, 1074 1066 sector_t sector, unsigned short idx, 1075 - unsigned short bv_count, unsigned int len, 1076 - struct bio_set *bs) 1067 + unsigned short bv_count, unsigned len) 1077 1068 { 1078 1069 struct bio *clone = &tio->clone; 1070 + unsigned trim = 0; 1079 1071 1080 1072 __bio_clone(clone, bio); 1081 - clone->bi_sector = sector; 1082 - clone->bi_idx = idx; 1083 - clone->bi_vcnt = idx + bv_count; 1084 - clone->bi_size = to_bytes(len); 1085 - clone->bi_flags &= ~(1 << BIO_SEG_VALID); 1073 + bio_setup_sector(clone, sector, len); 1074 + bio_setup_bv(clone, idx, bv_count); 1086 1075 1087 - if (bio_integrity(bio)) { 1088 - bio_integrity_clone(clone, bio, GFP_NOIO); 1089 - 1090 - if (idx != bio->bi_idx || clone->bi_size < bio->bi_size) 1091 - bio_integrity_trim(clone, 1092 - bio_sector_offset(bio, idx, 0), len); 1093 - } 1076 + if (idx != bio->bi_idx || clone->bi_size < bio->bi_size) 1077 + trim = 1; 1078 + clone_bio_integrity(bio, clone, idx, len, 0, trim); 1094 1079 } 1095 1080 1096 1081 static struct dm_target_io *alloc_tio(struct clone_info *ci, 1097 - struct dm_target *ti, int nr_iovecs) 1082 + struct dm_target *ti, int nr_iovecs, 1083 + unsigned target_bio_nr) 1098 1084 { 1099 1085 struct dm_target_io *tio; 1100 1086 struct bio *clone; ··· 1099 1097 tio->io = ci->io; 1100 1098 tio->ti = ti; 1101 1099 memset(&tio->info, 0, sizeof(tio->info)); 1102 - tio->target_request_nr = 0; 1100 + tio->target_bio_nr = target_bio_nr; 1103 1101 1104 1102 return tio; 1105 1103 } 1106 1104 1107 - static void __issue_target_request(struct clone_info *ci, struct dm_target *ti, 1108 - unsigned request_nr, sector_t len) 1105 + static void __clone_and_map_simple_bio(struct clone_info *ci, 1106 + struct dm_target *ti, 1107 + unsigned target_bio_nr, sector_t len) 1109 1108 { 1110 - struct dm_target_io *tio = alloc_tio(ci, ti, ci->bio->bi_max_vecs); 1109 + struct dm_target_io *tio = alloc_tio(ci, ti, ci->bio->bi_max_vecs, target_bio_nr); 1111 1110 struct bio *clone = &tio->clone; 1112 - 1113 - tio->target_request_nr = request_nr; 1114 1111 1115 1112 /* 1116 1113 * Discard requests require the bio's inline iovecs be initialized. 1117 1114 * ci->bio->bi_max_vecs is BIO_INLINE_VECS anyway, for both flush 1118 1115 * and discard, so no need for concern about wasted bvec allocations. 1119 1116 */ 1120 - 1121 1117 __bio_clone(clone, ci->bio); 1122 - if (len) { 1123 - clone->bi_sector = ci->sector; 1124 - clone->bi_size = to_bytes(len); 1125 - } 1118 + if (len) 1119 + bio_setup_sector(clone, ci->sector, len); 1126 1120 1127 - __map_bio(ti, tio); 1121 + __map_bio(tio); 1128 1122 } 1129 1123 1130 - static void __issue_target_requests(struct clone_info *ci, struct dm_target *ti, 1131 - unsigned num_requests, sector_t len) 1124 + static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti, 1125 + unsigned num_bios, sector_t len) 1132 1126 { 1133 - unsigned request_nr; 1127 + unsigned target_bio_nr; 1134 1128 1135 - for (request_nr = 0; request_nr < num_requests; request_nr++) 1136 - __issue_target_request(ci, ti, request_nr, len); 1129 + for (target_bio_nr = 0; target_bio_nr < num_bios; target_bio_nr++) 1130 + __clone_and_map_simple_bio(ci, ti, target_bio_nr, len); 1137 1131 } 1138 1132 1139 - static int __clone_and_map_empty_flush(struct clone_info *ci) 1133 + static int __send_empty_flush(struct clone_info *ci) 1140 1134 { 1141 1135 unsigned target_nr = 0; 1142 1136 struct dm_target *ti; 1143 1137 1144 1138 BUG_ON(bio_has_data(ci->bio)); 1145 1139 while ((ti = dm_table_get_target(ci->map, target_nr++))) 1146 - __issue_target_requests(ci, ti, ti->num_flush_requests, 0); 1140 + __send_duplicate_bios(ci, ti, ti->num_flush_bios, 0); 1147 1141 1148 1142 return 0; 1149 1143 } 1150 1144 1151 - /* 1152 - * Perform all io with a single clone. 1153 - */ 1154 - static void __clone_and_map_simple(struct clone_info *ci, struct dm_target *ti) 1145 + static void __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti, 1146 + sector_t sector, int nr_iovecs, 1147 + unsigned short idx, unsigned short bv_count, 1148 + unsigned offset, unsigned len, 1149 + unsigned split_bvec) 1155 1150 { 1156 1151 struct bio *bio = ci->bio; 1157 1152 struct dm_target_io *tio; 1153 + unsigned target_bio_nr; 1154 + unsigned num_target_bios = 1; 1158 1155 1159 - tio = alloc_tio(ci, ti, bio->bi_max_vecs); 1160 - clone_bio(tio, bio, ci->sector, ci->idx, bio->bi_vcnt - ci->idx, 1161 - ci->sector_count, ci->md->bs); 1162 - __map_bio(ti, tio); 1163 - ci->sector_count = 0; 1156 + /* 1157 + * Does the target want to receive duplicate copies of the bio? 1158 + */ 1159 + if (bio_data_dir(bio) == WRITE && ti->num_write_bios) 1160 + num_target_bios = ti->num_write_bios(ti, bio); 1161 + 1162 + for (target_bio_nr = 0; target_bio_nr < num_target_bios; target_bio_nr++) { 1163 + tio = alloc_tio(ci, ti, nr_iovecs, target_bio_nr); 1164 + if (split_bvec) 1165 + clone_split_bio(tio, bio, sector, idx, offset, len); 1166 + else 1167 + clone_bio(tio, bio, sector, idx, bv_count, len); 1168 + __map_bio(tio); 1169 + } 1164 1170 } 1165 1171 1166 - typedef unsigned (*get_num_requests_fn)(struct dm_target *ti); 1172 + typedef unsigned (*get_num_bios_fn)(struct dm_target *ti); 1167 1173 1168 - static unsigned get_num_discard_requests(struct dm_target *ti) 1174 + static unsigned get_num_discard_bios(struct dm_target *ti) 1169 1175 { 1170 - return ti->num_discard_requests; 1176 + return ti->num_discard_bios; 1171 1177 } 1172 1178 1173 - static unsigned get_num_write_same_requests(struct dm_target *ti) 1179 + static unsigned get_num_write_same_bios(struct dm_target *ti) 1174 1180 { 1175 - return ti->num_write_same_requests; 1181 + return ti->num_write_same_bios; 1176 1182 } 1177 1183 1178 1184 typedef bool (*is_split_required_fn)(struct dm_target *ti); 1179 1185 1180 1186 static bool is_split_required_for_discard(struct dm_target *ti) 1181 1187 { 1182 - return ti->split_discard_requests; 1188 + return ti->split_discard_bios; 1183 1189 } 1184 1190 1185 - static int __clone_and_map_changing_extent_only(struct clone_info *ci, 1186 - get_num_requests_fn get_num_requests, 1187 - is_split_required_fn is_split_required) 1191 + static int __send_changing_extent_only(struct clone_info *ci, 1192 + get_num_bios_fn get_num_bios, 1193 + is_split_required_fn is_split_required) 1188 1194 { 1189 1195 struct dm_target *ti; 1190 1196 sector_t len; 1191 - unsigned num_requests; 1197 + unsigned num_bios; 1192 1198 1193 1199 do { 1194 1200 ti = dm_table_find_target(ci->map, ci->sector); ··· 1209 1199 * reconfiguration might also have changed that since the 1210 1200 * check was performed. 1211 1201 */ 1212 - num_requests = get_num_requests ? get_num_requests(ti) : 0; 1213 - if (!num_requests) 1202 + num_bios = get_num_bios ? get_num_bios(ti) : 0; 1203 + if (!num_bios) 1214 1204 return -EOPNOTSUPP; 1215 1205 1216 1206 if (is_split_required && !is_split_required(ti)) ··· 1218 1208 else 1219 1209 len = min(ci->sector_count, max_io_len(ci->sector, ti)); 1220 1210 1221 - __issue_target_requests(ci, ti, num_requests, len); 1211 + __send_duplicate_bios(ci, ti, num_bios, len); 1222 1212 1223 1213 ci->sector += len; 1224 1214 } while (ci->sector_count -= len); ··· 1226 1216 return 0; 1227 1217 } 1228 1218 1229 - static int __clone_and_map_discard(struct clone_info *ci) 1219 + static int __send_discard(struct clone_info *ci) 1230 1220 { 1231 - return __clone_and_map_changing_extent_only(ci, get_num_discard_requests, 1232 - is_split_required_for_discard); 1221 + return __send_changing_extent_only(ci, get_num_discard_bios, 1222 + is_split_required_for_discard); 1233 1223 } 1234 1224 1235 - static int __clone_and_map_write_same(struct clone_info *ci) 1225 + static int __send_write_same(struct clone_info *ci) 1236 1226 { 1237 - return __clone_and_map_changing_extent_only(ci, get_num_write_same_requests, NULL); 1227 + return __send_changing_extent_only(ci, get_num_write_same_bios, NULL); 1238 1228 } 1239 1229 1240 - static int __clone_and_map(struct clone_info *ci) 1230 + /* 1231 + * Find maximum number of sectors / bvecs we can process with a single bio. 1232 + */ 1233 + static sector_t __len_within_target(struct clone_info *ci, sector_t max, int *idx) 1234 + { 1235 + struct bio *bio = ci->bio; 1236 + sector_t bv_len, total_len = 0; 1237 + 1238 + for (*idx = ci->idx; max && (*idx < bio->bi_vcnt); (*idx)++) { 1239 + bv_len = to_sector(bio->bi_io_vec[*idx].bv_len); 1240 + 1241 + if (bv_len > max) 1242 + break; 1243 + 1244 + max -= bv_len; 1245 + total_len += bv_len; 1246 + } 1247 + 1248 + return total_len; 1249 + } 1250 + 1251 + static int __split_bvec_across_targets(struct clone_info *ci, 1252 + struct dm_target *ti, sector_t max) 1253 + { 1254 + struct bio *bio = ci->bio; 1255 + struct bio_vec *bv = bio->bi_io_vec + ci->idx; 1256 + sector_t remaining = to_sector(bv->bv_len); 1257 + unsigned offset = 0; 1258 + sector_t len; 1259 + 1260 + do { 1261 + if (offset) { 1262 + ti = dm_table_find_target(ci->map, ci->sector); 1263 + if (!dm_target_is_valid(ti)) 1264 + return -EIO; 1265 + 1266 + max = max_io_len(ci->sector, ti); 1267 + } 1268 + 1269 + len = min(remaining, max); 1270 + 1271 + __clone_and_map_data_bio(ci, ti, ci->sector, 1, ci->idx, 0, 1272 + bv->bv_offset + offset, len, 1); 1273 + 1274 + ci->sector += len; 1275 + ci->sector_count -= len; 1276 + offset += to_bytes(len); 1277 + } while (remaining -= len); 1278 + 1279 + ci->idx++; 1280 + 1281 + return 0; 1282 + } 1283 + 1284 + /* 1285 + * Select the correct strategy for processing a non-flush bio. 1286 + */ 1287 + static int __split_and_process_non_flush(struct clone_info *ci) 1241 1288 { 1242 1289 struct bio *bio = ci->bio; 1243 1290 struct dm_target *ti; 1244 - sector_t len = 0, max; 1245 - struct dm_target_io *tio; 1291 + sector_t len, max; 1292 + int idx; 1246 1293 1247 1294 if (unlikely(bio->bi_rw & REQ_DISCARD)) 1248 - return __clone_and_map_discard(ci); 1295 + return __send_discard(ci); 1249 1296 else if (unlikely(bio->bi_rw & REQ_WRITE_SAME)) 1250 - return __clone_and_map_write_same(ci); 1297 + return __send_write_same(ci); 1251 1298 1252 1299 ti = dm_table_find_target(ci->map, ci->sector); 1253 1300 if (!dm_target_is_valid(ti)) ··· 1312 1245 1313 1246 max = max_io_len(ci->sector, ti); 1314 1247 1248 + /* 1249 + * Optimise for the simple case where we can do all of 1250 + * the remaining io with a single clone. 1251 + */ 1315 1252 if (ci->sector_count <= max) { 1316 - /* 1317 - * Optimise for the simple case where we can do all of 1318 - * the remaining io with a single clone. 1319 - */ 1320 - __clone_and_map_simple(ci, ti); 1253 + __clone_and_map_data_bio(ci, ti, ci->sector, bio->bi_max_vecs, 1254 + ci->idx, bio->bi_vcnt - ci->idx, 0, 1255 + ci->sector_count, 0); 1256 + ci->sector_count = 0; 1257 + return 0; 1258 + } 1321 1259 1322 - } else if (to_sector(bio->bi_io_vec[ci->idx].bv_len) <= max) { 1323 - /* 1324 - * There are some bvecs that don't span targets. 1325 - * Do as many of these as possible. 1326 - */ 1327 - int i; 1328 - sector_t remaining = max; 1329 - sector_t bv_len; 1260 + /* 1261 + * There are some bvecs that don't span targets. 1262 + * Do as many of these as possible. 1263 + */ 1264 + if (to_sector(bio->bi_io_vec[ci->idx].bv_len) <= max) { 1265 + len = __len_within_target(ci, max, &idx); 1330 1266 1331 - for (i = ci->idx; remaining && (i < bio->bi_vcnt); i++) { 1332 - bv_len = to_sector(bio->bi_io_vec[i].bv_len); 1333 - 1334 - if (bv_len > remaining) 1335 - break; 1336 - 1337 - remaining -= bv_len; 1338 - len += bv_len; 1339 - } 1340 - 1341 - tio = alloc_tio(ci, ti, bio->bi_max_vecs); 1342 - clone_bio(tio, bio, ci->sector, ci->idx, i - ci->idx, len, 1343 - ci->md->bs); 1344 - __map_bio(ti, tio); 1267 + __clone_and_map_data_bio(ci, ti, ci->sector, bio->bi_max_vecs, 1268 + ci->idx, idx - ci->idx, 0, len, 0); 1345 1269 1346 1270 ci->sector += len; 1347 1271 ci->sector_count -= len; 1348 - ci->idx = i; 1272 + ci->idx = idx; 1349 1273 1350 - } else { 1351 - /* 1352 - * Handle a bvec that must be split between two or more targets. 1353 - */ 1354 - struct bio_vec *bv = bio->bi_io_vec + ci->idx; 1355 - sector_t remaining = to_sector(bv->bv_len); 1356 - unsigned int offset = 0; 1357 - 1358 - do { 1359 - if (offset) { 1360 - ti = dm_table_find_target(ci->map, ci->sector); 1361 - if (!dm_target_is_valid(ti)) 1362 - return -EIO; 1363 - 1364 - max = max_io_len(ci->sector, ti); 1365 - } 1366 - 1367 - len = min(remaining, max); 1368 - 1369 - tio = alloc_tio(ci, ti, 1); 1370 - split_bvec(tio, bio, ci->sector, ci->idx, 1371 - bv->bv_offset + offset, len, ci->md->bs); 1372 - 1373 - __map_bio(ti, tio); 1374 - 1375 - ci->sector += len; 1376 - ci->sector_count -= len; 1377 - offset += to_bytes(len); 1378 - } while (remaining -= len); 1379 - 1380 - ci->idx++; 1274 + return 0; 1381 1275 } 1382 1276 1383 - return 0; 1277 + /* 1278 + * Handle a bvec that must be split between two or more targets. 1279 + */ 1280 + return __split_bvec_across_targets(ci, ti, max); 1384 1281 } 1385 1282 1386 1283 /* 1387 - * Split the bio into several clones and submit it to targets. 1284 + * Entry point to split a bio into clones and submit them to the targets. 1388 1285 */ 1389 1286 static void __split_and_process_bio(struct mapped_device *md, struct bio *bio) 1390 1287 { ··· 1372 1341 ci.idx = bio->bi_idx; 1373 1342 1374 1343 start_io_acct(ci.io); 1344 + 1375 1345 if (bio->bi_rw & REQ_FLUSH) { 1376 1346 ci.bio = &ci.md->flush_bio; 1377 1347 ci.sector_count = 0; 1378 - error = __clone_and_map_empty_flush(&ci); 1348 + error = __send_empty_flush(&ci); 1379 1349 /* dec_pending submits any data associated with flush */ 1380 1350 } else { 1381 1351 ci.bio = bio; 1382 1352 ci.sector_count = bio_sectors(bio); 1383 1353 while (ci.sector_count && !error) 1384 - error = __clone_and_map(&ci); 1354 + error = __split_and_process_non_flush(&ci); 1385 1355 } 1386 1356 1387 1357 /* drop the extra reference count */ ··· 1955 1923 unlock_fs(md); 1956 1924 bdput(md->bdev); 1957 1925 destroy_workqueue(md->wq); 1958 - if (md->tio_pool) 1959 - mempool_destroy(md->tio_pool); 1960 1926 if (md->io_pool) 1961 1927 mempool_destroy(md->io_pool); 1962 1928 if (md->bs) ··· 1977 1947 { 1978 1948 struct dm_md_mempools *p = dm_table_get_md_mempools(t); 1979 1949 1980 - if (md->io_pool && (md->tio_pool || dm_table_get_type(t) == DM_TYPE_BIO_BASED) && md->bs) { 1981 - /* 1982 - * The md already has necessary mempools. Reload just the 1983 - * bioset because front_pad may have changed because 1984 - * a different table was loaded. 1985 - */ 1986 - bioset_free(md->bs); 1987 - md->bs = p->bs; 1988 - p->bs = NULL; 1950 + if (md->io_pool && md->bs) { 1951 + /* The md already has necessary mempools. */ 1952 + if (dm_table_get_type(t) == DM_TYPE_BIO_BASED) { 1953 + /* 1954 + * Reload bioset because front_pad may have changed 1955 + * because a different table was loaded. 1956 + */ 1957 + bioset_free(md->bs); 1958 + md->bs = p->bs; 1959 + p->bs = NULL; 1960 + } else if (dm_table_get_type(t) == DM_TYPE_REQUEST_BASED) { 1961 + /* 1962 + * There's no need to reload with request-based dm 1963 + * because the size of front_pad doesn't change. 1964 + * Note for future: If you are to reload bioset, 1965 + * prep-ed requests in the queue may refer 1966 + * to bio from the old bioset, so you must walk 1967 + * through the queue to unprep. 1968 + */ 1969 + } 1989 1970 goto out; 1990 1971 } 1991 1972 1992 - BUG_ON(!p || md->io_pool || md->tio_pool || md->bs); 1973 + BUG_ON(!p || md->io_pool || md->bs); 1993 1974 1994 1975 md->io_pool = p->io_pool; 1995 1976 p->io_pool = NULL; 1996 - md->tio_pool = p->tio_pool; 1997 - p->tio_pool = NULL; 1998 1977 md->bs = p->bs; 1999 1978 p->bs = NULL; 2000 1979 ··· 2434 2395 */ 2435 2396 struct dm_table *dm_swap_table(struct mapped_device *md, struct dm_table *table) 2436 2397 { 2437 - struct dm_table *live_map, *map = ERR_PTR(-EINVAL); 2398 + struct dm_table *live_map = NULL, *map = ERR_PTR(-EINVAL); 2438 2399 struct queue_limits limits; 2439 2400 int r; 2440 2401 ··· 2457 2418 dm_table_put(live_map); 2458 2419 } 2459 2420 2460 - r = dm_calculate_queue_limits(table, &limits); 2461 - if (r) { 2462 - map = ERR_PTR(r); 2463 - goto out; 2421 + if (!live_map) { 2422 + r = dm_calculate_queue_limits(table, &limits); 2423 + if (r) { 2424 + map = ERR_PTR(r); 2425 + goto out; 2426 + } 2464 2427 } 2465 2428 2466 2429 map = __bind(md, table, &limits); ··· 2760 2719 2761 2720 struct dm_md_mempools *dm_alloc_md_mempools(unsigned type, unsigned integrity, unsigned per_bio_data_size) 2762 2721 { 2763 - struct dm_md_mempools *pools = kmalloc(sizeof(*pools), GFP_KERNEL); 2764 - unsigned int pool_size = (type == DM_TYPE_BIO_BASED) ? 16 : MIN_IOS; 2722 + struct dm_md_mempools *pools = kzalloc(sizeof(*pools), GFP_KERNEL); 2723 + struct kmem_cache *cachep; 2724 + unsigned int pool_size; 2725 + unsigned int front_pad; 2765 2726 2766 2727 if (!pools) 2767 2728 return NULL; 2768 2729 2769 - per_bio_data_size = roundup(per_bio_data_size, __alignof__(struct dm_target_io)); 2730 + if (type == DM_TYPE_BIO_BASED) { 2731 + cachep = _io_cache; 2732 + pool_size = 16; 2733 + front_pad = roundup(per_bio_data_size, __alignof__(struct dm_target_io)) + offsetof(struct dm_target_io, clone); 2734 + } else if (type == DM_TYPE_REQUEST_BASED) { 2735 + cachep = _rq_tio_cache; 2736 + pool_size = MIN_IOS; 2737 + front_pad = offsetof(struct dm_rq_clone_bio_info, clone); 2738 + /* per_bio_data_size is not used. See __bind_mempools(). */ 2739 + WARN_ON(per_bio_data_size != 0); 2740 + } else 2741 + goto out; 2770 2742 2771 - pools->io_pool = (type == DM_TYPE_BIO_BASED) ? 2772 - mempool_create_slab_pool(MIN_IOS, _io_cache) : 2773 - mempool_create_slab_pool(MIN_IOS, _rq_bio_info_cache); 2743 + pools->io_pool = mempool_create_slab_pool(MIN_IOS, cachep); 2774 2744 if (!pools->io_pool) 2775 - goto free_pools_and_out; 2745 + goto out; 2776 2746 2777 - pools->tio_pool = NULL; 2778 - if (type == DM_TYPE_REQUEST_BASED) { 2779 - pools->tio_pool = mempool_create_slab_pool(MIN_IOS, _rq_tio_cache); 2780 - if (!pools->tio_pool) 2781 - goto free_io_pool_and_out; 2782 - } 2783 - 2784 - pools->bs = (type == DM_TYPE_BIO_BASED) ? 2785 - bioset_create(pool_size, 2786 - per_bio_data_size + offsetof(struct dm_target_io, clone)) : 2787 - bioset_create(pool_size, 2788 - offsetof(struct dm_rq_clone_bio_info, clone)); 2747 + pools->bs = bioset_create(pool_size, front_pad); 2789 2748 if (!pools->bs) 2790 - goto free_tio_pool_and_out; 2749 + goto out; 2791 2750 2792 2751 if (integrity && bioset_integrity_create(pools->bs, pool_size)) 2793 - goto free_bioset_and_out; 2752 + goto out; 2794 2753 2795 2754 return pools; 2796 2755 2797 - free_bioset_and_out: 2798 - bioset_free(pools->bs); 2799 - 2800 - free_tio_pool_and_out: 2801 - if (pools->tio_pool) 2802 - mempool_destroy(pools->tio_pool); 2803 - 2804 - free_io_pool_and_out: 2805 - mempool_destroy(pools->io_pool); 2806 - 2807 - free_pools_and_out: 2808 - kfree(pools); 2756 + out: 2757 + dm_free_md_mempools(pools); 2809 2758 2810 2759 return NULL; 2811 2760 } ··· 2807 2776 2808 2777 if (pools->io_pool) 2809 2778 mempool_destroy(pools->io_pool); 2810 - 2811 - if (pools->tio_pool) 2812 - mempool_destroy(pools->tio_pool); 2813 2779 2814 2780 if (pools->bs) 2815 2781 bioset_free(pools->bs);

+1 -1

drivers/md/persistent-data/Kconfig

··· 1 1 config DM_PERSISTENT_DATA 2 2 tristate 3 - depends on BLK_DEV_DM && EXPERIMENTAL 3 + depends on BLK_DEV_DM 4 4 select LIBCRC32C 5 5 select DM_BUFIO 6 6 ---help---

+2

drivers/md/persistent-data/Makefile

··· 1 1 obj-$(CONFIG_DM_PERSISTENT_DATA) += dm-persistent-data.o 2 2 dm-persistent-data-objs := \ 3 + dm-array.o \ 4 + dm-bitset.o \ 3 5 dm-block-manager.o \ 4 6 dm-space-map-common.o \ 5 7 dm-space-map-disk.o \

+808

drivers/md/persistent-data/dm-array.c

··· 1 + /* 2 + * Copyright (C) 2012 Red Hat, Inc. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #include "dm-array.h" 8 + #include "dm-space-map.h" 9 + #include "dm-transaction-manager.h" 10 + 11 + #include <linux/export.h> 12 + #include <linux/device-mapper.h> 13 + 14 + #define DM_MSG_PREFIX "array" 15 + 16 + /*----------------------------------------------------------------*/ 17 + 18 + /* 19 + * The array is implemented as a fully populated btree, which points to 20 + * blocks that contain the packed values. This is more space efficient 21 + * than just using a btree since we don't store 1 key per value. 22 + */ 23 + struct array_block { 24 + __le32 csum; 25 + __le32 max_entries; 26 + __le32 nr_entries; 27 + __le32 value_size; 28 + __le64 blocknr; /* Block this node is supposed to live in. */ 29 + } __packed; 30 + 31 + /*----------------------------------------------------------------*/ 32 + 33 + /* 34 + * Validator methods. As usual we calculate a checksum, and also write the 35 + * block location into the header (paranoia about ssds remapping areas by 36 + * mistake). 37 + */ 38 + #define CSUM_XOR 595846735 39 + 40 + static void array_block_prepare_for_write(struct dm_block_validator *v, 41 + struct dm_block *b, 42 + size_t size_of_block) 43 + { 44 + struct array_block *bh_le = dm_block_data(b); 45 + 46 + bh_le->blocknr = cpu_to_le64(dm_block_location(b)); 47 + bh_le->csum = cpu_to_le32(dm_bm_checksum(&bh_le->max_entries, 48 + size_of_block - sizeof(__le32), 49 + CSUM_XOR)); 50 + } 51 + 52 + static int array_block_check(struct dm_block_validator *v, 53 + struct dm_block *b, 54 + size_t size_of_block) 55 + { 56 + struct array_block *bh_le = dm_block_data(b); 57 + __le32 csum_disk; 58 + 59 + if (dm_block_location(b) != le64_to_cpu(bh_le->blocknr)) { 60 + DMERR_LIMIT("array_block_check failed: blocknr %llu != wanted %llu", 61 + (unsigned long long) le64_to_cpu(bh_le->blocknr), 62 + (unsigned long long) dm_block_location(b)); 63 + return -ENOTBLK; 64 + } 65 + 66 + csum_disk = cpu_to_le32(dm_bm_checksum(&bh_le->max_entries, 67 + size_of_block - sizeof(__le32), 68 + CSUM_XOR)); 69 + if (csum_disk != bh_le->csum) { 70 + DMERR_LIMIT("array_block_check failed: csum %u != wanted %u", 71 + (unsigned) le32_to_cpu(csum_disk), 72 + (unsigned) le32_to_cpu(bh_le->csum)); 73 + return -EILSEQ; 74 + } 75 + 76 + return 0; 77 + } 78 + 79 + static struct dm_block_validator array_validator = { 80 + .name = "array", 81 + .prepare_for_write = array_block_prepare_for_write, 82 + .check = array_block_check 83 + }; 84 + 85 + /*----------------------------------------------------------------*/ 86 + 87 + /* 88 + * Functions for manipulating the array blocks. 89 + */ 90 + 91 + /* 92 + * Returns a pointer to a value within an array block. 93 + * 94 + * index - The index into _this_ specific block. 95 + */ 96 + static void *element_at(struct dm_array_info *info, struct array_block *ab, 97 + unsigned index) 98 + { 99 + unsigned char *entry = (unsigned char *) (ab + 1); 100 + 101 + entry += index * info->value_type.size; 102 + 103 + return entry; 104 + } 105 + 106 + /* 107 + * Utility function that calls one of the value_type methods on every value 108 + * in an array block. 109 + */ 110 + static void on_entries(struct dm_array_info *info, struct array_block *ab, 111 + void (*fn)(void *, const void *)) 112 + { 113 + unsigned i, nr_entries = le32_to_cpu(ab->nr_entries); 114 + 115 + for (i = 0; i < nr_entries; i++) 116 + fn(info->value_type.context, element_at(info, ab, i)); 117 + } 118 + 119 + /* 120 + * Increment every value in an array block. 121 + */ 122 + static void inc_ablock_entries(struct dm_array_info *info, struct array_block *ab) 123 + { 124 + struct dm_btree_value_type *vt = &info->value_type; 125 + 126 + if (vt->inc) 127 + on_entries(info, ab, vt->inc); 128 + } 129 + 130 + /* 131 + * Decrement every value in an array block. 132 + */ 133 + static void dec_ablock_entries(struct dm_array_info *info, struct array_block *ab) 134 + { 135 + struct dm_btree_value_type *vt = &info->value_type; 136 + 137 + if (vt->dec) 138 + on_entries(info, ab, vt->dec); 139 + } 140 + 141 + /* 142 + * Each array block can hold this many values. 143 + */ 144 + static uint32_t calc_max_entries(size_t value_size, size_t size_of_block) 145 + { 146 + return (size_of_block - sizeof(struct array_block)) / value_size; 147 + } 148 + 149 + /* 150 + * Allocate a new array block. The caller will need to unlock block. 151 + */ 152 + static int alloc_ablock(struct dm_array_info *info, size_t size_of_block, 153 + uint32_t max_entries, 154 + struct dm_block **block, struct array_block **ab) 155 + { 156 + int r; 157 + 158 + r = dm_tm_new_block(info->btree_info.tm, &array_validator, block); 159 + if (r) 160 + return r; 161 + 162 + (*ab) = dm_block_data(*block); 163 + (*ab)->max_entries = cpu_to_le32(max_entries); 164 + (*ab)->nr_entries = cpu_to_le32(0); 165 + (*ab)->value_size = cpu_to_le32(info->value_type.size); 166 + 167 + return 0; 168 + } 169 + 170 + /* 171 + * Pad an array block out with a particular value. Every instance will 172 + * cause an increment of the value_type. new_nr must always be more than 173 + * the current number of entries. 174 + */ 175 + static void fill_ablock(struct dm_array_info *info, struct array_block *ab, 176 + const void *value, unsigned new_nr) 177 + { 178 + unsigned i; 179 + uint32_t nr_entries; 180 + struct dm_btree_value_type *vt = &info->value_type; 181 + 182 + BUG_ON(new_nr > le32_to_cpu(ab->max_entries)); 183 + BUG_ON(new_nr < le32_to_cpu(ab->nr_entries)); 184 + 185 + nr_entries = le32_to_cpu(ab->nr_entries); 186 + for (i = nr_entries; i < new_nr; i++) { 187 + if (vt->inc) 188 + vt->inc(vt->context, value); 189 + memcpy(element_at(info, ab, i), value, vt->size); 190 + } 191 + ab->nr_entries = cpu_to_le32(new_nr); 192 + } 193 + 194 + /* 195 + * Remove some entries from the back of an array block. Every value 196 + * removed will be decremented. new_nr must be <= the current number of 197 + * entries. 198 + */ 199 + static void trim_ablock(struct dm_array_info *info, struct array_block *ab, 200 + unsigned new_nr) 201 + { 202 + unsigned i; 203 + uint32_t nr_entries; 204 + struct dm_btree_value_type *vt = &info->value_type; 205 + 206 + BUG_ON(new_nr > le32_to_cpu(ab->max_entries)); 207 + BUG_ON(new_nr > le32_to_cpu(ab->nr_entries)); 208 + 209 + nr_entries = le32_to_cpu(ab->nr_entries); 210 + for (i = nr_entries; i > new_nr; i--) 211 + if (vt->dec) 212 + vt->dec(vt->context, element_at(info, ab, i - 1)); 213 + ab->nr_entries = cpu_to_le32(new_nr); 214 + } 215 + 216 + /* 217 + * Read locks a block, and coerces it to an array block. The caller must 218 + * unlock 'block' when finished. 219 + */ 220 + static int get_ablock(struct dm_array_info *info, dm_block_t b, 221 + struct dm_block **block, struct array_block **ab) 222 + { 223 + int r; 224 + 225 + r = dm_tm_read_lock(info->btree_info.tm, b, &array_validator, block); 226 + if (r) 227 + return r; 228 + 229 + *ab = dm_block_data(*block); 230 + return 0; 231 + } 232 + 233 + /* 234 + * Unlocks an array block. 235 + */ 236 + static int unlock_ablock(struct dm_array_info *info, struct dm_block *block) 237 + { 238 + return dm_tm_unlock(info->btree_info.tm, block); 239 + } 240 + 241 + /*----------------------------------------------------------------*/ 242 + 243 + /* 244 + * Btree manipulation. 245 + */ 246 + 247 + /* 248 + * Looks up an array block in the btree, and then read locks it. 249 + * 250 + * index is the index of the index of the array_block, (ie. the array index 251 + * / max_entries). 252 + */ 253 + static int lookup_ablock(struct dm_array_info *info, dm_block_t root, 254 + unsigned index, struct dm_block **block, 255 + struct array_block **ab) 256 + { 257 + int r; 258 + uint64_t key = index; 259 + __le64 block_le; 260 + 261 + r = dm_btree_lookup(&info->btree_info, root, &key, &block_le); 262 + if (r) 263 + return r; 264 + 265 + return get_ablock(info, le64_to_cpu(block_le), block, ab); 266 + } 267 + 268 + /* 269 + * Insert an array block into the btree. The block is _not_ unlocked. 270 + */ 271 + static int insert_ablock(struct dm_array_info *info, uint64_t index, 272 + struct dm_block *block, dm_block_t *root) 273 + { 274 + __le64 block_le = cpu_to_le64(dm_block_location(block)); 275 + 276 + __dm_bless_for_disk(block_le); 277 + return dm_btree_insert(&info->btree_info, *root, &index, &block_le, root); 278 + } 279 + 280 + /* 281 + * Looks up an array block in the btree. Then shadows it, and updates the 282 + * btree to point to this new shadow. 'root' is an input/output parameter 283 + * for both the current root block, and the new one. 284 + */ 285 + static int shadow_ablock(struct dm_array_info *info, dm_block_t *root, 286 + unsigned index, struct dm_block **block, 287 + struct array_block **ab) 288 + { 289 + int r, inc; 290 + uint64_t key = index; 291 + dm_block_t b; 292 + __le64 block_le; 293 + 294 + /* 295 + * lookup 296 + */ 297 + r = dm_btree_lookup(&info->btree_info, *root, &key, &block_le); 298 + if (r) 299 + return r; 300 + b = le64_to_cpu(block_le); 301 + 302 + /* 303 + * shadow 304 + */ 305 + r = dm_tm_shadow_block(info->btree_info.tm, b, 306 + &array_validator, block, &inc); 307 + if (r) 308 + return r; 309 + 310 + *ab = dm_block_data(*block); 311 + if (inc) 312 + inc_ablock_entries(info, *ab); 313 + 314 + /* 315 + * Reinsert. 316 + * 317 + * The shadow op will often be a noop. Only insert if it really 318 + * copied data. 319 + */ 320 + if (dm_block_location(*block) != b) 321 + r = insert_ablock(info, index, *block, root); 322 + 323 + return r; 324 + } 325 + 326 + /* 327 + * Allocate an new array block, and fill it with some values. 328 + */ 329 + static int insert_new_ablock(struct dm_array_info *info, size_t size_of_block, 330 + uint32_t max_entries, 331 + unsigned block_index, uint32_t nr, 332 + const void *value, dm_block_t *root) 333 + { 334 + int r; 335 + struct dm_block *block; 336 + struct array_block *ab; 337 + 338 + r = alloc_ablock(info, size_of_block, max_entries, &block, &ab); 339 + if (r) 340 + return r; 341 + 342 + fill_ablock(info, ab, value, nr); 343 + r = insert_ablock(info, block_index, block, root); 344 + unlock_ablock(info, block); 345 + 346 + return r; 347 + } 348 + 349 + static int insert_full_ablocks(struct dm_array_info *info, size_t size_of_block, 350 + unsigned begin_block, unsigned end_block, 351 + unsigned max_entries, const void *value, 352 + dm_block_t *root) 353 + { 354 + int r = 0; 355 + 356 + for (; !r && begin_block != end_block; begin_block++) 357 + r = insert_new_ablock(info, size_of_block, max_entries, begin_block, max_entries, value, root); 358 + 359 + return r; 360 + } 361 + 362 + /* 363 + * There are a bunch of functions involved with resizing an array. This 364 + * structure holds information that commonly needed by them. Purely here 365 + * to reduce parameter count. 366 + */ 367 + struct resize { 368 + /* 369 + * Describes the array. 370 + */ 371 + struct dm_array_info *info; 372 + 373 + /* 374 + * The current root of the array. This gets updated. 375 + */ 376 + dm_block_t root; 377 + 378 + /* 379 + * Metadata block size. Used to calculate the nr entries in an 380 + * array block. 381 + */ 382 + size_t size_of_block; 383 + 384 + /* 385 + * Maximum nr entries in an array block. 386 + */ 387 + unsigned max_entries; 388 + 389 + /* 390 + * nr of completely full blocks in the array. 391 + * 392 + * 'old' refers to before the resize, 'new' after. 393 + */ 394 + unsigned old_nr_full_blocks, new_nr_full_blocks; 395 + 396 + /* 397 + * Number of entries in the final block. 0 iff only full blocks in 398 + * the array. 399 + */ 400 + unsigned old_nr_entries_in_last_block, new_nr_entries_in_last_block; 401 + 402 + /* 403 + * The default value used when growing the array. 404 + */ 405 + const void *value; 406 + }; 407 + 408 + /* 409 + * Removes a consecutive set of array blocks from the btree. The values 410 + * in block are decremented as a side effect of the btree remove. 411 + * 412 + * begin_index - the index of the first array block to remove. 413 + * end_index - the one-past-the-end value. ie. this block is not removed. 414 + */ 415 + static int drop_blocks(struct resize *resize, unsigned begin_index, 416 + unsigned end_index) 417 + { 418 + int r; 419 + 420 + while (begin_index != end_index) { 421 + uint64_t key = begin_index++; 422 + r = dm_btree_remove(&resize->info->btree_info, resize->root, 423 + &key, &resize->root); 424 + if (r) 425 + return r; 426 + } 427 + 428 + return 0; 429 + } 430 + 431 + /* 432 + * Calculates how many blocks are needed for the array. 433 + */ 434 + static unsigned total_nr_blocks_needed(unsigned nr_full_blocks, 435 + unsigned nr_entries_in_last_block) 436 + { 437 + return nr_full_blocks + (nr_entries_in_last_block ? 1 : 0); 438 + } 439 + 440 + /* 441 + * Shrink an array. 442 + */ 443 + static int shrink(struct resize *resize) 444 + { 445 + int r; 446 + unsigned begin, end; 447 + struct dm_block *block; 448 + struct array_block *ab; 449 + 450 + /* 451 + * Lose some blocks from the back? 452 + */ 453 + if (resize->new_nr_full_blocks < resize->old_nr_full_blocks) { 454 + begin = total_nr_blocks_needed(resize->new_nr_full_blocks, 455 + resize->new_nr_entries_in_last_block); 456 + end = total_nr_blocks_needed(resize->old_nr_full_blocks, 457 + resize->old_nr_entries_in_last_block); 458 + 459 + r = drop_blocks(resize, begin, end); 460 + if (r) 461 + return r; 462 + } 463 + 464 + /* 465 + * Trim the new tail block 466 + */ 467 + if (resize->new_nr_entries_in_last_block) { 468 + r = shadow_ablock(resize->info, &resize->root, 469 + resize->new_nr_full_blocks, &block, &ab); 470 + if (r) 471 + return r; 472 + 473 + trim_ablock(resize->info, ab, resize->new_nr_entries_in_last_block); 474 + unlock_ablock(resize->info, block); 475 + } 476 + 477 + return 0; 478 + } 479 + 480 + /* 481 + * Grow an array. 482 + */ 483 + static int grow_extend_tail_block(struct resize *resize, uint32_t new_nr_entries) 484 + { 485 + int r; 486 + struct dm_block *block; 487 + struct array_block *ab; 488 + 489 + r = shadow_ablock(resize->info, &resize->root, 490 + resize->old_nr_full_blocks, &block, &ab); 491 + if (r) 492 + return r; 493 + 494 + fill_ablock(resize->info, ab, resize->value, new_nr_entries); 495 + unlock_ablock(resize->info, block); 496 + 497 + return r; 498 + } 499 + 500 + static int grow_add_tail_block(struct resize *resize) 501 + { 502 + return insert_new_ablock(resize->info, resize->size_of_block, 503 + resize->max_entries, 504 + resize->new_nr_full_blocks, 505 + resize->new_nr_entries_in_last_block, 506 + resize->value, &resize->root); 507 + } 508 + 509 + static int grow_needs_more_blocks(struct resize *resize) 510 + { 511 + int r; 512 + 513 + if (resize->old_nr_entries_in_last_block > 0) { 514 + r = grow_extend_tail_block(resize, resize->max_entries); 515 + if (r) 516 + return r; 517 + } 518 + 519 + r = insert_full_ablocks(resize->info, resize->size_of_block, 520 + resize->old_nr_full_blocks, 521 + resize->new_nr_full_blocks, 522 + resize->max_entries, resize->value, 523 + &resize->root); 524 + if (r) 525 + return r; 526 + 527 + if (resize->new_nr_entries_in_last_block) 528 + r = grow_add_tail_block(resize); 529 + 530 + return r; 531 + } 532 + 533 + static int grow(struct resize *resize) 534 + { 535 + if (resize->new_nr_full_blocks > resize->old_nr_full_blocks) 536 + return grow_needs_more_blocks(resize); 537 + 538 + else if (resize->old_nr_entries_in_last_block) 539 + return grow_extend_tail_block(resize, resize->new_nr_entries_in_last_block); 540 + 541 + else 542 + return grow_add_tail_block(resize); 543 + } 544 + 545 + /*----------------------------------------------------------------*/ 546 + 547 + /* 548 + * These are the value_type functions for the btree elements, which point 549 + * to array blocks. 550 + */ 551 + static void block_inc(void *context, const void *value) 552 + { 553 + __le64 block_le; 554 + struct dm_array_info *info = context; 555 + 556 + memcpy(&block_le, value, sizeof(block_le)); 557 + dm_tm_inc(info->btree_info.tm, le64_to_cpu(block_le)); 558 + } 559 + 560 + static void block_dec(void *context, const void *value) 561 + { 562 + int r; 563 + uint64_t b; 564 + __le64 block_le; 565 + uint32_t ref_count; 566 + struct dm_block *block; 567 + struct array_block *ab; 568 + struct dm_array_info *info = context; 569 + 570 + memcpy(&block_le, value, sizeof(block_le)); 571 + b = le64_to_cpu(block_le); 572 + 573 + r = dm_tm_ref(info->btree_info.tm, b, &ref_count); 574 + if (r) { 575 + DMERR_LIMIT("couldn't get reference count for block %llu", 576 + (unsigned long long) b); 577 + return; 578 + } 579 + 580 + if (ref_count == 1) { 581 + /* 582 + * We're about to drop the last reference to this ablock. 583 + * So we need to decrement the ref count of the contents. 584 + */ 585 + r = get_ablock(info, b, &block, &ab); 586 + if (r) { 587 + DMERR_LIMIT("couldn't get array block %llu", 588 + (unsigned long long) b); 589 + return; 590 + } 591 + 592 + dec_ablock_entries(info, ab); 593 + unlock_ablock(info, block); 594 + } 595 + 596 + dm_tm_dec(info->btree_info.tm, b); 597 + } 598 + 599 + static int block_equal(void *context, const void *value1, const void *value2) 600 + { 601 + return !memcmp(value1, value2, sizeof(__le64)); 602 + } 603 + 604 + /*----------------------------------------------------------------*/ 605 + 606 + void dm_array_info_init(struct dm_array_info *info, 607 + struct dm_transaction_manager *tm, 608 + struct dm_btree_value_type *vt) 609 + { 610 + struct dm_btree_value_type *bvt = &info->btree_info.value_type; 611 + 612 + memcpy(&info->value_type, vt, sizeof(info->value_type)); 613 + info->btree_info.tm = tm; 614 + info->btree_info.levels = 1; 615 + 616 + bvt->context = info; 617 + bvt->size = sizeof(__le64); 618 + bvt->inc = block_inc; 619 + bvt->dec = block_dec; 620 + bvt->equal = block_equal; 621 + } 622 + EXPORT_SYMBOL_GPL(dm_array_info_init); 623 + 624 + int dm_array_empty(struct dm_array_info *info, dm_block_t *root) 625 + { 626 + return dm_btree_empty(&info->btree_info, root); 627 + } 628 + EXPORT_SYMBOL_GPL(dm_array_empty); 629 + 630 + static int array_resize(struct dm_array_info *info, dm_block_t root, 631 + uint32_t old_size, uint32_t new_size, 632 + const void *value, dm_block_t *new_root) 633 + { 634 + int r; 635 + struct resize resize; 636 + 637 + if (old_size == new_size) 638 + return 0; 639 + 640 + resize.info = info; 641 + resize.root = root; 642 + resize.size_of_block = dm_bm_block_size(dm_tm_get_bm(info->btree_info.tm)); 643 + resize.max_entries = calc_max_entries(info->value_type.size, 644 + resize.size_of_block); 645 + 646 + resize.old_nr_full_blocks = old_size / resize.max_entries; 647 + resize.old_nr_entries_in_last_block = old_size % resize.max_entries; 648 + resize.new_nr_full_blocks = new_size / resize.max_entries; 649 + resize.new_nr_entries_in_last_block = new_size % resize.max_entries; 650 + resize.value = value; 651 + 652 + r = ((new_size > old_size) ? grow : shrink)(&resize); 653 + if (r) 654 + return r; 655 + 656 + *new_root = resize.root; 657 + return 0; 658 + } 659 + 660 + int dm_array_resize(struct dm_array_info *info, dm_block_t root, 661 + uint32_t old_size, uint32_t new_size, 662 + const void *value, dm_block_t *new_root) 663 + __dm_written_to_disk(value) 664 + { 665 + int r = array_resize(info, root, old_size, new_size, value, new_root); 666 + __dm_unbless_for_disk(value); 667 + return r; 668 + } 669 + EXPORT_SYMBOL_GPL(dm_array_resize); 670 + 671 + int dm_array_del(struct dm_array_info *info, dm_block_t root) 672 + { 673 + return dm_btree_del(&info->btree_info, root); 674 + } 675 + EXPORT_SYMBOL_GPL(dm_array_del); 676 + 677 + int dm_array_get_value(struct dm_array_info *info, dm_block_t root, 678 + uint32_t index, void *value_le) 679 + { 680 + int r; 681 + struct dm_block *block; 682 + struct array_block *ab; 683 + size_t size_of_block; 684 + unsigned entry, max_entries; 685 + 686 + size_of_block = dm_bm_block_size(dm_tm_get_bm(info->btree_info.tm)); 687 + max_entries = calc_max_entries(info->value_type.size, size_of_block); 688 + 689 + r = lookup_ablock(info, root, index / max_entries, &block, &ab); 690 + if (r) 691 + return r; 692 + 693 + entry = index % max_entries; 694 + if (entry >= le32_to_cpu(ab->nr_entries)) 695 + r = -ENODATA; 696 + else 697 + memcpy(value_le, element_at(info, ab, entry), 698 + info->value_type.size); 699 + 700 + unlock_ablock(info, block); 701 + return r; 702 + } 703 + EXPORT_SYMBOL_GPL(dm_array_get_value); 704 + 705 + static int array_set_value(struct dm_array_info *info, dm_block_t root, 706 + uint32_t index, const void *value, dm_block_t *new_root) 707 + { 708 + int r; 709 + struct dm_block *block; 710 + struct array_block *ab; 711 + size_t size_of_block; 712 + unsigned max_entries; 713 + unsigned entry; 714 + void *old_value; 715 + struct dm_btree_value_type *vt = &info->value_type; 716 + 717 + size_of_block = dm_bm_block_size(dm_tm_get_bm(info->btree_info.tm)); 718 + max_entries = calc_max_entries(info->value_type.size, size_of_block); 719 + 720 + r = shadow_ablock(info, &root, index / max_entries, &block, &ab); 721 + if (r) 722 + return r; 723 + *new_root = root; 724 + 725 + entry = index % max_entries; 726 + if (entry >= le32_to_cpu(ab->nr_entries)) { 727 + r = -ENODATA; 728 + goto out; 729 + } 730 + 731 + old_value = element_at(info, ab, entry); 732 + if (vt->dec && 733 + (!vt->equal || !vt->equal(vt->context, old_value, value))) { 734 + vt->dec(vt->context, old_value); 735 + if (vt->inc) 736 + vt->inc(vt->context, value); 737 + } 738 + 739 + memcpy(old_value, value, info->value_type.size); 740 + 741 + out: 742 + unlock_ablock(info, block); 743 + return r; 744 + } 745 + 746 + int dm_array_set_value(struct dm_array_info *info, dm_block_t root, 747 + uint32_t index, const void *value, dm_block_t *new_root) 748 + __dm_written_to_disk(value) 749 + { 750 + int r; 751 + 752 + r = array_set_value(info, root, index, value, new_root); 753 + __dm_unbless_for_disk(value); 754 + return r; 755 + } 756 + EXPORT_SYMBOL_GPL(dm_array_set_value); 757 + 758 + struct walk_info { 759 + struct dm_array_info *info; 760 + int (*fn)(void *context, uint64_t key, void *leaf); 761 + void *context; 762 + }; 763 + 764 + static int walk_ablock(void *context, uint64_t *keys, void *leaf) 765 + { 766 + struct walk_info *wi = context; 767 + 768 + int r; 769 + unsigned i; 770 + __le64 block_le; 771 + unsigned nr_entries, max_entries; 772 + struct dm_block *block; 773 + struct array_block *ab; 774 + 775 + memcpy(&block_le, leaf, sizeof(block_le)); 776 + r = get_ablock(wi->info, le64_to_cpu(block_le), &block, &ab); 777 + if (r) 778 + return r; 779 + 780 + max_entries = le32_to_cpu(ab->max_entries); 781 + nr_entries = le32_to_cpu(ab->nr_entries); 782 + for (i = 0; i < nr_entries; i++) { 783 + r = wi->fn(wi->context, keys[0] * max_entries + i, 784 + element_at(wi->info, ab, i)); 785 + 786 + if (r) 787 + break; 788 + } 789 + 790 + unlock_ablock(wi->info, block); 791 + return r; 792 + } 793 + 794 + int dm_array_walk(struct dm_array_info *info, dm_block_t root, 795 + int (*fn)(void *, uint64_t key, void *leaf), 796 + void *context) 797 + { 798 + struct walk_info wi; 799 + 800 + wi.info = info; 801 + wi.fn = fn; 802 + wi.context = context; 803 + 804 + return dm_btree_walk(&info->btree_info, root, walk_ablock, &wi); 805 + } 806 + EXPORT_SYMBOL_GPL(dm_array_walk); 807 + 808 + /*----------------------------------------------------------------*/

+166

drivers/md/persistent-data/dm-array.h

··· 1 + /* 2 + * Copyright (C) 2012 Red Hat, Inc. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + #ifndef _LINUX_DM_ARRAY_H 7 + #define _LINUX_DM_ARRAY_H 8 + 9 + #include "dm-btree.h" 10 + 11 + /*----------------------------------------------------------------*/ 12 + 13 + /* 14 + * The dm-array is a persistent version of an array. It packs the data 15 + * more efficiently than a btree which will result in less disk space use, 16 + * and a performance boost. The element get and set operations are still 17 + * O(ln(n)), but with a much smaller constant. 18 + * 19 + * The value type structure is reused from the btree type to support proper 20 + * reference counting of values. 21 + * 22 + * The arrays implicitly know their length, and bounds are checked for 23 + * lookups and updated. It doesn't store this in an accessible place 24 + * because it would waste a whole metadata block. Make sure you store the 25 + * size along with the array root in your encompassing data. 26 + * 27 + * Array entries are indexed via an unsigned integer starting from zero. 28 + * Arrays are not sparse; if you resize an array to have 'n' entries then 29 + * 'n - 1' will be the last valid index. 30 + * 31 + * Typical use: 32 + * 33 + * a) initialise a dm_array_info structure. This describes the array 34 + * values and ties it into a specific transaction manager. It holds no 35 + * instance data; the same info can be used for many similar arrays if 36 + * you wish. 37 + * 38 + * b) Get yourself a root. The root is the index of a block of data on the 39 + * disk that holds a particular instance of an array. You may have a 40 + * pre existing root in your metadata that you wish to use, or you may 41 + * want to create a brand new, empty array with dm_array_empty(). 42 + * 43 + * Like the other data structures in this library, dm_array objects are 44 + * immutable between transactions. Update functions will return you the 45 + * root for a _new_ array. If you've incremented the old root, via 46 + * dm_tm_inc(), before calling the update function you may continue to use 47 + * it in parallel with the new root. 48 + * 49 + * c) resize an array with dm_array_resize(). 50 + * 51 + * d) Get a value from the array with dm_array_get_value(). 52 + * 53 + * e) Set a value in the array with dm_array_set_value(). 54 + * 55 + * f) Walk an array of values in index order with dm_array_walk(). More 56 + * efficient than making many calls to dm_array_get_value(). 57 + * 58 + * g) Destroy the array with dm_array_del(). This tells the transaction 59 + * manager that you're no longer using this data structure so it can 60 + * recycle it's blocks. (dm_array_dec() would be a better name for it, 61 + * but del is in keeping with dm_btree_del()). 62 + */ 63 + 64 + /* 65 + * Describes an array. Don't initialise this structure yourself, use the 66 + * init function below. 67 + */ 68 + struct dm_array_info { 69 + struct dm_transaction_manager *tm; 70 + struct dm_btree_value_type value_type; 71 + struct dm_btree_info btree_info; 72 + }; 73 + 74 + /* 75 + * Sets up a dm_array_info structure. You don't need to do anything with 76 + * this structure when you finish using it. 77 + * 78 + * info - the structure being filled in. 79 + * tm - the transaction manager that should supervise this structure. 80 + * vt - describes the leaf values. 81 + */ 82 + void dm_array_info_init(struct dm_array_info *info, 83 + struct dm_transaction_manager *tm, 84 + struct dm_btree_value_type *vt); 85 + 86 + /* 87 + * Create an empty, zero length array. 88 + * 89 + * info - describes the array 90 + * root - on success this will be filled out with the root block 91 + */ 92 + int dm_array_empty(struct dm_array_info *info, dm_block_t *root); 93 + 94 + /* 95 + * Resizes the array. 96 + * 97 + * info - describes the array 98 + * root - the root block of the array on disk 99 + * old_size - the caller is responsible for remembering the size of 100 + * the array 101 + * new_size - can be bigger or smaller than old_size 102 + * value - if we're growing the array the new entries will have this value 103 + * new_root - on success, points to the new root block 104 + * 105 + * If growing the inc function for 'value' will be called the appropriate 106 + * number of times. So if the caller is holding a reference they may want 107 + * to drop it. 108 + */ 109 + int dm_array_resize(struct dm_array_info *info, dm_block_t root, 110 + uint32_t old_size, uint32_t new_size, 111 + const void *value, dm_block_t *new_root) 112 + __dm_written_to_disk(value); 113 + 114 + /* 115 + * Frees a whole array. The value_type's decrement operation will be called 116 + * for all values in the array 117 + */ 118 + int dm_array_del(struct dm_array_info *info, dm_block_t root); 119 + 120 + /* 121 + * Lookup a value in the array 122 + * 123 + * info - describes the array 124 + * root - root block of the array 125 + * index - array index 126 + * value - the value to be read. Will be in on-disk format of course. 127 + * 128 + * -ENODATA will be returned if the index is out of bounds. 129 + */ 130 + int dm_array_get_value(struct dm_array_info *info, dm_block_t root, 131 + uint32_t index, void *value); 132 + 133 + /* 134 + * Set an entry in the array. 135 + * 136 + * info - describes the array 137 + * root - root block of the array 138 + * index - array index 139 + * value - value to be written to disk. Make sure you confirm the value is 140 + * in on-disk format with__dm_bless_for_disk() before calling. 141 + * new_root - the new root block 142 + * 143 + * The old value being overwritten will be decremented, the new value 144 + * incremented. 145 + * 146 + * -ENODATA will be returned if the index is out of bounds. 147 + */ 148 + int dm_array_set_value(struct dm_array_info *info, dm_block_t root, 149 + uint32_t index, const void *value, dm_block_t *new_root) 150 + __dm_written_to_disk(value); 151 + 152 + /* 153 + * Walk through all the entries in an array. 154 + * 155 + * info - describes the array 156 + * root - root block of the array 157 + * fn - called back for every element 158 + * context - passed to the callback 159 + */ 160 + int dm_array_walk(struct dm_array_info *info, dm_block_t root, 161 + int (*fn)(void *context, uint64_t key, void *leaf), 162 + void *context); 163 + 164 + /*----------------------------------------------------------------*/ 165 + 166 + #endif /* _LINUX_DM_ARRAY_H */

+163

drivers/md/persistent-data/dm-bitset.c

··· 1 + /* 2 + * Copyright (C) 2012 Red Hat, Inc. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #include "dm-bitset.h" 8 + #include "dm-transaction-manager.h" 9 + 10 + #include <linux/export.h> 11 + #include <linux/device-mapper.h> 12 + 13 + #define DM_MSG_PREFIX "bitset" 14 + #define BITS_PER_ARRAY_ENTRY 64 15 + 16 + /*----------------------------------------------------------------*/ 17 + 18 + static struct dm_btree_value_type bitset_bvt = { 19 + .context = NULL, 20 + .size = sizeof(__le64), 21 + .inc = NULL, 22 + .dec = NULL, 23 + .equal = NULL, 24 + }; 25 + 26 + /*----------------------------------------------------------------*/ 27 + 28 + void dm_disk_bitset_init(struct dm_transaction_manager *tm, 29 + struct dm_disk_bitset *info) 30 + { 31 + dm_array_info_init(&info->array_info, tm, &bitset_bvt); 32 + info->current_index_set = false; 33 + } 34 + EXPORT_SYMBOL_GPL(dm_disk_bitset_init); 35 + 36 + int dm_bitset_empty(struct dm_disk_bitset *info, dm_block_t *root) 37 + { 38 + return dm_array_empty(&info->array_info, root); 39 + } 40 + EXPORT_SYMBOL_GPL(dm_bitset_empty); 41 + 42 + int dm_bitset_resize(struct dm_disk_bitset *info, dm_block_t root, 43 + uint32_t old_nr_entries, uint32_t new_nr_entries, 44 + bool default_value, dm_block_t *new_root) 45 + { 46 + uint32_t old_blocks = dm_div_up(old_nr_entries, BITS_PER_ARRAY_ENTRY); 47 + uint32_t new_blocks = dm_div_up(new_nr_entries, BITS_PER_ARRAY_ENTRY); 48 + __le64 value = default_value ? cpu_to_le64(~0) : cpu_to_le64(0); 49 + 50 + __dm_bless_for_disk(&value); 51 + return dm_array_resize(&info->array_info, root, old_blocks, new_blocks, 52 + &value, new_root); 53 + } 54 + EXPORT_SYMBOL_GPL(dm_bitset_resize); 55 + 56 + int dm_bitset_del(struct dm_disk_bitset *info, dm_block_t root) 57 + { 58 + return dm_array_del(&info->array_info, root); 59 + } 60 + EXPORT_SYMBOL_GPL(dm_bitset_del); 61 + 62 + int dm_bitset_flush(struct dm_disk_bitset *info, dm_block_t root, 63 + dm_block_t *new_root) 64 + { 65 + int r; 66 + __le64 value; 67 + 68 + if (!info->current_index_set) 69 + return 0; 70 + 71 + value = cpu_to_le64(info->current_bits); 72 + 73 + __dm_bless_for_disk(&value); 74 + r = dm_array_set_value(&info->array_info, root, info->current_index, 75 + &value, new_root); 76 + if (r) 77 + return r; 78 + 79 + info->current_index_set = false; 80 + return 0; 81 + } 82 + EXPORT_SYMBOL_GPL(dm_bitset_flush); 83 + 84 + static int read_bits(struct dm_disk_bitset *info, dm_block_t root, 85 + uint32_t array_index) 86 + { 87 + int r; 88 + __le64 value; 89 + 90 + r = dm_array_get_value(&info->array_info, root, array_index, &value); 91 + if (r) 92 + return r; 93 + 94 + info->current_bits = le64_to_cpu(value); 95 + info->current_index_set = true; 96 + info->current_index = array_index; 97 + return 0; 98 + } 99 + 100 + static int get_array_entry(struct dm_disk_bitset *info, dm_block_t root, 101 + uint32_t index, dm_block_t *new_root) 102 + { 103 + int r; 104 + unsigned array_index = index / BITS_PER_ARRAY_ENTRY; 105 + 106 + if (info->current_index_set) { 107 + if (info->current_index == array_index) 108 + return 0; 109 + 110 + r = dm_bitset_flush(info, root, new_root); 111 + if (r) 112 + return r; 113 + } 114 + 115 + return read_bits(info, root, array_index); 116 + } 117 + 118 + int dm_bitset_set_bit(struct dm_disk_bitset *info, dm_block_t root, 119 + uint32_t index, dm_block_t *new_root) 120 + { 121 + int r; 122 + unsigned b = index % BITS_PER_ARRAY_ENTRY; 123 + 124 + r = get_array_entry(info, root, index, new_root); 125 + if (r) 126 + return r; 127 + 128 + set_bit(b, (unsigned long *) &info->current_bits); 129 + return 0; 130 + } 131 + EXPORT_SYMBOL_GPL(dm_bitset_set_bit); 132 + 133 + int dm_bitset_clear_bit(struct dm_disk_bitset *info, dm_block_t root, 134 + uint32_t index, dm_block_t *new_root) 135 + { 136 + int r; 137 + unsigned b = index % BITS_PER_ARRAY_ENTRY; 138 + 139 + r = get_array_entry(info, root, index, new_root); 140 + if (r) 141 + return r; 142 + 143 + clear_bit(b, (unsigned long *) &info->current_bits); 144 + return 0; 145 + } 146 + EXPORT_SYMBOL_GPL(dm_bitset_clear_bit); 147 + 148 + int dm_bitset_test_bit(struct dm_disk_bitset *info, dm_block_t root, 149 + uint32_t index, dm_block_t *new_root, bool *result) 150 + { 151 + int r; 152 + unsigned b = index % BITS_PER_ARRAY_ENTRY; 153 + 154 + r = get_array_entry(info, root, index, new_root); 155 + if (r) 156 + return r; 157 + 158 + *result = test_bit(b, (unsigned long *) &info->current_bits); 159 + return 0; 160 + } 161 + EXPORT_SYMBOL_GPL(dm_bitset_test_bit); 162 + 163 + /*----------------------------------------------------------------*/

+165

drivers/md/persistent-data/dm-bitset.h

··· 1 + /* 2 + * Copyright (C) 2012 Red Hat, Inc. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + #ifndef _LINUX_DM_BITSET_H 7 + #define _LINUX_DM_BITSET_H 8 + 9 + #include "dm-array.h" 10 + 11 + /*----------------------------------------------------------------*/ 12 + 13 + /* 14 + * This bitset type is a thin wrapper round a dm_array of 64bit words. It 15 + * uses a tiny, one word cache to reduce the number of array lookups and so 16 + * increase performance. 17 + * 18 + * Like the dm-array that it's based on, the caller needs to keep track of 19 + * the size of the bitset separately. The underlying dm-array implicitly 20 + * knows how many words it's storing and will return -ENODATA if you try 21 + * and access an out of bounds word. However, an out of bounds bit in the 22 + * final word will _not_ be detected, you have been warned. 23 + * 24 + * Bits are indexed from zero. 25 + 26 + * Typical use: 27 + * 28 + * a) Initialise a dm_disk_bitset structure with dm_disk_bitset_init(). 29 + * This describes the bitset and includes the cache. It's not called it 30 + * dm_bitset_info in line with other data structures because it does 31 + * include instance data. 32 + * 33 + * b) Get yourself a root. The root is the index of a block of data on the 34 + * disk that holds a particular instance of an bitset. You may have a 35 + * pre existing root in your metadata that you wish to use, or you may 36 + * want to create a brand new, empty bitset with dm_bitset_empty(). 37 + * 38 + * Like the other data structures in this library, dm_bitset objects are 39 + * immutable between transactions. Update functions will return you the 40 + * root for a _new_ array. If you've incremented the old root, via 41 + * dm_tm_inc(), before calling the update function you may continue to use 42 + * it in parallel with the new root. 43 + * 44 + * Even read operations may trigger the cache to be flushed and as such 45 + * return a root for a new, updated bitset. 46 + * 47 + * c) resize a bitset with dm_bitset_resize(). 48 + * 49 + * d) Set a bit with dm_bitset_set_bit(). 50 + * 51 + * e) Clear a bit with dm_bitset_clear_bit(). 52 + * 53 + * f) Test a bit with dm_bitset_test_bit(). 54 + * 55 + * g) Flush all updates from the cache with dm_bitset_flush(). 56 + * 57 + * h) Destroy the bitset with dm_bitset_del(). This tells the transaction 58 + * manager that you're no longer using this data structure so it can 59 + * recycle it's blocks. (dm_bitset_dec() would be a better name for it, 60 + * but del is in keeping with dm_btree_del()). 61 + */ 62 + 63 + /* 64 + * Opaque object. Unlike dm_array_info, you should have one of these per 65 + * bitset. Initialise with dm_disk_bitset_init(). 66 + */ 67 + struct dm_disk_bitset { 68 + struct dm_array_info array_info; 69 + 70 + uint32_t current_index; 71 + uint64_t current_bits; 72 + 73 + bool current_index_set:1; 74 + }; 75 + 76 + /* 77 + * Sets up a dm_disk_bitset structure. You don't need to do anything with 78 + * this structure when you finish using it. 79 + * 80 + * tm - the transaction manager that should supervise this structure 81 + * info - the structure being initialised 82 + */ 83 + void dm_disk_bitset_init(struct dm_transaction_manager *tm, 84 + struct dm_disk_bitset *info); 85 + 86 + /* 87 + * Create an empty, zero length bitset. 88 + * 89 + * info - describes the bitset 90 + * new_root - on success, points to the new root block 91 + */ 92 + int dm_bitset_empty(struct dm_disk_bitset *info, dm_block_t *new_root); 93 + 94 + /* 95 + * Resize the bitset. 96 + * 97 + * info - describes the bitset 98 + * old_root - the root block of the array on disk 99 + * old_nr_entries - the number of bits in the old bitset 100 + * new_nr_entries - the number of bits you want in the new bitset 101 + * default_value - the value for any new bits 102 + * new_root - on success, points to the new root block 103 + */ 104 + int dm_bitset_resize(struct dm_disk_bitset *info, dm_block_t old_root, 105 + uint32_t old_nr_entries, uint32_t new_nr_entries, 106 + bool default_value, dm_block_t *new_root); 107 + 108 + /* 109 + * Frees the bitset. 110 + */ 111 + int dm_bitset_del(struct dm_disk_bitset *info, dm_block_t root); 112 + 113 + /* 114 + * Set a bit. 115 + * 116 + * info - describes the bitset 117 + * root - the root block of the bitset 118 + * index - the bit index 119 + * new_root - on success, points to the new root block 120 + * 121 + * -ENODATA will be returned if the index is out of bounds. 122 + */ 123 + int dm_bitset_set_bit(struct dm_disk_bitset *info, dm_block_t root, 124 + uint32_t index, dm_block_t *new_root); 125 + 126 + /* 127 + * Clears a bit. 128 + * 129 + * info - describes the bitset 130 + * root - the root block of the bitset 131 + * index - the bit index 132 + * new_root - on success, points to the new root block 133 + * 134 + * -ENODATA will be returned if the index is out of bounds. 135 + */ 136 + int dm_bitset_clear_bit(struct dm_disk_bitset *info, dm_block_t root, 137 + uint32_t index, dm_block_t *new_root); 138 + 139 + /* 140 + * Tests a bit. 141 + * 142 + * info - describes the bitset 143 + * root - the root block of the bitset 144 + * index - the bit index 145 + * new_root - on success, points to the new root block (cached values may have been written) 146 + * result - the bit value you're after 147 + * 148 + * -ENODATA will be returned if the index is out of bounds. 149 + */ 150 + int dm_bitset_test_bit(struct dm_disk_bitset *info, dm_block_t root, 151 + uint32_t index, dm_block_t *new_root, bool *result); 152 + 153 + /* 154 + * Flush any cached changes to disk. 155 + * 156 + * info - describes the bitset 157 + * root - the root block of the bitset 158 + * new_root - on success, points to the new root block 159 + */ 160 + int dm_bitset_flush(struct dm_disk_bitset *info, dm_block_t root, 161 + dm_block_t *new_root); 162 + 163 + /*----------------------------------------------------------------*/ 164 + 165 + #endif /* _LINUX_DM_BITSET_H */

+1

drivers/md/persistent-data/dm-block-manager.c

··· 613 613 614 614 return dm_bufio_write_dirty_buffers(bm->bufio); 615 615 } 616 + EXPORT_SYMBOL_GPL(dm_bm_flush_and_unlock); 616 617 617 618 void dm_bm_set_read_only(struct dm_block_manager *bm) 618 619 {

+1

drivers/md/persistent-data/dm-btree-internal.h

··· 64 64 void init_ro_spine(struct ro_spine *s, struct dm_btree_info *info); 65 65 int exit_ro_spine(struct ro_spine *s); 66 66 int ro_step(struct ro_spine *s, dm_block_t new_child); 67 + void ro_pop(struct ro_spine *s); 67 68 struct btree_node *ro_node(struct ro_spine *s); 68 69 69 70 struct shadow_spine {

+7

drivers/md/persistent-data/dm-btree-spine.c

··· 164 164 return r; 165 165 } 166 166 167 + void ro_pop(struct ro_spine *s) 168 + { 169 + BUG_ON(!s->count); 170 + --s->count; 171 + unlock_block(s->info, s->nodes[s->count]); 172 + } 173 + 167 174 struct btree_node *ro_node(struct ro_spine *s) 168 175 { 169 176 struct dm_block *block;

+52

drivers/md/persistent-data/dm-btree.c

··· 807 807 return r ? r : count; 808 808 } 809 809 EXPORT_SYMBOL_GPL(dm_btree_find_highest_key); 810 + 811 + /* 812 + * FIXME: We shouldn't use a recursive algorithm when we have limited stack 813 + * space. Also this only works for single level trees. 814 + */ 815 + static int walk_node(struct ro_spine *s, dm_block_t block, 816 + int (*fn)(void *context, uint64_t *keys, void *leaf), 817 + void *context) 818 + { 819 + int r; 820 + unsigned i, nr; 821 + struct btree_node *n; 822 + uint64_t keys; 823 + 824 + r = ro_step(s, block); 825 + n = ro_node(s); 826 + 827 + nr = le32_to_cpu(n->header.nr_entries); 828 + for (i = 0; i < nr; i++) { 829 + if (le32_to_cpu(n->header.flags) & INTERNAL_NODE) { 830 + r = walk_node(s, value64(n, i), fn, context); 831 + if (r) 832 + goto out; 833 + } else { 834 + keys = le64_to_cpu(*key_ptr(n, i)); 835 + r = fn(context, &keys, value_ptr(n, i)); 836 + if (r) 837 + goto out; 838 + } 839 + } 840 + 841 + out: 842 + ro_pop(s); 843 + return r; 844 + } 845 + 846 + int dm_btree_walk(struct dm_btree_info *info, dm_block_t root, 847 + int (*fn)(void *context, uint64_t *keys, void *leaf), 848 + void *context) 849 + { 850 + int r; 851 + struct ro_spine spine; 852 + 853 + BUG_ON(info->levels > 1); 854 + 855 + init_ro_spine(&spine, info); 856 + r = walk_node(&spine, root, fn, context); 857 + exit_ro_spine(&spine); 858 + 859 + return r; 860 + } 861 + EXPORT_SYMBOL_GPL(dm_btree_walk);

+12 -3

drivers/md/persistent-data/dm-btree.h

··· 58 58 * somewhere.) This method is _not_ called for insertion of a new 59 59 * value: It is assumed the ref count is already 1. 60 60 */ 61 - void (*inc)(void *context, void *value); 61 + void (*inc)(void *context, const void *value); 62 62 63 63 /* 64 64 * This value is being deleted. The btree takes care of freeing 65 65 * the memory pointed to by @value. Often the del function just 66 66 * needs to decrement a reference count somewhere. 67 67 */ 68 - void (*dec)(void *context, void *value); 68 + void (*dec)(void *context, const void *value); 69 69 70 70 /* 71 71 * A test for equality between two values. When a value is 72 72 * overwritten with a new one, the old one has the dec method 73 73 * called _unless_ the new and old value are deemed equal. 74 74 */ 75 - int (*equal)(void *context, void *value1, void *value2); 75 + int (*equal)(void *context, const void *value1, const void *value2); 76 76 }; 77 77 78 78 /* ··· 141 141 */ 142 142 int dm_btree_find_highest_key(struct dm_btree_info *info, dm_block_t root, 143 143 uint64_t *result_keys); 144 + 145 + /* 146 + * Iterate through the a btree, calling fn() on each entry. 147 + * It only works for single level trees and is internally recursive, so 148 + * monitor stack usage carefully. 149 + */ 150 + int dm_btree_walk(struct dm_btree_info *info, dm_block_t root, 151 + int (*fn)(void *context, uint64_t *keys, void *leaf), 152 + void *context); 144 153 145 154 #endif /* _LINUX_DM_BTREE_H */

+32 -17

include/linux/device-mapper.h

··· 68 68 typedef int (*dm_preresume_fn) (struct dm_target *ti); 69 69 typedef void (*dm_resume_fn) (struct dm_target *ti); 70 70 71 - typedef int (*dm_status_fn) (struct dm_target *ti, status_type_t status_type, 72 - unsigned status_flags, char *result, unsigned maxlen); 71 + typedef void (*dm_status_fn) (struct dm_target *ti, status_type_t status_type, 72 + unsigned status_flags, char *result, unsigned maxlen); 73 73 74 74 typedef int (*dm_message_fn) (struct dm_target *ti, unsigned argc, char **argv); 75 75 ··· 175 175 #define DM_TARGET_IMMUTABLE 0x00000004 176 176 #define dm_target_is_immutable(type) ((type)->features & DM_TARGET_IMMUTABLE) 177 177 178 + /* 179 + * Some targets need to be sent the same WRITE bio severals times so 180 + * that they can send copies of it to different devices. This function 181 + * examines any supplied bio and returns the number of copies of it the 182 + * target requires. 183 + */ 184 + typedef unsigned (*dm_num_write_bios_fn) (struct dm_target *ti, struct bio *bio); 185 + 178 186 struct dm_target { 179 187 struct dm_table *table; 180 188 struct target_type *type; ··· 195 187 uint32_t max_io_len; 196 188 197 189 /* 198 - * A number of zero-length barrier requests that will be submitted 190 + * A number of zero-length barrier bios that will be submitted 199 191 * to the target for the purpose of flushing cache. 200 192 * 201 - * The request number can be accessed with dm_bio_get_target_request_nr. 202 - * It is a responsibility of the target driver to remap these requests 193 + * The bio number can be accessed with dm_bio_get_target_bio_nr. 194 + * It is a responsibility of the target driver to remap these bios 203 195 * to the real underlying devices. 204 196 */ 205 - unsigned num_flush_requests; 197 + unsigned num_flush_bios; 206 198 207 199 /* 208 - * The number of discard requests that will be submitted to the target. 209 - * The request number can be accessed with dm_bio_get_target_request_nr. 200 + * The number of discard bios that will be submitted to the target. 201 + * The bio number can be accessed with dm_bio_get_target_bio_nr. 210 202 */ 211 - unsigned num_discard_requests; 203 + unsigned num_discard_bios; 212 204 213 205 /* 214 - * The number of WRITE SAME requests that will be submitted to the target. 215 - * The request number can be accessed with dm_bio_get_target_request_nr. 206 + * The number of WRITE SAME bios that will be submitted to the target. 207 + * The bio number can be accessed with dm_bio_get_target_bio_nr. 216 208 */ 217 - unsigned num_write_same_requests; 209 + unsigned num_write_same_bios; 218 210 219 211 /* 220 212 * The minimum number of extra bytes allocated in each bio for the 221 213 * target to use. dm_per_bio_data returns the data location. 222 214 */ 223 215 unsigned per_bio_data_size; 216 + 217 + /* 218 + * If defined, this function is called to find out how many 219 + * duplicate bios should be sent to the target when writing 220 + * data. 221 + */ 222 + dm_num_write_bios_fn num_write_bios; 224 223 225 224 /* target specific data */ 226 225 void *private; ··· 248 233 bool discards_supported:1; 249 234 250 235 /* 251 - * Set if the target required discard request to be split 236 + * Set if the target required discard bios to be split 252 237 * on max_io_len boundary. 253 238 */ 254 - bool split_discard_requests:1; 239 + bool split_discard_bios:1; 255 240 256 241 /* 257 242 * Set if this target does not return zeroes on discarded blocks. ··· 276 261 struct dm_io *io; 277 262 struct dm_target *ti; 278 263 union map_info info; 279 - unsigned target_request_nr; 264 + unsigned target_bio_nr; 280 265 struct bio clone; 281 266 }; 282 267 ··· 290 275 return (struct bio *)((char *)data + data_size + offsetof(struct dm_target_io, clone)); 291 276 } 292 277 293 - static inline unsigned dm_bio_get_target_request_nr(const struct bio *bio) 278 + static inline unsigned dm_bio_get_target_bio_nr(const struct bio *bio) 294 279 { 295 - return container_of(bio, struct dm_target_io, clone)->target_request_nr; 280 + return container_of(bio, struct dm_target_io, clone)->target_bio_nr; 296 281 } 297 282 298 283 int dm_register_target(struct target_type *t);

+24 -1

include/linux/dm-kcopyd.h

··· 21 21 22 22 #define DM_KCOPYD_IGNORE_ERROR 1 23 23 24 + struct dm_kcopyd_throttle { 25 + unsigned throttle; 26 + unsigned num_io_jobs; 27 + unsigned io_period; 28 + unsigned total_period; 29 + unsigned last_jiffies; 30 + }; 31 + 32 + /* 33 + * kcopyd clients that want to support throttling must pass an initialised 34 + * dm_kcopyd_throttle struct into dm_kcopyd_client_create(). 35 + * Two or more clients may share the same instance of this struct between 36 + * them if they wish to be throttled as a group. 37 + * 38 + * This macro also creates a corresponding module parameter to configure 39 + * the amount of throttling. 40 + */ 41 + #define DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(name, description) \ 42 + static struct dm_kcopyd_throttle dm_kcopyd_throttle = { 100, 0, 0, 0, 0 }; \ 43 + module_param_named(name, dm_kcopyd_throttle.throttle, uint, 0644); \ 44 + MODULE_PARM_DESC(name, description) 45 + 24 46 /* 25 47 * To use kcopyd you must first create a dm_kcopyd_client object. 48 + * throttle can be NULL if you don't want any throttling. 26 49 */ 27 50 struct dm_kcopyd_client; 28 - struct dm_kcopyd_client *dm_kcopyd_client_create(void); 51 + struct dm_kcopyd_client *dm_kcopyd_client_create(struct dm_kcopyd_throttle *throttle); 29 52 void dm_kcopyd_client_destroy(struct dm_kcopyd_client *kc); 30 53 31 54 /*

+8 -3

include/uapi/linux/dm-ioctl.h

··· 267 267 #define DM_DEV_SET_GEOMETRY _IOWR(DM_IOCTL, DM_DEV_SET_GEOMETRY_CMD, struct dm_ioctl) 268 268 269 269 #define DM_VERSION_MAJOR 4 270 - #define DM_VERSION_MINOR 23 271 - #define DM_VERSION_PATCHLEVEL 1 272 - #define DM_VERSION_EXTRA "-ioctl (2012-12-18)" 270 + #define DM_VERSION_MINOR 24 271 + #define DM_VERSION_PATCHLEVEL 0 272 + #define DM_VERSION_EXTRA "-ioctl (2013-01-15)" 273 273 274 274 /* Status bits */ 275 275 #define DM_READONLY_FLAG (1 << 0) /* In/Out */ ··· 335 335 * or requesting sensitive data such as an encryption key. 336 336 */ 337 337 #define DM_SECURE_DATA_FLAG (1 << 15) /* In */ 338 + 339 + /* 340 + * If set, a message generated output data. 341 + */ 342 + #define DM_DATA_OUT_FLAG (1 << 16) /* Out */ 338 343 339 344 #endif /* _LINUX_DM_IOCTL_H */