Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge tag 'dm-3.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm

Pull device-mapper update from Alasdair G Kergon:
"The main addition here is a long-desired target framework to allow an
SSD to be used as a cache in front of a slower device. Cache tuning
is delegated to interchangeable policy modules so these can be
developed independently of the mechanics needed to shuffle the data
around.

Other than that, kcopyd users acquire a throttling parameter, ioctl
buffer usage gets streamlined, more mempool reliance is reduced and
there are a few other bug fixes and tidy-ups."

* tag 'dm-3.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: (30 commits)
dm cache: add cleaner policy
dm cache: add mq policy
dm: add cache target
dm persistent data: add bitset
dm persistent data: add transactional array
dm thin: remove cells from stack
dm bio prison: pass cell memory in
dm persistent data: add btree_walk
dm: add target num_write_bios fn
dm kcopyd: introduce configurable throttling
dm ioctl: allow message to return data
dm ioctl: optimize functions without variable params
dm ioctl: introduce ioctl_flags
dm: merge io_pool and tio_pool
dm: remove unused _rq_bio_info_cache
dm: fix limits initialization when there are no data devices
dm snapshot: add missing module aliases
dm persistent data: set some btree fn parms const
dm: refactor bio cloning
dm: rename bio cloning functions
...

+8792 -602
+77
Documentation/device-mapper/cache-policies.txt
··· 1 + Guidance for writing policies 2 + ============================= 3 + 4 + Try to keep transactionality out of it. The core is careful to 5 + avoid asking about anything that is migrating. This is a pain, but 6 + makes it easier to write the policies. 7 + 8 + Mappings are loaded into the policy at construction time. 9 + 10 + Every bio that is mapped by the target is referred to the policy. 11 + The policy can return a simple HIT or MISS or issue a migration. 12 + 13 + Currently there's no way for the policy to issue background work, 14 + e.g. to start writing back dirty blocks that are going to be evicte 15 + soon. 16 + 17 + Because we map bios, rather than requests it's easy for the policy 18 + to get fooled by many small bios. For this reason the core target 19 + issues periodic ticks to the policy. It's suggested that the policy 20 + doesn't update states (eg, hit counts) for a block more than once 21 + for each tick. The core ticks by watching bios complete, and so 22 + trying to see when the io scheduler has let the ios run. 23 + 24 + 25 + Overview of supplied cache replacement policies 26 + =============================================== 27 + 28 + multiqueue 29 + ---------- 30 + 31 + This policy is the default. 32 + 33 + The multiqueue policy has two sets of 16 queues: one set for entries 34 + waiting for the cache and another one for those in the cache. 35 + Cache entries in the queues are aged based on logical time. Entry into 36 + the cache is based on variable thresholds and queue selection is based 37 + on hit count on entry. The policy aims to take different cache miss 38 + costs into account and to adjust to varying load patterns automatically. 39 + 40 + Message and constructor argument pairs are: 41 + 'sequential_threshold <#nr_sequential_ios>' and 42 + 'random_threshold <#nr_random_ios>'. 43 + 44 + The sequential threshold indicates the number of contiguous I/Os 45 + required before a stream is treated as sequential. The random threshold 46 + is the number of intervening non-contiguous I/Os that must be seen 47 + before the stream is treated as random again. 48 + 49 + The sequential and random thresholds default to 512 and 4 respectively. 50 + 51 + Large, sequential ios are probably better left on the origin device 52 + since spindles tend to have good bandwidth. The io_tracker counts 53 + contiguous I/Os to try to spot when the io is in one of these sequential 54 + modes. 55 + 56 + cleaner 57 + ------- 58 + 59 + The cleaner writes back all dirty blocks in a cache to decommission it. 60 + 61 + Examples 62 + ======== 63 + 64 + The syntax for a table is: 65 + cache <metadata dev> <cache dev> <origin dev> <block size> 66 + <#feature_args> [<feature arg>]* 67 + <policy> <#policy_args> [<policy arg>]* 68 + 69 + The syntax to send a message using the dmsetup command is: 70 + dmsetup message <mapped device> 0 sequential_threshold 1024 71 + dmsetup message <mapped device> 0 random_threshold 8 72 + 73 + Using dmsetup: 74 + dmsetup create blah --table "0 268435456 cache /dev/sdb /dev/sdc \ 75 + /dev/sdd 512 0 mq 4 sequential_threshold 1024 random_threshold 8" 76 + creates a 128GB large mapped device named 'blah' with the 77 + sequential threshold set to 1024 and the random_threshold set to 8.
+243
Documentation/device-mapper/cache.txt
··· 1 + Introduction 2 + ============ 3 + 4 + dm-cache is a device mapper target written by Joe Thornber, Heinz 5 + Mauelshagen, and Mike Snitzer. 6 + 7 + It aims to improve performance of a block device (eg, a spindle) by 8 + dynamically migrating some of its data to a faster, smaller device 9 + (eg, an SSD). 10 + 11 + This device-mapper solution allows us to insert this caching at 12 + different levels of the dm stack, for instance above the data device for 13 + a thin-provisioning pool. Caching solutions that are integrated more 14 + closely with the virtual memory system should give better performance. 15 + 16 + The target reuses the metadata library used in the thin-provisioning 17 + library. 18 + 19 + The decision as to what data to migrate and when is left to a plug-in 20 + policy module. Several of these have been written as we experiment, 21 + and we hope other people will contribute others for specific io 22 + scenarios (eg. a vm image server). 23 + 24 + Glossary 25 + ======== 26 + 27 + Migration - Movement of the primary copy of a logical block from one 28 + device to the other. 29 + Promotion - Migration from slow device to fast device. 30 + Demotion - Migration from fast device to slow device. 31 + 32 + The origin device always contains a copy of the logical block, which 33 + may be out of date or kept in sync with the copy on the cache device 34 + (depending on policy). 35 + 36 + Design 37 + ====== 38 + 39 + Sub-devices 40 + ----------- 41 + 42 + The target is constructed by passing three devices to it (along with 43 + other parameters detailed later): 44 + 45 + 1. An origin device - the big, slow one. 46 + 47 + 2. A cache device - the small, fast one. 48 + 49 + 3. A small metadata device - records which blocks are in the cache, 50 + which are dirty, and extra hints for use by the policy object. 51 + This information could be put on the cache device, but having it 52 + separate allows the volume manager to configure it differently, 53 + e.g. as a mirror for extra robustness. 54 + 55 + Fixed block size 56 + ---------------- 57 + 58 + The origin is divided up into blocks of a fixed size. This block size 59 + is configurable when you first create the cache. Typically we've been 60 + using block sizes of 256k - 1024k. 61 + 62 + Having a fixed block size simplifies the target a lot. But it is 63 + something of a compromise. For instance, a small part of a block may be 64 + getting hit a lot, yet the whole block will be promoted to the cache. 65 + So large block sizes are bad because they waste cache space. And small 66 + block sizes are bad because they increase the amount of metadata (both 67 + in core and on disk). 68 + 69 + Writeback/writethrough 70 + ---------------------- 71 + 72 + The cache has two modes, writeback and writethrough. 73 + 74 + If writeback, the default, is selected then a write to a block that is 75 + cached will go only to the cache and the block will be marked dirty in 76 + the metadata. 77 + 78 + If writethrough is selected then a write to a cached block will not 79 + complete until it has hit both the origin and cache devices. Clean 80 + blocks should remain clean. 81 + 82 + A simple cleaner policy is provided, which will clean (write back) all 83 + dirty blocks in a cache. Useful for decommissioning a cache. 84 + 85 + Migration throttling 86 + -------------------- 87 + 88 + Migrating data between the origin and cache device uses bandwidth. 89 + The user can set a throttle to prevent more than a certain amount of 90 + migration occuring at any one time. Currently we're not taking any 91 + account of normal io traffic going to the devices. More work needs 92 + doing here to avoid migrating during those peak io moments. 93 + 94 + For the time being, a message "migration_threshold <#sectors>" 95 + can be used to set the maximum number of sectors being migrated, 96 + the default being 204800 sectors (or 100MB). 97 + 98 + Updating on-disk metadata 99 + ------------------------- 100 + 101 + On-disk metadata is committed every time a REQ_SYNC or REQ_FUA bio is 102 + written. If no such requests are made then commits will occur every 103 + second. This means the cache behaves like a physical disk that has a 104 + write cache (the same is true of the thin-provisioning target). If 105 + power is lost you may lose some recent writes. The metadata should 106 + always be consistent in spite of any crash. 107 + 108 + The 'dirty' state for a cache block changes far too frequently for us 109 + to keep updating it on the fly. So we treat it as a hint. In normal 110 + operation it will be written when the dm device is suspended. If the 111 + system crashes all cache blocks will be assumed dirty when restarted. 112 + 113 + Per-block policy hints 114 + ---------------------- 115 + 116 + Policy plug-ins can store a chunk of data per cache block. It's up to 117 + the policy how big this chunk is, but it should be kept small. Like the 118 + dirty flags this data is lost if there's a crash so a safe fallback 119 + value should always be possible. 120 + 121 + For instance, the 'mq' policy, which is currently the default policy, 122 + uses this facility to store the hit count of the cache blocks. If 123 + there's a crash this information will be lost, which means the cache 124 + may be less efficient until those hit counts are regenerated. 125 + 126 + Policy hints affect performance, not correctness. 127 + 128 + Policy messaging 129 + ---------------- 130 + 131 + Policies will have different tunables, specific to each one, so we 132 + need a generic way of getting and setting these. Device-mapper 133 + messages are used. Refer to cache-policies.txt. 134 + 135 + Discard bitset resolution 136 + ------------------------- 137 + 138 + We can avoid copying data during migration if we know the block has 139 + been discarded. A prime example of this is when mkfs discards the 140 + whole block device. We store a bitset tracking the discard state of 141 + blocks. However, we allow this bitset to have a different block size 142 + from the cache blocks. This is because we need to track the discard 143 + state for all of the origin device (compare with the dirty bitset 144 + which is just for the smaller cache device). 145 + 146 + Target interface 147 + ================ 148 + 149 + Constructor 150 + ----------- 151 + 152 + cache <metadata dev> <cache dev> <origin dev> <block size> 153 + <#feature args> [<feature arg>]* 154 + <policy> <#policy args> [policy args]* 155 + 156 + metadata dev : fast device holding the persistent metadata 157 + cache dev : fast device holding cached data blocks 158 + origin dev : slow device holding original data blocks 159 + block size : cache unit size in sectors 160 + 161 + #feature args : number of feature arguments passed 162 + feature args : writethrough. (The default is writeback.) 163 + 164 + policy : the replacement policy to use 165 + #policy args : an even number of arguments corresponding to 166 + key/value pairs passed to the policy 167 + policy args : key/value pairs passed to the policy 168 + E.g. 'sequential_threshold 1024' 169 + See cache-policies.txt for details. 170 + 171 + Optional feature arguments are: 172 + writethrough : write through caching that prohibits cache block 173 + content from being different from origin block content. 174 + Without this argument, the default behaviour is to write 175 + back cache block contents later for performance reasons, 176 + so they may differ from the corresponding origin blocks. 177 + 178 + A policy called 'default' is always registered. This is an alias for 179 + the policy we currently think is giving best all round performance. 180 + 181 + As the default policy could vary between kernels, if you are relying on 182 + the characteristics of a specific policy, always request it by name. 183 + 184 + Status 185 + ------ 186 + 187 + <#used metadata blocks>/<#total metadata blocks> <#read hits> <#read misses> 188 + <#write hits> <#write misses> <#demotions> <#promotions> <#blocks in cache> 189 + <#dirty> <#features> <features>* <#core args> <core args>* <#policy args> 190 + <policy args>* 191 + 192 + #used metadata blocks : Number of metadata blocks used 193 + #total metadata blocks : Total number of metadata blocks 194 + #read hits : Number of times a READ bio has been mapped 195 + to the cache 196 + #read misses : Number of times a READ bio has been mapped 197 + to the origin 198 + #write hits : Number of times a WRITE bio has been mapped 199 + to the cache 200 + #write misses : Number of times a WRITE bio has been 201 + mapped to the origin 202 + #demotions : Number of times a block has been removed 203 + from the cache 204 + #promotions : Number of times a block has been moved to 205 + the cache 206 + #blocks in cache : Number of blocks resident in the cache 207 + #dirty : Number of blocks in the cache that differ 208 + from the origin 209 + #feature args : Number of feature args to follow 210 + feature args : 'writethrough' (optional) 211 + #core args : Number of core arguments (must be even) 212 + core args : Key/value pairs for tuning the core 213 + e.g. migration_threshold 214 + #policy args : Number of policy arguments to follow (must be even) 215 + policy args : Key/value pairs 216 + e.g. 'sequential_threshold 1024 217 + 218 + Messages 219 + -------- 220 + 221 + Policies will have different tunables, specific to each one, so we 222 + need a generic way of getting and setting these. Device-mapper 223 + messages are used. (A sysfs interface would also be possible.) 224 + 225 + The message format is: 226 + 227 + <key> <value> 228 + 229 + E.g. 230 + dmsetup message my_cache 0 sequential_threshold 1024 231 + 232 + Examples 233 + ======== 234 + 235 + The test suite can be found here: 236 + 237 + https://github.com/jthornber/thinp-test-suite 238 + 239 + dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ 240 + /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0' 241 + dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ 242 + /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \ 243 + mq 4 sequential_threshold 1024 random_threshold 8'
+43 -12
drivers/md/Kconfig
··· 210 210 211 211 config DM_BUFIO 212 212 tristate 213 - depends on BLK_DEV_DM && EXPERIMENTAL 213 + depends on BLK_DEV_DM 214 214 ---help--- 215 215 This interface allows you to do buffered I/O on a device and acts 216 216 as a cache, holding recently-read blocks in memory and performing ··· 218 218 219 219 config DM_BIO_PRISON 220 220 tristate 221 - depends on BLK_DEV_DM && EXPERIMENTAL 221 + depends on BLK_DEV_DM 222 222 ---help--- 223 223 Some bio locking schemes used by other device-mapper targets 224 224 including thin provisioning. ··· 251 251 Allow volume managers to take writable snapshots of a device. 252 252 253 253 config DM_THIN_PROVISIONING 254 - tristate "Thin provisioning target (EXPERIMENTAL)" 255 - depends on BLK_DEV_DM && EXPERIMENTAL 254 + tristate "Thin provisioning target" 255 + depends on BLK_DEV_DM 256 256 select DM_PERSISTENT_DATA 257 257 select DM_BIO_PRISON 258 258 ---help--- ··· 267 267 block manager locking used by thin provisioning. 268 268 269 269 If unsure, say N. 270 + 271 + config DM_CACHE 272 + tristate "Cache target (EXPERIMENTAL)" 273 + depends on BLK_DEV_DM 274 + default n 275 + select DM_PERSISTENT_DATA 276 + select DM_BIO_PRISON 277 + ---help--- 278 + dm-cache attempts to improve performance of a block device by 279 + moving frequently used data to a smaller, higher performance 280 + device. Different 'policy' plugins can be used to change the 281 + algorithms used to select which blocks are promoted, demoted, 282 + cleaned etc. It supports writeback and writethrough modes. 283 + 284 + config DM_CACHE_MQ 285 + tristate "MQ Cache Policy (EXPERIMENTAL)" 286 + depends on DM_CACHE 287 + default y 288 + ---help--- 289 + A cache policy that uses a multiqueue ordered by recent hit 290 + count to select which blocks should be promoted and demoted. 291 + This is meant to be a general purpose policy. It prioritises 292 + reads over writes. 293 + 294 + config DM_CACHE_CLEANER 295 + tristate "Cleaner Cache Policy (EXPERIMENTAL)" 296 + depends on DM_CACHE 297 + default y 298 + ---help--- 299 + A simple cache policy that writes back all data to the 300 + origin. Used when decommissioning a dm-cache. 270 301 271 302 config DM_MIRROR 272 303 tristate "Mirror target" ··· 333 302 in one of the available parity distribution methods. 334 303 335 304 config DM_LOG_USERSPACE 336 - tristate "Mirror userspace logging (EXPERIMENTAL)" 337 - depends on DM_MIRROR && EXPERIMENTAL && NET 305 + tristate "Mirror userspace logging" 306 + depends on DM_MIRROR && NET 338 307 select CONNECTOR 339 308 ---help--- 340 309 The userspace logging module provides a mechanism for ··· 381 350 If unsure, say N. 382 351 383 352 config DM_DELAY 384 - tristate "I/O delaying target (EXPERIMENTAL)" 385 - depends on BLK_DEV_DM && EXPERIMENTAL 353 + tristate "I/O delaying target" 354 + depends on BLK_DEV_DM 386 355 ---help--- 387 356 A target that delays reads and/or writes and can send 388 357 them to different devices. Useful for testing. ··· 396 365 Generate udev events for DM events. 397 366 398 367 config DM_FLAKEY 399 - tristate "Flakey target (EXPERIMENTAL)" 400 - depends on BLK_DEV_DM && EXPERIMENTAL 368 + tristate "Flakey target" 369 + depends on BLK_DEV_DM 401 370 ---help--- 402 371 A target that intermittently fails I/O for debugging purposes. 403 372 404 373 config DM_VERITY 405 - tristate "Verity target support (EXPERIMENTAL)" 406 - depends on BLK_DEV_DM && EXPERIMENTAL 374 + tristate "Verity target support" 375 + depends on BLK_DEV_DM 407 376 select CRYPTO 408 377 select CRYPTO_HASH 409 378 select DM_BUFIO
+6
drivers/md/Makefile
··· 11 11 dm-log-userspace-y \ 12 12 += dm-log-userspace-base.o dm-log-userspace-transfer.o 13 13 dm-thin-pool-y += dm-thin.o dm-thin-metadata.o 14 + dm-cache-y += dm-cache-target.o dm-cache-metadata.o dm-cache-policy.o 15 + dm-cache-mq-y += dm-cache-policy-mq.o 16 + dm-cache-cleaner-y += dm-cache-policy-cleaner.o 14 17 md-mod-y += md.o bitmap.o 15 18 raid456-y += raid5.o 16 19 ··· 47 44 obj-$(CONFIG_DM_RAID) += dm-raid.o 48 45 obj-$(CONFIG_DM_THIN_PROVISIONING) += dm-thin-pool.o 49 46 obj-$(CONFIG_DM_VERITY) += dm-verity.o 47 + obj-$(CONFIG_DM_CACHE) += dm-cache.o 48 + obj-$(CONFIG_DM_CACHE_MQ) += dm-cache-mq.o 49 + obj-$(CONFIG_DM_CACHE_CLEANER) += dm-cache-cleaner.o 50 50 51 51 ifeq ($(CONFIG_DM_UEVENT),y) 52 52 dm-mod-objs += dm-uevent.o
+84 -77
drivers/md/dm-bio-prison.c
··· 14 14 15 15 /*----------------------------------------------------------------*/ 16 16 17 - struct dm_bio_prison_cell { 18 - struct hlist_node list; 19 - struct dm_bio_prison *prison; 20 - struct dm_cell_key key; 21 - struct bio *holder; 22 - struct bio_list bios; 23 - }; 24 - 25 17 struct dm_bio_prison { 26 18 spinlock_t lock; 27 19 mempool_t *cell_pool; ··· 79 87 } 80 88 EXPORT_SYMBOL_GPL(dm_bio_prison_destroy); 81 89 90 + struct dm_bio_prison_cell *dm_bio_prison_alloc_cell(struct dm_bio_prison *prison, gfp_t gfp) 91 + { 92 + return mempool_alloc(prison->cell_pool, gfp); 93 + } 94 + EXPORT_SYMBOL_GPL(dm_bio_prison_alloc_cell); 95 + 96 + void dm_bio_prison_free_cell(struct dm_bio_prison *prison, 97 + struct dm_bio_prison_cell *cell) 98 + { 99 + mempool_free(cell, prison->cell_pool); 100 + } 101 + EXPORT_SYMBOL_GPL(dm_bio_prison_free_cell); 102 + 82 103 static uint32_t hash_key(struct dm_bio_prison *prison, struct dm_cell_key *key) 83 104 { 84 105 const unsigned long BIG_PRIME = 4294967291UL; ··· 119 114 return NULL; 120 115 } 121 116 122 - /* 123 - * This may block if a new cell needs allocating. You must ensure that 124 - * cells will be unlocked even if the calling thread is blocked. 125 - * 126 - * Returns 1 if the cell was already held, 0 if @inmate is the new holder. 127 - */ 128 - int dm_bio_detain(struct dm_bio_prison *prison, struct dm_cell_key *key, 129 - struct bio *inmate, struct dm_bio_prison_cell **ref) 117 + static void __setup_new_cell(struct dm_bio_prison *prison, 118 + struct dm_cell_key *key, 119 + struct bio *holder, 120 + uint32_t hash, 121 + struct dm_bio_prison_cell *cell) 130 122 { 131 - int r = 1; 132 - unsigned long flags; 133 - uint32_t hash = hash_key(prison, key); 134 - struct dm_bio_prison_cell *cell, *cell2; 135 - 136 - BUG_ON(hash > prison->nr_buckets); 137 - 138 - spin_lock_irqsave(&prison->lock, flags); 139 - 140 - cell = __search_bucket(prison->cells + hash, key); 141 - if (cell) { 142 - bio_list_add(&cell->bios, inmate); 143 - goto out; 144 - } 145 - 146 - /* 147 - * Allocate a new cell 148 - */ 149 - spin_unlock_irqrestore(&prison->lock, flags); 150 - cell2 = mempool_alloc(prison->cell_pool, GFP_NOIO); 151 - spin_lock_irqsave(&prison->lock, flags); 152 - 153 - /* 154 - * We've been unlocked, so we have to double check that 155 - * nobody else has inserted this cell in the meantime. 156 - */ 157 - cell = __search_bucket(prison->cells + hash, key); 158 - if (cell) { 159 - mempool_free(cell2, prison->cell_pool); 160 - bio_list_add(&cell->bios, inmate); 161 - goto out; 162 - } 163 - 164 - /* 165 - * Use new cell. 166 - */ 167 - cell = cell2; 168 - 169 - cell->prison = prison; 170 123 memcpy(&cell->key, key, sizeof(cell->key)); 171 - cell->holder = inmate; 124 + cell->holder = holder; 172 125 bio_list_init(&cell->bios); 173 126 hlist_add_head(&cell->list, prison->cells + hash); 127 + } 174 128 175 - r = 0; 129 + static int __bio_detain(struct dm_bio_prison *prison, 130 + struct dm_cell_key *key, 131 + struct bio *inmate, 132 + struct dm_bio_prison_cell *cell_prealloc, 133 + struct dm_bio_prison_cell **cell_result) 134 + { 135 + uint32_t hash = hash_key(prison, key); 136 + struct dm_bio_prison_cell *cell; 176 137 177 - out: 138 + cell = __search_bucket(prison->cells + hash, key); 139 + if (cell) { 140 + if (inmate) 141 + bio_list_add(&cell->bios, inmate); 142 + *cell_result = cell; 143 + return 1; 144 + } 145 + 146 + __setup_new_cell(prison, key, inmate, hash, cell_prealloc); 147 + *cell_result = cell_prealloc; 148 + return 0; 149 + } 150 + 151 + static int bio_detain(struct dm_bio_prison *prison, 152 + struct dm_cell_key *key, 153 + struct bio *inmate, 154 + struct dm_bio_prison_cell *cell_prealloc, 155 + struct dm_bio_prison_cell **cell_result) 156 + { 157 + int r; 158 + unsigned long flags; 159 + 160 + spin_lock_irqsave(&prison->lock, flags); 161 + r = __bio_detain(prison, key, inmate, cell_prealloc, cell_result); 178 162 spin_unlock_irqrestore(&prison->lock, flags); 179 - 180 - *ref = cell; 181 163 182 164 return r; 183 165 } 166 + 167 + int dm_bio_detain(struct dm_bio_prison *prison, 168 + struct dm_cell_key *key, 169 + struct bio *inmate, 170 + struct dm_bio_prison_cell *cell_prealloc, 171 + struct dm_bio_prison_cell **cell_result) 172 + { 173 + return bio_detain(prison, key, inmate, cell_prealloc, cell_result); 174 + } 184 175 EXPORT_SYMBOL_GPL(dm_bio_detain); 176 + 177 + int dm_get_cell(struct dm_bio_prison *prison, 178 + struct dm_cell_key *key, 179 + struct dm_bio_prison_cell *cell_prealloc, 180 + struct dm_bio_prison_cell **cell_result) 181 + { 182 + return bio_detain(prison, key, NULL, cell_prealloc, cell_result); 183 + } 184 + EXPORT_SYMBOL_GPL(dm_get_cell); 185 185 186 186 /* 187 187 * @inmates must have been initialised prior to this call 188 188 */ 189 - static void __cell_release(struct dm_bio_prison_cell *cell, struct bio_list *inmates) 189 + static void __cell_release(struct dm_bio_prison_cell *cell, 190 + struct bio_list *inmates) 190 191 { 191 - struct dm_bio_prison *prison = cell->prison; 192 - 193 192 hlist_del(&cell->list); 194 193 195 194 if (inmates) { 196 - bio_list_add(inmates, cell->holder); 195 + if (cell->holder) 196 + bio_list_add(inmates, cell->holder); 197 197 bio_list_merge(inmates, &cell->bios); 198 198 } 199 - 200 - mempool_free(cell, prison->cell_pool); 201 199 } 202 200 203 - void dm_cell_release(struct dm_bio_prison_cell *cell, struct bio_list *bios) 201 + void dm_cell_release(struct dm_bio_prison *prison, 202 + struct dm_bio_prison_cell *cell, 203 + struct bio_list *bios) 204 204 { 205 205 unsigned long flags; 206 - struct dm_bio_prison *prison = cell->prison; 207 206 208 207 spin_lock_irqsave(&prison->lock, flags); 209 208 __cell_release(cell, bios); ··· 218 209 /* 219 210 * Sometimes we don't want the holder, just the additional bios. 220 211 */ 221 - static void __cell_release_no_holder(struct dm_bio_prison_cell *cell, struct bio_list *inmates) 212 + static void __cell_release_no_holder(struct dm_bio_prison_cell *cell, 213 + struct bio_list *inmates) 222 214 { 223 - struct dm_bio_prison *prison = cell->prison; 224 - 225 215 hlist_del(&cell->list); 226 216 bio_list_merge(inmates, &cell->bios); 227 - 228 - mempool_free(cell, prison->cell_pool); 229 217 } 230 218 231 - void dm_cell_release_no_holder(struct dm_bio_prison_cell *cell, struct bio_list *inmates) 219 + void dm_cell_release_no_holder(struct dm_bio_prison *prison, 220 + struct dm_bio_prison_cell *cell, 221 + struct bio_list *inmates) 232 222 { 233 223 unsigned long flags; 234 - struct dm_bio_prison *prison = cell->prison; 235 224 236 225 spin_lock_irqsave(&prison->lock, flags); 237 226 __cell_release_no_holder(cell, inmates); ··· 237 230 } 238 231 EXPORT_SYMBOL_GPL(dm_cell_release_no_holder); 239 232 240 - void dm_cell_error(struct dm_bio_prison_cell *cell) 233 + void dm_cell_error(struct dm_bio_prison *prison, 234 + struct dm_bio_prison_cell *cell) 241 235 { 242 - struct dm_bio_prison *prison = cell->prison; 243 236 struct bio_list bios; 244 237 struct bio *bio; 245 238 unsigned long flags;
+48 -8
drivers/md/dm-bio-prison.h
··· 22 22 * subsequently unlocked the bios become available. 23 23 */ 24 24 struct dm_bio_prison; 25 - struct dm_bio_prison_cell; 26 25 27 26 /* FIXME: this needs to be more abstract */ 28 27 struct dm_cell_key { ··· 30 31 dm_block_t block; 31 32 }; 32 33 34 + /* 35 + * Treat this as opaque, only in header so callers can manage allocation 36 + * themselves. 37 + */ 38 + struct dm_bio_prison_cell { 39 + struct hlist_node list; 40 + struct dm_cell_key key; 41 + struct bio *holder; 42 + struct bio_list bios; 43 + }; 44 + 33 45 struct dm_bio_prison *dm_bio_prison_create(unsigned nr_cells); 34 46 void dm_bio_prison_destroy(struct dm_bio_prison *prison); 35 47 36 48 /* 37 - * This may block if a new cell needs allocating. You must ensure that 38 - * cells will be unlocked even if the calling thread is blocked. 49 + * These two functions just wrap a mempool. This is a transitory step: 50 + * Eventually all bio prison clients should manage their own cell memory. 51 + * 52 + * Like mempool_alloc(), dm_bio_prison_alloc_cell() can only fail if called 53 + * in interrupt context or passed GFP_NOWAIT. 54 + */ 55 + struct dm_bio_prison_cell *dm_bio_prison_alloc_cell(struct dm_bio_prison *prison, 56 + gfp_t gfp); 57 + void dm_bio_prison_free_cell(struct dm_bio_prison *prison, 58 + struct dm_bio_prison_cell *cell); 59 + 60 + /* 61 + * Creates, or retrieves a cell for the given key. 62 + * 63 + * Returns 1 if pre-existing cell returned, zero if new cell created using 64 + * @cell_prealloc. 65 + */ 66 + int dm_get_cell(struct dm_bio_prison *prison, 67 + struct dm_cell_key *key, 68 + struct dm_bio_prison_cell *cell_prealloc, 69 + struct dm_bio_prison_cell **cell_result); 70 + 71 + /* 72 + * An atomic op that combines retrieving a cell, and adding a bio to it. 39 73 * 40 74 * Returns 1 if the cell was already held, 0 if @inmate is the new holder. 41 75 */ 42 - int dm_bio_detain(struct dm_bio_prison *prison, struct dm_cell_key *key, 43 - struct bio *inmate, struct dm_bio_prison_cell **ref); 76 + int dm_bio_detain(struct dm_bio_prison *prison, 77 + struct dm_cell_key *key, 78 + struct bio *inmate, 79 + struct dm_bio_prison_cell *cell_prealloc, 80 + struct dm_bio_prison_cell **cell_result); 44 81 45 - void dm_cell_release(struct dm_bio_prison_cell *cell, struct bio_list *bios); 46 - void dm_cell_release_no_holder(struct dm_bio_prison_cell *cell, struct bio_list *inmates); 47 - void dm_cell_error(struct dm_bio_prison_cell *cell); 82 + void dm_cell_release(struct dm_bio_prison *prison, 83 + struct dm_bio_prison_cell *cell, 84 + struct bio_list *bios); 85 + void dm_cell_release_no_holder(struct dm_bio_prison *prison, 86 + struct dm_bio_prison_cell *cell, 87 + struct bio_list *inmates); 88 + void dm_cell_error(struct dm_bio_prison *prison, 89 + struct dm_bio_prison_cell *cell); 48 90 49 91 /*----------------------------------------------------------------*/ 50 92
+1 -1
drivers/md/dm-bufio.c
··· 1192 1192 int dm_bufio_issue_flush(struct dm_bufio_client *c) 1193 1193 { 1194 1194 struct dm_io_request io_req = { 1195 - .bi_rw = REQ_FLUSH, 1195 + .bi_rw = WRITE_FLUSH, 1196 1196 .mem.type = DM_IO_KMEM, 1197 1197 .mem.ptr.addr = NULL, 1198 1198 .client = c->dm_io,
+54
drivers/md/dm-cache-block-types.h
··· 1 + /* 2 + * Copyright (C) 2012 Red Hat, Inc. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #ifndef DM_CACHE_BLOCK_TYPES_H 8 + #define DM_CACHE_BLOCK_TYPES_H 9 + 10 + #include "persistent-data/dm-block-manager.h" 11 + 12 + /*----------------------------------------------------------------*/ 13 + 14 + /* 15 + * It's helpful to get sparse to differentiate between indexes into the 16 + * origin device, indexes into the cache device, and indexes into the 17 + * discard bitset. 18 + */ 19 + 20 + typedef dm_block_t __bitwise__ dm_oblock_t; 21 + typedef uint32_t __bitwise__ dm_cblock_t; 22 + typedef dm_block_t __bitwise__ dm_dblock_t; 23 + 24 + static inline dm_oblock_t to_oblock(dm_block_t b) 25 + { 26 + return (__force dm_oblock_t) b; 27 + } 28 + 29 + static inline dm_block_t from_oblock(dm_oblock_t b) 30 + { 31 + return (__force dm_block_t) b; 32 + } 33 + 34 + static inline dm_cblock_t to_cblock(uint32_t b) 35 + { 36 + return (__force dm_cblock_t) b; 37 + } 38 + 39 + static inline uint32_t from_cblock(dm_cblock_t b) 40 + { 41 + return (__force uint32_t) b; 42 + } 43 + 44 + static inline dm_dblock_t to_dblock(dm_block_t b) 45 + { 46 + return (__force dm_dblock_t) b; 47 + } 48 + 49 + static inline dm_block_t from_dblock(dm_dblock_t b) 50 + { 51 + return (__force dm_block_t) b; 52 + } 53 + 54 + #endif /* DM_CACHE_BLOCK_TYPES_H */
+1146
drivers/md/dm-cache-metadata.c
··· 1 + /* 2 + * Copyright (C) 2012 Red Hat, Inc. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #include "dm-cache-metadata.h" 8 + 9 + #include "persistent-data/dm-array.h" 10 + #include "persistent-data/dm-bitset.h" 11 + #include "persistent-data/dm-space-map.h" 12 + #include "persistent-data/dm-space-map-disk.h" 13 + #include "persistent-data/dm-transaction-manager.h" 14 + 15 + #include <linux/device-mapper.h> 16 + 17 + /*----------------------------------------------------------------*/ 18 + 19 + #define DM_MSG_PREFIX "cache metadata" 20 + 21 + #define CACHE_SUPERBLOCK_MAGIC 06142003 22 + #define CACHE_SUPERBLOCK_LOCATION 0 23 + #define CACHE_VERSION 1 24 + #define CACHE_METADATA_CACHE_SIZE 64 25 + 26 + /* 27 + * 3 for btree insert + 28 + * 2 for btree lookup used within space map 29 + */ 30 + #define CACHE_MAX_CONCURRENT_LOCKS 5 31 + #define SPACE_MAP_ROOT_SIZE 128 32 + 33 + enum superblock_flag_bits { 34 + /* for spotting crashes that would invalidate the dirty bitset */ 35 + CLEAN_SHUTDOWN, 36 + }; 37 + 38 + /* 39 + * Each mapping from cache block -> origin block carries a set of flags. 40 + */ 41 + enum mapping_bits { 42 + /* 43 + * A valid mapping. Because we're using an array we clear this 44 + * flag for an non existant mapping. 45 + */ 46 + M_VALID = 1, 47 + 48 + /* 49 + * The data on the cache is different from that on the origin. 50 + */ 51 + M_DIRTY = 2 52 + }; 53 + 54 + struct cache_disk_superblock { 55 + __le32 csum; 56 + __le32 flags; 57 + __le64 blocknr; 58 + 59 + __u8 uuid[16]; 60 + __le64 magic; 61 + __le32 version; 62 + 63 + __u8 policy_name[CACHE_POLICY_NAME_SIZE]; 64 + __le32 policy_hint_size; 65 + 66 + __u8 metadata_space_map_root[SPACE_MAP_ROOT_SIZE]; 67 + __le64 mapping_root; 68 + __le64 hint_root; 69 + 70 + __le64 discard_root; 71 + __le64 discard_block_size; 72 + __le64 discard_nr_blocks; 73 + 74 + __le32 data_block_size; 75 + __le32 metadata_block_size; 76 + __le32 cache_blocks; 77 + 78 + __le32 compat_flags; 79 + __le32 compat_ro_flags; 80 + __le32 incompat_flags; 81 + 82 + __le32 read_hits; 83 + __le32 read_misses; 84 + __le32 write_hits; 85 + __le32 write_misses; 86 + } __packed; 87 + 88 + struct dm_cache_metadata { 89 + struct block_device *bdev; 90 + struct dm_block_manager *bm; 91 + struct dm_space_map *metadata_sm; 92 + struct dm_transaction_manager *tm; 93 + 94 + struct dm_array_info info; 95 + struct dm_array_info hint_info; 96 + struct dm_disk_bitset discard_info; 97 + 98 + struct rw_semaphore root_lock; 99 + dm_block_t root; 100 + dm_block_t hint_root; 101 + dm_block_t discard_root; 102 + 103 + sector_t discard_block_size; 104 + dm_dblock_t discard_nr_blocks; 105 + 106 + sector_t data_block_size; 107 + dm_cblock_t cache_blocks; 108 + bool changed:1; 109 + bool clean_when_opened:1; 110 + 111 + char policy_name[CACHE_POLICY_NAME_SIZE]; 112 + size_t policy_hint_size; 113 + struct dm_cache_statistics stats; 114 + }; 115 + 116 + /*------------------------------------------------------------------- 117 + * superblock validator 118 + *-----------------------------------------------------------------*/ 119 + 120 + #define SUPERBLOCK_CSUM_XOR 9031977 121 + 122 + static void sb_prepare_for_write(struct dm_block_validator *v, 123 + struct dm_block *b, 124 + size_t sb_block_size) 125 + { 126 + struct cache_disk_superblock *disk_super = dm_block_data(b); 127 + 128 + disk_super->blocknr = cpu_to_le64(dm_block_location(b)); 129 + disk_super->csum = cpu_to_le32(dm_bm_checksum(&disk_super->flags, 130 + sb_block_size - sizeof(__le32), 131 + SUPERBLOCK_CSUM_XOR)); 132 + } 133 + 134 + static int sb_check(struct dm_block_validator *v, 135 + struct dm_block *b, 136 + size_t sb_block_size) 137 + { 138 + struct cache_disk_superblock *disk_super = dm_block_data(b); 139 + __le32 csum_le; 140 + 141 + if (dm_block_location(b) != le64_to_cpu(disk_super->blocknr)) { 142 + DMERR("sb_check failed: blocknr %llu: wanted %llu", 143 + le64_to_cpu(disk_super->blocknr), 144 + (unsigned long long)dm_block_location(b)); 145 + return -ENOTBLK; 146 + } 147 + 148 + if (le64_to_cpu(disk_super->magic) != CACHE_SUPERBLOCK_MAGIC) { 149 + DMERR("sb_check failed: magic %llu: wanted %llu", 150 + le64_to_cpu(disk_super->magic), 151 + (unsigned long long)CACHE_SUPERBLOCK_MAGIC); 152 + return -EILSEQ; 153 + } 154 + 155 + csum_le = cpu_to_le32(dm_bm_checksum(&disk_super->flags, 156 + sb_block_size - sizeof(__le32), 157 + SUPERBLOCK_CSUM_XOR)); 158 + if (csum_le != disk_super->csum) { 159 + DMERR("sb_check failed: csum %u: wanted %u", 160 + le32_to_cpu(csum_le), le32_to_cpu(disk_super->csum)); 161 + return -EILSEQ; 162 + } 163 + 164 + return 0; 165 + } 166 + 167 + static struct dm_block_validator sb_validator = { 168 + .name = "superblock", 169 + .prepare_for_write = sb_prepare_for_write, 170 + .check = sb_check 171 + }; 172 + 173 + /*----------------------------------------------------------------*/ 174 + 175 + static int superblock_read_lock(struct dm_cache_metadata *cmd, 176 + struct dm_block **sblock) 177 + { 178 + return dm_bm_read_lock(cmd->bm, CACHE_SUPERBLOCK_LOCATION, 179 + &sb_validator, sblock); 180 + } 181 + 182 + static int superblock_lock_zero(struct dm_cache_metadata *cmd, 183 + struct dm_block **sblock) 184 + { 185 + return dm_bm_write_lock_zero(cmd->bm, CACHE_SUPERBLOCK_LOCATION, 186 + &sb_validator, sblock); 187 + } 188 + 189 + static int superblock_lock(struct dm_cache_metadata *cmd, 190 + struct dm_block **sblock) 191 + { 192 + return dm_bm_write_lock(cmd->bm, CACHE_SUPERBLOCK_LOCATION, 193 + &sb_validator, sblock); 194 + } 195 + 196 + /*----------------------------------------------------------------*/ 197 + 198 + static int __superblock_all_zeroes(struct dm_block_manager *bm, int *result) 199 + { 200 + int r; 201 + unsigned i; 202 + struct dm_block *b; 203 + __le64 *data_le, zero = cpu_to_le64(0); 204 + unsigned sb_block_size = dm_bm_block_size(bm) / sizeof(__le64); 205 + 206 + /* 207 + * We can't use a validator here - it may be all zeroes. 208 + */ 209 + r = dm_bm_read_lock(bm, CACHE_SUPERBLOCK_LOCATION, NULL, &b); 210 + if (r) 211 + return r; 212 + 213 + data_le = dm_block_data(b); 214 + *result = 1; 215 + for (i = 0; i < sb_block_size; i++) { 216 + if (data_le[i] != zero) { 217 + *result = 0; 218 + break; 219 + } 220 + } 221 + 222 + return dm_bm_unlock(b); 223 + } 224 + 225 + static void __setup_mapping_info(struct dm_cache_metadata *cmd) 226 + { 227 + struct dm_btree_value_type vt; 228 + 229 + vt.context = NULL; 230 + vt.size = sizeof(__le64); 231 + vt.inc = NULL; 232 + vt.dec = NULL; 233 + vt.equal = NULL; 234 + dm_array_info_init(&cmd->info, cmd->tm, &vt); 235 + 236 + if (cmd->policy_hint_size) { 237 + vt.size = sizeof(__le32); 238 + dm_array_info_init(&cmd->hint_info, cmd->tm, &vt); 239 + } 240 + } 241 + 242 + static int __write_initial_superblock(struct dm_cache_metadata *cmd) 243 + { 244 + int r; 245 + struct dm_block *sblock; 246 + size_t metadata_len; 247 + struct cache_disk_superblock *disk_super; 248 + sector_t bdev_size = i_size_read(cmd->bdev->bd_inode) >> SECTOR_SHIFT; 249 + 250 + /* FIXME: see if we can lose the max sectors limit */ 251 + if (bdev_size > DM_CACHE_METADATA_MAX_SECTORS) 252 + bdev_size = DM_CACHE_METADATA_MAX_SECTORS; 253 + 254 + r = dm_sm_root_size(cmd->metadata_sm, &metadata_len); 255 + if (r < 0) 256 + return r; 257 + 258 + r = dm_tm_pre_commit(cmd->tm); 259 + if (r < 0) 260 + return r; 261 + 262 + r = superblock_lock_zero(cmd, &sblock); 263 + if (r) 264 + return r; 265 + 266 + disk_super = dm_block_data(sblock); 267 + disk_super->flags = 0; 268 + memset(disk_super->uuid, 0, sizeof(disk_super->uuid)); 269 + disk_super->magic = cpu_to_le64(CACHE_SUPERBLOCK_MAGIC); 270 + disk_super->version = cpu_to_le32(CACHE_VERSION); 271 + memset(disk_super->policy_name, 0, CACHE_POLICY_NAME_SIZE); 272 + disk_super->policy_hint_size = 0; 273 + 274 + r = dm_sm_copy_root(cmd->metadata_sm, &disk_super->metadata_space_map_root, 275 + metadata_len); 276 + if (r < 0) 277 + goto bad_locked; 278 + 279 + disk_super->mapping_root = cpu_to_le64(cmd->root); 280 + disk_super->hint_root = cpu_to_le64(cmd->hint_root); 281 + disk_super->discard_root = cpu_to_le64(cmd->discard_root); 282 + disk_super->discard_block_size = cpu_to_le64(cmd->discard_block_size); 283 + disk_super->discard_nr_blocks = cpu_to_le64(from_dblock(cmd->discard_nr_blocks)); 284 + disk_super->metadata_block_size = cpu_to_le32(DM_CACHE_METADATA_BLOCK_SIZE >> SECTOR_SHIFT); 285 + disk_super->data_block_size = cpu_to_le32(cmd->data_block_size); 286 + disk_super->cache_blocks = cpu_to_le32(0); 287 + memset(disk_super->policy_name, 0, sizeof(disk_super->policy_name)); 288 + 289 + disk_super->read_hits = cpu_to_le32(0); 290 + disk_super->read_misses = cpu_to_le32(0); 291 + disk_super->write_hits = cpu_to_le32(0); 292 + disk_super->write_misses = cpu_to_le32(0); 293 + 294 + return dm_tm_commit(cmd->tm, sblock); 295 + 296 + bad_locked: 297 + dm_bm_unlock(sblock); 298 + return r; 299 + } 300 + 301 + static int __format_metadata(struct dm_cache_metadata *cmd) 302 + { 303 + int r; 304 + 305 + r = dm_tm_create_with_sm(cmd->bm, CACHE_SUPERBLOCK_LOCATION, 306 + &cmd->tm, &cmd->metadata_sm); 307 + if (r < 0) { 308 + DMERR("tm_create_with_sm failed"); 309 + return r; 310 + } 311 + 312 + __setup_mapping_info(cmd); 313 + 314 + r = dm_array_empty(&cmd->info, &cmd->root); 315 + if (r < 0) 316 + goto bad; 317 + 318 + dm_disk_bitset_init(cmd->tm, &cmd->discard_info); 319 + 320 + r = dm_bitset_empty(&cmd->discard_info, &cmd->discard_root); 321 + if (r < 0) 322 + goto bad; 323 + 324 + cmd->discard_block_size = 0; 325 + cmd->discard_nr_blocks = 0; 326 + 327 + r = __write_initial_superblock(cmd); 328 + if (r) 329 + goto bad; 330 + 331 + cmd->clean_when_opened = true; 332 + return 0; 333 + 334 + bad: 335 + dm_tm_destroy(cmd->tm); 336 + dm_sm_destroy(cmd->metadata_sm); 337 + 338 + return r; 339 + } 340 + 341 + static int __check_incompat_features(struct cache_disk_superblock *disk_super, 342 + struct dm_cache_metadata *cmd) 343 + { 344 + uint32_t features; 345 + 346 + features = le32_to_cpu(disk_super->incompat_flags) & ~DM_CACHE_FEATURE_INCOMPAT_SUPP; 347 + if (features) { 348 + DMERR("could not access metadata due to unsupported optional features (%lx).", 349 + (unsigned long)features); 350 + return -EINVAL; 351 + } 352 + 353 + /* 354 + * Check for read-only metadata to skip the following RDWR checks. 355 + */ 356 + if (get_disk_ro(cmd->bdev->bd_disk)) 357 + return 0; 358 + 359 + features = le32_to_cpu(disk_super->compat_ro_flags) & ~DM_CACHE_FEATURE_COMPAT_RO_SUPP; 360 + if (features) { 361 + DMERR("could not access metadata RDWR due to unsupported optional features (%lx).", 362 + (unsigned long)features); 363 + return -EINVAL; 364 + } 365 + 366 + return 0; 367 + } 368 + 369 + static int __open_metadata(struct dm_cache_metadata *cmd) 370 + { 371 + int r; 372 + struct dm_block *sblock; 373 + struct cache_disk_superblock *disk_super; 374 + unsigned long sb_flags; 375 + 376 + r = superblock_read_lock(cmd, &sblock); 377 + if (r < 0) { 378 + DMERR("couldn't read lock superblock"); 379 + return r; 380 + } 381 + 382 + disk_super = dm_block_data(sblock); 383 + 384 + r = __check_incompat_features(disk_super, cmd); 385 + if (r < 0) 386 + goto bad; 387 + 388 + r = dm_tm_open_with_sm(cmd->bm, CACHE_SUPERBLOCK_LOCATION, 389 + disk_super->metadata_space_map_root, 390 + sizeof(disk_super->metadata_space_map_root), 391 + &cmd->tm, &cmd->metadata_sm); 392 + if (r < 0) { 393 + DMERR("tm_open_with_sm failed"); 394 + goto bad; 395 + } 396 + 397 + __setup_mapping_info(cmd); 398 + dm_disk_bitset_init(cmd->tm, &cmd->discard_info); 399 + sb_flags = le32_to_cpu(disk_super->flags); 400 + cmd->clean_when_opened = test_bit(CLEAN_SHUTDOWN, &sb_flags); 401 + return dm_bm_unlock(sblock); 402 + 403 + bad: 404 + dm_bm_unlock(sblock); 405 + return r; 406 + } 407 + 408 + static int __open_or_format_metadata(struct dm_cache_metadata *cmd, 409 + bool format_device) 410 + { 411 + int r, unformatted; 412 + 413 + r = __superblock_all_zeroes(cmd->bm, &unformatted); 414 + if (r) 415 + return r; 416 + 417 + if (unformatted) 418 + return format_device ? __format_metadata(cmd) : -EPERM; 419 + 420 + return __open_metadata(cmd); 421 + } 422 + 423 + static int __create_persistent_data_objects(struct dm_cache_metadata *cmd, 424 + bool may_format_device) 425 + { 426 + int r; 427 + cmd->bm = dm_block_manager_create(cmd->bdev, DM_CACHE_METADATA_BLOCK_SIZE, 428 + CACHE_METADATA_CACHE_SIZE, 429 + CACHE_MAX_CONCURRENT_LOCKS); 430 + if (IS_ERR(cmd->bm)) { 431 + DMERR("could not create block manager"); 432 + return PTR_ERR(cmd->bm); 433 + } 434 + 435 + r = __open_or_format_metadata(cmd, may_format_device); 436 + if (r) 437 + dm_block_manager_destroy(cmd->bm); 438 + 439 + return r; 440 + } 441 + 442 + static void __destroy_persistent_data_objects(struct dm_cache_metadata *cmd) 443 + { 444 + dm_sm_destroy(cmd->metadata_sm); 445 + dm_tm_destroy(cmd->tm); 446 + dm_block_manager_destroy(cmd->bm); 447 + } 448 + 449 + typedef unsigned long (*flags_mutator)(unsigned long); 450 + 451 + static void update_flags(struct cache_disk_superblock *disk_super, 452 + flags_mutator mutator) 453 + { 454 + uint32_t sb_flags = mutator(le32_to_cpu(disk_super->flags)); 455 + disk_super->flags = cpu_to_le32(sb_flags); 456 + } 457 + 458 + static unsigned long set_clean_shutdown(unsigned long flags) 459 + { 460 + set_bit(CLEAN_SHUTDOWN, &flags); 461 + return flags; 462 + } 463 + 464 + static unsigned long clear_clean_shutdown(unsigned long flags) 465 + { 466 + clear_bit(CLEAN_SHUTDOWN, &flags); 467 + return flags; 468 + } 469 + 470 + static void read_superblock_fields(struct dm_cache_metadata *cmd, 471 + struct cache_disk_superblock *disk_super) 472 + { 473 + cmd->root = le64_to_cpu(disk_super->mapping_root); 474 + cmd->hint_root = le64_to_cpu(disk_super->hint_root); 475 + cmd->discard_root = le64_to_cpu(disk_super->discard_root); 476 + cmd->discard_block_size = le64_to_cpu(disk_super->discard_block_size); 477 + cmd->discard_nr_blocks = to_dblock(le64_to_cpu(disk_super->discard_nr_blocks)); 478 + cmd->data_block_size = le32_to_cpu(disk_super->data_block_size); 479 + cmd->cache_blocks = to_cblock(le32_to_cpu(disk_super->cache_blocks)); 480 + strncpy(cmd->policy_name, disk_super->policy_name, sizeof(cmd->policy_name)); 481 + cmd->policy_hint_size = le32_to_cpu(disk_super->policy_hint_size); 482 + 483 + cmd->stats.read_hits = le32_to_cpu(disk_super->read_hits); 484 + cmd->stats.read_misses = le32_to_cpu(disk_super->read_misses); 485 + cmd->stats.write_hits = le32_to_cpu(disk_super->write_hits); 486 + cmd->stats.write_misses = le32_to_cpu(disk_super->write_misses); 487 + 488 + cmd->changed = false; 489 + } 490 + 491 + /* 492 + * The mutator updates the superblock flags. 493 + */ 494 + static int __begin_transaction_flags(struct dm_cache_metadata *cmd, 495 + flags_mutator mutator) 496 + { 497 + int r; 498 + struct cache_disk_superblock *disk_super; 499 + struct dm_block *sblock; 500 + 501 + r = superblock_lock(cmd, &sblock); 502 + if (r) 503 + return r; 504 + 505 + disk_super = dm_block_data(sblock); 506 + update_flags(disk_super, mutator); 507 + read_superblock_fields(cmd, disk_super); 508 + 509 + return dm_bm_flush_and_unlock(cmd->bm, sblock); 510 + } 511 + 512 + static int __begin_transaction(struct dm_cache_metadata *cmd) 513 + { 514 + int r; 515 + struct cache_disk_superblock *disk_super; 516 + struct dm_block *sblock; 517 + 518 + /* 519 + * We re-read the superblock every time. Shouldn't need to do this 520 + * really. 521 + */ 522 + r = superblock_read_lock(cmd, &sblock); 523 + if (r) 524 + return r; 525 + 526 + disk_super = dm_block_data(sblock); 527 + read_superblock_fields(cmd, disk_super); 528 + dm_bm_unlock(sblock); 529 + 530 + return 0; 531 + } 532 + 533 + static int __commit_transaction(struct dm_cache_metadata *cmd, 534 + flags_mutator mutator) 535 + { 536 + int r; 537 + size_t metadata_len; 538 + struct cache_disk_superblock *disk_super; 539 + struct dm_block *sblock; 540 + 541 + /* 542 + * We need to know if the cache_disk_superblock exceeds a 512-byte sector. 543 + */ 544 + BUILD_BUG_ON(sizeof(struct cache_disk_superblock) > 512); 545 + 546 + r = dm_bitset_flush(&cmd->discard_info, cmd->discard_root, 547 + &cmd->discard_root); 548 + if (r) 549 + return r; 550 + 551 + r = dm_tm_pre_commit(cmd->tm); 552 + if (r < 0) 553 + return r; 554 + 555 + r = dm_sm_root_size(cmd->metadata_sm, &metadata_len); 556 + if (r < 0) 557 + return r; 558 + 559 + r = superblock_lock(cmd, &sblock); 560 + if (r) 561 + return r; 562 + 563 + disk_super = dm_block_data(sblock); 564 + 565 + if (mutator) 566 + update_flags(disk_super, mutator); 567 + 568 + disk_super->mapping_root = cpu_to_le64(cmd->root); 569 + disk_super->hint_root = cpu_to_le64(cmd->hint_root); 570 + disk_super->discard_root = cpu_to_le64(cmd->discard_root); 571 + disk_super->discard_block_size = cpu_to_le64(cmd->discard_block_size); 572 + disk_super->discard_nr_blocks = cpu_to_le64(from_dblock(cmd->discard_nr_blocks)); 573 + disk_super->cache_blocks = cpu_to_le32(from_cblock(cmd->cache_blocks)); 574 + strncpy(disk_super->policy_name, cmd->policy_name, sizeof(disk_super->policy_name)); 575 + 576 + disk_super->read_hits = cpu_to_le32(cmd->stats.read_hits); 577 + disk_super->read_misses = cpu_to_le32(cmd->stats.read_misses); 578 + disk_super->write_hits = cpu_to_le32(cmd->stats.write_hits); 579 + disk_super->write_misses = cpu_to_le32(cmd->stats.write_misses); 580 + 581 + r = dm_sm_copy_root(cmd->metadata_sm, &disk_super->metadata_space_map_root, 582 + metadata_len); 583 + if (r < 0) { 584 + dm_bm_unlock(sblock); 585 + return r; 586 + } 587 + 588 + return dm_tm_commit(cmd->tm, sblock); 589 + } 590 + 591 + /*----------------------------------------------------------------*/ 592 + 593 + /* 594 + * The mappings are held in a dm-array that has 64-bit values stored in 595 + * little-endian format. The index is the cblock, the high 48bits of the 596 + * value are the oblock and the low 16 bit the flags. 597 + */ 598 + #define FLAGS_MASK ((1 << 16) - 1) 599 + 600 + static __le64 pack_value(dm_oblock_t block, unsigned flags) 601 + { 602 + uint64_t value = from_oblock(block); 603 + value <<= 16; 604 + value = value | (flags & FLAGS_MASK); 605 + return cpu_to_le64(value); 606 + } 607 + 608 + static void unpack_value(__le64 value_le, dm_oblock_t *block, unsigned *flags) 609 + { 610 + uint64_t value = le64_to_cpu(value_le); 611 + uint64_t b = value >> 16; 612 + *block = to_oblock(b); 613 + *flags = value & FLAGS_MASK; 614 + } 615 + 616 + /*----------------------------------------------------------------*/ 617 + 618 + struct dm_cache_metadata *dm_cache_metadata_open(struct block_device *bdev, 619 + sector_t data_block_size, 620 + bool may_format_device, 621 + size_t policy_hint_size) 622 + { 623 + int r; 624 + struct dm_cache_metadata *cmd; 625 + 626 + cmd = kzalloc(sizeof(*cmd), GFP_KERNEL); 627 + if (!cmd) { 628 + DMERR("could not allocate metadata struct"); 629 + return NULL; 630 + } 631 + 632 + init_rwsem(&cmd->root_lock); 633 + cmd->bdev = bdev; 634 + cmd->data_block_size = data_block_size; 635 + cmd->cache_blocks = 0; 636 + cmd->policy_hint_size = policy_hint_size; 637 + cmd->changed = true; 638 + 639 + r = __create_persistent_data_objects(cmd, may_format_device); 640 + if (r) { 641 + kfree(cmd); 642 + return ERR_PTR(r); 643 + } 644 + 645 + r = __begin_transaction_flags(cmd, clear_clean_shutdown); 646 + if (r < 0) { 647 + dm_cache_metadata_close(cmd); 648 + return ERR_PTR(r); 649 + } 650 + 651 + return cmd; 652 + } 653 + 654 + void dm_cache_metadata_close(struct dm_cache_metadata *cmd) 655 + { 656 + __destroy_persistent_data_objects(cmd); 657 + kfree(cmd); 658 + } 659 + 660 + int dm_cache_resize(struct dm_cache_metadata *cmd, dm_cblock_t new_cache_size) 661 + { 662 + int r; 663 + __le64 null_mapping = pack_value(0, 0); 664 + 665 + down_write(&cmd->root_lock); 666 + __dm_bless_for_disk(&null_mapping); 667 + r = dm_array_resize(&cmd->info, cmd->root, from_cblock(cmd->cache_blocks), 668 + from_cblock(new_cache_size), 669 + &null_mapping, &cmd->root); 670 + if (!r) 671 + cmd->cache_blocks = new_cache_size; 672 + cmd->changed = true; 673 + up_write(&cmd->root_lock); 674 + 675 + return r; 676 + } 677 + 678 + int dm_cache_discard_bitset_resize(struct dm_cache_metadata *cmd, 679 + sector_t discard_block_size, 680 + dm_dblock_t new_nr_entries) 681 + { 682 + int r; 683 + 684 + down_write(&cmd->root_lock); 685 + r = dm_bitset_resize(&cmd->discard_info, 686 + cmd->discard_root, 687 + from_dblock(cmd->discard_nr_blocks), 688 + from_dblock(new_nr_entries), 689 + false, &cmd->discard_root); 690 + if (!r) { 691 + cmd->discard_block_size = discard_block_size; 692 + cmd->discard_nr_blocks = new_nr_entries; 693 + } 694 + 695 + cmd->changed = true; 696 + up_write(&cmd->root_lock); 697 + 698 + return r; 699 + } 700 + 701 + static int __set_discard(struct dm_cache_metadata *cmd, dm_dblock_t b) 702 + { 703 + return dm_bitset_set_bit(&cmd->discard_info, cmd->discard_root, 704 + from_dblock(b), &cmd->discard_root); 705 + } 706 + 707 + static int __clear_discard(struct dm_cache_metadata *cmd, dm_dblock_t b) 708 + { 709 + return dm_bitset_clear_bit(&cmd->discard_info, cmd->discard_root, 710 + from_dblock(b), &cmd->discard_root); 711 + } 712 + 713 + static int __is_discarded(struct dm_cache_metadata *cmd, dm_dblock_t b, 714 + bool *is_discarded) 715 + { 716 + return dm_bitset_test_bit(&cmd->discard_info, cmd->discard_root, 717 + from_dblock(b), &cmd->discard_root, 718 + is_discarded); 719 + } 720 + 721 + static int __discard(struct dm_cache_metadata *cmd, 722 + dm_dblock_t dblock, bool discard) 723 + { 724 + int r; 725 + 726 + r = (discard ? __set_discard : __clear_discard)(cmd, dblock); 727 + if (r) 728 + return r; 729 + 730 + cmd->changed = true; 731 + return 0; 732 + } 733 + 734 + int dm_cache_set_discard(struct dm_cache_metadata *cmd, 735 + dm_dblock_t dblock, bool discard) 736 + { 737 + int r; 738 + 739 + down_write(&cmd->root_lock); 740 + r = __discard(cmd, dblock, discard); 741 + up_write(&cmd->root_lock); 742 + 743 + return r; 744 + } 745 + 746 + static int __load_discards(struct dm_cache_metadata *cmd, 747 + load_discard_fn fn, void *context) 748 + { 749 + int r = 0; 750 + dm_block_t b; 751 + bool discard; 752 + 753 + for (b = 0; b < from_dblock(cmd->discard_nr_blocks); b++) { 754 + dm_dblock_t dblock = to_dblock(b); 755 + 756 + if (cmd->clean_when_opened) { 757 + r = __is_discarded(cmd, dblock, &discard); 758 + if (r) 759 + return r; 760 + } else 761 + discard = false; 762 + 763 + r = fn(context, cmd->discard_block_size, dblock, discard); 764 + if (r) 765 + break; 766 + } 767 + 768 + return r; 769 + } 770 + 771 + int dm_cache_load_discards(struct dm_cache_metadata *cmd, 772 + load_discard_fn fn, void *context) 773 + { 774 + int r; 775 + 776 + down_read(&cmd->root_lock); 777 + r = __load_discards(cmd, fn, context); 778 + up_read(&cmd->root_lock); 779 + 780 + return r; 781 + } 782 + 783 + dm_cblock_t dm_cache_size(struct dm_cache_metadata *cmd) 784 + { 785 + dm_cblock_t r; 786 + 787 + down_read(&cmd->root_lock); 788 + r = cmd->cache_blocks; 789 + up_read(&cmd->root_lock); 790 + 791 + return r; 792 + } 793 + 794 + static int __remove(struct dm_cache_metadata *cmd, dm_cblock_t cblock) 795 + { 796 + int r; 797 + __le64 value = pack_value(0, 0); 798 + 799 + __dm_bless_for_disk(&value); 800 + r = dm_array_set_value(&cmd->info, cmd->root, from_cblock(cblock), 801 + &value, &cmd->root); 802 + if (r) 803 + return r; 804 + 805 + cmd->changed = true; 806 + return 0; 807 + } 808 + 809 + int dm_cache_remove_mapping(struct dm_cache_metadata *cmd, dm_cblock_t cblock) 810 + { 811 + int r; 812 + 813 + down_write(&cmd->root_lock); 814 + r = __remove(cmd, cblock); 815 + up_write(&cmd->root_lock); 816 + 817 + return r; 818 + } 819 + 820 + static int __insert(struct dm_cache_metadata *cmd, 821 + dm_cblock_t cblock, dm_oblock_t oblock) 822 + { 823 + int r; 824 + __le64 value = pack_value(oblock, M_VALID); 825 + __dm_bless_for_disk(&value); 826 + 827 + r = dm_array_set_value(&cmd->info, cmd->root, from_cblock(cblock), 828 + &value, &cmd->root); 829 + if (r) 830 + return r; 831 + 832 + cmd->changed = true; 833 + return 0; 834 + } 835 + 836 + int dm_cache_insert_mapping(struct dm_cache_metadata *cmd, 837 + dm_cblock_t cblock, dm_oblock_t oblock) 838 + { 839 + int r; 840 + 841 + down_write(&cmd->root_lock); 842 + r = __insert(cmd, cblock, oblock); 843 + up_write(&cmd->root_lock); 844 + 845 + return r; 846 + } 847 + 848 + struct thunk { 849 + load_mapping_fn fn; 850 + void *context; 851 + 852 + struct dm_cache_metadata *cmd; 853 + bool respect_dirty_flags; 854 + bool hints_valid; 855 + }; 856 + 857 + static bool hints_array_initialized(struct dm_cache_metadata *cmd) 858 + { 859 + return cmd->hint_root && cmd->policy_hint_size; 860 + } 861 + 862 + static bool hints_array_available(struct dm_cache_metadata *cmd, 863 + const char *policy_name) 864 + { 865 + bool policy_names_match = !strncmp(cmd->policy_name, policy_name, 866 + sizeof(cmd->policy_name)); 867 + 868 + return cmd->clean_when_opened && policy_names_match && 869 + hints_array_initialized(cmd); 870 + } 871 + 872 + static int __load_mapping(void *context, uint64_t cblock, void *leaf) 873 + { 874 + int r = 0; 875 + bool dirty; 876 + __le64 value; 877 + __le32 hint_value = 0; 878 + dm_oblock_t oblock; 879 + unsigned flags; 880 + struct thunk *thunk = context; 881 + struct dm_cache_metadata *cmd = thunk->cmd; 882 + 883 + memcpy(&value, leaf, sizeof(value)); 884 + unpack_value(value, &oblock, &flags); 885 + 886 + if (flags & M_VALID) { 887 + if (thunk->hints_valid) { 888 + r = dm_array_get_value(&cmd->hint_info, cmd->hint_root, 889 + cblock, &hint_value); 890 + if (r && r != -ENODATA) 891 + return r; 892 + } 893 + 894 + dirty = thunk->respect_dirty_flags ? (flags & M_DIRTY) : true; 895 + r = thunk->fn(thunk->context, oblock, to_cblock(cblock), 896 + dirty, le32_to_cpu(hint_value), thunk->hints_valid); 897 + } 898 + 899 + return r; 900 + } 901 + 902 + static int __load_mappings(struct dm_cache_metadata *cmd, const char *policy_name, 903 + load_mapping_fn fn, void *context) 904 + { 905 + struct thunk thunk; 906 + 907 + thunk.fn = fn; 908 + thunk.context = context; 909 + 910 + thunk.cmd = cmd; 911 + thunk.respect_dirty_flags = cmd->clean_when_opened; 912 + thunk.hints_valid = hints_array_available(cmd, policy_name); 913 + 914 + return dm_array_walk(&cmd->info, cmd->root, __load_mapping, &thunk); 915 + } 916 + 917 + int dm_cache_load_mappings(struct dm_cache_metadata *cmd, const char *policy_name, 918 + load_mapping_fn fn, void *context) 919 + { 920 + int r; 921 + 922 + down_read(&cmd->root_lock); 923 + r = __load_mappings(cmd, policy_name, fn, context); 924 + up_read(&cmd->root_lock); 925 + 926 + return r; 927 + } 928 + 929 + static int __dump_mapping(void *context, uint64_t cblock, void *leaf) 930 + { 931 + int r = 0; 932 + __le64 value; 933 + dm_oblock_t oblock; 934 + unsigned flags; 935 + 936 + memcpy(&value, leaf, sizeof(value)); 937 + unpack_value(value, &oblock, &flags); 938 + 939 + return r; 940 + } 941 + 942 + static int __dump_mappings(struct dm_cache_metadata *cmd) 943 + { 944 + return dm_array_walk(&cmd->info, cmd->root, __dump_mapping, NULL); 945 + } 946 + 947 + void dm_cache_dump(struct dm_cache_metadata *cmd) 948 + { 949 + down_read(&cmd->root_lock); 950 + __dump_mappings(cmd); 951 + up_read(&cmd->root_lock); 952 + } 953 + 954 + int dm_cache_changed_this_transaction(struct dm_cache_metadata *cmd) 955 + { 956 + int r; 957 + 958 + down_read(&cmd->root_lock); 959 + r = cmd->changed; 960 + up_read(&cmd->root_lock); 961 + 962 + return r; 963 + } 964 + 965 + static int __dirty(struct dm_cache_metadata *cmd, dm_cblock_t cblock, bool dirty) 966 + { 967 + int r; 968 + unsigned flags; 969 + dm_oblock_t oblock; 970 + __le64 value; 971 + 972 + r = dm_array_get_value(&cmd->info, cmd->root, from_cblock(cblock), &value); 973 + if (r) 974 + return r; 975 + 976 + unpack_value(value, &oblock, &flags); 977 + 978 + if (((flags & M_DIRTY) && dirty) || (!(flags & M_DIRTY) && !dirty)) 979 + /* nothing to be done */ 980 + return 0; 981 + 982 + value = pack_value(oblock, flags | (dirty ? M_DIRTY : 0)); 983 + __dm_bless_for_disk(&value); 984 + 985 + r = dm_array_set_value(&cmd->info, cmd->root, from_cblock(cblock), 986 + &value, &cmd->root); 987 + if (r) 988 + return r; 989 + 990 + cmd->changed = true; 991 + return 0; 992 + 993 + } 994 + 995 + int dm_cache_set_dirty(struct dm_cache_metadata *cmd, 996 + dm_cblock_t cblock, bool dirty) 997 + { 998 + int r; 999 + 1000 + down_write(&cmd->root_lock); 1001 + r = __dirty(cmd, cblock, dirty); 1002 + up_write(&cmd->root_lock); 1003 + 1004 + return r; 1005 + } 1006 + 1007 + void dm_cache_metadata_get_stats(struct dm_cache_metadata *cmd, 1008 + struct dm_cache_statistics *stats) 1009 + { 1010 + down_read(&cmd->root_lock); 1011 + memcpy(stats, &cmd->stats, sizeof(*stats)); 1012 + up_read(&cmd->root_lock); 1013 + } 1014 + 1015 + void dm_cache_metadata_set_stats(struct dm_cache_metadata *cmd, 1016 + struct dm_cache_statistics *stats) 1017 + { 1018 + down_write(&cmd->root_lock); 1019 + memcpy(&cmd->stats, stats, sizeof(*stats)); 1020 + up_write(&cmd->root_lock); 1021 + } 1022 + 1023 + int dm_cache_commit(struct dm_cache_metadata *cmd, bool clean_shutdown) 1024 + { 1025 + int r; 1026 + flags_mutator mutator = (clean_shutdown ? set_clean_shutdown : 1027 + clear_clean_shutdown); 1028 + 1029 + down_write(&cmd->root_lock); 1030 + r = __commit_transaction(cmd, mutator); 1031 + if (r) 1032 + goto out; 1033 + 1034 + r = __begin_transaction(cmd); 1035 + 1036 + out: 1037 + up_write(&cmd->root_lock); 1038 + return r; 1039 + } 1040 + 1041 + int dm_cache_get_free_metadata_block_count(struct dm_cache_metadata *cmd, 1042 + dm_block_t *result) 1043 + { 1044 + int r = -EINVAL; 1045 + 1046 + down_read(&cmd->root_lock); 1047 + r = dm_sm_get_nr_free(cmd->metadata_sm, result); 1048 + up_read(&cmd->root_lock); 1049 + 1050 + return r; 1051 + } 1052 + 1053 + int dm_cache_get_metadata_dev_size(struct dm_cache_metadata *cmd, 1054 + dm_block_t *result) 1055 + { 1056 + int r = -EINVAL; 1057 + 1058 + down_read(&cmd->root_lock); 1059 + r = dm_sm_get_nr_blocks(cmd->metadata_sm, result); 1060 + up_read(&cmd->root_lock); 1061 + 1062 + return r; 1063 + } 1064 + 1065 + /*----------------------------------------------------------------*/ 1066 + 1067 + static int begin_hints(struct dm_cache_metadata *cmd, struct dm_cache_policy *policy) 1068 + { 1069 + int r; 1070 + __le32 value; 1071 + size_t hint_size; 1072 + const char *policy_name = dm_cache_policy_get_name(policy); 1073 + 1074 + if (!policy_name[0] || 1075 + (strlen(policy_name) > sizeof(cmd->policy_name) - 1)) 1076 + return -EINVAL; 1077 + 1078 + if (strcmp(cmd->policy_name, policy_name)) { 1079 + strncpy(cmd->policy_name, policy_name, sizeof(cmd->policy_name)); 1080 + 1081 + hint_size = dm_cache_policy_get_hint_size(policy); 1082 + if (!hint_size) 1083 + return 0; /* short-circuit hints initialization */ 1084 + cmd->policy_hint_size = hint_size; 1085 + 1086 + if (cmd->hint_root) { 1087 + r = dm_array_del(&cmd->hint_info, cmd->hint_root); 1088 + if (r) 1089 + return r; 1090 + } 1091 + 1092 + r = dm_array_empty(&cmd->hint_info, &cmd->hint_root); 1093 + if (r) 1094 + return r; 1095 + 1096 + value = cpu_to_le32(0); 1097 + __dm_bless_for_disk(&value); 1098 + r = dm_array_resize(&cmd->hint_info, cmd->hint_root, 0, 1099 + from_cblock(cmd->cache_blocks), 1100 + &value, &cmd->hint_root); 1101 + if (r) 1102 + return r; 1103 + } 1104 + 1105 + return 0; 1106 + } 1107 + 1108 + int dm_cache_begin_hints(struct dm_cache_metadata *cmd, struct dm_cache_policy *policy) 1109 + { 1110 + int r; 1111 + 1112 + down_write(&cmd->root_lock); 1113 + r = begin_hints(cmd, policy); 1114 + up_write(&cmd->root_lock); 1115 + 1116 + return r; 1117 + } 1118 + 1119 + static int save_hint(struct dm_cache_metadata *cmd, dm_cblock_t cblock, 1120 + uint32_t hint) 1121 + { 1122 + int r; 1123 + __le32 value = cpu_to_le32(hint); 1124 + __dm_bless_for_disk(&value); 1125 + 1126 + r = dm_array_set_value(&cmd->hint_info, cmd->hint_root, 1127 + from_cblock(cblock), &value, &cmd->hint_root); 1128 + cmd->changed = true; 1129 + 1130 + return r; 1131 + } 1132 + 1133 + int dm_cache_save_hint(struct dm_cache_metadata *cmd, dm_cblock_t cblock, 1134 + uint32_t hint) 1135 + { 1136 + int r; 1137 + 1138 + if (!hints_array_initialized(cmd)) 1139 + return 0; 1140 + 1141 + down_write(&cmd->root_lock); 1142 + r = save_hint(cmd, cblock, hint); 1143 + up_write(&cmd->root_lock); 1144 + 1145 + return r; 1146 + }
+142
drivers/md/dm-cache-metadata.h
··· 1 + /* 2 + * Copyright (C) 2012 Red Hat, Inc. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #ifndef DM_CACHE_METADATA_H 8 + #define DM_CACHE_METADATA_H 9 + 10 + #include "dm-cache-block-types.h" 11 + #include "dm-cache-policy-internal.h" 12 + 13 + /*----------------------------------------------------------------*/ 14 + 15 + #define DM_CACHE_METADATA_BLOCK_SIZE 4096 16 + 17 + /* FIXME: remove this restriction */ 18 + /* 19 + * The metadata device is currently limited in size. 20 + * 21 + * We have one block of index, which can hold 255 index entries. Each 22 + * index entry contains allocation info about 16k metadata blocks. 23 + */ 24 + #define DM_CACHE_METADATA_MAX_SECTORS (255 * (1 << 14) * (DM_CACHE_METADATA_BLOCK_SIZE / (1 << SECTOR_SHIFT))) 25 + 26 + /* 27 + * A metadata device larger than 16GB triggers a warning. 28 + */ 29 + #define DM_CACHE_METADATA_MAX_SECTORS_WARNING (16 * (1024 * 1024 * 1024 >> SECTOR_SHIFT)) 30 + 31 + /*----------------------------------------------------------------*/ 32 + 33 + /* 34 + * Ext[234]-style compat feature flags. 35 + * 36 + * A new feature which old metadata will still be compatible with should 37 + * define a DM_CACHE_FEATURE_COMPAT_* flag (rarely useful). 38 + * 39 + * A new feature that is not compatible with old code should define a 40 + * DM_CACHE_FEATURE_INCOMPAT_* flag and guard the relevant code with 41 + * that flag. 42 + * 43 + * A new feature that is not compatible with old code accessing the 44 + * metadata RDWR should define a DM_CACHE_FEATURE_RO_COMPAT_* flag and 45 + * guard the relevant code with that flag. 46 + * 47 + * As these various flags are defined they should be added to the 48 + * following masks. 49 + */ 50 + #define DM_CACHE_FEATURE_COMPAT_SUPP 0UL 51 + #define DM_CACHE_FEATURE_COMPAT_RO_SUPP 0UL 52 + #define DM_CACHE_FEATURE_INCOMPAT_SUPP 0UL 53 + 54 + /* 55 + * Reopens or creates a new, empty metadata volume. 56 + * Returns an ERR_PTR on failure. 57 + */ 58 + struct dm_cache_metadata *dm_cache_metadata_open(struct block_device *bdev, 59 + sector_t data_block_size, 60 + bool may_format_device, 61 + size_t policy_hint_size); 62 + 63 + void dm_cache_metadata_close(struct dm_cache_metadata *cmd); 64 + 65 + /* 66 + * The metadata needs to know how many cache blocks there are. We don't 67 + * care about the origin, assuming the core target is giving us valid 68 + * origin blocks to map to. 69 + */ 70 + int dm_cache_resize(struct dm_cache_metadata *cmd, dm_cblock_t new_cache_size); 71 + dm_cblock_t dm_cache_size(struct dm_cache_metadata *cmd); 72 + 73 + int dm_cache_discard_bitset_resize(struct dm_cache_metadata *cmd, 74 + sector_t discard_block_size, 75 + dm_dblock_t new_nr_entries); 76 + 77 + typedef int (*load_discard_fn)(void *context, sector_t discard_block_size, 78 + dm_dblock_t dblock, bool discarded); 79 + int dm_cache_load_discards(struct dm_cache_metadata *cmd, 80 + load_discard_fn fn, void *context); 81 + 82 + int dm_cache_set_discard(struct dm_cache_metadata *cmd, dm_dblock_t dblock, bool discard); 83 + 84 + int dm_cache_remove_mapping(struct dm_cache_metadata *cmd, dm_cblock_t cblock); 85 + int dm_cache_insert_mapping(struct dm_cache_metadata *cmd, dm_cblock_t cblock, dm_oblock_t oblock); 86 + int dm_cache_changed_this_transaction(struct dm_cache_metadata *cmd); 87 + 88 + typedef int (*load_mapping_fn)(void *context, dm_oblock_t oblock, 89 + dm_cblock_t cblock, bool dirty, 90 + uint32_t hint, bool hint_valid); 91 + int dm_cache_load_mappings(struct dm_cache_metadata *cmd, 92 + const char *policy_name, 93 + load_mapping_fn fn, 94 + void *context); 95 + 96 + int dm_cache_set_dirty(struct dm_cache_metadata *cmd, dm_cblock_t cblock, bool dirty); 97 + 98 + struct dm_cache_statistics { 99 + uint32_t read_hits; 100 + uint32_t read_misses; 101 + uint32_t write_hits; 102 + uint32_t write_misses; 103 + }; 104 + 105 + void dm_cache_metadata_get_stats(struct dm_cache_metadata *cmd, 106 + struct dm_cache_statistics *stats); 107 + void dm_cache_metadata_set_stats(struct dm_cache_metadata *cmd, 108 + struct dm_cache_statistics *stats); 109 + 110 + int dm_cache_commit(struct dm_cache_metadata *cmd, bool clean_shutdown); 111 + 112 + int dm_cache_get_free_metadata_block_count(struct dm_cache_metadata *cmd, 113 + dm_block_t *result); 114 + 115 + int dm_cache_get_metadata_dev_size(struct dm_cache_metadata *cmd, 116 + dm_block_t *result); 117 + 118 + void dm_cache_dump(struct dm_cache_metadata *cmd); 119 + 120 + /* 121 + * The policy is invited to save a 32bit hint value for every cblock (eg, 122 + * for a hit count). These are stored against the policy name. If 123 + * policies are changed, then hints will be lost. If the machine crashes, 124 + * hints will be lost. 125 + * 126 + * The hints are indexed by the cblock, but many policies will not 127 + * neccessarily have a fast way of accessing efficiently via cblock. So 128 + * rather than querying the policy for each cblock, we let it walk its data 129 + * structures and fill in the hints in whatever order it wishes. 130 + */ 131 + 132 + int dm_cache_begin_hints(struct dm_cache_metadata *cmd, struct dm_cache_policy *p); 133 + 134 + /* 135 + * requests hints for every cblock and stores in the metadata device. 136 + */ 137 + int dm_cache_save_hint(struct dm_cache_metadata *cmd, 138 + dm_cblock_t cblock, uint32_t hint); 139 + 140 + /*----------------------------------------------------------------*/ 141 + 142 + #endif /* DM_CACHE_METADATA_H */
+464
drivers/md/dm-cache-policy-cleaner.c
··· 1 + /* 2 + * Copyright (C) 2012 Red Hat. All rights reserved. 3 + * 4 + * writeback cache policy supporting flushing out dirty cache blocks. 5 + * 6 + * This file is released under the GPL. 7 + */ 8 + 9 + #include "dm-cache-policy.h" 10 + #include "dm.h" 11 + 12 + #include <linux/hash.h> 13 + #include <linux/module.h> 14 + #include <linux/slab.h> 15 + #include <linux/vmalloc.h> 16 + 17 + /*----------------------------------------------------------------*/ 18 + 19 + #define DM_MSG_PREFIX "cache cleaner" 20 + #define CLEANER_VERSION "1.0.0" 21 + 22 + /* Cache entry struct. */ 23 + struct wb_cache_entry { 24 + struct list_head list; 25 + struct hlist_node hlist; 26 + 27 + dm_oblock_t oblock; 28 + dm_cblock_t cblock; 29 + bool dirty:1; 30 + bool pending:1; 31 + }; 32 + 33 + struct hash { 34 + struct hlist_head *table; 35 + dm_block_t hash_bits; 36 + unsigned nr_buckets; 37 + }; 38 + 39 + struct policy { 40 + struct dm_cache_policy policy; 41 + spinlock_t lock; 42 + 43 + struct list_head free; 44 + struct list_head clean; 45 + struct list_head clean_pending; 46 + struct list_head dirty; 47 + 48 + /* 49 + * We know exactly how many cblocks will be needed, 50 + * so we can allocate them up front. 51 + */ 52 + dm_cblock_t cache_size, nr_cblocks_allocated; 53 + struct wb_cache_entry *cblocks; 54 + struct hash chash; 55 + }; 56 + 57 + /*----------------------------------------------------------------------------*/ 58 + 59 + /* 60 + * Low-level functions. 61 + */ 62 + static unsigned next_power(unsigned n, unsigned min) 63 + { 64 + return roundup_pow_of_two(max(n, min)); 65 + } 66 + 67 + static struct policy *to_policy(struct dm_cache_policy *p) 68 + { 69 + return container_of(p, struct policy, policy); 70 + } 71 + 72 + static struct list_head *list_pop(struct list_head *q) 73 + { 74 + struct list_head *r = q->next; 75 + 76 + list_del(r); 77 + 78 + return r; 79 + } 80 + 81 + /*----------------------------------------------------------------------------*/ 82 + 83 + /* Allocate/free various resources. */ 84 + static int alloc_hash(struct hash *hash, unsigned elts) 85 + { 86 + hash->nr_buckets = next_power(elts >> 4, 16); 87 + hash->hash_bits = ffs(hash->nr_buckets) - 1; 88 + hash->table = vzalloc(sizeof(*hash->table) * hash->nr_buckets); 89 + 90 + return hash->table ? 0 : -ENOMEM; 91 + } 92 + 93 + static void free_hash(struct hash *hash) 94 + { 95 + vfree(hash->table); 96 + } 97 + 98 + static int alloc_cache_blocks_with_hash(struct policy *p, dm_cblock_t cache_size) 99 + { 100 + int r = -ENOMEM; 101 + 102 + p->cblocks = vzalloc(sizeof(*p->cblocks) * from_cblock(cache_size)); 103 + if (p->cblocks) { 104 + unsigned u = from_cblock(cache_size); 105 + 106 + while (u--) 107 + list_add(&p->cblocks[u].list, &p->free); 108 + 109 + p->nr_cblocks_allocated = 0; 110 + 111 + /* Cache entries hash. */ 112 + r = alloc_hash(&p->chash, from_cblock(cache_size)); 113 + if (r) 114 + vfree(p->cblocks); 115 + } 116 + 117 + return r; 118 + } 119 + 120 + static void free_cache_blocks_and_hash(struct policy *p) 121 + { 122 + free_hash(&p->chash); 123 + vfree(p->cblocks); 124 + } 125 + 126 + static struct wb_cache_entry *alloc_cache_entry(struct policy *p) 127 + { 128 + struct wb_cache_entry *e; 129 + 130 + BUG_ON(from_cblock(p->nr_cblocks_allocated) >= from_cblock(p->cache_size)); 131 + 132 + e = list_entry(list_pop(&p->free), struct wb_cache_entry, list); 133 + p->nr_cblocks_allocated = to_cblock(from_cblock(p->nr_cblocks_allocated) + 1); 134 + 135 + return e; 136 + } 137 + 138 + /*----------------------------------------------------------------------------*/ 139 + 140 + /* Hash functions (lookup, insert, remove). */ 141 + static struct wb_cache_entry *lookup_cache_entry(struct policy *p, dm_oblock_t oblock) 142 + { 143 + struct hash *hash = &p->chash; 144 + unsigned h = hash_64(from_oblock(oblock), hash->hash_bits); 145 + struct wb_cache_entry *cur; 146 + struct hlist_head *bucket = &hash->table[h]; 147 + 148 + hlist_for_each_entry(cur, bucket, hlist) { 149 + if (cur->oblock == oblock) { 150 + /* Move upfront bucket for faster access. */ 151 + hlist_del(&cur->hlist); 152 + hlist_add_head(&cur->hlist, bucket); 153 + return cur; 154 + } 155 + } 156 + 157 + return NULL; 158 + } 159 + 160 + static void insert_cache_hash_entry(struct policy *p, struct wb_cache_entry *e) 161 + { 162 + unsigned h = hash_64(from_oblock(e->oblock), p->chash.hash_bits); 163 + 164 + hlist_add_head(&e->hlist, &p->chash.table[h]); 165 + } 166 + 167 + static void remove_cache_hash_entry(struct wb_cache_entry *e) 168 + { 169 + hlist_del(&e->hlist); 170 + } 171 + 172 + /* Public interface (see dm-cache-policy.h */ 173 + static int wb_map(struct dm_cache_policy *pe, dm_oblock_t oblock, 174 + bool can_block, bool can_migrate, bool discarded_oblock, 175 + struct bio *bio, struct policy_result *result) 176 + { 177 + struct policy *p = to_policy(pe); 178 + struct wb_cache_entry *e; 179 + unsigned long flags; 180 + 181 + result->op = POLICY_MISS; 182 + 183 + if (can_block) 184 + spin_lock_irqsave(&p->lock, flags); 185 + 186 + else if (!spin_trylock_irqsave(&p->lock, flags)) 187 + return -EWOULDBLOCK; 188 + 189 + e = lookup_cache_entry(p, oblock); 190 + if (e) { 191 + result->op = POLICY_HIT; 192 + result->cblock = e->cblock; 193 + 194 + } 195 + 196 + spin_unlock_irqrestore(&p->lock, flags); 197 + 198 + return 0; 199 + } 200 + 201 + static int wb_lookup(struct dm_cache_policy *pe, dm_oblock_t oblock, dm_cblock_t *cblock) 202 + { 203 + int r; 204 + struct policy *p = to_policy(pe); 205 + struct wb_cache_entry *e; 206 + unsigned long flags; 207 + 208 + if (!spin_trylock_irqsave(&p->lock, flags)) 209 + return -EWOULDBLOCK; 210 + 211 + e = lookup_cache_entry(p, oblock); 212 + if (e) { 213 + *cblock = e->cblock; 214 + r = 0; 215 + 216 + } else 217 + r = -ENOENT; 218 + 219 + spin_unlock_irqrestore(&p->lock, flags); 220 + 221 + return r; 222 + } 223 + 224 + static void __set_clear_dirty(struct dm_cache_policy *pe, dm_oblock_t oblock, bool set) 225 + { 226 + struct policy *p = to_policy(pe); 227 + struct wb_cache_entry *e; 228 + 229 + e = lookup_cache_entry(p, oblock); 230 + BUG_ON(!e); 231 + 232 + if (set) { 233 + if (!e->dirty) { 234 + e->dirty = true; 235 + list_move(&e->list, &p->dirty); 236 + } 237 + 238 + } else { 239 + if (e->dirty) { 240 + e->pending = false; 241 + e->dirty = false; 242 + list_move(&e->list, &p->clean); 243 + } 244 + } 245 + } 246 + 247 + static void wb_set_dirty(struct dm_cache_policy *pe, dm_oblock_t oblock) 248 + { 249 + struct policy *p = to_policy(pe); 250 + unsigned long flags; 251 + 252 + spin_lock_irqsave(&p->lock, flags); 253 + __set_clear_dirty(pe, oblock, true); 254 + spin_unlock_irqrestore(&p->lock, flags); 255 + } 256 + 257 + static void wb_clear_dirty(struct dm_cache_policy *pe, dm_oblock_t oblock) 258 + { 259 + struct policy *p = to_policy(pe); 260 + unsigned long flags; 261 + 262 + spin_lock_irqsave(&p->lock, flags); 263 + __set_clear_dirty(pe, oblock, false); 264 + spin_unlock_irqrestore(&p->lock, flags); 265 + } 266 + 267 + static void add_cache_entry(struct policy *p, struct wb_cache_entry *e) 268 + { 269 + insert_cache_hash_entry(p, e); 270 + if (e->dirty) 271 + list_add(&e->list, &p->dirty); 272 + else 273 + list_add(&e->list, &p->clean); 274 + } 275 + 276 + static int wb_load_mapping(struct dm_cache_policy *pe, 277 + dm_oblock_t oblock, dm_cblock_t cblock, 278 + uint32_t hint, bool hint_valid) 279 + { 280 + int r; 281 + struct policy *p = to_policy(pe); 282 + struct wb_cache_entry *e = alloc_cache_entry(p); 283 + 284 + if (e) { 285 + e->cblock = cblock; 286 + e->oblock = oblock; 287 + e->dirty = false; /* blocks default to clean */ 288 + add_cache_entry(p, e); 289 + r = 0; 290 + 291 + } else 292 + r = -ENOMEM; 293 + 294 + return r; 295 + } 296 + 297 + static void wb_destroy(struct dm_cache_policy *pe) 298 + { 299 + struct policy *p = to_policy(pe); 300 + 301 + free_cache_blocks_and_hash(p); 302 + kfree(p); 303 + } 304 + 305 + static struct wb_cache_entry *__wb_force_remove_mapping(struct policy *p, dm_oblock_t oblock) 306 + { 307 + struct wb_cache_entry *r = lookup_cache_entry(p, oblock); 308 + 309 + BUG_ON(!r); 310 + 311 + remove_cache_hash_entry(r); 312 + list_del(&r->list); 313 + 314 + return r; 315 + } 316 + 317 + static void wb_remove_mapping(struct dm_cache_policy *pe, dm_oblock_t oblock) 318 + { 319 + struct policy *p = to_policy(pe); 320 + struct wb_cache_entry *e; 321 + unsigned long flags; 322 + 323 + spin_lock_irqsave(&p->lock, flags); 324 + e = __wb_force_remove_mapping(p, oblock); 325 + list_add_tail(&e->list, &p->free); 326 + BUG_ON(!from_cblock(p->nr_cblocks_allocated)); 327 + p->nr_cblocks_allocated = to_cblock(from_cblock(p->nr_cblocks_allocated) - 1); 328 + spin_unlock_irqrestore(&p->lock, flags); 329 + } 330 + 331 + static void wb_force_mapping(struct dm_cache_policy *pe, 332 + dm_oblock_t current_oblock, dm_oblock_t oblock) 333 + { 334 + struct policy *p = to_policy(pe); 335 + struct wb_cache_entry *e; 336 + unsigned long flags; 337 + 338 + spin_lock_irqsave(&p->lock, flags); 339 + e = __wb_force_remove_mapping(p, current_oblock); 340 + e->oblock = oblock; 341 + add_cache_entry(p, e); 342 + spin_unlock_irqrestore(&p->lock, flags); 343 + } 344 + 345 + static struct wb_cache_entry *get_next_dirty_entry(struct policy *p) 346 + { 347 + struct list_head *l; 348 + struct wb_cache_entry *r; 349 + 350 + if (list_empty(&p->dirty)) 351 + return NULL; 352 + 353 + l = list_pop(&p->dirty); 354 + r = container_of(l, struct wb_cache_entry, list); 355 + list_add(l, &p->clean_pending); 356 + 357 + return r; 358 + } 359 + 360 + static int wb_writeback_work(struct dm_cache_policy *pe, 361 + dm_oblock_t *oblock, 362 + dm_cblock_t *cblock) 363 + { 364 + int r = -ENOENT; 365 + struct policy *p = to_policy(pe); 366 + struct wb_cache_entry *e; 367 + unsigned long flags; 368 + 369 + spin_lock_irqsave(&p->lock, flags); 370 + 371 + e = get_next_dirty_entry(p); 372 + if (e) { 373 + *oblock = e->oblock; 374 + *cblock = e->cblock; 375 + r = 0; 376 + } 377 + 378 + spin_unlock_irqrestore(&p->lock, flags); 379 + 380 + return r; 381 + } 382 + 383 + static dm_cblock_t wb_residency(struct dm_cache_policy *pe) 384 + { 385 + return to_policy(pe)->nr_cblocks_allocated; 386 + } 387 + 388 + /* Init the policy plugin interface function pointers. */ 389 + static void init_policy_functions(struct policy *p) 390 + { 391 + p->policy.destroy = wb_destroy; 392 + p->policy.map = wb_map; 393 + p->policy.lookup = wb_lookup; 394 + p->policy.set_dirty = wb_set_dirty; 395 + p->policy.clear_dirty = wb_clear_dirty; 396 + p->policy.load_mapping = wb_load_mapping; 397 + p->policy.walk_mappings = NULL; 398 + p->policy.remove_mapping = wb_remove_mapping; 399 + p->policy.writeback_work = wb_writeback_work; 400 + p->policy.force_mapping = wb_force_mapping; 401 + p->policy.residency = wb_residency; 402 + p->policy.tick = NULL; 403 + } 404 + 405 + static struct dm_cache_policy *wb_create(dm_cblock_t cache_size, 406 + sector_t origin_size, 407 + sector_t cache_block_size) 408 + { 409 + int r; 410 + struct policy *p = kzalloc(sizeof(*p), GFP_KERNEL); 411 + 412 + if (!p) 413 + return NULL; 414 + 415 + init_policy_functions(p); 416 + INIT_LIST_HEAD(&p->free); 417 + INIT_LIST_HEAD(&p->clean); 418 + INIT_LIST_HEAD(&p->clean_pending); 419 + INIT_LIST_HEAD(&p->dirty); 420 + 421 + p->cache_size = cache_size; 422 + spin_lock_init(&p->lock); 423 + 424 + /* Allocate cache entry structs and add them to free list. */ 425 + r = alloc_cache_blocks_with_hash(p, cache_size); 426 + if (!r) 427 + return &p->policy; 428 + 429 + kfree(p); 430 + 431 + return NULL; 432 + } 433 + /*----------------------------------------------------------------------------*/ 434 + 435 + static struct dm_cache_policy_type wb_policy_type = { 436 + .name = "cleaner", 437 + .hint_size = 0, 438 + .owner = THIS_MODULE, 439 + .create = wb_create 440 + }; 441 + 442 + static int __init wb_init(void) 443 + { 444 + int r = dm_cache_policy_register(&wb_policy_type); 445 + 446 + if (r < 0) 447 + DMERR("register failed %d", r); 448 + else 449 + DMINFO("version " CLEANER_VERSION " loaded"); 450 + 451 + return r; 452 + } 453 + 454 + static void __exit wb_exit(void) 455 + { 456 + dm_cache_policy_unregister(&wb_policy_type); 457 + } 458 + 459 + module_init(wb_init); 460 + module_exit(wb_exit); 461 + 462 + MODULE_AUTHOR("Heinz Mauelshagen <dm-devel@redhat.com>"); 463 + MODULE_LICENSE("GPL"); 464 + MODULE_DESCRIPTION("cleaner cache policy");
+124
drivers/md/dm-cache-policy-internal.h
··· 1 + /* 2 + * Copyright (C) 2012 Red Hat. All rights reserved. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #ifndef DM_CACHE_POLICY_INTERNAL_H 8 + #define DM_CACHE_POLICY_INTERNAL_H 9 + 10 + #include "dm-cache-policy.h" 11 + 12 + /*----------------------------------------------------------------*/ 13 + 14 + /* 15 + * Little inline functions that simplify calling the policy methods. 16 + */ 17 + static inline int policy_map(struct dm_cache_policy *p, dm_oblock_t oblock, 18 + bool can_block, bool can_migrate, bool discarded_oblock, 19 + struct bio *bio, struct policy_result *result) 20 + { 21 + return p->map(p, oblock, can_block, can_migrate, discarded_oblock, bio, result); 22 + } 23 + 24 + static inline int policy_lookup(struct dm_cache_policy *p, dm_oblock_t oblock, dm_cblock_t *cblock) 25 + { 26 + BUG_ON(!p->lookup); 27 + return p->lookup(p, oblock, cblock); 28 + } 29 + 30 + static inline void policy_set_dirty(struct dm_cache_policy *p, dm_oblock_t oblock) 31 + { 32 + if (p->set_dirty) 33 + p->set_dirty(p, oblock); 34 + } 35 + 36 + static inline void policy_clear_dirty(struct dm_cache_policy *p, dm_oblock_t oblock) 37 + { 38 + if (p->clear_dirty) 39 + p->clear_dirty(p, oblock); 40 + } 41 + 42 + static inline int policy_load_mapping(struct dm_cache_policy *p, 43 + dm_oblock_t oblock, dm_cblock_t cblock, 44 + uint32_t hint, bool hint_valid) 45 + { 46 + return p->load_mapping(p, oblock, cblock, hint, hint_valid); 47 + } 48 + 49 + static inline int policy_walk_mappings(struct dm_cache_policy *p, 50 + policy_walk_fn fn, void *context) 51 + { 52 + return p->walk_mappings ? p->walk_mappings(p, fn, context) : 0; 53 + } 54 + 55 + static inline int policy_writeback_work(struct dm_cache_policy *p, 56 + dm_oblock_t *oblock, 57 + dm_cblock_t *cblock) 58 + { 59 + return p->writeback_work ? p->writeback_work(p, oblock, cblock) : -ENOENT; 60 + } 61 + 62 + static inline void policy_remove_mapping(struct dm_cache_policy *p, dm_oblock_t oblock) 63 + { 64 + return p->remove_mapping(p, oblock); 65 + } 66 + 67 + static inline void policy_force_mapping(struct dm_cache_policy *p, 68 + dm_oblock_t current_oblock, dm_oblock_t new_oblock) 69 + { 70 + return p->force_mapping(p, current_oblock, new_oblock); 71 + } 72 + 73 + static inline dm_cblock_t policy_residency(struct dm_cache_policy *p) 74 + { 75 + return p->residency(p); 76 + } 77 + 78 + static inline void policy_tick(struct dm_cache_policy *p) 79 + { 80 + if (p->tick) 81 + return p->tick(p); 82 + } 83 + 84 + static inline int policy_emit_config_values(struct dm_cache_policy *p, char *result, unsigned maxlen) 85 + { 86 + ssize_t sz = 0; 87 + if (p->emit_config_values) 88 + return p->emit_config_values(p, result, maxlen); 89 + 90 + DMEMIT("0"); 91 + return 0; 92 + } 93 + 94 + static inline int policy_set_config_value(struct dm_cache_policy *p, 95 + const char *key, const char *value) 96 + { 97 + return p->set_config_value ? p->set_config_value(p, key, value) : -EINVAL; 98 + } 99 + 100 + /*----------------------------------------------------------------*/ 101 + 102 + /* 103 + * Creates a new cache policy given a policy name, a cache size, an origin size and the block size. 104 + */ 105 + struct dm_cache_policy *dm_cache_policy_create(const char *name, dm_cblock_t cache_size, 106 + sector_t origin_size, sector_t block_size); 107 + 108 + /* 109 + * Destroys the policy. This drops references to the policy module as well 110 + * as calling it's destroy method. So always use this rather than calling 111 + * the policy->destroy method directly. 112 + */ 113 + void dm_cache_policy_destroy(struct dm_cache_policy *p); 114 + 115 + /* 116 + * In case we've forgotten. 117 + */ 118 + const char *dm_cache_policy_get_name(struct dm_cache_policy *p); 119 + 120 + size_t dm_cache_policy_get_hint_size(struct dm_cache_policy *p); 121 + 122 + /*----------------------------------------------------------------*/ 123 + 124 + #endif /* DM_CACHE_POLICY_INTERNAL_H */
+1195
drivers/md/dm-cache-policy-mq.c
··· 1 + /* 2 + * Copyright (C) 2012 Red Hat. All rights reserved. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #include "dm-cache-policy.h" 8 + #include "dm.h" 9 + 10 + #include <linux/hash.h> 11 + #include <linux/module.h> 12 + #include <linux/mutex.h> 13 + #include <linux/slab.h> 14 + #include <linux/vmalloc.h> 15 + 16 + #define DM_MSG_PREFIX "cache-policy-mq" 17 + #define MQ_VERSION "1.0.0" 18 + 19 + static struct kmem_cache *mq_entry_cache; 20 + 21 + /*----------------------------------------------------------------*/ 22 + 23 + static unsigned next_power(unsigned n, unsigned min) 24 + { 25 + return roundup_pow_of_two(max(n, min)); 26 + } 27 + 28 + /*----------------------------------------------------------------*/ 29 + 30 + static unsigned long *alloc_bitset(unsigned nr_entries) 31 + { 32 + size_t s = sizeof(unsigned long) * dm_div_up(nr_entries, BITS_PER_LONG); 33 + return vzalloc(s); 34 + } 35 + 36 + static void free_bitset(unsigned long *bits) 37 + { 38 + vfree(bits); 39 + } 40 + 41 + /*----------------------------------------------------------------*/ 42 + 43 + /* 44 + * Large, sequential ios are probably better left on the origin device since 45 + * spindles tend to have good bandwidth. 46 + * 47 + * The io_tracker tries to spot when the io is in one of these sequential 48 + * modes. 49 + * 50 + * Two thresholds to switch between random and sequential io mode are defaulting 51 + * as follows and can be adjusted via the constructor and message interfaces. 52 + */ 53 + #define RANDOM_THRESHOLD_DEFAULT 4 54 + #define SEQUENTIAL_THRESHOLD_DEFAULT 512 55 + 56 + enum io_pattern { 57 + PATTERN_SEQUENTIAL, 58 + PATTERN_RANDOM 59 + }; 60 + 61 + struct io_tracker { 62 + enum io_pattern pattern; 63 + 64 + unsigned nr_seq_samples; 65 + unsigned nr_rand_samples; 66 + unsigned thresholds[2]; 67 + 68 + dm_oblock_t last_end_oblock; 69 + }; 70 + 71 + static void iot_init(struct io_tracker *t, 72 + int sequential_threshold, int random_threshold) 73 + { 74 + t->pattern = PATTERN_RANDOM; 75 + t->nr_seq_samples = 0; 76 + t->nr_rand_samples = 0; 77 + t->last_end_oblock = 0; 78 + t->thresholds[PATTERN_RANDOM] = random_threshold; 79 + t->thresholds[PATTERN_SEQUENTIAL] = sequential_threshold; 80 + } 81 + 82 + static enum io_pattern iot_pattern(struct io_tracker *t) 83 + { 84 + return t->pattern; 85 + } 86 + 87 + static void iot_update_stats(struct io_tracker *t, struct bio *bio) 88 + { 89 + if (bio->bi_sector == from_oblock(t->last_end_oblock) + 1) 90 + t->nr_seq_samples++; 91 + else { 92 + /* 93 + * Just one non-sequential IO is enough to reset the 94 + * counters. 95 + */ 96 + if (t->nr_seq_samples) { 97 + t->nr_seq_samples = 0; 98 + t->nr_rand_samples = 0; 99 + } 100 + 101 + t->nr_rand_samples++; 102 + } 103 + 104 + t->last_end_oblock = to_oblock(bio->bi_sector + bio_sectors(bio) - 1); 105 + } 106 + 107 + static void iot_check_for_pattern_switch(struct io_tracker *t) 108 + { 109 + switch (t->pattern) { 110 + case PATTERN_SEQUENTIAL: 111 + if (t->nr_rand_samples >= t->thresholds[PATTERN_RANDOM]) { 112 + t->pattern = PATTERN_RANDOM; 113 + t->nr_seq_samples = t->nr_rand_samples = 0; 114 + } 115 + break; 116 + 117 + case PATTERN_RANDOM: 118 + if (t->nr_seq_samples >= t->thresholds[PATTERN_SEQUENTIAL]) { 119 + t->pattern = PATTERN_SEQUENTIAL; 120 + t->nr_seq_samples = t->nr_rand_samples = 0; 121 + } 122 + break; 123 + } 124 + } 125 + 126 + static void iot_examine_bio(struct io_tracker *t, struct bio *bio) 127 + { 128 + iot_update_stats(t, bio); 129 + iot_check_for_pattern_switch(t); 130 + } 131 + 132 + /*----------------------------------------------------------------*/ 133 + 134 + 135 + /* 136 + * This queue is divided up into different levels. Allowing us to push 137 + * entries to the back of any of the levels. Think of it as a partially 138 + * sorted queue. 139 + */ 140 + #define NR_QUEUE_LEVELS 16u 141 + 142 + struct queue { 143 + struct list_head qs[NR_QUEUE_LEVELS]; 144 + }; 145 + 146 + static void queue_init(struct queue *q) 147 + { 148 + unsigned i; 149 + 150 + for (i = 0; i < NR_QUEUE_LEVELS; i++) 151 + INIT_LIST_HEAD(q->qs + i); 152 + } 153 + 154 + /* 155 + * Insert an entry to the back of the given level. 156 + */ 157 + static void queue_push(struct queue *q, unsigned level, struct list_head *elt) 158 + { 159 + list_add_tail(elt, q->qs + level); 160 + } 161 + 162 + static void queue_remove(struct list_head *elt) 163 + { 164 + list_del(elt); 165 + } 166 + 167 + /* 168 + * Shifts all regions down one level. This has no effect on the order of 169 + * the queue. 170 + */ 171 + static void queue_shift_down(struct queue *q) 172 + { 173 + unsigned level; 174 + 175 + for (level = 1; level < NR_QUEUE_LEVELS; level++) 176 + list_splice_init(q->qs + level, q->qs + level - 1); 177 + } 178 + 179 + /* 180 + * Gives us the oldest entry of the lowest popoulated level. If the first 181 + * level is emptied then we shift down one level. 182 + */ 183 + static struct list_head *queue_pop(struct queue *q) 184 + { 185 + unsigned level; 186 + struct list_head *r; 187 + 188 + for (level = 0; level < NR_QUEUE_LEVELS; level++) 189 + if (!list_empty(q->qs + level)) { 190 + r = q->qs[level].next; 191 + list_del(r); 192 + 193 + /* have we just emptied the bottom level? */ 194 + if (level == 0 && list_empty(q->qs)) 195 + queue_shift_down(q); 196 + 197 + return r; 198 + } 199 + 200 + return NULL; 201 + } 202 + 203 + static struct list_head *list_pop(struct list_head *lh) 204 + { 205 + struct list_head *r = lh->next; 206 + 207 + BUG_ON(!r); 208 + list_del_init(r); 209 + 210 + return r; 211 + } 212 + 213 + /*----------------------------------------------------------------*/ 214 + 215 + /* 216 + * Describes a cache entry. Used in both the cache and the pre_cache. 217 + */ 218 + struct entry { 219 + struct hlist_node hlist; 220 + struct list_head list; 221 + dm_oblock_t oblock; 222 + dm_cblock_t cblock; /* valid iff in_cache */ 223 + 224 + /* 225 + * FIXME: pack these better 226 + */ 227 + bool in_cache:1; 228 + unsigned hit_count; 229 + unsigned generation; 230 + unsigned tick; 231 + }; 232 + 233 + struct mq_policy { 234 + struct dm_cache_policy policy; 235 + 236 + /* protects everything */ 237 + struct mutex lock; 238 + dm_cblock_t cache_size; 239 + struct io_tracker tracker; 240 + 241 + /* 242 + * We maintain two queues of entries. The cache proper contains 243 + * the currently active mappings. Whereas the pre_cache tracks 244 + * blocks that are being hit frequently and potential candidates 245 + * for promotion to the cache. 246 + */ 247 + struct queue pre_cache; 248 + struct queue cache; 249 + 250 + /* 251 + * Keeps track of time, incremented by the core. We use this to 252 + * avoid attributing multiple hits within the same tick. 253 + * 254 + * Access to tick_protected should be done with the spin lock held. 255 + * It's copied to tick at the start of the map function (within the 256 + * mutex). 257 + */ 258 + spinlock_t tick_lock; 259 + unsigned tick_protected; 260 + unsigned tick; 261 + 262 + /* 263 + * A count of the number of times the map function has been called 264 + * and found an entry in the pre_cache or cache. Currently used to 265 + * calculate the generation. 266 + */ 267 + unsigned hit_count; 268 + 269 + /* 270 + * A generation is a longish period that is used to trigger some 271 + * book keeping effects. eg, decrementing hit counts on entries. 272 + * This is needed to allow the cache to evolve as io patterns 273 + * change. 274 + */ 275 + unsigned generation; 276 + unsigned generation_period; /* in lookups (will probably change) */ 277 + 278 + /* 279 + * Entries in the pre_cache whose hit count passes the promotion 280 + * threshold move to the cache proper. Working out the correct 281 + * value for the promotion_threshold is crucial to this policy. 282 + */ 283 + unsigned promote_threshold; 284 + 285 + /* 286 + * We need cache_size entries for the cache, and choose to have 287 + * cache_size entries for the pre_cache too. One motivation for 288 + * using the same size is to make the hit counts directly 289 + * comparable between pre_cache and cache. 290 + */ 291 + unsigned nr_entries; 292 + unsigned nr_entries_allocated; 293 + struct list_head free; 294 + 295 + /* 296 + * Cache blocks may be unallocated. We store this info in a 297 + * bitset. 298 + */ 299 + unsigned long *allocation_bitset; 300 + unsigned nr_cblocks_allocated; 301 + unsigned find_free_nr_words; 302 + unsigned find_free_last_word; 303 + 304 + /* 305 + * The hash table allows us to quickly find an entry by origin 306 + * block. Both pre_cache and cache entries are in here. 307 + */ 308 + unsigned nr_buckets; 309 + dm_block_t hash_bits; 310 + struct hlist_head *table; 311 + }; 312 + 313 + /*----------------------------------------------------------------*/ 314 + /* Free/alloc mq cache entry structures. */ 315 + static void takeout_queue(struct list_head *lh, struct queue *q) 316 + { 317 + unsigned level; 318 + 319 + for (level = 0; level < NR_QUEUE_LEVELS; level++) 320 + list_splice(q->qs + level, lh); 321 + } 322 + 323 + static void free_entries(struct mq_policy *mq) 324 + { 325 + struct entry *e, *tmp; 326 + 327 + takeout_queue(&mq->free, &mq->pre_cache); 328 + takeout_queue(&mq->free, &mq->cache); 329 + 330 + list_for_each_entry_safe(e, tmp, &mq->free, list) 331 + kmem_cache_free(mq_entry_cache, e); 332 + } 333 + 334 + static int alloc_entries(struct mq_policy *mq, unsigned elts) 335 + { 336 + unsigned u = mq->nr_entries; 337 + 338 + INIT_LIST_HEAD(&mq->free); 339 + mq->nr_entries_allocated = 0; 340 + 341 + while (u--) { 342 + struct entry *e = kmem_cache_zalloc(mq_entry_cache, GFP_KERNEL); 343 + 344 + if (!e) { 345 + free_entries(mq); 346 + return -ENOMEM; 347 + } 348 + 349 + 350 + list_add(&e->list, &mq->free); 351 + } 352 + 353 + return 0; 354 + } 355 + 356 + /*----------------------------------------------------------------*/ 357 + 358 + /* 359 + * Simple hash table implementation. Should replace with the standard hash 360 + * table that's making its way upstream. 361 + */ 362 + static void hash_insert(struct mq_policy *mq, struct entry *e) 363 + { 364 + unsigned h = hash_64(from_oblock(e->oblock), mq->hash_bits); 365 + 366 + hlist_add_head(&e->hlist, mq->table + h); 367 + } 368 + 369 + static struct entry *hash_lookup(struct mq_policy *mq, dm_oblock_t oblock) 370 + { 371 + unsigned h = hash_64(from_oblock(oblock), mq->hash_bits); 372 + struct hlist_head *bucket = mq->table + h; 373 + struct entry *e; 374 + 375 + hlist_for_each_entry(e, bucket, hlist) 376 + if (e->oblock == oblock) { 377 + hlist_del(&e->hlist); 378 + hlist_add_head(&e->hlist, bucket); 379 + return e; 380 + } 381 + 382 + return NULL; 383 + } 384 + 385 + static void hash_remove(struct entry *e) 386 + { 387 + hlist_del(&e->hlist); 388 + } 389 + 390 + /*----------------------------------------------------------------*/ 391 + 392 + /* 393 + * Allocates a new entry structure. The memory is allocated in one lump, 394 + * so we just handing it out here. Returns NULL if all entries have 395 + * already been allocated. Cannot fail otherwise. 396 + */ 397 + static struct entry *alloc_entry(struct mq_policy *mq) 398 + { 399 + struct entry *e; 400 + 401 + if (mq->nr_entries_allocated >= mq->nr_entries) { 402 + BUG_ON(!list_empty(&mq->free)); 403 + return NULL; 404 + } 405 + 406 + e = list_entry(list_pop(&mq->free), struct entry, list); 407 + INIT_LIST_HEAD(&e->list); 408 + INIT_HLIST_NODE(&e->hlist); 409 + 410 + mq->nr_entries_allocated++; 411 + return e; 412 + } 413 + 414 + /*----------------------------------------------------------------*/ 415 + 416 + /* 417 + * Mark cache blocks allocated or not in the bitset. 418 + */ 419 + static void alloc_cblock(struct mq_policy *mq, dm_cblock_t cblock) 420 + { 421 + BUG_ON(from_cblock(cblock) > from_cblock(mq->cache_size)); 422 + BUG_ON(test_bit(from_cblock(cblock), mq->allocation_bitset)); 423 + 424 + set_bit(from_cblock(cblock), mq->allocation_bitset); 425 + mq->nr_cblocks_allocated++; 426 + } 427 + 428 + static void free_cblock(struct mq_policy *mq, dm_cblock_t cblock) 429 + { 430 + BUG_ON(from_cblock(cblock) > from_cblock(mq->cache_size)); 431 + BUG_ON(!test_bit(from_cblock(cblock), mq->allocation_bitset)); 432 + 433 + clear_bit(from_cblock(cblock), mq->allocation_bitset); 434 + mq->nr_cblocks_allocated--; 435 + } 436 + 437 + static bool any_free_cblocks(struct mq_policy *mq) 438 + { 439 + return mq->nr_cblocks_allocated < from_cblock(mq->cache_size); 440 + } 441 + 442 + /* 443 + * Fills result out with a cache block that isn't in use, or return 444 + * -ENOSPC. This does _not_ mark the cblock as allocated, the caller is 445 + * reponsible for that. 446 + */ 447 + static int __find_free_cblock(struct mq_policy *mq, unsigned begin, unsigned end, 448 + dm_cblock_t *result, unsigned *last_word) 449 + { 450 + int r = -ENOSPC; 451 + unsigned w; 452 + 453 + for (w = begin; w < end; w++) { 454 + /* 455 + * ffz is undefined if no zero exists 456 + */ 457 + if (mq->allocation_bitset[w] != ~0UL) { 458 + *last_word = w; 459 + *result = to_cblock((w * BITS_PER_LONG) + ffz(mq->allocation_bitset[w])); 460 + if (from_cblock(*result) < from_cblock(mq->cache_size)) 461 + r = 0; 462 + 463 + break; 464 + } 465 + } 466 + 467 + return r; 468 + } 469 + 470 + static int find_free_cblock(struct mq_policy *mq, dm_cblock_t *result) 471 + { 472 + int r; 473 + 474 + if (!any_free_cblocks(mq)) 475 + return -ENOSPC; 476 + 477 + r = __find_free_cblock(mq, mq->find_free_last_word, mq->find_free_nr_words, result, &mq->find_free_last_word); 478 + if (r == -ENOSPC && mq->find_free_last_word) 479 + r = __find_free_cblock(mq, 0, mq->find_free_last_word, result, &mq->find_free_last_word); 480 + 481 + return r; 482 + } 483 + 484 + /*----------------------------------------------------------------*/ 485 + 486 + /* 487 + * Now we get to the meat of the policy. This section deals with deciding 488 + * when to to add entries to the pre_cache and cache, and move between 489 + * them. 490 + */ 491 + 492 + /* 493 + * The queue level is based on the log2 of the hit count. 494 + */ 495 + static unsigned queue_level(struct entry *e) 496 + { 497 + return min((unsigned) ilog2(e->hit_count), NR_QUEUE_LEVELS - 1u); 498 + } 499 + 500 + /* 501 + * Inserts the entry into the pre_cache or the cache. Ensures the cache 502 + * block is marked as allocated if necc. Inserts into the hash table. Sets the 503 + * tick which records when the entry was last moved about. 504 + */ 505 + static void push(struct mq_policy *mq, struct entry *e) 506 + { 507 + e->tick = mq->tick; 508 + hash_insert(mq, e); 509 + 510 + if (e->in_cache) { 511 + alloc_cblock(mq, e->cblock); 512 + queue_push(&mq->cache, queue_level(e), &e->list); 513 + } else 514 + queue_push(&mq->pre_cache, queue_level(e), &e->list); 515 + } 516 + 517 + /* 518 + * Removes an entry from pre_cache or cache. Removes from the hash table. 519 + * Frees off the cache block if necc. 520 + */ 521 + static void del(struct mq_policy *mq, struct entry *e) 522 + { 523 + queue_remove(&e->list); 524 + hash_remove(e); 525 + if (e->in_cache) 526 + free_cblock(mq, e->cblock); 527 + } 528 + 529 + /* 530 + * Like del, except it removes the first entry in the queue (ie. the least 531 + * recently used). 532 + */ 533 + static struct entry *pop(struct mq_policy *mq, struct queue *q) 534 + { 535 + struct entry *e = container_of(queue_pop(q), struct entry, list); 536 + 537 + if (e) { 538 + hash_remove(e); 539 + 540 + if (e->in_cache) 541 + free_cblock(mq, e->cblock); 542 + } 543 + 544 + return e; 545 + } 546 + 547 + /* 548 + * Has this entry already been updated? 549 + */ 550 + static bool updated_this_tick(struct mq_policy *mq, struct entry *e) 551 + { 552 + return mq->tick == e->tick; 553 + } 554 + 555 + /* 556 + * The promotion threshold is adjusted every generation. As are the counts 557 + * of the entries. 558 + * 559 + * At the moment the threshold is taken by averaging the hit counts of some 560 + * of the entries in the cache (the first 20 entries of the first level). 561 + * 562 + * We can be much cleverer than this though. For example, each promotion 563 + * could bump up the threshold helping to prevent churn. Much more to do 564 + * here. 565 + */ 566 + 567 + #define MAX_TO_AVERAGE 20 568 + 569 + static void check_generation(struct mq_policy *mq) 570 + { 571 + unsigned total = 0, nr = 0, count = 0, level; 572 + struct list_head *head; 573 + struct entry *e; 574 + 575 + if ((mq->hit_count >= mq->generation_period) && 576 + (mq->nr_cblocks_allocated == from_cblock(mq->cache_size))) { 577 + 578 + mq->hit_count = 0; 579 + mq->generation++; 580 + 581 + for (level = 0; level < NR_QUEUE_LEVELS && count < MAX_TO_AVERAGE; level++) { 582 + head = mq->cache.qs + level; 583 + list_for_each_entry(e, head, list) { 584 + nr++; 585 + total += e->hit_count; 586 + 587 + if (++count >= MAX_TO_AVERAGE) 588 + break; 589 + } 590 + } 591 + 592 + mq->promote_threshold = nr ? total / nr : 1; 593 + if (mq->promote_threshold * nr < total) 594 + mq->promote_threshold++; 595 + } 596 + } 597 + 598 + /* 599 + * Whenever we use an entry we bump up it's hit counter, and push it to the 600 + * back to it's current level. 601 + */ 602 + static void requeue_and_update_tick(struct mq_policy *mq, struct entry *e) 603 + { 604 + if (updated_this_tick(mq, e)) 605 + return; 606 + 607 + e->hit_count++; 608 + mq->hit_count++; 609 + check_generation(mq); 610 + 611 + /* generation adjustment, to stop the counts increasing forever. */ 612 + /* FIXME: divide? */ 613 + /* e->hit_count -= min(e->hit_count - 1, mq->generation - e->generation); */ 614 + e->generation = mq->generation; 615 + 616 + del(mq, e); 617 + push(mq, e); 618 + } 619 + 620 + /* 621 + * Demote the least recently used entry from the cache to the pre_cache. 622 + * Returns the new cache entry to use, and the old origin block it was 623 + * mapped to. 624 + * 625 + * We drop the hit count on the demoted entry back to 1 to stop it bouncing 626 + * straight back into the cache if it's subsequently hit. There are 627 + * various options here, and more experimentation would be good: 628 + * 629 + * - just forget about the demoted entry completely (ie. don't insert it 630 + into the pre_cache). 631 + * - divide the hit count rather that setting to some hard coded value. 632 + * - set the hit count to a hard coded value other than 1, eg, is it better 633 + * if it goes in at level 2? 634 + */ 635 + static dm_cblock_t demote_cblock(struct mq_policy *mq, dm_oblock_t *oblock) 636 + { 637 + dm_cblock_t result; 638 + struct entry *demoted = pop(mq, &mq->cache); 639 + 640 + BUG_ON(!demoted); 641 + result = demoted->cblock; 642 + *oblock = demoted->oblock; 643 + demoted->in_cache = false; 644 + demoted->hit_count = 1; 645 + push(mq, demoted); 646 + 647 + return result; 648 + } 649 + 650 + /* 651 + * We modify the basic promotion_threshold depending on the specific io. 652 + * 653 + * If the origin block has been discarded then there's no cost to copy it 654 + * to the cache. 655 + * 656 + * We bias towards reads, since they can be demoted at no cost if they 657 + * haven't been dirtied. 658 + */ 659 + #define DISCARDED_PROMOTE_THRESHOLD 1 660 + #define READ_PROMOTE_THRESHOLD 4 661 + #define WRITE_PROMOTE_THRESHOLD 8 662 + 663 + static unsigned adjusted_promote_threshold(struct mq_policy *mq, 664 + bool discarded_oblock, int data_dir) 665 + { 666 + if (discarded_oblock && any_free_cblocks(mq) && data_dir == WRITE) 667 + /* 668 + * We don't need to do any copying at all, so give this a 669 + * very low threshold. In practice this only triggers 670 + * during initial population after a format. 671 + */ 672 + return DISCARDED_PROMOTE_THRESHOLD; 673 + 674 + return data_dir == READ ? 675 + (mq->promote_threshold + READ_PROMOTE_THRESHOLD) : 676 + (mq->promote_threshold + WRITE_PROMOTE_THRESHOLD); 677 + } 678 + 679 + static bool should_promote(struct mq_policy *mq, struct entry *e, 680 + bool discarded_oblock, int data_dir) 681 + { 682 + return e->hit_count >= 683 + adjusted_promote_threshold(mq, discarded_oblock, data_dir); 684 + } 685 + 686 + static int cache_entry_found(struct mq_policy *mq, 687 + struct entry *e, 688 + struct policy_result *result) 689 + { 690 + requeue_and_update_tick(mq, e); 691 + 692 + if (e->in_cache) { 693 + result->op = POLICY_HIT; 694 + result->cblock = e->cblock; 695 + } 696 + 697 + return 0; 698 + } 699 + 700 + /* 701 + * Moves and entry from the pre_cache to the cache. The main work is 702 + * finding which cache block to use. 703 + */ 704 + static int pre_cache_to_cache(struct mq_policy *mq, struct entry *e, 705 + struct policy_result *result) 706 + { 707 + dm_cblock_t cblock; 708 + 709 + if (find_free_cblock(mq, &cblock) == -ENOSPC) { 710 + result->op = POLICY_REPLACE; 711 + cblock = demote_cblock(mq, &result->old_oblock); 712 + } else 713 + result->op = POLICY_NEW; 714 + 715 + result->cblock = e->cblock = cblock; 716 + 717 + del(mq, e); 718 + e->in_cache = true; 719 + push(mq, e); 720 + 721 + return 0; 722 + } 723 + 724 + static int pre_cache_entry_found(struct mq_policy *mq, struct entry *e, 725 + bool can_migrate, bool discarded_oblock, 726 + int data_dir, struct policy_result *result) 727 + { 728 + int r = 0; 729 + bool updated = updated_this_tick(mq, e); 730 + 731 + requeue_and_update_tick(mq, e); 732 + 733 + if ((!discarded_oblock && updated) || 734 + !should_promote(mq, e, discarded_oblock, data_dir)) 735 + result->op = POLICY_MISS; 736 + else if (!can_migrate) 737 + r = -EWOULDBLOCK; 738 + else 739 + r = pre_cache_to_cache(mq, e, result); 740 + 741 + return r; 742 + } 743 + 744 + static void insert_in_pre_cache(struct mq_policy *mq, 745 + dm_oblock_t oblock) 746 + { 747 + struct entry *e = alloc_entry(mq); 748 + 749 + if (!e) 750 + /* 751 + * There's no spare entry structure, so we grab the least 752 + * used one from the pre_cache. 753 + */ 754 + e = pop(mq, &mq->pre_cache); 755 + 756 + if (unlikely(!e)) { 757 + DMWARN("couldn't pop from pre cache"); 758 + return; 759 + } 760 + 761 + e->in_cache = false; 762 + e->oblock = oblock; 763 + e->hit_count = 1; 764 + e->generation = mq->generation; 765 + push(mq, e); 766 + } 767 + 768 + static void insert_in_cache(struct mq_policy *mq, dm_oblock_t oblock, 769 + struct policy_result *result) 770 + { 771 + struct entry *e; 772 + dm_cblock_t cblock; 773 + 774 + if (find_free_cblock(mq, &cblock) == -ENOSPC) { 775 + result->op = POLICY_MISS; 776 + insert_in_pre_cache(mq, oblock); 777 + return; 778 + } 779 + 780 + e = alloc_entry(mq); 781 + if (unlikely(!e)) { 782 + result->op = POLICY_MISS; 783 + return; 784 + } 785 + 786 + e->oblock = oblock; 787 + e->cblock = cblock; 788 + e->in_cache = true; 789 + e->hit_count = 1; 790 + e->generation = mq->generation; 791 + push(mq, e); 792 + 793 + result->op = POLICY_NEW; 794 + result->cblock = e->cblock; 795 + } 796 + 797 + static int no_entry_found(struct mq_policy *mq, dm_oblock_t oblock, 798 + bool can_migrate, bool discarded_oblock, 799 + int data_dir, struct policy_result *result) 800 + { 801 + if (adjusted_promote_threshold(mq, discarded_oblock, data_dir) == 1) { 802 + if (can_migrate) 803 + insert_in_cache(mq, oblock, result); 804 + else 805 + return -EWOULDBLOCK; 806 + } else { 807 + insert_in_pre_cache(mq, oblock); 808 + result->op = POLICY_MISS; 809 + } 810 + 811 + return 0; 812 + } 813 + 814 + /* 815 + * Looks the oblock up in the hash table, then decides whether to put in 816 + * pre_cache, or cache etc. 817 + */ 818 + static int map(struct mq_policy *mq, dm_oblock_t oblock, 819 + bool can_migrate, bool discarded_oblock, 820 + int data_dir, struct policy_result *result) 821 + { 822 + int r = 0; 823 + struct entry *e = hash_lookup(mq, oblock); 824 + 825 + if (e && e->in_cache) 826 + r = cache_entry_found(mq, e, result); 827 + else if (iot_pattern(&mq->tracker) == PATTERN_SEQUENTIAL) 828 + result->op = POLICY_MISS; 829 + else if (e) 830 + r = pre_cache_entry_found(mq, e, can_migrate, discarded_oblock, 831 + data_dir, result); 832 + else 833 + r = no_entry_found(mq, oblock, can_migrate, discarded_oblock, 834 + data_dir, result); 835 + 836 + if (r == -EWOULDBLOCK) 837 + result->op = POLICY_MISS; 838 + 839 + return r; 840 + } 841 + 842 + /*----------------------------------------------------------------*/ 843 + 844 + /* 845 + * Public interface, via the policy struct. See dm-cache-policy.h for a 846 + * description of these. 847 + */ 848 + 849 + static struct mq_policy *to_mq_policy(struct dm_cache_policy *p) 850 + { 851 + return container_of(p, struct mq_policy, policy); 852 + } 853 + 854 + static void mq_destroy(struct dm_cache_policy *p) 855 + { 856 + struct mq_policy *mq = to_mq_policy(p); 857 + 858 + free_bitset(mq->allocation_bitset); 859 + kfree(mq->table); 860 + free_entries(mq); 861 + kfree(mq); 862 + } 863 + 864 + static void copy_tick(struct mq_policy *mq) 865 + { 866 + unsigned long flags; 867 + 868 + spin_lock_irqsave(&mq->tick_lock, flags); 869 + mq->tick = mq->tick_protected; 870 + spin_unlock_irqrestore(&mq->tick_lock, flags); 871 + } 872 + 873 + static int mq_map(struct dm_cache_policy *p, dm_oblock_t oblock, 874 + bool can_block, bool can_migrate, bool discarded_oblock, 875 + struct bio *bio, struct policy_result *result) 876 + { 877 + int r; 878 + struct mq_policy *mq = to_mq_policy(p); 879 + 880 + result->op = POLICY_MISS; 881 + 882 + if (can_block) 883 + mutex_lock(&mq->lock); 884 + else if (!mutex_trylock(&mq->lock)) 885 + return -EWOULDBLOCK; 886 + 887 + copy_tick(mq); 888 + 889 + iot_examine_bio(&mq->tracker, bio); 890 + r = map(mq, oblock, can_migrate, discarded_oblock, 891 + bio_data_dir(bio), result); 892 + 893 + mutex_unlock(&mq->lock); 894 + 895 + return r; 896 + } 897 + 898 + static int mq_lookup(struct dm_cache_policy *p, dm_oblock_t oblock, dm_cblock_t *cblock) 899 + { 900 + int r; 901 + struct mq_policy *mq = to_mq_policy(p); 902 + struct entry *e; 903 + 904 + if (!mutex_trylock(&mq->lock)) 905 + return -EWOULDBLOCK; 906 + 907 + e = hash_lookup(mq, oblock); 908 + if (e && e->in_cache) { 909 + *cblock = e->cblock; 910 + r = 0; 911 + } else 912 + r = -ENOENT; 913 + 914 + mutex_unlock(&mq->lock); 915 + 916 + return r; 917 + } 918 + 919 + static int mq_load_mapping(struct dm_cache_policy *p, 920 + dm_oblock_t oblock, dm_cblock_t cblock, 921 + uint32_t hint, bool hint_valid) 922 + { 923 + struct mq_policy *mq = to_mq_policy(p); 924 + struct entry *e; 925 + 926 + e = alloc_entry(mq); 927 + if (!e) 928 + return -ENOMEM; 929 + 930 + e->cblock = cblock; 931 + e->oblock = oblock; 932 + e->in_cache = true; 933 + e->hit_count = hint_valid ? hint : 1; 934 + e->generation = mq->generation; 935 + push(mq, e); 936 + 937 + return 0; 938 + } 939 + 940 + static int mq_walk_mappings(struct dm_cache_policy *p, policy_walk_fn fn, 941 + void *context) 942 + { 943 + struct mq_policy *mq = to_mq_policy(p); 944 + int r = 0; 945 + struct entry *e; 946 + unsigned level; 947 + 948 + mutex_lock(&mq->lock); 949 + 950 + for (level = 0; level < NR_QUEUE_LEVELS; level++) 951 + list_for_each_entry(e, &mq->cache.qs[level], list) { 952 + r = fn(context, e->cblock, e->oblock, e->hit_count); 953 + if (r) 954 + goto out; 955 + } 956 + 957 + out: 958 + mutex_unlock(&mq->lock); 959 + 960 + return r; 961 + } 962 + 963 + static void remove_mapping(struct mq_policy *mq, dm_oblock_t oblock) 964 + { 965 + struct entry *e = hash_lookup(mq, oblock); 966 + 967 + BUG_ON(!e || !e->in_cache); 968 + 969 + del(mq, e); 970 + e->in_cache = false; 971 + push(mq, e); 972 + } 973 + 974 + static void mq_remove_mapping(struct dm_cache_policy *p, dm_oblock_t oblock) 975 + { 976 + struct mq_policy *mq = to_mq_policy(p); 977 + 978 + mutex_lock(&mq->lock); 979 + remove_mapping(mq, oblock); 980 + mutex_unlock(&mq->lock); 981 + } 982 + 983 + static void force_mapping(struct mq_policy *mq, 984 + dm_oblock_t current_oblock, dm_oblock_t new_oblock) 985 + { 986 + struct entry *e = hash_lookup(mq, current_oblock); 987 + 988 + BUG_ON(!e || !e->in_cache); 989 + 990 + del(mq, e); 991 + e->oblock = new_oblock; 992 + push(mq, e); 993 + } 994 + 995 + static void mq_force_mapping(struct dm_cache_policy *p, 996 + dm_oblock_t current_oblock, dm_oblock_t new_oblock) 997 + { 998 + struct mq_policy *mq = to_mq_policy(p); 999 + 1000 + mutex_lock(&mq->lock); 1001 + force_mapping(mq, current_oblock, new_oblock); 1002 + mutex_unlock(&mq->lock); 1003 + } 1004 + 1005 + static dm_cblock_t mq_residency(struct dm_cache_policy *p) 1006 + { 1007 + struct mq_policy *mq = to_mq_policy(p); 1008 + 1009 + /* FIXME: lock mutex, not sure we can block here */ 1010 + return to_cblock(mq->nr_cblocks_allocated); 1011 + } 1012 + 1013 + static void mq_tick(struct dm_cache_policy *p) 1014 + { 1015 + struct mq_policy *mq = to_mq_policy(p); 1016 + unsigned long flags; 1017 + 1018 + spin_lock_irqsave(&mq->tick_lock, flags); 1019 + mq->tick_protected++; 1020 + spin_unlock_irqrestore(&mq->tick_lock, flags); 1021 + } 1022 + 1023 + static int mq_set_config_value(struct dm_cache_policy *p, 1024 + const char *key, const char *value) 1025 + { 1026 + struct mq_policy *mq = to_mq_policy(p); 1027 + enum io_pattern pattern; 1028 + unsigned long tmp; 1029 + 1030 + if (!strcasecmp(key, "random_threshold")) 1031 + pattern = PATTERN_RANDOM; 1032 + else if (!strcasecmp(key, "sequential_threshold")) 1033 + pattern = PATTERN_SEQUENTIAL; 1034 + else 1035 + return -EINVAL; 1036 + 1037 + if (kstrtoul(value, 10, &tmp)) 1038 + return -EINVAL; 1039 + 1040 + mq->tracker.thresholds[pattern] = tmp; 1041 + 1042 + return 0; 1043 + } 1044 + 1045 + static int mq_emit_config_values(struct dm_cache_policy *p, char *result, unsigned maxlen) 1046 + { 1047 + ssize_t sz = 0; 1048 + struct mq_policy *mq = to_mq_policy(p); 1049 + 1050 + DMEMIT("4 random_threshold %u sequential_threshold %u", 1051 + mq->tracker.thresholds[PATTERN_RANDOM], 1052 + mq->tracker.thresholds[PATTERN_SEQUENTIAL]); 1053 + 1054 + return 0; 1055 + } 1056 + 1057 + /* Init the policy plugin interface function pointers. */ 1058 + static void init_policy_functions(struct mq_policy *mq) 1059 + { 1060 + mq->policy.destroy = mq_destroy; 1061 + mq->policy.map = mq_map; 1062 + mq->policy.lookup = mq_lookup; 1063 + mq->policy.load_mapping = mq_load_mapping; 1064 + mq->policy.walk_mappings = mq_walk_mappings; 1065 + mq->policy.remove_mapping = mq_remove_mapping; 1066 + mq->policy.writeback_work = NULL; 1067 + mq->policy.force_mapping = mq_force_mapping; 1068 + mq->policy.residency = mq_residency; 1069 + mq->policy.tick = mq_tick; 1070 + mq->policy.emit_config_values = mq_emit_config_values; 1071 + mq->policy.set_config_value = mq_set_config_value; 1072 + } 1073 + 1074 + static struct dm_cache_policy *mq_create(dm_cblock_t cache_size, 1075 + sector_t origin_size, 1076 + sector_t cache_block_size) 1077 + { 1078 + int r; 1079 + struct mq_policy *mq = kzalloc(sizeof(*mq), GFP_KERNEL); 1080 + 1081 + if (!mq) 1082 + return NULL; 1083 + 1084 + init_policy_functions(mq); 1085 + iot_init(&mq->tracker, SEQUENTIAL_THRESHOLD_DEFAULT, RANDOM_THRESHOLD_DEFAULT); 1086 + 1087 + mq->cache_size = cache_size; 1088 + mq->tick_protected = 0; 1089 + mq->tick = 0; 1090 + mq->hit_count = 0; 1091 + mq->generation = 0; 1092 + mq->promote_threshold = 0; 1093 + mutex_init(&mq->lock); 1094 + spin_lock_init(&mq->tick_lock); 1095 + mq->find_free_nr_words = dm_div_up(from_cblock(mq->cache_size), BITS_PER_LONG); 1096 + mq->find_free_last_word = 0; 1097 + 1098 + queue_init(&mq->pre_cache); 1099 + queue_init(&mq->cache); 1100 + mq->generation_period = max((unsigned) from_cblock(cache_size), 1024U); 1101 + 1102 + mq->nr_entries = 2 * from_cblock(cache_size); 1103 + r = alloc_entries(mq, mq->nr_entries); 1104 + if (r) 1105 + goto bad_cache_alloc; 1106 + 1107 + mq->nr_entries_allocated = 0; 1108 + mq->nr_cblocks_allocated = 0; 1109 + 1110 + mq->nr_buckets = next_power(from_cblock(cache_size) / 2, 16); 1111 + mq->hash_bits = ffs(mq->nr_buckets) - 1; 1112 + mq->table = kzalloc(sizeof(*mq->table) * mq->nr_buckets, GFP_KERNEL); 1113 + if (!mq->table) 1114 + goto bad_alloc_table; 1115 + 1116 + mq->allocation_bitset = alloc_bitset(from_cblock(cache_size)); 1117 + if (!mq->allocation_bitset) 1118 + goto bad_alloc_bitset; 1119 + 1120 + return &mq->policy; 1121 + 1122 + bad_alloc_bitset: 1123 + kfree(mq->table); 1124 + bad_alloc_table: 1125 + free_entries(mq); 1126 + bad_cache_alloc: 1127 + kfree(mq); 1128 + 1129 + return NULL; 1130 + } 1131 + 1132 + /*----------------------------------------------------------------*/ 1133 + 1134 + static struct dm_cache_policy_type mq_policy_type = { 1135 + .name = "mq", 1136 + .hint_size = 4, 1137 + .owner = THIS_MODULE, 1138 + .create = mq_create 1139 + }; 1140 + 1141 + static struct dm_cache_policy_type default_policy_type = { 1142 + .name = "default", 1143 + .hint_size = 4, 1144 + .owner = THIS_MODULE, 1145 + .create = mq_create 1146 + }; 1147 + 1148 + static int __init mq_init(void) 1149 + { 1150 + int r; 1151 + 1152 + mq_entry_cache = kmem_cache_create("dm_mq_policy_cache_entry", 1153 + sizeof(struct entry), 1154 + __alignof__(struct entry), 1155 + 0, NULL); 1156 + if (!mq_entry_cache) 1157 + goto bad; 1158 + 1159 + r = dm_cache_policy_register(&mq_policy_type); 1160 + if (r) { 1161 + DMERR("register failed %d", r); 1162 + goto bad_register_mq; 1163 + } 1164 + 1165 + r = dm_cache_policy_register(&default_policy_type); 1166 + if (!r) { 1167 + DMINFO("version " MQ_VERSION " loaded"); 1168 + return 0; 1169 + } 1170 + 1171 + DMERR("register failed (as default) %d", r); 1172 + 1173 + dm_cache_policy_unregister(&mq_policy_type); 1174 + bad_register_mq: 1175 + kmem_cache_destroy(mq_entry_cache); 1176 + bad: 1177 + return -ENOMEM; 1178 + } 1179 + 1180 + static void __exit mq_exit(void) 1181 + { 1182 + dm_cache_policy_unregister(&mq_policy_type); 1183 + dm_cache_policy_unregister(&default_policy_type); 1184 + 1185 + kmem_cache_destroy(mq_entry_cache); 1186 + } 1187 + 1188 + module_init(mq_init); 1189 + module_exit(mq_exit); 1190 + 1191 + MODULE_AUTHOR("Joe Thornber <dm-devel@redhat.com>"); 1192 + MODULE_LICENSE("GPL"); 1193 + MODULE_DESCRIPTION("mq cache policy"); 1194 + 1195 + MODULE_ALIAS("dm-cache-default");
+161
drivers/md/dm-cache-policy.c
··· 1 + /* 2 + * Copyright (C) 2012 Red Hat. All rights reserved. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #include "dm-cache-policy-internal.h" 8 + #include "dm.h" 9 + 10 + #include <linux/module.h> 11 + #include <linux/slab.h> 12 + 13 + /*----------------------------------------------------------------*/ 14 + 15 + #define DM_MSG_PREFIX "cache-policy" 16 + 17 + static DEFINE_SPINLOCK(register_lock); 18 + static LIST_HEAD(register_list); 19 + 20 + static struct dm_cache_policy_type *__find_policy(const char *name) 21 + { 22 + struct dm_cache_policy_type *t; 23 + 24 + list_for_each_entry(t, &register_list, list) 25 + if (!strcmp(t->name, name)) 26 + return t; 27 + 28 + return NULL; 29 + } 30 + 31 + static struct dm_cache_policy_type *__get_policy_once(const char *name) 32 + { 33 + struct dm_cache_policy_type *t = __find_policy(name); 34 + 35 + if (t && !try_module_get(t->owner)) { 36 + DMWARN("couldn't get module %s", name); 37 + t = ERR_PTR(-EINVAL); 38 + } 39 + 40 + return t; 41 + } 42 + 43 + static struct dm_cache_policy_type *get_policy_once(const char *name) 44 + { 45 + struct dm_cache_policy_type *t; 46 + 47 + spin_lock(&register_lock); 48 + t = __get_policy_once(name); 49 + spin_unlock(&register_lock); 50 + 51 + return t; 52 + } 53 + 54 + static struct dm_cache_policy_type *get_policy(const char *name) 55 + { 56 + struct dm_cache_policy_type *t; 57 + 58 + t = get_policy_once(name); 59 + if (IS_ERR(t)) 60 + return NULL; 61 + 62 + if (t) 63 + return t; 64 + 65 + request_module("dm-cache-%s", name); 66 + 67 + t = get_policy_once(name); 68 + if (IS_ERR(t)) 69 + return NULL; 70 + 71 + return t; 72 + } 73 + 74 + static void put_policy(struct dm_cache_policy_type *t) 75 + { 76 + module_put(t->owner); 77 + } 78 + 79 + int dm_cache_policy_register(struct dm_cache_policy_type *type) 80 + { 81 + int r; 82 + 83 + /* One size fits all for now */ 84 + if (type->hint_size != 0 && type->hint_size != 4) { 85 + DMWARN("hint size must be 0 or 4 but %llu supplied.", (unsigned long long) type->hint_size); 86 + return -EINVAL; 87 + } 88 + 89 + spin_lock(&register_lock); 90 + if (__find_policy(type->name)) { 91 + DMWARN("attempt to register policy under duplicate name %s", type->name); 92 + r = -EINVAL; 93 + } else { 94 + list_add(&type->list, &register_list); 95 + r = 0; 96 + } 97 + spin_unlock(&register_lock); 98 + 99 + return r; 100 + } 101 + EXPORT_SYMBOL_GPL(dm_cache_policy_register); 102 + 103 + void dm_cache_policy_unregister(struct dm_cache_policy_type *type) 104 + { 105 + spin_lock(&register_lock); 106 + list_del_init(&type->list); 107 + spin_unlock(&register_lock); 108 + } 109 + EXPORT_SYMBOL_GPL(dm_cache_policy_unregister); 110 + 111 + struct dm_cache_policy *dm_cache_policy_create(const char *name, 112 + dm_cblock_t cache_size, 113 + sector_t origin_size, 114 + sector_t cache_block_size) 115 + { 116 + struct dm_cache_policy *p = NULL; 117 + struct dm_cache_policy_type *type; 118 + 119 + type = get_policy(name); 120 + if (!type) { 121 + DMWARN("unknown policy type"); 122 + return NULL; 123 + } 124 + 125 + p = type->create(cache_size, origin_size, cache_block_size); 126 + if (!p) { 127 + put_policy(type); 128 + return NULL; 129 + } 130 + p->private = type; 131 + 132 + return p; 133 + } 134 + EXPORT_SYMBOL_GPL(dm_cache_policy_create); 135 + 136 + void dm_cache_policy_destroy(struct dm_cache_policy *p) 137 + { 138 + struct dm_cache_policy_type *t = p->private; 139 + 140 + p->destroy(p); 141 + put_policy(t); 142 + } 143 + EXPORT_SYMBOL_GPL(dm_cache_policy_destroy); 144 + 145 + const char *dm_cache_policy_get_name(struct dm_cache_policy *p) 146 + { 147 + struct dm_cache_policy_type *t = p->private; 148 + 149 + return t->name; 150 + } 151 + EXPORT_SYMBOL_GPL(dm_cache_policy_get_name); 152 + 153 + size_t dm_cache_policy_get_hint_size(struct dm_cache_policy *p) 154 + { 155 + struct dm_cache_policy_type *t = p->private; 156 + 157 + return t->hint_size; 158 + } 159 + EXPORT_SYMBOL_GPL(dm_cache_policy_get_hint_size); 160 + 161 + /*----------------------------------------------------------------*/
+228
drivers/md/dm-cache-policy.h
··· 1 + /* 2 + * Copyright (C) 2012 Red Hat. All rights reserved. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #ifndef DM_CACHE_POLICY_H 8 + #define DM_CACHE_POLICY_H 9 + 10 + #include "dm-cache-block-types.h" 11 + 12 + #include <linux/device-mapper.h> 13 + 14 + /*----------------------------------------------------------------*/ 15 + 16 + /* FIXME: make it clear which methods are optional. Get debug policy to 17 + * double check this at start. 18 + */ 19 + 20 + /* 21 + * The cache policy makes the important decisions about which blocks get to 22 + * live on the faster cache device. 23 + * 24 + * When the core target has to remap a bio it calls the 'map' method of the 25 + * policy. This returns an instruction telling the core target what to do. 26 + * 27 + * POLICY_HIT: 28 + * That block is in the cache. Remap to the cache and carry on. 29 + * 30 + * POLICY_MISS: 31 + * This block is on the origin device. Remap and carry on. 32 + * 33 + * POLICY_NEW: 34 + * This block is currently on the origin device, but the policy wants to 35 + * move it. The core should: 36 + * 37 + * - hold any further io to this origin block 38 + * - copy the origin to the given cache block 39 + * - release all the held blocks 40 + * - remap the original block to the cache 41 + * 42 + * POLICY_REPLACE: 43 + * This block is currently on the origin device. The policy wants to 44 + * move it to the cache, with the added complication that the destination 45 + * cache block needs a writeback first. The core should: 46 + * 47 + * - hold any further io to this origin block 48 + * - hold any further io to the origin block that's being written back 49 + * - writeback 50 + * - copy new block to cache 51 + * - release held blocks 52 + * - remap bio to cache and reissue. 53 + * 54 + * Should the core run into trouble while processing a POLICY_NEW or 55 + * POLICY_REPLACE instruction it will roll back the policies mapping using 56 + * remove_mapping() or force_mapping(). These methods must not fail. This 57 + * approach avoids having transactional semantics in the policy (ie, the 58 + * core informing the policy when a migration is complete), and hence makes 59 + * it easier to write new policies. 60 + * 61 + * In general policy methods should never block, except in the case of the 62 + * map function when can_migrate is set. So be careful to implement using 63 + * bounded, preallocated memory. 64 + */ 65 + enum policy_operation { 66 + POLICY_HIT, 67 + POLICY_MISS, 68 + POLICY_NEW, 69 + POLICY_REPLACE 70 + }; 71 + 72 + /* 73 + * This is the instruction passed back to the core target. 74 + */ 75 + struct policy_result { 76 + enum policy_operation op; 77 + dm_oblock_t old_oblock; /* POLICY_REPLACE */ 78 + dm_cblock_t cblock; /* POLICY_HIT, POLICY_NEW, POLICY_REPLACE */ 79 + }; 80 + 81 + typedef int (*policy_walk_fn)(void *context, dm_cblock_t cblock, 82 + dm_oblock_t oblock, uint32_t hint); 83 + 84 + /* 85 + * The cache policy object. Just a bunch of methods. It is envisaged that 86 + * this structure will be embedded in a bigger, policy specific structure 87 + * (ie. use container_of()). 88 + */ 89 + struct dm_cache_policy { 90 + 91 + /* 92 + * FIXME: make it clear which methods are optional, and which may 93 + * block. 94 + */ 95 + 96 + /* 97 + * Destroys this object. 98 + */ 99 + void (*destroy)(struct dm_cache_policy *p); 100 + 101 + /* 102 + * See large comment above. 103 + * 104 + * oblock - the origin block we're interested in. 105 + * 106 + * can_block - indicates whether the current thread is allowed to 107 + * block. -EWOULDBLOCK returned if it can't and would. 108 + * 109 + * can_migrate - gives permission for POLICY_NEW or POLICY_REPLACE 110 + * instructions. If denied and the policy would have 111 + * returned one of these instructions it should 112 + * return -EWOULDBLOCK. 113 + * 114 + * discarded_oblock - indicates whether the whole origin block is 115 + * in a discarded state (FIXME: better to tell the 116 + * policy about this sooner, so it can recycle that 117 + * cache block if it wants.) 118 + * bio - the bio that triggered this call. 119 + * result - gets filled in with the instruction. 120 + * 121 + * May only return 0, or -EWOULDBLOCK (if !can_migrate) 122 + */ 123 + int (*map)(struct dm_cache_policy *p, dm_oblock_t oblock, 124 + bool can_block, bool can_migrate, bool discarded_oblock, 125 + struct bio *bio, struct policy_result *result); 126 + 127 + /* 128 + * Sometimes we want to see if a block is in the cache, without 129 + * triggering any update of stats. (ie. it's not a real hit). 130 + * 131 + * Must not block. 132 + * 133 + * Returns 1 iff in cache, 0 iff not, < 0 on error (-EWOULDBLOCK 134 + * would be typical). 135 + */ 136 + int (*lookup)(struct dm_cache_policy *p, dm_oblock_t oblock, dm_cblock_t *cblock); 137 + 138 + /* 139 + * oblock must be a mapped block. Must not block. 140 + */ 141 + void (*set_dirty)(struct dm_cache_policy *p, dm_oblock_t oblock); 142 + void (*clear_dirty)(struct dm_cache_policy *p, dm_oblock_t oblock); 143 + 144 + /* 145 + * Called when a cache target is first created. Used to load a 146 + * mapping from the metadata device into the policy. 147 + */ 148 + int (*load_mapping)(struct dm_cache_policy *p, dm_oblock_t oblock, 149 + dm_cblock_t cblock, uint32_t hint, bool hint_valid); 150 + 151 + int (*walk_mappings)(struct dm_cache_policy *p, policy_walk_fn fn, 152 + void *context); 153 + 154 + /* 155 + * Override functions used on the error paths of the core target. 156 + * They must succeed. 157 + */ 158 + void (*remove_mapping)(struct dm_cache_policy *p, dm_oblock_t oblock); 159 + void (*force_mapping)(struct dm_cache_policy *p, dm_oblock_t current_oblock, 160 + dm_oblock_t new_oblock); 161 + 162 + int (*writeback_work)(struct dm_cache_policy *p, dm_oblock_t *oblock, dm_cblock_t *cblock); 163 + 164 + 165 + /* 166 + * How full is the cache? 167 + */ 168 + dm_cblock_t (*residency)(struct dm_cache_policy *p); 169 + 170 + /* 171 + * Because of where we sit in the block layer, we can be asked to 172 + * map a lot of little bios that are all in the same block (no 173 + * queue merging has occurred). To stop the policy being fooled by 174 + * these the core target sends regular tick() calls to the policy. 175 + * The policy should only count an entry as hit once per tick. 176 + */ 177 + void (*tick)(struct dm_cache_policy *p); 178 + 179 + /* 180 + * Configuration. 181 + */ 182 + int (*emit_config_values)(struct dm_cache_policy *p, 183 + char *result, unsigned maxlen); 184 + int (*set_config_value)(struct dm_cache_policy *p, 185 + const char *key, const char *value); 186 + 187 + /* 188 + * Book keeping ptr for the policy register, not for general use. 189 + */ 190 + void *private; 191 + }; 192 + 193 + /*----------------------------------------------------------------*/ 194 + 195 + /* 196 + * We maintain a little register of the different policy types. 197 + */ 198 + #define CACHE_POLICY_NAME_SIZE 16 199 + 200 + struct dm_cache_policy_type { 201 + /* For use by the register code only. */ 202 + struct list_head list; 203 + 204 + /* 205 + * Policy writers should fill in these fields. The name field is 206 + * what gets passed on the target line to select your policy. 207 + */ 208 + char name[CACHE_POLICY_NAME_SIZE]; 209 + 210 + /* 211 + * Policies may store a hint for each each cache block. 212 + * Currently the size of this hint must be 0 or 4 bytes but we 213 + * expect to relax this in future. 214 + */ 215 + size_t hint_size; 216 + 217 + struct module *owner; 218 + struct dm_cache_policy *(*create)(dm_cblock_t cache_size, 219 + sector_t origin_size, 220 + sector_t block_size); 221 + }; 222 + 223 + int dm_cache_policy_register(struct dm_cache_policy_type *type); 224 + void dm_cache_policy_unregister(struct dm_cache_policy_type *type); 225 + 226 + /*----------------------------------------------------------------*/ 227 + 228 + #endif /* DM_CACHE_POLICY_H */
+2584
drivers/md/dm-cache-target.c
··· 1 + /* 2 + * Copyright (C) 2012 Red Hat. All rights reserved. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #include "dm.h" 8 + #include "dm-bio-prison.h" 9 + #include "dm-cache-metadata.h" 10 + 11 + #include <linux/dm-io.h> 12 + #include <linux/dm-kcopyd.h> 13 + #include <linux/init.h> 14 + #include <linux/mempool.h> 15 + #include <linux/module.h> 16 + #include <linux/slab.h> 17 + #include <linux/vmalloc.h> 18 + 19 + #define DM_MSG_PREFIX "cache" 20 + 21 + DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(cache_copy_throttle, 22 + "A percentage of time allocated for copying to and/or from cache"); 23 + 24 + /*----------------------------------------------------------------*/ 25 + 26 + /* 27 + * Glossary: 28 + * 29 + * oblock: index of an origin block 30 + * cblock: index of a cache block 31 + * promotion: movement of a block from origin to cache 32 + * demotion: movement of a block from cache to origin 33 + * migration: movement of a block between the origin and cache device, 34 + * either direction 35 + */ 36 + 37 + /*----------------------------------------------------------------*/ 38 + 39 + static size_t bitset_size_in_bytes(unsigned nr_entries) 40 + { 41 + return sizeof(unsigned long) * dm_div_up(nr_entries, BITS_PER_LONG); 42 + } 43 + 44 + static unsigned long *alloc_bitset(unsigned nr_entries) 45 + { 46 + size_t s = bitset_size_in_bytes(nr_entries); 47 + return vzalloc(s); 48 + } 49 + 50 + static void clear_bitset(void *bitset, unsigned nr_entries) 51 + { 52 + size_t s = bitset_size_in_bytes(nr_entries); 53 + memset(bitset, 0, s); 54 + } 55 + 56 + static void free_bitset(unsigned long *bits) 57 + { 58 + vfree(bits); 59 + } 60 + 61 + /*----------------------------------------------------------------*/ 62 + 63 + #define PRISON_CELLS 1024 64 + #define MIGRATION_POOL_SIZE 128 65 + #define COMMIT_PERIOD HZ 66 + #define MIGRATION_COUNT_WINDOW 10 67 + 68 + /* 69 + * The block size of the device holding cache data must be >= 32KB 70 + */ 71 + #define DATA_DEV_BLOCK_SIZE_MIN_SECTORS (32 * 1024 >> SECTOR_SHIFT) 72 + 73 + /* 74 + * FIXME: the cache is read/write for the time being. 75 + */ 76 + enum cache_mode { 77 + CM_WRITE, /* metadata may be changed */ 78 + CM_READ_ONLY, /* metadata may not be changed */ 79 + }; 80 + 81 + struct cache_features { 82 + enum cache_mode mode; 83 + bool write_through:1; 84 + }; 85 + 86 + struct cache_stats { 87 + atomic_t read_hit; 88 + atomic_t read_miss; 89 + atomic_t write_hit; 90 + atomic_t write_miss; 91 + atomic_t demotion; 92 + atomic_t promotion; 93 + atomic_t copies_avoided; 94 + atomic_t cache_cell_clash; 95 + atomic_t commit_count; 96 + atomic_t discard_count; 97 + }; 98 + 99 + struct cache { 100 + struct dm_target *ti; 101 + struct dm_target_callbacks callbacks; 102 + 103 + /* 104 + * Metadata is written to this device. 105 + */ 106 + struct dm_dev *metadata_dev; 107 + 108 + /* 109 + * The slower of the two data devices. Typically a spindle. 110 + */ 111 + struct dm_dev *origin_dev; 112 + 113 + /* 114 + * The faster of the two data devices. Typically an SSD. 115 + */ 116 + struct dm_dev *cache_dev; 117 + 118 + /* 119 + * Cache features such as write-through. 120 + */ 121 + struct cache_features features; 122 + 123 + /* 124 + * Size of the origin device in _complete_ blocks and native sectors. 125 + */ 126 + dm_oblock_t origin_blocks; 127 + sector_t origin_sectors; 128 + 129 + /* 130 + * Size of the cache device in blocks. 131 + */ 132 + dm_cblock_t cache_size; 133 + 134 + /* 135 + * Fields for converting from sectors to blocks. 136 + */ 137 + uint32_t sectors_per_block; 138 + int sectors_per_block_shift; 139 + 140 + struct dm_cache_metadata *cmd; 141 + 142 + spinlock_t lock; 143 + struct bio_list deferred_bios; 144 + struct bio_list deferred_flush_bios; 145 + struct list_head quiesced_migrations; 146 + struct list_head completed_migrations; 147 + struct list_head need_commit_migrations; 148 + sector_t migration_threshold; 149 + atomic_t nr_migrations; 150 + wait_queue_head_t migration_wait; 151 + 152 + /* 153 + * cache_size entries, dirty if set 154 + */ 155 + dm_cblock_t nr_dirty; 156 + unsigned long *dirty_bitset; 157 + 158 + /* 159 + * origin_blocks entries, discarded if set. 160 + */ 161 + sector_t discard_block_size; /* a power of 2 times sectors per block */ 162 + dm_dblock_t discard_nr_blocks; 163 + unsigned long *discard_bitset; 164 + 165 + struct dm_kcopyd_client *copier; 166 + struct workqueue_struct *wq; 167 + struct work_struct worker; 168 + 169 + struct delayed_work waker; 170 + unsigned long last_commit_jiffies; 171 + 172 + struct dm_bio_prison *prison; 173 + struct dm_deferred_set *all_io_ds; 174 + 175 + mempool_t *migration_pool; 176 + struct dm_cache_migration *next_migration; 177 + 178 + struct dm_cache_policy *policy; 179 + unsigned policy_nr_args; 180 + 181 + bool need_tick_bio:1; 182 + bool sized:1; 183 + bool quiescing:1; 184 + bool commit_requested:1; 185 + bool loaded_mappings:1; 186 + bool loaded_discards:1; 187 + 188 + struct cache_stats stats; 189 + 190 + /* 191 + * Rather than reconstructing the table line for the status we just 192 + * save it and regurgitate. 193 + */ 194 + unsigned nr_ctr_args; 195 + const char **ctr_args; 196 + }; 197 + 198 + struct per_bio_data { 199 + bool tick:1; 200 + unsigned req_nr:2; 201 + struct dm_deferred_entry *all_io_entry; 202 + }; 203 + 204 + struct dm_cache_migration { 205 + struct list_head list; 206 + struct cache *cache; 207 + 208 + unsigned long start_jiffies; 209 + dm_oblock_t old_oblock; 210 + dm_oblock_t new_oblock; 211 + dm_cblock_t cblock; 212 + 213 + bool err:1; 214 + bool writeback:1; 215 + bool demote:1; 216 + bool promote:1; 217 + 218 + struct dm_bio_prison_cell *old_ocell; 219 + struct dm_bio_prison_cell *new_ocell; 220 + }; 221 + 222 + /* 223 + * Processing a bio in the worker thread may require these memory 224 + * allocations. We prealloc to avoid deadlocks (the same worker thread 225 + * frees them back to the mempool). 226 + */ 227 + struct prealloc { 228 + struct dm_cache_migration *mg; 229 + struct dm_bio_prison_cell *cell1; 230 + struct dm_bio_prison_cell *cell2; 231 + }; 232 + 233 + static void wake_worker(struct cache *cache) 234 + { 235 + queue_work(cache->wq, &cache->worker); 236 + } 237 + 238 + /*----------------------------------------------------------------*/ 239 + 240 + static struct dm_bio_prison_cell *alloc_prison_cell(struct cache *cache) 241 + { 242 + /* FIXME: change to use a local slab. */ 243 + return dm_bio_prison_alloc_cell(cache->prison, GFP_NOWAIT); 244 + } 245 + 246 + static void free_prison_cell(struct cache *cache, struct dm_bio_prison_cell *cell) 247 + { 248 + dm_bio_prison_free_cell(cache->prison, cell); 249 + } 250 + 251 + static int prealloc_data_structs(struct cache *cache, struct prealloc *p) 252 + { 253 + if (!p->mg) { 254 + p->mg = mempool_alloc(cache->migration_pool, GFP_NOWAIT); 255 + if (!p->mg) 256 + return -ENOMEM; 257 + } 258 + 259 + if (!p->cell1) { 260 + p->cell1 = alloc_prison_cell(cache); 261 + if (!p->cell1) 262 + return -ENOMEM; 263 + } 264 + 265 + if (!p->cell2) { 266 + p->cell2 = alloc_prison_cell(cache); 267 + if (!p->cell2) 268 + return -ENOMEM; 269 + } 270 + 271 + return 0; 272 + } 273 + 274 + static void prealloc_free_structs(struct cache *cache, struct prealloc *p) 275 + { 276 + if (p->cell2) 277 + free_prison_cell(cache, p->cell2); 278 + 279 + if (p->cell1) 280 + free_prison_cell(cache, p->cell1); 281 + 282 + if (p->mg) 283 + mempool_free(p->mg, cache->migration_pool); 284 + } 285 + 286 + static struct dm_cache_migration *prealloc_get_migration(struct prealloc *p) 287 + { 288 + struct dm_cache_migration *mg = p->mg; 289 + 290 + BUG_ON(!mg); 291 + p->mg = NULL; 292 + 293 + return mg; 294 + } 295 + 296 + /* 297 + * You must have a cell within the prealloc struct to return. If not this 298 + * function will BUG() rather than returning NULL. 299 + */ 300 + static struct dm_bio_prison_cell *prealloc_get_cell(struct prealloc *p) 301 + { 302 + struct dm_bio_prison_cell *r = NULL; 303 + 304 + if (p->cell1) { 305 + r = p->cell1; 306 + p->cell1 = NULL; 307 + 308 + } else if (p->cell2) { 309 + r = p->cell2; 310 + p->cell2 = NULL; 311 + } else 312 + BUG(); 313 + 314 + return r; 315 + } 316 + 317 + /* 318 + * You can't have more than two cells in a prealloc struct. BUG() will be 319 + * called if you try and overfill. 320 + */ 321 + static void prealloc_put_cell(struct prealloc *p, struct dm_bio_prison_cell *cell) 322 + { 323 + if (!p->cell2) 324 + p->cell2 = cell; 325 + 326 + else if (!p->cell1) 327 + p->cell1 = cell; 328 + 329 + else 330 + BUG(); 331 + } 332 + 333 + /*----------------------------------------------------------------*/ 334 + 335 + static void build_key(dm_oblock_t oblock, struct dm_cell_key *key) 336 + { 337 + key->virtual = 0; 338 + key->dev = 0; 339 + key->block = from_oblock(oblock); 340 + } 341 + 342 + /* 343 + * The caller hands in a preallocated cell, and a free function for it. 344 + * The cell will be freed if there's an error, or if it wasn't used because 345 + * a cell with that key already exists. 346 + */ 347 + typedef void (*cell_free_fn)(void *context, struct dm_bio_prison_cell *cell); 348 + 349 + static int bio_detain(struct cache *cache, dm_oblock_t oblock, 350 + struct bio *bio, struct dm_bio_prison_cell *cell_prealloc, 351 + cell_free_fn free_fn, void *free_context, 352 + struct dm_bio_prison_cell **cell_result) 353 + { 354 + int r; 355 + struct dm_cell_key key; 356 + 357 + build_key(oblock, &key); 358 + r = dm_bio_detain(cache->prison, &key, bio, cell_prealloc, cell_result); 359 + if (r) 360 + free_fn(free_context, cell_prealloc); 361 + 362 + return r; 363 + } 364 + 365 + static int get_cell(struct cache *cache, 366 + dm_oblock_t oblock, 367 + struct prealloc *structs, 368 + struct dm_bio_prison_cell **cell_result) 369 + { 370 + int r; 371 + struct dm_cell_key key; 372 + struct dm_bio_prison_cell *cell_prealloc; 373 + 374 + cell_prealloc = prealloc_get_cell(structs); 375 + 376 + build_key(oblock, &key); 377 + r = dm_get_cell(cache->prison, &key, cell_prealloc, cell_result); 378 + if (r) 379 + prealloc_put_cell(structs, cell_prealloc); 380 + 381 + return r; 382 + } 383 + 384 + /*----------------------------------------------------------------*/ 385 + 386 + static bool is_dirty(struct cache *cache, dm_cblock_t b) 387 + { 388 + return test_bit(from_cblock(b), cache->dirty_bitset); 389 + } 390 + 391 + static void set_dirty(struct cache *cache, dm_oblock_t oblock, dm_cblock_t cblock) 392 + { 393 + if (!test_and_set_bit(from_cblock(cblock), cache->dirty_bitset)) { 394 + cache->nr_dirty = to_cblock(from_cblock(cache->nr_dirty) + 1); 395 + policy_set_dirty(cache->policy, oblock); 396 + } 397 + } 398 + 399 + static void clear_dirty(struct cache *cache, dm_oblock_t oblock, dm_cblock_t cblock) 400 + { 401 + if (test_and_clear_bit(from_cblock(cblock), cache->dirty_bitset)) { 402 + policy_clear_dirty(cache->policy, oblock); 403 + cache->nr_dirty = to_cblock(from_cblock(cache->nr_dirty) - 1); 404 + if (!from_cblock(cache->nr_dirty)) 405 + dm_table_event(cache->ti->table); 406 + } 407 + } 408 + 409 + /*----------------------------------------------------------------*/ 410 + static bool block_size_is_power_of_two(struct cache *cache) 411 + { 412 + return cache->sectors_per_block_shift >= 0; 413 + } 414 + 415 + static dm_dblock_t oblock_to_dblock(struct cache *cache, dm_oblock_t oblock) 416 + { 417 + sector_t discard_blocks = cache->discard_block_size; 418 + dm_block_t b = from_oblock(oblock); 419 + 420 + if (!block_size_is_power_of_two(cache)) 421 + (void) sector_div(discard_blocks, cache->sectors_per_block); 422 + else 423 + discard_blocks >>= cache->sectors_per_block_shift; 424 + 425 + (void) sector_div(b, discard_blocks); 426 + 427 + return to_dblock(b); 428 + } 429 + 430 + static void set_discard(struct cache *cache, dm_dblock_t b) 431 + { 432 + unsigned long flags; 433 + 434 + atomic_inc(&cache->stats.discard_count); 435 + 436 + spin_lock_irqsave(&cache->lock, flags); 437 + set_bit(from_dblock(b), cache->discard_bitset); 438 + spin_unlock_irqrestore(&cache->lock, flags); 439 + } 440 + 441 + static void clear_discard(struct cache *cache, dm_dblock_t b) 442 + { 443 + unsigned long flags; 444 + 445 + spin_lock_irqsave(&cache->lock, flags); 446 + clear_bit(from_dblock(b), cache->discard_bitset); 447 + spin_unlock_irqrestore(&cache->lock, flags); 448 + } 449 + 450 + static bool is_discarded(struct cache *cache, dm_dblock_t b) 451 + { 452 + int r; 453 + unsigned long flags; 454 + 455 + spin_lock_irqsave(&cache->lock, flags); 456 + r = test_bit(from_dblock(b), cache->discard_bitset); 457 + spin_unlock_irqrestore(&cache->lock, flags); 458 + 459 + return r; 460 + } 461 + 462 + static bool is_discarded_oblock(struct cache *cache, dm_oblock_t b) 463 + { 464 + int r; 465 + unsigned long flags; 466 + 467 + spin_lock_irqsave(&cache->lock, flags); 468 + r = test_bit(from_dblock(oblock_to_dblock(cache, b)), 469 + cache->discard_bitset); 470 + spin_unlock_irqrestore(&cache->lock, flags); 471 + 472 + return r; 473 + } 474 + 475 + /*----------------------------------------------------------------*/ 476 + 477 + static void load_stats(struct cache *cache) 478 + { 479 + struct dm_cache_statistics stats; 480 + 481 + dm_cache_metadata_get_stats(cache->cmd, &stats); 482 + atomic_set(&cache->stats.read_hit, stats.read_hits); 483 + atomic_set(&cache->stats.read_miss, stats.read_misses); 484 + atomic_set(&cache->stats.write_hit, stats.write_hits); 485 + atomic_set(&cache->stats.write_miss, stats.write_misses); 486 + } 487 + 488 + static void save_stats(struct cache *cache) 489 + { 490 + struct dm_cache_statistics stats; 491 + 492 + stats.read_hits = atomic_read(&cache->stats.read_hit); 493 + stats.read_misses = atomic_read(&cache->stats.read_miss); 494 + stats.write_hits = atomic_read(&cache->stats.write_hit); 495 + stats.write_misses = atomic_read(&cache->stats.write_miss); 496 + 497 + dm_cache_metadata_set_stats(cache->cmd, &stats); 498 + } 499 + 500 + /*---------------------------------------------------------------- 501 + * Per bio data 502 + *--------------------------------------------------------------*/ 503 + static struct per_bio_data *get_per_bio_data(struct bio *bio) 504 + { 505 + struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data)); 506 + BUG_ON(!pb); 507 + return pb; 508 + } 509 + 510 + static struct per_bio_data *init_per_bio_data(struct bio *bio) 511 + { 512 + struct per_bio_data *pb = get_per_bio_data(bio); 513 + 514 + pb->tick = false; 515 + pb->req_nr = dm_bio_get_target_bio_nr(bio); 516 + pb->all_io_entry = NULL; 517 + 518 + return pb; 519 + } 520 + 521 + /*---------------------------------------------------------------- 522 + * Remapping 523 + *--------------------------------------------------------------*/ 524 + static void remap_to_origin(struct cache *cache, struct bio *bio) 525 + { 526 + bio->bi_bdev = cache->origin_dev->bdev; 527 + } 528 + 529 + static void remap_to_cache(struct cache *cache, struct bio *bio, 530 + dm_cblock_t cblock) 531 + { 532 + sector_t bi_sector = bio->bi_sector; 533 + 534 + bio->bi_bdev = cache->cache_dev->bdev; 535 + if (!block_size_is_power_of_two(cache)) 536 + bio->bi_sector = (from_cblock(cblock) * cache->sectors_per_block) + 537 + sector_div(bi_sector, cache->sectors_per_block); 538 + else 539 + bio->bi_sector = (from_cblock(cblock) << cache->sectors_per_block_shift) | 540 + (bi_sector & (cache->sectors_per_block - 1)); 541 + } 542 + 543 + static void check_if_tick_bio_needed(struct cache *cache, struct bio *bio) 544 + { 545 + unsigned long flags; 546 + struct per_bio_data *pb = get_per_bio_data(bio); 547 + 548 + spin_lock_irqsave(&cache->lock, flags); 549 + if (cache->need_tick_bio && 550 + !(bio->bi_rw & (REQ_FUA | REQ_FLUSH | REQ_DISCARD))) { 551 + pb->tick = true; 552 + cache->need_tick_bio = false; 553 + } 554 + spin_unlock_irqrestore(&cache->lock, flags); 555 + } 556 + 557 + static void remap_to_origin_clear_discard(struct cache *cache, struct bio *bio, 558 + dm_oblock_t oblock) 559 + { 560 + check_if_tick_bio_needed(cache, bio); 561 + remap_to_origin(cache, bio); 562 + if (bio_data_dir(bio) == WRITE) 563 + clear_discard(cache, oblock_to_dblock(cache, oblock)); 564 + } 565 + 566 + static void remap_to_cache_dirty(struct cache *cache, struct bio *bio, 567 + dm_oblock_t oblock, dm_cblock_t cblock) 568 + { 569 + remap_to_cache(cache, bio, cblock); 570 + if (bio_data_dir(bio) == WRITE) { 571 + set_dirty(cache, oblock, cblock); 572 + clear_discard(cache, oblock_to_dblock(cache, oblock)); 573 + } 574 + } 575 + 576 + static dm_oblock_t get_bio_block(struct cache *cache, struct bio *bio) 577 + { 578 + sector_t block_nr = bio->bi_sector; 579 + 580 + if (!block_size_is_power_of_two(cache)) 581 + (void) sector_div(block_nr, cache->sectors_per_block); 582 + else 583 + block_nr >>= cache->sectors_per_block_shift; 584 + 585 + return to_oblock(block_nr); 586 + } 587 + 588 + static int bio_triggers_commit(struct cache *cache, struct bio *bio) 589 + { 590 + return bio->bi_rw & (REQ_FLUSH | REQ_FUA); 591 + } 592 + 593 + static void issue(struct cache *cache, struct bio *bio) 594 + { 595 + unsigned long flags; 596 + 597 + if (!bio_triggers_commit(cache, bio)) { 598 + generic_make_request(bio); 599 + return; 600 + } 601 + 602 + /* 603 + * Batch together any bios that trigger commits and then issue a 604 + * single commit for them in do_worker(). 605 + */ 606 + spin_lock_irqsave(&cache->lock, flags); 607 + cache->commit_requested = true; 608 + bio_list_add(&cache->deferred_flush_bios, bio); 609 + spin_unlock_irqrestore(&cache->lock, flags); 610 + } 611 + 612 + /*---------------------------------------------------------------- 613 + * Migration processing 614 + * 615 + * Migration covers moving data from the origin device to the cache, or 616 + * vice versa. 617 + *--------------------------------------------------------------*/ 618 + static void free_migration(struct dm_cache_migration *mg) 619 + { 620 + mempool_free(mg, mg->cache->migration_pool); 621 + } 622 + 623 + static void inc_nr_migrations(struct cache *cache) 624 + { 625 + atomic_inc(&cache->nr_migrations); 626 + } 627 + 628 + static void dec_nr_migrations(struct cache *cache) 629 + { 630 + atomic_dec(&cache->nr_migrations); 631 + 632 + /* 633 + * Wake the worker in case we're suspending the target. 634 + */ 635 + wake_up(&cache->migration_wait); 636 + } 637 + 638 + static void __cell_defer(struct cache *cache, struct dm_bio_prison_cell *cell, 639 + bool holder) 640 + { 641 + (holder ? dm_cell_release : dm_cell_release_no_holder) 642 + (cache->prison, cell, &cache->deferred_bios); 643 + free_prison_cell(cache, cell); 644 + } 645 + 646 + static void cell_defer(struct cache *cache, struct dm_bio_prison_cell *cell, 647 + bool holder) 648 + { 649 + unsigned long flags; 650 + 651 + spin_lock_irqsave(&cache->lock, flags); 652 + __cell_defer(cache, cell, holder); 653 + spin_unlock_irqrestore(&cache->lock, flags); 654 + 655 + wake_worker(cache); 656 + } 657 + 658 + static void cleanup_migration(struct dm_cache_migration *mg) 659 + { 660 + dec_nr_migrations(mg->cache); 661 + free_migration(mg); 662 + } 663 + 664 + static void migration_failure(struct dm_cache_migration *mg) 665 + { 666 + struct cache *cache = mg->cache; 667 + 668 + if (mg->writeback) { 669 + DMWARN_LIMIT("writeback failed; couldn't copy block"); 670 + set_dirty(cache, mg->old_oblock, mg->cblock); 671 + cell_defer(cache, mg->old_ocell, false); 672 + 673 + } else if (mg->demote) { 674 + DMWARN_LIMIT("demotion failed; couldn't copy block"); 675 + policy_force_mapping(cache->policy, mg->new_oblock, mg->old_oblock); 676 + 677 + cell_defer(cache, mg->old_ocell, mg->promote ? 0 : 1); 678 + if (mg->promote) 679 + cell_defer(cache, mg->new_ocell, 1); 680 + } else { 681 + DMWARN_LIMIT("promotion failed; couldn't copy block"); 682 + policy_remove_mapping(cache->policy, mg->new_oblock); 683 + cell_defer(cache, mg->new_ocell, 1); 684 + } 685 + 686 + cleanup_migration(mg); 687 + } 688 + 689 + static void migration_success_pre_commit(struct dm_cache_migration *mg) 690 + { 691 + unsigned long flags; 692 + struct cache *cache = mg->cache; 693 + 694 + if (mg->writeback) { 695 + cell_defer(cache, mg->old_ocell, false); 696 + clear_dirty(cache, mg->old_oblock, mg->cblock); 697 + cleanup_migration(mg); 698 + return; 699 + 700 + } else if (mg->demote) { 701 + if (dm_cache_remove_mapping(cache->cmd, mg->cblock)) { 702 + DMWARN_LIMIT("demotion failed; couldn't update on disk metadata"); 703 + policy_force_mapping(cache->policy, mg->new_oblock, 704 + mg->old_oblock); 705 + if (mg->promote) 706 + cell_defer(cache, mg->new_ocell, true); 707 + cleanup_migration(mg); 708 + return; 709 + } 710 + } else { 711 + if (dm_cache_insert_mapping(cache->cmd, mg->cblock, mg->new_oblock)) { 712 + DMWARN_LIMIT("promotion failed; couldn't update on disk metadata"); 713 + policy_remove_mapping(cache->policy, mg->new_oblock); 714 + cleanup_migration(mg); 715 + return; 716 + } 717 + } 718 + 719 + spin_lock_irqsave(&cache->lock, flags); 720 + list_add_tail(&mg->list, &cache->need_commit_migrations); 721 + cache->commit_requested = true; 722 + spin_unlock_irqrestore(&cache->lock, flags); 723 + } 724 + 725 + static void migration_success_post_commit(struct dm_cache_migration *mg) 726 + { 727 + unsigned long flags; 728 + struct cache *cache = mg->cache; 729 + 730 + if (mg->writeback) { 731 + DMWARN("writeback unexpectedly triggered commit"); 732 + return; 733 + 734 + } else if (mg->demote) { 735 + cell_defer(cache, mg->old_ocell, mg->promote ? 0 : 1); 736 + 737 + if (mg->promote) { 738 + mg->demote = false; 739 + 740 + spin_lock_irqsave(&cache->lock, flags); 741 + list_add_tail(&mg->list, &cache->quiesced_migrations); 742 + spin_unlock_irqrestore(&cache->lock, flags); 743 + 744 + } else 745 + cleanup_migration(mg); 746 + 747 + } else { 748 + cell_defer(cache, mg->new_ocell, true); 749 + clear_dirty(cache, mg->new_oblock, mg->cblock); 750 + cleanup_migration(mg); 751 + } 752 + } 753 + 754 + static void copy_complete(int read_err, unsigned long write_err, void *context) 755 + { 756 + unsigned long flags; 757 + struct dm_cache_migration *mg = (struct dm_cache_migration *) context; 758 + struct cache *cache = mg->cache; 759 + 760 + if (read_err || write_err) 761 + mg->err = true; 762 + 763 + spin_lock_irqsave(&cache->lock, flags); 764 + list_add_tail(&mg->list, &cache->completed_migrations); 765 + spin_unlock_irqrestore(&cache->lock, flags); 766 + 767 + wake_worker(cache); 768 + } 769 + 770 + static void issue_copy_real(struct dm_cache_migration *mg) 771 + { 772 + int r; 773 + struct dm_io_region o_region, c_region; 774 + struct cache *cache = mg->cache; 775 + 776 + o_region.bdev = cache->origin_dev->bdev; 777 + o_region.count = cache->sectors_per_block; 778 + 779 + c_region.bdev = cache->cache_dev->bdev; 780 + c_region.sector = from_cblock(mg->cblock) * cache->sectors_per_block; 781 + c_region.count = cache->sectors_per_block; 782 + 783 + if (mg->writeback || mg->demote) { 784 + /* demote */ 785 + o_region.sector = from_oblock(mg->old_oblock) * cache->sectors_per_block; 786 + r = dm_kcopyd_copy(cache->copier, &c_region, 1, &o_region, 0, copy_complete, mg); 787 + } else { 788 + /* promote */ 789 + o_region.sector = from_oblock(mg->new_oblock) * cache->sectors_per_block; 790 + r = dm_kcopyd_copy(cache->copier, &o_region, 1, &c_region, 0, copy_complete, mg); 791 + } 792 + 793 + if (r < 0) 794 + migration_failure(mg); 795 + } 796 + 797 + static void avoid_copy(struct dm_cache_migration *mg) 798 + { 799 + atomic_inc(&mg->cache->stats.copies_avoided); 800 + migration_success_pre_commit(mg); 801 + } 802 + 803 + static void issue_copy(struct dm_cache_migration *mg) 804 + { 805 + bool avoid; 806 + struct cache *cache = mg->cache; 807 + 808 + if (mg->writeback || mg->demote) 809 + avoid = !is_dirty(cache, mg->cblock) || 810 + is_discarded_oblock(cache, mg->old_oblock); 811 + else 812 + avoid = is_discarded_oblock(cache, mg->new_oblock); 813 + 814 + avoid ? avoid_copy(mg) : issue_copy_real(mg); 815 + } 816 + 817 + static void complete_migration(struct dm_cache_migration *mg) 818 + { 819 + if (mg->err) 820 + migration_failure(mg); 821 + else 822 + migration_success_pre_commit(mg); 823 + } 824 + 825 + static void process_migrations(struct cache *cache, struct list_head *head, 826 + void (*fn)(struct dm_cache_migration *)) 827 + { 828 + unsigned long flags; 829 + struct list_head list; 830 + struct dm_cache_migration *mg, *tmp; 831 + 832 + INIT_LIST_HEAD(&list); 833 + spin_lock_irqsave(&cache->lock, flags); 834 + list_splice_init(head, &list); 835 + spin_unlock_irqrestore(&cache->lock, flags); 836 + 837 + list_for_each_entry_safe(mg, tmp, &list, list) 838 + fn(mg); 839 + } 840 + 841 + static void __queue_quiesced_migration(struct dm_cache_migration *mg) 842 + { 843 + list_add_tail(&mg->list, &mg->cache->quiesced_migrations); 844 + } 845 + 846 + static void queue_quiesced_migration(struct dm_cache_migration *mg) 847 + { 848 + unsigned long flags; 849 + struct cache *cache = mg->cache; 850 + 851 + spin_lock_irqsave(&cache->lock, flags); 852 + __queue_quiesced_migration(mg); 853 + spin_unlock_irqrestore(&cache->lock, flags); 854 + 855 + wake_worker(cache); 856 + } 857 + 858 + static void queue_quiesced_migrations(struct cache *cache, struct list_head *work) 859 + { 860 + unsigned long flags; 861 + struct dm_cache_migration *mg, *tmp; 862 + 863 + spin_lock_irqsave(&cache->lock, flags); 864 + list_for_each_entry_safe(mg, tmp, work, list) 865 + __queue_quiesced_migration(mg); 866 + spin_unlock_irqrestore(&cache->lock, flags); 867 + 868 + wake_worker(cache); 869 + } 870 + 871 + static void check_for_quiesced_migrations(struct cache *cache, 872 + struct per_bio_data *pb) 873 + { 874 + struct list_head work; 875 + 876 + if (!pb->all_io_entry) 877 + return; 878 + 879 + INIT_LIST_HEAD(&work); 880 + if (pb->all_io_entry) 881 + dm_deferred_entry_dec(pb->all_io_entry, &work); 882 + 883 + if (!list_empty(&work)) 884 + queue_quiesced_migrations(cache, &work); 885 + } 886 + 887 + static void quiesce_migration(struct dm_cache_migration *mg) 888 + { 889 + if (!dm_deferred_set_add_work(mg->cache->all_io_ds, &mg->list)) 890 + queue_quiesced_migration(mg); 891 + } 892 + 893 + static void promote(struct cache *cache, struct prealloc *structs, 894 + dm_oblock_t oblock, dm_cblock_t cblock, 895 + struct dm_bio_prison_cell *cell) 896 + { 897 + struct dm_cache_migration *mg = prealloc_get_migration(structs); 898 + 899 + mg->err = false; 900 + mg->writeback = false; 901 + mg->demote = false; 902 + mg->promote = true; 903 + mg->cache = cache; 904 + mg->new_oblock = oblock; 905 + mg->cblock = cblock; 906 + mg->old_ocell = NULL; 907 + mg->new_ocell = cell; 908 + mg->start_jiffies = jiffies; 909 + 910 + inc_nr_migrations(cache); 911 + quiesce_migration(mg); 912 + } 913 + 914 + static void writeback(struct cache *cache, struct prealloc *structs, 915 + dm_oblock_t oblock, dm_cblock_t cblock, 916 + struct dm_bio_prison_cell *cell) 917 + { 918 + struct dm_cache_migration *mg = prealloc_get_migration(structs); 919 + 920 + mg->err = false; 921 + mg->writeback = true; 922 + mg->demote = false; 923 + mg->promote = false; 924 + mg->cache = cache; 925 + mg->old_oblock = oblock; 926 + mg->cblock = cblock; 927 + mg->old_ocell = cell; 928 + mg->new_ocell = NULL; 929 + mg->start_jiffies = jiffies; 930 + 931 + inc_nr_migrations(cache); 932 + quiesce_migration(mg); 933 + } 934 + 935 + static void demote_then_promote(struct cache *cache, struct prealloc *structs, 936 + dm_oblock_t old_oblock, dm_oblock_t new_oblock, 937 + dm_cblock_t cblock, 938 + struct dm_bio_prison_cell *old_ocell, 939 + struct dm_bio_prison_cell *new_ocell) 940 + { 941 + struct dm_cache_migration *mg = prealloc_get_migration(structs); 942 + 943 + mg->err = false; 944 + mg->writeback = false; 945 + mg->demote = true; 946 + mg->promote = true; 947 + mg->cache = cache; 948 + mg->old_oblock = old_oblock; 949 + mg->new_oblock = new_oblock; 950 + mg->cblock = cblock; 951 + mg->old_ocell = old_ocell; 952 + mg->new_ocell = new_ocell; 953 + mg->start_jiffies = jiffies; 954 + 955 + inc_nr_migrations(cache); 956 + quiesce_migration(mg); 957 + } 958 + 959 + /*---------------------------------------------------------------- 960 + * bio processing 961 + *--------------------------------------------------------------*/ 962 + static void defer_bio(struct cache *cache, struct bio *bio) 963 + { 964 + unsigned long flags; 965 + 966 + spin_lock_irqsave(&cache->lock, flags); 967 + bio_list_add(&cache->deferred_bios, bio); 968 + spin_unlock_irqrestore(&cache->lock, flags); 969 + 970 + wake_worker(cache); 971 + } 972 + 973 + static void process_flush_bio(struct cache *cache, struct bio *bio) 974 + { 975 + struct per_bio_data *pb = get_per_bio_data(bio); 976 + 977 + BUG_ON(bio->bi_size); 978 + if (!pb->req_nr) 979 + remap_to_origin(cache, bio); 980 + else 981 + remap_to_cache(cache, bio, 0); 982 + 983 + issue(cache, bio); 984 + } 985 + 986 + /* 987 + * People generally discard large parts of a device, eg, the whole device 988 + * when formatting. Splitting these large discards up into cache block 989 + * sized ios and then quiescing (always neccessary for discard) takes too 990 + * long. 991 + * 992 + * We keep it simple, and allow any size of discard to come in, and just 993 + * mark off blocks on the discard bitset. No passdown occurs! 994 + * 995 + * To implement passdown we need to change the bio_prison such that a cell 996 + * can have a key that spans many blocks. 997 + */ 998 + static void process_discard_bio(struct cache *cache, struct bio *bio) 999 + { 1000 + dm_block_t start_block = dm_sector_div_up(bio->bi_sector, 1001 + cache->discard_block_size); 1002 + dm_block_t end_block = bio->bi_sector + bio_sectors(bio); 1003 + dm_block_t b; 1004 + 1005 + (void) sector_div(end_block, cache->discard_block_size); 1006 + 1007 + for (b = start_block; b < end_block; b++) 1008 + set_discard(cache, to_dblock(b)); 1009 + 1010 + bio_endio(bio, 0); 1011 + } 1012 + 1013 + static bool spare_migration_bandwidth(struct cache *cache) 1014 + { 1015 + sector_t current_volume = (atomic_read(&cache->nr_migrations) + 1) * 1016 + cache->sectors_per_block; 1017 + return current_volume < cache->migration_threshold; 1018 + } 1019 + 1020 + static bool is_writethrough_io(struct cache *cache, struct bio *bio, 1021 + dm_cblock_t cblock) 1022 + { 1023 + return bio_data_dir(bio) == WRITE && 1024 + cache->features.write_through && !is_dirty(cache, cblock); 1025 + } 1026 + 1027 + static void inc_hit_counter(struct cache *cache, struct bio *bio) 1028 + { 1029 + atomic_inc(bio_data_dir(bio) == READ ? 1030 + &cache->stats.read_hit : &cache->stats.write_hit); 1031 + } 1032 + 1033 + static void inc_miss_counter(struct cache *cache, struct bio *bio) 1034 + { 1035 + atomic_inc(bio_data_dir(bio) == READ ? 1036 + &cache->stats.read_miss : &cache->stats.write_miss); 1037 + } 1038 + 1039 + static void process_bio(struct cache *cache, struct prealloc *structs, 1040 + struct bio *bio) 1041 + { 1042 + int r; 1043 + bool release_cell = true; 1044 + dm_oblock_t block = get_bio_block(cache, bio); 1045 + struct dm_bio_prison_cell *cell_prealloc, *old_ocell, *new_ocell; 1046 + struct policy_result lookup_result; 1047 + struct per_bio_data *pb = get_per_bio_data(bio); 1048 + bool discarded_block = is_discarded_oblock(cache, block); 1049 + bool can_migrate = discarded_block || spare_migration_bandwidth(cache); 1050 + 1051 + /* 1052 + * Check to see if that block is currently migrating. 1053 + */ 1054 + cell_prealloc = prealloc_get_cell(structs); 1055 + r = bio_detain(cache, block, bio, cell_prealloc, 1056 + (cell_free_fn) prealloc_put_cell, 1057 + structs, &new_ocell); 1058 + if (r > 0) 1059 + return; 1060 + 1061 + r = policy_map(cache->policy, block, true, can_migrate, discarded_block, 1062 + bio, &lookup_result); 1063 + 1064 + if (r == -EWOULDBLOCK) 1065 + /* migration has been denied */ 1066 + lookup_result.op = POLICY_MISS; 1067 + 1068 + switch (lookup_result.op) { 1069 + case POLICY_HIT: 1070 + inc_hit_counter(cache, bio); 1071 + pb->all_io_entry = dm_deferred_entry_inc(cache->all_io_ds); 1072 + 1073 + if (is_writethrough_io(cache, bio, lookup_result.cblock)) { 1074 + /* 1075 + * No need to mark anything dirty in write through mode. 1076 + */ 1077 + pb->req_nr == 0 ? 1078 + remap_to_cache(cache, bio, lookup_result.cblock) : 1079 + remap_to_origin_clear_discard(cache, bio, block); 1080 + } else 1081 + remap_to_cache_dirty(cache, bio, block, lookup_result.cblock); 1082 + 1083 + issue(cache, bio); 1084 + break; 1085 + 1086 + case POLICY_MISS: 1087 + inc_miss_counter(cache, bio); 1088 + pb->all_io_entry = dm_deferred_entry_inc(cache->all_io_ds); 1089 + 1090 + if (pb->req_nr != 0) { 1091 + /* 1092 + * This is a duplicate writethrough io that is no 1093 + * longer needed because the block has been demoted. 1094 + */ 1095 + bio_endio(bio, 0); 1096 + } else { 1097 + remap_to_origin_clear_discard(cache, bio, block); 1098 + issue(cache, bio); 1099 + } 1100 + break; 1101 + 1102 + case POLICY_NEW: 1103 + atomic_inc(&cache->stats.promotion); 1104 + promote(cache, structs, block, lookup_result.cblock, new_ocell); 1105 + release_cell = false; 1106 + break; 1107 + 1108 + case POLICY_REPLACE: 1109 + cell_prealloc = prealloc_get_cell(structs); 1110 + r = bio_detain(cache, lookup_result.old_oblock, bio, cell_prealloc, 1111 + (cell_free_fn) prealloc_put_cell, 1112 + structs, &old_ocell); 1113 + if (r > 0) { 1114 + /* 1115 + * We have to be careful to avoid lock inversion of 1116 + * the cells. So we back off, and wait for the 1117 + * old_ocell to become free. 1118 + */ 1119 + policy_force_mapping(cache->policy, block, 1120 + lookup_result.old_oblock); 1121 + atomic_inc(&cache->stats.cache_cell_clash); 1122 + break; 1123 + } 1124 + atomic_inc(&cache->stats.demotion); 1125 + atomic_inc(&cache->stats.promotion); 1126 + 1127 + demote_then_promote(cache, structs, lookup_result.old_oblock, 1128 + block, lookup_result.cblock, 1129 + old_ocell, new_ocell); 1130 + release_cell = false; 1131 + break; 1132 + 1133 + default: 1134 + DMERR_LIMIT("%s: erroring bio, unknown policy op: %u", __func__, 1135 + (unsigned) lookup_result.op); 1136 + bio_io_error(bio); 1137 + } 1138 + 1139 + if (release_cell) 1140 + cell_defer(cache, new_ocell, false); 1141 + } 1142 + 1143 + static int need_commit_due_to_time(struct cache *cache) 1144 + { 1145 + return jiffies < cache->last_commit_jiffies || 1146 + jiffies > cache->last_commit_jiffies + COMMIT_PERIOD; 1147 + } 1148 + 1149 + static int commit_if_needed(struct cache *cache) 1150 + { 1151 + if (dm_cache_changed_this_transaction(cache->cmd) && 1152 + (cache->commit_requested || need_commit_due_to_time(cache))) { 1153 + atomic_inc(&cache->stats.commit_count); 1154 + cache->last_commit_jiffies = jiffies; 1155 + cache->commit_requested = false; 1156 + return dm_cache_commit(cache->cmd, false); 1157 + } 1158 + 1159 + return 0; 1160 + } 1161 + 1162 + static void process_deferred_bios(struct cache *cache) 1163 + { 1164 + unsigned long flags; 1165 + struct bio_list bios; 1166 + struct bio *bio; 1167 + struct prealloc structs; 1168 + 1169 + memset(&structs, 0, sizeof(structs)); 1170 + bio_list_init(&bios); 1171 + 1172 + spin_lock_irqsave(&cache->lock, flags); 1173 + bio_list_merge(&bios, &cache->deferred_bios); 1174 + bio_list_init(&cache->deferred_bios); 1175 + spin_unlock_irqrestore(&cache->lock, flags); 1176 + 1177 + while (!bio_list_empty(&bios)) { 1178 + /* 1179 + * If we've got no free migration structs, and processing 1180 + * this bio might require one, we pause until there are some 1181 + * prepared mappings to process. 1182 + */ 1183 + if (prealloc_data_structs(cache, &structs)) { 1184 + spin_lock_irqsave(&cache->lock, flags); 1185 + bio_list_merge(&cache->deferred_bios, &bios); 1186 + spin_unlock_irqrestore(&cache->lock, flags); 1187 + break; 1188 + } 1189 + 1190 + bio = bio_list_pop(&bios); 1191 + 1192 + if (bio->bi_rw & REQ_FLUSH) 1193 + process_flush_bio(cache, bio); 1194 + else if (bio->bi_rw & REQ_DISCARD) 1195 + process_discard_bio(cache, bio); 1196 + else 1197 + process_bio(cache, &structs, bio); 1198 + } 1199 + 1200 + prealloc_free_structs(cache, &structs); 1201 + } 1202 + 1203 + static void process_deferred_flush_bios(struct cache *cache, bool submit_bios) 1204 + { 1205 + unsigned long flags; 1206 + struct bio_list bios; 1207 + struct bio *bio; 1208 + 1209 + bio_list_init(&bios); 1210 + 1211 + spin_lock_irqsave(&cache->lock, flags); 1212 + bio_list_merge(&bios, &cache->deferred_flush_bios); 1213 + bio_list_init(&cache->deferred_flush_bios); 1214 + spin_unlock_irqrestore(&cache->lock, flags); 1215 + 1216 + while ((bio = bio_list_pop(&bios))) 1217 + submit_bios ? generic_make_request(bio) : bio_io_error(bio); 1218 + } 1219 + 1220 + static void writeback_some_dirty_blocks(struct cache *cache) 1221 + { 1222 + int r = 0; 1223 + dm_oblock_t oblock; 1224 + dm_cblock_t cblock; 1225 + struct prealloc structs; 1226 + struct dm_bio_prison_cell *old_ocell; 1227 + 1228 + memset(&structs, 0, sizeof(structs)); 1229 + 1230 + while (spare_migration_bandwidth(cache)) { 1231 + if (prealloc_data_structs(cache, &structs)) 1232 + break; 1233 + 1234 + r = policy_writeback_work(cache->policy, &oblock, &cblock); 1235 + if (r) 1236 + break; 1237 + 1238 + r = get_cell(cache, oblock, &structs, &old_ocell); 1239 + if (r) { 1240 + policy_set_dirty(cache->policy, oblock); 1241 + break; 1242 + } 1243 + 1244 + writeback(cache, &structs, oblock, cblock, old_ocell); 1245 + } 1246 + 1247 + prealloc_free_structs(cache, &structs); 1248 + } 1249 + 1250 + /*---------------------------------------------------------------- 1251 + * Main worker loop 1252 + *--------------------------------------------------------------*/ 1253 + static void start_quiescing(struct cache *cache) 1254 + { 1255 + unsigned long flags; 1256 + 1257 + spin_lock_irqsave(&cache->lock, flags); 1258 + cache->quiescing = 1; 1259 + spin_unlock_irqrestore(&cache->lock, flags); 1260 + } 1261 + 1262 + static void stop_quiescing(struct cache *cache) 1263 + { 1264 + unsigned long flags; 1265 + 1266 + spin_lock_irqsave(&cache->lock, flags); 1267 + cache->quiescing = 0; 1268 + spin_unlock_irqrestore(&cache->lock, flags); 1269 + } 1270 + 1271 + static bool is_quiescing(struct cache *cache) 1272 + { 1273 + int r; 1274 + unsigned long flags; 1275 + 1276 + spin_lock_irqsave(&cache->lock, flags); 1277 + r = cache->quiescing; 1278 + spin_unlock_irqrestore(&cache->lock, flags); 1279 + 1280 + return r; 1281 + } 1282 + 1283 + static void wait_for_migrations(struct cache *cache) 1284 + { 1285 + wait_event(cache->migration_wait, !atomic_read(&cache->nr_migrations)); 1286 + } 1287 + 1288 + static void stop_worker(struct cache *cache) 1289 + { 1290 + cancel_delayed_work(&cache->waker); 1291 + flush_workqueue(cache->wq); 1292 + } 1293 + 1294 + static void requeue_deferred_io(struct cache *cache) 1295 + { 1296 + struct bio *bio; 1297 + struct bio_list bios; 1298 + 1299 + bio_list_init(&bios); 1300 + bio_list_merge(&bios, &cache->deferred_bios); 1301 + bio_list_init(&cache->deferred_bios); 1302 + 1303 + while ((bio = bio_list_pop(&bios))) 1304 + bio_endio(bio, DM_ENDIO_REQUEUE); 1305 + } 1306 + 1307 + static int more_work(struct cache *cache) 1308 + { 1309 + if (is_quiescing(cache)) 1310 + return !list_empty(&cache->quiesced_migrations) || 1311 + !list_empty(&cache->completed_migrations) || 1312 + !list_empty(&cache->need_commit_migrations); 1313 + else 1314 + return !bio_list_empty(&cache->deferred_bios) || 1315 + !bio_list_empty(&cache->deferred_flush_bios) || 1316 + !list_empty(&cache->quiesced_migrations) || 1317 + !list_empty(&cache->completed_migrations) || 1318 + !list_empty(&cache->need_commit_migrations); 1319 + } 1320 + 1321 + static void do_worker(struct work_struct *ws) 1322 + { 1323 + struct cache *cache = container_of(ws, struct cache, worker); 1324 + 1325 + do { 1326 + if (!is_quiescing(cache)) 1327 + process_deferred_bios(cache); 1328 + 1329 + process_migrations(cache, &cache->quiesced_migrations, issue_copy); 1330 + process_migrations(cache, &cache->completed_migrations, complete_migration); 1331 + 1332 + writeback_some_dirty_blocks(cache); 1333 + 1334 + if (commit_if_needed(cache)) { 1335 + process_deferred_flush_bios(cache, false); 1336 + 1337 + /* 1338 + * FIXME: rollback metadata or just go into a 1339 + * failure mode and error everything 1340 + */ 1341 + } else { 1342 + process_deferred_flush_bios(cache, true); 1343 + process_migrations(cache, &cache->need_commit_migrations, 1344 + migration_success_post_commit); 1345 + } 1346 + } while (more_work(cache)); 1347 + } 1348 + 1349 + /* 1350 + * We want to commit periodically so that not too much 1351 + * unwritten metadata builds up. 1352 + */ 1353 + static void do_waker(struct work_struct *ws) 1354 + { 1355 + struct cache *cache = container_of(to_delayed_work(ws), struct cache, waker); 1356 + wake_worker(cache); 1357 + queue_delayed_work(cache->wq, &cache->waker, COMMIT_PERIOD); 1358 + } 1359 + 1360 + /*----------------------------------------------------------------*/ 1361 + 1362 + static int is_congested(struct dm_dev *dev, int bdi_bits) 1363 + { 1364 + struct request_queue *q = bdev_get_queue(dev->bdev); 1365 + return bdi_congested(&q->backing_dev_info, bdi_bits); 1366 + } 1367 + 1368 + static int cache_is_congested(struct dm_target_callbacks *cb, int bdi_bits) 1369 + { 1370 + struct cache *cache = container_of(cb, struct cache, callbacks); 1371 + 1372 + return is_congested(cache->origin_dev, bdi_bits) || 1373 + is_congested(cache->cache_dev, bdi_bits); 1374 + } 1375 + 1376 + /*---------------------------------------------------------------- 1377 + * Target methods 1378 + *--------------------------------------------------------------*/ 1379 + 1380 + /* 1381 + * This function gets called on the error paths of the constructor, so we 1382 + * have to cope with a partially initialised struct. 1383 + */ 1384 + static void destroy(struct cache *cache) 1385 + { 1386 + unsigned i; 1387 + 1388 + if (cache->next_migration) 1389 + mempool_free(cache->next_migration, cache->migration_pool); 1390 + 1391 + if (cache->migration_pool) 1392 + mempool_destroy(cache->migration_pool); 1393 + 1394 + if (cache->all_io_ds) 1395 + dm_deferred_set_destroy(cache->all_io_ds); 1396 + 1397 + if (cache->prison) 1398 + dm_bio_prison_destroy(cache->prison); 1399 + 1400 + if (cache->wq) 1401 + destroy_workqueue(cache->wq); 1402 + 1403 + if (cache->dirty_bitset) 1404 + free_bitset(cache->dirty_bitset); 1405 + 1406 + if (cache->discard_bitset) 1407 + free_bitset(cache->discard_bitset); 1408 + 1409 + if (cache->copier) 1410 + dm_kcopyd_client_destroy(cache->copier); 1411 + 1412 + if (cache->cmd) 1413 + dm_cache_metadata_close(cache->cmd); 1414 + 1415 + if (cache->metadata_dev) 1416 + dm_put_device(cache->ti, cache->metadata_dev); 1417 + 1418 + if (cache->origin_dev) 1419 + dm_put_device(cache->ti, cache->origin_dev); 1420 + 1421 + if (cache->cache_dev) 1422 + dm_put_device(cache->ti, cache->cache_dev); 1423 + 1424 + if (cache->policy) 1425 + dm_cache_policy_destroy(cache->policy); 1426 + 1427 + for (i = 0; i < cache->nr_ctr_args ; i++) 1428 + kfree(cache->ctr_args[i]); 1429 + kfree(cache->ctr_args); 1430 + 1431 + kfree(cache); 1432 + } 1433 + 1434 + static void cache_dtr(struct dm_target *ti) 1435 + { 1436 + struct cache *cache = ti->private; 1437 + 1438 + destroy(cache); 1439 + } 1440 + 1441 + static sector_t get_dev_size(struct dm_dev *dev) 1442 + { 1443 + return i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT; 1444 + } 1445 + 1446 + /*----------------------------------------------------------------*/ 1447 + 1448 + /* 1449 + * Construct a cache device mapping. 1450 + * 1451 + * cache <metadata dev> <cache dev> <origin dev> <block size> 1452 + * <#feature args> [<feature arg>]* 1453 + * <policy> <#policy args> [<policy arg>]* 1454 + * 1455 + * metadata dev : fast device holding the persistent metadata 1456 + * cache dev : fast device holding cached data blocks 1457 + * origin dev : slow device holding original data blocks 1458 + * block size : cache unit size in sectors 1459 + * 1460 + * #feature args : number of feature arguments passed 1461 + * feature args : writethrough. (The default is writeback.) 1462 + * 1463 + * policy : the replacement policy to use 1464 + * #policy args : an even number of policy arguments corresponding 1465 + * to key/value pairs passed to the policy 1466 + * policy args : key/value pairs passed to the policy 1467 + * E.g. 'sequential_threshold 1024' 1468 + * See cache-policies.txt for details. 1469 + * 1470 + * Optional feature arguments are: 1471 + * writethrough : write through caching that prohibits cache block 1472 + * content from being different from origin block content. 1473 + * Without this argument, the default behaviour is to write 1474 + * back cache block contents later for performance reasons, 1475 + * so they may differ from the corresponding origin blocks. 1476 + */ 1477 + struct cache_args { 1478 + struct dm_target *ti; 1479 + 1480 + struct dm_dev *metadata_dev; 1481 + 1482 + struct dm_dev *cache_dev; 1483 + sector_t cache_sectors; 1484 + 1485 + struct dm_dev *origin_dev; 1486 + sector_t origin_sectors; 1487 + 1488 + uint32_t block_size; 1489 + 1490 + const char *policy_name; 1491 + int policy_argc; 1492 + const char **policy_argv; 1493 + 1494 + struct cache_features features; 1495 + }; 1496 + 1497 + static void destroy_cache_args(struct cache_args *ca) 1498 + { 1499 + if (ca->metadata_dev) 1500 + dm_put_device(ca->ti, ca->metadata_dev); 1501 + 1502 + if (ca->cache_dev) 1503 + dm_put_device(ca->ti, ca->cache_dev); 1504 + 1505 + if (ca->origin_dev) 1506 + dm_put_device(ca->ti, ca->origin_dev); 1507 + 1508 + kfree(ca); 1509 + } 1510 + 1511 + static bool at_least_one_arg(struct dm_arg_set *as, char **error) 1512 + { 1513 + if (!as->argc) { 1514 + *error = "Insufficient args"; 1515 + return false; 1516 + } 1517 + 1518 + return true; 1519 + } 1520 + 1521 + static int parse_metadata_dev(struct cache_args *ca, struct dm_arg_set *as, 1522 + char **error) 1523 + { 1524 + int r; 1525 + sector_t metadata_dev_size; 1526 + char b[BDEVNAME_SIZE]; 1527 + 1528 + if (!at_least_one_arg(as, error)) 1529 + return -EINVAL; 1530 + 1531 + r = dm_get_device(ca->ti, dm_shift_arg(as), FMODE_READ | FMODE_WRITE, 1532 + &ca->metadata_dev); 1533 + if (r) { 1534 + *error = "Error opening metadata device"; 1535 + return r; 1536 + } 1537 + 1538 + metadata_dev_size = get_dev_size(ca->metadata_dev); 1539 + if (metadata_dev_size > DM_CACHE_METADATA_MAX_SECTORS_WARNING) 1540 + DMWARN("Metadata device %s is larger than %u sectors: excess space will not be used.", 1541 + bdevname(ca->metadata_dev->bdev, b), THIN_METADATA_MAX_SECTORS); 1542 + 1543 + return 0; 1544 + } 1545 + 1546 + static int parse_cache_dev(struct cache_args *ca, struct dm_arg_set *as, 1547 + char **error) 1548 + { 1549 + int r; 1550 + 1551 + if (!at_least_one_arg(as, error)) 1552 + return -EINVAL; 1553 + 1554 + r = dm_get_device(ca->ti, dm_shift_arg(as), FMODE_READ | FMODE_WRITE, 1555 + &ca->cache_dev); 1556 + if (r) { 1557 + *error = "Error opening cache device"; 1558 + return r; 1559 + } 1560 + ca->cache_sectors = get_dev_size(ca->cache_dev); 1561 + 1562 + return 0; 1563 + } 1564 + 1565 + static int parse_origin_dev(struct cache_args *ca, struct dm_arg_set *as, 1566 + char **error) 1567 + { 1568 + int r; 1569 + 1570 + if (!at_least_one_arg(as, error)) 1571 + return -EINVAL; 1572 + 1573 + r = dm_get_device(ca->ti, dm_shift_arg(as), FMODE_READ | FMODE_WRITE, 1574 + &ca->origin_dev); 1575 + if (r) { 1576 + *error = "Error opening origin device"; 1577 + return r; 1578 + } 1579 + 1580 + ca->origin_sectors = get_dev_size(ca->origin_dev); 1581 + if (ca->ti->len > ca->origin_sectors) { 1582 + *error = "Device size larger than cached device"; 1583 + return -EINVAL; 1584 + } 1585 + 1586 + return 0; 1587 + } 1588 + 1589 + static int parse_block_size(struct cache_args *ca, struct dm_arg_set *as, 1590 + char **error) 1591 + { 1592 + unsigned long tmp; 1593 + 1594 + if (!at_least_one_arg(as, error)) 1595 + return -EINVAL; 1596 + 1597 + if (kstrtoul(dm_shift_arg(as), 10, &tmp) || !tmp || 1598 + tmp < DATA_DEV_BLOCK_SIZE_MIN_SECTORS || 1599 + tmp & (DATA_DEV_BLOCK_SIZE_MIN_SECTORS - 1)) { 1600 + *error = "Invalid data block size"; 1601 + return -EINVAL; 1602 + } 1603 + 1604 + if (tmp > ca->cache_sectors) { 1605 + *error = "Data block size is larger than the cache device"; 1606 + return -EINVAL; 1607 + } 1608 + 1609 + ca->block_size = tmp; 1610 + 1611 + return 0; 1612 + } 1613 + 1614 + static void init_features(struct cache_features *cf) 1615 + { 1616 + cf->mode = CM_WRITE; 1617 + cf->write_through = false; 1618 + } 1619 + 1620 + static int parse_features(struct cache_args *ca, struct dm_arg_set *as, 1621 + char **error) 1622 + { 1623 + static struct dm_arg _args[] = { 1624 + {0, 1, "Invalid number of cache feature arguments"}, 1625 + }; 1626 + 1627 + int r; 1628 + unsigned argc; 1629 + const char *arg; 1630 + struct cache_features *cf = &ca->features; 1631 + 1632 + init_features(cf); 1633 + 1634 + r = dm_read_arg_group(_args, as, &argc, error); 1635 + if (r) 1636 + return -EINVAL; 1637 + 1638 + while (argc--) { 1639 + arg = dm_shift_arg(as); 1640 + 1641 + if (!strcasecmp(arg, "writeback")) 1642 + cf->write_through = false; 1643 + 1644 + else if (!strcasecmp(arg, "writethrough")) 1645 + cf->write_through = true; 1646 + 1647 + else { 1648 + *error = "Unrecognised cache feature requested"; 1649 + return -EINVAL; 1650 + } 1651 + } 1652 + 1653 + return 0; 1654 + } 1655 + 1656 + static int parse_policy(struct cache_args *ca, struct dm_arg_set *as, 1657 + char **error) 1658 + { 1659 + static struct dm_arg _args[] = { 1660 + {0, 1024, "Invalid number of policy arguments"}, 1661 + }; 1662 + 1663 + int r; 1664 + 1665 + if (!at_least_one_arg(as, error)) 1666 + return -EINVAL; 1667 + 1668 + ca->policy_name = dm_shift_arg(as); 1669 + 1670 + r = dm_read_arg_group(_args, as, &ca->policy_argc, error); 1671 + if (r) 1672 + return -EINVAL; 1673 + 1674 + ca->policy_argv = (const char **)as->argv; 1675 + dm_consume_args(as, ca->policy_argc); 1676 + 1677 + return 0; 1678 + } 1679 + 1680 + static int parse_cache_args(struct cache_args *ca, int argc, char **argv, 1681 + char **error) 1682 + { 1683 + int r; 1684 + struct dm_arg_set as; 1685 + 1686 + as.argc = argc; 1687 + as.argv = argv; 1688 + 1689 + r = parse_metadata_dev(ca, &as, error); 1690 + if (r) 1691 + return r; 1692 + 1693 + r = parse_cache_dev(ca, &as, error); 1694 + if (r) 1695 + return r; 1696 + 1697 + r = parse_origin_dev(ca, &as, error); 1698 + if (r) 1699 + return r; 1700 + 1701 + r = parse_block_size(ca, &as, error); 1702 + if (r) 1703 + return r; 1704 + 1705 + r = parse_features(ca, &as, error); 1706 + if (r) 1707 + return r; 1708 + 1709 + r = parse_policy(ca, &as, error); 1710 + if (r) 1711 + return r; 1712 + 1713 + return 0; 1714 + } 1715 + 1716 + /*----------------------------------------------------------------*/ 1717 + 1718 + static struct kmem_cache *migration_cache; 1719 + 1720 + static int set_config_values(struct dm_cache_policy *p, int argc, const char **argv) 1721 + { 1722 + int r = 0; 1723 + 1724 + if (argc & 1) { 1725 + DMWARN("Odd number of policy arguments given but they should be <key> <value> pairs."); 1726 + return -EINVAL; 1727 + } 1728 + 1729 + while (argc) { 1730 + r = policy_set_config_value(p, argv[0], argv[1]); 1731 + if (r) { 1732 + DMWARN("policy_set_config_value failed: key = '%s', value = '%s'", 1733 + argv[0], argv[1]); 1734 + return r; 1735 + } 1736 + 1737 + argc -= 2; 1738 + argv += 2; 1739 + } 1740 + 1741 + return r; 1742 + } 1743 + 1744 + static int create_cache_policy(struct cache *cache, struct cache_args *ca, 1745 + char **error) 1746 + { 1747 + int r; 1748 + 1749 + cache->policy = dm_cache_policy_create(ca->policy_name, 1750 + cache->cache_size, 1751 + cache->origin_sectors, 1752 + cache->sectors_per_block); 1753 + if (!cache->policy) { 1754 + *error = "Error creating cache's policy"; 1755 + return -ENOMEM; 1756 + } 1757 + 1758 + r = set_config_values(cache->policy, ca->policy_argc, ca->policy_argv); 1759 + if (r) 1760 + dm_cache_policy_destroy(cache->policy); 1761 + 1762 + return r; 1763 + } 1764 + 1765 + /* 1766 + * We want the discard block size to be a power of two, at least the size 1767 + * of the cache block size, and have no more than 2^14 discard blocks 1768 + * across the origin. 1769 + */ 1770 + #define MAX_DISCARD_BLOCKS (1 << 14) 1771 + 1772 + static bool too_many_discard_blocks(sector_t discard_block_size, 1773 + sector_t origin_size) 1774 + { 1775 + (void) sector_div(origin_size, discard_block_size); 1776 + 1777 + return origin_size > MAX_DISCARD_BLOCKS; 1778 + } 1779 + 1780 + static sector_t calculate_discard_block_size(sector_t cache_block_size, 1781 + sector_t origin_size) 1782 + { 1783 + sector_t discard_block_size; 1784 + 1785 + discard_block_size = roundup_pow_of_two(cache_block_size); 1786 + 1787 + if (origin_size) 1788 + while (too_many_discard_blocks(discard_block_size, origin_size)) 1789 + discard_block_size *= 2; 1790 + 1791 + return discard_block_size; 1792 + } 1793 + 1794 + #define DEFAULT_MIGRATION_THRESHOLD (2048 * 100) 1795 + 1796 + static unsigned cache_num_write_bios(struct dm_target *ti, struct bio *bio); 1797 + 1798 + static int cache_create(struct cache_args *ca, struct cache **result) 1799 + { 1800 + int r = 0; 1801 + char **error = &ca->ti->error; 1802 + struct cache *cache; 1803 + struct dm_target *ti = ca->ti; 1804 + dm_block_t origin_blocks; 1805 + struct dm_cache_metadata *cmd; 1806 + bool may_format = ca->features.mode == CM_WRITE; 1807 + 1808 + cache = kzalloc(sizeof(*cache), GFP_KERNEL); 1809 + if (!cache) 1810 + return -ENOMEM; 1811 + 1812 + cache->ti = ca->ti; 1813 + ti->private = cache; 1814 + ti->per_bio_data_size = sizeof(struct per_bio_data); 1815 + ti->num_flush_bios = 2; 1816 + ti->flush_supported = true; 1817 + 1818 + ti->num_discard_bios = 1; 1819 + ti->discards_supported = true; 1820 + ti->discard_zeroes_data_unsupported = true; 1821 + 1822 + memcpy(&cache->features, &ca->features, sizeof(cache->features)); 1823 + 1824 + if (cache->features.write_through) 1825 + ti->num_write_bios = cache_num_write_bios; 1826 + 1827 + cache->callbacks.congested_fn = cache_is_congested; 1828 + dm_table_add_target_callbacks(ti->table, &cache->callbacks); 1829 + 1830 + cache->metadata_dev = ca->metadata_dev; 1831 + cache->origin_dev = ca->origin_dev; 1832 + cache->cache_dev = ca->cache_dev; 1833 + 1834 + ca->metadata_dev = ca->origin_dev = ca->cache_dev = NULL; 1835 + 1836 + /* FIXME: factor out this whole section */ 1837 + origin_blocks = cache->origin_sectors = ca->origin_sectors; 1838 + (void) sector_div(origin_blocks, ca->block_size); 1839 + cache->origin_blocks = to_oblock(origin_blocks); 1840 + 1841 + cache->sectors_per_block = ca->block_size; 1842 + if (dm_set_target_max_io_len(ti, cache->sectors_per_block)) { 1843 + r = -EINVAL; 1844 + goto bad; 1845 + } 1846 + 1847 + if (ca->block_size & (ca->block_size - 1)) { 1848 + dm_block_t cache_size = ca->cache_sectors; 1849 + 1850 + cache->sectors_per_block_shift = -1; 1851 + (void) sector_div(cache_size, ca->block_size); 1852 + cache->cache_size = to_cblock(cache_size); 1853 + } else { 1854 + cache->sectors_per_block_shift = __ffs(ca->block_size); 1855 + cache->cache_size = to_cblock(ca->cache_sectors >> cache->sectors_per_block_shift); 1856 + } 1857 + 1858 + r = create_cache_policy(cache, ca, error); 1859 + if (r) 1860 + goto bad; 1861 + cache->policy_nr_args = ca->policy_argc; 1862 + 1863 + cmd = dm_cache_metadata_open(cache->metadata_dev->bdev, 1864 + ca->block_size, may_format, 1865 + dm_cache_policy_get_hint_size(cache->policy)); 1866 + if (IS_ERR(cmd)) { 1867 + *error = "Error creating metadata object"; 1868 + r = PTR_ERR(cmd); 1869 + goto bad; 1870 + } 1871 + cache->cmd = cmd; 1872 + 1873 + spin_lock_init(&cache->lock); 1874 + bio_list_init(&cache->deferred_bios); 1875 + bio_list_init(&cache->deferred_flush_bios); 1876 + INIT_LIST_HEAD(&cache->quiesced_migrations); 1877 + INIT_LIST_HEAD(&cache->completed_migrations); 1878 + INIT_LIST_HEAD(&cache->need_commit_migrations); 1879 + cache->migration_threshold = DEFAULT_MIGRATION_THRESHOLD; 1880 + atomic_set(&cache->nr_migrations, 0); 1881 + init_waitqueue_head(&cache->migration_wait); 1882 + 1883 + cache->nr_dirty = 0; 1884 + cache->dirty_bitset = alloc_bitset(from_cblock(cache->cache_size)); 1885 + if (!cache->dirty_bitset) { 1886 + *error = "could not allocate dirty bitset"; 1887 + goto bad; 1888 + } 1889 + clear_bitset(cache->dirty_bitset, from_cblock(cache->cache_size)); 1890 + 1891 + cache->discard_block_size = 1892 + calculate_discard_block_size(cache->sectors_per_block, 1893 + cache->origin_sectors); 1894 + cache->discard_nr_blocks = oblock_to_dblock(cache, cache->origin_blocks); 1895 + cache->discard_bitset = alloc_bitset(from_dblock(cache->discard_nr_blocks)); 1896 + if (!cache->discard_bitset) { 1897 + *error = "could not allocate discard bitset"; 1898 + goto bad; 1899 + } 1900 + clear_bitset(cache->discard_bitset, from_dblock(cache->discard_nr_blocks)); 1901 + 1902 + cache->copier = dm_kcopyd_client_create(&dm_kcopyd_throttle); 1903 + if (IS_ERR(cache->copier)) { 1904 + *error = "could not create kcopyd client"; 1905 + r = PTR_ERR(cache->copier); 1906 + goto bad; 1907 + } 1908 + 1909 + cache->wq = alloc_ordered_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM); 1910 + if (!cache->wq) { 1911 + *error = "could not create workqueue for metadata object"; 1912 + goto bad; 1913 + } 1914 + INIT_WORK(&cache->worker, do_worker); 1915 + INIT_DELAYED_WORK(&cache->waker, do_waker); 1916 + cache->last_commit_jiffies = jiffies; 1917 + 1918 + cache->prison = dm_bio_prison_create(PRISON_CELLS); 1919 + if (!cache->prison) { 1920 + *error = "could not create bio prison"; 1921 + goto bad; 1922 + } 1923 + 1924 + cache->all_io_ds = dm_deferred_set_create(); 1925 + if (!cache->all_io_ds) { 1926 + *error = "could not create all_io deferred set"; 1927 + goto bad; 1928 + } 1929 + 1930 + cache->migration_pool = mempool_create_slab_pool(MIGRATION_POOL_SIZE, 1931 + migration_cache); 1932 + if (!cache->migration_pool) { 1933 + *error = "Error creating cache's migration mempool"; 1934 + goto bad; 1935 + } 1936 + 1937 + cache->next_migration = NULL; 1938 + 1939 + cache->need_tick_bio = true; 1940 + cache->sized = false; 1941 + cache->quiescing = false; 1942 + cache->commit_requested = false; 1943 + cache->loaded_mappings = false; 1944 + cache->loaded_discards = false; 1945 + 1946 + load_stats(cache); 1947 + 1948 + atomic_set(&cache->stats.demotion, 0); 1949 + atomic_set(&cache->stats.promotion, 0); 1950 + atomic_set(&cache->stats.copies_avoided, 0); 1951 + atomic_set(&cache->stats.cache_cell_clash, 0); 1952 + atomic_set(&cache->stats.commit_count, 0); 1953 + atomic_set(&cache->stats.discard_count, 0); 1954 + 1955 + *result = cache; 1956 + return 0; 1957 + 1958 + bad: 1959 + destroy(cache); 1960 + return r; 1961 + } 1962 + 1963 + static int copy_ctr_args(struct cache *cache, int argc, const char **argv) 1964 + { 1965 + unsigned i; 1966 + const char **copy; 1967 + 1968 + copy = kcalloc(argc, sizeof(*copy), GFP_KERNEL); 1969 + if (!copy) 1970 + return -ENOMEM; 1971 + for (i = 0; i < argc; i++) { 1972 + copy[i] = kstrdup(argv[i], GFP_KERNEL); 1973 + if (!copy[i]) { 1974 + while (i--) 1975 + kfree(copy[i]); 1976 + kfree(copy); 1977 + return -ENOMEM; 1978 + } 1979 + } 1980 + 1981 + cache->nr_ctr_args = argc; 1982 + cache->ctr_args = copy; 1983 + 1984 + return 0; 1985 + } 1986 + 1987 + static int cache_ctr(struct dm_target *ti, unsigned argc, char **argv) 1988 + { 1989 + int r = -EINVAL; 1990 + struct cache_args *ca; 1991 + struct cache *cache = NULL; 1992 + 1993 + ca = kzalloc(sizeof(*ca), GFP_KERNEL); 1994 + if (!ca) { 1995 + ti->error = "Error allocating memory for cache"; 1996 + return -ENOMEM; 1997 + } 1998 + ca->ti = ti; 1999 + 2000 + r = parse_cache_args(ca, argc, argv, &ti->error); 2001 + if (r) 2002 + goto out; 2003 + 2004 + r = cache_create(ca, &cache); 2005 + 2006 + r = copy_ctr_args(cache, argc - 3, (const char **)argv + 3); 2007 + if (r) { 2008 + destroy(cache); 2009 + goto out; 2010 + } 2011 + 2012 + ti->private = cache; 2013 + 2014 + out: 2015 + destroy_cache_args(ca); 2016 + return r; 2017 + } 2018 + 2019 + static unsigned cache_num_write_bios(struct dm_target *ti, struct bio *bio) 2020 + { 2021 + int r; 2022 + struct cache *cache = ti->private; 2023 + dm_oblock_t block = get_bio_block(cache, bio); 2024 + dm_cblock_t cblock; 2025 + 2026 + r = policy_lookup(cache->policy, block, &cblock); 2027 + if (r < 0) 2028 + return 2; /* assume the worst */ 2029 + 2030 + return (!r && !is_dirty(cache, cblock)) ? 2 : 1; 2031 + } 2032 + 2033 + static int cache_map(struct dm_target *ti, struct bio *bio) 2034 + { 2035 + struct cache *cache = ti->private; 2036 + 2037 + int r; 2038 + dm_oblock_t block = get_bio_block(cache, bio); 2039 + bool can_migrate = false; 2040 + bool discarded_block; 2041 + struct dm_bio_prison_cell *cell; 2042 + struct policy_result lookup_result; 2043 + struct per_bio_data *pb; 2044 + 2045 + if (from_oblock(block) > from_oblock(cache->origin_blocks)) { 2046 + /* 2047 + * This can only occur if the io goes to a partial block at 2048 + * the end of the origin device. We don't cache these. 2049 + * Just remap to the origin and carry on. 2050 + */ 2051 + remap_to_origin_clear_discard(cache, bio, block); 2052 + return DM_MAPIO_REMAPPED; 2053 + } 2054 + 2055 + pb = init_per_bio_data(bio); 2056 + 2057 + if (bio->bi_rw & (REQ_FLUSH | REQ_FUA | REQ_DISCARD)) { 2058 + defer_bio(cache, bio); 2059 + return DM_MAPIO_SUBMITTED; 2060 + } 2061 + 2062 + /* 2063 + * Check to see if that block is currently migrating. 2064 + */ 2065 + cell = alloc_prison_cell(cache); 2066 + if (!cell) { 2067 + defer_bio(cache, bio); 2068 + return DM_MAPIO_SUBMITTED; 2069 + } 2070 + 2071 + r = bio_detain(cache, block, bio, cell, 2072 + (cell_free_fn) free_prison_cell, 2073 + cache, &cell); 2074 + if (r) { 2075 + if (r < 0) 2076 + defer_bio(cache, bio); 2077 + 2078 + return DM_MAPIO_SUBMITTED; 2079 + } 2080 + 2081 + discarded_block = is_discarded_oblock(cache, block); 2082 + 2083 + r = policy_map(cache->policy, block, false, can_migrate, discarded_block, 2084 + bio, &lookup_result); 2085 + if (r == -EWOULDBLOCK) { 2086 + cell_defer(cache, cell, true); 2087 + return DM_MAPIO_SUBMITTED; 2088 + 2089 + } else if (r) { 2090 + DMERR_LIMIT("Unexpected return from cache replacement policy: %d", r); 2091 + bio_io_error(bio); 2092 + return DM_MAPIO_SUBMITTED; 2093 + } 2094 + 2095 + switch (lookup_result.op) { 2096 + case POLICY_HIT: 2097 + inc_hit_counter(cache, bio); 2098 + pb->all_io_entry = dm_deferred_entry_inc(cache->all_io_ds); 2099 + 2100 + if (is_writethrough_io(cache, bio, lookup_result.cblock)) { 2101 + /* 2102 + * No need to mark anything dirty in write through mode. 2103 + */ 2104 + pb->req_nr == 0 ? 2105 + remap_to_cache(cache, bio, lookup_result.cblock) : 2106 + remap_to_origin_clear_discard(cache, bio, block); 2107 + cell_defer(cache, cell, false); 2108 + } else { 2109 + remap_to_cache_dirty(cache, bio, block, lookup_result.cblock); 2110 + cell_defer(cache, cell, false); 2111 + } 2112 + break; 2113 + 2114 + case POLICY_MISS: 2115 + inc_miss_counter(cache, bio); 2116 + pb->all_io_entry = dm_deferred_entry_inc(cache->all_io_ds); 2117 + 2118 + if (pb->req_nr != 0) { 2119 + /* 2120 + * This is a duplicate writethrough io that is no 2121 + * longer needed because the block has been demoted. 2122 + */ 2123 + bio_endio(bio, 0); 2124 + cell_defer(cache, cell, false); 2125 + return DM_MAPIO_SUBMITTED; 2126 + } else { 2127 + remap_to_origin_clear_discard(cache, bio, block); 2128 + cell_defer(cache, cell, false); 2129 + } 2130 + break; 2131 + 2132 + default: 2133 + DMERR_LIMIT("%s: erroring bio: unknown policy op: %u", __func__, 2134 + (unsigned) lookup_result.op); 2135 + bio_io_error(bio); 2136 + return DM_MAPIO_SUBMITTED; 2137 + } 2138 + 2139 + return DM_MAPIO_REMAPPED; 2140 + } 2141 + 2142 + static int cache_end_io(struct dm_target *ti, struct bio *bio, int error) 2143 + { 2144 + struct cache *cache = ti->private; 2145 + unsigned long flags; 2146 + struct per_bio_data *pb = get_per_bio_data(bio); 2147 + 2148 + if (pb->tick) { 2149 + policy_tick(cache->policy); 2150 + 2151 + spin_lock_irqsave(&cache->lock, flags); 2152 + cache->need_tick_bio = true; 2153 + spin_unlock_irqrestore(&cache->lock, flags); 2154 + } 2155 + 2156 + check_for_quiesced_migrations(cache, pb); 2157 + 2158 + return 0; 2159 + } 2160 + 2161 + static int write_dirty_bitset(struct cache *cache) 2162 + { 2163 + unsigned i, r; 2164 + 2165 + for (i = 0; i < from_cblock(cache->cache_size); i++) { 2166 + r = dm_cache_set_dirty(cache->cmd, to_cblock(i), 2167 + is_dirty(cache, to_cblock(i))); 2168 + if (r) 2169 + return r; 2170 + } 2171 + 2172 + return 0; 2173 + } 2174 + 2175 + static int write_discard_bitset(struct cache *cache) 2176 + { 2177 + unsigned i, r; 2178 + 2179 + r = dm_cache_discard_bitset_resize(cache->cmd, cache->discard_block_size, 2180 + cache->discard_nr_blocks); 2181 + if (r) { 2182 + DMERR("could not resize on-disk discard bitset"); 2183 + return r; 2184 + } 2185 + 2186 + for (i = 0; i < from_dblock(cache->discard_nr_blocks); i++) { 2187 + r = dm_cache_set_discard(cache->cmd, to_dblock(i), 2188 + is_discarded(cache, to_dblock(i))); 2189 + if (r) 2190 + return r; 2191 + } 2192 + 2193 + return 0; 2194 + } 2195 + 2196 + static int save_hint(void *context, dm_cblock_t cblock, dm_oblock_t oblock, 2197 + uint32_t hint) 2198 + { 2199 + struct cache *cache = context; 2200 + return dm_cache_save_hint(cache->cmd, cblock, hint); 2201 + } 2202 + 2203 + static int write_hints(struct cache *cache) 2204 + { 2205 + int r; 2206 + 2207 + r = dm_cache_begin_hints(cache->cmd, cache->policy); 2208 + if (r) { 2209 + DMERR("dm_cache_begin_hints failed"); 2210 + return r; 2211 + } 2212 + 2213 + r = policy_walk_mappings(cache->policy, save_hint, cache); 2214 + if (r) 2215 + DMERR("policy_walk_mappings failed"); 2216 + 2217 + return r; 2218 + } 2219 + 2220 + /* 2221 + * returns true on success 2222 + */ 2223 + static bool sync_metadata(struct cache *cache) 2224 + { 2225 + int r1, r2, r3, r4; 2226 + 2227 + r1 = write_dirty_bitset(cache); 2228 + if (r1) 2229 + DMERR("could not write dirty bitset"); 2230 + 2231 + r2 = write_discard_bitset(cache); 2232 + if (r2) 2233 + DMERR("could not write discard bitset"); 2234 + 2235 + save_stats(cache); 2236 + 2237 + r3 = write_hints(cache); 2238 + if (r3) 2239 + DMERR("could not write hints"); 2240 + 2241 + /* 2242 + * If writing the above metadata failed, we still commit, but don't 2243 + * set the clean shutdown flag. This will effectively force every 2244 + * dirty bit to be set on reload. 2245 + */ 2246 + r4 = dm_cache_commit(cache->cmd, !r1 && !r2 && !r3); 2247 + if (r4) 2248 + DMERR("could not write cache metadata. Data loss may occur."); 2249 + 2250 + return !r1 && !r2 && !r3 && !r4; 2251 + } 2252 + 2253 + static void cache_postsuspend(struct dm_target *ti) 2254 + { 2255 + struct cache *cache = ti->private; 2256 + 2257 + start_quiescing(cache); 2258 + wait_for_migrations(cache); 2259 + stop_worker(cache); 2260 + requeue_deferred_io(cache); 2261 + stop_quiescing(cache); 2262 + 2263 + (void) sync_metadata(cache); 2264 + } 2265 + 2266 + static int load_mapping(void *context, dm_oblock_t oblock, dm_cblock_t cblock, 2267 + bool dirty, uint32_t hint, bool hint_valid) 2268 + { 2269 + int r; 2270 + struct cache *cache = context; 2271 + 2272 + r = policy_load_mapping(cache->policy, oblock, cblock, hint, hint_valid); 2273 + if (r) 2274 + return r; 2275 + 2276 + if (dirty) 2277 + set_dirty(cache, oblock, cblock); 2278 + else 2279 + clear_dirty(cache, oblock, cblock); 2280 + 2281 + return 0; 2282 + } 2283 + 2284 + static int load_discard(void *context, sector_t discard_block_size, 2285 + dm_dblock_t dblock, bool discard) 2286 + { 2287 + struct cache *cache = context; 2288 + 2289 + /* FIXME: handle mis-matched block size */ 2290 + 2291 + if (discard) 2292 + set_discard(cache, dblock); 2293 + else 2294 + clear_discard(cache, dblock); 2295 + 2296 + return 0; 2297 + } 2298 + 2299 + static int cache_preresume(struct dm_target *ti) 2300 + { 2301 + int r = 0; 2302 + struct cache *cache = ti->private; 2303 + sector_t actual_cache_size = get_dev_size(cache->cache_dev); 2304 + (void) sector_div(actual_cache_size, cache->sectors_per_block); 2305 + 2306 + /* 2307 + * Check to see if the cache has resized. 2308 + */ 2309 + if (from_cblock(cache->cache_size) != actual_cache_size || !cache->sized) { 2310 + cache->cache_size = to_cblock(actual_cache_size); 2311 + 2312 + r = dm_cache_resize(cache->cmd, cache->cache_size); 2313 + if (r) { 2314 + DMERR("could not resize cache metadata"); 2315 + return r; 2316 + } 2317 + 2318 + cache->sized = true; 2319 + } 2320 + 2321 + if (!cache->loaded_mappings) { 2322 + r = dm_cache_load_mappings(cache->cmd, 2323 + dm_cache_policy_get_name(cache->policy), 2324 + load_mapping, cache); 2325 + if (r) { 2326 + DMERR("could not load cache mappings"); 2327 + return r; 2328 + } 2329 + 2330 + cache->loaded_mappings = true; 2331 + } 2332 + 2333 + if (!cache->loaded_discards) { 2334 + r = dm_cache_load_discards(cache->cmd, load_discard, cache); 2335 + if (r) { 2336 + DMERR("could not load origin discards"); 2337 + return r; 2338 + } 2339 + 2340 + cache->loaded_discards = true; 2341 + } 2342 + 2343 + return r; 2344 + } 2345 + 2346 + static void cache_resume(struct dm_target *ti) 2347 + { 2348 + struct cache *cache = ti->private; 2349 + 2350 + cache->need_tick_bio = true; 2351 + do_waker(&cache->waker.work); 2352 + } 2353 + 2354 + /* 2355 + * Status format: 2356 + * 2357 + * <#used metadata blocks>/<#total metadata blocks> 2358 + * <#read hits> <#read misses> <#write hits> <#write misses> 2359 + * <#demotions> <#promotions> <#blocks in cache> <#dirty> 2360 + * <#features> <features>* 2361 + * <#core args> <core args> 2362 + * <#policy args> <policy args>* 2363 + */ 2364 + static void cache_status(struct dm_target *ti, status_type_t type, 2365 + unsigned status_flags, char *result, unsigned maxlen) 2366 + { 2367 + int r = 0; 2368 + unsigned i; 2369 + ssize_t sz = 0; 2370 + dm_block_t nr_free_blocks_metadata = 0; 2371 + dm_block_t nr_blocks_metadata = 0; 2372 + char buf[BDEVNAME_SIZE]; 2373 + struct cache *cache = ti->private; 2374 + dm_cblock_t residency; 2375 + 2376 + switch (type) { 2377 + case STATUSTYPE_INFO: 2378 + /* Commit to ensure statistics aren't out-of-date */ 2379 + if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti)) { 2380 + r = dm_cache_commit(cache->cmd, false); 2381 + if (r) 2382 + DMERR("could not commit metadata for accurate status"); 2383 + } 2384 + 2385 + r = dm_cache_get_free_metadata_block_count(cache->cmd, 2386 + &nr_free_blocks_metadata); 2387 + if (r) { 2388 + DMERR("could not get metadata free block count"); 2389 + goto err; 2390 + } 2391 + 2392 + r = dm_cache_get_metadata_dev_size(cache->cmd, &nr_blocks_metadata); 2393 + if (r) { 2394 + DMERR("could not get metadata device size"); 2395 + goto err; 2396 + } 2397 + 2398 + residency = policy_residency(cache->policy); 2399 + 2400 + DMEMIT("%llu/%llu %u %u %u %u %u %u %llu %u ", 2401 + (unsigned long long)(nr_blocks_metadata - nr_free_blocks_metadata), 2402 + (unsigned long long)nr_blocks_metadata, 2403 + (unsigned) atomic_read(&cache->stats.read_hit), 2404 + (unsigned) atomic_read(&cache->stats.read_miss), 2405 + (unsigned) atomic_read(&cache->stats.write_hit), 2406 + (unsigned) atomic_read(&cache->stats.write_miss), 2407 + (unsigned) atomic_read(&cache->stats.demotion), 2408 + (unsigned) atomic_read(&cache->stats.promotion), 2409 + (unsigned long long) from_cblock(residency), 2410 + cache->nr_dirty); 2411 + 2412 + if (cache->features.write_through) 2413 + DMEMIT("1 writethrough "); 2414 + else 2415 + DMEMIT("0 "); 2416 + 2417 + DMEMIT("2 migration_threshold %llu ", (unsigned long long) cache->migration_threshold); 2418 + if (sz < maxlen) { 2419 + r = policy_emit_config_values(cache->policy, result + sz, maxlen - sz); 2420 + if (r) 2421 + DMERR("policy_emit_config_values returned %d", r); 2422 + } 2423 + 2424 + break; 2425 + 2426 + case STATUSTYPE_TABLE: 2427 + format_dev_t(buf, cache->metadata_dev->bdev->bd_dev); 2428 + DMEMIT("%s ", buf); 2429 + format_dev_t(buf, cache->cache_dev->bdev->bd_dev); 2430 + DMEMIT("%s ", buf); 2431 + format_dev_t(buf, cache->origin_dev->bdev->bd_dev); 2432 + DMEMIT("%s", buf); 2433 + 2434 + for (i = 0; i < cache->nr_ctr_args - 1; i++) 2435 + DMEMIT(" %s", cache->ctr_args[i]); 2436 + if (cache->nr_ctr_args) 2437 + DMEMIT(" %s", cache->ctr_args[cache->nr_ctr_args - 1]); 2438 + } 2439 + 2440 + return; 2441 + 2442 + err: 2443 + DMEMIT("Error"); 2444 + } 2445 + 2446 + #define NOT_CORE_OPTION 1 2447 + 2448 + static int process_config_option(struct cache *cache, char **argv) 2449 + { 2450 + unsigned long tmp; 2451 + 2452 + if (!strcasecmp(argv[0], "migration_threshold")) { 2453 + if (kstrtoul(argv[1], 10, &tmp)) 2454 + return -EINVAL; 2455 + 2456 + cache->migration_threshold = tmp; 2457 + return 0; 2458 + } 2459 + 2460 + return NOT_CORE_OPTION; 2461 + } 2462 + 2463 + /* 2464 + * Supports <key> <value>. 2465 + * 2466 + * The key migration_threshold is supported by the cache target core. 2467 + */ 2468 + static int cache_message(struct dm_target *ti, unsigned argc, char **argv) 2469 + { 2470 + int r; 2471 + struct cache *cache = ti->private; 2472 + 2473 + if (argc != 2) 2474 + return -EINVAL; 2475 + 2476 + r = process_config_option(cache, argv); 2477 + if (r == NOT_CORE_OPTION) 2478 + return policy_set_config_value(cache->policy, argv[0], argv[1]); 2479 + 2480 + return r; 2481 + } 2482 + 2483 + static int cache_iterate_devices(struct dm_target *ti, 2484 + iterate_devices_callout_fn fn, void *data) 2485 + { 2486 + int r = 0; 2487 + struct cache *cache = ti->private; 2488 + 2489 + r = fn(ti, cache->cache_dev, 0, get_dev_size(cache->cache_dev), data); 2490 + if (!r) 2491 + r = fn(ti, cache->origin_dev, 0, ti->len, data); 2492 + 2493 + return r; 2494 + } 2495 + 2496 + /* 2497 + * We assume I/O is going to the origin (which is the volume 2498 + * more likely to have restrictions e.g. by being striped). 2499 + * (Looking up the exact location of the data would be expensive 2500 + * and could always be out of date by the time the bio is submitted.) 2501 + */ 2502 + static int cache_bvec_merge(struct dm_target *ti, 2503 + struct bvec_merge_data *bvm, 2504 + struct bio_vec *biovec, int max_size) 2505 + { 2506 + struct cache *cache = ti->private; 2507 + struct request_queue *q = bdev_get_queue(cache->origin_dev->bdev); 2508 + 2509 + if (!q->merge_bvec_fn) 2510 + return max_size; 2511 + 2512 + bvm->bi_bdev = cache->origin_dev->bdev; 2513 + return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); 2514 + } 2515 + 2516 + static void set_discard_limits(struct cache *cache, struct queue_limits *limits) 2517 + { 2518 + /* 2519 + * FIXME: these limits may be incompatible with the cache device 2520 + */ 2521 + limits->max_discard_sectors = cache->discard_block_size * 1024; 2522 + limits->discard_granularity = cache->discard_block_size << SECTOR_SHIFT; 2523 + } 2524 + 2525 + static void cache_io_hints(struct dm_target *ti, struct queue_limits *limits) 2526 + { 2527 + struct cache *cache = ti->private; 2528 + 2529 + blk_limits_io_min(limits, 0); 2530 + blk_limits_io_opt(limits, cache->sectors_per_block << SECTOR_SHIFT); 2531 + set_discard_limits(cache, limits); 2532 + } 2533 + 2534 + /*----------------------------------------------------------------*/ 2535 + 2536 + static struct target_type cache_target = { 2537 + .name = "cache", 2538 + .version = {1, 0, 0}, 2539 + .module = THIS_MODULE, 2540 + .ctr = cache_ctr, 2541 + .dtr = cache_dtr, 2542 + .map = cache_map, 2543 + .end_io = cache_end_io, 2544 + .postsuspend = cache_postsuspend, 2545 + .preresume = cache_preresume, 2546 + .resume = cache_resume, 2547 + .status = cache_status, 2548 + .message = cache_message, 2549 + .iterate_devices = cache_iterate_devices, 2550 + .merge = cache_bvec_merge, 2551 + .io_hints = cache_io_hints, 2552 + }; 2553 + 2554 + static int __init dm_cache_init(void) 2555 + { 2556 + int r; 2557 + 2558 + r = dm_register_target(&cache_target); 2559 + if (r) { 2560 + DMERR("cache target registration failed: %d", r); 2561 + return r; 2562 + } 2563 + 2564 + migration_cache = KMEM_CACHE(dm_cache_migration, 0); 2565 + if (!migration_cache) { 2566 + dm_unregister_target(&cache_target); 2567 + return -ENOMEM; 2568 + } 2569 + 2570 + return 0; 2571 + } 2572 + 2573 + static void __exit dm_cache_exit(void) 2574 + { 2575 + dm_unregister_target(&cache_target); 2576 + kmem_cache_destroy(migration_cache); 2577 + } 2578 + 2579 + module_init(dm_cache_init); 2580 + module_exit(dm_cache_exit); 2581 + 2582 + MODULE_DESCRIPTION(DM_NAME " cache target"); 2583 + MODULE_AUTHOR("Joe Thornber <ejt@redhat.com>"); 2584 + MODULE_LICENSE("GPL");
+12 -33
drivers/md/dm-crypt.c
··· 1234 1234 return 0; 1235 1235 } 1236 1236 1237 - /* 1238 - * Encode key into its hex representation 1239 - */ 1240 - static void crypt_encode_key(char *hex, u8 *key, unsigned int size) 1241 - { 1242 - unsigned int i; 1243 - 1244 - for (i = 0; i < size; i++) { 1245 - sprintf(hex, "%02x", *key); 1246 - hex += 2; 1247 - key++; 1248 - } 1249 - } 1250 - 1251 1237 static void crypt_free_tfms(struct crypt_config *cc) 1252 1238 { 1253 1239 unsigned i; ··· 1637 1651 1638 1652 if (opt_params == 1 && opt_string && 1639 1653 !strcasecmp(opt_string, "allow_discards")) 1640 - ti->num_discard_requests = 1; 1654 + ti->num_discard_bios = 1; 1641 1655 else if (opt_params) { 1642 1656 ret = -EINVAL; 1643 1657 ti->error = "Invalid feature arguments"; ··· 1665 1679 goto bad; 1666 1680 } 1667 1681 1668 - ti->num_flush_requests = 1; 1682 + ti->num_flush_bios = 1; 1669 1683 ti->discard_zeroes_data_unsupported = true; 1670 1684 1671 1685 return 0; ··· 1703 1717 return DM_MAPIO_SUBMITTED; 1704 1718 } 1705 1719 1706 - static int crypt_status(struct dm_target *ti, status_type_t type, 1707 - unsigned status_flags, char *result, unsigned maxlen) 1720 + static void crypt_status(struct dm_target *ti, status_type_t type, 1721 + unsigned status_flags, char *result, unsigned maxlen) 1708 1722 { 1709 1723 struct crypt_config *cc = ti->private; 1710 - unsigned int sz = 0; 1724 + unsigned i, sz = 0; 1711 1725 1712 1726 switch (type) { 1713 1727 case STATUSTYPE_INFO: ··· 1717 1731 case STATUSTYPE_TABLE: 1718 1732 DMEMIT("%s ", cc->cipher_string); 1719 1733 1720 - if (cc->key_size > 0) { 1721 - if ((maxlen - sz) < ((cc->key_size << 1) + 1)) 1722 - return -ENOMEM; 1723 - 1724 - crypt_encode_key(result + sz, cc->key, cc->key_size); 1725 - sz += cc->key_size << 1; 1726 - } else { 1727 - if (sz >= maxlen) 1728 - return -ENOMEM; 1729 - result[sz++] = '-'; 1730 - } 1734 + if (cc->key_size > 0) 1735 + for (i = 0; i < cc->key_size; i++) 1736 + DMEMIT("%02x", cc->key[i]); 1737 + else 1738 + DMEMIT("-"); 1731 1739 1732 1740 DMEMIT(" %llu %s %llu", (unsigned long long)cc->iv_offset, 1733 1741 cc->dev->name, (unsigned long long)cc->start); 1734 1742 1735 - if (ti->num_discard_requests) 1743 + if (ti->num_discard_bios) 1736 1744 DMEMIT(" 1 allow_discards"); 1737 1745 1738 1746 break; 1739 1747 } 1740 - return 0; 1741 1748 } 1742 1749 1743 1750 static void crypt_postsuspend(struct dm_target *ti) ··· 1824 1845 1825 1846 static struct target_type crypt_target = { 1826 1847 .name = "crypt", 1827 - .version = {1, 12, 0}, 1848 + .version = {1, 12, 1}, 1828 1849 .module = THIS_MODULE, 1829 1850 .ctr = crypt_ctr, 1830 1851 .dtr = crypt_dtr,
+5 -7
drivers/md/dm-delay.c
··· 198 198 mutex_init(&dc->timer_lock); 199 199 atomic_set(&dc->may_delay, 1); 200 200 201 - ti->num_flush_requests = 1; 202 - ti->num_discard_requests = 1; 201 + ti->num_flush_bios = 1; 202 + ti->num_discard_bios = 1; 203 203 ti->private = dc; 204 204 return 0; 205 205 ··· 293 293 return delay_bio(dc, dc->read_delay, bio); 294 294 } 295 295 296 - static int delay_status(struct dm_target *ti, status_type_t type, 297 - unsigned status_flags, char *result, unsigned maxlen) 296 + static void delay_status(struct dm_target *ti, status_type_t type, 297 + unsigned status_flags, char *result, unsigned maxlen) 298 298 { 299 299 struct delay_c *dc = ti->private; 300 300 int sz = 0; ··· 314 314 dc->write_delay); 315 315 break; 316 316 } 317 - 318 - return 0; 319 317 } 320 318 321 319 static int delay_iterate_devices(struct dm_target *ti, ··· 335 337 336 338 static struct target_type delay_target = { 337 339 .name = "delay", 338 - .version = {1, 2, 0}, 340 + .version = {1, 2, 1}, 339 341 .module = THIS_MODULE, 340 342 .ctr = delay_ctr, 341 343 .dtr = delay_dtr,
+5 -6
drivers/md/dm-flakey.c
··· 216 216 goto bad; 217 217 } 218 218 219 - ti->num_flush_requests = 1; 220 - ti->num_discard_requests = 1; 219 + ti->num_flush_bios = 1; 220 + ti->num_discard_bios = 1; 221 221 ti->per_bio_data_size = sizeof(struct per_bio_data); 222 222 ti->private = fc; 223 223 return 0; ··· 337 337 return error; 338 338 } 339 339 340 - static int flakey_status(struct dm_target *ti, status_type_t type, 341 - unsigned status_flags, char *result, unsigned maxlen) 340 + static void flakey_status(struct dm_target *ti, status_type_t type, 341 + unsigned status_flags, char *result, unsigned maxlen) 342 342 { 343 343 unsigned sz = 0; 344 344 struct flakey_c *fc = ti->private; ··· 368 368 369 369 break; 370 370 } 371 - return 0; 372 371 } 373 372 374 373 static int flakey_ioctl(struct dm_target *ti, unsigned int cmd, unsigned long arg) ··· 410 411 411 412 static struct target_type flakey_target = { 412 413 .name = "flakey", 413 - .version = {1, 3, 0}, 414 + .version = {1, 3, 1}, 414 415 .module = THIS_MODULE, 415 416 .ctr = flakey_ctr, 416 417 .dtr = flakey_dtr,
+117 -45
drivers/md/dm-ioctl.c
··· 1067 1067 num_targets = dm_table_get_num_targets(table); 1068 1068 for (i = 0; i < num_targets; i++) { 1069 1069 struct dm_target *ti = dm_table_get_target(table, i); 1070 + size_t l; 1070 1071 1071 1072 remaining = len - (outptr - outbuf); 1072 1073 if (remaining <= sizeof(struct dm_target_spec)) { ··· 1094 1093 if (ti->type->status) { 1095 1094 if (param->flags & DM_NOFLUSH_FLAG) 1096 1095 status_flags |= DM_STATUS_NOFLUSH_FLAG; 1097 - if (ti->type->status(ti, type, status_flags, outptr, remaining)) { 1098 - param->flags |= DM_BUFFER_FULL_FLAG; 1099 - break; 1100 - } 1096 + ti->type->status(ti, type, status_flags, outptr, remaining); 1101 1097 } else 1102 1098 outptr[0] = '\0'; 1103 1099 1104 - outptr += strlen(outptr) + 1; 1100 + l = strlen(outptr) + 1; 1101 + if (l == remaining) { 1102 + param->flags |= DM_BUFFER_FULL_FLAG; 1103 + break; 1104 + } 1105 + 1106 + outptr += l; 1105 1107 used = param->data_start + (outptr - outbuf); 1106 1108 1107 1109 outptr = align_ptr(outptr); ··· 1414 1410 return 0; 1415 1411 } 1416 1412 1413 + static bool buffer_test_overflow(char *result, unsigned maxlen) 1414 + { 1415 + return !maxlen || strlen(result) + 1 >= maxlen; 1416 + } 1417 + 1418 + /* 1419 + * Process device-mapper dependent messages. 1420 + * Returns a number <= 1 if message was processed by device mapper. 1421 + * Returns 2 if message should be delivered to the target. 1422 + */ 1423 + static int message_for_md(struct mapped_device *md, unsigned argc, char **argv, 1424 + char *result, unsigned maxlen) 1425 + { 1426 + return 2; 1427 + } 1428 + 1417 1429 /* 1418 1430 * Pass a message to the target that's at the supplied device offset. 1419 1431 */ ··· 1441 1421 struct dm_table *table; 1442 1422 struct dm_target *ti; 1443 1423 struct dm_target_msg *tmsg = (void *) param + param->data_start; 1424 + size_t maxlen; 1425 + char *result = get_result_buffer(param, param_size, &maxlen); 1444 1426 1445 1427 md = find_device(param); 1446 1428 if (!md) ··· 1465 1443 DMWARN("Empty message received."); 1466 1444 goto out_argv; 1467 1445 } 1446 + 1447 + r = message_for_md(md, argc, argv, result, maxlen); 1448 + if (r <= 1) 1449 + goto out_argv; 1468 1450 1469 1451 table = dm_get_live_table(md); 1470 1452 if (!table) ··· 1495 1469 out_argv: 1496 1470 kfree(argv); 1497 1471 out: 1498 - param->data_size = 0; 1472 + if (r >= 0) 1473 + __dev_status(md, param); 1474 + 1475 + if (r == 1) { 1476 + param->flags |= DM_DATA_OUT_FLAG; 1477 + if (buffer_test_overflow(result, maxlen)) 1478 + param->flags |= DM_BUFFER_FULL_FLAG; 1479 + else 1480 + param->data_size = param->data_start + strlen(result) + 1; 1481 + r = 0; 1482 + } 1483 + 1499 1484 dm_put(md); 1500 1485 return r; 1501 1486 } 1487 + 1488 + /* 1489 + * The ioctl parameter block consists of two parts, a dm_ioctl struct 1490 + * followed by a data buffer. This flag is set if the second part, 1491 + * which has a variable size, is not used by the function processing 1492 + * the ioctl. 1493 + */ 1494 + #define IOCTL_FLAGS_NO_PARAMS 1 1502 1495 1503 1496 /*----------------------------------------------------------------- 1504 1497 * Implementation of open/close/ioctl on the special char 1505 1498 * device. 1506 1499 *---------------------------------------------------------------*/ 1507 - static ioctl_fn lookup_ioctl(unsigned int cmd) 1500 + static ioctl_fn lookup_ioctl(unsigned int cmd, int *ioctl_flags) 1508 1501 { 1509 1502 static struct { 1510 1503 int cmd; 1504 + int flags; 1511 1505 ioctl_fn fn; 1512 1506 } _ioctls[] = { 1513 - {DM_VERSION_CMD, NULL}, /* version is dealt with elsewhere */ 1514 - {DM_REMOVE_ALL_CMD, remove_all}, 1515 - {DM_LIST_DEVICES_CMD, list_devices}, 1507 + {DM_VERSION_CMD, 0, NULL}, /* version is dealt with elsewhere */ 1508 + {DM_REMOVE_ALL_CMD, IOCTL_FLAGS_NO_PARAMS, remove_all}, 1509 + {DM_LIST_DEVICES_CMD, 0, list_devices}, 1516 1510 1517 - {DM_DEV_CREATE_CMD, dev_create}, 1518 - {DM_DEV_REMOVE_CMD, dev_remove}, 1519 - {DM_DEV_RENAME_CMD, dev_rename}, 1520 - {DM_DEV_SUSPEND_CMD, dev_suspend}, 1521 - {DM_DEV_STATUS_CMD, dev_status}, 1522 - {DM_DEV_WAIT_CMD, dev_wait}, 1511 + {DM_DEV_CREATE_CMD, IOCTL_FLAGS_NO_PARAMS, dev_create}, 1512 + {DM_DEV_REMOVE_CMD, IOCTL_FLAGS_NO_PARAMS, dev_remove}, 1513 + {DM_DEV_RENAME_CMD, 0, dev_rename}, 1514 + {DM_DEV_SUSPEND_CMD, IOCTL_FLAGS_NO_PARAMS, dev_suspend}, 1515 + {DM_DEV_STATUS_CMD, IOCTL_FLAGS_NO_PARAMS, dev_status}, 1516 + {DM_DEV_WAIT_CMD, 0, dev_wait}, 1523 1517 1524 - {DM_TABLE_LOAD_CMD, table_load}, 1525 - {DM_TABLE_CLEAR_CMD, table_clear}, 1526 - {DM_TABLE_DEPS_CMD, table_deps}, 1527 - {DM_TABLE_STATUS_CMD, table_status}, 1518 + {DM_TABLE_LOAD_CMD, 0, table_load}, 1519 + {DM_TABLE_CLEAR_CMD, IOCTL_FLAGS_NO_PARAMS, table_clear}, 1520 + {DM_TABLE_DEPS_CMD, 0, table_deps}, 1521 + {DM_TABLE_STATUS_CMD, 0, table_status}, 1528 1522 1529 - {DM_LIST_VERSIONS_CMD, list_versions}, 1523 + {DM_LIST_VERSIONS_CMD, 0, list_versions}, 1530 1524 1531 - {DM_TARGET_MSG_CMD, target_message}, 1532 - {DM_DEV_SET_GEOMETRY_CMD, dev_set_geometry} 1525 + {DM_TARGET_MSG_CMD, 0, target_message}, 1526 + {DM_DEV_SET_GEOMETRY_CMD, 0, dev_set_geometry} 1533 1527 }; 1534 1528 1535 - return (cmd >= ARRAY_SIZE(_ioctls)) ? NULL : _ioctls[cmd].fn; 1529 + if (unlikely(cmd >= ARRAY_SIZE(_ioctls))) 1530 + return NULL; 1531 + 1532 + *ioctl_flags = _ioctls[cmd].flags; 1533 + return _ioctls[cmd].fn; 1536 1534 } 1537 1535 1538 1536 /* ··· 1593 1543 return r; 1594 1544 } 1595 1545 1596 - #define DM_PARAMS_VMALLOC 0x0001 /* Params alloced with vmalloc not kmalloc */ 1546 + #define DM_PARAMS_KMALLOC 0x0001 /* Params alloced with kmalloc */ 1547 + #define DM_PARAMS_VMALLOC 0x0002 /* Params alloced with vmalloc */ 1597 1548 #define DM_WIPE_BUFFER 0x0010 /* Wipe input buffer before returning from ioctl */ 1598 1549 1599 1550 static void free_params(struct dm_ioctl *param, size_t param_size, int param_flags) ··· 1602 1551 if (param_flags & DM_WIPE_BUFFER) 1603 1552 memset(param, 0, param_size); 1604 1553 1554 + if (param_flags & DM_PARAMS_KMALLOC) 1555 + kfree(param); 1605 1556 if (param_flags & DM_PARAMS_VMALLOC) 1606 1557 vfree(param); 1607 - else 1608 - kfree(param); 1609 1558 } 1610 1559 1611 - static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl **param, int *param_flags) 1560 + static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kernel, 1561 + int ioctl_flags, 1562 + struct dm_ioctl **param, int *param_flags) 1612 1563 { 1613 - struct dm_ioctl tmp, *dmi; 1564 + struct dm_ioctl *dmi; 1614 1565 int secure_data; 1566 + const size_t minimum_data_size = sizeof(*param_kernel) - sizeof(param_kernel->data); 1615 1567 1616 - if (copy_from_user(&tmp, user, sizeof(tmp) - sizeof(tmp.data))) 1568 + if (copy_from_user(param_kernel, user, minimum_data_size)) 1617 1569 return -EFAULT; 1618 1570 1619 - if (tmp.data_size < (sizeof(tmp) - sizeof(tmp.data))) 1571 + if (param_kernel->data_size < minimum_data_size) 1620 1572 return -EINVAL; 1621 1573 1622 - secure_data = tmp.flags & DM_SECURE_DATA_FLAG; 1574 + secure_data = param_kernel->flags & DM_SECURE_DATA_FLAG; 1623 1575 1624 1576 *param_flags = secure_data ? DM_WIPE_BUFFER : 0; 1577 + 1578 + if (ioctl_flags & IOCTL_FLAGS_NO_PARAMS) { 1579 + dmi = param_kernel; 1580 + dmi->data_size = minimum_data_size; 1581 + goto data_copied; 1582 + } 1625 1583 1626 1584 /* 1627 1585 * Try to avoid low memory issues when a device is suspended. 1628 1586 * Use kmalloc() rather than vmalloc() when we can. 1629 1587 */ 1630 1588 dmi = NULL; 1631 - if (tmp.data_size <= KMALLOC_MAX_SIZE) 1632 - dmi = kmalloc(tmp.data_size, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN); 1633 - 1634 - if (!dmi) { 1635 - dmi = __vmalloc(tmp.data_size, GFP_NOIO | __GFP_REPEAT | __GFP_HIGH, PAGE_KERNEL); 1636 - *param_flags |= DM_PARAMS_VMALLOC; 1589 + if (param_kernel->data_size <= KMALLOC_MAX_SIZE) { 1590 + dmi = kmalloc(param_kernel->data_size, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN); 1591 + if (dmi) 1592 + *param_flags |= DM_PARAMS_KMALLOC; 1637 1593 } 1638 1594 1639 1595 if (!dmi) { 1640 - if (secure_data && clear_user(user, tmp.data_size)) 1596 + dmi = __vmalloc(param_kernel->data_size, GFP_NOIO | __GFP_REPEAT | __GFP_HIGH, PAGE_KERNEL); 1597 + if (dmi) 1598 + *param_flags |= DM_PARAMS_VMALLOC; 1599 + } 1600 + 1601 + if (!dmi) { 1602 + if (secure_data && clear_user(user, param_kernel->data_size)) 1641 1603 return -EFAULT; 1642 1604 return -ENOMEM; 1643 1605 } 1644 1606 1645 - if (copy_from_user(dmi, user, tmp.data_size)) 1607 + if (copy_from_user(dmi, user, param_kernel->data_size)) 1646 1608 goto bad; 1647 1609 1610 + data_copied: 1648 1611 /* 1649 1612 * Abort if something changed the ioctl data while it was being copied. 1650 1613 */ 1651 - if (dmi->data_size != tmp.data_size) { 1614 + if (dmi->data_size != param_kernel->data_size) { 1652 1615 DMERR("rejecting ioctl: data size modified while processing parameters"); 1653 1616 goto bad; 1654 1617 } 1655 1618 1656 1619 /* Wipe the user buffer so we do not return it to userspace */ 1657 - if (secure_data && clear_user(user, tmp.data_size)) 1620 + if (secure_data && clear_user(user, param_kernel->data_size)) 1658 1621 goto bad; 1659 1622 1660 1623 *param = dmi; 1661 1624 return 0; 1662 1625 1663 1626 bad: 1664 - free_params(dmi, tmp.data_size, *param_flags); 1627 + free_params(dmi, param_kernel->data_size, *param_flags); 1665 1628 1666 1629 return -EFAULT; 1667 1630 } ··· 1686 1621 param->flags &= ~DM_BUFFER_FULL_FLAG; 1687 1622 param->flags &= ~DM_UEVENT_GENERATED_FLAG; 1688 1623 param->flags &= ~DM_SECURE_DATA_FLAG; 1624 + param->flags &= ~DM_DATA_OUT_FLAG; 1689 1625 1690 1626 /* Ignores parameters */ 1691 1627 if (cmd == DM_REMOVE_ALL_CMD || ··· 1714 1648 static int ctl_ioctl(uint command, struct dm_ioctl __user *user) 1715 1649 { 1716 1650 int r = 0; 1651 + int ioctl_flags; 1717 1652 int param_flags; 1718 1653 unsigned int cmd; 1719 1654 struct dm_ioctl *uninitialized_var(param); 1720 1655 ioctl_fn fn = NULL; 1721 1656 size_t input_param_size; 1657 + struct dm_ioctl param_kernel; 1722 1658 1723 1659 /* only root can play with this */ 1724 1660 if (!capable(CAP_SYS_ADMIN)) ··· 1745 1677 if (cmd == DM_VERSION_CMD) 1746 1678 return 0; 1747 1679 1748 - fn = lookup_ioctl(cmd); 1680 + fn = lookup_ioctl(cmd, &ioctl_flags); 1749 1681 if (!fn) { 1750 1682 DMWARN("dm_ctl_ioctl: unknown command 0x%x", command); 1751 1683 return -ENOTTY; ··· 1754 1686 /* 1755 1687 * Copy the parameters into kernel space. 1756 1688 */ 1757 - r = copy_params(user, &param, &param_flags); 1689 + r = copy_params(user, &param_kernel, ioctl_flags, &param, &param_flags); 1758 1690 1759 1691 if (r) 1760 1692 return r; ··· 1766 1698 1767 1699 param->data_size = sizeof(*param); 1768 1700 r = fn(param, input_param_size); 1701 + 1702 + if (unlikely(param->flags & DM_BUFFER_FULL_FLAG) && 1703 + unlikely(ioctl_flags & IOCTL_FLAGS_NO_PARAMS)) 1704 + DMERR("ioctl %d tried to output some data but has IOCTL_FLAGS_NO_PARAMS set", cmd); 1769 1705 1770 1706 /* 1771 1707 * Copy the results back to userland.
+120 -1
drivers/md/dm-kcopyd.c
··· 22 22 #include <linux/vmalloc.h> 23 23 #include <linux/workqueue.h> 24 24 #include <linux/mutex.h> 25 + #include <linux/delay.h> 25 26 #include <linux/device-mapper.h> 26 27 #include <linux/dm-kcopyd.h> 27 28 ··· 52 51 struct workqueue_struct *kcopyd_wq; 53 52 struct work_struct kcopyd_work; 54 53 54 + struct dm_kcopyd_throttle *throttle; 55 + 55 56 /* 56 57 * We maintain three lists of jobs: 57 58 * ··· 70 67 }; 71 68 72 69 static struct page_list zero_page_list; 70 + 71 + static DEFINE_SPINLOCK(throttle_spinlock); 72 + 73 + /* 74 + * IO/IDLE accounting slowly decays after (1 << ACCOUNT_INTERVAL_SHIFT) period. 75 + * When total_period >= (1 << ACCOUNT_INTERVAL_SHIFT) the counters are divided 76 + * by 2. 77 + */ 78 + #define ACCOUNT_INTERVAL_SHIFT SHIFT_HZ 79 + 80 + /* 81 + * Sleep this number of milliseconds. 82 + * 83 + * The value was decided experimentally. 84 + * Smaller values seem to cause an increased copy rate above the limit. 85 + * The reason for this is unknown but possibly due to jiffies rounding errors 86 + * or read/write cache inside the disk. 87 + */ 88 + #define SLEEP_MSEC 100 89 + 90 + /* 91 + * Maximum number of sleep events. There is a theoretical livelock if more 92 + * kcopyd clients do work simultaneously which this limit avoids. 93 + */ 94 + #define MAX_SLEEPS 10 95 + 96 + static void io_job_start(struct dm_kcopyd_throttle *t) 97 + { 98 + unsigned throttle, now, difference; 99 + int slept = 0, skew; 100 + 101 + if (unlikely(!t)) 102 + return; 103 + 104 + try_again: 105 + spin_lock_irq(&throttle_spinlock); 106 + 107 + throttle = ACCESS_ONCE(t->throttle); 108 + 109 + if (likely(throttle >= 100)) 110 + goto skip_limit; 111 + 112 + now = jiffies; 113 + difference = now - t->last_jiffies; 114 + t->last_jiffies = now; 115 + if (t->num_io_jobs) 116 + t->io_period += difference; 117 + t->total_period += difference; 118 + 119 + /* 120 + * Maintain sane values if we got a temporary overflow. 121 + */ 122 + if (unlikely(t->io_period > t->total_period)) 123 + t->io_period = t->total_period; 124 + 125 + if (unlikely(t->total_period >= (1 << ACCOUNT_INTERVAL_SHIFT))) { 126 + int shift = fls(t->total_period >> ACCOUNT_INTERVAL_SHIFT); 127 + t->total_period >>= shift; 128 + t->io_period >>= shift; 129 + } 130 + 131 + skew = t->io_period - throttle * t->total_period / 100; 132 + 133 + if (unlikely(skew > 0) && slept < MAX_SLEEPS) { 134 + slept++; 135 + spin_unlock_irq(&throttle_spinlock); 136 + msleep(SLEEP_MSEC); 137 + goto try_again; 138 + } 139 + 140 + skip_limit: 141 + t->num_io_jobs++; 142 + 143 + spin_unlock_irq(&throttle_spinlock); 144 + } 145 + 146 + static void io_job_finish(struct dm_kcopyd_throttle *t) 147 + { 148 + unsigned long flags; 149 + 150 + if (unlikely(!t)) 151 + return; 152 + 153 + spin_lock_irqsave(&throttle_spinlock, flags); 154 + 155 + t->num_io_jobs--; 156 + 157 + if (likely(ACCESS_ONCE(t->throttle) >= 100)) 158 + goto skip_limit; 159 + 160 + if (!t->num_io_jobs) { 161 + unsigned now, difference; 162 + 163 + now = jiffies; 164 + difference = now - t->last_jiffies; 165 + t->last_jiffies = now; 166 + 167 + t->io_period += difference; 168 + t->total_period += difference; 169 + 170 + /* 171 + * Maintain sane values if we got a temporary overflow. 172 + */ 173 + if (unlikely(t->io_period > t->total_period)) 174 + t->io_period = t->total_period; 175 + } 176 + 177 + skip_limit: 178 + spin_unlock_irqrestore(&throttle_spinlock, flags); 179 + } 180 + 73 181 74 182 static void wake(struct dm_kcopyd_client *kc) 75 183 { ··· 462 348 struct kcopyd_job *job = (struct kcopyd_job *) context; 463 349 struct dm_kcopyd_client *kc = job->kc; 464 350 351 + io_job_finish(kc->throttle); 352 + 465 353 if (error) { 466 354 if (job->rw & WRITE) 467 355 job->write_err |= error; ··· 504 388 .notify.context = job, 505 389 .client = job->kc->io_client, 506 390 }; 391 + 392 + io_job_start(job->kc->throttle); 507 393 508 394 if (job->rw == READ) 509 395 r = dm_io(&io_req, 1, &job->source, NULL); ··· 813 695 /*----------------------------------------------------------------- 814 696 * Client setup 815 697 *---------------------------------------------------------------*/ 816 - struct dm_kcopyd_client *dm_kcopyd_client_create(void) 698 + struct dm_kcopyd_client *dm_kcopyd_client_create(struct dm_kcopyd_throttle *throttle) 817 699 { 818 700 int r = -ENOMEM; 819 701 struct dm_kcopyd_client *kc; ··· 826 708 INIT_LIST_HEAD(&kc->complete_jobs); 827 709 INIT_LIST_HEAD(&kc->io_jobs); 828 710 INIT_LIST_HEAD(&kc->pages_jobs); 711 + kc->throttle = throttle; 829 712 830 713 kc->job_pool = mempool_create_slab_pool(MIN_JOBS, _job_cache); 831 714 if (!kc->job_pool)
+6 -7
drivers/md/dm-linear.c
··· 53 53 goto bad; 54 54 } 55 55 56 - ti->num_flush_requests = 1; 57 - ti->num_discard_requests = 1; 58 - ti->num_write_same_requests = 1; 56 + ti->num_flush_bios = 1; 57 + ti->num_discard_bios = 1; 58 + ti->num_write_same_bios = 1; 59 59 ti->private = lc; 60 60 return 0; 61 61 ··· 95 95 return DM_MAPIO_REMAPPED; 96 96 } 97 97 98 - static int linear_status(struct dm_target *ti, status_type_t type, 99 - unsigned status_flags, char *result, unsigned maxlen) 98 + static void linear_status(struct dm_target *ti, status_type_t type, 99 + unsigned status_flags, char *result, unsigned maxlen) 100 100 { 101 101 struct linear_c *lc = (struct linear_c *) ti->private; 102 102 ··· 110 110 (unsigned long long)lc->start); 111 111 break; 112 112 } 113 - return 0; 114 113 } 115 114 116 115 static int linear_ioctl(struct dm_target *ti, unsigned int cmd, ··· 154 155 155 156 static struct target_type linear_target = { 156 157 .name = "linear", 157 - .version = {1, 2, 0}, 158 + .version = {1, 2, 1}, 158 159 .module = THIS_MODULE, 159 160 .ctr = linear_ctr, 160 161 .dtr = linear_dtr,
+5 -7
drivers/md/dm-mpath.c
··· 905 905 goto bad; 906 906 } 907 907 908 - ti->num_flush_requests = 1; 909 - ti->num_discard_requests = 1; 908 + ti->num_flush_bios = 1; 909 + ti->num_discard_bios = 1; 910 910 911 911 return 0; 912 912 ··· 1378 1378 * [priority selector-name num_ps_args [ps_args]* 1379 1379 * num_paths num_selector_args [path_dev [selector_args]* ]+ ]+ 1380 1380 */ 1381 - static int multipath_status(struct dm_target *ti, status_type_t type, 1382 - unsigned status_flags, char *result, unsigned maxlen) 1381 + static void multipath_status(struct dm_target *ti, status_type_t type, 1382 + unsigned status_flags, char *result, unsigned maxlen) 1383 1383 { 1384 1384 int sz = 0; 1385 1385 unsigned long flags; ··· 1485 1485 } 1486 1486 1487 1487 spin_unlock_irqrestore(&m->lock, flags); 1488 - 1489 - return 0; 1490 1488 } 1491 1489 1492 1490 static int multipath_message(struct dm_target *ti, unsigned argc, char **argv) ··· 1693 1695 *---------------------------------------------------------------*/ 1694 1696 static struct target_type multipath_target = { 1695 1697 .name = "multipath", 1696 - .version = {1, 5, 0}, 1698 + .version = {1, 5, 1}, 1697 1699 .module = THIS_MODULE, 1698 1700 .ctr = multipath_ctr, 1699 1701 .dtr = multipath_dtr,
+4 -6
drivers/md/dm-raid.c
··· 1151 1151 1152 1152 INIT_WORK(&rs->md.event_work, do_table_event); 1153 1153 ti->private = rs; 1154 - ti->num_flush_requests = 1; 1154 + ti->num_flush_bios = 1; 1155 1155 1156 1156 mutex_lock(&rs->md.reconfig_mutex); 1157 1157 ret = md_run(&rs->md); ··· 1201 1201 return DM_MAPIO_SUBMITTED; 1202 1202 } 1203 1203 1204 - static int raid_status(struct dm_target *ti, status_type_t type, 1205 - unsigned status_flags, char *result, unsigned maxlen) 1204 + static void raid_status(struct dm_target *ti, status_type_t type, 1205 + unsigned status_flags, char *result, unsigned maxlen) 1206 1206 { 1207 1207 struct raid_set *rs = ti->private; 1208 1208 unsigned raid_param_cnt = 1; /* at least 1 for chunksize */ ··· 1344 1344 DMEMIT(" -"); 1345 1345 } 1346 1346 } 1347 - 1348 - return 0; 1349 1347 } 1350 1348 1351 1349 static int raid_iterate_devices(struct dm_target *ti, iterate_devices_callout_fn fn, void *data) ··· 1403 1405 1404 1406 static struct target_type raid_target = { 1405 1407 .name = "raid", 1406 - .version = {1, 4, 1}, 1408 + .version = {1, 4, 2}, 1407 1409 .module = THIS_MODULE, 1408 1410 .ctr = raid_ctr, 1409 1411 .dtr = raid_dtr,
+9 -8
drivers/md/dm-raid1.c
··· 82 82 struct mirror mirror[0]; 83 83 }; 84 84 85 + DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(raid1_resync_throttle, 86 + "A percentage of time allocated for raid resynchronization"); 87 + 85 88 static void wakeup_mirrord(void *context) 86 89 { 87 90 struct mirror_set *ms = context; ··· 1075 1072 if (r) 1076 1073 goto err_free_context; 1077 1074 1078 - ti->num_flush_requests = 1; 1079 - ti->num_discard_requests = 1; 1075 + ti->num_flush_bios = 1; 1076 + ti->num_discard_bios = 1; 1080 1077 ti->per_bio_data_size = sizeof(struct dm_raid1_bio_record); 1081 1078 ti->discard_zeroes_data_unsupported = true; 1082 1079 ··· 1114 1111 goto err_destroy_wq; 1115 1112 } 1116 1113 1117 - ms->kcopyd_client = dm_kcopyd_client_create(); 1114 + ms->kcopyd_client = dm_kcopyd_client_create(&dm_kcopyd_throttle); 1118 1115 if (IS_ERR(ms->kcopyd_client)) { 1119 1116 r = PTR_ERR(ms->kcopyd_client); 1120 1117 goto err_destroy_wq; ··· 1350 1347 } 1351 1348 1352 1349 1353 - static int mirror_status(struct dm_target *ti, status_type_t type, 1354 - unsigned status_flags, char *result, unsigned maxlen) 1350 + static void mirror_status(struct dm_target *ti, status_type_t type, 1351 + unsigned status_flags, char *result, unsigned maxlen) 1355 1352 { 1356 1353 unsigned int m, sz = 0; 1357 1354 struct mirror_set *ms = (struct mirror_set *) ti->private; ··· 1386 1383 if (ms->features & DM_RAID1_HANDLE_ERRORS) 1387 1384 DMEMIT(" 1 handle_errors"); 1388 1385 } 1389 - 1390 - return 0; 1391 1386 } 1392 1387 1393 1388 static int mirror_iterate_devices(struct dm_target *ti, ··· 1404 1403 1405 1404 static struct target_type mirror_target = { 1406 1405 .name = "mirror", 1407 - .version = {1, 13, 1}, 1406 + .version = {1, 13, 2}, 1408 1407 .module = THIS_MODULE, 1409 1408 .ctr = mirror_ctr, 1410 1409 .dtr = mirror_dtr,
+17 -16
drivers/md/dm-snap.c
··· 124 124 #define RUNNING_MERGE 0 125 125 #define SHUTDOWN_MERGE 1 126 126 127 + DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(snapshot_copy_throttle, 128 + "A percentage of time allocated for copy on write"); 129 + 127 130 struct dm_dev *dm_snap_origin(struct dm_snapshot *s) 128 131 { 129 132 return s->origin; ··· 1040 1037 int i; 1041 1038 int r = -EINVAL; 1042 1039 char *origin_path, *cow_path; 1043 - unsigned args_used, num_flush_requests = 1; 1040 + unsigned args_used, num_flush_bios = 1; 1044 1041 fmode_t origin_mode = FMODE_READ; 1045 1042 1046 1043 if (argc != 4) { ··· 1050 1047 } 1051 1048 1052 1049 if (dm_target_is_snapshot_merge(ti)) { 1053 - num_flush_requests = 2; 1050 + num_flush_bios = 2; 1054 1051 origin_mode = FMODE_WRITE; 1055 1052 } 1056 1053 ··· 1111 1108 goto bad_hash_tables; 1112 1109 } 1113 1110 1114 - s->kcopyd_client = dm_kcopyd_client_create(); 1111 + s->kcopyd_client = dm_kcopyd_client_create(&dm_kcopyd_throttle); 1115 1112 if (IS_ERR(s->kcopyd_client)) { 1116 1113 r = PTR_ERR(s->kcopyd_client); 1117 1114 ti->error = "Could not create kcopyd client"; ··· 1130 1127 spin_lock_init(&s->tracked_chunk_lock); 1131 1128 1132 1129 ti->private = s; 1133 - ti->num_flush_requests = num_flush_requests; 1130 + ti->num_flush_bios = num_flush_bios; 1134 1131 ti->per_bio_data_size = sizeof(struct dm_snap_tracked_chunk); 1135 1132 1136 1133 /* Add snapshot to the list of snapshots for this origin */ ··· 1694 1691 init_tracked_chunk(bio); 1695 1692 1696 1693 if (bio->bi_rw & REQ_FLUSH) { 1697 - if (!dm_bio_get_target_request_nr(bio)) 1694 + if (!dm_bio_get_target_bio_nr(bio)) 1698 1695 bio->bi_bdev = s->origin->bdev; 1699 1696 else 1700 1697 bio->bi_bdev = s->cow->bdev; ··· 1839 1836 start_merge(s); 1840 1837 } 1841 1838 1842 - static int snapshot_status(struct dm_target *ti, status_type_t type, 1843 - unsigned status_flags, char *result, unsigned maxlen) 1839 + static void snapshot_status(struct dm_target *ti, status_type_t type, 1840 + unsigned status_flags, char *result, unsigned maxlen) 1844 1841 { 1845 1842 unsigned sz = 0; 1846 1843 struct dm_snapshot *snap = ti->private; ··· 1886 1883 maxlen - sz); 1887 1884 break; 1888 1885 } 1889 - 1890 - return 0; 1891 1886 } 1892 1887 1893 1888 static int snapshot_iterate_devices(struct dm_target *ti, ··· 2105 2104 } 2106 2105 2107 2106 ti->private = dev; 2108 - ti->num_flush_requests = 1; 2107 + ti->num_flush_bios = 1; 2109 2108 2110 2109 return 0; 2111 2110 } ··· 2139 2138 ti->max_io_len = get_origin_minimum_chunksize(dev->bdev); 2140 2139 } 2141 2140 2142 - static int origin_status(struct dm_target *ti, status_type_t type, 2143 - unsigned status_flags, char *result, unsigned maxlen) 2141 + static void origin_status(struct dm_target *ti, status_type_t type, 2142 + unsigned status_flags, char *result, unsigned maxlen) 2144 2143 { 2145 2144 struct dm_dev *dev = ti->private; 2146 2145 ··· 2153 2152 snprintf(result, maxlen, "%s", dev->name); 2154 2153 break; 2155 2154 } 2156 - 2157 - return 0; 2158 2155 } 2159 2156 2160 2157 static int origin_merge(struct dm_target *ti, struct bvec_merge_data *bvm, ··· 2179 2180 2180 2181 static struct target_type origin_target = { 2181 2182 .name = "snapshot-origin", 2182 - .version = {1, 8, 0}, 2183 + .version = {1, 8, 1}, 2183 2184 .module = THIS_MODULE, 2184 2185 .ctr = origin_ctr, 2185 2186 .dtr = origin_dtr, ··· 2192 2193 2193 2194 static struct target_type snapshot_target = { 2194 2195 .name = "snapshot", 2195 - .version = {1, 11, 0}, 2196 + .version = {1, 11, 1}, 2196 2197 .module = THIS_MODULE, 2197 2198 .ctr = snapshot_ctr, 2198 2199 .dtr = snapshot_dtr, ··· 2305 2306 MODULE_DESCRIPTION(DM_NAME " snapshot target"); 2306 2307 MODULE_AUTHOR("Joe Thornber"); 2307 2308 MODULE_LICENSE("GPL"); 2309 + MODULE_ALIAS("dm-snapshot-origin"); 2310 + MODULE_ALIAS("dm-snapshot-merge");
+13 -14
drivers/md/dm-stripe.c
··· 160 160 if (r) 161 161 return r; 162 162 163 - ti->num_flush_requests = stripes; 164 - ti->num_discard_requests = stripes; 165 - ti->num_write_same_requests = stripes; 163 + ti->num_flush_bios = stripes; 164 + ti->num_discard_bios = stripes; 165 + ti->num_write_same_bios = stripes; 166 166 167 167 sc->chunk_size = chunk_size; 168 168 if (chunk_size & (chunk_size - 1)) ··· 276 276 { 277 277 struct stripe_c *sc = ti->private; 278 278 uint32_t stripe; 279 - unsigned target_request_nr; 279 + unsigned target_bio_nr; 280 280 281 281 if (bio->bi_rw & REQ_FLUSH) { 282 - target_request_nr = dm_bio_get_target_request_nr(bio); 283 - BUG_ON(target_request_nr >= sc->stripes); 284 - bio->bi_bdev = sc->stripe[target_request_nr].dev->bdev; 282 + target_bio_nr = dm_bio_get_target_bio_nr(bio); 283 + BUG_ON(target_bio_nr >= sc->stripes); 284 + bio->bi_bdev = sc->stripe[target_bio_nr].dev->bdev; 285 285 return DM_MAPIO_REMAPPED; 286 286 } 287 287 if (unlikely(bio->bi_rw & REQ_DISCARD) || 288 288 unlikely(bio->bi_rw & REQ_WRITE_SAME)) { 289 - target_request_nr = dm_bio_get_target_request_nr(bio); 290 - BUG_ON(target_request_nr >= sc->stripes); 291 - return stripe_map_range(sc, bio, target_request_nr); 289 + target_bio_nr = dm_bio_get_target_bio_nr(bio); 290 + BUG_ON(target_bio_nr >= sc->stripes); 291 + return stripe_map_range(sc, bio, target_bio_nr); 292 292 } 293 293 294 294 stripe_map_sector(sc, bio->bi_sector, &stripe, &bio->bi_sector); ··· 312 312 * 313 313 */ 314 314 315 - static int stripe_status(struct dm_target *ti, status_type_t type, 316 - unsigned status_flags, char *result, unsigned maxlen) 315 + static void stripe_status(struct dm_target *ti, status_type_t type, 316 + unsigned status_flags, char *result, unsigned maxlen) 317 317 { 318 318 struct stripe_c *sc = (struct stripe_c *) ti->private; 319 319 char buffer[sc->stripes + 1]; ··· 340 340 (unsigned long long)sc->stripe[i].physical_start); 341 341 break; 342 342 } 343 - return 0; 344 343 } 345 344 346 345 static int stripe_end_io(struct dm_target *ti, struct bio *bio, int error) ··· 427 428 428 429 static struct target_type stripe_target = { 429 430 .name = "striped", 430 - .version = {1, 5, 0}, 431 + .version = {1, 5, 1}, 431 432 .module = THIS_MODULE, 432 433 .ctr = stripe_ctr, 433 434 .dtr = stripe_dtr,
+5 -6
drivers/md/dm-table.c
··· 217 217 218 218 if (alloc_targets(t, num_targets)) { 219 219 kfree(t); 220 - t = NULL; 221 220 return -ENOMEM; 222 221 } 223 222 ··· 822 823 823 824 t->highs[t->num_targets++] = tgt->begin + tgt->len - 1; 824 825 825 - if (!tgt->num_discard_requests && tgt->discards_supported) 826 - DMWARN("%s: %s: ignoring discards_supported because num_discard_requests is zero.", 826 + if (!tgt->num_discard_bios && tgt->discards_supported) 827 + DMWARN("%s: %s: ignoring discards_supported because num_discard_bios is zero.", 827 828 dm_device_name(t->md), type); 828 829 829 830 return 0; ··· 1359 1360 while (i < dm_table_get_num_targets(t)) { 1360 1361 ti = dm_table_get_target(t, i++); 1361 1362 1362 - if (!ti->num_flush_requests) 1363 + if (!ti->num_flush_bios) 1363 1364 continue; 1364 1365 1365 1366 if (ti->flush_supported) ··· 1438 1439 while (i < dm_table_get_num_targets(t)) { 1439 1440 ti = dm_table_get_target(t, i++); 1440 1441 1441 - if (!ti->num_write_same_requests) 1442 + if (!ti->num_write_same_bios) 1442 1443 return false; 1443 1444 1444 1445 if (!ti->type->iterate_devices || ··· 1656 1657 while (i < dm_table_get_num_targets(t)) { 1657 1658 ti = dm_table_get_target(t, i++); 1658 1659 1659 - if (!ti->num_discard_requests) 1660 + if (!ti->num_discard_bios) 1660 1661 continue; 1661 1662 1662 1663 if (ti->discards_supported)
+1 -1
drivers/md/dm-target.c
··· 116 116 /* 117 117 * Return error for discards instead of -EOPNOTSUPP 118 118 */ 119 - tt->num_discard_requests = 1; 119 + tt->num_discard_bios = 1; 120 120 121 121 return 0; 122 122 }
+6 -6
drivers/md/dm-thin-metadata.c
··· 280 280 *t = v & ((1 << 24) - 1); 281 281 } 282 282 283 - static void data_block_inc(void *context, void *value_le) 283 + static void data_block_inc(void *context, const void *value_le) 284 284 { 285 285 struct dm_space_map *sm = context; 286 286 __le64 v_le; ··· 292 292 dm_sm_inc_block(sm, b); 293 293 } 294 294 295 - static void data_block_dec(void *context, void *value_le) 295 + static void data_block_dec(void *context, const void *value_le) 296 296 { 297 297 struct dm_space_map *sm = context; 298 298 __le64 v_le; ··· 304 304 dm_sm_dec_block(sm, b); 305 305 } 306 306 307 - static int data_block_equal(void *context, void *value1_le, void *value2_le) 307 + static int data_block_equal(void *context, const void *value1_le, const void *value2_le) 308 308 { 309 309 __le64 v1_le, v2_le; 310 310 uint64_t b1, b2; ··· 318 318 return b1 == b2; 319 319 } 320 320 321 - static void subtree_inc(void *context, void *value) 321 + static void subtree_inc(void *context, const void *value) 322 322 { 323 323 struct dm_btree_info *info = context; 324 324 __le64 root_le; ··· 329 329 dm_tm_inc(info->tm, root); 330 330 } 331 331 332 - static void subtree_dec(void *context, void *value) 332 + static void subtree_dec(void *context, const void *value) 333 333 { 334 334 struct dm_btree_info *info = context; 335 335 __le64 root_le; ··· 341 341 DMERR("btree delete failed\n"); 342 342 } 343 343 344 - static int subtree_equal(void *context, void *value1_le, void *value2_le) 344 + static int subtree_equal(void *context, const void *value1_le, const void *value2_le) 345 345 { 346 346 __le64 v1_le, v2_le; 347 347 memcpy(&v1_le, value1_le, sizeof(v1_le));
+180 -97
drivers/md/dm-thin.c
··· 26 26 #define PRISON_CELLS 1024 27 27 #define COMMIT_PERIOD HZ 28 28 29 + DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(snapshot_copy_throttle, 30 + "A percentage of time allocated for copy on write"); 31 + 29 32 /* 30 33 * The block size of the device holding pool data must be 31 34 * between 64KB and 1GB. ··· 230 227 /*----------------------------------------------------------------*/ 231 228 232 229 /* 230 + * wake_worker() is used when new work is queued and when pool_resume is 231 + * ready to continue deferred IO processing. 232 + */ 233 + static void wake_worker(struct pool *pool) 234 + { 235 + queue_work(pool->wq, &pool->worker); 236 + } 237 + 238 + /*----------------------------------------------------------------*/ 239 + 240 + static int bio_detain(struct pool *pool, struct dm_cell_key *key, struct bio *bio, 241 + struct dm_bio_prison_cell **cell_result) 242 + { 243 + int r; 244 + struct dm_bio_prison_cell *cell_prealloc; 245 + 246 + /* 247 + * Allocate a cell from the prison's mempool. 248 + * This might block but it can't fail. 249 + */ 250 + cell_prealloc = dm_bio_prison_alloc_cell(pool->prison, GFP_NOIO); 251 + 252 + r = dm_bio_detain(pool->prison, key, bio, cell_prealloc, cell_result); 253 + if (r) 254 + /* 255 + * We reused an old cell; we can get rid of 256 + * the new one. 257 + */ 258 + dm_bio_prison_free_cell(pool->prison, cell_prealloc); 259 + 260 + return r; 261 + } 262 + 263 + static void cell_release(struct pool *pool, 264 + struct dm_bio_prison_cell *cell, 265 + struct bio_list *bios) 266 + { 267 + dm_cell_release(pool->prison, cell, bios); 268 + dm_bio_prison_free_cell(pool->prison, cell); 269 + } 270 + 271 + static void cell_release_no_holder(struct pool *pool, 272 + struct dm_bio_prison_cell *cell, 273 + struct bio_list *bios) 274 + { 275 + dm_cell_release_no_holder(pool->prison, cell, bios); 276 + dm_bio_prison_free_cell(pool->prison, cell); 277 + } 278 + 279 + static void cell_defer_no_holder_no_free(struct thin_c *tc, 280 + struct dm_bio_prison_cell *cell) 281 + { 282 + struct pool *pool = tc->pool; 283 + unsigned long flags; 284 + 285 + spin_lock_irqsave(&pool->lock, flags); 286 + dm_cell_release_no_holder(pool->prison, cell, &pool->deferred_bios); 287 + spin_unlock_irqrestore(&pool->lock, flags); 288 + 289 + wake_worker(pool); 290 + } 291 + 292 + static void cell_error(struct pool *pool, 293 + struct dm_bio_prison_cell *cell) 294 + { 295 + dm_cell_error(pool->prison, cell); 296 + dm_bio_prison_free_cell(pool->prison, cell); 297 + } 298 + 299 + /*----------------------------------------------------------------*/ 300 + 301 + /* 233 302 * A global list of pools that uses a struct mapped_device as a key. 234 303 */ 235 304 static struct dm_thin_pool_table { ··· 405 330 * target. 406 331 */ 407 332 333 + static bool block_size_is_power_of_two(struct pool *pool) 334 + { 335 + return pool->sectors_per_block_shift >= 0; 336 + } 337 + 408 338 static dm_block_t get_bio_block(struct thin_c *tc, struct bio *bio) 409 339 { 340 + struct pool *pool = tc->pool; 410 341 sector_t block_nr = bio->bi_sector; 411 342 412 - if (tc->pool->sectors_per_block_shift < 0) 413 - (void) sector_div(block_nr, tc->pool->sectors_per_block); 343 + if (block_size_is_power_of_two(pool)) 344 + block_nr >>= pool->sectors_per_block_shift; 414 345 else 415 - block_nr >>= tc->pool->sectors_per_block_shift; 346 + (void) sector_div(block_nr, pool->sectors_per_block); 416 347 417 348 return block_nr; 418 349 } ··· 429 348 sector_t bi_sector = bio->bi_sector; 430 349 431 350 bio->bi_bdev = tc->pool_dev->bdev; 432 - if (tc->pool->sectors_per_block_shift < 0) 433 - bio->bi_sector = (block * pool->sectors_per_block) + 434 - sector_div(bi_sector, pool->sectors_per_block); 435 - else 351 + if (block_size_is_power_of_two(pool)) 436 352 bio->bi_sector = (block << pool->sectors_per_block_shift) | 437 353 (bi_sector & (pool->sectors_per_block - 1)); 354 + else 355 + bio->bi_sector = (block * pool->sectors_per_block) + 356 + sector_div(bi_sector, pool->sectors_per_block); 438 357 } 439 358 440 359 static void remap_to_origin(struct thin_c *tc, struct bio *bio) ··· 499 418 { 500 419 remap(tc, bio, block); 501 420 issue(tc, bio); 502 - } 503 - 504 - /* 505 - * wake_worker() is used when new work is queued and when pool_resume is 506 - * ready to continue deferred IO processing. 507 - */ 508 - static void wake_worker(struct pool *pool) 509 - { 510 - queue_work(pool->wq, &pool->worker); 511 421 } 512 422 513 423 /*----------------------------------------------------------------*/ ··· 587 515 unsigned long flags; 588 516 589 517 spin_lock_irqsave(&pool->lock, flags); 590 - dm_cell_release(cell, &pool->deferred_bios); 518 + cell_release(pool, cell, &pool->deferred_bios); 591 519 spin_unlock_irqrestore(&tc->pool->lock, flags); 592 520 593 521 wake_worker(pool); 594 522 } 595 523 596 524 /* 597 - * Same as cell_defer except it omits the original holder of the cell. 525 + * Same as cell_defer above, except it omits the original holder of the cell. 598 526 */ 599 527 static void cell_defer_no_holder(struct thin_c *tc, struct dm_bio_prison_cell *cell) 600 528 { ··· 602 530 unsigned long flags; 603 531 604 532 spin_lock_irqsave(&pool->lock, flags); 605 - dm_cell_release_no_holder(cell, &pool->deferred_bios); 533 + cell_release_no_holder(pool, cell, &pool->deferred_bios); 606 534 spin_unlock_irqrestore(&pool->lock, flags); 607 535 608 536 wake_worker(pool); ··· 612 540 { 613 541 if (m->bio) 614 542 m->bio->bi_end_io = m->saved_bi_end_io; 615 - dm_cell_error(m->cell); 543 + cell_error(m->tc->pool, m->cell); 616 544 list_del(&m->list); 617 545 mempool_free(m, m->tc->pool->mapping_pool); 618 546 } 547 + 619 548 static void process_prepared_mapping(struct dm_thin_new_mapping *m) 620 549 { 621 550 struct thin_c *tc = m->tc; 551 + struct pool *pool = tc->pool; 622 552 struct bio *bio; 623 553 int r; 624 554 ··· 629 555 bio->bi_end_io = m->saved_bi_end_io; 630 556 631 557 if (m->err) { 632 - dm_cell_error(m->cell); 558 + cell_error(pool, m->cell); 633 559 goto out; 634 560 } 635 561 ··· 641 567 r = dm_thin_insert_block(tc->td, m->virt_block, m->data_block); 642 568 if (r) { 643 569 DMERR_LIMIT("dm_thin_insert_block() failed"); 644 - dm_cell_error(m->cell); 570 + cell_error(pool, m->cell); 645 571 goto out; 646 572 } 647 573 ··· 659 585 660 586 out: 661 587 list_del(&m->list); 662 - mempool_free(m, tc->pool->mapping_pool); 588 + mempool_free(m, pool->mapping_pool); 663 589 } 664 590 665 591 static void process_prepared_discard_fail(struct dm_thin_new_mapping *m) ··· 810 736 if (r < 0) { 811 737 mempool_free(m, pool->mapping_pool); 812 738 DMERR_LIMIT("dm_kcopyd_copy() failed"); 813 - dm_cell_error(cell); 739 + cell_error(pool, cell); 814 740 } 815 741 } 816 742 } ··· 876 802 if (r < 0) { 877 803 mempool_free(m, pool->mapping_pool); 878 804 DMERR_LIMIT("dm_kcopyd_zero() failed"); 879 - dm_cell_error(cell); 805 + cell_error(pool, cell); 880 806 } 881 807 } 882 808 } ··· 982 908 spin_unlock_irqrestore(&pool->lock, flags); 983 909 } 984 910 985 - static void no_space(struct dm_bio_prison_cell *cell) 911 + static void no_space(struct pool *pool, struct dm_bio_prison_cell *cell) 986 912 { 987 913 struct bio *bio; 988 914 struct bio_list bios; 989 915 990 916 bio_list_init(&bios); 991 - dm_cell_release(cell, &bios); 917 + cell_release(pool, cell, &bios); 992 918 993 919 while ((bio = bio_list_pop(&bios))) 994 920 retry_on_resume(bio); ··· 1006 932 struct dm_thin_new_mapping *m; 1007 933 1008 934 build_virtual_key(tc->td, block, &key); 1009 - if (dm_bio_detain(tc->pool->prison, &key, bio, &cell)) 935 + if (bio_detain(tc->pool, &key, bio, &cell)) 1010 936 return; 1011 937 1012 938 r = dm_thin_find_block(tc->td, block, 1, &lookup_result); ··· 1018 944 * on this block. 1019 945 */ 1020 946 build_data_key(tc->td, lookup_result.block, &key2); 1021 - if (dm_bio_detain(tc->pool->prison, &key2, bio, &cell2)) { 947 + if (bio_detain(tc->pool, &key2, bio, &cell2)) { 1022 948 cell_defer_no_holder(tc, cell); 1023 949 break; 1024 950 } ··· 1094 1020 break; 1095 1021 1096 1022 case -ENOSPC: 1097 - no_space(cell); 1023 + no_space(tc->pool, cell); 1098 1024 break; 1099 1025 1100 1026 default: 1101 1027 DMERR_LIMIT("%s: alloc_data_block() failed: error = %d", 1102 1028 __func__, r); 1103 - dm_cell_error(cell); 1029 + cell_error(tc->pool, cell); 1104 1030 break; 1105 1031 } 1106 1032 } ··· 1118 1044 * of being broken so we have nothing further to do here. 1119 1045 */ 1120 1046 build_data_key(tc->td, lookup_result->block, &key); 1121 - if (dm_bio_detain(pool->prison, &key, bio, &cell)) 1047 + if (bio_detain(pool, &key, bio, &cell)) 1122 1048 return; 1123 1049 1124 1050 if (bio_data_dir(bio) == WRITE && bio->bi_size) ··· 1139 1065 { 1140 1066 int r; 1141 1067 dm_block_t data_block; 1068 + struct pool *pool = tc->pool; 1142 1069 1143 1070 /* 1144 1071 * Remap empty bios (flushes) immediately, without provisioning. 1145 1072 */ 1146 1073 if (!bio->bi_size) { 1147 - inc_all_io_entry(tc->pool, bio); 1074 + inc_all_io_entry(pool, bio); 1148 1075 cell_defer_no_holder(tc, cell); 1149 1076 1150 1077 remap_and_issue(tc, bio, 0); ··· 1172 1097 break; 1173 1098 1174 1099 case -ENOSPC: 1175 - no_space(cell); 1100 + no_space(pool, cell); 1176 1101 break; 1177 1102 1178 1103 default: 1179 1104 DMERR_LIMIT("%s: alloc_data_block() failed: error = %d", 1180 1105 __func__, r); 1181 - set_pool_mode(tc->pool, PM_READ_ONLY); 1182 - dm_cell_error(cell); 1106 + set_pool_mode(pool, PM_READ_ONLY); 1107 + cell_error(pool, cell); 1183 1108 break; 1184 1109 } 1185 1110 } ··· 1187 1112 static void process_bio(struct thin_c *tc, struct bio *bio) 1188 1113 { 1189 1114 int r; 1115 + struct pool *pool = tc->pool; 1190 1116 dm_block_t block = get_bio_block(tc, bio); 1191 1117 struct dm_bio_prison_cell *cell; 1192 1118 struct dm_cell_key key; ··· 1198 1122 * being provisioned so we have nothing further to do here. 1199 1123 */ 1200 1124 build_virtual_key(tc->td, block, &key); 1201 - if (dm_bio_detain(tc->pool->prison, &key, bio, &cell)) 1125 + if (bio_detain(pool, &key, bio, &cell)) 1202 1126 return; 1203 1127 1204 1128 r = dm_thin_find_block(tc->td, block, 1, &lookup_result); ··· 1206 1130 case 0: 1207 1131 if (lookup_result.shared) { 1208 1132 process_shared_bio(tc, bio, block, &lookup_result); 1209 - cell_defer_no_holder(tc, cell); 1133 + cell_defer_no_holder(tc, cell); /* FIXME: pass this cell into process_shared? */ 1210 1134 } else { 1211 - inc_all_io_entry(tc->pool, bio); 1135 + inc_all_io_entry(pool, bio); 1212 1136 cell_defer_no_holder(tc, cell); 1213 1137 1214 1138 remap_and_issue(tc, bio, lookup_result.block); ··· 1217 1141 1218 1142 case -ENODATA: 1219 1143 if (bio_data_dir(bio) == READ && tc->origin_dev) { 1220 - inc_all_io_entry(tc->pool, bio); 1144 + inc_all_io_entry(pool, bio); 1221 1145 cell_defer_no_holder(tc, cell); 1222 1146 1223 1147 remap_to_origin_and_issue(tc, bio); ··· 1454 1378 dm_block_t block = get_bio_block(tc, bio); 1455 1379 struct dm_thin_device *td = tc->td; 1456 1380 struct dm_thin_lookup_result result; 1457 - struct dm_bio_prison_cell *cell1, *cell2; 1381 + struct dm_bio_prison_cell cell1, cell2; 1382 + struct dm_bio_prison_cell *cell_result; 1458 1383 struct dm_cell_key key; 1459 1384 1460 1385 thin_hook_bio(tc, bio); ··· 1497 1420 } 1498 1421 1499 1422 build_virtual_key(tc->td, block, &key); 1500 - if (dm_bio_detain(tc->pool->prison, &key, bio, &cell1)) 1423 + if (dm_bio_detain(tc->pool->prison, &key, bio, &cell1, &cell_result)) 1501 1424 return DM_MAPIO_SUBMITTED; 1502 1425 1503 1426 build_data_key(tc->td, result.block, &key); 1504 - if (dm_bio_detain(tc->pool->prison, &key, bio, &cell2)) { 1505 - cell_defer_no_holder(tc, cell1); 1427 + if (dm_bio_detain(tc->pool->prison, &key, bio, &cell2, &cell_result)) { 1428 + cell_defer_no_holder_no_free(tc, &cell1); 1506 1429 return DM_MAPIO_SUBMITTED; 1507 1430 } 1508 1431 1509 1432 inc_all_io_entry(tc->pool, bio); 1510 - cell_defer_no_holder(tc, cell2); 1511 - cell_defer_no_holder(tc, cell1); 1433 + cell_defer_no_holder_no_free(tc, &cell2); 1434 + cell_defer_no_holder_no_free(tc, &cell1); 1512 1435 1513 1436 remap(tc, bio, result.block); 1514 1437 return DM_MAPIO_REMAPPED; ··· 1713 1636 goto bad_prison; 1714 1637 } 1715 1638 1716 - pool->copier = dm_kcopyd_client_create(); 1639 + pool->copier = dm_kcopyd_client_create(&dm_kcopyd_throttle); 1717 1640 if (IS_ERR(pool->copier)) { 1718 1641 r = PTR_ERR(pool->copier); 1719 1642 *error = "Error creating pool's kcopyd client"; ··· 2015 1938 pt->data_dev = data_dev; 2016 1939 pt->low_water_blocks = low_water_blocks; 2017 1940 pt->adjusted_pf = pt->requested_pf = pf; 2018 - ti->num_flush_requests = 1; 1941 + ti->num_flush_bios = 1; 2019 1942 2020 1943 /* 2021 1944 * Only need to enable discards if the pool should pass ··· 2023 1946 * processing will cause mappings to be removed from the btree. 2024 1947 */ 2025 1948 if (pf.discard_enabled && pf.discard_passdown) { 2026 - ti->num_discard_requests = 1; 1949 + ti->num_discard_bios = 1; 2027 1950 2028 1951 /* 2029 1952 * Setting 'discards_supported' circumvents the normal ··· 2376 2299 * <transaction id> <used metadata sectors>/<total metadata sectors> 2377 2300 * <used data sectors>/<total data sectors> <held metadata root> 2378 2301 */ 2379 - static int pool_status(struct dm_target *ti, status_type_t type, 2380 - unsigned status_flags, char *result, unsigned maxlen) 2302 + static void pool_status(struct dm_target *ti, status_type_t type, 2303 + unsigned status_flags, char *result, unsigned maxlen) 2381 2304 { 2382 2305 int r; 2383 2306 unsigned sz = 0; ··· 2403 2326 if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti)) 2404 2327 (void) commit_or_fallback(pool); 2405 2328 2406 - r = dm_pool_get_metadata_transaction_id(pool->pmd, 2407 - &transaction_id); 2408 - if (r) 2409 - return r; 2329 + r = dm_pool_get_metadata_transaction_id(pool->pmd, &transaction_id); 2330 + if (r) { 2331 + DMERR("dm_pool_get_metadata_transaction_id returned %d", r); 2332 + goto err; 2333 + } 2410 2334 2411 - r = dm_pool_get_free_metadata_block_count(pool->pmd, 2412 - &nr_free_blocks_metadata); 2413 - if (r) 2414 - return r; 2335 + r = dm_pool_get_free_metadata_block_count(pool->pmd, &nr_free_blocks_metadata); 2336 + if (r) { 2337 + DMERR("dm_pool_get_free_metadata_block_count returned %d", r); 2338 + goto err; 2339 + } 2415 2340 2416 2341 r = dm_pool_get_metadata_dev_size(pool->pmd, &nr_blocks_metadata); 2417 - if (r) 2418 - return r; 2342 + if (r) { 2343 + DMERR("dm_pool_get_metadata_dev_size returned %d", r); 2344 + goto err; 2345 + } 2419 2346 2420 - r = dm_pool_get_free_block_count(pool->pmd, 2421 - &nr_free_blocks_data); 2422 - if (r) 2423 - return r; 2347 + r = dm_pool_get_free_block_count(pool->pmd, &nr_free_blocks_data); 2348 + if (r) { 2349 + DMERR("dm_pool_get_free_block_count returned %d", r); 2350 + goto err; 2351 + } 2424 2352 2425 2353 r = dm_pool_get_data_dev_size(pool->pmd, &nr_blocks_data); 2426 - if (r) 2427 - return r; 2354 + if (r) { 2355 + DMERR("dm_pool_get_data_dev_size returned %d", r); 2356 + goto err; 2357 + } 2428 2358 2429 2359 r = dm_pool_get_metadata_snap(pool->pmd, &held_root); 2430 - if (r) 2431 - return r; 2360 + if (r) { 2361 + DMERR("dm_pool_get_metadata_snap returned %d", r); 2362 + goto err; 2363 + } 2432 2364 2433 2365 DMEMIT("%llu %llu/%llu %llu/%llu ", 2434 2366 (unsigned long long)transaction_id, ··· 2474 2388 emit_flags(&pt->requested_pf, result, sz, maxlen); 2475 2389 break; 2476 2390 } 2391 + return; 2477 2392 2478 - return 0; 2393 + err: 2394 + DMEMIT("Error"); 2479 2395 } 2480 2396 2481 2397 static int pool_iterate_devices(struct dm_target *ti, ··· 2502 2414 return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); 2503 2415 } 2504 2416 2505 - static bool block_size_is_power_of_two(struct pool *pool) 2506 - { 2507 - return pool->sectors_per_block_shift >= 0; 2508 - } 2509 - 2510 2417 static void set_discard_limits(struct pool_c *pt, struct queue_limits *limits) 2511 2418 { 2512 2419 struct pool *pool = pt->pool; ··· 2515 2432 if (pt->adjusted_pf.discard_passdown) { 2516 2433 data_limits = &bdev_get_queue(pt->data_dev->bdev)->limits; 2517 2434 limits->discard_granularity = data_limits->discard_granularity; 2518 - } else if (block_size_is_power_of_two(pool)) 2435 + } else 2519 2436 limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT; 2520 - else 2521 - /* 2522 - * Use largest power of 2 that is a factor of sectors_per_block 2523 - * but at least DATA_DEV_BLOCK_SIZE_MIN_SECTORS. 2524 - */ 2525 - limits->discard_granularity = max(1 << (ffs(pool->sectors_per_block) - 1), 2526 - DATA_DEV_BLOCK_SIZE_MIN_SECTORS) << SECTOR_SHIFT; 2527 2437 } 2528 2438 2529 2439 static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits) ··· 2544 2468 .name = "thin-pool", 2545 2469 .features = DM_TARGET_SINGLETON | DM_TARGET_ALWAYS_WRITEABLE | 2546 2470 DM_TARGET_IMMUTABLE, 2547 - .version = {1, 6, 0}, 2471 + .version = {1, 6, 1}, 2548 2472 .module = THIS_MODULE, 2549 2473 .ctr = pool_ctr, 2550 2474 .dtr = pool_dtr, ··· 2664 2588 if (r) 2665 2589 goto bad_thin_open; 2666 2590 2667 - ti->num_flush_requests = 1; 2591 + ti->num_flush_bios = 1; 2668 2592 ti->flush_supported = true; 2669 2593 ti->per_bio_data_size = sizeof(struct dm_thin_endio_hook); 2670 2594 2671 2595 /* In case the pool supports discards, pass them on. */ 2672 2596 if (tc->pool->pf.discard_enabled) { 2673 2597 ti->discards_supported = true; 2674 - ti->num_discard_requests = 1; 2598 + ti->num_discard_bios = 1; 2675 2599 ti->discard_zeroes_data_unsupported = true; 2676 - /* Discard requests must be split on a block boundary */ 2677 - ti->split_discard_requests = true; 2600 + /* Discard bios must be split on a block boundary */ 2601 + ti->split_discard_bios = true; 2678 2602 } 2679 2603 2680 2604 dm_put(pool_md); ··· 2752 2676 /* 2753 2677 * <nr mapped sectors> <highest mapped sector> 2754 2678 */ 2755 - static int thin_status(struct dm_target *ti, status_type_t type, 2756 - unsigned status_flags, char *result, unsigned maxlen) 2679 + static void thin_status(struct dm_target *ti, status_type_t type, 2680 + unsigned status_flags, char *result, unsigned maxlen) 2757 2681 { 2758 2682 int r; 2759 2683 ssize_t sz = 0; ··· 2763 2687 2764 2688 if (get_pool_mode(tc->pool) == PM_FAIL) { 2765 2689 DMEMIT("Fail"); 2766 - return 0; 2690 + return; 2767 2691 } 2768 2692 2769 2693 if (!tc->td) ··· 2772 2696 switch (type) { 2773 2697 case STATUSTYPE_INFO: 2774 2698 r = dm_thin_get_mapped_count(tc->td, &mapped); 2775 - if (r) 2776 - return r; 2699 + if (r) { 2700 + DMERR("dm_thin_get_mapped_count returned %d", r); 2701 + goto err; 2702 + } 2777 2703 2778 2704 r = dm_thin_get_highest_mapped_block(tc->td, &highest); 2779 - if (r < 0) 2780 - return r; 2705 + if (r < 0) { 2706 + DMERR("dm_thin_get_highest_mapped_block returned %d", r); 2707 + goto err; 2708 + } 2781 2709 2782 2710 DMEMIT("%llu ", mapped * tc->pool->sectors_per_block); 2783 2711 if (r) ··· 2801 2721 } 2802 2722 } 2803 2723 2804 - return 0; 2724 + return; 2725 + 2726 + err: 2727 + DMEMIT("Error"); 2805 2728 } 2806 2729 2807 2730 static int thin_iterate_devices(struct dm_target *ti, ··· 2831 2748 2832 2749 static struct target_type thin_target = { 2833 2750 .name = "thin", 2834 - .version = {1, 7, 0}, 2751 + .version = {1, 7, 1}, 2835 2752 .module = THIS_MODULE, 2836 2753 .ctr = thin_ctr, 2837 2754 .dtr = thin_dtr,
+3 -5
drivers/md/dm-verity.c
··· 508 508 /* 509 509 * Status: V (valid) or C (corruption found) 510 510 */ 511 - static int verity_status(struct dm_target *ti, status_type_t type, 512 - unsigned status_flags, char *result, unsigned maxlen) 511 + static void verity_status(struct dm_target *ti, status_type_t type, 512 + unsigned status_flags, char *result, unsigned maxlen) 513 513 { 514 514 struct dm_verity *v = ti->private; 515 515 unsigned sz = 0; ··· 540 540 DMEMIT("%02x", v->salt[x]); 541 541 break; 542 542 } 543 - 544 - return 0; 545 543 } 546 544 547 545 static int verity_ioctl(struct dm_target *ti, unsigned cmd, ··· 858 860 859 861 static struct target_type verity_target = { 860 862 .name = "verity", 861 - .version = {1, 1, 0}, 863 + .version = {1, 1, 1}, 862 864 .module = THIS_MODULE, 863 865 .ctr = verity_ctr, 864 866 .dtr = verity_dtr,
+1 -1
drivers/md/dm-zero.c
··· 25 25 /* 26 26 * Silently drop discards, avoiding -EOPNOTSUPP. 27 27 */ 28 - ti->num_discard_requests = 1; 28 + ti->num_discard_bios = 1; 29 29 30 30 return 0; 31 31 }
+241 -213
drivers/md/dm.c
··· 163 163 * io objects are allocated from here. 164 164 */ 165 165 mempool_t *io_pool; 166 - mempool_t *tio_pool; 167 166 168 167 struct bio_set *bs; 169 168 ··· 196 197 */ 197 198 struct dm_md_mempools { 198 199 mempool_t *io_pool; 199 - mempool_t *tio_pool; 200 200 struct bio_set *bs; 201 201 }; 202 202 203 203 #define MIN_IOS 256 204 204 static struct kmem_cache *_io_cache; 205 205 static struct kmem_cache *_rq_tio_cache; 206 - 207 - /* 208 - * Unused now, and needs to be deleted. But since io_pool is overloaded and it's 209 - * still used for _io_cache, I'm leaving this for a later cleanup 210 - */ 211 - static struct kmem_cache *_rq_bio_info_cache; 212 206 213 207 static int __init local_init(void) 214 208 { ··· 216 224 if (!_rq_tio_cache) 217 225 goto out_free_io_cache; 218 226 219 - _rq_bio_info_cache = KMEM_CACHE(dm_rq_clone_bio_info, 0); 220 - if (!_rq_bio_info_cache) 221 - goto out_free_rq_tio_cache; 222 - 223 227 r = dm_uevent_init(); 224 228 if (r) 225 - goto out_free_rq_bio_info_cache; 229 + goto out_free_rq_tio_cache; 226 230 227 231 _major = major; 228 232 r = register_blkdev(_major, _name); ··· 232 244 233 245 out_uevent_exit: 234 246 dm_uevent_exit(); 235 - out_free_rq_bio_info_cache: 236 - kmem_cache_destroy(_rq_bio_info_cache); 237 247 out_free_rq_tio_cache: 238 248 kmem_cache_destroy(_rq_tio_cache); 239 249 out_free_io_cache: ··· 242 256 243 257 static void local_exit(void) 244 258 { 245 - kmem_cache_destroy(_rq_bio_info_cache); 246 259 kmem_cache_destroy(_rq_tio_cache); 247 260 kmem_cache_destroy(_io_cache); 248 261 unregister_blkdev(_major, _name); ··· 433 448 static struct dm_rq_target_io *alloc_rq_tio(struct mapped_device *md, 434 449 gfp_t gfp_mask) 435 450 { 436 - return mempool_alloc(md->tio_pool, gfp_mask); 451 + return mempool_alloc(md->io_pool, gfp_mask); 437 452 } 438 453 439 454 static void free_rq_tio(struct dm_rq_target_io *tio) 440 455 { 441 - mempool_free(tio, tio->md->tio_pool); 456 + mempool_free(tio, tio->md->io_pool); 442 457 } 443 458 444 459 static int md_in_flight(struct mapped_device *md) ··· 970 985 } 971 986 EXPORT_SYMBOL_GPL(dm_set_target_max_io_len); 972 987 973 - static void __map_bio(struct dm_target *ti, struct dm_target_io *tio) 988 + static void __map_bio(struct dm_target_io *tio) 974 989 { 975 990 int r; 976 991 sector_t sector; 977 992 struct mapped_device *md; 978 993 struct bio *clone = &tio->clone; 994 + struct dm_target *ti = tio->ti; 979 995 980 996 clone->bi_end_io = clone_endio; 981 997 clone->bi_private = tio; ··· 1017 1031 unsigned short idx; 1018 1032 }; 1019 1033 1034 + static void bio_setup_sector(struct bio *bio, sector_t sector, sector_t len) 1035 + { 1036 + bio->bi_sector = sector; 1037 + bio->bi_size = to_bytes(len); 1038 + } 1039 + 1040 + static void bio_setup_bv(struct bio *bio, unsigned short idx, unsigned short bv_count) 1041 + { 1042 + bio->bi_idx = idx; 1043 + bio->bi_vcnt = idx + bv_count; 1044 + bio->bi_flags &= ~(1 << BIO_SEG_VALID); 1045 + } 1046 + 1047 + static void clone_bio_integrity(struct bio *bio, struct bio *clone, 1048 + unsigned short idx, unsigned len, unsigned offset, 1049 + unsigned trim) 1050 + { 1051 + if (!bio_integrity(bio)) 1052 + return; 1053 + 1054 + bio_integrity_clone(clone, bio, GFP_NOIO); 1055 + 1056 + if (trim) 1057 + bio_integrity_trim(clone, bio_sector_offset(bio, idx, offset), len); 1058 + } 1059 + 1020 1060 /* 1021 1061 * Creates a little bio that just does part of a bvec. 1022 1062 */ 1023 - static void split_bvec(struct dm_target_io *tio, struct bio *bio, 1024 - sector_t sector, unsigned short idx, unsigned int offset, 1025 - unsigned int len, struct bio_set *bs) 1063 + static void clone_split_bio(struct dm_target_io *tio, struct bio *bio, 1064 + sector_t sector, unsigned short idx, 1065 + unsigned offset, unsigned len) 1026 1066 { 1027 1067 struct bio *clone = &tio->clone; 1028 1068 struct bio_vec *bv = bio->bi_io_vec + idx; 1029 1069 1030 1070 *clone->bi_io_vec = *bv; 1031 1071 1032 - clone->bi_sector = sector; 1072 + bio_setup_sector(clone, sector, len); 1073 + 1033 1074 clone->bi_bdev = bio->bi_bdev; 1034 1075 clone->bi_rw = bio->bi_rw; 1035 1076 clone->bi_vcnt = 1; 1036 - clone->bi_size = to_bytes(len); 1037 1077 clone->bi_io_vec->bv_offset = offset; 1038 1078 clone->bi_io_vec->bv_len = clone->bi_size; 1039 1079 clone->bi_flags |= 1 << BIO_CLONED; 1040 1080 1041 - if (bio_integrity(bio)) { 1042 - bio_integrity_clone(clone, bio, GFP_NOIO); 1043 - bio_integrity_trim(clone, 1044 - bio_sector_offset(bio, idx, offset), len); 1045 - } 1081 + clone_bio_integrity(bio, clone, idx, len, offset, 1); 1046 1082 } 1047 1083 1048 1084 /* ··· 1072 1064 */ 1073 1065 static void clone_bio(struct dm_target_io *tio, struct bio *bio, 1074 1066 sector_t sector, unsigned short idx, 1075 - unsigned short bv_count, unsigned int len, 1076 - struct bio_set *bs) 1067 + unsigned short bv_count, unsigned len) 1077 1068 { 1078 1069 struct bio *clone = &tio->clone; 1070 + unsigned trim = 0; 1079 1071 1080 1072 __bio_clone(clone, bio); 1081 - clone->bi_sector = sector; 1082 - clone->bi_idx = idx; 1083 - clone->bi_vcnt = idx + bv_count; 1084 - clone->bi_size = to_bytes(len); 1085 - clone->bi_flags &= ~(1 << BIO_SEG_VALID); 1073 + bio_setup_sector(clone, sector, len); 1074 + bio_setup_bv(clone, idx, bv_count); 1086 1075 1087 - if (bio_integrity(bio)) { 1088 - bio_integrity_clone(clone, bio, GFP_NOIO); 1089 - 1090 - if (idx != bio->bi_idx || clone->bi_size < bio->bi_size) 1091 - bio_integrity_trim(clone, 1092 - bio_sector_offset(bio, idx, 0), len); 1093 - } 1076 + if (idx != bio->bi_idx || clone->bi_size < bio->bi_size) 1077 + trim = 1; 1078 + clone_bio_integrity(bio, clone, idx, len, 0, trim); 1094 1079 } 1095 1080 1096 1081 static struct dm_target_io *alloc_tio(struct clone_info *ci, 1097 - struct dm_target *ti, int nr_iovecs) 1082 + struct dm_target *ti, int nr_iovecs, 1083 + unsigned target_bio_nr) 1098 1084 { 1099 1085 struct dm_target_io *tio; 1100 1086 struct bio *clone; ··· 1099 1097 tio->io = ci->io; 1100 1098 tio->ti = ti; 1101 1099 memset(&tio->info, 0, sizeof(tio->info)); 1102 - tio->target_request_nr = 0; 1100 + tio->target_bio_nr = target_bio_nr; 1103 1101 1104 1102 return tio; 1105 1103 } 1106 1104 1107 - static void __issue_target_request(struct clone_info *ci, struct dm_target *ti, 1108 - unsigned request_nr, sector_t len) 1105 + static void __clone_and_map_simple_bio(struct clone_info *ci, 1106 + struct dm_target *ti, 1107 + unsigned target_bio_nr, sector_t len) 1109 1108 { 1110 - struct dm_target_io *tio = alloc_tio(ci, ti, ci->bio->bi_max_vecs); 1109 + struct dm_target_io *tio = alloc_tio(ci, ti, ci->bio->bi_max_vecs, target_bio_nr); 1111 1110 struct bio *clone = &tio->clone; 1112 - 1113 - tio->target_request_nr = request_nr; 1114 1111 1115 1112 /* 1116 1113 * Discard requests require the bio's inline iovecs be initialized. 1117 1114 * ci->bio->bi_max_vecs is BIO_INLINE_VECS anyway, for both flush 1118 1115 * and discard, so no need for concern about wasted bvec allocations. 1119 1116 */ 1120 - 1121 1117 __bio_clone(clone, ci->bio); 1122 - if (len) { 1123 - clone->bi_sector = ci->sector; 1124 - clone->bi_size = to_bytes(len); 1125 - } 1118 + if (len) 1119 + bio_setup_sector(clone, ci->sector, len); 1126 1120 1127 - __map_bio(ti, tio); 1121 + __map_bio(tio); 1128 1122 } 1129 1123 1130 - static void __issue_target_requests(struct clone_info *ci, struct dm_target *ti, 1131 - unsigned num_requests, sector_t len) 1124 + static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti, 1125 + unsigned num_bios, sector_t len) 1132 1126 { 1133 - unsigned request_nr; 1127 + unsigned target_bio_nr; 1134 1128 1135 - for (request_nr = 0; request_nr < num_requests; request_nr++) 1136 - __issue_target_request(ci, ti, request_nr, len); 1129 + for (target_bio_nr = 0; target_bio_nr < num_bios; target_bio_nr++) 1130 + __clone_and_map_simple_bio(ci, ti, target_bio_nr, len); 1137 1131 } 1138 1132 1139 - static int __clone_and_map_empty_flush(struct clone_info *ci) 1133 + static int __send_empty_flush(struct clone_info *ci) 1140 1134 { 1141 1135 unsigned target_nr = 0; 1142 1136 struct dm_target *ti; 1143 1137 1144 1138 BUG_ON(bio_has_data(ci->bio)); 1145 1139 while ((ti = dm_table_get_target(ci->map, target_nr++))) 1146 - __issue_target_requests(ci, ti, ti->num_flush_requests, 0); 1140 + __send_duplicate_bios(ci, ti, ti->num_flush_bios, 0); 1147 1141 1148 1142 return 0; 1149 1143 } 1150 1144 1151 - /* 1152 - * Perform all io with a single clone. 1153 - */ 1154 - static void __clone_and_map_simple(struct clone_info *ci, struct dm_target *ti) 1145 + static void __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti, 1146 + sector_t sector, int nr_iovecs, 1147 + unsigned short idx, unsigned short bv_count, 1148 + unsigned offset, unsigned len, 1149 + unsigned split_bvec) 1155 1150 { 1156 1151 struct bio *bio = ci->bio; 1157 1152 struct dm_target_io *tio; 1153 + unsigned target_bio_nr; 1154 + unsigned num_target_bios = 1; 1158 1155 1159 - tio = alloc_tio(ci, ti, bio->bi_max_vecs); 1160 - clone_bio(tio, bio, ci->sector, ci->idx, bio->bi_vcnt - ci->idx, 1161 - ci->sector_count, ci->md->bs); 1162 - __map_bio(ti, tio); 1163 - ci->sector_count = 0; 1156 + /* 1157 + * Does the target want to receive duplicate copies of the bio? 1158 + */ 1159 + if (bio_data_dir(bio) == WRITE && ti->num_write_bios) 1160 + num_target_bios = ti->num_write_bios(ti, bio); 1161 + 1162 + for (target_bio_nr = 0; target_bio_nr < num_target_bios; target_bio_nr++) { 1163 + tio = alloc_tio(ci, ti, nr_iovecs, target_bio_nr); 1164 + if (split_bvec) 1165 + clone_split_bio(tio, bio, sector, idx, offset, len); 1166 + else 1167 + clone_bio(tio, bio, sector, idx, bv_count, len); 1168 + __map_bio(tio); 1169 + } 1164 1170 } 1165 1171 1166 - typedef unsigned (*get_num_requests_fn)(struct dm_target *ti); 1172 + typedef unsigned (*get_num_bios_fn)(struct dm_target *ti); 1167 1173 1168 - static unsigned get_num_discard_requests(struct dm_target *ti) 1174 + static unsigned get_num_discard_bios(struct dm_target *ti) 1169 1175 { 1170 - return ti->num_discard_requests; 1176 + return ti->num_discard_bios; 1171 1177 } 1172 1178 1173 - static unsigned get_num_write_same_requests(struct dm_target *ti) 1179 + static unsigned get_num_write_same_bios(struct dm_target *ti) 1174 1180 { 1175 - return ti->num_write_same_requests; 1181 + return ti->num_write_same_bios; 1176 1182 } 1177 1183 1178 1184 typedef bool (*is_split_required_fn)(struct dm_target *ti); 1179 1185 1180 1186 static bool is_split_required_for_discard(struct dm_target *ti) 1181 1187 { 1182 - return ti->split_discard_requests; 1188 + return ti->split_discard_bios; 1183 1189 } 1184 1190 1185 - static int __clone_and_map_changing_extent_only(struct clone_info *ci, 1186 - get_num_requests_fn get_num_requests, 1187 - is_split_required_fn is_split_required) 1191 + static int __send_changing_extent_only(struct clone_info *ci, 1192 + get_num_bios_fn get_num_bios, 1193 + is_split_required_fn is_split_required) 1188 1194 { 1189 1195 struct dm_target *ti; 1190 1196 sector_t len; 1191 - unsigned num_requests; 1197 + unsigned num_bios; 1192 1198 1193 1199 do { 1194 1200 ti = dm_table_find_target(ci->map, ci->sector); ··· 1209 1199 * reconfiguration might also have changed that since the 1210 1200 * check was performed. 1211 1201 */ 1212 - num_requests = get_num_requests ? get_num_requests(ti) : 0; 1213 - if (!num_requests) 1202 + num_bios = get_num_bios ? get_num_bios(ti) : 0; 1203 + if (!num_bios) 1214 1204 return -EOPNOTSUPP; 1215 1205 1216 1206 if (is_split_required && !is_split_required(ti)) ··· 1218 1208 else 1219 1209 len = min(ci->sector_count, max_io_len(ci->sector, ti)); 1220 1210 1221 - __issue_target_requests(ci, ti, num_requests, len); 1211 + __send_duplicate_bios(ci, ti, num_bios, len); 1222 1212 1223 1213 ci->sector += len; 1224 1214 } while (ci->sector_count -= len); ··· 1226 1216 return 0; 1227 1217 } 1228 1218 1229 - static int __clone_and_map_discard(struct clone_info *ci) 1219 + static int __send_discard(struct clone_info *ci) 1230 1220 { 1231 - return __clone_and_map_changing_extent_only(ci, get_num_discard_requests, 1232 - is_split_required_for_discard); 1221 + return __send_changing_extent_only(ci, get_num_discard_bios, 1222 + is_split_required_for_discard); 1233 1223 } 1234 1224 1235 - static int __clone_and_map_write_same(struct clone_info *ci) 1225 + static int __send_write_same(struct clone_info *ci) 1236 1226 { 1237 - return __clone_and_map_changing_extent_only(ci, get_num_write_same_requests, NULL); 1227 + return __send_changing_extent_only(ci, get_num_write_same_bios, NULL); 1238 1228 } 1239 1229 1240 - static int __clone_and_map(struct clone_info *ci) 1230 + /* 1231 + * Find maximum number of sectors / bvecs we can process with a single bio. 1232 + */ 1233 + static sector_t __len_within_target(struct clone_info *ci, sector_t max, int *idx) 1234 + { 1235 + struct bio *bio = ci->bio; 1236 + sector_t bv_len, total_len = 0; 1237 + 1238 + for (*idx = ci->idx; max && (*idx < bio->bi_vcnt); (*idx)++) { 1239 + bv_len = to_sector(bio->bi_io_vec[*idx].bv_len); 1240 + 1241 + if (bv_len > max) 1242 + break; 1243 + 1244 + max -= bv_len; 1245 + total_len += bv_len; 1246 + } 1247 + 1248 + return total_len; 1249 + } 1250 + 1251 + static int __split_bvec_across_targets(struct clone_info *ci, 1252 + struct dm_target *ti, sector_t max) 1253 + { 1254 + struct bio *bio = ci->bio; 1255 + struct bio_vec *bv = bio->bi_io_vec + ci->idx; 1256 + sector_t remaining = to_sector(bv->bv_len); 1257 + unsigned offset = 0; 1258 + sector_t len; 1259 + 1260 + do { 1261 + if (offset) { 1262 + ti = dm_table_find_target(ci->map, ci->sector); 1263 + if (!dm_target_is_valid(ti)) 1264 + return -EIO; 1265 + 1266 + max = max_io_len(ci->sector, ti); 1267 + } 1268 + 1269 + len = min(remaining, max); 1270 + 1271 + __clone_and_map_data_bio(ci, ti, ci->sector, 1, ci->idx, 0, 1272 + bv->bv_offset + offset, len, 1); 1273 + 1274 + ci->sector += len; 1275 + ci->sector_count -= len; 1276 + offset += to_bytes(len); 1277 + } while (remaining -= len); 1278 + 1279 + ci->idx++; 1280 + 1281 + return 0; 1282 + } 1283 + 1284 + /* 1285 + * Select the correct strategy for processing a non-flush bio. 1286 + */ 1287 + static int __split_and_process_non_flush(struct clone_info *ci) 1241 1288 { 1242 1289 struct bio *bio = ci->bio; 1243 1290 struct dm_target *ti; 1244 - sector_t len = 0, max; 1245 - struct dm_target_io *tio; 1291 + sector_t len, max; 1292 + int idx; 1246 1293 1247 1294 if (unlikely(bio->bi_rw & REQ_DISCARD)) 1248 - return __clone_and_map_discard(ci); 1295 + return __send_discard(ci); 1249 1296 else if (unlikely(bio->bi_rw & REQ_WRITE_SAME)) 1250 - return __clone_and_map_write_same(ci); 1297 + return __send_write_same(ci); 1251 1298 1252 1299 ti = dm_table_find_target(ci->map, ci->sector); 1253 1300 if (!dm_target_is_valid(ti)) ··· 1312 1245 1313 1246 max = max_io_len(ci->sector, ti); 1314 1247 1248 + /* 1249 + * Optimise for the simple case where we can do all of 1250 + * the remaining io with a single clone. 1251 + */ 1315 1252 if (ci->sector_count <= max) { 1316 - /* 1317 - * Optimise for the simple case where we can do all of 1318 - * the remaining io with a single clone. 1319 - */ 1320 - __clone_and_map_simple(ci, ti); 1253 + __clone_and_map_data_bio(ci, ti, ci->sector, bio->bi_max_vecs, 1254 + ci->idx, bio->bi_vcnt - ci->idx, 0, 1255 + ci->sector_count, 0); 1256 + ci->sector_count = 0; 1257 + return 0; 1258 + } 1321 1259 1322 - } else if (to_sector(bio->bi_io_vec[ci->idx].bv_len) <= max) { 1323 - /* 1324 - * There are some bvecs that don't span targets. 1325 - * Do as many of these as possible. 1326 - */ 1327 - int i; 1328 - sector_t remaining = max; 1329 - sector_t bv_len; 1260 + /* 1261 + * There are some bvecs that don't span targets. 1262 + * Do as many of these as possible. 1263 + */ 1264 + if (to_sector(bio->bi_io_vec[ci->idx].bv_len) <= max) { 1265 + len = __len_within_target(ci, max, &idx); 1330 1266 1331 - for (i = ci->idx; remaining && (i < bio->bi_vcnt); i++) { 1332 - bv_len = to_sector(bio->bi_io_vec[i].bv_len); 1333 - 1334 - if (bv_len > remaining) 1335 - break; 1336 - 1337 - remaining -= bv_len; 1338 - len += bv_len; 1339 - } 1340 - 1341 - tio = alloc_tio(ci, ti, bio->bi_max_vecs); 1342 - clone_bio(tio, bio, ci->sector, ci->idx, i - ci->idx, len, 1343 - ci->md->bs); 1344 - __map_bio(ti, tio); 1267 + __clone_and_map_data_bio(ci, ti, ci->sector, bio->bi_max_vecs, 1268 + ci->idx, idx - ci->idx, 0, len, 0); 1345 1269 1346 1270 ci->sector += len; 1347 1271 ci->sector_count -= len; 1348 - ci->idx = i; 1272 + ci->idx = idx; 1349 1273 1350 - } else { 1351 - /* 1352 - * Handle a bvec that must be split between two or more targets. 1353 - */ 1354 - struct bio_vec *bv = bio->bi_io_vec + ci->idx; 1355 - sector_t remaining = to_sector(bv->bv_len); 1356 - unsigned int offset = 0; 1357 - 1358 - do { 1359 - if (offset) { 1360 - ti = dm_table_find_target(ci->map, ci->sector); 1361 - if (!dm_target_is_valid(ti)) 1362 - return -EIO; 1363 - 1364 - max = max_io_len(ci->sector, ti); 1365 - } 1366 - 1367 - len = min(remaining, max); 1368 - 1369 - tio = alloc_tio(ci, ti, 1); 1370 - split_bvec(tio, bio, ci->sector, ci->idx, 1371 - bv->bv_offset + offset, len, ci->md->bs); 1372 - 1373 - __map_bio(ti, tio); 1374 - 1375 - ci->sector += len; 1376 - ci->sector_count -= len; 1377 - offset += to_bytes(len); 1378 - } while (remaining -= len); 1379 - 1380 - ci->idx++; 1274 + return 0; 1381 1275 } 1382 1276 1383 - return 0; 1277 + /* 1278 + * Handle a bvec that must be split between two or more targets. 1279 + */ 1280 + return __split_bvec_across_targets(ci, ti, max); 1384 1281 } 1385 1282 1386 1283 /* 1387 - * Split the bio into several clones and submit it to targets. 1284 + * Entry point to split a bio into clones and submit them to the targets. 1388 1285 */ 1389 1286 static void __split_and_process_bio(struct mapped_device *md, struct bio *bio) 1390 1287 { ··· 1372 1341 ci.idx = bio->bi_idx; 1373 1342 1374 1343 start_io_acct(ci.io); 1344 + 1375 1345 if (bio->bi_rw & REQ_FLUSH) { 1376 1346 ci.bio = &ci.md->flush_bio; 1377 1347 ci.sector_count = 0; 1378 - error = __clone_and_map_empty_flush(&ci); 1348 + error = __send_empty_flush(&ci); 1379 1349 /* dec_pending submits any data associated with flush */ 1380 1350 } else { 1381 1351 ci.bio = bio; 1382 1352 ci.sector_count = bio_sectors(bio); 1383 1353 while (ci.sector_count && !error) 1384 - error = __clone_and_map(&ci); 1354 + error = __split_and_process_non_flush(&ci); 1385 1355 } 1386 1356 1387 1357 /* drop the extra reference count */ ··· 1955 1923 unlock_fs(md); 1956 1924 bdput(md->bdev); 1957 1925 destroy_workqueue(md->wq); 1958 - if (md->tio_pool) 1959 - mempool_destroy(md->tio_pool); 1960 1926 if (md->io_pool) 1961 1927 mempool_destroy(md->io_pool); 1962 1928 if (md->bs) ··· 1977 1947 { 1978 1948 struct dm_md_mempools *p = dm_table_get_md_mempools(t); 1979 1949 1980 - if (md->io_pool && (md->tio_pool || dm_table_get_type(t) == DM_TYPE_BIO_BASED) && md->bs) { 1981 - /* 1982 - * The md already has necessary mempools. Reload just the 1983 - * bioset because front_pad may have changed because 1984 - * a different table was loaded. 1985 - */ 1986 - bioset_free(md->bs); 1987 - md->bs = p->bs; 1988 - p->bs = NULL; 1950 + if (md->io_pool && md->bs) { 1951 + /* The md already has necessary mempools. */ 1952 + if (dm_table_get_type(t) == DM_TYPE_BIO_BASED) { 1953 + /* 1954 + * Reload bioset because front_pad may have changed 1955 + * because a different table was loaded. 1956 + */ 1957 + bioset_free(md->bs); 1958 + md->bs = p->bs; 1959 + p->bs = NULL; 1960 + } else if (dm_table_get_type(t) == DM_TYPE_REQUEST_BASED) { 1961 + /* 1962 + * There's no need to reload with request-based dm 1963 + * because the size of front_pad doesn't change. 1964 + * Note for future: If you are to reload bioset, 1965 + * prep-ed requests in the queue may refer 1966 + * to bio from the old bioset, so you must walk 1967 + * through the queue to unprep. 1968 + */ 1969 + } 1989 1970 goto out; 1990 1971 } 1991 1972 1992 - BUG_ON(!p || md->io_pool || md->tio_pool || md->bs); 1973 + BUG_ON(!p || md->io_pool || md->bs); 1993 1974 1994 1975 md->io_pool = p->io_pool; 1995 1976 p->io_pool = NULL; 1996 - md->tio_pool = p->tio_pool; 1997 - p->tio_pool = NULL; 1998 1977 md->bs = p->bs; 1999 1978 p->bs = NULL; 2000 1979 ··· 2434 2395 */ 2435 2396 struct dm_table *dm_swap_table(struct mapped_device *md, struct dm_table *table) 2436 2397 { 2437 - struct dm_table *live_map, *map = ERR_PTR(-EINVAL); 2398 + struct dm_table *live_map = NULL, *map = ERR_PTR(-EINVAL); 2438 2399 struct queue_limits limits; 2439 2400 int r; 2440 2401 ··· 2457 2418 dm_table_put(live_map); 2458 2419 } 2459 2420 2460 - r = dm_calculate_queue_limits(table, &limits); 2461 - if (r) { 2462 - map = ERR_PTR(r); 2463 - goto out; 2421 + if (!live_map) { 2422 + r = dm_calculate_queue_limits(table, &limits); 2423 + if (r) { 2424 + map = ERR_PTR(r); 2425 + goto out; 2426 + } 2464 2427 } 2465 2428 2466 2429 map = __bind(md, table, &limits); ··· 2760 2719 2761 2720 struct dm_md_mempools *dm_alloc_md_mempools(unsigned type, unsigned integrity, unsigned per_bio_data_size) 2762 2721 { 2763 - struct dm_md_mempools *pools = kmalloc(sizeof(*pools), GFP_KERNEL); 2764 - unsigned int pool_size = (type == DM_TYPE_BIO_BASED) ? 16 : MIN_IOS; 2722 + struct dm_md_mempools *pools = kzalloc(sizeof(*pools), GFP_KERNEL); 2723 + struct kmem_cache *cachep; 2724 + unsigned int pool_size; 2725 + unsigned int front_pad; 2765 2726 2766 2727 if (!pools) 2767 2728 return NULL; 2768 2729 2769 - per_bio_data_size = roundup(per_bio_data_size, __alignof__(struct dm_target_io)); 2730 + if (type == DM_TYPE_BIO_BASED) { 2731 + cachep = _io_cache; 2732 + pool_size = 16; 2733 + front_pad = roundup(per_bio_data_size, __alignof__(struct dm_target_io)) + offsetof(struct dm_target_io, clone); 2734 + } else if (type == DM_TYPE_REQUEST_BASED) { 2735 + cachep = _rq_tio_cache; 2736 + pool_size = MIN_IOS; 2737 + front_pad = offsetof(struct dm_rq_clone_bio_info, clone); 2738 + /* per_bio_data_size is not used. See __bind_mempools(). */ 2739 + WARN_ON(per_bio_data_size != 0); 2740 + } else 2741 + goto out; 2770 2742 2771 - pools->io_pool = (type == DM_TYPE_BIO_BASED) ? 2772 - mempool_create_slab_pool(MIN_IOS, _io_cache) : 2773 - mempool_create_slab_pool(MIN_IOS, _rq_bio_info_cache); 2743 + pools->io_pool = mempool_create_slab_pool(MIN_IOS, cachep); 2774 2744 if (!pools->io_pool) 2775 - goto free_pools_and_out; 2745 + goto out; 2776 2746 2777 - pools->tio_pool = NULL; 2778 - if (type == DM_TYPE_REQUEST_BASED) { 2779 - pools->tio_pool = mempool_create_slab_pool(MIN_IOS, _rq_tio_cache); 2780 - if (!pools->tio_pool) 2781 - goto free_io_pool_and_out; 2782 - } 2783 - 2784 - pools->bs = (type == DM_TYPE_BIO_BASED) ? 2785 - bioset_create(pool_size, 2786 - per_bio_data_size + offsetof(struct dm_target_io, clone)) : 2787 - bioset_create(pool_size, 2788 - offsetof(struct dm_rq_clone_bio_info, clone)); 2747 + pools->bs = bioset_create(pool_size, front_pad); 2789 2748 if (!pools->bs) 2790 - goto free_tio_pool_and_out; 2749 + goto out; 2791 2750 2792 2751 if (integrity && bioset_integrity_create(pools->bs, pool_size)) 2793 - goto free_bioset_and_out; 2752 + goto out; 2794 2753 2795 2754 return pools; 2796 2755 2797 - free_bioset_and_out: 2798 - bioset_free(pools->bs); 2799 - 2800 - free_tio_pool_and_out: 2801 - if (pools->tio_pool) 2802 - mempool_destroy(pools->tio_pool); 2803 - 2804 - free_io_pool_and_out: 2805 - mempool_destroy(pools->io_pool); 2806 - 2807 - free_pools_and_out: 2808 - kfree(pools); 2756 + out: 2757 + dm_free_md_mempools(pools); 2809 2758 2810 2759 return NULL; 2811 2760 } ··· 2807 2776 2808 2777 if (pools->io_pool) 2809 2778 mempool_destroy(pools->io_pool); 2810 - 2811 - if (pools->tio_pool) 2812 - mempool_destroy(pools->tio_pool); 2813 2779 2814 2780 if (pools->bs) 2815 2781 bioset_free(pools->bs);
+1 -1
drivers/md/persistent-data/Kconfig
··· 1 1 config DM_PERSISTENT_DATA 2 2 tristate 3 - depends on BLK_DEV_DM && EXPERIMENTAL 3 + depends on BLK_DEV_DM 4 4 select LIBCRC32C 5 5 select DM_BUFIO 6 6 ---help---
+2
drivers/md/persistent-data/Makefile
··· 1 1 obj-$(CONFIG_DM_PERSISTENT_DATA) += dm-persistent-data.o 2 2 dm-persistent-data-objs := \ 3 + dm-array.o \ 4 + dm-bitset.o \ 3 5 dm-block-manager.o \ 4 6 dm-space-map-common.o \ 5 7 dm-space-map-disk.o \
+808
drivers/md/persistent-data/dm-array.c
··· 1 + /* 2 + * Copyright (C) 2012 Red Hat, Inc. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #include "dm-array.h" 8 + #include "dm-space-map.h" 9 + #include "dm-transaction-manager.h" 10 + 11 + #include <linux/export.h> 12 + #include <linux/device-mapper.h> 13 + 14 + #define DM_MSG_PREFIX "array" 15 + 16 + /*----------------------------------------------------------------*/ 17 + 18 + /* 19 + * The array is implemented as a fully populated btree, which points to 20 + * blocks that contain the packed values. This is more space efficient 21 + * than just using a btree since we don't store 1 key per value. 22 + */ 23 + struct array_block { 24 + __le32 csum; 25 + __le32 max_entries; 26 + __le32 nr_entries; 27 + __le32 value_size; 28 + __le64 blocknr; /* Block this node is supposed to live in. */ 29 + } __packed; 30 + 31 + /*----------------------------------------------------------------*/ 32 + 33 + /* 34 + * Validator methods. As usual we calculate a checksum, and also write the 35 + * block location into the header (paranoia about ssds remapping areas by 36 + * mistake). 37 + */ 38 + #define CSUM_XOR 595846735 39 + 40 + static void array_block_prepare_for_write(struct dm_block_validator *v, 41 + struct dm_block *b, 42 + size_t size_of_block) 43 + { 44 + struct array_block *bh_le = dm_block_data(b); 45 + 46 + bh_le->blocknr = cpu_to_le64(dm_block_location(b)); 47 + bh_le->csum = cpu_to_le32(dm_bm_checksum(&bh_le->max_entries, 48 + size_of_block - sizeof(__le32), 49 + CSUM_XOR)); 50 + } 51 + 52 + static int array_block_check(struct dm_block_validator *v, 53 + struct dm_block *b, 54 + size_t size_of_block) 55 + { 56 + struct array_block *bh_le = dm_block_data(b); 57 + __le32 csum_disk; 58 + 59 + if (dm_block_location(b) != le64_to_cpu(bh_le->blocknr)) { 60 + DMERR_LIMIT("array_block_check failed: blocknr %llu != wanted %llu", 61 + (unsigned long long) le64_to_cpu(bh_le->blocknr), 62 + (unsigned long long) dm_block_location(b)); 63 + return -ENOTBLK; 64 + } 65 + 66 + csum_disk = cpu_to_le32(dm_bm_checksum(&bh_le->max_entries, 67 + size_of_block - sizeof(__le32), 68 + CSUM_XOR)); 69 + if (csum_disk != bh_le->csum) { 70 + DMERR_LIMIT("array_block_check failed: csum %u != wanted %u", 71 + (unsigned) le32_to_cpu(csum_disk), 72 + (unsigned) le32_to_cpu(bh_le->csum)); 73 + return -EILSEQ; 74 + } 75 + 76 + return 0; 77 + } 78 + 79 + static struct dm_block_validator array_validator = { 80 + .name = "array", 81 + .prepare_for_write = array_block_prepare_for_write, 82 + .check = array_block_check 83 + }; 84 + 85 + /*----------------------------------------------------------------*/ 86 + 87 + /* 88 + * Functions for manipulating the array blocks. 89 + */ 90 + 91 + /* 92 + * Returns a pointer to a value within an array block. 93 + * 94 + * index - The index into _this_ specific block. 95 + */ 96 + static void *element_at(struct dm_array_info *info, struct array_block *ab, 97 + unsigned index) 98 + { 99 + unsigned char *entry = (unsigned char *) (ab + 1); 100 + 101 + entry += index * info->value_type.size; 102 + 103 + return entry; 104 + } 105 + 106 + /* 107 + * Utility function that calls one of the value_type methods on every value 108 + * in an array block. 109 + */ 110 + static void on_entries(struct dm_array_info *info, struct array_block *ab, 111 + void (*fn)(void *, const void *)) 112 + { 113 + unsigned i, nr_entries = le32_to_cpu(ab->nr_entries); 114 + 115 + for (i = 0; i < nr_entries; i++) 116 + fn(info->value_type.context, element_at(info, ab, i)); 117 + } 118 + 119 + /* 120 + * Increment every value in an array block. 121 + */ 122 + static void inc_ablock_entries(struct dm_array_info *info, struct array_block *ab) 123 + { 124 + struct dm_btree_value_type *vt = &info->value_type; 125 + 126 + if (vt->inc) 127 + on_entries(info, ab, vt->inc); 128 + } 129 + 130 + /* 131 + * Decrement every value in an array block. 132 + */ 133 + static void dec_ablock_entries(struct dm_array_info *info, struct array_block *ab) 134 + { 135 + struct dm_btree_value_type *vt = &info->value_type; 136 + 137 + if (vt->dec) 138 + on_entries(info, ab, vt->dec); 139 + } 140 + 141 + /* 142 + * Each array block can hold this many values. 143 + */ 144 + static uint32_t calc_max_entries(size_t value_size, size_t size_of_block) 145 + { 146 + return (size_of_block - sizeof(struct array_block)) / value_size; 147 + } 148 + 149 + /* 150 + * Allocate a new array block. The caller will need to unlock block. 151 + */ 152 + static int alloc_ablock(struct dm_array_info *info, size_t size_of_block, 153 + uint32_t max_entries, 154 + struct dm_block **block, struct array_block **ab) 155 + { 156 + int r; 157 + 158 + r = dm_tm_new_block(info->btree_info.tm, &array_validator, block); 159 + if (r) 160 + return r; 161 + 162 + (*ab) = dm_block_data(*block); 163 + (*ab)->max_entries = cpu_to_le32(max_entries); 164 + (*ab)->nr_entries = cpu_to_le32(0); 165 + (*ab)->value_size = cpu_to_le32(info->value_type.size); 166 + 167 + return 0; 168 + } 169 + 170 + /* 171 + * Pad an array block out with a particular value. Every instance will 172 + * cause an increment of the value_type. new_nr must always be more than 173 + * the current number of entries. 174 + */ 175 + static void fill_ablock(struct dm_array_info *info, struct array_block *ab, 176 + const void *value, unsigned new_nr) 177 + { 178 + unsigned i; 179 + uint32_t nr_entries; 180 + struct dm_btree_value_type *vt = &info->value_type; 181 + 182 + BUG_ON(new_nr > le32_to_cpu(ab->max_entries)); 183 + BUG_ON(new_nr < le32_to_cpu(ab->nr_entries)); 184 + 185 + nr_entries = le32_to_cpu(ab->nr_entries); 186 + for (i = nr_entries; i < new_nr; i++) { 187 + if (vt->inc) 188 + vt->inc(vt->context, value); 189 + memcpy(element_at(info, ab, i), value, vt->size); 190 + } 191 + ab->nr_entries = cpu_to_le32(new_nr); 192 + } 193 + 194 + /* 195 + * Remove some entries from the back of an array block. Every value 196 + * removed will be decremented. new_nr must be <= the current number of 197 + * entries. 198 + */ 199 + static void trim_ablock(struct dm_array_info *info, struct array_block *ab, 200 + unsigned new_nr) 201 + { 202 + unsigned i; 203 + uint32_t nr_entries; 204 + struct dm_btree_value_type *vt = &info->value_type; 205 + 206 + BUG_ON(new_nr > le32_to_cpu(ab->max_entries)); 207 + BUG_ON(new_nr > le32_to_cpu(ab->nr_entries)); 208 + 209 + nr_entries = le32_to_cpu(ab->nr_entries); 210 + for (i = nr_entries; i > new_nr; i--) 211 + if (vt->dec) 212 + vt->dec(vt->context, element_at(info, ab, i - 1)); 213 + ab->nr_entries = cpu_to_le32(new_nr); 214 + } 215 + 216 + /* 217 + * Read locks a block, and coerces it to an array block. The caller must 218 + * unlock 'block' when finished. 219 + */ 220 + static int get_ablock(struct dm_array_info *info, dm_block_t b, 221 + struct dm_block **block, struct array_block **ab) 222 + { 223 + int r; 224 + 225 + r = dm_tm_read_lock(info->btree_info.tm, b, &array_validator, block); 226 + if (r) 227 + return r; 228 + 229 + *ab = dm_block_data(*block); 230 + return 0; 231 + } 232 + 233 + /* 234 + * Unlocks an array block. 235 + */ 236 + static int unlock_ablock(struct dm_array_info *info, struct dm_block *block) 237 + { 238 + return dm_tm_unlock(info->btree_info.tm, block); 239 + } 240 + 241 + /*----------------------------------------------------------------*/ 242 + 243 + /* 244 + * Btree manipulation. 245 + */ 246 + 247 + /* 248 + * Looks up an array block in the btree, and then read locks it. 249 + * 250 + * index is the index of the index of the array_block, (ie. the array index 251 + * / max_entries). 252 + */ 253 + static int lookup_ablock(struct dm_array_info *info, dm_block_t root, 254 + unsigned index, struct dm_block **block, 255 + struct array_block **ab) 256 + { 257 + int r; 258 + uint64_t key = index; 259 + __le64 block_le; 260 + 261 + r = dm_btree_lookup(&info->btree_info, root, &key, &block_le); 262 + if (r) 263 + return r; 264 + 265 + return get_ablock(info, le64_to_cpu(block_le), block, ab); 266 + } 267 + 268 + /* 269 + * Insert an array block into the btree. The block is _not_ unlocked. 270 + */ 271 + static int insert_ablock(struct dm_array_info *info, uint64_t index, 272 + struct dm_block *block, dm_block_t *root) 273 + { 274 + __le64 block_le = cpu_to_le64(dm_block_location(block)); 275 + 276 + __dm_bless_for_disk(block_le); 277 + return dm_btree_insert(&info->btree_info, *root, &index, &block_le, root); 278 + } 279 + 280 + /* 281 + * Looks up an array block in the btree. Then shadows it, and updates the 282 + * btree to point to this new shadow. 'root' is an input/output parameter 283 + * for both the current root block, and the new one. 284 + */ 285 + static int shadow_ablock(struct dm_array_info *info, dm_block_t *root, 286 + unsigned index, struct dm_block **block, 287 + struct array_block **ab) 288 + { 289 + int r, inc; 290 + uint64_t key = index; 291 + dm_block_t b; 292 + __le64 block_le; 293 + 294 + /* 295 + * lookup 296 + */ 297 + r = dm_btree_lookup(&info->btree_info, *root, &key, &block_le); 298 + if (r) 299 + return r; 300 + b = le64_to_cpu(block_le); 301 + 302 + /* 303 + * shadow 304 + */ 305 + r = dm_tm_shadow_block(info->btree_info.tm, b, 306 + &array_validator, block, &inc); 307 + if (r) 308 + return r; 309 + 310 + *ab = dm_block_data(*block); 311 + if (inc) 312 + inc_ablock_entries(info, *ab); 313 + 314 + /* 315 + * Reinsert. 316 + * 317 + * The shadow op will often be a noop. Only insert if it really 318 + * copied data. 319 + */ 320 + if (dm_block_location(*block) != b) 321 + r = insert_ablock(info, index, *block, root); 322 + 323 + return r; 324 + } 325 + 326 + /* 327 + * Allocate an new array block, and fill it with some values. 328 + */ 329 + static int insert_new_ablock(struct dm_array_info *info, size_t size_of_block, 330 + uint32_t max_entries, 331 + unsigned block_index, uint32_t nr, 332 + const void *value, dm_block_t *root) 333 + { 334 + int r; 335 + struct dm_block *block; 336 + struct array_block *ab; 337 + 338 + r = alloc_ablock(info, size_of_block, max_entries, &block, &ab); 339 + if (r) 340 + return r; 341 + 342 + fill_ablock(info, ab, value, nr); 343 + r = insert_ablock(info, block_index, block, root); 344 + unlock_ablock(info, block); 345 + 346 + return r; 347 + } 348 + 349 + static int insert_full_ablocks(struct dm_array_info *info, size_t size_of_block, 350 + unsigned begin_block, unsigned end_block, 351 + unsigned max_entries, const void *value, 352 + dm_block_t *root) 353 + { 354 + int r = 0; 355 + 356 + for (; !r && begin_block != end_block; begin_block++) 357 + r = insert_new_ablock(info, size_of_block, max_entries, begin_block, max_entries, value, root); 358 + 359 + return r; 360 + } 361 + 362 + /* 363 + * There are a bunch of functions involved with resizing an array. This 364 + * structure holds information that commonly needed by them. Purely here 365 + * to reduce parameter count. 366 + */ 367 + struct resize { 368 + /* 369 + * Describes the array. 370 + */ 371 + struct dm_array_info *info; 372 + 373 + /* 374 + * The current root of the array. This gets updated. 375 + */ 376 + dm_block_t root; 377 + 378 + /* 379 + * Metadata block size. Used to calculate the nr entries in an 380 + * array block. 381 + */ 382 + size_t size_of_block; 383 + 384 + /* 385 + * Maximum nr entries in an array block. 386 + */ 387 + unsigned max_entries; 388 + 389 + /* 390 + * nr of completely full blocks in the array. 391 + * 392 + * 'old' refers to before the resize, 'new' after. 393 + */ 394 + unsigned old_nr_full_blocks, new_nr_full_blocks; 395 + 396 + /* 397 + * Number of entries in the final block. 0 iff only full blocks in 398 + * the array. 399 + */ 400 + unsigned old_nr_entries_in_last_block, new_nr_entries_in_last_block; 401 + 402 + /* 403 + * The default value used when growing the array. 404 + */ 405 + const void *value; 406 + }; 407 + 408 + /* 409 + * Removes a consecutive set of array blocks from the btree. The values 410 + * in block are decremented as a side effect of the btree remove. 411 + * 412 + * begin_index - the index of the first array block to remove. 413 + * end_index - the one-past-the-end value. ie. this block is not removed. 414 + */ 415 + static int drop_blocks(struct resize *resize, unsigned begin_index, 416 + unsigned end_index) 417 + { 418 + int r; 419 + 420 + while (begin_index != end_index) { 421 + uint64_t key = begin_index++; 422 + r = dm_btree_remove(&resize->info->btree_info, resize->root, 423 + &key, &resize->root); 424 + if (r) 425 + return r; 426 + } 427 + 428 + return 0; 429 + } 430 + 431 + /* 432 + * Calculates how many blocks are needed for the array. 433 + */ 434 + static unsigned total_nr_blocks_needed(unsigned nr_full_blocks, 435 + unsigned nr_entries_in_last_block) 436 + { 437 + return nr_full_blocks + (nr_entries_in_last_block ? 1 : 0); 438 + } 439 + 440 + /* 441 + * Shrink an array. 442 + */ 443 + static int shrink(struct resize *resize) 444 + { 445 + int r; 446 + unsigned begin, end; 447 + struct dm_block *block; 448 + struct array_block *ab; 449 + 450 + /* 451 + * Lose some blocks from the back? 452 + */ 453 + if (resize->new_nr_full_blocks < resize->old_nr_full_blocks) { 454 + begin = total_nr_blocks_needed(resize->new_nr_full_blocks, 455 + resize->new_nr_entries_in_last_block); 456 + end = total_nr_blocks_needed(resize->old_nr_full_blocks, 457 + resize->old_nr_entries_in_last_block); 458 + 459 + r = drop_blocks(resize, begin, end); 460 + if (r) 461 + return r; 462 + } 463 + 464 + /* 465 + * Trim the new tail block 466 + */ 467 + if (resize->new_nr_entries_in_last_block) { 468 + r = shadow_ablock(resize->info, &resize->root, 469 + resize->new_nr_full_blocks, &block, &ab); 470 + if (r) 471 + return r; 472 + 473 + trim_ablock(resize->info, ab, resize->new_nr_entries_in_last_block); 474 + unlock_ablock(resize->info, block); 475 + } 476 + 477 + return 0; 478 + } 479 + 480 + /* 481 + * Grow an array. 482 + */ 483 + static int grow_extend_tail_block(struct resize *resize, uint32_t new_nr_entries) 484 + { 485 + int r; 486 + struct dm_block *block; 487 + struct array_block *ab; 488 + 489 + r = shadow_ablock(resize->info, &resize->root, 490 + resize->old_nr_full_blocks, &block, &ab); 491 + if (r) 492 + return r; 493 + 494 + fill_ablock(resize->info, ab, resize->value, new_nr_entries); 495 + unlock_ablock(resize->info, block); 496 + 497 + return r; 498 + } 499 + 500 + static int grow_add_tail_block(struct resize *resize) 501 + { 502 + return insert_new_ablock(resize->info, resize->size_of_block, 503 + resize->max_entries, 504 + resize->new_nr_full_blocks, 505 + resize->new_nr_entries_in_last_block, 506 + resize->value, &resize->root); 507 + } 508 + 509 + static int grow_needs_more_blocks(struct resize *resize) 510 + { 511 + int r; 512 + 513 + if (resize->old_nr_entries_in_last_block > 0) { 514 + r = grow_extend_tail_block(resize, resize->max_entries); 515 + if (r) 516 + return r; 517 + } 518 + 519 + r = insert_full_ablocks(resize->info, resize->size_of_block, 520 + resize->old_nr_full_blocks, 521 + resize->new_nr_full_blocks, 522 + resize->max_entries, resize->value, 523 + &resize->root); 524 + if (r) 525 + return r; 526 + 527 + if (resize->new_nr_entries_in_last_block) 528 + r = grow_add_tail_block(resize); 529 + 530 + return r; 531 + } 532 + 533 + static int grow(struct resize *resize) 534 + { 535 + if (resize->new_nr_full_blocks > resize->old_nr_full_blocks) 536 + return grow_needs_more_blocks(resize); 537 + 538 + else if (resize->old_nr_entries_in_last_block) 539 + return grow_extend_tail_block(resize, resize->new_nr_entries_in_last_block); 540 + 541 + else 542 + return grow_add_tail_block(resize); 543 + } 544 + 545 + /*----------------------------------------------------------------*/ 546 + 547 + /* 548 + * These are the value_type functions for the btree elements, which point 549 + * to array blocks. 550 + */ 551 + static void block_inc(void *context, const void *value) 552 + { 553 + __le64 block_le; 554 + struct dm_array_info *info = context; 555 + 556 + memcpy(&block_le, value, sizeof(block_le)); 557 + dm_tm_inc(info->btree_info.tm, le64_to_cpu(block_le)); 558 + } 559 + 560 + static void block_dec(void *context, const void *value) 561 + { 562 + int r; 563 + uint64_t b; 564 + __le64 block_le; 565 + uint32_t ref_count; 566 + struct dm_block *block; 567 + struct array_block *ab; 568 + struct dm_array_info *info = context; 569 + 570 + memcpy(&block_le, value, sizeof(block_le)); 571 + b = le64_to_cpu(block_le); 572 + 573 + r = dm_tm_ref(info->btree_info.tm, b, &ref_count); 574 + if (r) { 575 + DMERR_LIMIT("couldn't get reference count for block %llu", 576 + (unsigned long long) b); 577 + return; 578 + } 579 + 580 + if (ref_count == 1) { 581 + /* 582 + * We're about to drop the last reference to this ablock. 583 + * So we need to decrement the ref count of the contents. 584 + */ 585 + r = get_ablock(info, b, &block, &ab); 586 + if (r) { 587 + DMERR_LIMIT("couldn't get array block %llu", 588 + (unsigned long long) b); 589 + return; 590 + } 591 + 592 + dec_ablock_entries(info, ab); 593 + unlock_ablock(info, block); 594 + } 595 + 596 + dm_tm_dec(info->btree_info.tm, b); 597 + } 598 + 599 + static int block_equal(void *context, const void *value1, const void *value2) 600 + { 601 + return !memcmp(value1, value2, sizeof(__le64)); 602 + } 603 + 604 + /*----------------------------------------------------------------*/ 605 + 606 + void dm_array_info_init(struct dm_array_info *info, 607 + struct dm_transaction_manager *tm, 608 + struct dm_btree_value_type *vt) 609 + { 610 + struct dm_btree_value_type *bvt = &info->btree_info.value_type; 611 + 612 + memcpy(&info->value_type, vt, sizeof(info->value_type)); 613 + info->btree_info.tm = tm; 614 + info->btree_info.levels = 1; 615 + 616 + bvt->context = info; 617 + bvt->size = sizeof(__le64); 618 + bvt->inc = block_inc; 619 + bvt->dec = block_dec; 620 + bvt->equal = block_equal; 621 + } 622 + EXPORT_SYMBOL_GPL(dm_array_info_init); 623 + 624 + int dm_array_empty(struct dm_array_info *info, dm_block_t *root) 625 + { 626 + return dm_btree_empty(&info->btree_info, root); 627 + } 628 + EXPORT_SYMBOL_GPL(dm_array_empty); 629 + 630 + static int array_resize(struct dm_array_info *info, dm_block_t root, 631 + uint32_t old_size, uint32_t new_size, 632 + const void *value, dm_block_t *new_root) 633 + { 634 + int r; 635 + struct resize resize; 636 + 637 + if (old_size == new_size) 638 + return 0; 639 + 640 + resize.info = info; 641 + resize.root = root; 642 + resize.size_of_block = dm_bm_block_size(dm_tm_get_bm(info->btree_info.tm)); 643 + resize.max_entries = calc_max_entries(info->value_type.size, 644 + resize.size_of_block); 645 + 646 + resize.old_nr_full_blocks = old_size / resize.max_entries; 647 + resize.old_nr_entries_in_last_block = old_size % resize.max_entries; 648 + resize.new_nr_full_blocks = new_size / resize.max_entries; 649 + resize.new_nr_entries_in_last_block = new_size % resize.max_entries; 650 + resize.value = value; 651 + 652 + r = ((new_size > old_size) ? grow : shrink)(&resize); 653 + if (r) 654 + return r; 655 + 656 + *new_root = resize.root; 657 + return 0; 658 + } 659 + 660 + int dm_array_resize(struct dm_array_info *info, dm_block_t root, 661 + uint32_t old_size, uint32_t new_size, 662 + const void *value, dm_block_t *new_root) 663 + __dm_written_to_disk(value) 664 + { 665 + int r = array_resize(info, root, old_size, new_size, value, new_root); 666 + __dm_unbless_for_disk(value); 667 + return r; 668 + } 669 + EXPORT_SYMBOL_GPL(dm_array_resize); 670 + 671 + int dm_array_del(struct dm_array_info *info, dm_block_t root) 672 + { 673 + return dm_btree_del(&info->btree_info, root); 674 + } 675 + EXPORT_SYMBOL_GPL(dm_array_del); 676 + 677 + int dm_array_get_value(struct dm_array_info *info, dm_block_t root, 678 + uint32_t index, void *value_le) 679 + { 680 + int r; 681 + struct dm_block *block; 682 + struct array_block *ab; 683 + size_t size_of_block; 684 + unsigned entry, max_entries; 685 + 686 + size_of_block = dm_bm_block_size(dm_tm_get_bm(info->btree_info.tm)); 687 + max_entries = calc_max_entries(info->value_type.size, size_of_block); 688 + 689 + r = lookup_ablock(info, root, index / max_entries, &block, &ab); 690 + if (r) 691 + return r; 692 + 693 + entry = index % max_entries; 694 + if (entry >= le32_to_cpu(ab->nr_entries)) 695 + r = -ENODATA; 696 + else 697 + memcpy(value_le, element_at(info, ab, entry), 698 + info->value_type.size); 699 + 700 + unlock_ablock(info, block); 701 + return r; 702 + } 703 + EXPORT_SYMBOL_GPL(dm_array_get_value); 704 + 705 + static int array_set_value(struct dm_array_info *info, dm_block_t root, 706 + uint32_t index, const void *value, dm_block_t *new_root) 707 + { 708 + int r; 709 + struct dm_block *block; 710 + struct array_block *ab; 711 + size_t size_of_block; 712 + unsigned max_entries; 713 + unsigned entry; 714 + void *old_value; 715 + struct dm_btree_value_type *vt = &info->value_type; 716 + 717 + size_of_block = dm_bm_block_size(dm_tm_get_bm(info->btree_info.tm)); 718 + max_entries = calc_max_entries(info->value_type.size, size_of_block); 719 + 720 + r = shadow_ablock(info, &root, index / max_entries, &block, &ab); 721 + if (r) 722 + return r; 723 + *new_root = root; 724 + 725 + entry = index % max_entries; 726 + if (entry >= le32_to_cpu(ab->nr_entries)) { 727 + r = -ENODATA; 728 + goto out; 729 + } 730 + 731 + old_value = element_at(info, ab, entry); 732 + if (vt->dec && 733 + (!vt->equal || !vt->equal(vt->context, old_value, value))) { 734 + vt->dec(vt->context, old_value); 735 + if (vt->inc) 736 + vt->inc(vt->context, value); 737 + } 738 + 739 + memcpy(old_value, value, info->value_type.size); 740 + 741 + out: 742 + unlock_ablock(info, block); 743 + return r; 744 + } 745 + 746 + int dm_array_set_value(struct dm_array_info *info, dm_block_t root, 747 + uint32_t index, const void *value, dm_block_t *new_root) 748 + __dm_written_to_disk(value) 749 + { 750 + int r; 751 + 752 + r = array_set_value(info, root, index, value, new_root); 753 + __dm_unbless_for_disk(value); 754 + return r; 755 + } 756 + EXPORT_SYMBOL_GPL(dm_array_set_value); 757 + 758 + struct walk_info { 759 + struct dm_array_info *info; 760 + int (*fn)(void *context, uint64_t key, void *leaf); 761 + void *context; 762 + }; 763 + 764 + static int walk_ablock(void *context, uint64_t *keys, void *leaf) 765 + { 766 + struct walk_info *wi = context; 767 + 768 + int r; 769 + unsigned i; 770 + __le64 block_le; 771 + unsigned nr_entries, max_entries; 772 + struct dm_block *block; 773 + struct array_block *ab; 774 + 775 + memcpy(&block_le, leaf, sizeof(block_le)); 776 + r = get_ablock(wi->info, le64_to_cpu(block_le), &block, &ab); 777 + if (r) 778 + return r; 779 + 780 + max_entries = le32_to_cpu(ab->max_entries); 781 + nr_entries = le32_to_cpu(ab->nr_entries); 782 + for (i = 0; i < nr_entries; i++) { 783 + r = wi->fn(wi->context, keys[0] * max_entries + i, 784 + element_at(wi->info, ab, i)); 785 + 786 + if (r) 787 + break; 788 + } 789 + 790 + unlock_ablock(wi->info, block); 791 + return r; 792 + } 793 + 794 + int dm_array_walk(struct dm_array_info *info, dm_block_t root, 795 + int (*fn)(void *, uint64_t key, void *leaf), 796 + void *context) 797 + { 798 + struct walk_info wi; 799 + 800 + wi.info = info; 801 + wi.fn = fn; 802 + wi.context = context; 803 + 804 + return dm_btree_walk(&info->btree_info, root, walk_ablock, &wi); 805 + } 806 + EXPORT_SYMBOL_GPL(dm_array_walk); 807 + 808 + /*----------------------------------------------------------------*/
+166
drivers/md/persistent-data/dm-array.h
··· 1 + /* 2 + * Copyright (C) 2012 Red Hat, Inc. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + #ifndef _LINUX_DM_ARRAY_H 7 + #define _LINUX_DM_ARRAY_H 8 + 9 + #include "dm-btree.h" 10 + 11 + /*----------------------------------------------------------------*/ 12 + 13 + /* 14 + * The dm-array is a persistent version of an array. It packs the data 15 + * more efficiently than a btree which will result in less disk space use, 16 + * and a performance boost. The element get and set operations are still 17 + * O(ln(n)), but with a much smaller constant. 18 + * 19 + * The value type structure is reused from the btree type to support proper 20 + * reference counting of values. 21 + * 22 + * The arrays implicitly know their length, and bounds are checked for 23 + * lookups and updated. It doesn't store this in an accessible place 24 + * because it would waste a whole metadata block. Make sure you store the 25 + * size along with the array root in your encompassing data. 26 + * 27 + * Array entries are indexed via an unsigned integer starting from zero. 28 + * Arrays are not sparse; if you resize an array to have 'n' entries then 29 + * 'n - 1' will be the last valid index. 30 + * 31 + * Typical use: 32 + * 33 + * a) initialise a dm_array_info structure. This describes the array 34 + * values and ties it into a specific transaction manager. It holds no 35 + * instance data; the same info can be used for many similar arrays if 36 + * you wish. 37 + * 38 + * b) Get yourself a root. The root is the index of a block of data on the 39 + * disk that holds a particular instance of an array. You may have a 40 + * pre existing root in your metadata that you wish to use, or you may 41 + * want to create a brand new, empty array with dm_array_empty(). 42 + * 43 + * Like the other data structures in this library, dm_array objects are 44 + * immutable between transactions. Update functions will return you the 45 + * root for a _new_ array. If you've incremented the old root, via 46 + * dm_tm_inc(), before calling the update function you may continue to use 47 + * it in parallel with the new root. 48 + * 49 + * c) resize an array with dm_array_resize(). 50 + * 51 + * d) Get a value from the array with dm_array_get_value(). 52 + * 53 + * e) Set a value in the array with dm_array_set_value(). 54 + * 55 + * f) Walk an array of values in index order with dm_array_walk(). More 56 + * efficient than making many calls to dm_array_get_value(). 57 + * 58 + * g) Destroy the array with dm_array_del(). This tells the transaction 59 + * manager that you're no longer using this data structure so it can 60 + * recycle it's blocks. (dm_array_dec() would be a better name for it, 61 + * but del is in keeping with dm_btree_del()). 62 + */ 63 + 64 + /* 65 + * Describes an array. Don't initialise this structure yourself, use the 66 + * init function below. 67 + */ 68 + struct dm_array_info { 69 + struct dm_transaction_manager *tm; 70 + struct dm_btree_value_type value_type; 71 + struct dm_btree_info btree_info; 72 + }; 73 + 74 + /* 75 + * Sets up a dm_array_info structure. You don't need to do anything with 76 + * this structure when you finish using it. 77 + * 78 + * info - the structure being filled in. 79 + * tm - the transaction manager that should supervise this structure. 80 + * vt - describes the leaf values. 81 + */ 82 + void dm_array_info_init(struct dm_array_info *info, 83 + struct dm_transaction_manager *tm, 84 + struct dm_btree_value_type *vt); 85 + 86 + /* 87 + * Create an empty, zero length array. 88 + * 89 + * info - describes the array 90 + * root - on success this will be filled out with the root block 91 + */ 92 + int dm_array_empty(struct dm_array_info *info, dm_block_t *root); 93 + 94 + /* 95 + * Resizes the array. 96 + * 97 + * info - describes the array 98 + * root - the root block of the array on disk 99 + * old_size - the caller is responsible for remembering the size of 100 + * the array 101 + * new_size - can be bigger or smaller than old_size 102 + * value - if we're growing the array the new entries will have this value 103 + * new_root - on success, points to the new root block 104 + * 105 + * If growing the inc function for 'value' will be called the appropriate 106 + * number of times. So if the caller is holding a reference they may want 107 + * to drop it. 108 + */ 109 + int dm_array_resize(struct dm_array_info *info, dm_block_t root, 110 + uint32_t old_size, uint32_t new_size, 111 + const void *value, dm_block_t *new_root) 112 + __dm_written_to_disk(value); 113 + 114 + /* 115 + * Frees a whole array. The value_type's decrement operation will be called 116 + * for all values in the array 117 + */ 118 + int dm_array_del(struct dm_array_info *info, dm_block_t root); 119 + 120 + /* 121 + * Lookup a value in the array 122 + * 123 + * info - describes the array 124 + * root - root block of the array 125 + * index - array index 126 + * value - the value to be read. Will be in on-disk format of course. 127 + * 128 + * -ENODATA will be returned if the index is out of bounds. 129 + */ 130 + int dm_array_get_value(struct dm_array_info *info, dm_block_t root, 131 + uint32_t index, void *value); 132 + 133 + /* 134 + * Set an entry in the array. 135 + * 136 + * info - describes the array 137 + * root - root block of the array 138 + * index - array index 139 + * value - value to be written to disk. Make sure you confirm the value is 140 + * in on-disk format with__dm_bless_for_disk() before calling. 141 + * new_root - the new root block 142 + * 143 + * The old value being overwritten will be decremented, the new value 144 + * incremented. 145 + * 146 + * -ENODATA will be returned if the index is out of bounds. 147 + */ 148 + int dm_array_set_value(struct dm_array_info *info, dm_block_t root, 149 + uint32_t index, const void *value, dm_block_t *new_root) 150 + __dm_written_to_disk(value); 151 + 152 + /* 153 + * Walk through all the entries in an array. 154 + * 155 + * info - describes the array 156 + * root - root block of the array 157 + * fn - called back for every element 158 + * context - passed to the callback 159 + */ 160 + int dm_array_walk(struct dm_array_info *info, dm_block_t root, 161 + int (*fn)(void *context, uint64_t key, void *leaf), 162 + void *context); 163 + 164 + /*----------------------------------------------------------------*/ 165 + 166 + #endif /* _LINUX_DM_ARRAY_H */
+163
drivers/md/persistent-data/dm-bitset.c
··· 1 + /* 2 + * Copyright (C) 2012 Red Hat, Inc. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + 7 + #include "dm-bitset.h" 8 + #include "dm-transaction-manager.h" 9 + 10 + #include <linux/export.h> 11 + #include <linux/device-mapper.h> 12 + 13 + #define DM_MSG_PREFIX "bitset" 14 + #define BITS_PER_ARRAY_ENTRY 64 15 + 16 + /*----------------------------------------------------------------*/ 17 + 18 + static struct dm_btree_value_type bitset_bvt = { 19 + .context = NULL, 20 + .size = sizeof(__le64), 21 + .inc = NULL, 22 + .dec = NULL, 23 + .equal = NULL, 24 + }; 25 + 26 + /*----------------------------------------------------------------*/ 27 + 28 + void dm_disk_bitset_init(struct dm_transaction_manager *tm, 29 + struct dm_disk_bitset *info) 30 + { 31 + dm_array_info_init(&info->array_info, tm, &bitset_bvt); 32 + info->current_index_set = false; 33 + } 34 + EXPORT_SYMBOL_GPL(dm_disk_bitset_init); 35 + 36 + int dm_bitset_empty(struct dm_disk_bitset *info, dm_block_t *root) 37 + { 38 + return dm_array_empty(&info->array_info, root); 39 + } 40 + EXPORT_SYMBOL_GPL(dm_bitset_empty); 41 + 42 + int dm_bitset_resize(struct dm_disk_bitset *info, dm_block_t root, 43 + uint32_t old_nr_entries, uint32_t new_nr_entries, 44 + bool default_value, dm_block_t *new_root) 45 + { 46 + uint32_t old_blocks = dm_div_up(old_nr_entries, BITS_PER_ARRAY_ENTRY); 47 + uint32_t new_blocks = dm_div_up(new_nr_entries, BITS_PER_ARRAY_ENTRY); 48 + __le64 value = default_value ? cpu_to_le64(~0) : cpu_to_le64(0); 49 + 50 + __dm_bless_for_disk(&value); 51 + return dm_array_resize(&info->array_info, root, old_blocks, new_blocks, 52 + &value, new_root); 53 + } 54 + EXPORT_SYMBOL_GPL(dm_bitset_resize); 55 + 56 + int dm_bitset_del(struct dm_disk_bitset *info, dm_block_t root) 57 + { 58 + return dm_array_del(&info->array_info, root); 59 + } 60 + EXPORT_SYMBOL_GPL(dm_bitset_del); 61 + 62 + int dm_bitset_flush(struct dm_disk_bitset *info, dm_block_t root, 63 + dm_block_t *new_root) 64 + { 65 + int r; 66 + __le64 value; 67 + 68 + if (!info->current_index_set) 69 + return 0; 70 + 71 + value = cpu_to_le64(info->current_bits); 72 + 73 + __dm_bless_for_disk(&value); 74 + r = dm_array_set_value(&info->array_info, root, info->current_index, 75 + &value, new_root); 76 + if (r) 77 + return r; 78 + 79 + info->current_index_set = false; 80 + return 0; 81 + } 82 + EXPORT_SYMBOL_GPL(dm_bitset_flush); 83 + 84 + static int read_bits(struct dm_disk_bitset *info, dm_block_t root, 85 + uint32_t array_index) 86 + { 87 + int r; 88 + __le64 value; 89 + 90 + r = dm_array_get_value(&info->array_info, root, array_index, &value); 91 + if (r) 92 + return r; 93 + 94 + info->current_bits = le64_to_cpu(value); 95 + info->current_index_set = true; 96 + info->current_index = array_index; 97 + return 0; 98 + } 99 + 100 + static int get_array_entry(struct dm_disk_bitset *info, dm_block_t root, 101 + uint32_t index, dm_block_t *new_root) 102 + { 103 + int r; 104 + unsigned array_index = index / BITS_PER_ARRAY_ENTRY; 105 + 106 + if (info->current_index_set) { 107 + if (info->current_index == array_index) 108 + return 0; 109 + 110 + r = dm_bitset_flush(info, root, new_root); 111 + if (r) 112 + return r; 113 + } 114 + 115 + return read_bits(info, root, array_index); 116 + } 117 + 118 + int dm_bitset_set_bit(struct dm_disk_bitset *info, dm_block_t root, 119 + uint32_t index, dm_block_t *new_root) 120 + { 121 + int r; 122 + unsigned b = index % BITS_PER_ARRAY_ENTRY; 123 + 124 + r = get_array_entry(info, root, index, new_root); 125 + if (r) 126 + return r; 127 + 128 + set_bit(b, (unsigned long *) &info->current_bits); 129 + return 0; 130 + } 131 + EXPORT_SYMBOL_GPL(dm_bitset_set_bit); 132 + 133 + int dm_bitset_clear_bit(struct dm_disk_bitset *info, dm_block_t root, 134 + uint32_t index, dm_block_t *new_root) 135 + { 136 + int r; 137 + unsigned b = index % BITS_PER_ARRAY_ENTRY; 138 + 139 + r = get_array_entry(info, root, index, new_root); 140 + if (r) 141 + return r; 142 + 143 + clear_bit(b, (unsigned long *) &info->current_bits); 144 + return 0; 145 + } 146 + EXPORT_SYMBOL_GPL(dm_bitset_clear_bit); 147 + 148 + int dm_bitset_test_bit(struct dm_disk_bitset *info, dm_block_t root, 149 + uint32_t index, dm_block_t *new_root, bool *result) 150 + { 151 + int r; 152 + unsigned b = index % BITS_PER_ARRAY_ENTRY; 153 + 154 + r = get_array_entry(info, root, index, new_root); 155 + if (r) 156 + return r; 157 + 158 + *result = test_bit(b, (unsigned long *) &info->current_bits); 159 + return 0; 160 + } 161 + EXPORT_SYMBOL_GPL(dm_bitset_test_bit); 162 + 163 + /*----------------------------------------------------------------*/
+165
drivers/md/persistent-data/dm-bitset.h
··· 1 + /* 2 + * Copyright (C) 2012 Red Hat, Inc. 3 + * 4 + * This file is released under the GPL. 5 + */ 6 + #ifndef _LINUX_DM_BITSET_H 7 + #define _LINUX_DM_BITSET_H 8 + 9 + #include "dm-array.h" 10 + 11 + /*----------------------------------------------------------------*/ 12 + 13 + /* 14 + * This bitset type is a thin wrapper round a dm_array of 64bit words. It 15 + * uses a tiny, one word cache to reduce the number of array lookups and so 16 + * increase performance. 17 + * 18 + * Like the dm-array that it's based on, the caller needs to keep track of 19 + * the size of the bitset separately. The underlying dm-array implicitly 20 + * knows how many words it's storing and will return -ENODATA if you try 21 + * and access an out of bounds word. However, an out of bounds bit in the 22 + * final word will _not_ be detected, you have been warned. 23 + * 24 + * Bits are indexed from zero. 25 + 26 + * Typical use: 27 + * 28 + * a) Initialise a dm_disk_bitset structure with dm_disk_bitset_init(). 29 + * This describes the bitset and includes the cache. It's not called it 30 + * dm_bitset_info in line with other data structures because it does 31 + * include instance data. 32 + * 33 + * b) Get yourself a root. The root is the index of a block of data on the 34 + * disk that holds a particular instance of an bitset. You may have a 35 + * pre existing root in your metadata that you wish to use, or you may 36 + * want to create a brand new, empty bitset with dm_bitset_empty(). 37 + * 38 + * Like the other data structures in this library, dm_bitset objects are 39 + * immutable between transactions. Update functions will return you the 40 + * root for a _new_ array. If you've incremented the old root, via 41 + * dm_tm_inc(), before calling the update function you may continue to use 42 + * it in parallel with the new root. 43 + * 44 + * Even read operations may trigger the cache to be flushed and as such 45 + * return a root for a new, updated bitset. 46 + * 47 + * c) resize a bitset with dm_bitset_resize(). 48 + * 49 + * d) Set a bit with dm_bitset_set_bit(). 50 + * 51 + * e) Clear a bit with dm_bitset_clear_bit(). 52 + * 53 + * f) Test a bit with dm_bitset_test_bit(). 54 + * 55 + * g) Flush all updates from the cache with dm_bitset_flush(). 56 + * 57 + * h) Destroy the bitset with dm_bitset_del(). This tells the transaction 58 + * manager that you're no longer using this data structure so it can 59 + * recycle it's blocks. (dm_bitset_dec() would be a better name for it, 60 + * but del is in keeping with dm_btree_del()). 61 + */ 62 + 63 + /* 64 + * Opaque object. Unlike dm_array_info, you should have one of these per 65 + * bitset. Initialise with dm_disk_bitset_init(). 66 + */ 67 + struct dm_disk_bitset { 68 + struct dm_array_info array_info; 69 + 70 + uint32_t current_index; 71 + uint64_t current_bits; 72 + 73 + bool current_index_set:1; 74 + }; 75 + 76 + /* 77 + * Sets up a dm_disk_bitset structure. You don't need to do anything with 78 + * this structure when you finish using it. 79 + * 80 + * tm - the transaction manager that should supervise this structure 81 + * info - the structure being initialised 82 + */ 83 + void dm_disk_bitset_init(struct dm_transaction_manager *tm, 84 + struct dm_disk_bitset *info); 85 + 86 + /* 87 + * Create an empty, zero length bitset. 88 + * 89 + * info - describes the bitset 90 + * new_root - on success, points to the new root block 91 + */ 92 + int dm_bitset_empty(struct dm_disk_bitset *info, dm_block_t *new_root); 93 + 94 + /* 95 + * Resize the bitset. 96 + * 97 + * info - describes the bitset 98 + * old_root - the root block of the array on disk 99 + * old_nr_entries - the number of bits in the old bitset 100 + * new_nr_entries - the number of bits you want in the new bitset 101 + * default_value - the value for any new bits 102 + * new_root - on success, points to the new root block 103 + */ 104 + int dm_bitset_resize(struct dm_disk_bitset *info, dm_block_t old_root, 105 + uint32_t old_nr_entries, uint32_t new_nr_entries, 106 + bool default_value, dm_block_t *new_root); 107 + 108 + /* 109 + * Frees the bitset. 110 + */ 111 + int dm_bitset_del(struct dm_disk_bitset *info, dm_block_t root); 112 + 113 + /* 114 + * Set a bit. 115 + * 116 + * info - describes the bitset 117 + * root - the root block of the bitset 118 + * index - the bit index 119 + * new_root - on success, points to the new root block 120 + * 121 + * -ENODATA will be returned if the index is out of bounds. 122 + */ 123 + int dm_bitset_set_bit(struct dm_disk_bitset *info, dm_block_t root, 124 + uint32_t index, dm_block_t *new_root); 125 + 126 + /* 127 + * Clears a bit. 128 + * 129 + * info - describes the bitset 130 + * root - the root block of the bitset 131 + * index - the bit index 132 + * new_root - on success, points to the new root block 133 + * 134 + * -ENODATA will be returned if the index is out of bounds. 135 + */ 136 + int dm_bitset_clear_bit(struct dm_disk_bitset *info, dm_block_t root, 137 + uint32_t index, dm_block_t *new_root); 138 + 139 + /* 140 + * Tests a bit. 141 + * 142 + * info - describes the bitset 143 + * root - the root block of the bitset 144 + * index - the bit index 145 + * new_root - on success, points to the new root block (cached values may have been written) 146 + * result - the bit value you're after 147 + * 148 + * -ENODATA will be returned if the index is out of bounds. 149 + */ 150 + int dm_bitset_test_bit(struct dm_disk_bitset *info, dm_block_t root, 151 + uint32_t index, dm_block_t *new_root, bool *result); 152 + 153 + /* 154 + * Flush any cached changes to disk. 155 + * 156 + * info - describes the bitset 157 + * root - the root block of the bitset 158 + * new_root - on success, points to the new root block 159 + */ 160 + int dm_bitset_flush(struct dm_disk_bitset *info, dm_block_t root, 161 + dm_block_t *new_root); 162 + 163 + /*----------------------------------------------------------------*/ 164 + 165 + #endif /* _LINUX_DM_BITSET_H */
+1
drivers/md/persistent-data/dm-block-manager.c
··· 613 613 614 614 return dm_bufio_write_dirty_buffers(bm->bufio); 615 615 } 616 + EXPORT_SYMBOL_GPL(dm_bm_flush_and_unlock); 616 617 617 618 void dm_bm_set_read_only(struct dm_block_manager *bm) 618 619 {
+1
drivers/md/persistent-data/dm-btree-internal.h
··· 64 64 void init_ro_spine(struct ro_spine *s, struct dm_btree_info *info); 65 65 int exit_ro_spine(struct ro_spine *s); 66 66 int ro_step(struct ro_spine *s, dm_block_t new_child); 67 + void ro_pop(struct ro_spine *s); 67 68 struct btree_node *ro_node(struct ro_spine *s); 68 69 69 70 struct shadow_spine {
+7
drivers/md/persistent-data/dm-btree-spine.c
··· 164 164 return r; 165 165 } 166 166 167 + void ro_pop(struct ro_spine *s) 168 + { 169 + BUG_ON(!s->count); 170 + --s->count; 171 + unlock_block(s->info, s->nodes[s->count]); 172 + } 173 + 167 174 struct btree_node *ro_node(struct ro_spine *s) 168 175 { 169 176 struct dm_block *block;
+52
drivers/md/persistent-data/dm-btree.c
··· 807 807 return r ? r : count; 808 808 } 809 809 EXPORT_SYMBOL_GPL(dm_btree_find_highest_key); 810 + 811 + /* 812 + * FIXME: We shouldn't use a recursive algorithm when we have limited stack 813 + * space. Also this only works for single level trees. 814 + */ 815 + static int walk_node(struct ro_spine *s, dm_block_t block, 816 + int (*fn)(void *context, uint64_t *keys, void *leaf), 817 + void *context) 818 + { 819 + int r; 820 + unsigned i, nr; 821 + struct btree_node *n; 822 + uint64_t keys; 823 + 824 + r = ro_step(s, block); 825 + n = ro_node(s); 826 + 827 + nr = le32_to_cpu(n->header.nr_entries); 828 + for (i = 0; i < nr; i++) { 829 + if (le32_to_cpu(n->header.flags) & INTERNAL_NODE) { 830 + r = walk_node(s, value64(n, i), fn, context); 831 + if (r) 832 + goto out; 833 + } else { 834 + keys = le64_to_cpu(*key_ptr(n, i)); 835 + r = fn(context, &keys, value_ptr(n, i)); 836 + if (r) 837 + goto out; 838 + } 839 + } 840 + 841 + out: 842 + ro_pop(s); 843 + return r; 844 + } 845 + 846 + int dm_btree_walk(struct dm_btree_info *info, dm_block_t root, 847 + int (*fn)(void *context, uint64_t *keys, void *leaf), 848 + void *context) 849 + { 850 + int r; 851 + struct ro_spine spine; 852 + 853 + BUG_ON(info->levels > 1); 854 + 855 + init_ro_spine(&spine, info); 856 + r = walk_node(&spine, root, fn, context); 857 + exit_ro_spine(&spine); 858 + 859 + return r; 860 + } 861 + EXPORT_SYMBOL_GPL(dm_btree_walk);
+12 -3
drivers/md/persistent-data/dm-btree.h
··· 58 58 * somewhere.) This method is _not_ called for insertion of a new 59 59 * value: It is assumed the ref count is already 1. 60 60 */ 61 - void (*inc)(void *context, void *value); 61 + void (*inc)(void *context, const void *value); 62 62 63 63 /* 64 64 * This value is being deleted. The btree takes care of freeing 65 65 * the memory pointed to by @value. Often the del function just 66 66 * needs to decrement a reference count somewhere. 67 67 */ 68 - void (*dec)(void *context, void *value); 68 + void (*dec)(void *context, const void *value); 69 69 70 70 /* 71 71 * A test for equality between two values. When a value is 72 72 * overwritten with a new one, the old one has the dec method 73 73 * called _unless_ the new and old value are deemed equal. 74 74 */ 75 - int (*equal)(void *context, void *value1, void *value2); 75 + int (*equal)(void *context, const void *value1, const void *value2); 76 76 }; 77 77 78 78 /* ··· 141 141 */ 142 142 int dm_btree_find_highest_key(struct dm_btree_info *info, dm_block_t root, 143 143 uint64_t *result_keys); 144 + 145 + /* 146 + * Iterate through the a btree, calling fn() on each entry. 147 + * It only works for single level trees and is internally recursive, so 148 + * monitor stack usage carefully. 149 + */ 150 + int dm_btree_walk(struct dm_btree_info *info, dm_block_t root, 151 + int (*fn)(void *context, uint64_t *keys, void *leaf), 152 + void *context); 144 153 145 154 #endif /* _LINUX_DM_BTREE_H */
+32 -17
include/linux/device-mapper.h
··· 68 68 typedef int (*dm_preresume_fn) (struct dm_target *ti); 69 69 typedef void (*dm_resume_fn) (struct dm_target *ti); 70 70 71 - typedef int (*dm_status_fn) (struct dm_target *ti, status_type_t status_type, 72 - unsigned status_flags, char *result, unsigned maxlen); 71 + typedef void (*dm_status_fn) (struct dm_target *ti, status_type_t status_type, 72 + unsigned status_flags, char *result, unsigned maxlen); 73 73 74 74 typedef int (*dm_message_fn) (struct dm_target *ti, unsigned argc, char **argv); 75 75 ··· 175 175 #define DM_TARGET_IMMUTABLE 0x00000004 176 176 #define dm_target_is_immutable(type) ((type)->features & DM_TARGET_IMMUTABLE) 177 177 178 + /* 179 + * Some targets need to be sent the same WRITE bio severals times so 180 + * that they can send copies of it to different devices. This function 181 + * examines any supplied bio and returns the number of copies of it the 182 + * target requires. 183 + */ 184 + typedef unsigned (*dm_num_write_bios_fn) (struct dm_target *ti, struct bio *bio); 185 + 178 186 struct dm_target { 179 187 struct dm_table *table; 180 188 struct target_type *type; ··· 195 187 uint32_t max_io_len; 196 188 197 189 /* 198 - * A number of zero-length barrier requests that will be submitted 190 + * A number of zero-length barrier bios that will be submitted 199 191 * to the target for the purpose of flushing cache. 200 192 * 201 - * The request number can be accessed with dm_bio_get_target_request_nr. 202 - * It is a responsibility of the target driver to remap these requests 193 + * The bio number can be accessed with dm_bio_get_target_bio_nr. 194 + * It is a responsibility of the target driver to remap these bios 203 195 * to the real underlying devices. 204 196 */ 205 - unsigned num_flush_requests; 197 + unsigned num_flush_bios; 206 198 207 199 /* 208 - * The number of discard requests that will be submitted to the target. 209 - * The request number can be accessed with dm_bio_get_target_request_nr. 200 + * The number of discard bios that will be submitted to the target. 201 + * The bio number can be accessed with dm_bio_get_target_bio_nr. 210 202 */ 211 - unsigned num_discard_requests; 203 + unsigned num_discard_bios; 212 204 213 205 /* 214 - * The number of WRITE SAME requests that will be submitted to the target. 215 - * The request number can be accessed with dm_bio_get_target_request_nr. 206 + * The number of WRITE SAME bios that will be submitted to the target. 207 + * The bio number can be accessed with dm_bio_get_target_bio_nr. 216 208 */ 217 - unsigned num_write_same_requests; 209 + unsigned num_write_same_bios; 218 210 219 211 /* 220 212 * The minimum number of extra bytes allocated in each bio for the 221 213 * target to use. dm_per_bio_data returns the data location. 222 214 */ 223 215 unsigned per_bio_data_size; 216 + 217 + /* 218 + * If defined, this function is called to find out how many 219 + * duplicate bios should be sent to the target when writing 220 + * data. 221 + */ 222 + dm_num_write_bios_fn num_write_bios; 224 223 225 224 /* target specific data */ 226 225 void *private; ··· 248 233 bool discards_supported:1; 249 234 250 235 /* 251 - * Set if the target required discard request to be split 236 + * Set if the target required discard bios to be split 252 237 * on max_io_len boundary. 253 238 */ 254 - bool split_discard_requests:1; 239 + bool split_discard_bios:1; 255 240 256 241 /* 257 242 * Set if this target does not return zeroes on discarded blocks. ··· 276 261 struct dm_io *io; 277 262 struct dm_target *ti; 278 263 union map_info info; 279 - unsigned target_request_nr; 264 + unsigned target_bio_nr; 280 265 struct bio clone; 281 266 }; 282 267 ··· 290 275 return (struct bio *)((char *)data + data_size + offsetof(struct dm_target_io, clone)); 291 276 } 292 277 293 - static inline unsigned dm_bio_get_target_request_nr(const struct bio *bio) 278 + static inline unsigned dm_bio_get_target_bio_nr(const struct bio *bio) 294 279 { 295 - return container_of(bio, struct dm_target_io, clone)->target_request_nr; 280 + return container_of(bio, struct dm_target_io, clone)->target_bio_nr; 296 281 } 297 282 298 283 int dm_register_target(struct target_type *t);
+24 -1
include/linux/dm-kcopyd.h
··· 21 21 22 22 #define DM_KCOPYD_IGNORE_ERROR 1 23 23 24 + struct dm_kcopyd_throttle { 25 + unsigned throttle; 26 + unsigned num_io_jobs; 27 + unsigned io_period; 28 + unsigned total_period; 29 + unsigned last_jiffies; 30 + }; 31 + 32 + /* 33 + * kcopyd clients that want to support throttling must pass an initialised 34 + * dm_kcopyd_throttle struct into dm_kcopyd_client_create(). 35 + * Two or more clients may share the same instance of this struct between 36 + * them if they wish to be throttled as a group. 37 + * 38 + * This macro also creates a corresponding module parameter to configure 39 + * the amount of throttling. 40 + */ 41 + #define DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(name, description) \ 42 + static struct dm_kcopyd_throttle dm_kcopyd_throttle = { 100, 0, 0, 0, 0 }; \ 43 + module_param_named(name, dm_kcopyd_throttle.throttle, uint, 0644); \ 44 + MODULE_PARM_DESC(name, description) 45 + 24 46 /* 25 47 * To use kcopyd you must first create a dm_kcopyd_client object. 48 + * throttle can be NULL if you don't want any throttling. 26 49 */ 27 50 struct dm_kcopyd_client; 28 - struct dm_kcopyd_client *dm_kcopyd_client_create(void); 51 + struct dm_kcopyd_client *dm_kcopyd_client_create(struct dm_kcopyd_throttle *throttle); 29 52 void dm_kcopyd_client_destroy(struct dm_kcopyd_client *kc); 30 53 31 54 /*
+8 -3
include/uapi/linux/dm-ioctl.h
··· 267 267 #define DM_DEV_SET_GEOMETRY _IOWR(DM_IOCTL, DM_DEV_SET_GEOMETRY_CMD, struct dm_ioctl) 268 268 269 269 #define DM_VERSION_MAJOR 4 270 - #define DM_VERSION_MINOR 23 271 - #define DM_VERSION_PATCHLEVEL 1 272 - #define DM_VERSION_EXTRA "-ioctl (2012-12-18)" 270 + #define DM_VERSION_MINOR 24 271 + #define DM_VERSION_PATCHLEVEL 0 272 + #define DM_VERSION_EXTRA "-ioctl (2013-01-15)" 273 273 274 274 /* Status bits */ 275 275 #define DM_READONLY_FLAG (1 << 0) /* In/Out */ ··· 335 335 * or requesting sensitive data such as an encryption key. 336 336 */ 337 337 #define DM_SECURE_DATA_FLAG (1 << 15) /* In */ 338 + 339 + /* 340 + * If set, a message generated output data. 341 + */ 342 + #define DM_DATA_OUT_FLAG (1 << 16) /* Out */ 338 343 339 344 #endif /* _LINUX_DM_IOCTL_H */