Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

cleancache: remove limit on the number of cleancache enabled filesystems

The limit equals 32 and is imposed by the number of entries in the
fs_poolid_map and shared_fs_poolid_map. Nowadays it is insufficient,
because with containers on board a Linux host can have hundreds of
active fs mounts.

These maps were introduced by commit 49a9ab815acb8 ("mm: cleancache:
lazy initialization to allow tmem backends to build/run as modules") in
order to allow compiling cleancache drivers as modules. Real pool ids
are stored in these maps while super_block->cleancache_poolid points to
an entry in the map, so that on cleancache registration we can walk over
all (if there are <= 32 of them, of course) cleancache-enabled super
blocks and assign real pool ids.

Actually, there is absolutely no need in these maps, because we can
iterate over all super blocks immediately using iterate_supers. This is
not racy, because cleancache_init_ops is called from mount_fs with
super_block->s_umount held for writing, while iterate_supers takes this
semaphore for reading, so if we call iterate_supers after setting
cleancache_ops, all super blocks that had been created before
cleancache_register_ops was called will be assigned pool ids by the
action function of iterate_supers while all newer super blocks will
receive it in cleancache_init_fs.

This patch therefore removes the maps and hence the artificial limit on
the number of cleancache enabled filesystems.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Stefan Hengelein <ilendir@googlemail.com>
Cc: Florian Schmaus <fschmaus@gmail.com>
Cc: Andor Daam <andor.daam@googlemail.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Bob Liu <lliubbo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Vladimir Davydov and committed by
Linus Torvalds
3cb29d11 53d85c98

+94 -182
+1 -1
fs/super.c
··· 224 224 s->s_maxbytes = MAX_NON_LFS; 225 225 s->s_op = &default_op; 226 226 s->s_time_gran = 1000000000; 227 - s->cleancache_poolid = -1; 227 + s->cleancache_poolid = CLEANCACHE_NO_POOL; 228 228 229 229 s->s_shrink.seeks = DEFAULT_SEEKS; 230 230 s->s_shrink.scan_objects = super_cache_scan;
+4
include/linux/cleancache.h
··· 5 5 #include <linux/exportfs.h> 6 6 #include <linux/mm.h> 7 7 8 + #define CLEANCACHE_NO_POOL -1 9 + #define CLEANCACHE_NO_BACKEND -2 10 + #define CLEANCACHE_NO_BACKEND_SHARED -3 11 + 8 12 #define CLEANCACHE_KEY_MAX 6 9 13 10 14 /*
+89 -181
mm/cleancache.c
··· 19 19 #include <linux/cleancache.h> 20 20 21 21 /* 22 - * cleancache_ops is set by cleancache_ops_register to contain the pointers 22 + * cleancache_ops is set by cleancache_register_ops to contain the pointers 23 23 * to the cleancache "backend" implementation functions. 24 24 */ 25 25 static struct cleancache_ops *cleancache_ops __read_mostly; ··· 34 34 static u64 cleancache_puts; 35 35 static u64 cleancache_invalidates; 36 36 37 - /* 38 - * When no backend is registered all calls to init_fs and init_shared_fs 39 - * are registered and fake poolids (FAKE_FS_POOLID_OFFSET or 40 - * FAKE_SHARED_FS_POOLID_OFFSET, plus offset in the respective array 41 - * [shared_|]fs_poolid_map) are given to the respective super block 42 - * (sb->cleancache_poolid) and no tmem_pools are created. When a backend 43 - * registers with cleancache the previous calls to init_fs and init_shared_fs 44 - * are executed to create tmem_pools and set the respective poolids. While no 45 - * backend is registered all "puts", "gets" and "flushes" are ignored or failed. 46 - */ 47 - #define MAX_INITIALIZABLE_FS 32 48 - #define FAKE_FS_POOLID_OFFSET 1000 49 - #define FAKE_SHARED_FS_POOLID_OFFSET 2000 50 - 51 - #define FS_NO_BACKEND (-1) 52 - #define FS_UNKNOWN (-2) 53 - static int fs_poolid_map[MAX_INITIALIZABLE_FS]; 54 - static int shared_fs_poolid_map[MAX_INITIALIZABLE_FS]; 55 - static char *uuids[MAX_INITIALIZABLE_FS]; 56 - /* 57 - * Mutex for the [shared_|]fs_poolid_map to guard against multiple threads 58 - * invoking umount (and ending in __cleancache_invalidate_fs) and also multiple 59 - * threads calling mount (and ending up in __cleancache_init_[shared|]fs). 60 - */ 61 - static DEFINE_MUTEX(poolid_mutex); 62 - /* 63 - * When set to false (default) all calls to the cleancache functions, except 64 - * the __cleancache_invalidate_fs and __cleancache_init_[shared|]fs are guarded 65 - * by the if (!cleancache_ops) return. This means multiple threads (from 66 - * different filesystems) will be checking cleancache_ops. The usage of a 67 - * bool instead of a atomic_t or a bool guarded by a spinlock is OK - we are 68 - * OK if the time between the backend's have been initialized (and 69 - * cleancache_ops has been set to not NULL) and when the filesystems start 70 - * actually calling the backends. The inverse (when unloading) is obviously 71 - * not good - but this shim does not do that (yet). 72 - */ 73 - 74 - /* 75 - * The backends and filesystems work all asynchronously. This is b/c the 76 - * backends can be built as modules. 77 - * The usual sequence of events is: 78 - * a) mount / -> __cleancache_init_fs is called. We set the 79 - * [shared_|]fs_poolid_map and uuids for. 80 - * 81 - * b). user does I/Os -> we call the rest of __cleancache_* functions 82 - * which return immediately as cleancache_ops is false. 83 - * 84 - * c). modprobe zcache -> cleancache_register_ops. We init the backend 85 - * and set cleancache_ops to true, and for any fs_poolid_map 86 - * (which is set by __cleancache_init_fs) we initialize the poolid. 87 - * 88 - * d). user does I/Os -> now that cleancache_ops is true all the 89 - * __cleancache_* functions can call the backend. They all check 90 - * that fs_poolid_map is valid and if so invoke the backend. 91 - * 92 - * e). umount / -> __cleancache_invalidate_fs, the fs_poolid_map is 93 - * reset (which is the second check in the __cleancache_* ops 94 - * to call the backend). 95 - * 96 - * The sequence of event could also be c), followed by a), and d). and e). The 97 - * c) would not happen anymore. There is also the chance of c), and one thread 98 - * doing a) + d), and another doing e). For that case we depend on the 99 - * filesystem calling __cleancache_invalidate_fs in the proper sequence (so 100 - * that it handles all I/Os before it invalidates the fs (which is last part 101 - * of unmounting process). 102 - * 103 - * Note: The acute reader will notice that there is no "rmmod zcache" case. 104 - * This is b/c the functionality for that is not yet implemented and when 105 - * done, will require some extra locking not yet devised. 106 - */ 37 + static void cleancache_register_ops_sb(struct super_block *sb, void *unused) 38 + { 39 + switch (sb->cleancache_poolid) { 40 + case CLEANCACHE_NO_BACKEND: 41 + __cleancache_init_fs(sb); 42 + break; 43 + case CLEANCACHE_NO_BACKEND_SHARED: 44 + __cleancache_init_shared_fs(sb); 45 + break; 46 + } 47 + } 107 48 108 49 /* 109 50 * Register operations for cleancache. Returns 0 on success. 110 51 */ 111 52 int cleancache_register_ops(struct cleancache_ops *ops) 112 53 { 113 - int i; 114 - 115 - mutex_lock(&poolid_mutex); 116 - if (cleancache_ops) { 117 - mutex_unlock(&poolid_mutex); 54 + if (cmpxchg(&cleancache_ops, NULL, ops)) 118 55 return -EBUSY; 119 - } 120 - for (i = 0; i < MAX_INITIALIZABLE_FS; i++) { 121 - if (fs_poolid_map[i] == FS_NO_BACKEND) 122 - fs_poolid_map[i] = ops->init_fs(PAGE_SIZE); 123 - if (shared_fs_poolid_map[i] == FS_NO_BACKEND) 124 - shared_fs_poolid_map[i] = ops->init_shared_fs 125 - (uuids[i], PAGE_SIZE); 126 - } 56 + 127 57 /* 128 - * We MUST set cleancache_ops _after_ we have called the backends 129 - * init_fs or init_shared_fs functions. Otherwise the compiler might 130 - * re-order where cleancache_ops is set in this function. 58 + * A cleancache backend can be built as a module and hence loaded after 59 + * a cleancache enabled filesystem has called cleancache_init_fs. To 60 + * handle such a scenario, here we call ->init_fs or ->init_shared_fs 61 + * for each active super block. To differentiate between local and 62 + * shared filesystems, we temporarily initialize sb->cleancache_poolid 63 + * to CLEANCACHE_NO_BACKEND or CLEANCACHE_NO_BACKEND_SHARED 64 + * respectively in case there is no backend registered at the time 65 + * cleancache_init_fs or cleancache_init_shared_fs is called. 66 + * 67 + * Since filesystems can be mounted concurrently with cleancache 68 + * backend registration, we have to be careful to guarantee that all 69 + * cleancache enabled filesystems that has been mounted by the time 70 + * cleancache_register_ops is called has got and all mounted later will 71 + * get cleancache_poolid. This is assured by the following statements 72 + * tied together: 73 + * 74 + * a) iterate_supers skips only those super blocks that has started 75 + * ->kill_sb 76 + * 77 + * b) if iterate_supers encounters a super block that has not finished 78 + * ->mount yet, it waits until it is finished 79 + * 80 + * c) cleancache_init_fs is called from ->mount and 81 + * cleancache_invalidate_fs is called from ->kill_sb 82 + * 83 + * d) we call iterate_supers after cleancache_ops has been set 84 + * 85 + * From a) it follows that if iterate_supers skips a super block, then 86 + * either the super block is already dead, in which case we do not need 87 + * to bother initializing cleancache for it, or it was mounted after we 88 + * initiated iterate_supers. In the latter case, it must have seen 89 + * cleancache_ops set according to d) and initialized cleancache from 90 + * ->mount by itself according to c). This proves that we call 91 + * ->init_fs at least once for each active super block. 92 + * 93 + * From b) and c) it follows that if iterate_supers encounters a super 94 + * block that has already started ->init_fs, it will wait until ->mount 95 + * and hence ->init_fs has finished, then check cleancache_poolid, see 96 + * that it has already been set and therefore do nothing. This proves 97 + * that we call ->init_fs no more than once for each super block. 98 + * 99 + * Combined together, the last two paragraphs prove the function 100 + * correctness. 101 + * 102 + * Note that various cleancache callbacks may proceed before this 103 + * function is called or even concurrently with it, but since 104 + * CLEANCACHE_NO_BACKEND is negative, they will all result in a noop 105 + * until the corresponding ->init_fs has been actually called and 106 + * cleancache_ops has been set. 131 107 */ 132 - barrier(); 133 - cleancache_ops = ops; 134 - mutex_unlock(&poolid_mutex); 108 + iterate_supers(cleancache_register_ops_sb, NULL); 135 109 return 0; 136 110 } 137 111 EXPORT_SYMBOL(cleancache_register_ops); ··· 113 139 /* Called by a cleancache-enabled filesystem at time of mount */ 114 140 void __cleancache_init_fs(struct super_block *sb) 115 141 { 116 - int i; 142 + int pool_id = CLEANCACHE_NO_BACKEND; 117 143 118 - mutex_lock(&poolid_mutex); 119 - for (i = 0; i < MAX_INITIALIZABLE_FS; i++) { 120 - if (fs_poolid_map[i] == FS_UNKNOWN) { 121 - sb->cleancache_poolid = i + FAKE_FS_POOLID_OFFSET; 122 - if (cleancache_ops) 123 - fs_poolid_map[i] = cleancache_ops->init_fs(PAGE_SIZE); 124 - else 125 - fs_poolid_map[i] = FS_NO_BACKEND; 126 - break; 127 - } 144 + if (cleancache_ops) { 145 + pool_id = cleancache_ops->init_fs(PAGE_SIZE); 146 + if (pool_id < 0) 147 + pool_id = CLEANCACHE_NO_POOL; 128 148 } 129 - mutex_unlock(&poolid_mutex); 149 + sb->cleancache_poolid = pool_id; 130 150 } 131 151 EXPORT_SYMBOL(__cleancache_init_fs); 132 152 133 153 /* Called by a cleancache-enabled clustered filesystem at time of mount */ 134 154 void __cleancache_init_shared_fs(struct super_block *sb) 135 155 { 136 - int i; 156 + int pool_id = CLEANCACHE_NO_BACKEND_SHARED; 137 157 138 - mutex_lock(&poolid_mutex); 139 - for (i = 0; i < MAX_INITIALIZABLE_FS; i++) { 140 - if (shared_fs_poolid_map[i] == FS_UNKNOWN) { 141 - sb->cleancache_poolid = i + FAKE_SHARED_FS_POOLID_OFFSET; 142 - uuids[i] = sb->s_uuid; 143 - if (cleancache_ops) 144 - shared_fs_poolid_map[i] = cleancache_ops->init_shared_fs 145 - (sb->s_uuid, PAGE_SIZE); 146 - else 147 - shared_fs_poolid_map[i] = FS_NO_BACKEND; 148 - break; 149 - } 158 + if (cleancache_ops) { 159 + pool_id = cleancache_ops->init_shared_fs(sb->s_uuid, PAGE_SIZE); 160 + if (pool_id < 0) 161 + pool_id = CLEANCACHE_NO_POOL; 150 162 } 151 - mutex_unlock(&poolid_mutex); 163 + sb->cleancache_poolid = pool_id; 152 164 } 153 165 EXPORT_SYMBOL(__cleancache_init_shared_fs); 154 166 ··· 164 204 } 165 205 166 206 /* 167 - * Returns a pool_id that is associated with a given fake poolid. 168 - */ 169 - static int get_poolid_from_fake(int fake_pool_id) 170 - { 171 - if (fake_pool_id >= FAKE_SHARED_FS_POOLID_OFFSET) 172 - return shared_fs_poolid_map[fake_pool_id - 173 - FAKE_SHARED_FS_POOLID_OFFSET]; 174 - else if (fake_pool_id >= FAKE_FS_POOLID_OFFSET) 175 - return fs_poolid_map[fake_pool_id - FAKE_FS_POOLID_OFFSET]; 176 - return FS_NO_BACKEND; 177 - } 178 - 179 - /* 180 207 * "Get" data from cleancache associated with the poolid/inode/index 181 208 * that were specified when the data was put to cleanache and, if 182 209 * successful, use it to fill the specified page with data and return 0. ··· 178 231 { 179 232 int ret = -1; 180 233 int pool_id; 181 - int fake_pool_id; 182 234 struct cleancache_filekey key = { .u.key = { 0 } }; 183 235 184 236 if (!cleancache_ops) { ··· 186 240 } 187 241 188 242 VM_BUG_ON_PAGE(!PageLocked(page), page); 189 - fake_pool_id = page->mapping->host->i_sb->cleancache_poolid; 190 - if (fake_pool_id < 0) 243 + pool_id = page->mapping->host->i_sb->cleancache_poolid; 244 + if (pool_id < 0) 191 245 goto out; 192 - pool_id = get_poolid_from_fake(fake_pool_id); 193 246 194 247 if (cleancache_get_key(page->mapping->host, &key) < 0) 195 248 goto out; 196 249 197 - if (pool_id >= 0) 198 - ret = cleancache_ops->get_page(pool_id, 199 - key, page->index, page); 250 + ret = cleancache_ops->get_page(pool_id, key, page->index, page); 200 251 if (ret == 0) 201 252 cleancache_succ_gets++; 202 253 else ··· 216 273 void __cleancache_put_page(struct page *page) 217 274 { 218 275 int pool_id; 219 - int fake_pool_id; 220 276 struct cleancache_filekey key = { .u.key = { 0 } }; 221 277 222 278 if (!cleancache_ops) { ··· 224 282 } 225 283 226 284 VM_BUG_ON_PAGE(!PageLocked(page), page); 227 - fake_pool_id = page->mapping->host->i_sb->cleancache_poolid; 228 - if (fake_pool_id < 0) 229 - return; 230 - 231 - pool_id = get_poolid_from_fake(fake_pool_id); 232 - 285 + pool_id = page->mapping->host->i_sb->cleancache_poolid; 233 286 if (pool_id >= 0 && 234 287 cleancache_get_key(page->mapping->host, &key) >= 0) { 235 288 cleancache_ops->put_page(pool_id, key, page->index, page); ··· 245 308 struct page *page) 246 309 { 247 310 /* careful... page->mapping is NULL sometimes when this is called */ 248 - int pool_id; 249 - int fake_pool_id = mapping->host->i_sb->cleancache_poolid; 311 + int pool_id = mapping->host->i_sb->cleancache_poolid; 250 312 struct cleancache_filekey key = { .u.key = { 0 } }; 251 313 252 314 if (!cleancache_ops) 253 315 return; 254 316 255 - if (fake_pool_id >= 0) { 256 - pool_id = get_poolid_from_fake(fake_pool_id); 257 - if (pool_id < 0) 258 - return; 259 - 317 + if (pool_id >= 0) { 260 318 VM_BUG_ON_PAGE(!PageLocked(page), page); 261 319 if (cleancache_get_key(mapping->host, &key) >= 0) { 262 320 cleancache_ops->invalidate_page(pool_id, ··· 273 341 */ 274 342 void __cleancache_invalidate_inode(struct address_space *mapping) 275 343 { 276 - int pool_id; 277 - int fake_pool_id = mapping->host->i_sb->cleancache_poolid; 344 + int pool_id = mapping->host->i_sb->cleancache_poolid; 278 345 struct cleancache_filekey key = { .u.key = { 0 } }; 279 346 280 347 if (!cleancache_ops) 281 348 return; 282 - 283 - if (fake_pool_id < 0) 284 - return; 285 - 286 - pool_id = get_poolid_from_fake(fake_pool_id); 287 349 288 350 if (pool_id >= 0 && cleancache_get_key(mapping->host, &key) >= 0) 289 351 cleancache_ops->invalidate_inode(pool_id, key); ··· 291 365 */ 292 366 void __cleancache_invalidate_fs(struct super_block *sb) 293 367 { 294 - int index; 295 - int fake_pool_id = sb->cleancache_poolid; 296 - int old_poolid = fake_pool_id; 368 + int pool_id; 297 369 298 - mutex_lock(&poolid_mutex); 299 - if (fake_pool_id >= FAKE_SHARED_FS_POOLID_OFFSET) { 300 - index = fake_pool_id - FAKE_SHARED_FS_POOLID_OFFSET; 301 - old_poolid = shared_fs_poolid_map[index]; 302 - shared_fs_poolid_map[index] = FS_UNKNOWN; 303 - uuids[index] = NULL; 304 - } else if (fake_pool_id >= FAKE_FS_POOLID_OFFSET) { 305 - index = fake_pool_id - FAKE_FS_POOLID_OFFSET; 306 - old_poolid = fs_poolid_map[index]; 307 - fs_poolid_map[index] = FS_UNKNOWN; 308 - } 309 - sb->cleancache_poolid = -1; 310 - if (cleancache_ops) 311 - cleancache_ops->invalidate_fs(old_poolid); 312 - mutex_unlock(&poolid_mutex); 370 + pool_id = sb->cleancache_poolid; 371 + sb->cleancache_poolid = CLEANCACHE_NO_POOL; 372 + 373 + if (cleancache_ops && pool_id >= 0) 374 + cleancache_ops->invalidate_fs(pool_id); 313 375 } 314 376 EXPORT_SYMBOL(__cleancache_invalidate_fs); 315 377 316 378 static int __init init_cleancache(void) 317 379 { 318 - int i; 319 - 320 380 #ifdef CONFIG_DEBUG_FS 321 381 struct dentry *root = debugfs_create_dir("cleancache", NULL); 322 382 if (root == NULL) ··· 314 402 debugfs_create_u64("invalidates", S_IRUGO, 315 403 root, &cleancache_invalidates); 316 404 #endif 317 - for (i = 0; i < MAX_INITIALIZABLE_FS; i++) { 318 - fs_poolid_map[i] = FS_UNKNOWN; 319 - shared_fs_poolid_map[i] = FS_UNKNOWN; 320 - } 321 405 return 0; 322 406 } 323 407 module_init(init_cleancache)