Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

CacheFiles: Fix occasional EIO on call to vfs_unlink()

Fix an occasional EIO returned by a call to vfs_unlink():

[ 4868.465413] CacheFiles: I/O Error: Unlink failed
[ 4868.465444] FS-Cache: Cache cachefiles stopped due to I/O error
[ 4947.320011] CacheFiles: File cache on md3 unregistering
[ 4947.320041] FS-Cache: Withdrawing cache "mycache"
[ 5127.348683] FS-Cache: Cache "mycache" added (type cachefiles)
[ 5127.348716] CacheFiles: File cache on md3 registered
[ 7076.871081] CacheFiles: I/O Error: Unlink failed
[ 7076.871130] FS-Cache: Cache cachefiles stopped due to I/O error
[ 7116.780891] CacheFiles: File cache on md3 unregistering
[ 7116.780937] FS-Cache: Withdrawing cache "mycache"
[ 7296.813394] FS-Cache: Cache "mycache" added (type cachefiles)
[ 7296.813432] CacheFiles: File cache on md3 registered

What happens is this:

(1) A cached NFS file is seen to have become out of date, so NFS retires the
object and immediately acquires a new object with the same key.

(2) Retirement of the old object is done asynchronously - so the lookup/create
to generate the new object may be done first.

This can be a problem as the old object and the new object must exist at
the same point in the backing filesystem (i.e. they must have the same
pathname).

(3) The lookup for the new object sees that a backing file already exists,
checks to see whether it is valid and sees that it isn't. It then deletes
that file and creates a new one on disk.

(4) The retirement phase for the old file is then performed. It tries to
delete the dentry it has, but ext4_unlink() returns -EIO because the inode
attached to that dentry no longer matches the inode number associated with
the filename in the parent directory.

The trace below shows this quite well.

[md5sum] ==> __fscache_relinquish_cookie(ffff88002d12fb58{NFS.fh,ffff88002ce62100},1)
[md5sum] ==> __fscache_acquire_cookie({NFS.server},{NFS.fh},ffff88002ce62100)

NFS has retired the old cookie and asked for a new one.

[kslowd] ==> fscache_object_state_machine({OBJ52,OBJECT_ACTIVE,24})
[kslowd] <== fscache_object_state_machine() [->OBJECT_DYING]
[kslowd] ==> fscache_object_state_machine({OBJ53,OBJECT_INIT,0})
[kslowd] <== fscache_object_state_machine() [->OBJECT_LOOKING_UP]
[kslowd] ==> fscache_object_state_machine({OBJ52,OBJECT_DYING,24})
[kslowd] <== fscache_object_state_machine() [->OBJECT_RECYCLING]

The old object (OBJ52) is going through the terminal states to get rid of it,
whilst the new object - (OBJ53) - is coming into being.

[kslowd] ==> fscache_object_state_machine({OBJ53,OBJECT_LOOKING_UP,0})
[kslowd] ==> cachefiles_walk_to_object({ffff88003029d8b8},OBJ53,@68,)
[kslowd] lookup '@68'
[kslowd] next -> ffff88002ce41bd0 positive
[kslowd] advance
[kslowd] lookup 'Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA'
[kslowd] next -> ffff8800369faac8 positive

The new object has looked up the subdir in which the file would be in (getting
dentry ffff88002ce41bd0) and then looked up the file itself (getting dentry
ffff8800369faac8).

[kslowd] validate 'Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA'
[kslowd] ==> cachefiles_bury_object(,'@68','Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA')
[kslowd] remove ffff8800369faac8 from ffff88002ce41bd0
[kslowd] unlink stale object
[kslowd] <== cachefiles_bury_object() = 0

It then checks the file's xattrs to see if it's valid. NFS says that the
auxiliary data indicate the file is out of date (obvious to us - that's why NFS
ditched the old version and got a new one). CacheFiles then deletes the old
file (dentry ffff8800369faac8).

[kslowd] redo lookup
[kslowd] lookup 'Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA'
[kslowd] next -> ffff88002cd94288 negative
[kslowd] create -> ffff88002cd94288{ffff88002cdaf238{ino=148247}}

CacheFiles then redoes the lookup and gets a negative result in a new dentry
(ffff88002cd94288) which it then creates a file for.

[kslowd] ==> cachefiles_mark_object_active(,OBJ53)
[kslowd] <== cachefiles_mark_object_active() = 0
[kslowd] === OBTAINED_OBJECT ===
[kslowd] <== cachefiles_walk_to_object() = 0 [148247]
[kslowd] <== fscache_object_state_machine() [->OBJECT_AVAILABLE]

The new object is then marked active and the state machine moves to the
available state - at which point NFS can start filling the object.

[kslowd] ==> fscache_object_state_machine({OBJ52,OBJECT_RECYCLING,20})
[kslowd] ==> fscache_release_object()
[kslowd] ==> cachefiles_drop_object({OBJ52,2})
[kslowd] ==> cachefiles_delete_object(,OBJ52{ffff8800369faac8})

The old object, meanwhile, goes on with being retired. If allocation occurs
first, cachefiles_delete_object() has to wait for dir->d_inode->i_mutex to
become available before it can continue.

[kslowd] ==> cachefiles_bury_object(,'@68','Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA')
[kslowd] remove ffff8800369faac8 from ffff88002ce41bd0
[kslowd] unlink stale object
EXT4-fs warning (device sda6): ext4_unlink: Inode number mismatch in unlink (148247!=148193)
CacheFiles: I/O Error: Unlink failed
FS-Cache: Cache cachefiles stopped due to I/O error

CacheFiles then tries to delete the file for the old object, but the dentry it
has (ffff8800369faac8) no longer points to a valid inode for that directory
entry, and so ext4_unlink() returns -EIO when de->inode does not match i_ino.

[kslowd] <== cachefiles_bury_object() = -5
[kslowd] <== cachefiles_delete_object() = -5
[kslowd] <== fscache_object_state_machine() [->OBJECT_DEAD]
[kslowd] ==> fscache_object_state_machine({OBJ53,OBJECT_AVAILABLE,0})
[kslowd] <== fscache_object_state_machine() [->OBJECT_ACTIVE]

(Note that the above trace includes extra information beyond that produced by
the upstream code).

The fix is to note when an object that is being retired has had its object
deleted preemptively by a replacement object that is being created, and to
skip the second removal attempt in such a case.

Reported-by: Greg M <gregm@servu.net.au>
Reported-by: Mark Moseley <moseleymark@gmail.com>
Reported-by: Romain DEGEZ <romain.degez@smartjog.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

David Howells and committed by
Linus Torvalds
c61ea31d 7d6fb7bd

+87 -12
+1
fs/cachefiles/internal.h
··· 34 34 loff_t i_size; /* object size */ 35 35 unsigned long flags; 36 36 #define CACHEFILES_OBJECT_ACTIVE 0 /* T if marked active */ 37 + #define CACHEFILES_OBJECT_BURIED 1 /* T if preemptively buried */ 37 38 atomic_t usage; /* object usage count */ 38 39 uint8_t type; /* object type */ 39 40 uint8_t new; /* T if object new */
+86 -12
fs/cachefiles/namei.c
··· 93 93 } 94 94 95 95 /* 96 + * mark the owner of a dentry, if there is one, to indicate that that dentry 97 + * has been preemptively deleted 98 + * - the caller must hold the i_mutex on the dentry's parent as required to 99 + * call vfs_unlink(), vfs_rmdir() or vfs_rename() 100 + */ 101 + static void cachefiles_mark_object_buried(struct cachefiles_cache *cache, 102 + struct dentry *dentry) 103 + { 104 + struct cachefiles_object *object; 105 + struct rb_node *p; 106 + 107 + _enter(",'%*.*s'", 108 + dentry->d_name.len, dentry->d_name.len, dentry->d_name.name); 109 + 110 + write_lock(&cache->active_lock); 111 + 112 + p = cache->active_nodes.rb_node; 113 + while (p) { 114 + object = rb_entry(p, struct cachefiles_object, active_node); 115 + if (object->dentry > dentry) 116 + p = p->rb_left; 117 + else if (object->dentry < dentry) 118 + p = p->rb_right; 119 + else 120 + goto found_dentry; 121 + } 122 + 123 + write_unlock(&cache->active_lock); 124 + _leave(" [no owner]"); 125 + return; 126 + 127 + /* found the dentry for */ 128 + found_dentry: 129 + kdebug("preemptive burial: OBJ%x [%s] %p", 130 + object->fscache.debug_id, 131 + fscache_object_states[object->fscache.state], 132 + dentry); 133 + 134 + if (object->fscache.state < FSCACHE_OBJECT_DYING) { 135 + printk(KERN_ERR "\n"); 136 + printk(KERN_ERR "CacheFiles: Error:" 137 + " Can't preemptively bury live object\n"); 138 + cachefiles_printk_object(object, NULL); 139 + } else if (test_and_set_bit(CACHEFILES_OBJECT_BURIED, &object->flags)) { 140 + printk(KERN_ERR "CacheFiles: Error:" 141 + " Object already preemptively buried\n"); 142 + } 143 + 144 + write_unlock(&cache->active_lock); 145 + _leave(" [owner marked]"); 146 + } 147 + 148 + /* 96 149 * record the fact that an object is now active 97 150 */ 98 151 static int cachefiles_mark_object_active(struct cachefiles_cache *cache, ··· 272 219 */ 273 220 static int cachefiles_bury_object(struct cachefiles_cache *cache, 274 221 struct dentry *dir, 275 - struct dentry *rep) 222 + struct dentry *rep, 223 + bool preemptive) 276 224 { 277 225 struct dentry *grave, *trap; 278 226 char nbuffer[8 + 8 + 1]; ··· 283 229 dir->d_name.len, dir->d_name.len, dir->d_name.name, 284 230 rep->d_name.len, rep->d_name.len, rep->d_name.name); 285 231 232 + _debug("remove %p from %p", rep, dir); 233 + 286 234 /* non-directories can just be unlinked */ 287 235 if (!S_ISDIR(rep->d_inode->i_mode)) { 288 236 _debug("unlink stale object"); 289 237 ret = vfs_unlink(dir->d_inode, rep); 238 + 239 + if (preemptive) 240 + cachefiles_mark_object_buried(cache, rep); 290 241 291 242 mutex_unlock(&dir->d_inode->i_mutex); 292 243 ··· 384 325 if (ret != 0 && ret != -ENOMEM) 385 326 cachefiles_io_error(cache, "Rename failed with error %d", ret); 386 327 328 + if (preemptive) 329 + cachefiles_mark_object_buried(cache, rep); 330 + 387 331 unlock_rename(cache->graveyard, dir); 388 332 dput(grave); 389 333 _leave(" = 0"); ··· 402 340 struct dentry *dir; 403 341 int ret; 404 342 405 - _enter(",{%p}", object->dentry); 343 + _enter(",OBJ%x{%p}", object->fscache.debug_id, object->dentry); 406 344 407 345 ASSERT(object->dentry); 408 346 ASSERT(object->dentry->d_inode); ··· 412 350 413 351 mutex_lock_nested(&dir->d_inode->i_mutex, I_MUTEX_PARENT); 414 352 415 - /* we need to check that our parent is _still_ our parent - it may have 416 - * been renamed */ 417 - if (dir == object->dentry->d_parent) { 418 - ret = cachefiles_bury_object(cache, dir, object->dentry); 419 - } else { 420 - /* it got moved, presumably by cachefilesd culling it, so it's 421 - * no longer in the key path and we can ignore it */ 353 + if (test_bit(CACHEFILES_OBJECT_BURIED, &object->flags)) { 354 + /* object allocation for the same key preemptively deleted this 355 + * object's file so that it could create its own file */ 356 + _debug("object preemptively buried"); 422 357 mutex_unlock(&dir->d_inode->i_mutex); 423 358 ret = 0; 359 + } else { 360 + /* we need to check that our parent is _still_ our parent - it 361 + * may have been renamed */ 362 + if (dir == object->dentry->d_parent) { 363 + ret = cachefiles_bury_object(cache, dir, 364 + object->dentry, false); 365 + } else { 366 + /* it got moved, presumably by cachefilesd culling it, 367 + * so it's no longer in the key path and we can ignore 368 + * it */ 369 + mutex_unlock(&dir->d_inode->i_mutex); 370 + ret = 0; 371 + } 424 372 } 425 373 426 374 dput(dir); ··· 453 381 const char *name; 454 382 int ret, nlen; 455 383 456 - _enter("{%p},,%s,", parent->dentry, key); 384 + _enter("OBJ%x{%p},OBJ%x,%s,", 385 + parent->fscache.debug_id, parent->dentry, 386 + object->fscache.debug_id, key); 457 387 458 388 cache = container_of(parent->fscache.cache, 459 389 struct cachefiles_cache, cache); ··· 583 509 * mutex) */ 584 510 object->dentry = NULL; 585 511 586 - ret = cachefiles_bury_object(cache, dir, next); 512 + ret = cachefiles_bury_object(cache, dir, next, true); 587 513 dput(next); 588 514 next = NULL; 589 515 ··· 902 828 /* actually remove the victim (drops the dir mutex) */ 903 829 _debug("bury"); 904 830 905 - ret = cachefiles_bury_object(cache, dir, victim); 831 + ret = cachefiles_bury_object(cache, dir, victim, false); 906 832 if (ret < 0) 907 833 goto error; 908 834