Btrfs: fix loss of prealloc extents past i_size after fsync log replay

Currently if we allocate extents beyond an inode's i_size (through the
fallocate system call) and then fsync the file, we log the extents but
after a power failure we replay them and then immediately drop them.
This behaviour happens since about 2009, commit c71bf099abdd ("Btrfs:
Avoid orphan inodes cleanup while replaying log"), because it marks
the inode as an orphan instead of dropping any extents beyond i_size
before replaying logged extents, so after the log replay, and while
the mount operation is still ongoing, we find the inode marked as an
orphan and then perform a truncation (drop extents beyond the inode's
i_size). Because the processing of orphan inodes is still done
right after replaying the log and before the mount operation finishes,
the intention of that commit does not make any sense (at least as
of today). However reverting that behaviour is not enough, because
we can not simply discard all extents beyond i_size and then replay
logged extents, because we risk dropping extents beyond i_size created
in past transactions, for example:

add prealloc extent beyond i_size
fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode
transaction commit
add another prealloc extent beyond i_size
fsync - triggers the fast fsync path
power failure

In that scenario, we would drop the first extent and then replay the
second one. To fix this just make sure that all prealloc extents
beyond i_size are logged, and if we find too many (which is far from
a common case), fallback to a full transaction commit (like we do when
logging regular extents in the fast fsync path).

Trivial reproducer:

$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo
$ sync
$ xfs_io -c "falloc -k 256K 1M" /mnt/foo
$ xfs_io -c "fsync" /mnt/foo
<power failure>

# mount to replay log
$ mount /dev/sdb /mnt
# at this point the file only has one extent, at offset 0, size 256K

A test case for fstests follows soon, covering multiple scenarios that
involve adding prealloc extents with previous shrinking truncates and
without such truncates.

Fixes: c71bf099abdd ("Btrfs: Avoid orphan inodes cleanup while replaying log")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

authored by Filipe Manana and committed by David Sterba 471d557a af722733

+58 -5
+58 -5
fs/btrfs/tree-log.c
··· 2458 2458 if (ret) 2459 2459 break; 2460 2460 2461 - /* for regular files, make sure corresponding 2462 - * orphan item exist. extents past the new EOF 2463 - * will be truncated later by orphan cleanup. 2461 + /* 2462 + * Before replaying extents, truncate the inode to its 2463 + * size. We need to do it now and not after log replay 2464 + * because before an fsync we can have prealloc extents 2465 + * added beyond the inode's i_size. If we did it after, 2466 + * through orphan cleanup for example, we would drop 2467 + * those prealloc extents just after replaying them. 2464 2468 */ 2465 2469 if (S_ISREG(mode)) { 2466 - ret = insert_orphan_item(wc->trans, root, 2467 - key.objectid); 2470 + struct inode *inode; 2471 + u64 from; 2472 + 2473 + inode = read_one_inode(root, key.objectid); 2474 + if (!inode) { 2475 + ret = -EIO; 2476 + break; 2477 + } 2478 + from = ALIGN(i_size_read(inode), 2479 + root->fs_info->sectorsize); 2480 + ret = btrfs_drop_extents(wc->trans, root, inode, 2481 + from, (u64)-1, 1); 2482 + /* 2483 + * If the nlink count is zero here, the iput 2484 + * will free the inode. We bump it to make 2485 + * sure it doesn't get freed until the link 2486 + * count fixup is done. 2487 + */ 2488 + if (!ret) { 2489 + if (inode->i_nlink == 0) 2490 + inc_nlink(inode); 2491 + /* Update link count and nbytes. */ 2492 + ret = btrfs_update_inode(wc->trans, 2493 + root, inode); 2494 + } 2495 + iput(inode); 2468 2496 if (ret) 2469 2497 break; 2470 2498 } ··· 4385 4357 set_bit(EXTENT_FLAG_LOGGING, &em->flags); 4386 4358 list_add_tail(&em->list, &extents); 4387 4359 num++; 4360 + } 4361 + 4362 + /* 4363 + * Add all prealloc extents beyond the inode's i_size to make sure we 4364 + * don't lose them after doing a fast fsync and replaying the log. 4365 + */ 4366 + if (inode->flags & BTRFS_INODE_PREALLOC) { 4367 + struct rb_node *node; 4368 + 4369 + for (node = rb_last(&tree->map); node; node = rb_prev(node)) { 4370 + em = rb_entry(node, struct extent_map, rb_node); 4371 + if (em->start < i_size_read(&inode->vfs_inode)) 4372 + break; 4373 + if (!list_empty(&em->list)) 4374 + continue; 4375 + /* Same as above loop. */ 4376 + if (++num > 32768) { 4377 + list_del_init(&tree->modified_extents); 4378 + ret = -EFBIG; 4379 + goto process; 4380 + } 4381 + refcount_inc(&em->refs); 4382 + set_bit(EXTENT_FLAG_LOGGING, &em->flags); 4383 + list_add_tail(&em->list, &extents); 4384 + } 4388 4385 } 4389 4386 4390 4387 list_sort(NULL, &extents, extent_cmp);