Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Btrfs: fix fsync of files with multiple hard links in new directories

The log tree has a long standing problem that when a file is fsync'ed we
only check for new ancestors, created in the current transaction, by
following only the hard link for which the fsync was issued. We follow the
ancestors using the VFS' dget_parent() API. This means that if we create a
new link for a file in a directory that is new (or in an any other new
ancestor directory) and then fsync the file using an old hard link, we end
up not logging the new ancestor, and on log replay that new hard link and
ancestor do not exist. In some cases, involving renames, the file will not
exist at all.

Example:

mkfs.btrfs -f /dev/sdb
mount /dev/sdb /mnt

mkdir /mnt/A
touch /mnt/foo
ln /mnt/foo /mnt/A/bar
xfs_io -c fsync /mnt/foo

<power failure>

In this example after log replay only the hard link named 'foo' exists
and directory A does not exist, which is unexpected. In other major linux
filesystems, such as ext4, xfs and f2fs for example, both hard links exist
and so does directory A after mounting again the filesystem.

Checking if any new ancestors are new and need to be logged was added in
2009 by commit 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes"),
however only for the ancestors of the hard link (dentry) for which the
fsync was issued, instead of checking for all ancestors for all of the
inode's hard links.

So fix this by tracking the id of the last transaction where a hard link
was created for an inode and then on fsync fallback to a full transaction
commit when an inode has more than one hard link and at least one new hard
link was created in the current transaction. This is the simplest solution
since this is not a common use case (adding frequently hard links for
which there's an ancestor created in the current transaction and then
fsync the file). In case it ever becomes a common use case, a solution
that consists of iterating the fs/subvol btree for each hard link and
check if any ancestor is new, could be implemented.

This solves many unexpected scenarios reported by Jayashree Mohan and
Vijay Chidambaram, and for which there is a new test case for fstests
under review.

Fixes: 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes")
CC: stable@vger.kernel.org # 4.4+
Reported-by: Vijay Chidambaram <vvijay03@gmail.com>
Reported-by: Jayashree Mohan <jayashree2912@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

authored by

Filipe Manana and committed by
David Sterba
41bd6067 bbe339cc

+39
+6
fs/btrfs/btrfs_inode.h
··· 148 148 u64 last_unlink_trans; 149 149 150 150 /* 151 + * Track the transaction id of the last transaction used to create a 152 + * hard link for the inode. This is used by the log tree (fsync). 153 + */ 154 + u64 last_link_trans; 155 + 156 + /* 151 157 * Number of bytes outstanding that are going to need csums. This is 152 158 * used in ENOSPC accounting. 153 159 */
+17
fs/btrfs/inode.c
··· 3658 3658 * inode is not a directory, logging its parent unnecessarily. 3659 3659 */ 3660 3660 BTRFS_I(inode)->last_unlink_trans = BTRFS_I(inode)->last_trans; 3661 + /* 3662 + * Similar reasoning for last_link_trans, needs to be set otherwise 3663 + * for a case like the following: 3664 + * 3665 + * mkdir A 3666 + * touch foo 3667 + * ln foo A/bar 3668 + * echo 2 > /proc/sys/vm/drop_caches 3669 + * fsync foo 3670 + * <power failure> 3671 + * 3672 + * Would result in link bar and directory A not existing after the power 3673 + * failure. 3674 + */ 3675 + BTRFS_I(inode)->last_link_trans = BTRFS_I(inode)->last_trans; 3661 3676 3662 3677 path->slots[0]++; 3663 3678 if (inode->i_nlink != 1 || ··· 6612 6597 if (err) 6613 6598 goto fail; 6614 6599 } 6600 + BTRFS_I(inode)->last_link_trans = trans->transid; 6615 6601 d_instantiate(dentry, inode); 6616 6602 ret = btrfs_log_new_name(trans, BTRFS_I(inode), NULL, parent, 6617 6603 true, NULL); ··· 9139 9123 ei->index_cnt = (u64)-1; 9140 9124 ei->dir_index = 0; 9141 9125 ei->last_unlink_trans = 0; 9126 + ei->last_link_trans = 0; 9142 9127 ei->last_log_commit = 0; 9143 9128 9144 9129 spin_lock_init(&ei->lock);
+16
fs/btrfs/tree-log.c
··· 5758 5758 goto end_trans; 5759 5759 } 5760 5760 5761 + /* 5762 + * If a new hard link was added to the inode in the current transaction 5763 + * and its link count is now greater than 1, we need to fallback to a 5764 + * transaction commit, otherwise we can end up not logging all its new 5765 + * parents for all the hard links. Here just from the dentry used to 5766 + * fsync, we can not visit the ancestor inodes for all the other hard 5767 + * links to figure out if any is new, so we fallback to a transaction 5768 + * commit (instead of adding a lot of complexity of scanning a btree, 5769 + * since this scenario is not a common use case). 5770 + */ 5771 + if (inode->vfs_inode.i_nlink > 1 && 5772 + inode->last_link_trans > last_committed) { 5773 + ret = -EMLINK; 5774 + goto end_trans; 5775 + } 5776 + 5761 5777 while (1) { 5762 5778 if (!parent || d_really_is_negative(parent) || sb != parent->d_sb) 5763 5779 break;