Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

xfs: ensure truncate forces zeroed blocks to disk

A new fsync vs power fail test in xfstests indicated that XFS can
have unreliable data consistency when doing extending truncates that
require block zeroing. The blocks beyond EOF get zeroed in memory,
but we never force those changes to disk before we run the
transaction that extends the file size and exposes those blocks to
userspace. This can result in the blocks not being correctly zeroed
after a crash.

Because in-memory behaviour is correct, tools like fsx don't pick up
any coherency problems - it's not until the filesystem is shutdown
or the system crashes after writing the truncate transaction to the
journal but before the zeroed data in the page cache is flushed that
the issue is exposed.

Fix this by also flushing the dirty data in memory region between
the old size and new size when we've found blocks that need zeroing
in the truncate process.

Reported-by: Liu Bo <bo.li.liu@oracle.com>
cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

authored by

Dave Chinner and committed by
Dave Chinner
5885ebda dfcc70a8

+29 -30
+10 -4
fs/xfs/xfs_file.c
··· 397 397 xfs_zero_last_block( 398 398 struct xfs_inode *ip, 399 399 xfs_fsize_t offset, 400 - xfs_fsize_t isize) 400 + xfs_fsize_t isize, 401 + bool *did_zeroing) 401 402 { 402 403 struct xfs_mount *mp = ip->i_mount; 403 404 xfs_fileoff_t last_fsb = XFS_B_TO_FSBT(mp, isize); ··· 426 425 zero_len = mp->m_sb.sb_blocksize - zero_offset; 427 426 if (isize + zero_len > offset) 428 427 zero_len = offset - isize; 428 + *did_zeroing = true; 429 429 return xfs_iozero(ip, isize, zero_len); 430 430 } 431 431 ··· 445 443 xfs_zero_eof( 446 444 struct xfs_inode *ip, 447 445 xfs_off_t offset, /* starting I/O offset */ 448 - xfs_fsize_t isize) /* current inode size */ 446 + xfs_fsize_t isize, /* current inode size */ 447 + bool *did_zeroing) 449 448 { 450 449 struct xfs_mount *mp = ip->i_mount; 451 450 xfs_fileoff_t start_zero_fsb; ··· 468 465 * We only zero a part of that block so it is handled specially. 469 466 */ 470 467 if (XFS_B_FSB_OFFSET(mp, isize) != 0) { 471 - error = xfs_zero_last_block(ip, offset, isize); 468 + error = xfs_zero_last_block(ip, offset, isize, did_zeroing); 472 469 if (error) 473 470 return error; 474 471 } ··· 528 525 if (error) 529 526 return error; 530 527 528 + *did_zeroing = true; 531 529 start_zero_fsb = imap.br_startoff + imap.br_blockcount; 532 530 ASSERT(start_zero_fsb <= (end_zero_fsb + 1)); 533 531 } ··· 571 567 * having to redo all checks before. 572 568 */ 573 569 if (*pos > i_size_read(inode)) { 570 + bool zero = false; 571 + 574 572 if (*iolock == XFS_IOLOCK_SHARED) { 575 573 xfs_rw_iunlock(ip, *iolock); 576 574 *iolock = XFS_IOLOCK_EXCL; 577 575 xfs_rw_ilock(ip, *iolock); 578 576 goto restart; 579 577 } 580 - error = xfs_zero_eof(ip, *pos, i_size_read(inode)); 578 + error = xfs_zero_eof(ip, *pos, i_size_read(inode), &zero); 581 579 if (error) 582 580 return error; 583 581 }
+5 -4
fs/xfs/xfs_inode.h
··· 384 384 XFS_PREALLOC_INVISIBLE = (1 << 4), 385 385 }; 386 386 387 - int xfs_update_prealloc_flags(struct xfs_inode *, 388 - enum xfs_prealloc_flags); 389 - int xfs_zero_eof(struct xfs_inode *, xfs_off_t, xfs_fsize_t); 390 - int xfs_iozero(struct xfs_inode *, loff_t, size_t); 387 + int xfs_update_prealloc_flags(struct xfs_inode *ip, 388 + enum xfs_prealloc_flags flags); 389 + int xfs_zero_eof(struct xfs_inode *ip, xfs_off_t offset, 390 + xfs_fsize_t isize, bool *did_zeroing); 391 + int xfs_iozero(struct xfs_inode *ip, loff_t pos, size_t count); 391 392 392 393 393 394 #define IHOLD(ip) \
+14 -22
fs/xfs/xfs_iops.c
··· 751 751 int error; 752 752 uint lock_flags = 0; 753 753 uint commit_flags = 0; 754 + bool did_zeroing = false; 754 755 755 756 trace_xfs_setattr(ip); 756 757 ··· 795 794 return error; 796 795 797 796 /* 798 - * Now we can make the changes. Before we join the inode to the 799 - * transaction, take care of the part of the truncation that must be 800 - * done without the inode lock. This needs to be done before joining 801 - * the inode to the transaction, because the inode cannot be unlocked 802 - * once it is a part of the transaction. 797 + * File data changes must be complete before we start the transaction to 798 + * modify the inode. This needs to be done before joining the inode to 799 + * the transaction because the inode cannot be unlocked once it is a 800 + * part of the transaction. 801 + * 802 + * Start with zeroing any data block beyond EOF that we may expose on 803 + * file extension. 803 804 */ 804 805 if (newsize > oldsize) { 805 - /* 806 - * Do the first part of growing a file: zero any data in the 807 - * last block that is beyond the old EOF. We need to do this 808 - * before the inode is joined to the transaction to modify 809 - * i_size. 810 - */ 811 - error = xfs_zero_eof(ip, newsize, oldsize); 806 + error = xfs_zero_eof(ip, newsize, oldsize, &did_zeroing); 812 807 if (error) 813 808 return error; 814 809 } ··· 814 817 * any previous writes that are beyond the on disk EOF and the new 815 818 * EOF that have not been written out need to be written here. If we 816 819 * do not write the data out, we expose ourselves to the null files 817 - * problem. 818 - * 819 - * Only flush from the on disk size to the smaller of the in memory 820 - * file size or the new size as that's the range we really care about 821 - * here and prevents waiting for other data not within the range we 822 - * care about here. 820 + * problem. Note that this includes any block zeroing we did above; 821 + * otherwise those blocks may not be zeroed after a crash. 823 822 */ 824 - if (oldsize != ip->i_d.di_size && newsize > ip->i_d.di_size) { 823 + if (newsize > ip->i_d.di_size && 824 + (oldsize != ip->i_d.di_size || did_zeroing)) { 825 825 error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, 826 826 ip->i_d.di_size, newsize); 827 827 if (error) 828 828 return error; 829 829 } 830 830 831 - /* 832 - * Wait for all direct I/O to complete. 833 - */ 831 + /* Now wait for all direct I/O to complete. */ 834 832 inode_dio_wait(inode); 835 833 836 834 /*