Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

xfs: properly serialise fallocate against AIO+DIO

AIO+DIO can extend the file size on IO completion, and it holds
no inode locks while the IO is in flight. Therefore, a race
condition exists in file size updates if we do something like this:

aio-thread fallocate-thread

lock inode
submit IO beyond inode->i_size
unlock inode
.....
lock inode
break layouts
if (off + len > inode->i_size)
new_size = off + len
.....
inode_dio_wait()
<blocks>
.....
completes
inode->i_size updated
inode_dio_done()
....
<wakes>
<does stuff no long beyond EOF>
if (new_size)
xfs_vn_setattr(inode, new_size)


Yup, that attempt to extend the file size in the fallocate code
turns into a truncate - it removes the whatever the aio write
allocated and put to disk, and reduced the inode size back down to
where the fallocate operation ends.

Fundamentally, xfs_file_fallocate() not compatible with racing
AIO+DIO completions, so we need to move the inode_dio_wait() call
up to where the lock the inode and break the layouts.

Secondly, storing the inode size and then using it unchecked without
holding the ILOCK is not safe; we can only do such a thing if we've
locked out and drained all IO and other modification operations,
which we don't do initially in xfs_file_fallocate.

It should be noted that some of the fallocate operations are
compound operations - they are made up of multiple manipulations
that may zero data, and so we may need to flush and invalidate the
file multiple times during an operation. However, we only need to
lock out IO and other space manipulation operations once, as that
lockout is maintained until the entire fallocate operation has been
completed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

authored by

Dave Chinner and committed by
Darrick J. Wong
249bd908 21f55993

+32 -7
+1 -7
fs/xfs/xfs_bmap_util.c
··· 930 930 goto out_unlock; 931 931 } 932 932 933 + /* Caller must first wait for the completion of any pending DIOs if required. */ 933 934 int 934 935 xfs_flush_unmap_range( 935 936 struct xfs_inode *ip, ··· 941 940 struct inode *inode = VFS_I(ip); 942 941 xfs_off_t rounding, start, end; 943 942 int error; 944 - 945 - /* wait for the completion of any pending DIOs */ 946 - inode_dio_wait(inode); 947 943 948 944 rounding = max_t(xfs_off_t, 1 << mp->m_sb.sb_blocklog, PAGE_SIZE); 949 945 start = round_down(offset, rounding); ··· 972 974 973 975 if (len <= 0) /* if nothing being freed */ 974 976 return 0; 975 - 976 - error = xfs_flush_unmap_range(ip, offset, len); 977 - if (error) 978 - return error; 979 977 980 978 startoffset_fsb = XFS_B_TO_FSB(mp, offset); 981 979 endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
+30
fs/xfs/xfs_file.c
··· 817 817 if (error) 818 818 goto out_unlock; 819 819 820 + /* 821 + * Must wait for all AIO to complete before we continue as AIO can 822 + * change the file size on completion without holding any locks we 823 + * currently hold. We must do this first because AIO can update both 824 + * the on disk and in memory inode sizes, and the operations that follow 825 + * require the in-memory size to be fully up-to-date. 826 + */ 827 + inode_dio_wait(inode); 828 + 829 + /* 830 + * Now AIO and DIO has drained we flush and (if necessary) invalidate 831 + * the cached range over the first operation we are about to run. 832 + * 833 + * We care about zero and collapse here because they both run a hole 834 + * punch over the range first. Because that can zero data, and the range 835 + * of invalidation for the shift operations is much larger, we still do 836 + * the required flush for collapse in xfs_prepare_shift(). 837 + * 838 + * Insert has the same range requirements as collapse, and we extend the 839 + * file first which can zero data. Hence insert has the same 840 + * flush/invalidate requirements as collapse and so they are both 841 + * handled at the right time by xfs_prepare_shift(). 842 + */ 843 + if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE | 844 + FALLOC_FL_COLLAPSE_RANGE)) { 845 + error = xfs_flush_unmap_range(ip, offset, len); 846 + if (error) 847 + goto out_unlock; 848 + } 849 + 820 850 if (mode & FALLOC_FL_PUNCH_HOLE) { 821 851 error = xfs_free_file_space(ip, offset, len); 822 852 if (error)
+1
fs/xfs/xfs_ioctl.c
··· 623 623 error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP); 624 624 if (error) 625 625 goto out_unlock; 626 + inode_dio_wait(inode); 626 627 627 628 switch (bf->l_whence) { 628 629 case 0: /*SEEK_SET*/