Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

xfs: serialise unaligned direct IOs

When two concurrent unaligned, non-overlapping direct IOs are issued
to the same block, the direct Io layer will race to zero the block.
The result is that one of the concurrent IOs will overwrite data
written by the other IO with zeros. This is demonstrated by the
xfsqa test 240.

To avoid this problem, serialise all unaligned direct IOs to an
inode with a big hammer. We need a big hammer approach as we need to
serialise AIO as well, so we can't just block writes on locks.
Hence, the big hammer is calling xfs_ioend_wait() while holding out
other unaligned direct IOs from starting.

We don't bother trying to serialised aligned vs unaligned IOs as
they are overlapping IO and the result of concurrent overlapping IOs
is undefined - the result of either IO is a valid result so we let
them race. Hence we only penalise unaligned IO, which already has a
major overhead compared to aligned IO so this isn't a major problem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

authored by

Dave Chinner and committed by
Dave Chinner
eda77982 4d8d1581

+28 -10
+28 -10
fs/xfs/linux-2.6/xfs_file.c
··· 684 684 * xfs_file_dio_aio_write - handle direct IO writes 685 685 * 686 686 * Lock the inode appropriately to prepare for and issue a direct IO write. 687 - * By spearating it from the buffered write path we remove all the tricky to 687 + * By separating it from the buffered write path we remove all the tricky to 688 688 * follow locking changes and looping. 689 + * 690 + * If there are cached pages or we're extending the file, we need IOLOCK_EXCL 691 + * until we're sure the bytes at the new EOF have been zeroed and/or the cached 692 + * pages are flushed out. 693 + * 694 + * In most cases the direct IO writes will be done holding IOLOCK_SHARED 695 + * allowing them to be done in parallel with reads and other direct IO writes. 696 + * However, if the IO is not aligned to filesystem blocks, the direct IO layer 697 + * needs to do sub-block zeroing and that requires serialisation against other 698 + * direct IOs to the same block. In this case we need to serialise the 699 + * submission of the unaligned IOs so that we don't get racing block zeroing in 700 + * the dio layer. To avoid the problem with aio, we also need to wait for 701 + * outstanding IOs to complete so that unwritten extent conversion is completed 702 + * before we try to map the overlapping block. This is currently implemented by 703 + * hitting it with a big hammer (i.e. xfs_ioend_wait()). 689 704 * 690 705 * Returns with locks held indicated by @iolock and errors indicated by 691 706 * negative return values. ··· 721 706 struct xfs_mount *mp = ip->i_mount; 722 707 ssize_t ret = 0; 723 708 size_t count = ocount; 709 + int unaligned_io = 0; 724 710 struct xfs_buftarg *target = XFS_IS_REALTIME_INODE(ip) ? 725 711 mp->m_rtdev_targp : mp->m_ddev_targp; 726 712 ··· 729 713 if ((pos & target->bt_smask) || (count & target->bt_smask)) 730 714 return -XFS_ERROR(EINVAL); 731 715 732 - /* 733 - * For direct I/O, if there are cached pages or we're extending 734 - * the file, we need IOLOCK_EXCL until we're sure the bytes at 735 - * the new EOF have been zeroed and/or the cached pages are 736 - * flushed out. 737 - */ 738 - if (mapping->nrpages || pos > ip->i_size) 716 + if ((pos & mp->m_blockmask) || ((pos + count) & mp->m_blockmask)) 717 + unaligned_io = 1; 718 + 719 + if (unaligned_io || mapping->nrpages || pos > ip->i_size) 739 720 *iolock = XFS_IOLOCK_EXCL; 740 721 else 741 722 *iolock = XFS_IOLOCK_SHARED; ··· 750 737 return ret; 751 738 } 752 739 753 - if (*iolock == XFS_IOLOCK_EXCL) { 754 - /* demote the lock now the cached pages are gone */ 740 + /* 741 + * If we are doing unaligned IO, wait for all other IO to drain, 742 + * otherwise demote the lock if we had to flush cached pages 743 + */ 744 + if (unaligned_io) 745 + xfs_ioend_wait(ip); 746 + else if (*iolock == XFS_IOLOCK_EXCL) { 755 747 xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL); 756 748 *iolock = XFS_IOLOCK_SHARED; 757 749 }