Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

btrfs: invalidate pages instead of truncate after reflinking

Qu reported that generic/164 often fails because the read operations get
zeroes when it expects to either get all bytes with a value of 0x61 or
0x62. The issue stems from truncating the pages from the page cache
instead of invalidating, as truncating can zero page contents. This
zeroing is not just in case the range is not page sized (as it's commented
in truncate_inode_pages_range()) but also in case we are using large
folios, they need to be split and the splitting fails. Stealing Qu's
comment in the thread linked below:

"We can have the following case:

0 4K 8K 12K 16K
| | | | |
|<---- Extent A ----->|<----- Extent B ------>|

The page size is still 4K, but the folio we got is 16K.

Then if we remap the range for [8K, 16K), then
truncate_inode_pages_range() will get the large folio 0 sized 16K,
then call truncate_inode_partial_folio().

Which later calls folio_zero_range() for the [8K, 16K) range first,
then tries to split the folio into smaller ones to properly drop them
from the cache.

But if splitting failed (e.g. racing with other operations holding the
filemap lock), the partially zeroed large folio will be kept, resulting
the range [8K, 16K) being zeroed meanwhile the folio is still a 16K
sized large one."

So instead of truncating, invalidate the page cache range with a call to
filemap_invalidate_inode(), which besides not doing any zeroing also
ensures that while it's invalidating folios, no new folios are added.

This helps ensure that buffered reads that happen while a reflink
operation is in progress always get either the whole old data (the one
before the reflink) or the whole new data, which is what generic/164
expects.

Link: https://lore.kernel.org/linux-btrfs/7fb9b44f-9680-4c22-a47f-6648cb109ddf@suse.com/
Reported-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

authored by

Filipe Manana and committed by
David Sterba
88268077 64dd1caf

+13 -10
+13 -10
fs/btrfs/reflink.c
··· 705 705 struct inode *src = file_inode(file_src); 706 706 struct btrfs_fs_info *fs_info = inode_to_fs_info(inode); 707 707 int ret; 708 - int wb_ret; 709 708 u64 len = olen; 710 709 u64 bs = fs_info->sectorsize; 711 710 u64 end; ··· 749 750 btrfs_lock_extent(&BTRFS_I(inode)->io_tree, destoff, end, &cached_state); 750 751 ret = btrfs_clone(src, inode, off, olen, len, destoff, 0); 751 752 btrfs_unlock_extent(&BTRFS_I(inode)->io_tree, destoff, end, &cached_state); 753 + if (ret < 0) 754 + return ret; 752 755 753 756 /* 754 757 * We may have copied an inline extent into a page of the destination 755 - * range, so wait for writeback to complete before truncating pages 758 + * range, so wait for writeback to complete before invalidating pages 756 759 * from the page cache. This is a rare case. 757 760 */ 758 - wb_ret = btrfs_wait_ordered_range(BTRFS_I(inode), destoff, len); 759 - ret = ret ? ret : wb_ret; 761 + ret = btrfs_wait_ordered_range(BTRFS_I(inode), destoff, len); 762 + if (ret < 0) 763 + return ret; 764 + 760 765 /* 761 - * Truncate page cache pages so that future reads will see the cloned 762 - * data immediately and not the previous data. 766 + * Invalidate page cache so that future reads will see the cloned data 767 + * immediately and not the previous data. 763 768 */ 764 - truncate_inode_pages_range(&inode->i_data, 765 - round_down(destoff, PAGE_SIZE), 766 - round_up(destoff + len, PAGE_SIZE) - 1); 769 + ret = filemap_invalidate_inode(inode, false, destoff, end); 770 + if (ret < 0) 771 + return ret; 767 772 768 773 btrfs_btree_balance_dirty(fs_info); 769 774 770 - return ret; 775 + return 0; 771 776 } 772 777 773 778 static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,