Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

btrfs: fix space cache corruption and potential double allocations

When testing space_cache v2 on a large set of machines, we encountered a
few symptoms:

1. "unable to add free space :-17" (EEXIST) errors.
2. Missing free space info items, sometimes caught with a "missing free
space info for X" error.
3. Double-accounted space: ranges that were allocated in the extent tree
and also marked as free in the free space tree, ranges that were
marked as allocated twice in the extent tree, or ranges that were
marked as free twice in the free space tree. If the latter made it
onto disk, the next reboot would hit the BUG_ON() in
add_new_free_space().
4. On some hosts with no on-disk corruption or error messages, the
in-memory space cache (dumped with drgn) disagreed with the free
space tree.

All of these symptoms have the same underlying cause: a race between
caching the free space for a block group and returning free space to the
in-memory space cache for pinned extents causes us to double-add a free
range to the space cache. This race exists when free space is cached
from the free space tree (space_cache=v2) or the extent tree
(nospace_cache, or space_cache=v1 if the cache needs to be regenerated).
struct btrfs_block_group::last_byte_to_unpin and struct
btrfs_block_group::progress are supposed to protect against this race,
but commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when
waiting for a transaction commit") subtly broke this by allowing
multiple transactions to be unpinning extents at the same time.

Specifically, the race is as follows:

1. An extent is deleted from an uncached block group in transaction A.
2. btrfs_commit_transaction() is called for transaction A.
3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed
ref for the deleted extent.
4. __btrfs_free_extent() -> do_free_extent_accounting() ->
add_to_free_space_tree() adds the deleted extent back to the free
space tree.
5. do_free_extent_accounting() -> btrfs_update_block_group() ->
btrfs_cache_block_group() queues up the block group to get cached.
block_group->progress is set to block_group->start.
6. btrfs_commit_transaction() for transaction A calls
switch_commit_roots(). It sets block_group->last_byte_to_unpin to
block_group->progress, which is block_group->start because the block
group hasn't been cached yet.
7. The caching thread gets to our block group. Since the commit roots
were already switched, load_free_space_tree() sees the deleted extent
as free and adds it to the space cache. It finishes caching and sets
block_group->progress to U64_MAX.
8. btrfs_commit_transaction() advances transaction A to
TRANS_STATE_SUPER_COMMITTED.
9. fsync calls btrfs_commit_transaction() for transaction B. Since
transaction A is already in TRANS_STATE_SUPER_COMMITTED and the
commit is for fsync, it advances.
10. btrfs_commit_transaction() for transaction B calls
switch_commit_roots(). This time, the block group has already been
cached, so it sets block_group->last_byte_to_unpin to U64_MAX.
11. btrfs_commit_transaction() for transaction A calls
btrfs_finish_extent_commit(), which calls unpin_extent_range() for
the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by
transaction B!), so it adds the deleted extent to the space cache
again!

This explains all of our symptoms above:

* If the sequence of events is exactly as described above, when the free
space is re-added in step 11, it will fail with EEXIST.
* If another thread reallocates the deleted extent in between steps 7
and 11, then step 11 will silently re-add that space to the space
cache as free even though it is actually allocated. Then, if that
space is allocated *again*, the free space tree will be corrupted
(namely, the wrong item will be deleted).
* If we don't catch this free space tree corruption, it will continue
to get worse as extents are deleted and reallocated.

The v1 space_cache is synchronously loaded when an extent is deleted
(btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group()
with load_cache_only=1), so it is not normally affected by this bug.
However, as noted above, if we fail to load the space cache, we will
fall back to caching from the extent tree and may hit this bug.

The easiest fix for this race is to also make caching from the free
space tree or extent tree synchronous. Josef tested this and found no
performance regressions.

A few extra changes fall out of this change. Namely, this fix does the
following, with step 2 being the crucial fix:

1. Factor btrfs_caching_ctl_wait_done() out of
btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl
that we already hold a reference to.
2. Change the call in btrfs_cache_block_group() of
btrfs_wait_space_cache_v1_finished() to
btrfs_caching_ctl_wait_done(), which makes us wait regardless of the
space_cache option.
3. Delete the now unused btrfs_wait_space_cache_v1_finished() and
space_cache_v1_done().
4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to
`bool wait` to more accurately describe its new meaning.
5. Change a few callers which had a separate call to
btrfs_wait_block_group_cache_done() to use wait = true instead.
6. Make btrfs_wait_block_group_cache_done() static now that it's not
used outside of block-group.c anymore.

Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

authored by

Omar Sandoval and committed by
David Sterba
ced8ecf0 79d3d1d1

+22 -60
+15 -32
fs/btrfs/block-group.c
··· 440 440 btrfs_put_caching_control(caching_ctl); 441 441 } 442 442 443 - int btrfs_wait_block_group_cache_done(struct btrfs_block_group *cache) 443 + static int btrfs_caching_ctl_wait_done(struct btrfs_block_group *cache, 444 + struct btrfs_caching_control *caching_ctl) 445 + { 446 + wait_event(caching_ctl->wait, btrfs_block_group_done(cache)); 447 + return cache->cached == BTRFS_CACHE_ERROR ? -EIO : 0; 448 + } 449 + 450 + static int btrfs_wait_block_group_cache_done(struct btrfs_block_group *cache) 444 451 { 445 452 struct btrfs_caching_control *caching_ctl; 446 - int ret = 0; 453 + int ret; 447 454 448 455 caching_ctl = btrfs_get_caching_control(cache); 449 456 if (!caching_ctl) 450 457 return (cache->cached == BTRFS_CACHE_ERROR) ? -EIO : 0; 451 - 452 - wait_event(caching_ctl->wait, btrfs_block_group_done(cache)); 453 - if (cache->cached == BTRFS_CACHE_ERROR) 454 - ret = -EIO; 458 + ret = btrfs_caching_ctl_wait_done(cache, caching_ctl); 455 459 btrfs_put_caching_control(caching_ctl); 456 460 return ret; 457 - } 458 - 459 - static bool space_cache_v1_done(struct btrfs_block_group *cache) 460 - { 461 - bool ret; 462 - 463 - spin_lock(&cache->lock); 464 - ret = cache->cached != BTRFS_CACHE_FAST; 465 - spin_unlock(&cache->lock); 466 - 467 - return ret; 468 - } 469 - 470 - void btrfs_wait_space_cache_v1_finished(struct btrfs_block_group *cache, 471 - struct btrfs_caching_control *caching_ctl) 472 - { 473 - wait_event(caching_ctl->wait, space_cache_v1_done(cache)); 474 461 } 475 462 476 463 #ifdef CONFIG_BTRFS_DEBUG ··· 737 750 btrfs_put_block_group(block_group); 738 751 } 739 752 740 - int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only) 753 + int btrfs_cache_block_group(struct btrfs_block_group *cache, bool wait) 741 754 { 742 - DEFINE_WAIT(wait); 743 755 struct btrfs_fs_info *fs_info = cache->fs_info; 744 756 struct btrfs_caching_control *caching_ctl = NULL; 745 757 int ret = 0; ··· 771 785 } 772 786 WARN_ON(cache->caching_ctl); 773 787 cache->caching_ctl = caching_ctl; 774 - if (btrfs_test_opt(fs_info, SPACE_CACHE)) 775 - cache->cached = BTRFS_CACHE_FAST; 776 - else 777 - cache->cached = BTRFS_CACHE_STARTED; 788 + cache->cached = BTRFS_CACHE_STARTED; 778 789 cache->has_caching_ctl = 1; 779 790 spin_unlock(&cache->lock); 780 791 ··· 784 801 785 802 btrfs_queue_work(fs_info->caching_workers, &caching_ctl->work); 786 803 out: 787 - if (load_cache_only && caching_ctl) 788 - btrfs_wait_space_cache_v1_finished(cache, caching_ctl); 804 + if (wait && caching_ctl) 805 + ret = btrfs_caching_ctl_wait_done(cache, caching_ctl); 789 806 if (caching_ctl) 790 807 btrfs_put_caching_control(caching_ctl); 791 808 ··· 3295 3312 * space back to the block group, otherwise we will leak space. 3296 3313 */ 3297 3314 if (!alloc && !btrfs_block_group_done(cache)) 3298 - btrfs_cache_block_group(cache, 1); 3315 + btrfs_cache_block_group(cache, true); 3299 3316 3300 3317 byte_in_group = bytenr - cache->start; 3301 3318 WARN_ON(byte_in_group > cache->length);
+1 -3
fs/btrfs/block-group.h
··· 263 263 void btrfs_wait_nocow_writers(struct btrfs_block_group *bg); 264 264 void btrfs_wait_block_group_cache_progress(struct btrfs_block_group *cache, 265 265 u64 num_bytes); 266 - int btrfs_wait_block_group_cache_done(struct btrfs_block_group *cache); 267 - int btrfs_cache_block_group(struct btrfs_block_group *cache, 268 - int load_cache_only); 266 + int btrfs_cache_block_group(struct btrfs_block_group *cache, bool wait); 269 267 void btrfs_put_caching_control(struct btrfs_caching_control *ctl); 270 268 struct btrfs_caching_control *btrfs_get_caching_control( 271 269 struct btrfs_block_group *cache);
-1
fs/btrfs/ctree.h
··· 505 505 enum btrfs_caching_type { 506 506 BTRFS_CACHE_NO, 507 507 BTRFS_CACHE_STARTED, 508 - BTRFS_CACHE_FAST, 509 508 BTRFS_CACHE_FINISHED, 510 509 BTRFS_CACHE_ERROR, 511 510 };
+6 -24
fs/btrfs/extent-tree.c
··· 2551 2551 return -EINVAL; 2552 2552 2553 2553 /* 2554 - * pull in the free space cache (if any) so that our pin 2555 - * removes the free space from the cache. We have load_only set 2556 - * to one because the slow code to read in the free extents does check 2557 - * the pinned extents. 2554 + * Fully cache the free space first so that our pin removes the free space 2555 + * from the cache. 2558 2556 */ 2559 - btrfs_cache_block_group(cache, 1); 2560 - /* 2561 - * Make sure we wait until the cache is completely built in case it is 2562 - * missing or is invalid and therefore needs to be rebuilt. 2563 - */ 2564 - ret = btrfs_wait_block_group_cache_done(cache); 2557 + ret = btrfs_cache_block_group(cache, true); 2565 2558 if (ret) 2566 2559 goto out; 2567 2560 ··· 2577 2584 if (!block_group) 2578 2585 return -EINVAL; 2579 2586 2580 - btrfs_cache_block_group(block_group, 1); 2581 - /* 2582 - * Make sure we wait until the cache is completely built in case it is 2583 - * missing or is invalid and therefore needs to be rebuilt. 2584 - */ 2585 - ret = btrfs_wait_block_group_cache_done(block_group); 2587 + ret = btrfs_cache_block_group(block_group, true); 2586 2588 if (ret) 2587 2589 goto out; 2588 2590 ··· 4387 4399 ffe_ctl->cached = btrfs_block_group_done(block_group); 4388 4400 if (unlikely(!ffe_ctl->cached)) { 4389 4401 ffe_ctl->have_caching_bg = true; 4390 - ret = btrfs_cache_block_group(block_group, 0); 4402 + ret = btrfs_cache_block_group(block_group, false); 4391 4403 4392 4404 /* 4393 4405 * If we get ENOMEM here or something else we want to ··· 6157 6169 6158 6170 if (end - start >= range->minlen) { 6159 6171 if (!btrfs_block_group_done(cache)) { 6160 - ret = btrfs_cache_block_group(cache, 0); 6161 - if (ret) { 6162 - bg_failed++; 6163 - bg_ret = ret; 6164 - continue; 6165 - } 6166 - ret = btrfs_wait_block_group_cache_done(cache); 6172 + ret = btrfs_cache_block_group(cache, true); 6167 6173 if (ret) { 6168 6174 bg_failed++; 6169 6175 bg_ret = ret;