Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Btrfs: fix race between adding and putting tree mod seq elements and nodes

There is a race between adding and removing elements to the tree mod log
list and rbtree that can lead to use-after-free problems.

Consider the following example that explains how/why the problems happens:

1) Task A has mod log element with sequence number 200. It currently is
the only element in the mod log list;

2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to
access the tree mod log. When it enters the function, it initializes
'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock'
before checking if there are other elements in the mod seq list.
Since the list it empty, 'min_seq' remains set to (u64)-1. Then it
unlocks the lock 'tree_mod_seq_lock';

3) Before task A acquires the lock 'tree_mod_log_lock', task B adds
itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a
sequence number of 201;

4) Some other task, name it task C, modifies a btree and because there
elements in the mod seq list, it adds a tree mod elem to the tree
mod log rbtree. That node added to the mod log rbtree is assigned
a sequence number of 202;

5) Task B, which is doing fiemap and resolving indirect back references,
calls btrfs get_old_root(), with 'time_seq' == 201, which in turn
calls tree_mod_log_search() - the search returns the mod log node
from the rbtree with sequence number 202, created by task C;

6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating
the mod log rbtree and finds the node with sequence number 202. Since
202 is less than the previously computed 'min_seq', (u64)-1, it
removes the node and frees it;

7) Task B still has a pointer to the node with sequence number 202, and
it dereferences the pointer itself and through the call to
__tree_mod_log_rewind(), resulting in a use-after-free problem.

This issue can be triggered sporadically with the test case generic/561
from fstests, and it happens more frequently with a higher number of
duperemove processes. When it happens to me, it either freezes the VM or
it produces a trace like the following before crashing:

[ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1
[ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[ 1245.321287] RIP: 0010:rb_next+0x16/0x50
[ 1245.321307] Code: ....
[ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202
[ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b
[ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80
[ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000
[ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038
[ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8
[ 1245.321539] FS: 00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000
[ 1245.321591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0
[ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1245.321706] Call Trace:
[ 1245.321798] __tree_mod_log_rewind+0xbf/0x280 [btrfs]
[ 1245.321841] btrfs_search_old_slot+0x105/0xd00 [btrfs]
[ 1245.321877] resolve_indirect_refs+0x1eb/0xc60 [btrfs]
[ 1245.321912] find_parent_nodes+0x3dc/0x11b0 [btrfs]
[ 1245.321947] btrfs_check_shared+0x115/0x1c0 [btrfs]
[ 1245.321980] ? extent_fiemap+0x59d/0x6d0 [btrfs]
[ 1245.322029] extent_fiemap+0x59d/0x6d0 [btrfs]
[ 1245.322066] do_vfs_ioctl+0x45a/0x750
[ 1245.322081] ksys_ioctl+0x70/0x80
[ 1245.322092] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 1245.322113] __x64_sys_ioctl+0x16/0x20
[ 1245.322126] do_syscall_64+0x5c/0x280
[ 1245.322139] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1245.322155] RIP: 0033:0x7fdee3942dd7
[ 1245.322177] Code: ....
[ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7
[ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004
[ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44
[ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48
[ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50
[ 1245.322423] Modules linked in: ....
[ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]---

Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum
sequence number and iterates the rbtree while holding the lock
'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock'
lock, since it is now redundant.

Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions")
Fixes: 097b8a7c9e48e2 ("Btrfs: join tree mod log code with the code holding back delayed refs")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

authored by

Filipe Manana and committed by
David Sterba
7227ff4d 4e19443d

+8 -16
+2 -6
fs/btrfs/ctree.c
··· 326 326 struct seq_list *elem) 327 327 { 328 328 write_lock(&fs_info->tree_mod_log_lock); 329 - spin_lock(&fs_info->tree_mod_seq_lock); 330 329 if (!elem->seq) { 331 330 elem->seq = btrfs_inc_tree_mod_seq(fs_info); 332 331 list_add_tail(&elem->list, &fs_info->tree_mod_seq_list); 333 332 } 334 - spin_unlock(&fs_info->tree_mod_seq_lock); 335 333 write_unlock(&fs_info->tree_mod_log_lock); 336 334 337 335 return elem->seq; ··· 349 351 if (!seq_putting) 350 352 return; 351 353 352 - spin_lock(&fs_info->tree_mod_seq_lock); 354 + write_lock(&fs_info->tree_mod_log_lock); 353 355 list_del(&elem->list); 354 356 elem->seq = 0; 355 357 ··· 360 362 * blocker with lower sequence number exists, we 361 363 * cannot remove anything from the log 362 364 */ 363 - spin_unlock(&fs_info->tree_mod_seq_lock); 365 + write_unlock(&fs_info->tree_mod_log_lock); 364 366 return; 365 367 } 366 368 min_seq = cur_elem->seq; 367 369 } 368 370 } 369 - spin_unlock(&fs_info->tree_mod_seq_lock); 370 371 371 372 /* 372 373 * anything that's lower than the lowest existing (read: blocked) 373 374 * sequence number can be removed from the tree. 374 375 */ 375 - write_lock(&fs_info->tree_mod_log_lock); 376 376 tm_root = &fs_info->tree_mod_log; 377 377 for (node = rb_first(tm_root); node; node = next) { 378 378 next = rb_next(node);
+2 -4
fs/btrfs/ctree.h
··· 714 714 atomic_t nr_delayed_iputs; 715 715 wait_queue_head_t delayed_iputs_wait; 716 716 717 - /* this protects tree_mod_seq_list */ 718 - spinlock_t tree_mod_seq_lock; 719 717 atomic64_t tree_mod_seq; 720 - struct list_head tree_mod_seq_list; 721 718 722 - /* this protects tree_mod_log */ 719 + /* this protects tree_mod_log and tree_mod_seq_list */ 723 720 rwlock_t tree_mod_log_lock; 724 721 struct rb_root tree_mod_log; 722 + struct list_head tree_mod_seq_list; 725 723 726 724 atomic_t async_delalloc_pages; 727 725
+4 -4
fs/btrfs/delayed-ref.c
··· 492 492 if (head->is_data) 493 493 return; 494 494 495 - spin_lock(&fs_info->tree_mod_seq_lock); 495 + read_lock(&fs_info->tree_mod_log_lock); 496 496 if (!list_empty(&fs_info->tree_mod_seq_list)) { 497 497 struct seq_list *elem; 498 498 ··· 500 500 struct seq_list, list); 501 501 seq = elem->seq; 502 502 } 503 - spin_unlock(&fs_info->tree_mod_seq_lock); 503 + read_unlock(&fs_info->tree_mod_log_lock); 504 504 505 505 again: 506 506 for (node = rb_first_cached(&head->ref_tree); node; ··· 518 518 struct seq_list *elem; 519 519 int ret = 0; 520 520 521 - spin_lock(&fs_info->tree_mod_seq_lock); 521 + read_lock(&fs_info->tree_mod_log_lock); 522 522 if (!list_empty(&fs_info->tree_mod_seq_list)) { 523 523 elem = list_first_entry(&fs_info->tree_mod_seq_list, 524 524 struct seq_list, list); ··· 531 531 } 532 532 } 533 533 534 - spin_unlock(&fs_info->tree_mod_seq_lock); 534 + read_unlock(&fs_info->tree_mod_log_lock); 535 535 return ret; 536 536 } 537 537
-1
fs/btrfs/disk-io.c
··· 2697 2697 spin_lock_init(&fs_info->fs_roots_radix_lock); 2698 2698 spin_lock_init(&fs_info->delayed_iput_lock); 2699 2699 spin_lock_init(&fs_info->defrag_inodes_lock); 2700 - spin_lock_init(&fs_info->tree_mod_seq_lock); 2701 2700 spin_lock_init(&fs_info->super_lock); 2702 2701 spin_lock_init(&fs_info->buffer_lock); 2703 2702 spin_lock_init(&fs_info->unused_bgs_lock);
-1
fs/btrfs/tests/btrfs-tests.c
··· 142 142 spin_lock_init(&fs_info->qgroup_lock); 143 143 spin_lock_init(&fs_info->super_lock); 144 144 spin_lock_init(&fs_info->fs_roots_radix_lock); 145 - spin_lock_init(&fs_info->tree_mod_seq_lock); 146 145 mutex_init(&fs_info->qgroup_ioctl_lock); 147 146 mutex_init(&fs_info->qgroup_rescan_lock); 148 147 rwlock_init(&fs_info->tree_mod_log_lock);