Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Btrfs: fix missing delayed iputs on unmount

There's a race between close_ctree() and cleaner_kthread().
close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
sees it set, but this is racy; the cleaner might have already checked
the bit and could be cleaning stuff. In particular, if it deletes unused
block groups, it will create delayed iputs for the free space cache
inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
longer running delayed iputs after a commit. Therefore, if the cleaner
creates more delayed iputs after delayed iputs are run in
btrfs_commit_super(), we will leak inodes on unmount and get a busy
inode crash from the VFS.

Fix it by parking the cleaner before we actually close anything. Then,
any remaining delayed iputs will always be handled in
btrfs_commit_super(). This also ensures that the commit in close_ctree()
is really the last commit, so we can get rid of the commit in
cleaner_kthread().

The fstest/generic/475 followed by 476 can trigger a crash that
manifests as a slab corruption caused by accessing the freed kthread
structure by a wake up function. Sample trace:

[ 5657.077612] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
[ 5657.079432] PGD 1c57a067 P4D 1c57a067 PUD da10067 PMD 0
[ 5657.080661] Oops: 0000 [#1] PREEMPT SMP
[ 5657.081592] CPU: 1 PID: 5157 Comm: fsstress Tainted: G W 4.19.0-rc8-default+ #323
[ 5657.083703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
[ 5657.086577] RIP: 0010:shrink_page_list+0x2f9/0xe90
[ 5657.091937] RSP: 0018:ffffb5c745c8f728 EFLAGS: 00010287
[ 5657.092953] RAX: 0000000000000074 RBX: ffffb5c745c8f830 RCX: 0000000000000000
[ 5657.094590] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a8747fdf3d0
[ 5657.095987] RBP: ffffb5c745c8f9e0 R08: 0000000000000000 R09: 0000000000000000
[ 5657.097159] R10: ffff9a8747fdf5e8 R11: 0000000000000000 R12: ffffb5c745c8f788
[ 5657.098513] R13: ffff9a877f6ff2c0 R14: ffff9a877f6ff2c8 R15: dead000000000200
[ 5657.099689] FS: 00007f948d853b80(0000) GS:ffff9a877d600000(0000) knlGS:0000000000000000
[ 5657.101032] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5657.101953] CR2: 00000000000000cc CR3: 00000000684bd000 CR4: 00000000000006e0
[ 5657.103159] Call Trace:
[ 5657.103776] shrink_inactive_list+0x194/0x410
[ 5657.104671] shrink_node_memcg.constprop.84+0x39a/0x6a0
[ 5657.105750] shrink_node+0x62/0x1c0
[ 5657.106529] try_to_free_pages+0x1a4/0x500
[ 5657.107408] __alloc_pages_slowpath+0x2c9/0xb20
[ 5657.108418] __alloc_pages_nodemask+0x268/0x2b0
[ 5657.109348] kmalloc_large_node+0x37/0x90
[ 5657.110205] __kmalloc_node+0x236/0x310
[ 5657.111014] kvmalloc_node+0x3e/0x70

Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add trace ]
Signed-off-by: David Sterba <dsterba@suse.com>

authored by

Omar Sandoval and committed by
David Sterba
d6fd0ae2 ac765f83

+15 -36
+15 -36
fs/btrfs/disk-io.c
··· 1664 1664 struct btrfs_root *root = arg; 1665 1665 struct btrfs_fs_info *fs_info = root->fs_info; 1666 1666 int again; 1667 - struct btrfs_trans_handle *trans; 1668 1667 1669 - do { 1668 + while (1) { 1670 1669 again = 0; 1671 1670 1672 1671 /* Make the cleaner go to sleep early. */ ··· 1714 1715 */ 1715 1716 btrfs_delete_unused_bgs(fs_info); 1716 1717 sleep: 1718 + if (kthread_should_park()) 1719 + kthread_parkme(); 1720 + if (kthread_should_stop()) 1721 + return 0; 1717 1722 if (!again) { 1718 1723 set_current_state(TASK_INTERRUPTIBLE); 1719 - if (!kthread_should_stop()) 1720 - schedule(); 1724 + schedule(); 1721 1725 __set_current_state(TASK_RUNNING); 1722 1726 } 1723 - } while (!kthread_should_stop()); 1724 - 1725 - /* 1726 - * Transaction kthread is stopped before us and wakes us up. 1727 - * However we might have started a new transaction and COWed some 1728 - * tree blocks when deleting unused block groups for example. So 1729 - * make sure we commit the transaction we started to have a clean 1730 - * shutdown when evicting the btree inode - if it has dirty pages 1731 - * when we do the final iput() on it, eviction will trigger a 1732 - * writeback for it which will fail with null pointer dereferences 1733 - * since work queues and other resources were already released and 1734 - * destroyed by the time the iput/eviction/writeback is made. 1735 - */ 1736 - trans = btrfs_attach_transaction(root); 1737 - if (IS_ERR(trans)) { 1738 - if (PTR_ERR(trans) != -ENOENT) 1739 - btrfs_err(fs_info, 1740 - "cleaner transaction attach returned %ld", 1741 - PTR_ERR(trans)); 1742 - } else { 1743 - int ret; 1744 - 1745 - ret = btrfs_commit_transaction(trans); 1746 - if (ret) 1747 - btrfs_err(fs_info, 1748 - "cleaner open transaction commit returned %d", 1749 - ret); 1750 1727 } 1751 - 1752 - return 0; 1753 1728 } 1754 1729 1755 1730 static int transaction_kthread(void *arg) ··· 3904 3931 int ret; 3905 3932 3906 3933 set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags); 3934 + /* 3935 + * We don't want the cleaner to start new transactions, add more delayed 3936 + * iputs, etc. while we're closing. We can't use kthread_stop() yet 3937 + * because that frees the task_struct, and the transaction kthread might 3938 + * still try to wake up the cleaner. 3939 + */ 3940 + kthread_park(fs_info->cleaner_kthread); 3907 3941 3908 3942 /* wait for the qgroup rescan worker to stop */ 3909 3943 btrfs_qgroup_wait_for_completion(fs_info, false); ··· 3938 3958 3939 3959 if (!sb_rdonly(fs_info->sb)) { 3940 3960 /* 3941 - * If the cleaner thread is stopped and there are 3942 - * block groups queued for removal, the deletion will be 3943 - * skipped when we quit the cleaner thread. 3961 + * The cleaner kthread is stopped, so do one final pass over 3962 + * unused block groups. 3944 3963 */ 3945 3964 btrfs_delete_unused_bgs(fs_info); 3946 3965