btrfs: release path before initializing extent tree in btrfs_read_locked_inode()

In btrfs_read_locked_inode() we are calling btrfs_init_file_extent_tree()
while holding a path with a read locked leaf from a subvolume tree, and
btrfs_init_file_extent_tree() may do a GFP_KERNEL allocation, which can
trigger reclaim.

This can create a circular lock dependency which lockdep warns about with
the following splat:

[6.1433] ======================================================
[6.1574] WARNING: possible circular locking dependency detected
[6.1583] 6.18.0+ #4 Tainted: G U
[6.1591] ------------------------------------------------------
[6.1599] kswapd0/117 is trying to acquire lock:
[6.1606] ffff8d9b6333c5b8 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x39/0x2f0
[6.1625]
but task is already holding lock:
[6.1633] ffffffffa4ab8ce0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x195/0xc60
[6.1646]
which lock already depends on the new lock.

[6.1657]
the existing dependency chain (in reverse order) is:
[6.1667]
-> #2 (fs_reclaim){+.+.}-{0:0}:
[6.1677] fs_reclaim_acquire+0x9d/0xd0
[6.1685] __kmalloc_cache_noprof+0x59/0x750
[6.1694] btrfs_init_file_extent_tree+0x90/0x100
[6.1702] btrfs_read_locked_inode+0xc3/0x6b0
[6.1710] btrfs_iget+0xbb/0xf0
[6.1716] btrfs_lookup_dentry+0x3c5/0x8e0
[6.1724] btrfs_lookup+0x12/0x30
[6.1731] lookup_open.isra.0+0x1aa/0x6a0
[6.1739] path_openat+0x5f7/0xc60
[6.1746] do_filp_open+0xd6/0x180
[6.1753] do_sys_openat2+0x8b/0xe0
[6.1760] __x64_sys_openat+0x54/0xa0
[6.1768] do_syscall_64+0x97/0x3e0
[6.1776] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[6.1784]
-> #1 (btrfs-tree-00){++++}-{3:3}:
[6.1794] lock_release+0x127/0x2a0
[6.1801] up_read+0x1b/0x30
[6.1808] btrfs_search_slot+0x8e0/0xff0
[6.1817] btrfs_lookup_inode+0x52/0xd0
[6.1825] __btrfs_update_delayed_inode+0x73/0x520
[6.1833] btrfs_commit_inode_delayed_inode+0x11a/0x120
[6.1842] btrfs_log_inode+0x608/0x1aa0
[6.1849] btrfs_log_inode_parent+0x249/0xf80
[6.1857] btrfs_log_dentry_safe+0x3e/0x60
[6.1865] btrfs_sync_file+0x431/0x690
[6.1872] do_fsync+0x39/0x80
[6.1879] __x64_sys_fsync+0x13/0x20
[6.1887] do_syscall_64+0x97/0x3e0
[6.1894] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[6.1903]
-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
[6.1913] __lock_acquire+0x15e9/0x2820
[6.1920] lock_acquire+0xc9/0x2d0
[6.1927] __mutex_lock+0xcc/0x10a0
[6.1934] __btrfs_release_delayed_node.part.0+0x39/0x2f0
[6.1944] btrfs_evict_inode+0x20b/0x4b0
[6.1952] evict+0x15a/0x2f0
[6.1958] prune_icache_sb+0x91/0xd0
[6.1966] super_cache_scan+0x150/0x1d0
[6.1974] do_shrink_slab+0x155/0x6f0
[6.1981] shrink_slab+0x48e/0x890
[6.1988] shrink_one+0x11a/0x1f0
[6.1995] shrink_node+0xbfd/0x1320
[6.1002] balance_pgdat+0x67f/0xc60
[6.1321] kswapd+0x1dc/0x3e0
[6.1643] kthread+0xff/0x240
[6.1965] ret_from_fork+0x223/0x280
[6.1287] ret_from_fork_asm+0x1a/0x30
[6.1616]
other info that might help us debug this:

[6.1561] Chain exists of:
&delayed_node->mutex --> btrfs-tree-00 --> fs_reclaim

[6.1503] Possible unsafe locking scenario:

[6.1110] CPU0 CPU1
[6.1411] ---- ----
[6.1707] lock(fs_reclaim);
[6.1998] lock(btrfs-tree-00);
[6.1291] lock(fs_reclaim);
[6.1581] lock(&delayed_node->mutex);
[6.1874]
*** DEADLOCK ***

[6.1716] 2 locks held by kswapd0/117:
[6.1999] #0: ffffffffa4ab8ce0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x195/0xc60
[6.1294] #1: ffff8d998344b0e0 (&type->s_umount_key#40){++++}- {3:3}, at: super_cache_scan+0x37/0x1d0
[6.1596]
stack backtrace:
[6.1183] CPU: 11 UID: 0 PID: 117 Comm: kswapd0 Tainted: G U 6.18.0+ #4 PREEMPT(lazy)
[6.1185] Tainted: [U]=USER
[6.1186] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023
[6.1187] Call Trace:
[6.1187] <TASK>
[6.1189] dump_stack_lvl+0x6e/0xa0
[6.1192] print_circular_bug.cold+0x17a/0x1c0
[6.1194] check_noncircular+0x175/0x190
[6.1197] __lock_acquire+0x15e9/0x2820
[6.1200] lock_acquire+0xc9/0x2d0
[6.1201] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0
[6.1204] __mutex_lock+0xcc/0x10a0
[6.1206] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0
[6.1208] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0
[6.1211] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0
[6.1213] __btrfs_release_delayed_node.part.0+0x39/0x2f0
[6.1215] btrfs_evict_inode+0x20b/0x4b0
[6.1217] ? lock_acquire+0xc9/0x2d0
[6.1220] evict+0x15a/0x2f0
[6.1222] prune_icache_sb+0x91/0xd0
[6.1224] super_cache_scan+0x150/0x1d0
[6.1226] do_shrink_slab+0x155/0x6f0
[6.1228] shrink_slab+0x48e/0x890
[6.1229] ? shrink_slab+0x2d2/0x890
[6.1231] shrink_one+0x11a/0x1f0
[6.1234] shrink_node+0xbfd/0x1320
[6.1236] ? shrink_node+0xa2d/0x1320
[6.1236] ? shrink_node+0xbd3/0x1320
[6.1239] ? balance_pgdat+0x67f/0xc60
[6.1239] balance_pgdat+0x67f/0xc60
[6.1241] ? finish_task_switch.isra.0+0xc4/0x2a0
[6.1246] kswapd+0x1dc/0x3e0
[6.1247] ? __pfx_autoremove_wake_function+0x10/0x10
[6.1249] ? __pfx_kswapd+0x10/0x10
[6.1250] kthread+0xff/0x240
[6.1251] ? __pfx_kthread+0x10/0x10
[6.1253] ret_from_fork+0x223/0x280
[6.1255] ? __pfx_kthread+0x10/0x10
[6.1257] ret_from_fork_asm+0x1a/0x30
[6.1260] </TASK>

This is because:

1) The fsync task is holding an inode's delayed node mutex (for a
directory) while calling __btrfs_update_delayed_inode() and that needs
to do a search on the subvolume's btree (therefore read lock some
extent buffers);

2) The lookup task, at btrfs_lookup(), triggered reclaim with the
GFP_KERNEL allocation done by btrfs_init_file_extent_tree() while
holding a read lock on a subvolume leaf;

3) The reclaim triggered kswapd which is doing inode eviction for the
directory inode the fsync task is using as an argument to
btrfs_commit_inode_delayed_inode() - but in that call chain we are
trying to read lock the same leaf that the lookup task is holding
while calling btrfs_init_file_extent_tree() and doing the GFP_KERNEL
allocation.

Fix this by calling btrfs_init_file_extent_tree() after we don't need the
path anymore and release it in btrfs_read_locked_inode().

Reported-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/linux-btrfs/6e55113a22347c3925458a5d840a18401a38b276.camel@linux.intel.com/
Fixes: 8679d2687c35 ("btrfs: initialize inode::file_extent_tree after i_mode has been set")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

authored by Filipe Manana and committed by David Sterba 8731f2c5 1aff297f

Changed files
+14 -5
fs
btrfs
+14 -5
fs/btrfs/inode.c
··· 4046 4046 btrfs_set_inode_mapping_order(inode); 4047 4047 4048 4048 cache_index: 4049 - ret = btrfs_init_file_extent_tree(inode); 4050 - if (ret) 4051 - goto out; 4052 - btrfs_inode_set_file_extent_range(inode, 0, 4053 - round_up(i_size_read(vfs_inode), fs_info->sectorsize)); 4054 4049 /* 4055 4050 * If we were modified in the current generation and evicted from memory 4056 4051 * and then re-read we need to do a full sync since we don't have any ··· 4131 4136 "error loading props for ino %llu (root %llu): %d", 4132 4137 btrfs_ino(inode), btrfs_root_id(root), ret); 4133 4138 } 4139 + 4140 + /* 4141 + * We don't need the path anymore, so release it to avoid holding a read 4142 + * lock on a leaf while calling btrfs_init_file_extent_tree(), which can 4143 + * allocate memory that triggers reclaim (GFP_KERNEL) and cause a locking 4144 + * dependency. 4145 + */ 4146 + btrfs_release_path(path); 4147 + 4148 + ret = btrfs_init_file_extent_tree(inode); 4149 + if (ret) 4150 + goto out; 4151 + btrfs_inode_set_file_extent_range(inode, 0, 4152 + round_up(i_size_read(vfs_inode), fs_info->sectorsize)); 4134 4153 4135 4154 if (!maybe_acls) 4136 4155 cache_no_acl(vfs_inode);