Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

xfs: swap leaf buffer into path struct atomically during path shift

The node directory lookup code uses a state structure that tracks the
path of buffers used to search for the hash of a filename through the
leaf blocks. When the lookup encounters a block that ends with the
requested hash, but the entry has not yet been found, it must shift over
to the next block and continue looking for the entry (i.e., duplicate
hashes could continue over into the next block). This shift mechanism
involves walking back up and down the state structure, replacing buffers
at the appropriate btree levels as necessary.

When a buffer is replaced, the old buffer is released and the new buffer
read into the active slot in the path structure. Because the buffer is
read directly into the path slot, a buffer read failure can result in
setting a NULL buffer pointer in an active slot. This throws off the
state cleanup code in xfs_dir2_node_lookup(), which expects to release a
buffer from each active slot. Instead, a BUG occurs due to a NULL
pointer dereference:

BUG: unable to handle kernel NULL pointer dereference at 00000000000001e8
IP: [<ffffffffa0585063>] xfs_trans_brelse+0x2a3/0x3c0 [xfs]
...
RIP: 0010:[<ffffffffa0585063>] [<ffffffffa0585063>] xfs_trans_brelse+0x2a3/0x3c0 [xfs]
...
Call Trace:
[<ffffffffa05250c6>] xfs_dir2_node_lookup+0xa6/0x2c0 [xfs]
[<ffffffffa0519f7c>] xfs_dir_lookup+0x1ac/0x1c0 [xfs]
[<ffffffffa055d0e1>] xfs_lookup+0x91/0x290 [xfs]
[<ffffffffa05580b3>] xfs_vn_lookup+0x73/0xb0 [xfs]
[<ffffffff8122de8d>] lookup_real+0x1d/0x50
[<ffffffff8123330e>] path_openat+0x91e/0x1490
[<ffffffff81235079>] do_filp_open+0x89/0x100
...

This has been reproduced via a parallel fsstress and filesystem shutdown
workload in a loop. The shutdown triggers the read error in the
aforementioned codepath and causes the BUG in xfs_dir2_node_lookup().

Update xfs_da3_path_shift() to update the active path slot atomically
with respect to the caller when a buffer is replaced. This ensures that
the caller always sees the old or new buffer in the slot and prevents
the NULL pointer dereference.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

authored by

Brian Foster and committed by
Dave Chinner
7df1c170 1b867d3a

+15 -10
+15 -10
fs/xfs/libxfs/xfs_da_btree.c
··· 1822 1822 struct xfs_da_args *args; 1823 1823 struct xfs_da_node_entry *btree; 1824 1824 struct xfs_da3_icnode_hdr nodehdr; 1825 + struct xfs_buf *bp; 1825 1826 xfs_dablk_t blkno = 0; 1826 1827 int level; 1827 1828 int error; ··· 1867 1866 */ 1868 1867 for (blk++, level++; level < path->active; blk++, level++) { 1869 1868 /* 1870 - * Release the old block. 1871 - * (if it's dirty, trans won't actually let go) 1869 + * Read the next child block into a local buffer. 1870 + */ 1871 + error = xfs_da3_node_read(args->trans, dp, blkno, -1, &bp, 1872 + args->whichfork); 1873 + if (error) 1874 + return error; 1875 + 1876 + /* 1877 + * Release the old block (if it's dirty, the trans doesn't 1878 + * actually let go) and swap the local buffer into the path 1879 + * structure. This ensures failure of the above read doesn't set 1880 + * a NULL buffer in an active slot in the path. 1872 1881 */ 1873 1882 if (release) 1874 1883 xfs_trans_brelse(args->trans, blk->bp); 1875 - 1876 - /* 1877 - * Read the next child block. 1878 - */ 1879 1884 blk->blkno = blkno; 1880 - error = xfs_da3_node_read(args->trans, dp, blkno, -1, 1881 - &blk->bp, args->whichfork); 1882 - if (error) 1883 - return error; 1885 + blk->bp = bp; 1886 + 1884 1887 info = blk->bp->b_addr; 1885 1888 ASSERT(info->magic == cpu_to_be16(XFS_DA_NODE_MAGIC) || 1886 1889 info->magic == cpu_to_be16(XFS_DA3_NODE_MAGIC) ||