Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

btrfs: fix race setting file private on concurrent lseek using same fd

When doing concurrent lseek(2) system calls against the same file
descriptor, using multiple threads belonging to the same process, we have
a short time window where a race happens and can result in a memory leak.

The race happens like this:

1) A program opens a file descriptor for a file and then spawns two
threads (with the pthreads library for example), lets call them
task A and task B;

2) Task A calls lseek with SEEK_DATA or SEEK_HOLE and ends up at
file.c:find_desired_extent() while holding a read lock on the inode;

3) At the start of find_desired_extent(), it extracts the file's
private_data pointer into a local variable named 'private', which has
a value of NULL;

4) Task B also calls lseek with SEEK_DATA or SEEK_HOLE, locks the inode
in shared mode and enters file.c:find_desired_extent(), where it also
extracts file->private_data into its local variable 'private', which
has a NULL value;

5) Because it saw a NULL file private, task A allocates a private
structure and assigns to the file structure;

6) Task B also saw a NULL file private so it also allocates its own file
private and then assigns it to the same file structure, since both
tasks are using the same file descriptor.

At this point we leak the private structure allocated by task A.

Besides the memory leak, there's also the detail that both tasks end up
using the same cached state record in the private structure (struct
btrfs_file_private::llseek_cached_state), which can result in a
use-after-free problem since one task can free it while the other is
still using it (only one task took a reference count on it). Also, sharing
the cached state is not a good idea since it could result in incorrect
results in the future - right now it should not be a problem because it
end ups being used only in extent-io-tree.c:count_range_bits() where we do
range validation before using the cached state.

Fix this by protecting the private assignment and check of a file while
holding the inode's spinlock and keep track of the task that allocated
the private, so that it's used only by that task in order to prevent
user-after-free issues with the cached state record as well as potentially
using it incorrectly in the future.

Fixes: 3c32c7212f16 ("btrfs: use cached state when looking for delalloc ranges with lseek")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

authored by

Filipe Manana and committed by
David Sterba
7ee85f55 bd610c09

+34 -3
+1
fs/btrfs/btrfs_inode.h
··· 152 152 * logged_trans), to access/update delalloc_bytes, new_delalloc_bytes, 153 153 * defrag_bytes, disk_i_size, outstanding_extents, csum_bytes and to 154 154 * update the VFS' inode number of bytes used. 155 + * Also protects setting struct file::private_data. 155 156 */ 156 157 spinlock_t lock; 157 158
+2
fs/btrfs/ctree.h
··· 463 463 void *filldir_buf; 464 464 u64 last_index; 465 465 struct extent_state *llseek_cached_state; 466 + /* Task that allocated this structure. */ 467 + struct task_struct *owner_task; 466 468 }; 467 469 468 470 static inline u32 BTRFS_LEAF_DATA_SIZE(const struct btrfs_fs_info *info)
+31 -3
fs/btrfs/file.c
··· 3485 3485 static loff_t find_desired_extent(struct file *file, loff_t offset, int whence) 3486 3486 { 3487 3487 struct btrfs_inode *inode = BTRFS_I(file->f_mapping->host); 3488 - struct btrfs_file_private *private = file->private_data; 3488 + struct btrfs_file_private *private; 3489 3489 struct btrfs_fs_info *fs_info = inode->root->fs_info; 3490 3490 struct extent_state *cached_state = NULL; 3491 3491 struct extent_state **delalloc_cached_state; ··· 3513 3513 inode_get_bytes(&inode->vfs_inode) == i_size) 3514 3514 return i_size; 3515 3515 3516 - if (!private) { 3516 + spin_lock(&inode->lock); 3517 + private = file->private_data; 3518 + spin_unlock(&inode->lock); 3519 + 3520 + if (private && private->owner_task != current) { 3521 + /* 3522 + * Not allocated by us, don't use it as its cached state is used 3523 + * by the task that allocated it and we don't want neither to 3524 + * mess with it nor get incorrect results because it reflects an 3525 + * invalid state for the current task. 3526 + */ 3527 + private = NULL; 3528 + } else if (!private) { 3517 3529 private = kzalloc(sizeof(*private), GFP_KERNEL); 3518 3530 /* 3519 3531 * No worries if memory allocation failed. ··· 3533 3521 * lseek SEEK_HOLE/DATA calls to a file when there's delalloc, 3534 3522 * so everything will still be correct. 3535 3523 */ 3536 - file->private_data = private; 3524 + if (private) { 3525 + bool free = false; 3526 + 3527 + private->owner_task = current; 3528 + 3529 + spin_lock(&inode->lock); 3530 + if (file->private_data) 3531 + free = true; 3532 + else 3533 + file->private_data = private; 3534 + spin_unlock(&inode->lock); 3535 + 3536 + if (free) { 3537 + kfree(private); 3538 + private = NULL; 3539 + } 3540 + } 3537 3541 } 3538 3542 3539 3543 if (private)