Merge tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

Pull misc vfs updates from Christian Brauner:
"This contains the usual pile of misc updates:

Features:

- Add F_CREATED_QUERY fcntl() that allows userspace to query whether
a file was actually created. Often userspace wants to know whether
an O_CREATE request did actually create a file without using
O_EXCL. The current logic is that to first attempts to open the
file without O_CREAT | O_EXCL and if ENOENT is returned userspace
tries again with both flags. If that succeeds all is well. If it
now reports EEXIST it retries.

That works fairly well but some corner cases make this more
involved. If this operates on a dangling symlink the first openat()
without O_CREAT | O_EXCL will return ENOENT but the second openat()
with O_CREAT | O_EXCL will fail with EEXIST.

The reason is that openat() without O_CREAT | O_EXCL follows the
symlink while O_CREAT | O_EXCL doesn't for security reasons. So
it's not something we can really change unless we add an explicit
opt-in via O_FOLLOW which seems really ugly.

All available workarounds are really nasty (fanotify, bpf lsm etc)
so add a simple fcntl().

- Try an opportunistic lookup for O_CREAT. Today, when opening a file
we'll typically do a fast lookup, but if O_CREAT is set, the kernel
always takes the exclusive inode lock. This was likely done with
the expectation that O_CREAT means that we always expect to do the
create, but that's often not the case. Many programs set O_CREAT
even in scenarios where the file already exists (see related
F_CREATED_QUERY patch motivation above).

The series contained in the pr rearranges the pathwalk-for-open
code to also attempt a fast_lookup in certain O_CREAT cases. If a
positive dentry is found, the inode_lock can be avoided altogether
and it can stay in rcuwalk mode for the last step_into.

- Expose the 64 bit mount id via name_to_handle_at()

Now that we provide a unique 64-bit mount ID interface in statx(2),
we can now provide a race-free way for name_to_handle_at(2) to
provide a file handle and corresponding mount without needing to
worry about racing with /proc/mountinfo parsing or having to open a
file just to do statx(2).

While this is not necessary if you are using AT_EMPTY_PATH and
don't care about an extra statx(2) call, users that pass full paths
into name_to_handle_at(2) need to know which mount the file handle
comes from (to make sure they don't try to open_by_handle_at a file
handle from a different filesystem) and switching to AT_EMPTY_PATH
would require allocating a file for every name_to_handle_at(2) call

- Add a per dentry expire timeout to autofs

There are two fairly well known automounter map formats, the autofs
format and the amd format (more or less System V and Berkley).

Some time ago Linux autofs added an amd map format parser that
implemented a fair amount of the amd functionality. This was done
within the autofs infrastructure and some functionality wasn't
implemented because it either didn't make sense or required extra
kernel changes. The idea was to restrict changes to be within the
existing autofs functionality as much as possible and leave changes
with a wider scope to be considered later.

One of these changes is implementing the amd options:
1) "unmount", expire this mount according to a timeout (same as
the current autofs default).
2) "nounmount", don't expire this mount (same as setting the
autofs timeout to 0 except only for this specific mount) .
3) "utimeout=<seconds>", expire this mount using the specified
timeout (again same as setting the autofs timeout but only for
this mount)

To implement these options per-dentry expire timeouts need to be
implemented for autofs indirect mounts. This is because all map
keys (mounts) for autofs indirect mounts use an expire timeout
stored in the autofs mount super block info. structure and all
indirect mounts use the same expire timeout.

Fixes:

- Fix missing fput for FSCONFIG_SET_FD in autofs

- Use param->file for FSCONFIG_SET_FD in coda

- Delete the 'fs/netfs' proc subtreee when netfs module exits

- Make sure that struct uid_gid_map fits into a single cacheline

- Don't flush in-flight wb switches for superblocks without cgroup
writeback

- Correcting the idmapping mount example in the idmapping
documentation

- Fix a race between evice_inodes() and find_inode() and iput()

- Refine the show_inode_state() macro definition in writeback code

- Prevent dump_mapping() from accessing invalid dentry.d_name.name

- Show actual source for debugfs in /proc/mounts

- Annotate data-race of busy_poll_usecs in eventpoll

- Don't WARN for racy path_noexec check in exec code

- Handle OOM on mnt_warn_timestamp_expiry()

- Fix some spelling in the iomap design documentation

- Fix typo in procfs comment

- Fix typo in fs/namespace.c comment

Cleanups:

- Add the VFS git tree to the MAINTAINERS file

- Move FMODE_UNSIGNED_OFFSET to fop_flags freeing up another f_mode
bit in struct file bringing us to 5 free f_mode bits

- Remove the __I_DIO_WAKEUP bit from i_state flags as we can simplify
the wait mechanism

- Remove the unused path_put_init() helper

- Replace a __u32 with u32 for s_fsnotify_mask as __u32 is uapi
specific

- Replace the unsigned long i_state member with a u32 i_state member
in struct inode freeing up 4 bytes in struct inode. Instead of
using the bit based wait apis we're now using the var event apis
and using the individual bytes of the i_state member to wait on
state changes

- Explain how per-syscall AT_* flags should be allocated

- Use in_group_or_capable() helper to simplify the posix acl mode
update code

- Switch to LIST_HEAD() in fsync_buffers_list() to simplify the code

- Removed comment about d_rcu_to_refcount() as that function doesn't
exist anymore

- Add kernel documentation for lookup_fast()

- Don't re-zero evenpoll fields

- Remove outdated comment after close_fd()

- Fix imprecise wording in comment about the pipe filesystem

- Drop GFP_NOFAIL mode from alloc_page_buffers

- Missing blank line warnings and struct declaration improved in
file_table

- Annotate struct poll_list with __counted_by()

- Remove the unused read parameter in percpu-rwsem

- Remove linux/prefetch.h include from direct-io code

- Use kmemdup_array instead of kmemdup for multiple allocation in
mnt_idmapping code

- Remove unused mnt_cursor_del() declaration

Performance tweaks:

- Dodge smp_mb in break_lease and break_deleg in the common case

- Only read fops once in fops_{get,put}()

- Use RCU in ilookup()

- Elide smp_mb in iversion handling in the common case

- Drop one lock trip in evict()"

* tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (58 commits)
uidgid: make sure we fit into one cacheline
proc: Fix typo in the comment
fs/pipe: Correct imprecise wording in comment
fhandle: expose u64 mount id to name_to_handle_at(2)
uapi: explain how per-syscall AT_* flags should be allocated
fs: drop GFP_NOFAIL mode from alloc_page_buffers
writeback: Refine the show_inode_state() macro definition
fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name
mnt_idmapping: Use kmemdup_array instead of kmemdup for multiple allocation
netfs: Delete subtree of 'fs/netfs' when netfs module exits
fs: use LIST_HEAD() to simplify code
inode: make i_state a u32
inode: port __I_LRU_ISOLATING to var event
vfs: fix race between evice_inodes() and find_inode()&iput()
inode: port __I_NEW to var event
inode: port __I_SYNC to var event
fs: reorder i_state bits
fs: add i_state helpers
MAINTAINERS: add the VFS git tree
fs: s/__u32/u32/ for s_fsnotify_mask
...

Linus Torvalds 2 years ago 8f72c31f d2230051

+665 -313

66 changed files

expand all

Documentation

filesystems

idmappings.rst

iomap

design.rst

MAINTAINERS

drivers

char

adi.c

mem.c

gpu

drm

amd

amdgpu

amdgpu_drv.c

drm_file.c

gma500

psb_drv.c

i915

i915_driver.c

nouveau

nouveau_drm.c

radeon

radeon_drv.c

tegra

drm.c

vmwgfx

vmwgfx_drv.c

xe_device.c

md-bitmap.c

aio.c

autofs

autofs_i.h

dev-ioctl.c

expire.c

inode.c

bcachefs

fs.c

buffer.c

coda

inode.c

dcache.c

debugfs

inode.c

direct-io.c

eventpoll.c

exec.c

fcntl.c

fhandle.c

file.c

file_table.c

fs-writeback.c

inode.c

libfs.c

mnt_idmapping.c

mount.h

namei.c

namespace.c

netfs

locking.c

main.c

pipe.c

posix_acl.c

proc

base.c

fd.c

kcore.c

read_write.c

select.c

super.c

include

drm

drm_accel.h

drm_gem.h

drm_gem_dma_helper.h

linux

buffer_head.h

filelock.h

fs.h

path.h

percpu-rwsem.h

syscalls.h

user_namespace.h

writeback.h

trace

events

writeback.h

uapi

linux

auto_fs.h

fcntl.h

kernel

user.c

mmap.c

tools

testing

selftests

core

close_range_test.c

+4 -4

Documentation/filesystems/idmappings.rst

··· 821 821 /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ 822 822 make_kuid(u0:k20000:r10000, u1000) = k21000 823 823 824 - 2. Verify that the caller's kernel ids can be mapped to userspace ids in the 824 + 3. Verify that the caller's kernel ids can be mapped to userspace ids in the 825 825 filesystem's idmapping:: 826 826 827 827 from_kuid(u0:k20000:r10000, k21000) = u1000 ··· 854 854 /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ 855 855 make_kuid(u0:k0:r4294967295, u1000) = k1000 856 856 857 - 2. Verify that the caller's kernel ids can be mapped to userspace ids in the 857 + 3. Verify that the caller's kernel ids can be mapped to userspace ids in the 858 858 filesystem's idmapping:: 859 859 860 - from_kuid(u0:k0:r4294967295, k21000) = u1000 860 + from_kuid(u0:k0:r4294967295, k1000) = u1000 861 861 862 862 So the ownership that lands on disk will be ``u1000``. 863 863 ··· 994 994 /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ 995 995 make_kuid(u0:k0:r4294967295, u1000) = k1000 996 996 997 - 2. Verify that the caller's filesystem ids can be mapped to userspace ids in the 997 + 3. Verify that the caller's filesystem ids can be mapped to userspace ids in the 998 998 filesystem's idmapping:: 999 999 1000 1000 from_kuid(u0:k0:r4294967295, k1000) = u1000

+3 -3

Documentation/filesystems/iomap/design.rst

··· 142 142 * **pure overwrite**: A write operation that does not require any 143 143 metadata or zeroing operations to perform during either submission 144 144 or completion. 145 - This implies that the fileystem must have already allocated space 145 + This implies that the filesystem must have already allocated space 146 146 on disk as ``IOMAP_MAPPED`` and the filesystem must not place any 147 - constaints on IO alignment or size. 147 + constraints on IO alignment or size. 148 148 The only constraints on I/O alignment are device level (minimum I/O 149 149 size and alignment, typically sector size). 150 150 ··· 394 394 395 395 * The **upper** level primitive is provided by the filesystem to 396 396 coordinate access to different iomap operations. 397 - The exact primitive is specifc to the filesystem and operation, 397 + The exact primitive is specific to the filesystem and operation, 398 398 but is often a VFS inode, pagecache invalidation, or folio lock. 399 399 For example, a filesystem might take ``i_rwsem`` before calling 400 400 ``iomap_file_buffered_write`` and ``iomap_file_unshare`` to prevent

MAINTAINERS

··· 8635 8635 R: Jan Kara <jack@suse.cz> 8636 8636 L: linux-fsdevel@vger.kernel.org 8637 8637 S: Maintained 8638 + T: git https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git 8638 8639 F: fs/* 8639 8640 F: include/linux/fs.h 8640 8641 F: include/linux/fs_types.h

+1 -7

drivers/char/adi.c

··· 14 14 15 15 #define MAX_BUF_SZ PAGE_SIZE 16 16 17 - static int adi_open(struct inode *inode, struct file *file) 18 - { 19 - file->f_mode |= FMODE_UNSIGNED_OFFSET; 20 - return 0; 21 - } 22 - 23 17 static int read_mcd_tag(unsigned long addr) 24 18 { 25 19 long err; ··· 200 206 static const struct file_operations adi_fops = { 201 207 .owner = THIS_MODULE, 202 208 .llseek = adi_llseek, 203 - .open = adi_open, 204 209 .read = adi_read, 205 210 .write = adi_write, 211 + .fop_flags = FOP_UNSIGNED_OFFSET, 206 212 }; 207 213 208 214 static struct miscdevice adi_miscdev = {

+2 -1

drivers/char/mem.c

··· 643 643 .get_unmapped_area = get_unmapped_area_mem, 644 644 .mmap_capabilities = memory_mmap_capabilities, 645 645 #endif 646 + .fop_flags = FOP_UNSIGNED_OFFSET, 646 647 }; 647 648 648 649 static const struct file_operations null_fops = { ··· 694 693 umode_t mode; 695 694 } devlist[] = { 696 695 #ifdef CONFIG_DEVMEM 697 - [DEVMEM_MINOR] = { "mem", &mem_fops, FMODE_UNSIGNED_OFFSET, 0 }, 696 + [DEVMEM_MINOR] = { "mem", &mem_fops, 0, 0 }, 698 697 #endif 699 698 [3] = { "null", &null_fops, FMODE_NOWAIT, 0666 }, 700 699 #ifdef CONFIG_DEVPORT

drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c

··· 2908 2908 #ifdef CONFIG_PROC_FS 2909 2909 .show_fdinfo = drm_show_fdinfo, 2910 2910 #endif 2911 + .fop_flags = FOP_UNSIGNED_OFFSET, 2911 2912 }; 2912 2913 2913 2914 int amdgpu_file_to_fpriv(struct file *filp, struct amdgpu_fpriv **fpriv)

+2 -1

drivers/gpu/drm/drm_file.c

··· 318 318 if (dev->switch_power_state != DRM_SWITCH_POWER_ON && 319 319 dev->switch_power_state != DRM_SWITCH_POWER_DYNAMIC_OFF) 320 320 return -EINVAL; 321 + if (WARN_ON_ONCE(!(filp->f_op->fop_flags & FOP_UNSIGNED_OFFSET))) 322 + return -EINVAL; 321 323 322 324 drm_dbg_core(dev, "comm=\"%s\", pid=%d, minor=%d\n", 323 325 current->comm, task_pid_nr(current), minor->index); ··· 337 335 } 338 336 339 337 filp->private_data = priv; 340 - filp->f_mode |= FMODE_UNSIGNED_OFFSET; 341 338 priv->filp = filp; 342 339 343 340 mutex_lock(&dev->filelist_mutex);

drivers/gpu/drm/gma500/psb_drv.c

··· 498 498 .mmap = drm_gem_mmap, 499 499 .poll = drm_poll, 500 500 .read = drm_read, 501 + .fop_flags = FOP_UNSIGNED_OFFSET, 501 502 }; 502 503 503 504 static const struct drm_driver driver = {

drivers/gpu/drm/i915/i915_driver.c

··· 1671 1671 #ifdef CONFIG_PROC_FS 1672 1672 .show_fdinfo = drm_show_fdinfo, 1673 1673 #endif 1674 + .fop_flags = FOP_UNSIGNED_OFFSET, 1674 1675 }; 1675 1676 1676 1677 static int

drivers/gpu/drm/nouveau/nouveau_drm.c

··· 1274 1274 .compat_ioctl = nouveau_compat_ioctl, 1275 1275 #endif 1276 1276 .llseek = noop_llseek, 1277 + .fop_flags = FOP_UNSIGNED_OFFSET, 1277 1278 }; 1278 1279 1279 1280 static struct drm_driver

drivers/gpu/drm/radeon/radeon_drv.c

··· 520 520 #ifdef CONFIG_COMPAT 521 521 .compat_ioctl = radeon_kms_compat_ioctl, 522 522 #endif 523 + .fop_flags = FOP_UNSIGNED_OFFSET, 523 524 }; 524 525 525 526 static const struct drm_ioctl_desc radeon_ioctls_kms[] = {

drivers/gpu/drm/tegra/drm.c

··· 801 801 .read = drm_read, 802 802 .compat_ioctl = drm_compat_ioctl, 803 803 .llseek = noop_llseek, 804 + .fop_flags = FOP_UNSIGNED_OFFSET, 804 805 }; 805 806 806 807 static int tegra_drm_context_cleanup(int id, void *p, void *data)

drivers/gpu/drm/vmwgfx/vmwgfx_drv.c

··· 1609 1609 .compat_ioctl = vmw_compat_ioctl, 1610 1610 #endif 1611 1611 .llseek = noop_llseek, 1612 + .fop_flags = FOP_UNSIGNED_OFFSET, 1612 1613 }; 1613 1614 1614 1615 static const struct drm_driver driver = {

drivers/gpu/drm/xe/xe_device.c

··· 241 241 #ifdef CONFIG_PROC_FS 242 242 .show_fdinfo = drm_show_fdinfo, 243 243 #endif 244 + .fop_flags = FOP_UNSIGNED_OFFSET, 244 245 }; 245 246 246 247 static struct drm_driver driver = {

+1 -1

drivers/md/md-bitmap.c

··· 360 360 pr_debug("read bitmap file (%dB @ %llu)\n", (int)PAGE_SIZE, 361 361 (unsigned long long)index << PAGE_SHIFT); 362 362 363 - bh = alloc_page_buffers(page, blocksize, false); 363 + bh = alloc_page_buffers(page, blocksize); 364 364 if (!bh) { 365 365 ret = -ENOMEM; 366 366 goto out;

+1 -1

fs/aio.c

··· 100 100 101 101 unsigned long user_id; 102 102 103 - struct __percpu kioctx_cpu *cpu; 103 + struct kioctx_cpu __percpu *cpu; 104 104 105 105 /* 106 106 * For percpu reqs_available, number of slots we move to/from global

fs/autofs/autofs_i.h

··· 62 62 struct list_head expiring; 63 63 64 64 struct autofs_sb_info *sbi; 65 + unsigned long exp_timeout; 65 66 unsigned long last_used; 66 67 int count; 67 68 ··· 82 81 */ 83 82 #define AUTOFS_INF_PENDING (1<<2) /* dentry pending mount */ 84 83 84 + #define AUTOFS_INF_EXPIRE_SET (1<<3) /* per-dentry expire timeout set for 85 + this mount point. 86 + */ 85 87 struct autofs_wait_queue { 86 88 wait_queue_head_t queue; 87 89 struct autofs_wait_queue *next;

+92 -5

fs/autofs/dev-ioctl.c

··· 128 128 goto out; 129 129 } 130 130 131 + /* Setting the per-dentry expire timeout requires a trailing 132 + * path component, ie. no '/', so invert the logic of the 133 + * check_name() return for AUTOFS_DEV_IOCTL_TIMEOUT_CMD. 134 + */ 131 135 err = check_name(param->path); 136 + if (cmd == AUTOFS_DEV_IOCTL_TIMEOUT_CMD) 137 + err = err ? 0 : -EINVAL; 132 138 if (err) { 133 139 pr_warn("invalid path supplied for cmd(0x%08x)\n", 134 140 cmd); ··· 402 396 return 0; 403 397 } 404 398 405 - /* Set the autofs mount timeout */ 399 + /* 400 + * Set the autofs mount expire timeout. 401 + * 402 + * There are two places an expire timeout can be set, in the autofs 403 + * super block info. (this is all that's needed for direct and offset 404 + * mounts because there's a distinct mount corresponding to each of 405 + * these) and per-dentry within within the dentry info. If a per-dentry 406 + * timeout is set it will override the expire timeout set in the parent 407 + * autofs super block info. 408 + * 409 + * If setting the autofs super block expire timeout the autofs_dev_ioctl 410 + * size field will be equal to the autofs_dev_ioctl structure size. If 411 + * setting the per-dentry expire timeout the mount point name is passed 412 + * in the autofs_dev_ioctl path field and the size field updated to 413 + * reflect this. 414 + * 415 + * Setting the autofs mount expire timeout sets the timeout in the super 416 + * block info. struct. Setting the per-dentry timeout does a little more. 417 + * If the timeout is equal to -1 the per-dentry timeout (and flag) is 418 + * cleared which reverts to using the super block timeout, otherwise if 419 + * timeout is 0 the timeout is set to this value and the flag is left 420 + * set which disables expiration for the mount point, lastly the flag 421 + * and the timeout are set enabling the dentry to use this timeout. 422 + */ 406 423 static int autofs_dev_ioctl_timeout(struct file *fp, 407 424 struct autofs_sb_info *sbi, 408 425 struct autofs_dev_ioctl *param) 409 426 { 410 - unsigned long timeout; 427 + unsigned long timeout = param->timeout.timeout; 411 428 412 - timeout = param->timeout.timeout; 413 - param->timeout.timeout = sbi->exp_timeout / HZ; 414 - sbi->exp_timeout = timeout * HZ; 429 + /* If setting the expire timeout for an individual indirect 430 + * mount point dentry the mount trailing component path is 431 + * placed in param->path and param->size adjusted to account 432 + * for it otherwise param->size it is set to the structure 433 + * size. 434 + */ 435 + if (param->size == AUTOFS_DEV_IOCTL_SIZE) { 436 + param->timeout.timeout = sbi->exp_timeout / HZ; 437 + sbi->exp_timeout = timeout * HZ; 438 + } else { 439 + struct dentry *base = fp->f_path.dentry; 440 + struct inode *inode = base->d_inode; 441 + int path_len = param->size - AUTOFS_DEV_IOCTL_SIZE - 1; 442 + struct dentry *dentry; 443 + struct autofs_info *ino; 444 + 445 + if (!autofs_type_indirect(sbi->type)) 446 + return -EINVAL; 447 + 448 + /* An expire timeout greater than the superblock timeout 449 + * could be a problem at shutdown but the super block 450 + * timeout itself can change so all we can really do is 451 + * warn the user. 452 + */ 453 + if (timeout >= sbi->exp_timeout) 454 + pr_warn("per-mount expire timeout is greater than " 455 + "the parent autofs mount timeout which could " 456 + "prevent shutdown\n"); 457 + 458 + inode_lock_shared(inode); 459 + dentry = try_lookup_one_len(param->path, base, path_len); 460 + inode_unlock_shared(inode); 461 + if (IS_ERR_OR_NULL(dentry)) 462 + return dentry ? PTR_ERR(dentry) : -ENOENT; 463 + ino = autofs_dentry_ino(dentry); 464 + if (!ino) { 465 + dput(dentry); 466 + return -ENOENT; 467 + } 468 + 469 + if (ino->exp_timeout && ino->flags & AUTOFS_INF_EXPIRE_SET) 470 + param->timeout.timeout = ino->exp_timeout / HZ; 471 + else 472 + param->timeout.timeout = sbi->exp_timeout / HZ; 473 + 474 + if (timeout == -1) { 475 + /* Revert to using the super block timeout */ 476 + ino->flags &= ~AUTOFS_INF_EXPIRE_SET; 477 + ino->exp_timeout = 0; 478 + } else { 479 + /* Set the dentry expire flag and timeout. 480 + * 481 + * If timeout is 0 it will prevent the expire 482 + * of this particular automount. 483 + */ 484 + ino->flags |= AUTOFS_INF_EXPIRE_SET; 485 + ino->exp_timeout = timeout * HZ; 486 + } 487 + dput(dentry); 488 + } 489 + 415 490 return 0; 416 491 } 417 492

+5 -2

fs/autofs/expire.c

··· 429 429 if (!root) 430 430 return NULL; 431 431 432 - timeout = sbi->exp_timeout; 433 - 434 432 dentry = NULL; 435 433 while ((dentry = get_next_positive_subdir(dentry, root))) { 436 434 spin_lock(&sbi->fs_lock); ··· 438 440 continue; 439 441 } 440 442 spin_unlock(&sbi->fs_lock); 443 + 444 + if (ino->flags & AUTOFS_INF_EXPIRE_SET) 445 + timeout = ino->exp_timeout; 446 + else 447 + timeout = sbi->exp_timeout; 441 448 442 449 expired = should_expire(dentry, mnt, timeout, how); 443 450 if (!expired)

+3 -2

fs/autofs/inode.c

··· 19 19 INIT_LIST_HEAD(&ino->expiring); 20 20 ino->last_used = jiffies; 21 21 ino->sbi = sbi; 22 + ino->exp_timeout = -1; 22 23 ino->count = 1; 23 24 } 24 25 return ino; ··· 29 28 { 30 29 ino->uid = GLOBAL_ROOT_UID; 31 30 ino->gid = GLOBAL_ROOT_GID; 31 + ino->exp_timeout = -1; 32 32 ino->last_used = jiffies; 33 33 } 34 34 ··· 174 172 ret = autofs_check_pipe(pipe); 175 173 if (ret < 0) { 176 174 errorf(fc, "Invalid/unusable pipe"); 177 - if (param->type != fs_value_is_file) 178 - fput(pipe); 175 + fput(pipe); 179 176 return -EBADF; 180 177 } 181 178

+6 -4

fs/bcachefs/fs.c

··· 1652 1652 break; 1653 1653 } 1654 1654 } else if (clean_pass && this_pass_clean) { 1655 - wait_queue_head_t *wq = bit_waitqueue(&inode->v.i_state, __I_NEW); 1656 - DEFINE_WAIT_BIT(wait, &inode->v.i_state, __I_NEW); 1655 + struct wait_bit_queue_entry wqe; 1656 + struct wait_queue_head *wq_head; 1657 1657 1658 - prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE); 1658 + wq_head = inode_bit_waitqueue(&wqe, &inode->v, __I_NEW); 1659 + prepare_to_wait_event(wq_head, &wqe.wq_entry, 1660 + TASK_UNINTERRUPTIBLE); 1659 1661 mutex_unlock(&c->vfs_inodes_lock); 1660 1662 1661 1663 schedule(); 1662 - finish_wait(wq, &wait.wq_entry); 1664 + finish_wait(wq_head, &wqe.wq_entry); 1663 1665 goto again; 1664 1666 } 1665 1667 }

+2 -6

fs/buffer.c

··· 774 774 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list) 775 775 { 776 776 struct buffer_head *bh; 777 - struct list_head tmp; 778 777 struct address_space *mapping; 779 778 int err = 0, err2; 780 779 struct blk_plug plug; 780 + LIST_HEAD(tmp); 781 781 782 - INIT_LIST_HEAD(&tmp); 783 782 blk_start_plug(&plug); 784 783 785 784 spin_lock(lock); ··· 957 958 } 958 959 EXPORT_SYMBOL_GPL(folio_alloc_buffers); 959 960 960 - struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, 961 - bool retry) 961 + struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size) 962 962 { 963 963 gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT; 964 - if (retry) 965 - gfp |= __GFP_NOFAIL; 966 964 967 965 return folio_alloc_buffers(page_folio(page), size, gfp); 968 966 }

+30 -13

fs/coda/inode.c

··· 119 119 {} 120 120 }; 121 121 122 - static int coda_parse_fd(struct fs_context *fc, int fd) 122 + static int coda_set_idx(struct fs_context *fc, struct file *file) 123 123 { 124 124 struct coda_fs_context *ctx = fc->fs_private; 125 - struct fd f; 126 125 struct inode *inode; 127 126 int idx; 128 127 129 - f = fdget(fd); 130 - if (!f.file) 131 - return -EBADF; 132 - inode = file_inode(f.file); 128 + inode = file_inode(file); 133 129 if (!S_ISCHR(inode->i_mode) || imajor(inode) != CODA_PSDEV_MAJOR) { 134 - fdput(f); 135 - return invalf(fc, "code: Not coda psdev"); 130 + return invalf(fc, "coda: Not coda psdev"); 136 131 } 137 - 138 132 idx = iminor(inode); 139 - fdput(f); 140 - 141 133 if (idx < 0 || idx >= MAX_CODADEVS) 142 134 return invalf(fc, "coda: Bad minor number"); 143 135 ctx->idx = idx; 144 136 return 0; 137 + } 138 + 139 + static int coda_parse_fd(struct fs_context *fc, struct fs_parameter *param, 140 + struct fs_parse_result *result) 141 + { 142 + struct file *file; 143 + int err; 144 + 145 + if (param->type == fs_value_is_file) { 146 + file = param->file; 147 + param->file = NULL; 148 + } else { 149 + file = fget(result->uint_32); 150 + } 151 + if (!file) 152 + return -EBADF; 153 + 154 + err = coda_set_idx(fc, file); 155 + fput(file); 156 + return err; 145 157 } 146 158 147 159 static int coda_parse_param(struct fs_context *fc, struct fs_parameter *param) ··· 167 155 168 156 switch (opt) { 169 157 case Opt_fd: 170 - return coda_parse_fd(fc, result.uint_32); 158 + return coda_parse_fd(fc, param, &result); 171 159 } 172 160 173 161 return 0; ··· 179 167 */ 180 168 static int coda_parse_monolithic(struct fs_context *fc, void *_data) 181 169 { 170 + struct file *file; 182 171 struct coda_mount_data *data = _data; 183 172 184 173 if (!data) ··· 188 175 if (data->version != CODA_MOUNT_VERSION) 189 176 return invalf(fc, "coda: Bad mount version"); 190 177 191 - coda_parse_fd(fc, data->fd); 178 + file = fget(data->fd); 179 + if (file) { 180 + coda_set_idx(fc, file); 181 + fput(file); 182 + } 192 183 return 0; 193 184 } 194 185

+6 -4

fs/dcache.c

··· 1913 1913 __d_instantiate(entry, inode); 1914 1914 WARN_ON(!(inode->i_state & I_NEW)); 1915 1915 inode->i_state &= ~I_NEW & ~I_CREATING; 1916 + /* 1917 + * Pairs with the barrier in prepare_to_wait_event() to make sure 1918 + * ___wait_var_event() either sees the bit cleared or 1919 + * waitqueue_active() check in wake_up_var() sees the waiter. 1920 + */ 1916 1921 smp_mb(); 1917 - wake_up_bit(&inode->i_state, __I_NEW); 1922 + inode_wake_up_bit(inode, __I_NEW); 1918 1923 spin_unlock(&inode->i_lock); 1919 1924 } 1920 1925 EXPORT_SYMBOL(d_instantiate_new); ··· 2172 2167 * held, and rcu_read_lock held. The returned dentry must not be stored into 2173 2168 * without taking d_lock and checking d_seq sequence count against @seq 2174 2169 * returned here. 2175 - * 2176 - * A refcount may be taken on the found dentry with the d_rcu_to_refcount 2177 - * function. 2178 2170 * 2179 2171 * Alternatively, __d_lookup_rcu may be called again to look up the child of 2180 2172 * the returned dentry, so long as its parent's seqlock is checked after the

fs/debugfs/inode.c

··· 89 89 Opt_uid, 90 90 Opt_gid, 91 91 Opt_mode, 92 + Opt_source, 92 93 }; 93 94 94 95 static const struct fs_parameter_spec debugfs_param_specs[] = { 95 96 fsparam_gid ("gid", Opt_gid), 96 97 fsparam_u32oct ("mode", Opt_mode), 97 98 fsparam_uid ("uid", Opt_uid), 99 + fsparam_string ("source", Opt_source), 98 100 {} 99 101 }; 100 102 ··· 127 125 break; 128 126 case Opt_mode: 129 127 opts->mode = result.uint_32 & S_IALLUGO; 128 + break; 129 + case Opt_source: 130 + if (fc->source) 131 + return invalfc(fc, "Multiple sources specified"); 132 + fc->source = param->string; 133 + param->string = NULL; 130 134 break; 131 135 /* 132 136 * We might like to report bad mount options here;

-6

fs/direct-io.c

··· 37 37 #include <linux/rwsem.h> 38 38 #include <linux/uio.h> 39 39 #include <linux/atomic.h> 40 - #include <linux/prefetch.h> 41 40 42 41 #include "internal.h" 43 42 ··· 1119 1120 struct buffer_head map_bh = { 0, }; 1120 1121 struct blk_plug plug; 1121 1122 unsigned long align = offset | iov_iter_alignment(iter); 1122 - 1123 - /* 1124 - * Avoid references to bdev if not absolutely needed to give 1125 - * the early prefetch in the caller enough time. 1126 - */ 1127 1123 1128 1124 /* watch out for a 0 len io from a tricksy fs */ 1129 1125 if (iov_iter_rw(iter) == READ && !count)

+1 -6

fs/eventpoll.c

··· 420 420 421 421 static bool ep_busy_loop_on(struct eventpoll *ep) 422 422 { 423 - return !!ep->busy_poll_usecs || net_busy_loop_on(); 423 + return !!READ_ONCE(ep->busy_poll_usecs) || net_busy_loop_on(); 424 424 } 425 425 426 426 static bool ep_busy_loop_end(void *p, unsigned long start_time) ··· 2200 2200 error = PTR_ERR(file); 2201 2201 goto out_free_fd; 2202 2202 } 2203 - #ifdef CONFIG_NET_RX_BUSY_POLL 2204 - ep->busy_poll_usecs = 0; 2205 - ep->busy_poll_budget = 0; 2206 - ep->prefer_busy_poll = false; 2207 - #endif 2208 2203 ep->file = file; 2209 2204 fd_install(fd, file); 2210 2205 return fd;

+12 -19

fs/exec.c

··· 145 145 goto out; 146 146 147 147 /* 148 - * may_open() has already checked for this, so it should be 149 - * impossible to trip now. But we need to be extra cautious 150 - * and check again at the very end too. 148 + * Check do_open_execat() for an explanation. 151 149 */ 152 150 error = -EACCES; 153 - if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode) || 154 - path_noexec(&file->f_path))) 151 + if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode)) || 152 + path_noexec(&file->f_path)) 155 153 goto exit; 156 154 157 155 error = -ENOEXEC; ··· 952 954 static struct file *do_open_execat(int fd, struct filename *name, int flags) 953 955 { 954 956 struct file *file; 955 - int err; 956 957 struct open_flags open_exec_flags = { 957 958 .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC, 958 959 .acc_mode = MAY_EXEC, ··· 968 971 969 972 file = do_filp_open(fd, name, &open_exec_flags); 970 973 if (IS_ERR(file)) 971 - goto out; 974 + return file; 972 975 973 976 /* 974 - * may_open() has already checked for this, so it should be 975 - * impossible to trip now. But we need to be extra cautious 976 - * and check again at the very end too. 977 + * In the past the regular type check was here. It moved to may_open() in 978 + * 633fb6ac3980 ("exec: move S_ISREG() check earlier"). Since then it is 979 + * an invariant that all non-regular files error out before we get here. 977 980 */ 978 - err = -EACCES; 979 - if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode) || 980 - path_noexec(&file->f_path))) 981 - goto exit; 981 + if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode)) || 982 + path_noexec(&file->f_path)) { 983 + fput(file); 984 + return ERR_PTR(-EACCES); 985 + } 982 986 983 - out: 984 987 return file; 985 - 986 - exit: 987 - fput(file); 988 - return ERR_PTR(err); 989 988 } 990 989 991 990 /**

+10

fs/fcntl.c

··· 343 343 return f.file == filp; 344 344 } 345 345 346 + /* Let the caller figure out whether a given file was just created. */ 347 + static long f_created_query(const struct file *filp) 348 + { 349 + return !!(filp->f_mode & FMODE_CREATED); 350 + } 351 + 346 352 static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, 347 353 struct file *filp) 348 354 { ··· 358 352 long err = -EINVAL; 359 353 360 354 switch (cmd) { 355 + case F_CREATED_QUERY: 356 + err = f_created_query(filp); 357 + break; 361 358 case F_DUPFD: 362 359 err = f_dupfd(argi, filp, 0); 363 360 break; ··· 472 463 static int check_fcntl_cmd(unsigned cmd) 473 464 { 474 465 switch (cmd) { 466 + case F_CREATED_QUERY: 475 467 case F_DUPFD: 476 468 case F_DUPFD_CLOEXEC: 477 469 case F_DUPFD_QUERY:

+22 -7

fs/fhandle.c

··· 16 16 17 17 static long do_sys_name_to_handle(const struct path *path, 18 18 struct file_handle __user *ufh, 19 - int __user *mnt_id, int fh_flags) 19 + void __user *mnt_id, bool unique_mntid, 20 + int fh_flags) 20 21 { 21 22 long retval; 22 23 struct file_handle f_handle; ··· 70 69 } else 71 70 retval = 0; 72 71 /* copy the mount id */ 73 - if (put_user(real_mount(path->mnt)->mnt_id, mnt_id) || 74 - copy_to_user(ufh, handle, 75 - struct_size(handle, f_handle, handle_bytes))) 72 + if (unique_mntid) { 73 + if (put_user(real_mount(path->mnt)->mnt_id_unique, 74 + (u64 __user *) mnt_id)) 75 + retval = -EFAULT; 76 + } else { 77 + if (put_user(real_mount(path->mnt)->mnt_id, 78 + (int __user *) mnt_id)) 79 + retval = -EFAULT; 80 + } 81 + /* copy the handle */ 82 + if (retval != -EFAULT && 83 + copy_to_user(ufh, handle, 84 + struct_size(handle, f_handle, handle_bytes))) 76 85 retval = -EFAULT; 77 86 kfree(handle); 78 87 return retval; ··· 94 83 * @name: name that should be converted to handle. 95 84 * @handle: resulting file handle 96 85 * @mnt_id: mount id of the file system containing the file 86 + * (u64 if AT_HANDLE_MNT_ID_UNIQUE, otherwise int) 97 87 * @flag: flag value to indicate whether to follow symlink or not 98 88 * and whether a decodable file handle is required. 99 89 * ··· 104 92 * value required. 105 93 */ 106 94 SYSCALL_DEFINE5(name_to_handle_at, int, dfd, const char __user *, name, 107 - struct file_handle __user *, handle, int __user *, mnt_id, 95 + struct file_handle __user *, handle, void __user *, mnt_id, 108 96 int, flag) 109 97 { 110 98 struct path path; ··· 112 100 int fh_flags; 113 101 int err; 114 102 115 - if (flag & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH | AT_HANDLE_FID)) 103 + if (flag & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH | AT_HANDLE_FID | 104 + AT_HANDLE_MNT_ID_UNIQUE)) 116 105 return -EINVAL; 117 106 118 107 lookup_flags = (flag & AT_SYMLINK_FOLLOW) ? LOOKUP_FOLLOW : 0; ··· 122 109 lookup_flags |= LOOKUP_EMPTY; 123 110 err = user_path_at(dfd, name, lookup_flags, &path); 124 111 if (!err) { 125 - err = do_sys_name_to_handle(&path, handle, mnt_id, fh_flags); 112 + err = do_sys_name_to_handle(&path, handle, mnt_id, 113 + flag & AT_HANDLE_MNT_ID_UNIQUE, 114 + fh_flags); 126 115 path_put(&path); 127 116 } 128 117 return err;

+1 -1

fs/file.c

··· 672 672 673 673 return filp_close(file, files); 674 674 } 675 - EXPORT_SYMBOL(close_fd); /* for ksys_close() */ 675 + EXPORT_SYMBOL(close_fd); 676 676 677 677 /** 678 678 * last_fd - return last valid index into fd table

+4 -1

fs/file_table.c

··· 136 136 register_sysctl_init("fs", fs_stat_sysctls); 137 137 if (IS_ENABLED(CONFIG_BINFMT_MISC)) { 138 138 struct ctl_table_header *hdr; 139 + 139 140 hdr = register_sysctl_mount_point("fs/binfmt_misc"); 140 141 kmemleak_not_leak(hdr); 141 142 } ··· 384 383 struct file *alloc_file_clone(struct file *base, int flags, 385 384 const struct file_operations *fops) 386 385 { 387 - struct file *f = alloc_file(&base->f_path, flags, fops); 386 + struct file *f; 387 + 388 + f = alloc_file(&base->f_path, flags, fops); 388 389 if (!IS_ERR(f)) { 389 390 path_get(&f->f_path); 390 391 f->f_mapping = base->f_mapping;

+40 -33

fs/fs-writeback.c

··· 1132 1132 1133 1133 /** 1134 1134 * cgroup_writeback_umount - flush inode wb switches for umount 1135 + * @sb: target super_block 1135 1136 * 1136 1137 * This function is called when a super_block is about to be destroyed and 1137 1138 * flushes in-flight inode wb switches. An inode wb switch goes through ··· 1141 1140 * rare occurrences and synchronize_rcu() can take a while, perform 1142 1141 * flushing iff wb switches are in flight. 1143 1142 */ 1144 - void cgroup_writeback_umount(void) 1143 + void cgroup_writeback_umount(struct super_block *sb) 1145 1144 { 1145 + 1146 + if (!(sb->s_bdi->capabilities & BDI_CAP_WRITEBACK)) 1147 + return; 1148 + 1146 1149 /* 1147 1150 * SB_ACTIVE should be reliably cleared before checking 1148 1151 * isw_nr_in_flight, see generic_shutdown_super(). ··· 1386 1381 1387 1382 static void inode_sync_complete(struct inode *inode) 1388 1383 { 1384 + assert_spin_locked(&inode->i_lock); 1385 + 1389 1386 inode->i_state &= ~I_SYNC; 1390 1387 /* If inode is clean an unused, put it into LRU now... */ 1391 1388 inode_add_lru(inode); 1392 - /* Waiters must see I_SYNC cleared before being woken up */ 1393 - smp_mb(); 1394 - wake_up_bit(&inode->i_state, __I_SYNC); 1389 + /* Called with inode->i_lock which ensures memory ordering. */ 1390 + inode_wake_up_bit(inode, __I_SYNC); 1395 1391 } 1396 1392 1397 1393 static bool inode_dirtied_after(struct inode *inode, unsigned long t) ··· 1511 1505 * Wait for writeback on an inode to complete. Called with i_lock held. 1512 1506 * Caller must make sure inode cannot go away when we drop i_lock. 1513 1507 */ 1514 - static void __inode_wait_for_writeback(struct inode *inode) 1515 - __releases(inode->i_lock) 1516 - __acquires(inode->i_lock) 1517 - { 1518 - DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC); 1519 - wait_queue_head_t *wqh; 1520 - 1521 - wqh = bit_waitqueue(&inode->i_state, __I_SYNC); 1522 - while (inode->i_state & I_SYNC) { 1523 - spin_unlock(&inode->i_lock); 1524 - __wait_on_bit(wqh, &wq, bit_wait, 1525 - TASK_UNINTERRUPTIBLE); 1526 - spin_lock(&inode->i_lock); 1527 - } 1528 - } 1529 - 1530 - /* 1531 - * Wait for writeback on an inode to complete. Caller must have inode pinned. 1532 - */ 1533 1508 void inode_wait_for_writeback(struct inode *inode) 1534 1509 { 1535 - spin_lock(&inode->i_lock); 1536 - __inode_wait_for_writeback(inode); 1537 - spin_unlock(&inode->i_lock); 1510 + struct wait_bit_queue_entry wqe; 1511 + struct wait_queue_head *wq_head; 1512 + 1513 + assert_spin_locked(&inode->i_lock); 1514 + 1515 + if (!(inode->i_state & I_SYNC)) 1516 + return; 1517 + 1518 + wq_head = inode_bit_waitqueue(&wqe, inode, __I_SYNC); 1519 + for (;;) { 1520 + prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE); 1521 + /* Checking I_SYNC with inode->i_lock guarantees memory ordering. */ 1522 + if (!(inode->i_state & I_SYNC)) 1523 + break; 1524 + spin_unlock(&inode->i_lock); 1525 + schedule(); 1526 + spin_lock(&inode->i_lock); 1527 + } 1528 + finish_wait(wq_head, &wqe.wq_entry); 1538 1529 } 1539 1530 1540 1531 /* ··· 1542 1539 static void inode_sleep_on_writeback(struct inode *inode) 1543 1540 __releases(inode->i_lock) 1544 1541 { 1545 - DEFINE_WAIT(wait); 1546 - wait_queue_head_t *wqh = bit_waitqueue(&inode->i_state, __I_SYNC); 1547 - int sleep; 1542 + struct wait_bit_queue_entry wqe; 1543 + struct wait_queue_head *wq_head; 1544 + bool sleep; 1548 1545 1549 - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); 1550 - sleep = inode->i_state & I_SYNC; 1546 + assert_spin_locked(&inode->i_lock); 1547 + 1548 + wq_head = inode_bit_waitqueue(&wqe, inode, __I_SYNC); 1549 + prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE); 1550 + /* Checking I_SYNC with inode->i_lock guarantees memory ordering. */ 1551 + sleep = !!(inode->i_state & I_SYNC); 1551 1552 spin_unlock(&inode->i_lock); 1552 1553 if (sleep) 1553 1554 schedule(); 1554 - finish_wait(wqh, &wait); 1555 + finish_wait(wq_head, &wqe.wq_entry); 1555 1556 } 1556 1557 1557 1558 /* ··· 1759 1752 */ 1760 1753 if (wbc->sync_mode != WB_SYNC_ALL) 1761 1754 goto out; 1762 - __inode_wait_for_writeback(inode); 1755 + inode_wait_for_writeback(inode); 1763 1756 } 1764 1757 WARN_ON(inode->i_state & I_SYNC); 1765 1758 /*

+80 -36

fs/inode.c

··· 472 472 inode->i_state |= I_REFERENCED; 473 473 } 474 474 475 + struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe, 476 + struct inode *inode, u32 bit) 477 + { 478 + void *bit_address; 479 + 480 + bit_address = inode_state_wait_address(inode, bit); 481 + init_wait_var_entry(wqe, bit_address, 0); 482 + return __var_waitqueue(bit_address); 483 + } 484 + EXPORT_SYMBOL(inode_bit_waitqueue); 485 + 475 486 /* 476 487 * Add inode to LRU if needed (inode is unused and clean). 477 488 * ··· 511 500 spin_lock(&inode->i_lock); 512 501 WARN_ON(!(inode->i_state & I_LRU_ISOLATING)); 513 502 inode->i_state &= ~I_LRU_ISOLATING; 514 - smp_mb(); 515 - wake_up_bit(&inode->i_state, __I_LRU_ISOLATING); 503 + /* Called with inode->i_lock which ensures memory ordering. */ 504 + inode_wake_up_bit(inode, __I_LRU_ISOLATING); 516 505 spin_unlock(&inode->i_lock); 517 506 } 518 507 519 508 static void inode_wait_for_lru_isolating(struct inode *inode) 520 509 { 521 - spin_lock(&inode->i_lock); 522 - if (inode->i_state & I_LRU_ISOLATING) { 523 - DEFINE_WAIT_BIT(wq, &inode->i_state, __I_LRU_ISOLATING); 524 - wait_queue_head_t *wqh; 510 + struct wait_bit_queue_entry wqe; 511 + struct wait_queue_head *wq_head; 525 512 526 - wqh = bit_waitqueue(&inode->i_state, __I_LRU_ISOLATING); 513 + lockdep_assert_held(&inode->i_lock); 514 + if (!(inode->i_state & I_LRU_ISOLATING)) 515 + return; 516 + 517 + wq_head = inode_bit_waitqueue(&wqe, inode, __I_LRU_ISOLATING); 518 + for (;;) { 519 + prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE); 520 + /* 521 + * Checking I_LRU_ISOLATING with inode->i_lock guarantees 522 + * memory ordering. 523 + */ 524 + if (!(inode->i_state & I_LRU_ISOLATING)) 525 + break; 527 526 spin_unlock(&inode->i_lock); 528 - __wait_on_bit(wqh, &wq, bit_wait, TASK_UNINTERRUPTIBLE); 527 + schedule(); 529 528 spin_lock(&inode->i_lock); 530 - WARN_ON(inode->i_state & I_LRU_ISOLATING); 531 529 } 532 - spin_unlock(&inode->i_lock); 530 + finish_wait(wq_head, &wqe.wq_entry); 531 + WARN_ON(inode->i_state & I_LRU_ISOLATING); 533 532 } 534 533 535 534 /** ··· 616 595 struct hlist_node *dentry_first; 617 596 struct dentry *dentry_ptr; 618 597 struct dentry dentry; 598 + char fname[64] = {}; 619 599 unsigned long ino; 620 600 621 601 /* ··· 653 631 return; 654 632 } 655 633 634 + if (strncpy_from_kernel_nofault(fname, dentry.d_name.name, 63) < 0) 635 + strscpy(fname, "<invalid>"); 656 636 /* 657 - * if dentry is corrupted, the %pd handler may still crash, 658 - * but it's unlikely that we reach here with a corrupt mapping 637 + * Even if strncpy_from_kernel_nofault() succeeded, 638 + * the fname could be unreliable 659 639 */ 660 - pr_warn("aops:%ps ino:%lx dentry name:\"%pd\"\n", a_ops, ino, &dentry); 640 + pr_warn("aops:%ps ino:%lx dentry name(?):\"%s\"\n", 641 + a_ops, ino, fname); 661 642 } 662 643 663 644 void clear_inode(struct inode *inode) ··· 715 690 716 691 inode_sb_list_del(inode); 717 692 693 + spin_lock(&inode->i_lock); 718 694 inode_wait_for_lru_isolating(inode); 719 695 720 696 /* ··· 725 699 * the inode. We just have to wait for running writeback to finish. 726 700 */ 727 701 inode_wait_for_writeback(inode); 702 + spin_unlock(&inode->i_lock); 728 703 729 704 if (op->evict_inode) { 730 705 op->evict_inode(inode); ··· 749 722 * used as an indicator whether blocking on it is safe. 750 723 */ 751 724 spin_lock(&inode->i_lock); 752 - wake_up_bit(&inode->i_state, __I_NEW); 725 + /* 726 + * Pairs with the barrier in prepare_to_wait_event() to make sure 727 + * ___wait_var_event() either sees the bit cleared or 728 + * waitqueue_active() check in wake_up_var() sees the waiter. 729 + */ 730 + smp_mb(); 731 + inode_wake_up_bit(inode, __I_NEW); 753 732 BUG_ON(inode->i_state != (I_FREEING | I_CLEAR)); 754 733 spin_unlock(&inode->i_lock); 755 734 ··· 803 770 continue; 804 771 805 772 spin_lock(&inode->i_lock); 773 + if (atomic_read(&inode->i_count)) { 774 + spin_unlock(&inode->i_lock); 775 + continue; 776 + } 806 777 if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) { 807 778 spin_unlock(&inode->i_lock); 808 779 continue; ··· 1167 1130 spin_lock(&inode->i_lock); 1168 1131 WARN_ON(!(inode->i_state & I_NEW)); 1169 1132 inode->i_state &= ~I_NEW & ~I_CREATING; 1133 + /* 1134 + * Pairs with the barrier in prepare_to_wait_event() to make sure 1135 + * ___wait_var_event() either sees the bit cleared or 1136 + * waitqueue_active() check in wake_up_var() sees the waiter. 1137 + */ 1170 1138 smp_mb(); 1171 - wake_up_bit(&inode->i_state, __I_NEW); 1139 + inode_wake_up_bit(inode, __I_NEW); 1172 1140 spin_unlock(&inode->i_lock); 1173 1141 } 1174 1142 EXPORT_SYMBOL(unlock_new_inode); ··· 1184 1142 spin_lock(&inode->i_lock); 1185 1143 WARN_ON(!(inode->i_state & I_NEW)); 1186 1144 inode->i_state &= ~I_NEW; 1145 + /* 1146 + * Pairs with the barrier in prepare_to_wait_event() to make sure 1147 + * ___wait_var_event() either sees the bit cleared or 1148 + * waitqueue_active() check in wake_up_var() sees the waiter. 1149 + */ 1187 1150 smp_mb(); 1188 - wake_up_bit(&inode->i_state, __I_NEW); 1151 + inode_wake_up_bit(inode, __I_NEW); 1189 1152 spin_unlock(&inode->i_lock); 1190 1153 iput(inode); 1191 1154 } ··· 1617 1570 struct hlist_head *head = inode_hashtable + hash(sb, ino); 1618 1571 struct inode *inode; 1619 1572 again: 1620 - spin_lock(&inode_hash_lock); 1621 - inode = find_inode_fast(sb, head, ino, true); 1622 - spin_unlock(&inode_hash_lock); 1573 + inode = find_inode_fast(sb, head, ino, false); 1623 1574 1624 1575 if (inode) { 1625 1576 if (IS_ERR(inode)) ··· 2379 2334 */ 2380 2335 static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_locked) 2381 2336 { 2382 - wait_queue_head_t *wq; 2383 - DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW); 2337 + struct wait_bit_queue_entry wqe; 2338 + struct wait_queue_head *wq_head; 2384 2339 2385 2340 /* 2386 2341 * Handle racing against evict(), see that routine for more details. ··· 2391 2346 return; 2392 2347 } 2393 2348 2394 - wq = bit_waitqueue(&inode->i_state, __I_NEW); 2395 - prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE); 2349 + wq_head = inode_bit_waitqueue(&wqe, inode, __I_NEW); 2350 + prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE); 2396 2351 spin_unlock(&inode->i_lock); 2397 2352 rcu_read_unlock(); 2398 2353 if (is_inode_hash_locked) 2399 2354 spin_unlock(&inode_hash_lock); 2400 2355 schedule(); 2401 - finish_wait(wq, &wait.wq_entry); 2356 + finish_wait(wq_head, &wqe.wq_entry); 2402 2357 if (is_inode_hash_locked) 2403 2358 spin_lock(&inode_hash_lock); 2404 2359 rcu_read_lock(); ··· 2547 2502 /* 2548 2503 * Direct i/o helper functions 2549 2504 */ 2550 - static void __inode_dio_wait(struct inode *inode) 2505 + bool inode_dio_finished(const struct inode *inode) 2551 2506 { 2552 - wait_queue_head_t *wq = bit_waitqueue(&inode->i_state, __I_DIO_WAKEUP); 2553 - DEFINE_WAIT_BIT(q, &inode->i_state, __I_DIO_WAKEUP); 2554 - 2555 - do { 2556 - prepare_to_wait(wq, &q.wq_entry, TASK_UNINTERRUPTIBLE); 2557 - if (atomic_read(&inode->i_dio_count)) 2558 - schedule(); 2559 - } while (atomic_read(&inode->i_dio_count)); 2560 - finish_wait(wq, &q.wq_entry); 2507 + return atomic_read(&inode->i_dio_count) == 0; 2561 2508 } 2509 + EXPORT_SYMBOL(inode_dio_finished); 2562 2510 2563 2511 /** 2564 2512 * inode_dio_wait - wait for outstanding DIO requests to finish ··· 2565 2527 */ 2566 2528 void inode_dio_wait(struct inode *inode) 2567 2529 { 2568 - if (atomic_read(&inode->i_dio_count)) 2569 - __inode_dio_wait(inode); 2530 + wait_var_event(&inode->i_dio_count, inode_dio_finished(inode)); 2570 2531 } 2571 2532 EXPORT_SYMBOL(inode_dio_wait); 2533 + 2534 + void inode_dio_wait_interruptible(struct inode *inode) 2535 + { 2536 + wait_var_event_interruptible(&inode->i_dio_count, 2537 + inode_dio_finished(inode)); 2538 + } 2539 + EXPORT_SYMBOL(inode_dio_wait_interruptible); 2572 2540 2573 2541 /* 2574 2542 * inode_set_flags - atomically set some inode flags

+18 -10

fs/libfs.c

··· 2003 2003 * information, but the legacy inode_inc_iversion code used a spinlock 2004 2004 * to serialize increments. 2005 2005 * 2006 - * Here, we add full memory barriers to ensure that any de-facto 2007 - * ordering with other info is preserved. 2006 + * We add a full memory barrier to ensure that any de facto ordering 2007 + * with other state is preserved (either implicitly coming from cmpxchg 2008 + * or explicitly from smp_mb if we don't know upfront if we will execute 2009 + * the former). 2008 2010 * 2009 - * This barrier pairs with the barrier in inode_query_iversion() 2011 + * These barriers pair with inode_query_iversion(). 2010 2012 */ 2011 - smp_mb(); 2012 2013 cur = inode_peek_iversion_raw(inode); 2014 + if (!force && !(cur & I_VERSION_QUERIED)) { 2015 + smp_mb(); 2016 + cur = inode_peek_iversion_raw(inode); 2017 + } 2018 + 2013 2019 do { 2014 2020 /* If flag is clear then we needn't do anything */ 2015 2021 if (!force && !(cur & I_VERSION_QUERIED)) ··· 2044 2038 u64 inode_query_iversion(struct inode *inode) 2045 2039 { 2046 2040 u64 cur, new; 2041 + bool fenced = false; 2047 2042 2043 + /* 2044 + * Memory barriers (implicit in cmpxchg, explicit in smp_mb) pair with 2045 + * inode_maybe_inc_iversion(), see that routine for more details. 2046 + */ 2048 2047 cur = inode_peek_iversion_raw(inode); 2049 2048 do { 2050 2049 /* If flag is already set, then no need to swap */ 2051 2050 if (cur & I_VERSION_QUERIED) { 2052 - /* 2053 - * This barrier (and the implicit barrier in the 2054 - * cmpxchg below) pairs with the barrier in 2055 - * inode_maybe_inc_iversion(). 2056 - */ 2057 - smp_mb(); 2051 + if (!fenced) 2052 + smp_mb(); 2058 2053 break; 2059 2054 } 2060 2055 2056 + fenced = true; 2061 2057 new = cur | I_VERSION_QUERIED; 2062 2058 } while (!atomic64_try_cmpxchg(&inode->i_version, &cur, new)); 2063 2059 return cur >> I_VERSION_QUERIED_SHIFT;

+6 -6

fs/mnt_idmapping.c

··· 228 228 return 0; 229 229 } 230 230 231 - forward = kmemdup(map_from->forward, 232 - nr_extents * sizeof(struct uid_gid_extent), 233 - GFP_KERNEL_ACCOUNT); 231 + forward = kmemdup_array(map_from->forward, nr_extents, 232 + sizeof(struct uid_gid_extent), 233 + GFP_KERNEL_ACCOUNT); 234 234 if (!forward) 235 235 return -ENOMEM; 236 236 237 - reverse = kmemdup(map_from->reverse, 238 - nr_extents * sizeof(struct uid_gid_extent), 239 - GFP_KERNEL_ACCOUNT); 237 + reverse = kmemdup_array(map_from->reverse, nr_extents, 238 + sizeof(struct uid_gid_extent), 239 + GFP_KERNEL_ACCOUNT); 240 240 if (!reverse) { 241 241 kfree(forward); 242 242 return -ENOMEM;

-1

fs/mount.h

··· 153 153 list_add_tail(&mnt->mnt_list, dt_list); 154 154 } 155 155 156 - extern void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor); 157 156 bool has_locked_children(struct mount *mnt, struct dentry *dentry);

+61 -14

fs/namei.c

··· 1639 1639 } 1640 1640 EXPORT_SYMBOL(lookup_one_qstr_excl); 1641 1641 1642 + /** 1643 + * lookup_fast - do fast lockless (but racy) lookup of a dentry 1644 + * @nd: current nameidata 1645 + * 1646 + * Do a fast, but racy lookup in the dcache for the given dentry, and 1647 + * revalidate it. Returns a valid dentry pointer or NULL if one wasn't 1648 + * found. On error, an ERR_PTR will be returned. 1649 + * 1650 + * If this function returns a valid dentry and the walk is no longer 1651 + * lazy, the dentry will carry a reference that must later be put. If 1652 + * RCU mode is still in force, then this is not the case and the dentry 1653 + * must be legitimized before use. If this returns NULL, then the walk 1654 + * will no longer be in RCU mode. 1655 + */ 1642 1656 static struct dentry *lookup_fast(struct nameidata *nd) 1643 1657 { 1644 1658 struct dentry *dentry, *parent = nd->path.dentry; ··· 3535 3521 return dentry; 3536 3522 } 3537 3523 3524 + if (open_flag & O_CREAT) 3525 + audit_inode(nd->name, dir, AUDIT_INODE_PARENT); 3526 + 3538 3527 /* 3539 3528 * Checking write permission is tricky, bacuse we don't know if we are 3540 3529 * going to actually need it: O_CREAT opens should work as long as the ··· 3608 3591 return ERR_PTR(error); 3609 3592 } 3610 3593 3594 + static inline bool trailing_slashes(struct nameidata *nd) 3595 + { 3596 + return (bool)nd->last.name[nd->last.len]; 3597 + } 3598 + 3599 + static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag) 3600 + { 3601 + struct dentry *dentry; 3602 + 3603 + if (open_flag & O_CREAT) { 3604 + if (trailing_slashes(nd)) 3605 + return ERR_PTR(-EISDIR); 3606 + 3607 + /* Don't bother on an O_EXCL create */ 3608 + if (open_flag & O_EXCL) 3609 + return NULL; 3610 + } 3611 + 3612 + if (trailing_slashes(nd)) 3613 + nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; 3614 + 3615 + dentry = lookup_fast(nd); 3616 + if (IS_ERR_OR_NULL(dentry)) 3617 + return dentry; 3618 + 3619 + if (open_flag & O_CREAT) { 3620 + /* Discard negative dentries. Need inode_lock to do the create */ 3621 + if (!dentry->d_inode) { 3622 + if (!(nd->flags & LOOKUP_RCU)) 3623 + dput(dentry); 3624 + dentry = NULL; 3625 + } 3626 + } 3627 + return dentry; 3628 + } 3629 + 3611 3630 static const char *open_last_lookups(struct nameidata *nd, 3612 3631 struct file *file, const struct open_flags *op) 3613 3632 { ··· 3661 3608 return handle_dots(nd, nd->last_type); 3662 3609 } 3663 3610 3664 - if (!(open_flag & O_CREAT)) { 3665 - if (nd->last.name[nd->last.len]) 3666 - nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; 3667 - /* we _can_ be in RCU mode here */ 3668 - dentry = lookup_fast(nd); 3669 - if (IS_ERR(dentry)) 3670 - return ERR_CAST(dentry); 3671 - if (likely(dentry)) 3672 - goto finish_lookup; 3611 + /* We _can_ be in RCU mode here */ 3612 + dentry = lookup_fast_for_open(nd, open_flag); 3613 + if (IS_ERR(dentry)) 3614 + return ERR_CAST(dentry); 3673 3615 3616 + if (likely(dentry)) 3617 + goto finish_lookup; 3618 + 3619 + if (!(open_flag & O_CREAT)) { 3674 3620 if (WARN_ON_ONCE(nd->flags & LOOKUP_RCU)) 3675 3621 return ERR_PTR(-ECHILD); 3676 3622 } else { 3677 - /* create side of things */ 3678 3623 if (nd->flags & LOOKUP_RCU) { 3679 3624 if (!try_to_unlazy(nd)) 3680 3625 return ERR_PTR(-ECHILD); 3681 3626 } 3682 - audit_inode(nd->name, dir, AUDIT_INODE_PARENT); 3683 - /* trailing slashes? */ 3684 - if (unlikely(nd->last.name[nd->last.len])) 3685 - return ERR_PTR(-EISDIR); 3686 3627 } 3687 3628 3688 3629 if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {

+13 -5

fs/namespace.c

··· 1774 1774 list_del_init(&p->mnt_child); 1775 1775 } 1776 1776 1777 - /* Add propogated mounts to the tmp_list */ 1777 + /* Add propagated mounts to the tmp_list */ 1778 1778 if (how & UMOUNT_PROPAGATE) 1779 1779 propagate_umount(&tmp_list); 1780 1780 ··· 2921 2921 if (!__mnt_is_readonly(mnt) && 2922 2922 (!(sb->s_iflags & SB_I_TS_EXPIRY_WARNED)) && 2923 2923 (ktime_get_real_seconds() + TIME_UPTIME_SEC_MAX > sb->s_time_max)) { 2924 - char *buf = (char *)__get_free_page(GFP_KERNEL); 2925 - char *mntpath = buf ? d_path(mountpoint, buf, PAGE_SIZE) : ERR_PTR(-ENOMEM); 2924 + char *buf, *mntpath; 2925 + 2926 + buf = (char *)__get_free_page(GFP_KERNEL); 2927 + if (buf) 2928 + mntpath = d_path(mountpoint, buf, PAGE_SIZE); 2929 + else 2930 + mntpath = ERR_PTR(-ENOMEM); 2931 + if (IS_ERR(mntpath)) 2932 + mntpath = "(unknown)"; 2926 2933 2927 2934 pr_warn("%s filesystem being %s at %s supports timestamps until %ptTd (0x%llx)\n", 2928 2935 sb->s_type->name, ··· 2937 2930 mntpath, &sb->s_time_max, 2938 2931 (unsigned long long)sb->s_time_max); 2939 2932 2940 - free_page((unsigned long)buf); 2941 2933 sb->s_iflags |= SB_I_TS_EXPIRY_WARNED; 2934 + if (buf) 2935 + free_page((unsigned long)buf); 2942 2936 } 2943 2937 } 2944 2938 ··· 5613 5605 /* Only worry about locked mounts */ 5614 5606 if (!(child->mnt.mnt_flags & MNT_LOCKED)) 5615 5607 continue; 5616 - /* Is the directory permanetly empty? */ 5608 + /* Is the directory permanently empty? */ 5617 5609 if (!is_empty_dir_inode(inode)) 5618 5610 goto next; 5619 5611 }

+5 -17

fs/netfs/locking.c

··· 19 19 * Must be called under a lock that serializes taking new references 20 20 * to i_dio_count, usually by inode->i_mutex. 21 21 */ 22 - static int inode_dio_wait_interruptible(struct inode *inode) 22 + static int netfs_inode_dio_wait_interruptible(struct inode *inode) 23 23 { 24 - if (!atomic_read(&inode->i_dio_count)) 24 + if (inode_dio_finished(inode)) 25 25 return 0; 26 26 27 - wait_queue_head_t *wq = bit_waitqueue(&inode->i_state, __I_DIO_WAKEUP); 28 - DEFINE_WAIT_BIT(q, &inode->i_state, __I_DIO_WAKEUP); 29 - 30 - for (;;) { 31 - prepare_to_wait(wq, &q.wq_entry, TASK_INTERRUPTIBLE); 32 - if (!atomic_read(&inode->i_dio_count)) 33 - break; 34 - if (signal_pending(current)) 35 - break; 36 - schedule(); 37 - } 38 - finish_wait(wq, &q.wq_entry); 39 - 40 - return atomic_read(&inode->i_dio_count) ? -ERESTARTSYS : 0; 27 + inode_dio_wait_interruptible(inode); 28 + return !inode_dio_finished(inode) ? -ERESTARTSYS : 0; 41 29 } 42 30 43 31 /* Call with exclusively locked inode->i_rwsem */ ··· 34 46 if (!test_bit(NETFS_ICTX_ODIRECT, &ictx->flags)) 35 47 return 0; 36 48 clear_bit(NETFS_ICTX_ODIRECT, &ictx->flags); 37 - return inode_dio_wait_interruptible(&ictx->inode); 49 + return netfs_inode_dio_wait_interruptible(&ictx->inode); 38 50 } 39 51 40 52 /**

+2 -2

fs/netfs/main.c

··· 142 142 143 143 error_fscache: 144 144 error_procfile: 145 - remove_proc_entry("fs/netfs", NULL); 145 + remove_proc_subtree("fs/netfs", NULL); 146 146 error_proc: 147 147 mempool_exit(&netfs_subrequest_pool); 148 148 error_subreqpool: ··· 159 159 static void __exit netfs_exit(void) 160 160 { 161 161 fscache_exit(); 162 - remove_proc_entry("fs/netfs", NULL); 162 + remove_proc_subtree("fs/netfs", NULL); 163 163 mempool_exit(&netfs_subrequest_pool); 164 164 kmem_cache_destroy(netfs_subrequest_slab); 165 165 mempool_exit(&netfs_request_pool);

+1 -1

fs/pipe.c

··· 1427 1427 1428 1428 /* 1429 1429 * pipefs should _never_ be mounted by userland - too much of security hassle, 1430 - * no real gain from having the whole whorehouse mounted. So we don't need 1430 + * no real gain from having the whole file system mounted. So we don't need 1431 1431 * any operations on the root directory. However, we need a non-trivial 1432 1432 * d_name - pipe: will go nicely and kill the special-casing in procfs. 1433 1433 */

+2 -2

fs/posix_acl.c

··· 715 715 return error; 716 716 if (error == 0) 717 717 *acl = NULL; 718 - if (!vfsgid_in_group_p(i_gid_into_vfsgid(idmap, inode)) && 719 - !capable_wrt_inode_uidgid(idmap, inode, CAP_FSETID)) 718 + if (!in_group_or_capable(idmap, inode, 719 + i_gid_into_vfsgid(idmap, inode))) 720 720 mode &= ~S_ISGID; 721 721 *mode_p = mode; 722 722 return 0;

+4 -6

fs/proc/base.c

··· 827 827 828 828 static int mem_open(struct inode *inode, struct file *file) 829 829 { 830 - int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH); 831 - 832 - /* OK to pass negative loff_t, we can catch out-of-range */ 833 - file->f_mode |= FMODE_UNSIGNED_OFFSET; 834 - 835 - return ret; 830 + if (WARN_ON_ONCE(!(file->f_op->fop_flags & FOP_UNSIGNED_OFFSET))) 831 + return -EINVAL; 832 + return __mem_open(inode, file, PTRACE_MODE_ATTACH); 836 833 } 837 834 838 835 static ssize_t mem_rw(struct file *file, char __user *buf, ··· 929 932 .write = mem_write, 930 933 .open = mem_open, 931 934 .release = mem_release, 935 + .fop_flags = FOP_UNSIGNED_OFFSET, 932 936 }; 933 937 934 938 static int environ_open(struct inode *inode, struct file *file)

+1 -1

fs/proc/fd.c

··· 59 59 real_mount(file->f_path.mnt)->mnt_id, 60 60 file_inode(file)->i_ino); 61 61 62 - /* show_fd_locks() never deferences files so a stale value is safe */ 62 + /* show_fd_locks() never dereferences files, so a stale value is safe */ 63 63 show_fd_locks(m, file, files); 64 64 if (seq_has_overflowed(m)) 65 65 goto out;

+1 -1

fs/proc/kcore.c

··· 235 235 int nid, ret; 236 236 unsigned long end_pfn; 237 237 238 - /* Not inialized....update now */ 238 + /* Not initialized....update now */ 239 239 /* find out "max pfn" */ 240 240 end_pfn = 0; 241 241 for_each_node_state(nid, N_MEMORY) {

+1 -1

fs/read_write.c

··· 36 36 37 37 static inline bool unsigned_offsets(struct file *file) 38 38 { 39 - return file->f_mode & FMODE_UNSIGNED_OFFSET; 39 + return file->f_op->fop_flags & FOP_UNSIGNED_OFFSET; 40 40 } 41 41 42 42 /**

+1 -1

fs/select.c

··· 840 840 struct poll_list { 841 841 struct poll_list *next; 842 842 unsigned int len; 843 - struct pollfd entries[]; 843 + struct pollfd entries[] __counted_by(len); 844 844 }; 845 845 846 846 #define POLLFD_PER_PAGE ((PAGE_SIZE-sizeof(struct poll_list)) / sizeof(struct pollfd))

+2 -2

fs/super.c

··· 621 621 sync_filesystem(sb); 622 622 sb->s_flags &= ~SB_ACTIVE; 623 623 624 - cgroup_writeback_umount(); 624 + cgroup_writeback_umount(sb); 625 625 626 626 /* Evict all inodes with zero refcount. */ 627 627 evict_inodes(sb); ··· 1905 1905 int level; 1906 1906 1907 1907 for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--) 1908 - percpu_rwsem_release(sb->s_writers.rw_sem + level, 0, _THIS_IP_); 1908 + percpu_rwsem_release(sb->s_writers.rw_sem + level, _THIS_IP_); 1909 1909 } 1910 1910 1911 1911 /*

+2 -1

include/drm/drm_accel.h

··· 28 28 .poll = drm_poll,\ 29 29 .read = drm_read,\ 30 30 .llseek = noop_llseek, \ 31 - .mmap = drm_gem_mmap 31 + .mmap = drm_gem_mmap, \ 32 + .fop_flags = FOP_UNSIGNED_OFFSET 32 33 33 34 /** 34 35 * DEFINE_DRM_ACCEL_FOPS() - macro to generate file operations for accelerators drivers

+2 -1

include/drm/drm_gem.h

··· 447 447 .poll = drm_poll,\ 448 448 .read = drm_read,\ 449 449 .llseek = noop_llseek,\ 450 - .mmap = drm_gem_mmap 450 + .mmap = drm_gem_mmap, \ 451 + .fop_flags = FOP_UNSIGNED_OFFSET 451 452 452 453 /** 453 454 * DEFINE_DRM_GEM_FOPS() - macro to generate file operations for GEM drivers

include/drm/drm_gem_dma_helper.h

··· 267 267 .read = drm_read,\ 268 268 .llseek = noop_llseek,\ 269 269 .mmap = drm_gem_mmap,\ 270 + .fop_flags = FOP_UNSIGNED_OFFSET, \ 270 271 DRM_GEM_DMA_UNMAPPED_AREA_FOPS \ 271 272 } 272 273

+1 -2

include/linux/buffer_head.h

··· 199 199 unsigned long offset); 200 200 struct buffer_head *folio_alloc_buffers(struct folio *folio, unsigned long size, 201 201 gfp_t gfp); 202 - struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, 203 - bool retry); 202 + struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size); 204 203 struct buffer_head *create_empty_buffers(struct folio *folio, 205 204 unsigned long blocksize, unsigned long b_state); 206 205 void end_buffer_read_sync(struct buffer_head *bh, int uptodate);

+12 -2

include/linux/filelock.h

··· 420 420 #ifdef CONFIG_FILE_LOCKING 421 421 static inline int break_lease(struct inode *inode, unsigned int mode) 422 422 { 423 + struct file_lock_context *flctx; 424 + 423 425 /* 424 426 * Since this check is lockless, we must ensure that any refcounts 425 427 * taken are done before checking i_flctx->flc_lease. Otherwise, we 426 428 * could end up racing with tasks trying to set a new lease on this 427 429 * file. 428 430 */ 431 + flctx = READ_ONCE(inode->i_flctx); 432 + if (!flctx) 433 + return 0; 429 434 smp_mb(); 430 - if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease)) 435 + if (!list_empty_careful(&flctx->flc_lease)) 431 436 return __break_lease(inode, mode, FL_LEASE); 432 437 return 0; 433 438 } 434 439 435 440 static inline int break_deleg(struct inode *inode, unsigned int mode) 436 441 { 442 + struct file_lock_context *flctx; 443 + 437 444 /* 438 445 * Since this check is lockless, we must ensure that any refcounts 439 446 * taken are done before checking i_flctx->flc_lease. Otherwise, we 440 447 * could end up racing with tasks trying to set a new lease on this 441 448 * file. 442 449 */ 450 + flctx = READ_ONCE(inode->i_flctx); 451 + if (!flctx) 452 + return 0; 443 453 smp_mb(); 444 - if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease)) 454 + if (!list_empty_careful(&flctx->flc_lease)) 445 455 return __break_lease(inode, mode, FL_DELEG); 446 456 return 0; 447 457 }

+57 -31

include/linux/fs.h

··· 146 146 /* Expect random access pattern */ 147 147 #define FMODE_RANDOM ((__force fmode_t)(1 << 12)) 148 148 149 - /* File is huge (eg. /dev/mem): treat loff_t as unsigned */ 150 - #define FMODE_UNSIGNED_OFFSET ((__force fmode_t)(1 << 13)) 149 + /* FMODE_* bit 13 */ 151 150 152 151 /* File is opened with O_PATH; almost nothing can be done with it */ 153 152 #define FMODE_PATH ((__force fmode_t)(1 << 14)) ··· 682 683 #endif 683 684 684 685 /* Misc */ 685 - unsigned long i_state; 686 + u32 i_state; 687 + /* 32-bit hole */ 686 688 struct rw_semaphore i_rwsem; 687 689 688 690 unsigned long dirtied_when; /* jiffies of first dirtying */ ··· 745 745 746 746 void *i_private; /* fs or device private pointer */ 747 747 } __randomize_layout; 748 + 749 + /* 750 + * Get bit address from inode->i_state to use with wait_var_event() 751 + * infrastructre. 752 + */ 753 + #define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit)) 754 + 755 + struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe, 756 + struct inode *inode, u32 bit); 757 + 758 + static inline void inode_wake_up_bit(struct inode *inode, u32 bit) 759 + { 760 + /* Caller is responsible for correct memory barriers. */ 761 + wake_up_var(inode_state_wait_address(inode, bit)); 762 + } 748 763 749 764 struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode); 750 765 ··· 1283 1268 time64_t s_time_min; 1284 1269 time64_t s_time_max; 1285 1270 #ifdef CONFIG_FSNOTIFY 1286 - __u32 s_fsnotify_mask; 1271 + u32 s_fsnotify_mask; 1287 1272 struct fsnotify_sb_info *s_fsnotify_info; 1288 1273 #endif 1289 1274 ··· 1699 1684 #define __sb_writers_acquired(sb, lev) \ 1700 1685 percpu_rwsem_acquire(&(sb)->s_writers.rw_sem[(lev)-1], 1, _THIS_IP_) 1701 1686 #define __sb_writers_release(sb, lev) \ 1702 - percpu_rwsem_release(&(sb)->s_writers.rw_sem[(lev)-1], 1, _THIS_IP_) 1687 + percpu_rwsem_release(&(sb)->s_writers.rw_sem[(lev)-1], _THIS_IP_) 1703 1688 1704 1689 /** 1705 1690 * __sb_write_started - check if sb freeze level is held ··· 2089 2074 #define FOP_DIO_PARALLEL_WRITE ((__force fop_flags_t)(1 << 3)) 2090 2075 /* Contains huge pages */ 2091 2076 #define FOP_HUGE_PAGES ((__force fop_flags_t)(1 << 4)) 2077 + /* Treat loff_t as unsigned (e.g., /dev/mem) */ 2078 + #define FOP_UNSIGNED_OFFSET ((__force fop_flags_t)(1 << 5)) 2092 2079 2093 2080 /* Wrap a directory iterator that needs exclusive inode access */ 2094 2081 int wrap_directory_iterator(struct file *, struct dir_context *, ··· 2390 2373 * 2391 2374 * I_REFERENCED Marks the inode as recently references on the LRU list. 2392 2375 * 2393 - * I_DIO_WAKEUP Never set. Only used as a key for wait_on_bit(). 2394 - * 2395 2376 * I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to 2396 2377 * synchronize competing switching instances and to tell 2397 2378 * wb stat updates to grab the i_pages lock. See ··· 2412 2397 * i_count. 2413 2398 * 2414 2399 * Q: What is the difference between I_WILL_FREE and I_FREEING? 2400 + * 2401 + * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait 2402 + * upon. There's one free address left. 2415 2403 */ 2416 - #define I_DIRTY_SYNC (1 << 0) 2417 - #define I_DIRTY_DATASYNC (1 << 1) 2418 - #define I_DIRTY_PAGES (1 << 2) 2419 - #define __I_NEW 3 2404 + #define __I_NEW 0 2420 2405 #define I_NEW (1 << __I_NEW) 2421 - #define I_WILL_FREE (1 << 4) 2422 - #define I_FREEING (1 << 5) 2423 - #define I_CLEAR (1 << 6) 2424 - #define __I_SYNC 7 2406 + #define __I_SYNC 1 2425 2407 #define I_SYNC (1 << __I_SYNC) 2426 - #define I_REFERENCED (1 << 8) 2427 - #define __I_DIO_WAKEUP 9 2428 - #define I_DIO_WAKEUP (1 << __I_DIO_WAKEUP) 2408 + #define __I_LRU_ISOLATING 2 2409 + #define I_LRU_ISOLATING (1 << __I_LRU_ISOLATING) 2410 + 2411 + #define I_DIRTY_SYNC (1 << 3) 2412 + #define I_DIRTY_DATASYNC (1 << 4) 2413 + #define I_DIRTY_PAGES (1 << 5) 2414 + #define I_WILL_FREE (1 << 6) 2415 + #define I_FREEING (1 << 7) 2416 + #define I_CLEAR (1 << 8) 2417 + #define I_REFERENCED (1 << 9) 2429 2418 #define I_LINKABLE (1 << 10) 2430 2419 #define I_DIRTY_TIME (1 << 11) 2431 - #define I_WB_SWITCH (1 << 13) 2432 - #define I_OVL_INUSE (1 << 14) 2433 - #define I_CREATING (1 << 15) 2434 - #define I_DONTCACHE (1 << 16) 2435 - #define I_SYNC_QUEUED (1 << 17) 2436 - #define I_PINNING_NETFS_WB (1 << 18) 2437 - #define __I_LRU_ISOLATING 19 2438 - #define I_LRU_ISOLATING (1 << __I_LRU_ISOLATING) 2420 + #define I_WB_SWITCH (1 << 12) 2421 + #define I_OVL_INUSE (1 << 13) 2422 + #define I_CREATING (1 << 14) 2423 + #define I_DONTCACHE (1 << 15) 2424 + #define I_SYNC_QUEUED (1 << 16) 2425 + #define I_PINNING_NETFS_WB (1 << 17) 2439 2426 2440 2427 #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC) 2441 2428 #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES) ··· 2571 2554 struct super_block *sget_dev(struct fs_context *fc, dev_t dev); 2572 2555 2573 2556 /* Alas, no aliases. Too much hassle with bringing module.h everywhere */ 2574 - #define fops_get(fops) \ 2575 - (((fops) && try_module_get((fops)->owner) ? (fops) : NULL)) 2576 - #define fops_put(fops) \ 2577 - do { if (fops) module_put((fops)->owner); } while(0) 2557 + #define fops_get(fops) ({ \ 2558 + const struct file_operations *_fops = (fops); \ 2559 + (((_fops) && try_module_get((_fops)->owner) ? (_fops) : NULL)); \ 2560 + }) 2561 + 2562 + #define fops_put(fops) ({ \ 2563 + const struct file_operations *_fops = (fops); \ 2564 + if (_fops) \ 2565 + module_put((_fops)->owner); \ 2566 + }) 2567 + 2578 2568 /* 2579 2569 * This one is to be used *ONLY* from ->open() instances. 2580 2570 * fops must be non-NULL, pinned down *and* module dependencies ··· 3244 3220 } 3245 3221 #endif 3246 3222 3223 + bool inode_dio_finished(const struct inode *inode); 3247 3224 void inode_dio_wait(struct inode *inode); 3225 + void inode_dio_wait_interruptible(struct inode *inode); 3248 3226 3249 3227 /** 3250 3228 * inode_dio_begin - signal start of a direct I/O requests ··· 3270 3244 static inline void inode_dio_end(struct inode *inode) 3271 3245 { 3272 3246 if (atomic_dec_and_test(&inode->i_dio_count)) 3273 - wake_up_bit(&inode->i_state, __I_DIO_WAKEUP); 3247 + wake_up_var(&inode->i_dio_count); 3274 3248 } 3275 3249 3276 3250 extern void inode_set_flags(struct inode *inode, unsigned int flags,

-6

include/linux/path.h

··· 18 18 return path1->mnt == path2->mnt && path1->dentry == path2->dentry; 19 19 } 20 20 21 - static inline void path_put_init(struct path *path) 22 - { 23 - path_put(path); 24 - *path = (struct path) { }; 25 - } 26 - 27 21 /* 28 22 * Cleanup macro for use with __free(path_put). Avoids dereference and 29 23 * copying @path unlike DEFINE_FREE(). path_put() will handle the empty

+1 -1

include/linux/percpu-rwsem.h

··· 145 145 #define percpu_rwsem_assert_held(sem) lockdep_assert_held(sem) 146 146 147 147 static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem, 148 - bool read, unsigned long ip) 148 + unsigned long ip) 149 149 { 150 150 lock_release(&sem->dep_map, ip); 151 151 }

+1 -1

include/linux/syscalls.h

··· 870 870 #endif 871 871 asmlinkage long sys_name_to_handle_at(int dfd, const char __user *name, 872 872 struct file_handle __user *handle, 873 - int __user *mnt_id, int flag); 873 + void __user *mnt_id, int flag); 874 874 asmlinkage long sys_open_by_handle_at(int mountdirfd, 875 875 struct file_handle __user *handle, 876 876 int flags);

+4 -2

include/linux/user_namespace.h

··· 21 21 }; 22 22 23 23 struct uid_gid_map { /* 64 bytes -- 1 cache line */ 24 - u32 nr_extents; 25 24 union { 26 - struct uid_gid_extent extent[UID_GID_MAP_MAX_BASE_EXTENTS]; 25 + struct { 26 + struct uid_gid_extent extent[UID_GID_MAP_MAX_BASE_EXTENTS]; 27 + u32 nr_extents; 28 + }; 27 29 struct { 28 30 struct uid_gid_extent *forward; 29 31 struct uid_gid_extent *reverse;

+4 -3

include/linux/writeback.h

··· 200 200 /* writeback.h requires fs.h; it, too, is not included from here. */ 201 201 static inline void wait_on_inode(struct inode *inode) 202 202 { 203 - wait_on_bit(&inode->i_state, __I_NEW, TASK_UNINTERRUPTIBLE); 203 + wait_var_event(inode_state_wait_address(inode, __I_NEW), 204 + !(READ_ONCE(inode->i_state) & I_NEW)); 204 205 } 205 206 206 207 #ifdef CONFIG_CGROUP_WRITEBACK ··· 218 217 size_t bytes); 219 218 int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, 220 219 enum wb_reason reason, struct wb_completion *done); 221 - void cgroup_writeback_umount(void); 220 + void cgroup_writeback_umount(struct super_block *sb); 222 221 bool cleanup_offline_cgwb(struct bdi_writeback *wb); 223 222 224 223 /** ··· 325 324 { 326 325 } 327 326 328 - static inline void cgroup_writeback_umount(void) 327 + static inline void cgroup_writeback_umount(struct super_block *sb) 329 328 { 330 329 } 331 330

+9 -1

include/trace/events/writeback.h

··· 20 20 {I_CLEAR, "I_CLEAR"}, \ 21 21 {I_SYNC, "I_SYNC"}, \ 22 22 {I_DIRTY_TIME, "I_DIRTY_TIME"}, \ 23 - {I_REFERENCED, "I_REFERENCED"} \ 23 + {I_REFERENCED, "I_REFERENCED"}, \ 24 + {I_LINKABLE, "I_LINKABLE"}, \ 25 + {I_WB_SWITCH, "I_WB_SWITCH"}, \ 26 + {I_OVL_INUSE, "I_OVL_INUSE"}, \ 27 + {I_CREATING, "I_CREATING"}, \ 28 + {I_DONTCACHE, "I_DONTCACHE"}, \ 29 + {I_SYNC_QUEUED, "I_SYNC_QUEUED"}, \ 30 + {I_PINNING_NETFS_WB, "I_PINNING_NETFS_WB"}, \ 31 + {I_LRU_ISOLATING, "I_LRU_ISOLATING"} \ 24 32 ) 25 33 26 34 /* enums need to be exported to user space */

+1 -1

include/uapi/linux/auto_fs.h

··· 23 23 #define AUTOFS_MIN_PROTO_VERSION 3 24 24 #define AUTOFS_MAX_PROTO_VERSION 5 25 25 26 - #define AUTOFS_PROTO_SUBVERSION 5 26 + #define AUTOFS_PROTO_SUBVERSION 6 27 27 28 28 /* 29 29 * The wait_queue_token (autofs_wqt_t) is part of a structure which is passed

+60 -24

include/uapi/linux/fcntl.h

··· 16 16 17 17 #define F_DUPFD_QUERY (F_LINUX_SPECIFIC_BASE + 3) 18 18 19 + /* Was the file just created? */ 20 + #define F_CREATED_QUERY (F_LINUX_SPECIFIC_BASE + 4) 21 + 19 22 /* 20 23 * Cancel a blocking posix lock; internal use only until we expose an 21 24 * asynchronous lock api to userspace: ··· 90 87 #define DN_ATTRIB 0x00000020 /* File changed attibutes */ 91 88 #define DN_MULTISHOT 0x80000000 /* Don't remove notifier */ 92 89 90 + #define AT_FDCWD -100 /* Special value for dirfd used to 91 + indicate openat should use the 92 + current working directory. */ 93 + 94 + 95 + /* Generic flags for the *at(2) family of syscalls. */ 96 + 97 + /* Reserved for per-syscall flags 0xff. */ 98 + #define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic 99 + links. */ 100 + /* Reserved for per-syscall flags 0x200 */ 101 + #define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */ 102 + #define AT_NO_AUTOMOUNT 0x800 /* Suppress terminal automount 103 + traversal. */ 104 + #define AT_EMPTY_PATH 0x1000 /* Allow empty relative 105 + pathname to operate on dirfd 106 + directly. */ 93 107 /* 94 - * The constants AT_REMOVEDIR and AT_EACCESS have the same value. AT_EACCESS is 95 - * meaningful only to faccessat, while AT_REMOVEDIR is meaningful only to 96 - * unlinkat. The two functions do completely different things and therefore, 97 - * the flags can be allowed to overlap. For example, passing AT_REMOVEDIR to 98 - * faccessat would be undefined behavior and thus treating it equivalent to 99 - * AT_EACCESS is valid undefined behavior. 108 + * These flags are currently statx(2)-specific, but they could be made generic 109 + * in the future and so they should not be used for other per-syscall flags. 100 110 */ 101 - #define AT_FDCWD -100 /* Special value used to indicate 102 - openat should use the current 103 - working directory. */ 104 - #define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic links. */ 111 + #define AT_STATX_SYNC_TYPE 0x6000 /* Type of synchronisation required from statx() */ 112 + #define AT_STATX_SYNC_AS_STAT 0x0000 /* - Do whatever stat() does */ 113 + #define AT_STATX_FORCE_SYNC 0x2000 /* - Force the attributes to be sync'd with the server */ 114 + #define AT_STATX_DONT_SYNC 0x4000 /* - Don't sync attributes with the server */ 115 + 116 + #define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */ 117 + 118 + /* 119 + * Per-syscall flags for the *at(2) family of syscalls. 120 + * 121 + * These are flags that are so syscall-specific that a user passing these flags 122 + * to the wrong syscall is so "clearly wrong" that we can safely call such 123 + * usage "undefined behaviour". 124 + * 125 + * For example, the constants AT_REMOVEDIR and AT_EACCESS have the same value. 126 + * AT_EACCESS is meaningful only to faccessat, while AT_REMOVEDIR is meaningful 127 + * only to unlinkat. The two functions do completely different things and 128 + * therefore, the flags can be allowed to overlap. For example, passing 129 + * AT_REMOVEDIR to faccessat would be undefined behavior and thus treating it 130 + * equivalent to AT_EACCESS is valid undefined behavior. 131 + * 132 + * Note for implementers: When picking a new per-syscall AT_* flag, try to 133 + * reuse already existing flags first. This leaves us with as many unused bits 134 + * as possible, so we can use them for generic bits in the future if necessary. 135 + */ 136 + 137 + /* Flags for renameat2(2) (must match legacy RENAME_* flags). */ 138 + #define AT_RENAME_NOREPLACE 0x0001 139 + #define AT_RENAME_EXCHANGE 0x0002 140 + #define AT_RENAME_WHITEOUT 0x0004 141 + 142 + /* Flag for faccessat(2). */ 105 143 #define AT_EACCESS 0x200 /* Test access permitted for 106 144 effective IDs, not real IDs. */ 145 + /* Flag for unlinkat(2). */ 107 146 #define AT_REMOVEDIR 0x200 /* Remove directory instead of 108 147 unlinking file. */ 109 - #define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */ 110 - #define AT_NO_AUTOMOUNT 0x800 /* Suppress terminal automount traversal */ 111 - #define AT_EMPTY_PATH 0x1000 /* Allow empty relative pathname */ 148 + /* Flags for name_to_handle_at(2). */ 149 + #define AT_HANDLE_FID 0x200 /* File handle is needed to compare 150 + object identity and may not be 151 + usable with open_by_handle_at(2). */ 152 + #define AT_HANDLE_MNT_ID_UNIQUE 0x001 /* Return the u64 unique mount ID. */ 112 153 113 - #define AT_STATX_SYNC_TYPE 0x6000 /* Type of synchronisation required from statx() */ 114 - #define AT_STATX_SYNC_AS_STAT 0x0000 /* - Do whatever stat() does */ 115 - #define AT_STATX_FORCE_SYNC 0x2000 /* - Force the attributes to be sync'd with the server */ 116 - #define AT_STATX_DONT_SYNC 0x4000 /* - Don't sync attributes with the server */ 117 - 118 - #define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */ 119 - 120 - /* Flags for name_to_handle_at(2). We reuse AT_ flag space to save bits... */ 121 - #define AT_HANDLE_FID AT_REMOVEDIR /* file handle is needed to 122 - compare object identity and may not 123 - be usable to open_by_handle_at(2) */ 124 154 #if defined(__KERNEL__) 125 155 #define AT_GETATTR_NOSEC 0x80000000 126 156 #endif

+3 -3

kernel/user.c

··· 36 36 */ 37 37 struct user_namespace init_user_ns = { 38 38 .uid_map = { 39 - .nr_extents = 1, 40 39 { 41 40 .extent[0] = { 42 41 .first = 0, 43 42 .lower_first = 0, 44 43 .count = 4294967295U, 45 44 }, 45 + .nr_extents = 1, 46 46 }, 47 47 }, 48 48 .gid_map = { 49 - .nr_extents = 1, 50 49 { 51 50 .extent[0] = { 52 51 .first = 0, 53 52 .lower_first = 0, 54 53 .count = 4294967295U, 55 54 }, 55 + .nr_extents = 1, 56 56 }, 57 57 }, 58 58 .projid_map = { 59 - .nr_extents = 1, 60 59 { 61 60 .extent[0] = { 62 61 .first = 0, 63 62 .lower_first = 0, 64 63 .count = 4294967295U, 65 64 }, 65 + .nr_extents = 1, 66 66 }, 67 67 }, 68 68 .ns.count = REFCOUNT_INIT(3),

+1 -1

mm/mmap.c

··· 1229 1229 return MAX_LFS_FILESIZE; 1230 1230 1231 1231 /* Special "we do even unsigned file positions" case */ 1232 - if (file->f_mode & FMODE_UNSIGNED_OFFSET) 1232 + if (file->f_op->fop_flags & FOP_UNSIGNED_OFFSET) 1233 1233 return 0; 1234 1234 1235 1235 /* Yes, random drivers might want more. But I'm tired of buggy drivers */

+39

tools/testing/selftests/core/close_range_test.c

··· 26 26 #define F_DUPFD_QUERY (F_LINUX_SPECIFIC_BASE + 3) 27 27 #endif 28 28 29 + #ifndef F_CREATED_QUERY 30 + #define F_CREATED_QUERY (F_LINUX_SPECIFIC_BASE + 4) 31 + #endif 32 + 29 33 static inline int sys_close_range(unsigned int fd, unsigned int max_fd, 30 34 unsigned int flags) 31 35 { ··· 626 622 EXPECT_EQ(waitpid(pid, &status, 0), pid); 627 623 EXPECT_EQ(true, WIFEXITED(status)); 628 624 EXPECT_EQ(0, WEXITSTATUS(status)); 625 + } 626 + 627 + TEST(fcntl_created) 628 + { 629 + for (int i = 0; i < 101; i++) { 630 + int fd; 631 + char path[PATH_MAX]; 632 + 633 + fd = open("/dev/null", O_RDONLY | O_CLOEXEC); 634 + ASSERT_GE(fd, 0) { 635 + if (errno == ENOENT) 636 + SKIP(return, 637 + "Skipping test since /dev/null does not exist"); 638 + } 639 + 640 + /* We didn't create "/dev/null". */ 641 + EXPECT_EQ(fcntl(fd, F_CREATED_QUERY, 0), 0); 642 + close(fd); 643 + 644 + sprintf(path, "aaaa_%d", i); 645 + fd = open(path, O_CREAT | O_RDONLY | O_CLOEXEC, 0600); 646 + ASSERT_GE(fd, 0); 647 + 648 + /* We created "aaaa_%d". */ 649 + EXPECT_EQ(fcntl(fd, F_CREATED_QUERY, 0), 1); 650 + close(fd); 651 + 652 + fd = open(path, O_RDONLY | O_CLOEXEC); 653 + ASSERT_GE(fd, 0); 654 + 655 + /* We're opening it again, so no positive creation check. */ 656 + EXPECT_EQ(fcntl(fd, F_CREATED_QUERY, 0), 0); 657 + close(fd); 658 + unlink(path); 659 + } 629 660 } 630 661 631 662 TEST_HARNESS_MAIN