Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge patch series "pidfs: persistent info & xattrs"

Christian Brauner <brauner@kernel.org> says:

Persist exit and coredump information independent of whether anyone
currently holds a pidfd for the struct pid.

The current scheme allocated pidfs dentries on-demand repeatedly.
This scheme is reaching it's limits as it makes it impossible to pin
information that needs to be available after the task has exited or
coredumped and that should not be lost simply because the pidfd got
closed temporarily. The next opener should still see the stashed
information.

This is also a prerequisite for supporting extended attributes on
pidfds to allow attaching meta information to them.

If someone opens a pidfd for a struct pid a pidfs dentry is allocated
and stashed in pid->stashed. Once the last pidfd for the struct pid is
closed the pidfs dentry is released and removed from pid->stashed.

So if 10 callers create a pidfs dentry for the same struct pid
sequentially, i.e., each closing the pidfd before the other creates a
new one then a new pidfs dentry is allocated every time.

Because multiple tasks acquiring and releasing a pidfd for the same
struct pid can race with each another a task may still find a valid
pidfs entry from the previous task in pid->stashed and reuse it. Or it
might find a dead dentry in there and fail to reuse it and so stashes a
new pidfs dentry. Multiple tasks may race to stash a new pidfs dentry
but only one will succeed, the other ones will put their dentry.

The current scheme aims to ensure that a pidfs dentry for a struct pid
can only be created if the task is still alive or if a pidfs dentry
already existed before the task was reaped and so exit information has
been was stashed in the pidfs inode.

That's great except that it's buggy. If a pidfs dentry is stashed in
pid->stashed after pidfs_exit() but before __unhash_process() is called
we will return a pidfd for a reaped task without exit information being
available.

The pidfds_pid_valid() check does not guard against this race as it
doens't sync at all with pidfs_exit(). The pid_has_task() check might be
successful simply because we're before __unhash_process() but after
pidfs_exit().

Introduce a new scheme where the lifetime of information associated with
a pidfs entry (coredump and exit information) isn't bound to the
lifetime of the pidfs inode but the struct pid itself.

The first time a pidfs dentry is allocated for a struct pid a struct
pidfs_attr will be allocated which will be used to store exit and
coredump information.

If all pidfs for the pidfs dentry are closed the dentry and inode can be
cleaned up but the struct pidfs_attr will stick until the struct pid
itself is freed. This will ensure minimal memory usage while persisting
relevant information.

The new scheme has various advantages. First, it allows to close the
race where we end up handing out a pidfd for a reaped task for which no
exit information is available. Second, it minimizes memory usage.
Third, it allows to remove complex lifetime tracking via dentries when
registering a struct pid with pidfs. There's no need to get or put a
reference. Instead, the lifetime of exit and coredump information
associated with a struct pid is bound to the lifetime of struct pid
itself.

Now that we have a way to persist information for pidfs dentries we can
start supporting extended attributes on pidfds. This will allow
userspace to attach meta information to tasks.

One natural extension would be to introduce a custom pidfs.* extended
attribute space and allow for the inheritance of extended attributes
across fork() and exec().

The first simple scheme will allow privileged userspace to set trusted
extended attributes on pidfs inodes.

* patches from https://lore.kernel.org/20250618-work-pidfs-persistent-v2-0-98f3456fd552@kernel.org:
pidfs: add some CONFIG_DEBUG_VFS asserts
selftests/pidfd: test setattr support
selftests/pidfd: test extended attribute support
selftests/pidfd: test extended attribute support
pidfs: support xattrs on pidfds
pidfs: make inodes mutable
libfs: prepare to allow for non-immutable pidfd inodes
pidfs: remove pidfs_pid_valid()
pidfs: remove pidfs_{get,put}_pid()
pidfs: remove custom inode allocation
pidfs: remove unused members from struct pidfs_inode
pidfs: persist information
pidfs: move to anonymous struct
libfs: massage path_from_stashed()
libfs: massage path_from_stashed() to allow custom stashing behavior
pidfs: raise SB_I_NODEV and SB_I_NOEXEC

Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-0-98f3456fd552@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>

+483 -217
-6
fs/coredump.c
··· 898 898 retval = kernel_connect(socket, (struct sockaddr *)(&addr), 899 899 addr_len, O_NONBLOCK | SOCK_COREDUMP); 900 900 901 - /* 902 - * ... Make sure to only put our reference after connect() took 903 - * its own reference keeping the pidfs entry alive ... 904 - */ 905 - pidfs_put_pid(cprm.pid); 906 - 907 901 if (retval) { 908 902 if (retval == -EAGAIN) 909 903 coredump_report_failure("Coredump socket %s receive queue full", addr.sun_path);
+3
fs/internal.h
··· 322 322 struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap); 323 323 void mnt_idmap_put(struct mnt_idmap *idmap); 324 324 struct stashed_operations { 325 + struct dentry *(*stash_dentry)(struct dentry **stashed, 326 + struct dentry *dentry); 325 327 void (*put_data)(void *data); 326 328 int (*init_inode)(struct inode *inode, void *data); 327 329 }; 328 330 int path_from_stashed(struct dentry **stashed, struct vfsmount *mnt, void *data, 329 331 struct path *path); 330 332 void stashed_dentry_prune(struct dentry *dentry); 333 + struct dentry *stash_dentry(struct dentry **stashed, struct dentry *dentry); 331 334 struct dentry *stashed_dentry_get(struct dentry **stashed); 332 335 /** 333 336 * path_mounted - check whether path is mounted
+22 -12
fs/libfs.c
··· 2128 2128 dentry = rcu_dereference(*stashed); 2129 2129 if (!dentry) 2130 2130 return NULL; 2131 + if (IS_ERR(dentry)) 2132 + return dentry; 2131 2133 if (!lockref_get_not_dead(&dentry->d_lockref)) 2132 2134 return NULL; 2133 2135 return dentry; ··· 2162 2160 2163 2161 /* Notice when this is changed. */ 2164 2162 WARN_ON_ONCE(!S_ISREG(inode->i_mode)); 2165 - WARN_ON_ONCE(!IS_IMMUTABLE(inode)); 2166 2163 2167 2164 dentry = d_alloc_anon(sb); 2168 2165 if (!dentry) { ··· 2177 2176 return dentry; 2178 2177 } 2179 2178 2180 - static struct dentry *stash_dentry(struct dentry **stashed, 2181 - struct dentry *dentry) 2179 + struct dentry *stash_dentry(struct dentry **stashed, struct dentry *dentry) 2182 2180 { 2183 2181 guard(rcu)(); 2184 2182 for (;;) { ··· 2218 2218 int path_from_stashed(struct dentry **stashed, struct vfsmount *mnt, void *data, 2219 2219 struct path *path) 2220 2220 { 2221 - struct dentry *dentry; 2221 + struct dentry *dentry, *res; 2222 2222 const struct stashed_operations *sops = mnt->mnt_sb->s_fs_info; 2223 2223 2224 2224 /* See if dentry can be reused. */ 2225 - path->dentry = stashed_dentry_get(stashed); 2226 - if (path->dentry) { 2225 + res = stashed_dentry_get(stashed); 2226 + if (IS_ERR(res)) 2227 + return PTR_ERR(res); 2228 + if (res) { 2227 2229 sops->put_data(data); 2228 - goto out_path; 2230 + goto make_path; 2229 2231 } 2230 2232 2231 2233 /* Allocate a new dentry. */ ··· 2236 2234 return PTR_ERR(dentry); 2237 2235 2238 2236 /* Added a new dentry. @data is now owned by the filesystem. */ 2239 - path->dentry = stash_dentry(stashed, dentry); 2240 - if (path->dentry != dentry) 2237 + if (sops->stash_dentry) 2238 + res = sops->stash_dentry(stashed, dentry); 2239 + else 2240 + res = stash_dentry(stashed, dentry); 2241 + if (IS_ERR(res)) { 2242 + dput(dentry); 2243 + return PTR_ERR(res); 2244 + } 2245 + if (res != dentry) 2241 2246 dput(dentry); 2242 2247 2243 - out_path: 2244 - WARN_ON_ONCE(path->dentry->d_fsdata != stashed); 2245 - WARN_ON_ONCE(d_inode(path->dentry)->i_private != data); 2248 + make_path: 2249 + path->dentry = res; 2246 2250 path->mnt = mntget(mnt); 2251 + VFS_WARN_ON_ONCE(path->dentry->d_fsdata != stashed); 2252 + VFS_WARN_ON_ONCE(d_inode(path->dentry)->i_private != data); 2247 2253 return 0; 2248 2254 } 2249 2255
+242 -185
fs/pidfs.c
··· 21 21 #include <linux/utsname.h> 22 22 #include <net/net_namespace.h> 23 23 #include <linux/coredump.h> 24 + #include <linux/xattr.h> 24 25 25 26 #include "internal.h" 26 27 #include "mount.h" 27 28 28 - static struct kmem_cache *pidfs_cachep __ro_after_init; 29 + #define PIDFS_PID_DEAD ERR_PTR(-ESRCH) 30 + 31 + static struct kmem_cache *pidfs_attr_cachep __ro_after_init; 32 + static struct kmem_cache *pidfs_xattr_cachep __ro_after_init; 29 33 30 34 /* 31 35 * Stashes information that userspace needs to access even after the ··· 41 37 __u32 coredump_mask; 42 38 }; 43 39 44 - struct pidfs_inode { 40 + struct pidfs_attr { 41 + struct simple_xattrs *xattrs; 45 42 struct pidfs_exit_info __pei; 46 43 struct pidfs_exit_info *exit_info; 47 - struct inode vfs_inode; 48 44 }; 49 - 50 - static inline struct pidfs_inode *pidfs_i(struct inode *inode) 51 - { 52 - return container_of(inode, struct pidfs_inode, vfs_inode); 53 - } 54 45 55 46 static struct rb_root pidfs_ino_tree = RB_ROOT; 56 47 ··· 124 125 125 126 pid->ino = pidfs_ino_nr; 126 127 pid->stashed = NULL; 128 + pid->attr = NULL; 127 129 pidfs_ino_nr++; 128 130 129 131 write_seqcount_begin(&pidmap_lock_seq); ··· 137 137 write_seqcount_begin(&pidmap_lock_seq); 138 138 rb_erase(&pid->pidfs_node, &pidfs_ino_tree); 139 139 write_seqcount_end(&pidmap_lock_seq); 140 + } 141 + 142 + void pidfs_free_pid(struct pid *pid) 143 + { 144 + struct pidfs_attr *attr __free(kfree) = no_free_ptr(pid->attr); 145 + struct simple_xattrs *xattrs __free(kfree) = NULL; 146 + 147 + /* 148 + * Any dentry must've been wiped from the pid by now. 149 + * Otherwise there's a reference count bug. 150 + */ 151 + VFS_WARN_ON_ONCE(pid->stashed); 152 + 153 + if (IS_ERR(attr)) 154 + return; 155 + 156 + /* 157 + * Any dentry must've been wiped from the pid by now. Otherwise 158 + * there's a reference count bug. 159 + */ 160 + VFS_WARN_ON_ONCE(pid->stashed); 161 + 162 + xattrs = attr->xattrs; 163 + if (xattrs) 164 + simple_xattrs_free(attr->xattrs, NULL); 140 165 } 141 166 142 167 #ifdef CONFIG_PROC_FS ··· 286 261 static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg) 287 262 { 288 263 struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg; 289 - struct inode *inode = file_inode(file); 290 264 struct pid *pid = pidfd_pid(file); 291 265 size_t usize = _IOC_SIZE(cmd); 292 266 struct pidfd_info kinfo = {}; 293 267 struct pidfs_exit_info *exit_info; 294 268 struct user_namespace *user_ns; 295 269 struct task_struct *task; 270 + struct pidfs_attr *attr; 296 271 const struct cred *c; 297 272 __u64 mask; 298 273 ··· 311 286 if (!pid_in_current_pidns(pid)) 312 287 return -ESRCH; 313 288 289 + attr = READ_ONCE(pid->attr); 314 290 if (mask & PIDFD_INFO_EXIT) { 315 - exit_info = READ_ONCE(pidfs_i(inode)->exit_info); 291 + exit_info = READ_ONCE(attr->exit_info); 316 292 if (exit_info) { 317 293 kinfo.mask |= PIDFD_INFO_EXIT; 318 294 #ifdef CONFIG_CGROUPS ··· 326 300 327 301 if (mask & PIDFD_INFO_COREDUMP) { 328 302 kinfo.mask |= PIDFD_INFO_COREDUMP; 329 - kinfo.coredump_mask = READ_ONCE(pidfs_i(inode)->__pei.coredump_mask); 303 + kinfo.coredump_mask = READ_ONCE(attr->__pei.coredump_mask); 330 304 } 331 305 332 306 task = get_pid_task(pid, PIDTYPE_PID); ··· 578 552 * task has been reaped which cannot happen until we're out of 579 553 * release_task(). 580 554 * 581 - * If this struct pid is referred to by a pidfd then 582 - * stashed_dentry_get() will return the dentry and inode for that struct 583 - * pid. Since we've taken a reference on it there's now an additional 584 - * reference from the exit path on it. Which is fine. We're going to put 585 - * it again in a second and we know that the pid is kept alive anyway. 555 + * If this struct pid has at least once been referred to by a pidfd then 556 + * pid->attr will be allocated. If not we mark the struct pid as dead so 557 + * anyone who is trying to register it with pidfs will fail to do so. 558 + * Otherwise we would hand out pidfs for reaped tasks without having 559 + * exit information available. 586 560 * 587 - * Worst case is that we've filled in the info and immediately free the 588 - * dentry and inode afterwards since the pidfd has been closed. Since 561 + * Worst case is that we've filled in the info and the pid gets freed 562 + * right away in free_pid() when no one holds a pidfd anymore. Since 589 563 * pidfs_exit() currently is placed after exit_task_work() we know that 590 - * it cannot be us aka the exiting task holding a pidfd to ourselves. 564 + * it cannot be us aka the exiting task holding a pidfd to itself. 591 565 */ 592 566 void pidfs_exit(struct task_struct *tsk) 593 567 { 594 - struct dentry *dentry; 568 + struct pid *pid = task_pid(tsk); 569 + struct pidfs_attr *attr; 570 + struct pidfs_exit_info *exit_info; 571 + #ifdef CONFIG_CGROUPS 572 + struct cgroup *cgrp; 573 + #endif 595 574 596 575 might_sleep(); 597 576 598 - dentry = stashed_dentry_get(&task_pid(tsk)->stashed); 599 - if (dentry) { 600 - struct inode *inode = d_inode(dentry); 601 - struct pidfs_exit_info *exit_info = &pidfs_i(inode)->__pei; 602 - #ifdef CONFIG_CGROUPS 603 - struct cgroup *cgrp; 604 - 605 - rcu_read_lock(); 606 - cgrp = task_dfl_cgroup(tsk); 607 - exit_info->cgroupid = cgroup_id(cgrp); 608 - rcu_read_unlock(); 609 - #endif 610 - exit_info->exit_code = tsk->exit_code; 611 - 612 - /* Ensure that PIDFD_GET_INFO sees either all or nothing. */ 613 - smp_store_release(&pidfs_i(inode)->exit_info, &pidfs_i(inode)->__pei); 614 - dput(dentry); 577 + guard(spinlock_irq)(&pid->wait_pidfd.lock); 578 + attr = pid->attr; 579 + if (!attr) { 580 + /* 581 + * No one ever held a pidfd for this struct pid. 582 + * Mark it as dead so no one can add a pidfs 583 + * entry anymore. We're about to be reaped and 584 + * so no exit information would be available. 585 + */ 586 + pid->attr = PIDFS_PID_DEAD; 587 + return; 615 588 } 589 + 590 + /* 591 + * If @pid->attr is set someone might still legitimately hold a 592 + * pidfd to @pid or someone might concurrently still be getting 593 + * a reference to an already stashed dentry from @pid->stashed. 594 + * So defer cleaning @pid->attr until the last reference to @pid 595 + * is put 596 + */ 597 + 598 + exit_info = &attr->__pei; 599 + 600 + #ifdef CONFIG_CGROUPS 601 + rcu_read_lock(); 602 + cgrp = task_dfl_cgroup(tsk); 603 + exit_info->cgroupid = cgroup_id(cgrp); 604 + rcu_read_unlock(); 605 + #endif 606 + exit_info->exit_code = tsk->exit_code; 607 + 608 + /* Ensure that PIDFD_GET_INFO sees either all or nothing. */ 609 + smp_store_release(&attr->exit_info, &attr->__pei); 616 610 } 617 611 618 612 #ifdef CONFIG_COREDUMP ··· 640 594 { 641 595 struct pid *pid = cprm->pid; 642 596 struct pidfs_exit_info *exit_info; 643 - struct dentry *dentry; 644 - struct inode *inode; 597 + struct pidfs_attr *attr; 645 598 __u32 coredump_mask = 0; 646 599 647 - dentry = pid->stashed; 648 - if (WARN_ON_ONCE(!dentry)) 649 - return; 600 + attr = READ_ONCE(pid->attr); 650 601 651 - inode = d_inode(dentry); 652 - exit_info = &pidfs_i(inode)->__pei; 602 + VFS_WARN_ON_ONCE(!attr); 603 + VFS_WARN_ON_ONCE(attr == PIDFS_PID_DEAD); 604 + 605 + exit_info = &attr->__pei; 653 606 /* Note how we were coredumped. */ 654 607 coredump_mask = pidfs_coredump_mask(cprm->mm_flags); 655 608 /* Note that we actually did coredump. */ ··· 679 634 return anon_inode_getattr(idmap, path, stat, request_mask, query_flags); 680 635 } 681 636 637 + static ssize_t pidfs_listxattr(struct dentry *dentry, char *buf, size_t size) 638 + { 639 + struct inode *inode = d_inode(dentry); 640 + struct pid *pid = inode->i_private; 641 + struct pidfs_attr *attr = pid->attr; 642 + struct simple_xattrs *xattrs; 643 + 644 + xattrs = READ_ONCE(attr->xattrs); 645 + if (!xattrs) 646 + return 0; 647 + 648 + return simple_xattr_list(inode, xattrs, buf, size); 649 + } 650 + 682 651 static const struct inode_operations pidfs_inode_operations = { 683 - .getattr = pidfs_getattr, 684 - .setattr = pidfs_setattr, 652 + .getattr = pidfs_getattr, 653 + .setattr = pidfs_setattr, 654 + .listxattr = pidfs_listxattr, 685 655 }; 686 656 687 657 static void pidfs_evict_inode(struct inode *inode) ··· 707 647 put_pid(pid); 708 648 } 709 649 710 - static struct inode *pidfs_alloc_inode(struct super_block *sb) 711 - { 712 - struct pidfs_inode *pi; 713 - 714 - pi = alloc_inode_sb(sb, pidfs_cachep, GFP_KERNEL); 715 - if (!pi) 716 - return NULL; 717 - 718 - memset(&pi->__pei, 0, sizeof(pi->__pei)); 719 - pi->exit_info = NULL; 720 - 721 - return &pi->vfs_inode; 722 - } 723 - 724 - static void pidfs_free_inode(struct inode *inode) 725 - { 726 - kmem_cache_free(pidfs_cachep, pidfs_i(inode)); 727 - } 728 - 729 650 static const struct super_operations pidfs_sops = { 730 - .alloc_inode = pidfs_alloc_inode, 731 651 .drop_inode = generic_delete_inode, 732 652 .evict_inode = pidfs_evict_inode, 733 - .free_inode = pidfs_free_inode, 734 653 .statfs = simple_statfs, 735 654 }; 736 655 ··· 809 770 if (ret < 0) 810 771 return ERR_PTR(ret); 811 772 773 + VFS_WARN_ON_ONCE(!pid->attr); 774 + 812 775 mntput(path.mnt); 813 776 return path.dentry; 814 777 } ··· 837 796 return 0; 838 797 } 839 798 840 - static inline bool pidfs_pid_valid(struct pid *pid, const struct path *path, 841 - unsigned int flags) 842 - { 843 - enum pid_type type; 844 - 845 - if (flags & PIDFD_STALE) 846 - return true; 847 - 848 - /* 849 - * Make sure that if a pidfd is created PIDFD_INFO_EXIT 850 - * information will be available. So after an inode for the 851 - * pidfd has been allocated perform another check that the pid 852 - * is still alive. If it is exit information is available even 853 - * if the task gets reaped before the pidfd is returned to 854 - * userspace. The only exception are indicated by PIDFD_STALE: 855 - * 856 - * (1) The kernel is in the middle of task creation and thus no 857 - * task linkage has been established yet. 858 - * (2) The caller knows @pid has been registered in pidfs at a 859 - * time when the task was still alive. 860 - * 861 - * In both cases exit information will have been reported. 862 - */ 863 - if (flags & PIDFD_THREAD) 864 - type = PIDTYPE_PID; 865 - else 866 - type = PIDTYPE_TGID; 867 - 868 - /* 869 - * Since pidfs_exit() is called before struct pid's task linkage 870 - * is removed the case where the task got reaped but a dentry 871 - * was already attached to struct pid and exit information was 872 - * recorded and published can be handled correctly. 873 - */ 874 - if (unlikely(!pid_has_task(pid, type))) { 875 - struct inode *inode = d_inode(path->dentry); 876 - return !!READ_ONCE(pidfs_i(inode)->exit_info); 877 - } 878 - 879 - return true; 880 - } 881 - 882 799 static struct file *pidfs_export_open(struct path *path, unsigned int oflags) 883 800 { 884 - if (!pidfs_pid_valid(d_inode(path->dentry)->i_private, path, oflags)) 885 - return ERR_PTR(-ESRCH); 886 - 887 801 /* 888 802 * Clear O_LARGEFILE as open_by_handle_at() forces it and raise 889 803 * O_RDWR as pidfds always are. ··· 860 864 861 865 inode->i_private = data; 862 866 inode->i_flags |= S_PRIVATE | S_ANON_INODE; 867 + /* We allow to set xattrs. */ 868 + inode->i_flags &= ~S_IMMUTABLE; 863 869 inode->i_mode |= S_IRWXU; 864 870 inode->i_op = &pidfs_inode_operations; 865 871 inode->i_fop = &pidfs_file_operations; ··· 876 878 put_pid(pid); 877 879 } 878 880 881 + /** 882 + * pidfs_register_pid - register a struct pid in pidfs 883 + * @pid: pid to pin 884 + * 885 + * Register a struct pid in pidfs. 886 + * 887 + * Return: On success zero, on error a negative error code is returned. 888 + */ 889 + int pidfs_register_pid(struct pid *pid) 890 + { 891 + struct pidfs_attr *new_attr __free(kfree) = NULL; 892 + struct pidfs_attr *attr; 893 + 894 + might_sleep(); 895 + 896 + if (!pid) 897 + return 0; 898 + 899 + attr = READ_ONCE(pid->attr); 900 + if (unlikely(attr == PIDFS_PID_DEAD)) 901 + return PTR_ERR(PIDFS_PID_DEAD); 902 + if (attr) 903 + return 0; 904 + 905 + new_attr = kmem_cache_zalloc(pidfs_attr_cachep, GFP_KERNEL); 906 + if (!new_attr) 907 + return -ENOMEM; 908 + 909 + /* Synchronize with pidfs_exit(). */ 910 + guard(spinlock_irq)(&pid->wait_pidfd.lock); 911 + 912 + attr = pid->attr; 913 + if (unlikely(attr == PIDFS_PID_DEAD)) 914 + return PTR_ERR(PIDFS_PID_DEAD); 915 + if (unlikely(attr)) 916 + return 0; 917 + 918 + pid->attr = no_free_ptr(new_attr); 919 + return 0; 920 + } 921 + 922 + static struct dentry *pidfs_stash_dentry(struct dentry **stashed, 923 + struct dentry *dentry) 924 + { 925 + int ret; 926 + struct pid *pid = d_inode(dentry)->i_private; 927 + 928 + VFS_WARN_ON_ONCE(stashed != &pid->stashed); 929 + 930 + ret = pidfs_register_pid(pid); 931 + if (ret) 932 + return ERR_PTR(ret); 933 + 934 + return stash_dentry(stashed, dentry); 935 + } 936 + 879 937 static const struct stashed_operations pidfs_stashed_ops = { 880 - .init_inode = pidfs_init_inode, 881 - .put_data = pidfs_put_data, 938 + .stash_dentry = pidfs_stash_dentry, 939 + .init_inode = pidfs_init_inode, 940 + .put_data = pidfs_put_data, 941 + }; 942 + 943 + static int pidfs_xattr_get(const struct xattr_handler *handler, 944 + struct dentry *unused, struct inode *inode, 945 + const char *suffix, void *value, size_t size) 946 + { 947 + struct pid *pid = inode->i_private; 948 + struct pidfs_attr *attr = pid->attr; 949 + const char *name; 950 + struct simple_xattrs *xattrs; 951 + 952 + xattrs = READ_ONCE(attr->xattrs); 953 + if (!xattrs) 954 + return 0; 955 + 956 + name = xattr_full_name(handler, suffix); 957 + return simple_xattr_get(xattrs, name, value, size); 958 + } 959 + 960 + static int pidfs_xattr_set(const struct xattr_handler *handler, 961 + struct mnt_idmap *idmap, struct dentry *unused, 962 + struct inode *inode, const char *suffix, 963 + const void *value, size_t size, int flags) 964 + { 965 + struct pid *pid = inode->i_private; 966 + struct pidfs_attr *attr = pid->attr; 967 + const char *name; 968 + struct simple_xattrs *xattrs; 969 + struct simple_xattr *old_xattr; 970 + 971 + /* Ensure we're the only one to set @attr->xattrs. */ 972 + WARN_ON_ONCE(!inode_is_locked(inode)); 973 + 974 + xattrs = READ_ONCE(attr->xattrs); 975 + if (!xattrs) { 976 + xattrs = kmem_cache_zalloc(pidfs_xattr_cachep, GFP_KERNEL); 977 + if (!xattrs) 978 + return -ENOMEM; 979 + 980 + simple_xattrs_init(xattrs); 981 + smp_store_release(&pid->attr->xattrs, xattrs); 982 + } 983 + 984 + name = xattr_full_name(handler, suffix); 985 + old_xattr = simple_xattr_set(xattrs, name, value, size, flags); 986 + if (IS_ERR(old_xattr)) 987 + return PTR_ERR(old_xattr); 988 + 989 + simple_xattr_free(old_xattr); 990 + return 0; 991 + } 992 + 993 + static const struct xattr_handler pidfs_trusted_xattr_handler = { 994 + .prefix = XATTR_TRUSTED_PREFIX, 995 + .get = pidfs_xattr_get, 996 + .set = pidfs_xattr_set, 997 + }; 998 + 999 + static const struct xattr_handler *const pidfs_xattr_handlers[] = { 1000 + &pidfs_trusted_xattr_handler, 1001 + NULL 882 1002 }; 883 1003 884 1004 static int pidfs_init_fs_context(struct fs_context *fc) ··· 1007 891 if (!ctx) 1008 892 return -ENOMEM; 1009 893 894 + fc->s_iflags |= SB_I_NOEXEC; 895 + fc->s_iflags |= SB_I_NODEV; 1010 896 ctx->ops = &pidfs_sops; 1011 897 ctx->eops = &pidfs_export_operations; 1012 898 ctx->dops = &pidfs_dentry_operations; 899 + ctx->xattr = pidfs_xattr_handlers; 1013 900 fc->s_fs_info = (void *)&pidfs_stashed_ops; 1014 901 return 0; 1015 902 } ··· 1040 921 if (ret < 0) 1041 922 return ERR_PTR(ret); 1042 923 1043 - if (!pidfs_pid_valid(pid, &path, flags)) 1044 - return ERR_PTR(-ESRCH); 924 + VFS_WARN_ON_ONCE(!pid->attr); 1045 925 1046 926 flags &= ~PIDFD_STALE; 1047 927 flags |= O_RDWR; ··· 1052 934 return pidfd_file; 1053 935 } 1054 936 1055 - /** 1056 - * pidfs_register_pid - register a struct pid in pidfs 1057 - * @pid: pid to pin 1058 - * 1059 - * Register a struct pid in pidfs. Needs to be paired with 1060 - * pidfs_put_pid() to not risk leaking the pidfs dentry and inode. 1061 - * 1062 - * Return: On success zero, on error a negative error code is returned. 1063 - */ 1064 - int pidfs_register_pid(struct pid *pid) 1065 - { 1066 - struct path path __free(path_put) = {}; 1067 - int ret; 1068 - 1069 - might_sleep(); 1070 - 1071 - if (!pid) 1072 - return 0; 1073 - 1074 - ret = path_from_stashed(&pid->stashed, pidfs_mnt, get_pid(pid), &path); 1075 - if (unlikely(ret)) 1076 - return ret; 1077 - /* Keep the dentry and only put the reference to the mount. */ 1078 - path.dentry = NULL; 1079 - return 0; 1080 - } 1081 - 1082 - /** 1083 - * pidfs_get_pid - pin a struct pid through pidfs 1084 - * @pid: pid to pin 1085 - * 1086 - * Similar to pidfs_register_pid() but only valid if the caller knows 1087 - * there's a reference to the @pid through a dentry already that can't 1088 - * go away. 1089 - */ 1090 - void pidfs_get_pid(struct pid *pid) 1091 - { 1092 - if (!pid) 1093 - return; 1094 - WARN_ON_ONCE(!stashed_dentry_get(&pid->stashed)); 1095 - } 1096 - 1097 - /** 1098 - * pidfs_put_pid - drop a pidfs reference 1099 - * @pid: pid to drop 1100 - * 1101 - * Drop a reference to @pid via pidfs. This is only safe if the 1102 - * reference has been taken via pidfs_get_pid(). 1103 - */ 1104 - void pidfs_put_pid(struct pid *pid) 1105 - { 1106 - might_sleep(); 1107 - 1108 - if (!pid) 1109 - return; 1110 - VFS_WARN_ON_ONCE(!pid->stashed); 1111 - dput(pid->stashed); 1112 - } 1113 - 1114 - static void pidfs_inode_init_once(void *data) 1115 - { 1116 - struct pidfs_inode *pi = data; 1117 - 1118 - inode_init_once(&pi->vfs_inode); 1119 - } 1120 - 1121 937 void __init pidfs_init(void) 1122 938 { 1123 - pidfs_cachep = kmem_cache_create("pidfs_cache", sizeof(struct pidfs_inode), 0, 939 + pidfs_attr_cachep = kmem_cache_create("pidfs_attr_cache", sizeof(struct pidfs_attr), 0, 1124 940 (SLAB_HWCACHE_ALIGN | SLAB_RECLAIM_ACCOUNT | 1125 - SLAB_ACCOUNT | SLAB_PANIC), 1126 - pidfs_inode_init_once); 941 + SLAB_ACCOUNT | SLAB_PANIC), NULL); 942 + 943 + pidfs_xattr_cachep = kmem_cache_create("pidfs_xattr_cache", 944 + sizeof(struct simple_xattrs), 0, 945 + (SLAB_HWCACHE_ALIGN | SLAB_RECLAIM_ACCOUNT | 946 + SLAB_ACCOUNT | SLAB_PANIC), NULL); 947 + 1127 948 pidfs_mnt = kern_mount(&pidfs_type); 1128 949 if (IS_ERR(pidfs_mnt)) 1129 950 panic("Failed to mount pidfs pseudo filesystem");
+9 -5
include/linux/pid.h
··· 47 47 48 48 #define RESERVED_PIDS 300 49 49 50 + struct pidfs_attr; 51 + 50 52 struct upid { 51 53 int nr; 52 54 struct pid_namespace *ns; 53 55 }; 54 56 55 - struct pid 56 - { 57 + struct pid { 57 58 refcount_t count; 58 59 unsigned int level; 59 60 spinlock_t lock; 60 - struct dentry *stashed; 61 - u64 ino; 62 - struct rb_node pidfs_node; 61 + struct { 62 + u64 ino; 63 + struct rb_node pidfs_node; 64 + struct dentry *stashed; 65 + struct pidfs_attr *attr; 66 + }; 63 67 /* lists of tasks that use this pid */ 64 68 struct hlist_head tasks[PIDTYPE_MAX]; 65 69 struct hlist_head inodes;
+1 -2
include/linux/pidfs.h
··· 14 14 #endif 15 15 extern const struct dentry_operations pidfs_dentry_operations; 16 16 int pidfs_register_pid(struct pid *pid); 17 - void pidfs_get_pid(struct pid *pid); 18 - void pidfs_put_pid(struct pid *pid); 17 + void pidfs_free_pid(struct pid *pid); 19 18 20 19 #endif /* _LINUX_PID_FS_H */
+1 -1
kernel/pid.c
··· 100 100 101 101 ns = pid->numbers[pid->level].ns; 102 102 if (refcount_dec_and_test(&pid->count)) { 103 - WARN_ON_ONCE(pid->stashed); 103 + pidfs_free_pid(pid); 104 104 kmem_cache_free(ns->pid_cachep, pid); 105 105 put_pid_ns(ns); 106 106 }
-5
net/unix/af_unix.c
··· 646 646 return; 647 647 } 648 648 649 - if (sk->sk_peer_pid) 650 - pidfs_put_pid(sk->sk_peer_pid); 651 - 652 649 if (u->addr) 653 650 unix_release_addr(u->addr); 654 651 ··· 766 769 swap(peercred->peer_pid, pid); 767 770 swap(peercred->peer_cred, cred); 768 771 769 - pidfs_put_pid(pid); 770 772 put_pid(pid); 771 773 put_cred(cred); 772 774 } ··· 798 802 799 803 spin_lock(&sk->sk_peer_lock); 800 804 sk->sk_peer_pid = get_pid(peersk->sk_peer_pid); 801 - pidfs_get_pid(sk->sk_peer_pid); 802 805 sk->sk_peer_cred = get_cred(peersk->sk_peer_cred); 803 806 spin_unlock(&sk->sk_peer_lock); 804 807 }
+2
tools/testing/selftests/pidfd/.gitignore
··· 10 10 pidfd_bind_mount 11 11 pidfd_info_test 12 12 pidfd_exec_helper 13 + pidfd_xattr_test 14 + pidfd_setattr_test
+2 -1
tools/testing/selftests/pidfd/Makefile
··· 3 3 4 4 TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test \ 5 5 pidfd_poll_test pidfd_wait pidfd_getfd_test pidfd_setns_test \ 6 - pidfd_file_handle_test pidfd_bind_mount pidfd_info_test 6 + pidfd_file_handle_test pidfd_bind_mount pidfd_info_test \ 7 + pidfd_xattr_test pidfd_setattr_test 7 8 8 9 TEST_GEN_PROGS_EXTENDED := pidfd_exec_helper 9 10
+69
tools/testing/selftests/pidfd/pidfd_setattr_test.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #define _GNU_SOURCE 4 + #include <errno.h> 5 + #include <fcntl.h> 6 + #include <limits.h> 7 + #include <linux/types.h> 8 + #include <poll.h> 9 + #include <pthread.h> 10 + #include <sched.h> 11 + #include <signal.h> 12 + #include <stdio.h> 13 + #include <stdlib.h> 14 + #include <string.h> 15 + #include <syscall.h> 16 + #include <sys/prctl.h> 17 + #include <sys/wait.h> 18 + #include <unistd.h> 19 + #include <sys/socket.h> 20 + #include <linux/kcmp.h> 21 + #include <sys/stat.h> 22 + #include <sys/xattr.h> 23 + 24 + #include "pidfd.h" 25 + #include "../kselftest_harness.h" 26 + 27 + FIXTURE(pidfs_setattr) 28 + { 29 + pid_t child_pid; 30 + int child_pidfd; 31 + }; 32 + 33 + FIXTURE_SETUP(pidfs_setattr) 34 + { 35 + self->child_pid = create_child(&self->child_pidfd, CLONE_NEWUSER | CLONE_NEWPID); 36 + EXPECT_GE(self->child_pid, 0); 37 + 38 + if (self->child_pid == 0) 39 + _exit(EXIT_SUCCESS); 40 + } 41 + 42 + FIXTURE_TEARDOWN(pidfs_setattr) 43 + { 44 + sys_waitid(P_PID, self->child_pid, NULL, WEXITED); 45 + EXPECT_EQ(close(self->child_pidfd), 0); 46 + } 47 + 48 + TEST_F(pidfs_setattr, no_chown) 49 + { 50 + ASSERT_LT(fchown(self->child_pidfd, 1234, 5678), 0); 51 + ASSERT_EQ(errno, EOPNOTSUPP); 52 + } 53 + 54 + TEST_F(pidfs_setattr, no_chmod) 55 + { 56 + ASSERT_LT(fchmod(self->child_pidfd, 0777), 0); 57 + ASSERT_EQ(errno, EOPNOTSUPP); 58 + } 59 + 60 + TEST_F(pidfs_setattr, no_exec) 61 + { 62 + char *const argv[] = { NULL }; 63 + char *const envp[] = { NULL }; 64 + 65 + ASSERT_LT(execveat(self->child_pidfd, "", argv, envp, AT_EMPTY_PATH), 0); 66 + ASSERT_EQ(errno, EACCES); 67 + } 68 + 69 + TEST_HARNESS_MAIN
+132
tools/testing/selftests/pidfd/pidfd_xattr_test.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #define _GNU_SOURCE 4 + #include <errno.h> 5 + #include <fcntl.h> 6 + #include <limits.h> 7 + #include <linux/types.h> 8 + #include <poll.h> 9 + #include <pthread.h> 10 + #include <sched.h> 11 + #include <signal.h> 12 + #include <stdio.h> 13 + #include <stdlib.h> 14 + #include <string.h> 15 + #include <syscall.h> 16 + #include <sys/prctl.h> 17 + #include <sys/wait.h> 18 + #include <unistd.h> 19 + #include <sys/socket.h> 20 + #include <linux/kcmp.h> 21 + #include <sys/stat.h> 22 + #include <sys/xattr.h> 23 + 24 + #include "pidfd.h" 25 + #include "../kselftest_harness.h" 26 + 27 + FIXTURE(pidfs_xattr) 28 + { 29 + pid_t child_pid; 30 + int child_pidfd; 31 + }; 32 + 33 + FIXTURE_SETUP(pidfs_xattr) 34 + { 35 + self->child_pid = create_child(&self->child_pidfd, CLONE_NEWUSER | CLONE_NEWPID); 36 + EXPECT_GE(self->child_pid, 0); 37 + 38 + if (self->child_pid == 0) 39 + _exit(EXIT_SUCCESS); 40 + } 41 + 42 + FIXTURE_TEARDOWN(pidfs_xattr) 43 + { 44 + sys_waitid(P_PID, self->child_pid, NULL, WEXITED); 45 + } 46 + 47 + TEST_F(pidfs_xattr, set_get_list_xattr_multiple) 48 + { 49 + int ret, i; 50 + char xattr_name[32]; 51 + char xattr_value[32]; 52 + char buf[32]; 53 + const int num_xattrs = 10; 54 + char list[PATH_MAX] = {}; 55 + 56 + for (i = 0; i < num_xattrs; i++) { 57 + snprintf(xattr_name, sizeof(xattr_name), "trusted.testattr%d", i); 58 + snprintf(xattr_value, sizeof(xattr_value), "testvalue%d", i); 59 + ret = fsetxattr(self->child_pidfd, xattr_name, xattr_value, strlen(xattr_value), 0); 60 + ASSERT_EQ(ret, 0); 61 + } 62 + 63 + for (i = 0; i < num_xattrs; i++) { 64 + snprintf(xattr_name, sizeof(xattr_name), "trusted.testattr%d", i); 65 + snprintf(xattr_value, sizeof(xattr_value), "testvalue%d", i); 66 + memset(buf, 0, sizeof(buf)); 67 + ret = fgetxattr(self->child_pidfd, xattr_name, buf, sizeof(buf)); 68 + ASSERT_EQ(ret, strlen(xattr_value)); 69 + ASSERT_EQ(strcmp(buf, xattr_value), 0); 70 + } 71 + 72 + ret = flistxattr(self->child_pidfd, list, sizeof(list)); 73 + ASSERT_GT(ret, 0); 74 + for (i = 0; i < num_xattrs; i++) { 75 + snprintf(xattr_name, sizeof(xattr_name), "trusted.testattr%d", i); 76 + bool found = false; 77 + for (char *it = list; it < list + ret; it += strlen(it) + 1) { 78 + if (strcmp(it, xattr_name)) 79 + continue; 80 + found = true; 81 + break; 82 + } 83 + ASSERT_TRUE(found); 84 + } 85 + 86 + for (i = 0; i < num_xattrs; i++) { 87 + snprintf(xattr_name, sizeof(xattr_name), "trusted.testattr%d", i); 88 + ret = fremovexattr(self->child_pidfd, xattr_name); 89 + ASSERT_EQ(ret, 0); 90 + 91 + ret = fgetxattr(self->child_pidfd, xattr_name, buf, sizeof(buf)); 92 + ASSERT_EQ(ret, -1); 93 + ASSERT_EQ(errno, ENODATA); 94 + } 95 + } 96 + 97 + TEST_F(pidfs_xattr, set_get_list_xattr_persistent) 98 + { 99 + int ret; 100 + char buf[32]; 101 + char list[PATH_MAX] = {}; 102 + 103 + ret = fsetxattr(self->child_pidfd, "trusted.persistent", "persistent value", strlen("persistent value"), 0); 104 + ASSERT_EQ(ret, 0); 105 + 106 + memset(buf, 0, sizeof(buf)); 107 + ret = fgetxattr(self->child_pidfd, "trusted.persistent", buf, sizeof(buf)); 108 + ASSERT_EQ(ret, strlen("persistent value")); 109 + ASSERT_EQ(strcmp(buf, "persistent value"), 0); 110 + 111 + ret = flistxattr(self->child_pidfd, list, sizeof(list)); 112 + ASSERT_GT(ret, 0); 113 + ASSERT_EQ(strcmp(list, "trusted.persistent"), 0) 114 + 115 + ASSERT_EQ(close(self->child_pidfd), 0); 116 + self->child_pidfd = -EBADF; 117 + sleep(2); 118 + 119 + self->child_pidfd = sys_pidfd_open(self->child_pid, 0); 120 + ASSERT_GE(self->child_pidfd, 0); 121 + 122 + memset(buf, 0, sizeof(buf)); 123 + ret = fgetxattr(self->child_pidfd, "trusted.persistent", buf, sizeof(buf)); 124 + ASSERT_EQ(ret, strlen("persistent value")); 125 + ASSERT_EQ(strcmp(buf, "persistent value"), 0); 126 + 127 + ret = flistxattr(self->child_pidfd, list, sizeof(list)); 128 + ASSERT_GT(ret, 0); 129 + ASSERT_EQ(strcmp(list, "trusted.persistent"), 0); 130 + } 131 + 132 + TEST_HARNESS_MAIN