Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

ns: add active reference count

The namespace tree is, among other things, currently used to support
file handles for namespaces. When a namespace is created it is placed on
the namespace trees and when it is destroyed it is removed from the
namespace trees.

While a namespace is on the namespace trees with a valid reference count
it is possible to reopen it through a namespace file handle. This is all
fine but has some issues that should be addressed.

On current kernels a namespace is visible to userspace in the
following cases:

(1) The namespace is in use by a task.
(2) The namespace is persisted through a VFS object (namespace file
descriptor or bind-mount).
Note that (2) only cares about direct persistence of the namespace
itself not indirectly via e.g., file->f_cred file references or
similar.
(3) The namespace is a hierarchical namespace type and is the parent of
a single or multiple child namespaces.

Case (3) is interesting because it is possible that a parent namespace
might not fulfill any of (1) or (2), i.e., is invisible to userspace but
it may still be resurrected through the NS_GET_PARENT ioctl().

Currently namespace file handles allow much broader access to namespaces
than what is currently possible via (1)-(3). The reason is that
namespaces may remain pinned for completely internal reasons yet are
inaccessible to userspace.

For example, a user namespace my remain pinned by get_cred() calls to
stash the opener's credentials into file->f_cred. As it stands file
handles allow to resurrect such a users namespace even though this
should not be possible via (1)-(3). This is a fundamental uapi change
that we shouldn't do if we don't have to.

Consider the following insane case: Various architectures support the
CONFIG_MMU_LAZY_TLB_REFCOUNT option which uses lazy TLB destruction.
When this option is set a userspace task's struct mm_struct may be used
for kernel threads such as the idle task and will only be destroyed once
the cpu's runqueue switches back to another task. But because of ptrace()
permission checks struct mm_struct stashes the user namespace of the
task that struct mm_struct originally belonged to. The kernel thread
will take a reference on the struct mm_struct and thus pin it.

So on an idle system user namespaces can be persisted for arbitrary
amounts of time which also means that they can be resurrected using
namespace file handles. That makes no sense whatsoever. The problem is
of course excarabted on large systems with a huge number of cpus.

To handle this nicely we introduce an active reference count which
tracks (1)-(3). This is easy to do as all of these things are already
managed centrally. Only (1)-(3) will count towards the active reference
count and only namespaces which are active may be opened via namespace
file handles.

The problem is that namespaces may be resurrected. Which means that they
can become temporarily inactive and will be reactived some time later.
Currently the only example of this is the SIOGCSKNS socket ioctl. The
SIOCGSKNS ioctl allows to open a network namespace file descriptor based
on a socket file descriptor.

If a socket is tied to a network namespace that subsequently becomes
inactive but that socket is persisted by another process in another
network namespace (e.g., via SCM_RIGHTS of pidfd_getfd()) then the
SIOCGSKNS ioctl will resurrect this network namespace.

So calls to open_related_ns() and open_namespace() will end up
resurrecting the corresponding namespace tree.

Note that the active reference count does not regulate the lifetime of
the namespace itself. This is still done by the normal reference count.
The active reference count can only be elevated if the regular reference
count is elevated.

The active reference count also doesn't regulate the presence of a
namespace on the namespace trees. It only regulates its visiblity to
namespace file handles (and in later patches to listns()).

A namespace remains on the namespace trees from creation until its
actual destruction. This will allow the kernel to always reach any
namespace trivially and it will also enable subsystems like bpf to walk
the namespace lists on the system for tracing or general introspection
purposes.

Note that different namespaces have different visibility lifetimes on
current kernels. While most namespace are immediately released when the
last task using them exits, the user- and pid namespace are persisted
and thus both remain accessible via /proc/<pid>/ns/<ns_type>.

The user namespace lifetime is aliged with struct cred and is only
released through exit_creds(). However, it becomes inaccessible to
userspace once the last task using it is reaped, i.e., when
release_task() is called and all proc entries are flushed. Similarly,
the pid namespace is also visible until the last task using it has been
reaped and the associated pid numbers are freed.

The active reference counts of the user- and pid namespace are
decremented once the task is reaped.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-11-2e6f823ebdc0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>

+449 -4
+47 -1
fs/nsfs.c
··· 58 58 static void nsfs_evict(struct inode *inode) 59 59 { 60 60 struct ns_common *ns = inode->i_private; 61 + 62 + __ns_ref_active_put(ns); 61 63 clear_inode(inode); 62 64 ns->ops->put(ns); 63 65 } ··· 421 419 inode->i_mode |= S_IRUGO; 422 420 inode->i_fop = &ns_file_operations; 423 421 inode->i_ino = ns->inum; 422 + 423 + /* 424 + * Bring the namespace subtree back to life if we have to. This 425 + * can happen when e.g., all processes using a network namespace 426 + * and all namespace files or namespace file bind-mounts have 427 + * died but there are still sockets pinning it. The SIOCGSKNS 428 + * ioctl on such a socket will resurrect the relevant namespace 429 + * subtree. 430 + */ 431 + __ns_ref_active_resurrect(ns); 424 432 return 0; 425 433 } 426 434 ··· 507 495 if (ns->inum != fid->ns_inum) 508 496 return NULL; 509 497 510 - if (!__ns_ref_get(ns)) 498 + /* 499 + * This is racy because we're not actually taking an 500 + * active reference. IOW, it could happen that the 501 + * namespace becomes inactive after this check. 502 + * We don't care because nsfs_init_inode() will just 503 + * resurrect the relevant namespace tree for us. If it 504 + * has been active here we just allow it's resurrection. 505 + * We could try to take an active reference here and 506 + * then drop it again. But really, why bother. 507 + */ 508 + if (!ns_get_unless_inactive(ns)) 511 509 return NULL; 512 510 } 513 511 ··· 636 614 nsfs_mnt->mnt_sb->s_flags &= ~SB_NOUSER; 637 615 nsfs_root_path.mnt = nsfs_mnt; 638 616 nsfs_root_path.dentry = nsfs_mnt->mnt_root; 617 + } 618 + 619 + void nsproxy_ns_active_get(struct nsproxy *ns) 620 + { 621 + ns_ref_active_get(ns->mnt_ns); 622 + ns_ref_active_get(ns->uts_ns); 623 + ns_ref_active_get(ns->ipc_ns); 624 + ns_ref_active_get(ns->pid_ns_for_children); 625 + ns_ref_active_get(ns->cgroup_ns); 626 + ns_ref_active_get(ns->net_ns); 627 + ns_ref_active_get(ns->time_ns); 628 + ns_ref_active_get(ns->time_ns_for_children); 629 + } 630 + 631 + void nsproxy_ns_active_put(struct nsproxy *ns) 632 + { 633 + ns_ref_active_put(ns->mnt_ns); 634 + ns_ref_active_put(ns->uts_ns); 635 + ns_ref_active_put(ns->ipc_ns); 636 + ns_ref_active_put(ns->pid_ns_for_children); 637 + ns_ref_active_put(ns->cgroup_ns); 638 + ns_ref_active_put(ns->net_ns); 639 + ns_ref_active_put(ns->time_ns); 640 + ns_ref_active_put(ns->time_ns_for_children); 639 641 }
+139 -2
include/linux/ns_common.h
··· 4 4 5 5 #include <linux/refcount.h> 6 6 #include <linux/rbtree.h> 7 + #include <linux/vfsdebug.h> 7 8 #include <uapi/linux/sched.h> 9 + #include <uapi/linux/nsfs.h> 8 10 9 11 struct proc_ns_operations; 10 12 ··· 39 37 extern const struct proc_ns_operations timens_operations; 40 38 extern const struct proc_ns_operations timens_for_children_operations; 41 39 40 + /* 41 + * Namespace lifetimes are managed via a two-tier reference counting model: 42 + * 43 + * (1) __ns_ref (refcount_t): Main reference count tracking memory 44 + * lifetime. Controls when the namespace structure itself is freed. 45 + * It also pins the namespace on the namespace trees whereas (2) 46 + * only regulates their visibility to userspace. 47 + * 48 + * (2) __ns_ref_active (atomic_t): Reference count tracking active users. 49 + * Controls visibility of the namespace in the namespace trees. 50 + * Any live task that uses the namespace (via nsproxy or cred) holds 51 + * an active reference. Any open file descriptor or bind-mount of 52 + * the namespace holds an active reference. Once all tasks have 53 + * called exited their namespaces and all file descriptors and 54 + * bind-mounts have been released the active reference count drops 55 + * to zero and the namespace becomes inactive. IOW, the namespace 56 + * cannot be listed or opened via file handles anymore. 57 + * 58 + * Note that it is valid to transition from active to inactive and 59 + * back from inactive to active e.g., when resurrecting an inactive 60 + * namespace tree via the SIOCGSKNS ioctl(). 61 + * 62 + * Relationship and lifecycle states: 63 + * 64 + * - Active (__ns_ref_active > 0): 65 + * Namespace is actively used and visible to userspace. The namespace 66 + * can be reopened via /proc/<pid>/ns/<ns_type>, via namespace file 67 + * handles, or discovered via listns(). 68 + * 69 + * - Inactive (__ns_ref_active == 0, __ns_ref > 0): 70 + * No tasks are actively using the namespace and it isn't pinned by 71 + * any bind-mounts or open file descriptors anymore. But the namespace 72 + * is still kept alive by internal references. For example, the user 73 + * namespace could be pinned by an open file through file->f_cred 74 + * references when one of the now defunct tasks had opened a file and 75 + * handed the file descriptor off to another process via a UNIX 76 + * sockets. Such references keep the namespace structure alive through 77 + * __ns_ref but will not hold an active reference. 78 + * 79 + * - Destroyed (__ns_ref == 0): 80 + * No references remain. The namespace is removed from the tree and freed. 81 + * 82 + * State transitions: 83 + * 84 + * Active -> Inactive: 85 + * When the last task using the namespace exits it drops its active 86 + * references to all namespaces. However, user and pid namespaces 87 + * remain accessible until the task has been reaped. 88 + * 89 + * Inactive -> Active: 90 + * An inactive namespace tree might be resurrected due to e.g., the 91 + * SIOCGSKNS ioctl() on a socket. 92 + * 93 + * Inactive -> Destroyed: 94 + * When __ns_ref drops to zero the namespace is removed from the 95 + * namespaces trees and the memory is freed (after RCU grace period). 96 + * 97 + * Initial namespaces: 98 + * Boot-time namespaces (init_net, init_pid_ns, etc.) start with 99 + * __ns_ref_active = 1 and remain active forever. 100 + */ 42 101 struct ns_common { 43 102 u32 ns_type; 44 103 struct dentry *stashed; ··· 111 48 u64 ns_id; 112 49 struct rb_node ns_tree_node; 113 50 struct list_head ns_list_node; 51 + atomic_t __ns_ref_active; /* do not use directly */ 114 52 }; 115 53 struct rcu_head ns_rcu; 116 54 }; ··· 119 55 120 56 int __ns_common_init(struct ns_common *ns, u32 ns_type, const struct proc_ns_operations *ops, int inum); 121 57 void __ns_common_free(struct ns_common *ns); 58 + 59 + static __always_inline bool is_initial_namespace(struct ns_common *ns) 60 + { 61 + VFS_WARN_ON_ONCE(ns->inum == 0); 62 + return unlikely(in_range(ns->inum, MNT_NS_INIT_INO, 63 + IPC_NS_INIT_INO - MNT_NS_INIT_INO + 1)); 64 + } 122 65 123 66 #define to_ns_common(__ns) \ 124 67 _Generic((__ns), \ ··· 198 127 .ops = to_ns_operations(&nsname), \ 199 128 .stashed = NULL, \ 200 129 .__ns_ref = REFCOUNT_INIT(refs), \ 130 + .__ns_ref_active = ATOMIC_INIT(1), \ 201 131 .ns_list_node = LIST_HEAD_INIT(nsname.ns.ns_list_node), \ 202 132 } 203 133 ··· 216 144 217 145 #define ns_common_free(__ns) __ns_common_free(to_ns_common((__ns))) 218 146 147 + static __always_inline __must_check int __ns_ref_active_read(const struct ns_common *ns) 148 + { 149 + return atomic_read(&ns->__ns_ref_active); 150 + } 151 + 219 152 static __always_inline __must_check bool __ns_ref_put(struct ns_common *ns) 220 153 { 221 - return refcount_dec_and_test(&ns->__ns_ref); 154 + if (refcount_dec_and_test(&ns->__ns_ref)) { 155 + VFS_WARN_ON_ONCE(__ns_ref_active_read(ns)); 156 + return true; 157 + } 158 + return false; 222 159 } 223 160 224 161 static __always_inline __must_check bool __ns_ref_get(struct ns_common *ns) 225 162 { 226 - return refcount_inc_not_zero(&ns->__ns_ref); 163 + if (refcount_inc_not_zero(&ns->__ns_ref)) 164 + return true; 165 + VFS_WARN_ON_ONCE(__ns_ref_active_read(ns)); 166 + return false; 227 167 } 228 168 229 169 static __always_inline __must_check int __ns_ref_read(const struct ns_common *ns) ··· 249 165 #define ns_ref_put(__ns) __ns_ref_put(to_ns_common((__ns))) 250 166 #define ns_ref_put_and_lock(__ns, __lock) \ 251 167 refcount_dec_and_lock(&to_ns_common((__ns))->__ns_ref, (__lock)) 168 + 169 + #define ns_ref_active_read(__ns) \ 170 + ((__ns) ? __ns_ref_active_read(to_ns_common(__ns)) : 0) 171 + 172 + void __ns_ref_active_get_owner(struct ns_common *ns); 173 + 174 + static __always_inline void __ns_ref_active_get(struct ns_common *ns) 175 + { 176 + WARN_ON_ONCE(atomic_add_negative(1, &ns->__ns_ref_active)); 177 + VFS_WARN_ON_ONCE(is_initial_namespace(ns) && __ns_ref_active_read(ns) <= 0); 178 + } 179 + #define ns_ref_active_get(__ns) \ 180 + do { if (__ns) __ns_ref_active_get(to_ns_common(__ns)); } while (0) 181 + 182 + static __always_inline bool __ns_ref_active_get_not_zero(struct ns_common *ns) 183 + { 184 + if (atomic_inc_not_zero(&ns->__ns_ref_active)) { 185 + VFS_WARN_ON_ONCE(!__ns_ref_read(ns)); 186 + return true; 187 + } 188 + return false; 189 + } 190 + 191 + #define ns_ref_active_get_owner(__ns) \ 192 + do { if (__ns) __ns_ref_active_get_owner(to_ns_common(__ns)); } while (0) 193 + 194 + void __ns_ref_active_put_owner(struct ns_common *ns); 195 + 196 + static __always_inline void __ns_ref_active_put(struct ns_common *ns) 197 + { 198 + if (atomic_dec_and_test(&ns->__ns_ref_active)) { 199 + VFS_WARN_ON_ONCE(is_initial_namespace(ns)); 200 + VFS_WARN_ON_ONCE(!__ns_ref_read(ns)); 201 + __ns_ref_active_put_owner(ns); 202 + } 203 + } 204 + #define ns_ref_active_put(__ns) \ 205 + do { if (__ns) __ns_ref_active_put(to_ns_common(__ns)); } while (0) 206 + 207 + static __always_inline struct ns_common *__must_check ns_get_unless_inactive(struct ns_common *ns) 208 + { 209 + VFS_WARN_ON_ONCE(__ns_ref_active_read(ns) && !__ns_ref_read(ns)); 210 + if (!__ns_ref_active_read(ns)) 211 + return NULL; 212 + if (!__ns_ref_get(ns)) 213 + return NULL; 214 + return ns; 215 + } 216 + 217 + void __ns_ref_active_resurrect(struct ns_common *ns); 218 + 219 + #define ns_ref_active_resurrect(__ns) \ 220 + do { if (__ns) __ns_ref_active_resurrect(to_ns_common(__ns)); } while (0) 252 221 253 222 #endif
+3
include/linux/nsfs.h
··· 37 37 38 38 #define current_in_namespace(__ns) (__current_namespace_from_type(__ns) == __ns) 39 39 40 + void nsproxy_ns_active_get(struct nsproxy *ns); 41 + void nsproxy_ns_active_put(struct nsproxy *ns); 42 + 40 43 #endif /* _LINUX_NSFS_H */
+3
include/linux/nsproxy.h
··· 93 93 */ 94 94 95 95 int copy_namespaces(u64 flags, struct task_struct *tsk); 96 + void switch_cred_namespaces(const struct cred *old, const struct cred *new); 96 97 void exit_nsproxy_namespaces(struct task_struct *tsk); 98 + void get_cred_namespaces(struct task_struct *tsk); 99 + void exit_cred_namespaces(struct task_struct *tsk); 97 100 void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new); 98 101 int exec_task_namespaces(void); 99 102 void free_nsproxy(struct nsproxy *ns);
+6
kernel/cred.c
··· 306 306 kdebug("share_creds(%p{%ld})", 307 307 p->cred, atomic_long_read(&p->cred->usage)); 308 308 inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1); 309 + get_cred_namespaces(p); 309 310 return 0; 310 311 } 311 312 ··· 344 343 345 344 p->cred = p->real_cred = get_cred(new); 346 345 inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1); 346 + get_cred_namespaces(p); 347 + 347 348 return 0; 348 349 349 350 error_put: ··· 438 435 */ 439 436 if (new->user != old->user || new->user_ns != old->user_ns) 440 437 inc_rlimit_ucounts(new->ucounts, UCOUNT_RLIMIT_NPROC, 1); 438 + 441 439 rcu_assign_pointer(task->real_cred, new); 442 440 rcu_assign_pointer(task->cred, new); 443 441 if (new->user != old->user || new->user_ns != old->user_ns) 444 442 dec_rlimit_ucounts(old->ucounts, UCOUNT_RLIMIT_NPROC, 1); 443 + if (new->user_ns != old->user_ns) 444 + switch_cred_namespaces(old, new); 445 445 446 446 /* send notifications */ 447 447 if (!uid_eq(new->uid, old->uid) ||
+1
kernel/exit.c
··· 291 291 write_unlock_irq(&tasklist_lock); 292 292 /* @thread_pid can't go away until free_pids() below */ 293 293 proc_flush_pid(thread_pid); 294 + exit_cred_namespaces(p); 294 295 add_device_randomness(&p->se.sum_exec_runtime, 295 296 sizeof(p->se.sum_exec_runtime)); 296 297 free_pids(post.pids);
+1
kernel/fork.c
··· 2487 2487 delayacct_tsk_free(p); 2488 2488 bad_fork_cleanup_count: 2489 2489 dec_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1); 2490 + exit_cred_namespaces(p); 2490 2491 exit_creds(p); 2491 2492 bad_fork_free: 2492 2493 WRITE_ONCE(p->__state, TASK_DEAD);
+213 -1
kernel/nscommon.c
··· 3 3 4 4 #include <linux/ns_common.h> 5 5 #include <linux/proc_ns.h> 6 + #include <linux/user_namespace.h> 6 7 #include <linux/vfsdebug.h> 7 8 8 9 #ifdef CONFIG_DEBUG_VFS ··· 54 53 55 54 int __ns_common_init(struct ns_common *ns, u32 ns_type, const struct proc_ns_operations *ops, int inum) 56 55 { 56 + int ret; 57 + 57 58 refcount_set(&ns->__ns_ref, 1); 58 59 ns->stashed = NULL; 59 60 ns->ops = ops; ··· 72 69 ns->inum = inum; 73 70 return 0; 74 71 } 75 - return proc_alloc_inum(&ns->inum); 72 + ret = proc_alloc_inum(&ns->inum); 73 + if (ret) 74 + return ret; 75 + /* 76 + * Tree ref starts at 0. It's incremented when namespace enters 77 + * active use (installed in nsproxy) and decremented when all 78 + * active uses are gone. Initial namespaces are always active. 79 + */ 80 + if (is_initial_namespace(ns)) 81 + atomic_set(&ns->__ns_ref_active, 1); 82 + else 83 + atomic_set(&ns->__ns_ref_active, 0); 84 + return 0; 76 85 } 77 86 78 87 void __ns_common_free(struct ns_common *ns) 79 88 { 80 89 proc_free_inum(ns->inum); 90 + } 91 + 92 + static struct ns_common *ns_owner(struct ns_common *ns) 93 + { 94 + struct user_namespace *owner; 95 + 96 + if (unlikely(!ns->ops)) 97 + return NULL; 98 + VFS_WARN_ON_ONCE(!ns->ops->owner); 99 + owner = ns->ops->owner(ns); 100 + VFS_WARN_ON_ONCE(!owner && ns != to_ns_common(&init_user_ns)); 101 + if (!owner) 102 + return NULL; 103 + /* Skip init_user_ns as it's always active */ 104 + if (owner == &init_user_ns) 105 + return NULL; 106 + return to_ns_common(owner); 107 + } 108 + 109 + void __ns_ref_active_get_owner(struct ns_common *ns) 110 + { 111 + ns = ns_owner(ns); 112 + if (ns) 113 + WARN_ON_ONCE(atomic_add_negative(1, &ns->__ns_ref_active)); 114 + } 115 + 116 + /* 117 + * The active reference count works by having each namespace that gets 118 + * created take a single active reference on its owning user namespace. 119 + * That single reference is only released once the child namespace's 120 + * active count itself goes down. 121 + * 122 + * A regular namespace tree might look as follow: 123 + * Legend: 124 + * + : adding active reference 125 + * - : dropping active reference 126 + * x : always active (initial namespace) 127 + * 128 + * 129 + * net_ns pid_ns 130 + * \ / 131 + * + + 132 + * user_ns1 (2) 133 + * | 134 + * ipc_ns | uts_ns 135 + * \ | / 136 + * + + + 137 + * user_ns2 (3) 138 + * | 139 + * cgroup_ns | mnt_ns 140 + * \ | / 141 + * x x x 142 + * init_user_ns (1) 143 + * 144 + * If both net_ns and pid_ns put their last active reference on 145 + * themselves it will cascade to user_ns1 dropping its own active 146 + * reference and dropping one active reference on user_ns2: 147 + * 148 + * net_ns pid_ns 149 + * \ / 150 + * - - 151 + * user_ns1 (0) 152 + * | 153 + * ipc_ns | uts_ns 154 + * \ | / 155 + * + - + 156 + * user_ns2 (2) 157 + * | 158 + * cgroup_ns | mnt_ns 159 + * \ | / 160 + * x x x 161 + * init_user_ns (1) 162 + * 163 + * The iteration stops once we reach a namespace that still has active 164 + * references. 165 + */ 166 + void __ns_ref_active_put_owner(struct ns_common *ns) 167 + { 168 + for (;;) { 169 + ns = ns_owner(ns); 170 + if (!ns) 171 + return; 172 + if (!atomic_dec_and_test(&ns->__ns_ref_active)) 173 + return; 174 + } 175 + } 176 + 177 + /* 178 + * The active reference count works by having each namespace that gets 179 + * created take a single active reference on its owning user namespace. 180 + * That single reference is only released once the child namespace's 181 + * active count itself goes down. This makes it possible to efficiently 182 + * resurrect a namespace tree: 183 + * 184 + * A regular namespace tree might look as follow: 185 + * Legend: 186 + * + : adding active reference 187 + * - : dropping active reference 188 + * x : always active (initial namespace) 189 + * 190 + * 191 + * net_ns pid_ns 192 + * \ / 193 + * + + 194 + * user_ns1 (2) 195 + * | 196 + * ipc_ns | uts_ns 197 + * \ | / 198 + * + + + 199 + * user_ns2 (3) 200 + * | 201 + * cgroup_ns | mnt_ns 202 + * \ | / 203 + * x x x 204 + * init_user_ns (1) 205 + * 206 + * If both net_ns and pid_ns put their last active reference on 207 + * themselves it will cascade to user_ns1 dropping its own active 208 + * reference and dropping one active reference on user_ns2: 209 + * 210 + * net_ns pid_ns 211 + * \ / 212 + * - - 213 + * user_ns1 (0) 214 + * | 215 + * ipc_ns | uts_ns 216 + * \ | / 217 + * + - + 218 + * user_ns2 (2) 219 + * | 220 + * cgroup_ns | mnt_ns 221 + * \ | / 222 + * x x x 223 + * init_user_ns (1) 224 + * 225 + * Assume the whole tree is dead but all namespaces are still active: 226 + * 227 + * net_ns pid_ns 228 + * \ / 229 + * - - 230 + * user_ns1 (0) 231 + * | 232 + * ipc_ns | uts_ns 233 + * \ | / 234 + * - - - 235 + * user_ns2 (0) 236 + * | 237 + * cgroup_ns | mnt_ns 238 + * \ | / 239 + * x x x 240 + * init_user_ns (1) 241 + * 242 + * Now assume the net_ns gets resurrected (.e.g., via the SIOCGSKNS ioctl()): 243 + * 244 + * net_ns pid_ns 245 + * \ / 246 + * + - 247 + * user_ns1 (0) 248 + * | 249 + * ipc_ns | uts_ns 250 + * \ | / 251 + * - + - 252 + * user_ns2 (0) 253 + * | 254 + * cgroup_ns | mnt_ns 255 + * \ | / 256 + * x x x 257 + * init_user_ns (1) 258 + * 259 + * If net_ns had a zero reference count and we bumped it we also need to 260 + * take another reference on its owning user namespace. Similarly, if 261 + * pid_ns had a zero reference count it also needs to take another 262 + * reference on its owning user namespace. So both net_ns and pid_ns 263 + * will each have their own reference on the owning user namespace. 264 + * 265 + * If the owning user namespace user_ns1 had a zero reference count then 266 + * it also needs to take another reference on its owning user namespace 267 + * and so on. 268 + */ 269 + void __ns_ref_active_resurrect(struct ns_common *ns) 270 + { 271 + /* If we didn't resurrect the namespace we're done. */ 272 + if (atomic_fetch_add(1, &ns->__ns_ref_active)) 273 + return; 274 + 275 + /* 276 + * We did resurrect it. Walk the ownership hierarchy upwards 277 + * until we found an owning user namespace that is active. 278 + */ 279 + for (;;) { 280 + ns = ns_owner(ns); 281 + if (!ns) 282 + return; 283 + 284 + if (atomic_fetch_add(1, &ns->__ns_ref_active)) 285 + return; 286 + } 81 287 }
+23
kernel/nsproxy.c
··· 26 26 #include <linux/syscalls.h> 27 27 #include <linux/cgroup.h> 28 28 #include <linux/perf_event.h> 29 + #include <linux/nstree.h> 29 30 30 31 static struct kmem_cache *nsproxy_cachep; 31 32 ··· 180 179 if ((flags & CLONE_VM) == 0) 181 180 timens_on_fork(new_ns, tsk); 182 181 182 + nsproxy_ns_active_get(new_ns); 183 183 tsk->nsproxy = new_ns; 184 184 return 0; 185 185 } 186 186 187 187 void free_nsproxy(struct nsproxy *ns) 188 188 { 189 + nsproxy_ns_active_put(ns); 190 + 189 191 put_mnt_ns(ns->mnt_ns); 190 192 put_uts_ns(ns->uts_ns); 191 193 put_ipc_ns(ns->ipc_ns); ··· 236 232 237 233 might_sleep(); 238 234 235 + if (new) 236 + nsproxy_ns_active_get(new); 237 + 239 238 task_lock(p); 240 239 ns = p->nsproxy; 241 240 p->nsproxy = new; ··· 251 244 void exit_nsproxy_namespaces(struct task_struct *p) 252 245 { 253 246 switch_task_namespaces(p, NULL); 247 + } 248 + 249 + void switch_cred_namespaces(const struct cred *old, const struct cred *new) 250 + { 251 + ns_ref_active_get(new->user_ns); 252 + ns_ref_active_put(old->user_ns); 253 + } 254 + 255 + void get_cred_namespaces(struct task_struct *tsk) 256 + { 257 + ns_ref_active_get(tsk->real_cred->user_ns); 258 + } 259 + 260 + void exit_cred_namespaces(struct task_struct *tsk) 261 + { 262 + ns_ref_active_put(tsk->real_cred->user_ns); 254 263 } 255 264 256 265 int exec_task_namespaces(void)
+8
kernel/nstree.c
··· 123 123 write_sequnlock(&ns_tree->ns_tree_lock); 124 124 125 125 VFS_WARN_ON_ONCE(node); 126 + 127 + /* 128 + * Take an active reference on the owner namespace. This ensures 129 + * that the owner remains visible while any of its child namespaces 130 + * are active. For init namespaces this is a no-op as ns_owner() 131 + * returns NULL for namespaces owned by init_user_ns. 132 + */ 133 + __ns_ref_active_get_owner(ns); 126 134 } 127 135 128 136 void __ns_tree_remove(struct ns_common *ns, struct ns_tree *ns_tree)
+5
kernel/pid.c
··· 112 112 void free_pid(struct pid *pid) 113 113 { 114 114 int i; 115 + struct pid_namespace *active_ns; 115 116 116 117 lockdep_assert_not_held(&tasklist_lock); 118 + 119 + active_ns = pid->numbers[pid->level].ns; 120 + ns_ref_active_put(active_ns); 117 121 118 122 spin_lock(&pidmap_lock); 119 123 for (i = 0; i <= pid->level; i++) { ··· 282 278 } 283 279 spin_unlock(&pidmap_lock); 284 280 idr_preload_end(); 281 + ns_ref_active_get(ns); 285 282 286 283 return pid; 287 284