Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

nsproxy: attach to namespaces via pidfds

For quite a while we have been thinking about using pidfds to attach to
namespaces. This patchset has existed for about a year already but we've
wanted to wait to see how the general api would be received and adopted.
Now that more and more programs in userspace have started using pidfds
for process management it's time to send this one out.

This patch makes it possible to use pidfds to attach to the namespaces
of another process, i.e. they can be passed as the first argument to the
setns() syscall. When only a single namespace type is specified the
semantics are equivalent to passing an nsfd. That means
setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However,
when a pidfd is passed, multiple namespace flags can be specified in the
second setns() argument and setns() will attach the caller to all the
specified namespaces all at once or to none of them. Specifying 0 is not
valid together with a pidfd.

Here are just two obvious examples:
setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
setns(pidfd, CLONE_NEWUSER);
Allowing to also attach subsets of namespaces supports various use-cases
where callers setns to a subset of namespaces to retain privilege, perform
an action and then re-attach another subset of namespaces.

If the need arises, as Eric suggested, we can extend this patchset to
assume even more context than just attaching all namespaces. His suggestion
specifically was about assuming the process' root directory when
setns(pidfd, 0) or setns(pidfd, SETNS_PIDFD) is specified. For now, just
keep it flexible in terms of supporting subsets of namespaces but let's
wait until we have users asking for even more context to be assumed. At
that point we can add an extension.

The obvious example where this is useful is a standard container
manager interacting with a running container: pushing and pulling files
or directories, injecting mounts, attaching/execing any kind of process,
managing network devices all these operations require attaching to all
or at least multiple namespaces at the same time. Given that nowadays
most containers are spawned with all namespaces enabled we're currently
looking at at least 14 syscalls, 7 to open the /proc/<pid>/ns/<ns>
nsfds, another 7 to actually perform the namespace switch. With time
namespaces we're looking at about 16 syscalls.
(We could amortize the first 7 or 8 syscalls for opening the nsfds by
stashing them in each container's monitor process but that would mean
we need to send around those file descriptors through unix sockets
everytime we want to interact with the container or keep on-disk
state. Even in scenarios where a caller wants to join a particular
namespace in a particular order callers still profit from batching
other namespaces. That mostly applies to the user namespace but
all container runtimes I found join the user namespace first no matter
if it privileges or deprivileges the container similar to how unshare
behaves.)
With pidfds this becomes a single syscall no matter how many namespaces
are supposed to be attached to.

A decently designed, large-scale container manager usually isn't the
parent of any of the containers it spawns so the containers don't die
when it crashes or needs to update or reinitialize. This means that
for the manager to interact with containers through pids is inherently
racy especially on systems where the maximum pid number is not
significicantly bumped. This is even more problematic since we often spawn
and manage thousands or ten-thousands of containers. Interacting with a
container through a pid thus can become risky quite quickly. Especially
since we allow for an administrator to enable advanced features such as
syscall interception where we're performing syscalls in lieu of the
container. In all of those cases we use pidfds if they are available and
we pass them around as stable references. Using them to setns() to the
target process' namespaces is as reliable as using nsfds. Either the
target process is already dead and we get ESRCH or we manage to attach
to its namespaces but we can't accidently attach to another process'
namespaces. So pidfds lend themselves to be used with this api.
The other main advantage is that with this change the pidfd becomes the
only relevant token for most container interactions and it's the only
token we need to create and send around.

Apart from significiantly reducing the number of syscalls from double
digit to single digit which is a decent reason post-spectre/meltdown
this also allows to switch to a set of namespaces atomically, i.e.
either attaching to all the specified namespaces succeeds or we fail. If
we fail we haven't changed a single namespace. There are currently three
namespaces that can fail (other than for ENOMEM which really is not
very interesting since we then have other problems anyway) for
non-trivial reasons, user, mount, and pid namespaces. We can fail to
attach to a pid namespace if it is not our current active pid namespace
or a descendant of it. We can fail to attach to a user namespace because
we are multi-threaded or because our current mount namespace shares
filesystem state with other tasks, or because we're trying to setns()
to the same user namespace, i.e. the target task has the same user
namespace as we do. We can fail to attach to a mount namespace because
it shares filesystem state with other tasks or because we fail to lookup
the new root for the new mount namespace. In most non-pathological
scenarios these issues can be somewhat mitigated. But there are cases where
we're half-attached to some namespace and failing to attach to another one.
I've talked about some of these problem during the hallway track (something
only the pre-COVID-19 generation will remember) of Plumbers in Los Angeles
in 2018(?). Even if all these issues could be avoided with super careful
userspace coding it would be nicer to have this done in-kernel. Pidfds seem
to lend themselves nicely for this.

The other neat thing about this is that setns() becomes an actual
counterpart to the namespace bits of unshare().

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jann Horn <jannh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/r/20200505140432.181565-3-christian.brauner@ubuntu.com

+227 -17
+5
fs/namespace.c
··· 1733 1733 return container_of(ns, struct mnt_namespace, ns); 1734 1734 } 1735 1735 1736 + struct ns_common *from_mnt_ns(struct mnt_namespace *mnt) 1737 + { 1738 + return &mnt->ns; 1739 + } 1740 + 1736 1741 static bool mnt_ns_loop(struct dentry *dentry) 1737 1742 { 1738 1743 /* Could bind mounting the mount namespace inode cause a
+5
fs/nsfs.c
··· 229 229 return res; 230 230 } 231 231 232 + bool proc_ns_file(const struct file *file) 233 + { 234 + return file->f_op == &ns_file_operations; 235 + } 236 + 232 237 struct file *proc_ns_fget(int fd) 233 238 { 234 239 struct file *file;
+1
include/linux/mnt_namespace.h
··· 11 11 extern struct mnt_namespace *copy_mnt_ns(unsigned long, struct mnt_namespace *, 12 12 struct user_namespace *, struct fs_struct *); 13 13 extern void put_mnt_ns(struct mnt_namespace *ns); 14 + extern struct ns_common *from_mnt_ns(struct mnt_namespace *); 14 15 15 16 extern const struct file_operations proc_mounts_operations; 16 17 extern const struct file_operations proc_mountinfo_operations;
+2
include/linux/proc_fs.h
··· 179 179 return inode->i_sb->s_fs_info; 180 180 } 181 181 182 + bool proc_ns_file(const struct file *file); 183 + 182 184 #endif /* _LINUX_PROC_FS_H */
+214 -17
kernel/nsproxy.c
··· 20 20 #include <linux/ipc_namespace.h> 21 21 #include <linux/time_namespace.h> 22 22 #include <linux/fs_struct.h> 23 + #include <linux/proc_fs.h> 23 24 #include <linux/proc_ns.h> 24 25 #include <linux/file.h> 25 26 #include <linux/syscalls.h> ··· 259 258 switch_task_namespaces(p, NULL); 260 259 } 261 260 261 + static int check_setns_flags(unsigned long flags) 262 + { 263 + if (!flags || (flags & ~(CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | 264 + CLONE_NEWNET | CLONE_NEWUSER | CLONE_NEWPID | 265 + CLONE_NEWCGROUP))) 266 + return -EINVAL; 267 + 268 + #ifndef CONFIG_USER_NS 269 + if (flags & CLONE_NEWUSER) 270 + return -EINVAL; 271 + #endif 272 + #ifndef CONFIG_PID_NS 273 + if (flags & CLONE_NEWPID) 274 + return -EINVAL; 275 + #endif 276 + #ifndef CONFIG_UTS_NS 277 + if (flags & CLONE_NEWUTS) 278 + return -EINVAL; 279 + #endif 280 + #ifndef CONFIG_IPC_NS 281 + if (flags & CLONE_NEWIPC) 282 + return -EINVAL; 283 + #endif 284 + #ifndef CONFIG_CGROUPS 285 + if (flags & CLONE_NEWCGROUP) 286 + return -EINVAL; 287 + #endif 288 + #ifndef CONFIG_NET_NS 289 + if (flags & CLONE_NEWNET) 290 + return -EINVAL; 291 + #endif 292 + 293 + return 0; 294 + } 295 + 262 296 static void put_nsset(struct nsset *nsset) 263 297 { 264 298 unsigned flags = nsset->flags; 265 299 266 300 if (flags & CLONE_NEWUSER) 267 301 put_cred(nsset_cred(nsset)); 302 + /* 303 + * We only created a temporary copy if we attached to more than just 304 + * the mount namespace. 305 + */ 306 + if (nsset->fs && (flags & CLONE_NEWNS) && (flags & ~CLONE_NEWNS)) 307 + free_fs_struct(nsset->fs); 268 308 if (nsset->nsproxy) 269 309 free_nsproxy(nsset->nsproxy); 270 310 } 271 311 272 - static int prepare_nsset(int nstype, struct nsset *nsset) 312 + static int prepare_nsset(unsigned flags, struct nsset *nsset) 273 313 { 274 314 struct task_struct *me = current; 275 315 ··· 318 276 if (IS_ERR(nsset->nsproxy)) 319 277 return PTR_ERR(nsset->nsproxy); 320 278 321 - if (nstype == CLONE_NEWUSER) 279 + if (flags & CLONE_NEWUSER) 322 280 nsset->cred = prepare_creds(); 323 281 else 324 282 nsset->cred = current_cred(); 325 283 if (!nsset->cred) 326 284 goto out; 327 285 328 - if (nstype == CLONE_NEWNS) 286 + /* Only create a temporary copy of fs_struct if we really need to. */ 287 + if (flags == CLONE_NEWNS) { 329 288 nsset->fs = me->fs; 289 + } else if (flags & CLONE_NEWNS) { 290 + nsset->fs = copy_fs_struct(me->fs); 291 + if (!nsset->fs) 292 + goto out; 293 + } 330 294 331 - nsset->flags = nstype; 295 + nsset->flags = flags; 332 296 return 0; 333 297 334 298 out: 335 299 put_nsset(nsset); 336 300 return -ENOMEM; 301 + } 302 + 303 + static inline int validate_ns(struct nsset *nsset, struct ns_common *ns) 304 + { 305 + return ns->ops->install(nsset, ns); 306 + } 307 + 308 + /* 309 + * This is the inverse operation to unshare(). 310 + * Ordering is equivalent to the standard ordering used everywhere else 311 + * during unshare and process creation. The switch to the new set of 312 + * namespaces occurs at the point of no return after installation of 313 + * all requested namespaces was successful in commit_nsset(). 314 + */ 315 + static int validate_nsset(struct nsset *nsset, struct pid *pid) 316 + { 317 + int ret = 0; 318 + unsigned flags = nsset->flags; 319 + struct user_namespace *user_ns = NULL; 320 + struct pid_namespace *pid_ns = NULL; 321 + struct nsproxy *nsp; 322 + struct task_struct *tsk; 323 + 324 + /* Take a "snapshot" of the target task's namespaces. */ 325 + rcu_read_lock(); 326 + tsk = pid_task(pid, PIDTYPE_PID); 327 + if (!tsk) { 328 + rcu_read_unlock(); 329 + return -ESRCH; 330 + } 331 + 332 + if (!ptrace_may_access(tsk, PTRACE_MODE_READ_REALCREDS)) { 333 + rcu_read_unlock(); 334 + return -EPERM; 335 + } 336 + 337 + task_lock(tsk); 338 + nsp = tsk->nsproxy; 339 + if (nsp) 340 + get_nsproxy(nsp); 341 + task_unlock(tsk); 342 + if (!nsp) { 343 + rcu_read_unlock(); 344 + return -ESRCH; 345 + } 346 + 347 + #ifdef CONFIG_PID_NS 348 + if (flags & CLONE_NEWPID) { 349 + pid_ns = task_active_pid_ns(tsk); 350 + if (unlikely(!pid_ns)) { 351 + rcu_read_unlock(); 352 + ret = -ESRCH; 353 + goto out; 354 + } 355 + get_pid_ns(pid_ns); 356 + } 357 + #endif 358 + 359 + #ifdef CONFIG_USER_NS 360 + if (flags & CLONE_NEWUSER) 361 + user_ns = get_user_ns(__task_cred(tsk)->user_ns); 362 + #endif 363 + rcu_read_unlock(); 364 + 365 + /* 366 + * Install requested namespaces. The caller will have 367 + * verified earlier that the requested namespaces are 368 + * supported on this kernel. We don't report errors here 369 + * if a namespace is requested that isn't supported. 370 + */ 371 + #ifdef CONFIG_USER_NS 372 + if (flags & CLONE_NEWUSER) { 373 + ret = validate_ns(nsset, &user_ns->ns); 374 + if (ret) 375 + goto out; 376 + } 377 + #endif 378 + 379 + if (flags & CLONE_NEWNS) { 380 + ret = validate_ns(nsset, from_mnt_ns(nsp->mnt_ns)); 381 + if (ret) 382 + goto out; 383 + } 384 + 385 + #ifdef CONFIG_UTS_NS 386 + if (flags & CLONE_NEWUTS) { 387 + ret = validate_ns(nsset, &nsp->uts_ns->ns); 388 + if (ret) 389 + goto out; 390 + } 391 + #endif 392 + 393 + #ifdef CONFIG_IPC_NS 394 + if (flags & CLONE_NEWIPC) { 395 + ret = validate_ns(nsset, &nsp->ipc_ns->ns); 396 + if (ret) 397 + goto out; 398 + } 399 + #endif 400 + 401 + #ifdef CONFIG_PID_NS 402 + if (flags & CLONE_NEWPID) { 403 + ret = validate_ns(nsset, &pid_ns->ns); 404 + if (ret) 405 + goto out; 406 + } 407 + #endif 408 + 409 + #ifdef CONFIG_CGROUPS 410 + if (flags & CLONE_NEWCGROUP) { 411 + ret = validate_ns(nsset, &nsp->cgroup_ns->ns); 412 + if (ret) 413 + goto out; 414 + } 415 + #endif 416 + 417 + #ifdef CONFIG_NET_NS 418 + if (flags & CLONE_NEWNET) { 419 + ret = validate_ns(nsset, &nsp->net_ns->ns); 420 + if (ret) 421 + goto out; 422 + } 423 + #endif 424 + 425 + out: 426 + if (pid_ns) 427 + put_pid_ns(pid_ns); 428 + if (nsp) 429 + put_nsproxy(nsp); 430 + put_user_ns(user_ns); 431 + 432 + return ret; 337 433 } 338 434 339 435 /* ··· 496 316 } 497 317 #endif 498 318 319 + /* We only need to commit if we have used a temporary fs_struct. */ 320 + if ((flags & CLONE_NEWNS) && (flags & ~CLONE_NEWNS)) { 321 + set_fs_root(me->fs, &nsset->fs->root); 322 + set_fs_pwd(me->fs, &nsset->fs->pwd); 323 + } 324 + 499 325 #ifdef CONFIG_IPC_NS 500 326 if (flags & CLONE_NEWIPC) 501 327 exit_sem(me); ··· 512 326 nsset->nsproxy = NULL; 513 327 } 514 328 515 - SYSCALL_DEFINE2(setns, int, fd, int, nstype) 329 + SYSCALL_DEFINE2(setns, int, fd, int, flags) 516 330 { 517 331 struct file *file; 518 - struct ns_common *ns; 332 + struct ns_common *ns = NULL; 519 333 struct nsset nsset = {}; 520 - int err; 334 + int err = 0; 521 335 522 - file = proc_ns_fget(fd); 523 - if (IS_ERR(file)) 524 - return PTR_ERR(file); 336 + file = fget(fd); 337 + if (!file) 338 + return -EBADF; 525 339 526 - err = -EINVAL; 527 - ns = get_proc_ns(file_inode(file)); 528 - if (nstype && (ns->ops->type != nstype)) 529 - goto out; 530 - 531 - err = prepare_nsset(ns->ops->type, &nsset); 340 + if (proc_ns_file(file)) { 341 + ns = get_proc_ns(file_inode(file)); 342 + if (flags && (ns->ops->type != flags)) 343 + err = -EINVAL; 344 + flags = ns->ops->type; 345 + } else if (!IS_ERR(pidfd_pid(file))) { 346 + err = check_setns_flags(flags); 347 + } else { 348 + err = -EBADF; 349 + } 532 350 if (err) 533 351 goto out; 534 352 535 - err = ns->ops->install(&nsset, ns); 353 + err = prepare_nsset(flags, &nsset); 354 + if (err) 355 + goto out; 356 + 357 + if (proc_ns_file(file)) 358 + err = validate_ns(&nsset, ns); 359 + else 360 + err = validate_nsset(&nsset, file->private_data); 536 361 if (!err) { 537 362 commit_nsset(&nsset); 538 363 perf_event_namespaces(current);