Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

fork: extend clone3() to support setting a PID

The main motivation to add set_tid to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.

Extending clone3() to support *set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).

This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with *set_tid as they are currently in place for ns_last_pid.

The original version of this change was using a single value for
set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
decided to change set_tid to an array to enable setting the PID of a
process in multiple PID namespaces at the same time. If a process is
created in a PID namespace it is possible to influence the PID inside
and outside of the PID namespace. Details also in the corresponding
selftest.

To create a process with the following PIDs:

PID NS level Requested PID
0 (host) 31496
1 42
2 1

For that example the two newly introduced parameters to struct
clone_args (set_tid and set_tid_size) would need to be:

set_tid[0] = 1;
set_tid[1] = 42;
set_tid[2] = 31496;
set_tid_size = 3;

If only the PIDs of the two innermost nested PID namespaces should be
defined it would look like this:

set_tid[0] = 1;
set_tid[1] = 42;
set_tid_size = 2;

The PID of the newly created process would then be the next available
free PID in the PID namespace level 0 (host) and 42 in the PID namespace
at level 1 and the PID of the process in the innermost PID namespace
would be 1.

The set_tid array is used to specify the PID of a process starting
from the innermost nested PID namespaces up to set_tid_size PID namespaces.

set_tid_size cannot be larger then the current PID namespace level.

Signed-off-by: Adrian Reber <areber@redhat.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Acked-by: Andrei Vagin <avagin@gmail.com>
Link: https://lore.kernel.org/r/20191115123621.142252-1-areber@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

authored by

Adrian Reber and committed by
Christian Brauner
49cb2fc4 17a81069

+121 -36
+2 -1
include/linux/pid.h
··· 124 124 extern struct pid *find_get_pid(int nr); 125 125 extern struct pid *find_ge_pid(int nr, struct pid_namespace *); 126 126 127 - extern struct pid *alloc_pid(struct pid_namespace *ns); 127 + extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, 128 + size_t set_tid_size); 128 129 extern void free_pid(struct pid *pid); 129 130 extern void disable_pid_allocation(struct pid_namespace *ns); 130 131
+2
include/linux/pid_namespace.h
··· 12 12 #include <linux/ns_common.h> 13 13 #include <linux/idr.h> 14 14 15 + /* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */ 16 + #define MAX_PID_NS_LEVEL 32 15 17 16 18 struct fs_pin; 17 19
+3
include/linux/sched/task.h
··· 26 26 unsigned long stack; 27 27 unsigned long stack_size; 28 28 unsigned long tls; 29 + pid_t *set_tid; 30 + /* Number of elements in *set_tid */ 31 + size_t set_tid_size; 29 32 }; 30 33 31 34 /*
+35 -18
include/uapi/linux/sched.h
··· 39 39 #ifndef __ASSEMBLY__ 40 40 /** 41 41 * struct clone_args - arguments for the clone3 syscall 42 - * @flags: Flags for the new process as listed above. 43 - * All flags are valid except for CSIGNAL and 44 - * CLONE_DETACHED. 45 - * @pidfd: If CLONE_PIDFD is set, a pidfd will be 46 - * returned in this argument. 47 - * @child_tid: If CLONE_CHILD_SETTID is set, the TID of the 48 - * child process will be returned in the child's 49 - * memory. 50 - * @parent_tid: If CLONE_PARENT_SETTID is set, the TID of 51 - * the child process will be returned in the 52 - * parent's memory. 53 - * @exit_signal: The exit_signal the parent process will be 54 - * sent when the child exits. 55 - * @stack: Specify the location of the stack for the 56 - * child process. 57 - * @stack_size: The size of the stack for the child process. 58 - * @tls: If CLONE_SETTLS is set, the tls descriptor 59 - * is set to tls. 42 + * @flags: Flags for the new process as listed above. 43 + * All flags are valid except for CSIGNAL and 44 + * CLONE_DETACHED. 45 + * @pidfd: If CLONE_PIDFD is set, a pidfd will be 46 + * returned in this argument. 47 + * @child_tid: If CLONE_CHILD_SETTID is set, the TID of the 48 + * child process will be returned in the child's 49 + * memory. 50 + * @parent_tid: If CLONE_PARENT_SETTID is set, the TID of 51 + * the child process will be returned in the 52 + * parent's memory. 53 + * @exit_signal: The exit_signal the parent process will be 54 + * sent when the child exits. 55 + * @stack: Specify the location of the stack for the 56 + * child process. 57 + * @stack_size: The size of the stack for the child process. 58 + * @tls: If CLONE_SETTLS is set, the tls descriptor 59 + * is set to tls. 60 + * @set_tid: Pointer to an array of type *pid_t. The size 61 + * of the array is defined using @set_tid_size. 62 + * This array is used to select PIDs/TIDs for 63 + * newly created processes. The first element in 64 + * this defines the PID in the most nested PID 65 + * namespace. Each additional element in the array 66 + * defines the PID in the parent PID namespace of 67 + * the original PID namespace. If the array has 68 + * less entries than the number of currently 69 + * nested PID namespaces only the PIDs in the 70 + * corresponding namespaces are set. 71 + * @set_tid_size: This defines the size of the array referenced 72 + * in @set_tid. This cannot be larger than the 73 + * kernel's limit of nested PID namespaces. 60 74 * 61 75 * The structure is versioned by size and thus extensible. 62 76 * New struct members must go at the end of the struct and ··· 85 71 __aligned_u64 stack; 86 72 __aligned_u64 stack_size; 87 73 __aligned_u64 tls; 74 + __aligned_u64 set_tid; 75 + __aligned_u64 set_tid_size; 88 76 }; 89 77 #endif 90 78 91 79 #define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */ 80 + #define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */ 92 81 93 82 /* 94 83 * Scheduling policies
+23 -1
kernel/fork.c
··· 2087 2087 stackleak_task_init(p); 2088 2088 2089 2089 if (pid != &init_struct_pid) { 2090 - pid = alloc_pid(p->nsproxy->pid_ns_for_children); 2090 + pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid, 2091 + args->set_tid_size); 2091 2092 if (IS_ERR(pid)) { 2092 2093 retval = PTR_ERR(pid); 2093 2094 goto bad_fork_cleanup_thread; ··· 2591 2590 { 2592 2591 int err; 2593 2592 struct clone_args args; 2593 + pid_t *kset_tid = kargs->set_tid; 2594 2594 2595 2595 if (unlikely(usize > PAGE_SIZE)) 2596 2596 return -E2BIG; ··· 2601 2599 err = copy_struct_from_user(&args, sizeof(args), uargs, usize); 2602 2600 if (err) 2603 2601 return err; 2602 + 2603 + if (unlikely(args.set_tid_size > MAX_PID_NS_LEVEL)) 2604 + return -EINVAL; 2605 + 2606 + if (unlikely(!args.set_tid && args.set_tid_size > 0)) 2607 + return -EINVAL; 2608 + 2609 + if (unlikely(args.set_tid && args.set_tid_size == 0)) 2610 + return -EINVAL; 2604 2611 2605 2612 /* 2606 2613 * Verify that higher 32bits of exit_signal are unset and that ··· 2628 2617 .stack = args.stack, 2629 2618 .stack_size = args.stack_size, 2630 2619 .tls = args.tls, 2620 + .set_tid_size = args.set_tid_size, 2631 2621 }; 2622 + 2623 + if (args.set_tid && 2624 + copy_from_user(kset_tid, u64_to_user_ptr(args.set_tid), 2625 + (kargs->set_tid_size * sizeof(pid_t)))) 2626 + return -EFAULT; 2627 + 2628 + kargs->set_tid = kset_tid; 2632 2629 2633 2630 return 0; 2634 2631 } ··· 2681 2662 int err; 2682 2663 2683 2664 struct kernel_clone_args kargs; 2665 + pid_t set_tid[MAX_PID_NS_LEVEL]; 2666 + 2667 + kargs.set_tid = set_tid; 2684 2668 2685 2669 err = copy_clone_args_from_user(&kargs, uargs, size); 2686 2670 if (err)
+56 -14
kernel/pid.c
··· 157 157 call_rcu(&pid->rcu, delayed_put_pid); 158 158 } 159 159 160 - struct pid *alloc_pid(struct pid_namespace *ns) 160 + struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, 161 + size_t set_tid_size) 161 162 { 162 163 struct pid *pid; 163 164 enum pid_type type; ··· 166 165 struct pid_namespace *tmp; 167 166 struct upid *upid; 168 167 int retval = -ENOMEM; 168 + 169 + /* 170 + * set_tid_size contains the size of the set_tid array. Starting at 171 + * the most nested currently active PID namespace it tells alloc_pid() 172 + * which PID to set for a process in that most nested PID namespace 173 + * up to set_tid_size PID namespaces. It does not have to set the PID 174 + * for a process in all nested PID namespaces but set_tid_size must 175 + * never be greater than the current ns->level + 1. 176 + */ 177 + if (set_tid_size > ns->level + 1) 178 + return ERR_PTR(-EINVAL); 169 179 170 180 pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL); 171 181 if (!pid) ··· 186 174 pid->level = ns->level; 187 175 188 176 for (i = ns->level; i >= 0; i--) { 189 - int pid_min = 1; 177 + int tid = 0; 178 + 179 + if (set_tid_size) { 180 + tid = set_tid[ns->level - i]; 181 + 182 + retval = -EINVAL; 183 + if (tid < 1 || tid >= pid_max) 184 + goto out_free; 185 + /* 186 + * Also fail if a PID != 1 is requested and 187 + * no PID 1 exists. 188 + */ 189 + if (tid != 1 && !tmp->child_reaper) 190 + goto out_free; 191 + retval = -EPERM; 192 + if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN)) 193 + goto out_free; 194 + set_tid_size--; 195 + } 190 196 191 197 idr_preload(GFP_KERNEL); 192 198 spin_lock_irq(&pidmap_lock); 193 199 194 - /* 195 - * init really needs pid 1, but after reaching the maximum 196 - * wrap back to RESERVED_PIDS 197 - */ 198 - if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS) 199 - pid_min = RESERVED_PIDS; 200 + if (tid) { 201 + nr = idr_alloc(&tmp->idr, NULL, tid, 202 + tid + 1, GFP_ATOMIC); 203 + /* 204 + * If ENOSPC is returned it means that the PID is 205 + * alreay in use. Return EEXIST in that case. 206 + */ 207 + if (nr == -ENOSPC) 208 + nr = -EEXIST; 209 + } else { 210 + int pid_min = 1; 211 + /* 212 + * init really needs pid 1, but after reaching the 213 + * maximum wrap back to RESERVED_PIDS 214 + */ 215 + if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS) 216 + pid_min = RESERVED_PIDS; 200 217 201 - /* 202 - * Store a null pointer so find_pid_ns does not find 203 - * a partially initialized PID (see below). 204 - */ 205 - nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, 206 - pid_max, GFP_ATOMIC); 218 + /* 219 + * Store a null pointer so find_pid_ns does not find 220 + * a partially initialized PID (see below). 221 + */ 222 + nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, 223 + pid_max, GFP_ATOMIC); 224 + } 207 225 spin_unlock_irq(&pidmap_lock); 208 226 idr_preload_end(); 209 227
-2
kernel/pid_namespace.c
··· 26 26 27 27 static DEFINE_MUTEX(pid_caches_mutex); 28 28 static struct kmem_cache *pid_ns_cachep; 29 - /* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */ 30 - #define MAX_PID_NS_LEVEL 32 31 29 /* Write once array, filled from the beginning. */ 32 30 static struct kmem_cache *pid_cache[MAX_PID_NS_LEVEL]; 33 31