Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

seccomp: introduce writer locking

Normally, task_struct.seccomp.filter is only ever read or modified by
the task that owns it (current). This property aids in fast access
during system call filtering as read access is lockless.

Updating the pointer from another task, however, opens up race
conditions. To allow cross-thread filter pointer updates, writes to the
seccomp fields are now protected by the sighand spinlock (which is shared
by all threads in the thread group). Read access remains lockless because
pointer updates themselves are atomic. However, writes (or cloning)
often entail additional checking (like maximum instruction counts)
which require locking to perform safely.

In the case of cloning threads, the child is invisible to the system
until it enters the task list. To make sure a child can't be cloned from
a thread and left in a prior state, seccomp duplication is additionally
moved under the sighand lock. Then parent and child are certain have
the same seccomp state when they exit the lock.

Based on patches by Will Drewry and David Drysdale.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>

Kees Cook dbd95212 c8bee430

+66 -5
+3 -3
include/linux/seccomp.h
··· 14 14 * 15 15 * @mode: indicates one of the valid values above for controlled 16 16 * system calls available to a process. 17 - * @filter: The metadata and ruleset for determining what system calls 18 - * are allowed for a task. 17 + * @filter: must always point to a valid seccomp-filter or NULL as it is 18 + * accessed without locking during system call entry. 19 19 * 20 20 * @filter must only be accessed from the context of current as there 21 - * is no locking. 21 + * is no read locking. 22 22 */ 23 23 struct seccomp { 24 24 int mode;
+48 -1
kernel/fork.c
··· 315 315 goto free_ti; 316 316 317 317 tsk->stack = ti; 318 + #ifdef CONFIG_SECCOMP 319 + /* 320 + * We must handle setting up seccomp filters once we're under 321 + * the sighand lock in case orig has changed between now and 322 + * then. Until then, filter must be NULL to avoid messing up 323 + * the usage counts on the error path calling free_task. 324 + */ 325 + tsk->seccomp.filter = NULL; 326 + #endif 318 327 319 328 setup_thread_stack(tsk, orig); 320 329 clear_user_return_notifier(tsk); ··· 1090 1081 return 0; 1091 1082 } 1092 1083 1084 + static void copy_seccomp(struct task_struct *p) 1085 + { 1086 + #ifdef CONFIG_SECCOMP 1087 + /* 1088 + * Must be called with sighand->lock held, which is common to 1089 + * all threads in the group. Holding cred_guard_mutex is not 1090 + * needed because this new task is not yet running and cannot 1091 + * be racing exec. 1092 + */ 1093 + BUG_ON(!spin_is_locked(&current->sighand->siglock)); 1094 + 1095 + /* Ref-count the new filter user, and assign it. */ 1096 + get_seccomp_filter(current); 1097 + p->seccomp = current->seccomp; 1098 + 1099 + /* 1100 + * Explicitly enable no_new_privs here in case it got set 1101 + * between the task_struct being duplicated and holding the 1102 + * sighand lock. The seccomp state and nnp must be in sync. 1103 + */ 1104 + if (task_no_new_privs(current)) 1105 + task_set_no_new_privs(p); 1106 + 1107 + /* 1108 + * If the parent gained a seccomp mode after copying thread 1109 + * flags and between before we held the sighand lock, we have 1110 + * to manually enable the seccomp thread flag here. 1111 + */ 1112 + if (p->seccomp.mode != SECCOMP_MODE_DISABLED) 1113 + set_tsk_thread_flag(p, TIF_SECCOMP); 1114 + #endif 1115 + } 1116 + 1093 1117 SYSCALL_DEFINE1(set_tid_address, int __user *, tidptr) 1094 1118 { 1095 1119 current->clear_child_tid = tidptr; ··· 1238 1196 goto fork_out; 1239 1197 1240 1198 ftrace_graph_init_task(p); 1241 - get_seccomp_filter(p); 1242 1199 1243 1200 rt_mutex_init_task(p); 1244 1201 ··· 1476 1435 } 1477 1436 1478 1437 spin_lock(&current->sighand->siglock); 1438 + 1439 + /* 1440 + * Copy seccomp details explicitly here, in case they were changed 1441 + * before holding sighand lock. 1442 + */ 1443 + copy_seccomp(p); 1479 1444 1480 1445 /* 1481 1446 * Process group and session signals need to be delivered to just the
+15 -1
kernel/seccomp.c
··· 199 199 200 200 static inline bool seccomp_may_assign_mode(unsigned long seccomp_mode) 201 201 { 202 + BUG_ON(!spin_is_locked(&current->sighand->siglock)); 203 + 202 204 if (current->seccomp.mode && current->seccomp.mode != seccomp_mode) 203 205 return false; 204 206 ··· 209 207 210 208 static inline void seccomp_assign_mode(unsigned long seccomp_mode) 211 209 { 210 + BUG_ON(!spin_is_locked(&current->sighand->siglock)); 211 + 212 212 current->seccomp.mode = seccomp_mode; 213 213 set_tsk_thread_flag(current, TIF_SECCOMP); 214 214 } ··· 336 332 * @flags: flags to change filter behavior 337 333 * @filter: seccomp filter to add to the current process 338 334 * 335 + * Caller must be holding current->sighand->siglock lock. 336 + * 339 337 * Returns 0 on success, -ve on error. 340 338 */ 341 339 static long seccomp_attach_filter(unsigned int flags, ··· 345 339 { 346 340 unsigned long total_insns; 347 341 struct seccomp_filter *walker; 342 + 343 + BUG_ON(!spin_is_locked(&current->sighand->siglock)); 348 344 349 345 /* Validate resulting filter length. */ 350 346 total_insns = filter->prog->len; ··· 537 529 const unsigned long seccomp_mode = SECCOMP_MODE_STRICT; 538 530 long ret = -EINVAL; 539 531 532 + spin_lock_irq(&current->sighand->siglock); 533 + 540 534 if (!seccomp_may_assign_mode(seccomp_mode)) 541 535 goto out; 542 536 ··· 549 539 ret = 0; 550 540 551 541 out: 542 + spin_unlock_irq(&current->sighand->siglock); 552 543 553 544 return ret; 554 545 } ··· 577 566 578 567 /* Validate flags. */ 579 568 if (flags != 0) 580 - goto out; 569 + return -EINVAL; 581 570 582 571 /* Prepare the new filter before holding any locks. */ 583 572 prepared = seccomp_prepare_user_filter(filter); 584 573 if (IS_ERR(prepared)) 585 574 return PTR_ERR(prepared); 575 + 576 + spin_lock_irq(&current->sighand->siglock); 586 577 587 578 if (!seccomp_may_assign_mode(seccomp_mode)) 588 579 goto out; ··· 597 584 598 585 seccomp_assign_mode(seccomp_mode); 599 586 out: 587 + spin_unlock_irq(&current->sighand->siglock); 600 588 seccomp_filter_free(prepared); 601 589 return ret; 602 590 }