Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

seccomp: Add wait_killable semantic to seccomp user notifier

This introduces a per-filter flag (SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV)
that makes it so that when notifications are received by the supervisor the
notifying process will transition to wait killable semantics. Although wait
killable isn't a set of semantics formally exposed to userspace, the
concept is searchable. If the notifying process is signaled prior to the
notification being received by the userspace agent, it will be handled as
normal.

One quirk about how this is handled is that the notifying process
only switches to TASK_KILLABLE if it receives a wakeup from either
an addfd or a signal. This is to avoid an unnecessary wakeup of
the notifying task.

The reasons behind switching into wait_killable only after userspace
receives the notification are:
* Avoiding unncessary work - Often, workloads will perform work that they
may abort (request racing comes to mind). This allows for syscalls to be
aborted safely prior to the notification being received by the
supervisor. In this, the supervisor doesn't end up doing work that the
workload does not want to complete anyways.
* Avoiding side effects - We don't want the syscall to be interruptible
once the supervisor starts doing work because it may not be trivial
to reverse the operation. For example, unmounting a file system may
take a long time, and it's hard to rollback, or treat that as
reentrant.
* Avoid breaking runtimes - Various runtimes do not GC when they are
during a syscall (or while running native code that subsequently
calls a syscall). If many notifications are blocked, and not picked
up by the supervisor, this can get the application into a bad state.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220503080958.20220-2-sargun@sargun.me

authored by

Sargun Dhillon and committed by
Kees Cook
c2aa2dfe 662340ef

+54 -3
+10
Documentation/userspace-api/seccomp_filter.rst
··· 271 271 respond atomically by using the ``SECCOMP_ADDFD_FLAG_SEND`` flag and the return 272 272 value will be the injected file descriptor number. 273 273 274 + The notifying process can be preempted, resulting in the notification being 275 + aborted. This can be problematic when trying to take actions on behalf of the 276 + notifying process that are long-running and typically retryable (mounting a 277 + filesytem). Alternatively, at filter installation time, the 278 + ``SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV`` flag can be set. This flag makes it 279 + such that when a user notification is received by the supervisor, the notifying 280 + process will ignore non-fatal signals until the response is sent. Signals that 281 + are sent prior to the notification being received by userspace are handled 282 + normally. 283 + 274 284 It is worth noting that ``struct seccomp_data`` contains the values of register 275 285 arguments to the syscall, but does not contain pointers to memory. The task's 276 286 memory is accessible to suitably privileged traces via ``ptrace()`` or
+2 -1
include/linux/seccomp.h
··· 8 8 SECCOMP_FILTER_FLAG_LOG | \ 9 9 SECCOMP_FILTER_FLAG_SPEC_ALLOW | \ 10 10 SECCOMP_FILTER_FLAG_NEW_LISTENER | \ 11 - SECCOMP_FILTER_FLAG_TSYNC_ESRCH) 11 + SECCOMP_FILTER_FLAG_TSYNC_ESRCH | \ 12 + SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) 12 13 13 14 /* sizeof() the first published struct seccomp_notif_addfd */ 14 15 #define SECCOMP_NOTIFY_ADDFD_SIZE_VER0 24
+2
include/uapi/linux/seccomp.h
··· 23 23 #define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2) 24 24 #define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3) 25 25 #define SECCOMP_FILTER_FLAG_TSYNC_ESRCH (1UL << 4) 26 + /* Received notifications wait in killable state (only respond to fatal signals) */ 27 + #define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5) 26 28 27 29 /* 28 30 * All BPF programs must return a 32-bit value.
+40 -2
kernel/seccomp.c
··· 200 200 * the filter can be freed. 201 201 * @cache: cache of arch/syscall mappings to actions 202 202 * @log: true if all actions except for SECCOMP_RET_ALLOW should be logged 203 + * @wait_killable_recv: Put notifying process in killable state once the 204 + * notification is received by the userspace listener. 203 205 * @prev: points to a previously installed, or inherited, filter 204 206 * @prog: the BPF program to evaluate 205 207 * @notif: the struct that holds all notification related information ··· 222 220 refcount_t refs; 223 221 refcount_t users; 224 222 bool log; 223 + bool wait_killable_recv; 225 224 struct action_cache cache; 226 225 struct seccomp_filter *prev; 227 226 struct bpf_prog *prog; ··· 896 893 if (flags & SECCOMP_FILTER_FLAG_LOG) 897 894 filter->log = true; 898 895 896 + /* Set wait killable flag, if present. */ 897 + if (flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) 898 + filter->wait_killable_recv = true; 899 + 899 900 /* 900 901 * If there is an existing filter, make it the prev and don't drop its 901 902 * task reference. ··· 1087 1080 complete(&addfd->completion); 1088 1081 } 1089 1082 1083 + static bool should_sleep_killable(struct seccomp_filter *match, 1084 + struct seccomp_knotif *n) 1085 + { 1086 + return match->wait_killable_recv && n->state == SECCOMP_NOTIFY_SENT; 1087 + } 1088 + 1090 1089 static int seccomp_do_user_notification(int this_syscall, 1091 1090 struct seccomp_filter *match, 1092 1091 const struct seccomp_data *sd) ··· 1123 1110 * This is where we wait for a reply from userspace. 1124 1111 */ 1125 1112 do { 1113 + bool wait_killable = should_sleep_killable(match, &n); 1114 + 1126 1115 mutex_unlock(&match->notify_lock); 1127 - err = wait_for_completion_interruptible(&n.ready); 1116 + if (wait_killable) 1117 + err = wait_for_completion_killable(&n.ready); 1118 + else 1119 + err = wait_for_completion_interruptible(&n.ready); 1128 1120 mutex_lock(&match->notify_lock); 1129 - if (err != 0) 1121 + 1122 + if (err != 0) { 1123 + /* 1124 + * Check to see if the notifcation got picked up and 1125 + * whether we should switch to wait killable. 1126 + */ 1127 + if (!wait_killable && should_sleep_killable(match, &n)) 1128 + continue; 1129 + 1130 1130 goto interrupted; 1131 + } 1131 1132 1132 1133 addfd = list_first_entry_or_null(&n.addfd, 1133 1134 struct seccomp_kaddfd, list); ··· 1511 1484 mutex_lock(&filter->notify_lock); 1512 1485 knotif = find_notification(filter, unotif.id); 1513 1486 if (knotif) { 1487 + /* Reset the process to make sure it's not stuck */ 1488 + if (should_sleep_killable(filter, knotif)) 1489 + complete(&knotif->ready); 1514 1490 knotif->state = SECCOMP_NOTIFY_INIT; 1515 1491 up(&filter->notif->request); 1516 1492 } ··· 1857 1827 if ((flags & SECCOMP_FILTER_FLAG_TSYNC) && 1858 1828 (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) && 1859 1829 ((flags & SECCOMP_FILTER_FLAG_TSYNC_ESRCH) == 0)) 1830 + return -EINVAL; 1831 + 1832 + /* 1833 + * The SECCOMP_FILTER_FLAG_WAIT_KILLABLE_SENT flag doesn't make sense 1834 + * without the SECCOMP_FILTER_FLAG_NEW_LISTENER flag. 1835 + */ 1836 + if ((flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) && 1837 + ((flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) == 0)) 1860 1838 return -EINVAL; 1861 1839 1862 1840 /* Prepare the new filter before holding any locks. */