Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge branch 'next-seccomp' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

Pull seccomp updates from James Morris:

- Add SECCOMP_RET_USER_NOTIF

- seccomp fixes for sparse warnings and s390 build (Tycho)

* 'next-seccomp' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
seccomp, s390: fix build for syscall type change
seccomp: fix poor type promotion
samples: add an example of seccomp user trap
seccomp: add a return code to trap to userspace
seccomp: switch system call argument type to void *
seccomp: hoist struct seccomp_data recalculation higher

+1411 -24
+1
Documentation/ioctl/ioctl-number.txt
··· 79 79 0x1b all InfiniBand Subsystem <http://infiniband.sourceforge.net/> 80 80 0x20 all drivers/cdrom/cm206.h 81 81 0x22 all scsi/sg.h 82 + '!' 00-1F uapi/linux/seccomp.h 82 83 '#' 00-3F IEEE 1394 Subsystem Block for the entire subsystem 83 84 '$' 00-0F linux/perf_counter.h, linux/perf_event.h 84 85 '%' 00-0F include/uapi/linux/stm.h
+84
Documentation/userspace-api/seccomp_filter.rst
··· 122 122 Results in the lower 16-bits of the return value being passed 123 123 to userland as the errno without executing the system call. 124 124 125 + ``SECCOMP_RET_USER_NOTIF``: 126 + Results in a ``struct seccomp_notif`` message sent on the userspace 127 + notification fd, if it is attached, or ``-ENOSYS`` if it is not. See below 128 + on discussion of how to handle user notifications. 129 + 125 130 ``SECCOMP_RET_TRACE``: 126 131 When returned, this value will cause the kernel to attempt to 127 132 notify a ``ptrace()``-based tracer prior to executing the system ··· 187 182 The ``samples/seccomp/`` directory contains both an x86-specific example 188 183 and a more generic example of a higher level macro interface for BPF 189 184 program generation. 185 + 186 + Userspace Notification 187 + ====================== 188 + 189 + The ``SECCOMP_RET_USER_NOTIF`` return code lets seccomp filters pass a 190 + particular syscall to userspace to be handled. This may be useful for 191 + applications like container managers, which wish to intercept particular 192 + syscalls (``mount()``, ``finit_module()``, etc.) and change their behavior. 193 + 194 + To acquire a notification FD, use the ``SECCOMP_FILTER_FLAG_NEW_LISTENER`` 195 + argument to the ``seccomp()`` syscall: 196 + 197 + .. code-block:: c 198 + 199 + fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog); 200 + 201 + which (on success) will return a listener fd for the filter, which can then be 202 + passed around via ``SCM_RIGHTS`` or similar. Note that filter fds correspond to 203 + a particular filter, and not a particular task. So if this task then forks, 204 + notifications from both tasks will appear on the same filter fd. Reads and 205 + writes to/from a filter fd are also synchronized, so a filter fd can safely 206 + have many readers. 207 + 208 + The interface for a seccomp notification fd consists of two structures: 209 + 210 + .. code-block:: c 211 + 212 + struct seccomp_notif_sizes { 213 + __u16 seccomp_notif; 214 + __u16 seccomp_notif_resp; 215 + __u16 seccomp_data; 216 + }; 217 + 218 + struct seccomp_notif { 219 + __u64 id; 220 + __u32 pid; 221 + __u32 flags; 222 + struct seccomp_data data; 223 + }; 224 + 225 + struct seccomp_notif_resp { 226 + __u64 id; 227 + __s64 val; 228 + __s32 error; 229 + __u32 flags; 230 + }; 231 + 232 + The ``struct seccomp_notif_sizes`` structure can be used to determine the size 233 + of the various structures used in seccomp notifications. The size of ``struct 234 + seccomp_data`` may change in the future, so code should use: 235 + 236 + .. code-block:: c 237 + 238 + struct seccomp_notif_sizes sizes; 239 + seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes); 240 + 241 + to determine the size of the various structures to allocate. See 242 + samples/seccomp/user-trap.c for an example. 243 + 244 + Users can read via ``ioctl(SECCOMP_IOCTL_NOTIF_RECV)`` (or ``poll()``) on a 245 + seccomp notification fd to receive a ``struct seccomp_notif``, which contains 246 + five members: the input length of the structure, a unique-per-filter ``id``, 247 + the ``pid`` of the task which triggered this request (which may be 0 if the 248 + task is in a pid ns not visible from the listener's pid namespace), a ``flags`` 249 + member which for now only has ``SECCOMP_NOTIF_FLAG_SIGNALED``, representing 250 + whether or not the notification is a result of a non-fatal signal, and the 251 + ``data`` passed to seccomp. Userspace can then make a decision based on this 252 + information about what to do, and ``ioctl(SECCOMP_IOCTL_NOTIF_SEND)`` a 253 + response, indicating what should be returned to userspace. The ``id`` member of 254 + ``struct seccomp_notif_resp`` should be the same ``id`` as in ``struct 255 + seccomp_notif``. 256 + 257 + It is worth noting that ``struct seccomp_data`` contains the values of register 258 + arguments to the syscall, but does not contain pointers to memory. The task's 259 + memory is accessible to suitably privileged traces via ``ptrace()`` or 260 + ``/proc/pid/mem``. However, care should be taken to avoid the TOCTOU mentioned 261 + above in this document: all arguments being read from the tracee's memory 262 + should be read into the tracer's memory before any policy decisions are made. 263 + This allows for an atomic decision on syscall arguments. 190 264 191 265 Sysctls 192 266 =======
+1 -1
arch/s390/kernel/compat_wrapper.c
··· 164 164 COMPAT_SYSCALL_WRAP3(sched_setattr, pid_t, pid, struct sched_attr __user *, attr, unsigned int, flags); 165 165 COMPAT_SYSCALL_WRAP4(sched_getattr, pid_t, pid, struct sched_attr __user *, attr, unsigned int, size, unsigned int, flags); 166 166 COMPAT_SYSCALL_WRAP5(renameat2, int, olddfd, const char __user *, oldname, int, newdfd, const char __user *, newname, unsigned int, flags); 167 - COMPAT_SYSCALL_WRAP3(seccomp, unsigned int, op, unsigned int, flags, const char __user *, uargs) 167 + COMPAT_SYSCALL_WRAP3(seccomp, unsigned int, op, unsigned int, flags, void __user *, uargs) 168 168 COMPAT_SYSCALL_WRAP3(getrandom, char __user *, buf, size_t, count, unsigned int, flags) 169 169 COMPAT_SYSCALL_WRAP2(memfd_create, const char __user *, uname, unsigned int, flags) 170 170 COMPAT_SYSCALL_WRAP3(bpf, int, cmd, union bpf_attr *, attr, unsigned int, size);
+5 -4
include/linux/seccomp.h
··· 4 4 5 5 #include <uapi/linux/seccomp.h> 6 6 7 - #define SECCOMP_FILTER_FLAG_MASK (SECCOMP_FILTER_FLAG_TSYNC | \ 8 - SECCOMP_FILTER_FLAG_LOG | \ 9 - SECCOMP_FILTER_FLAG_SPEC_ALLOW) 7 + #define SECCOMP_FILTER_FLAG_MASK (SECCOMP_FILTER_FLAG_TSYNC | \ 8 + SECCOMP_FILTER_FLAG_LOG | \ 9 + SECCOMP_FILTER_FLAG_SPEC_ALLOW | \ 10 + SECCOMP_FILTER_FLAG_NEW_LISTENER) 10 11 11 12 #ifdef CONFIG_SECCOMP 12 13 ··· 44 43 #endif 45 44 46 45 extern long prctl_get_seccomp(void); 47 - extern long prctl_set_seccomp(unsigned long, char __user *); 46 + extern long prctl_set_seccomp(unsigned long, void __user *); 48 47 49 48 static inline int seccomp_mode(struct seccomp *s) 50 49 {
+1 -1
include/linux/syscalls.h
··· 898 898 int newdfd, const char __user *newname, 899 899 unsigned int flags); 900 900 asmlinkage long sys_seccomp(unsigned int op, unsigned int flags, 901 - const char __user *uargs); 901 + void __user *uargs); 902 902 asmlinkage long sys_getrandom(char __user *buf, size_t count, 903 903 unsigned int flags); 904 904 asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int flags);
+37 -3
include/uapi/linux/seccomp.h
··· 15 15 #define SECCOMP_SET_MODE_STRICT 0 16 16 #define SECCOMP_SET_MODE_FILTER 1 17 17 #define SECCOMP_GET_ACTION_AVAIL 2 18 + #define SECCOMP_GET_NOTIF_SIZES 3 18 19 19 20 /* Valid flags for SECCOMP_SET_MODE_FILTER */ 20 - #define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) 21 - #define SECCOMP_FILTER_FLAG_LOG (1UL << 1) 22 - #define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2) 21 + #define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) 22 + #define SECCOMP_FILTER_FLAG_LOG (1UL << 1) 23 + #define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2) 24 + #define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3) 23 25 24 26 /* 25 27 * All BPF programs must return a 32-bit value. ··· 37 35 #define SECCOMP_RET_KILL SECCOMP_RET_KILL_THREAD 38 36 #define SECCOMP_RET_TRAP 0x00030000U /* disallow and force a SIGSYS */ 39 37 #define SECCOMP_RET_ERRNO 0x00050000U /* returns an errno */ 38 + #define SECCOMP_RET_USER_NOTIF 0x7fc00000U /* notifies userspace */ 40 39 #define SECCOMP_RET_TRACE 0x7ff00000U /* pass to a tracer or disallow */ 41 40 #define SECCOMP_RET_LOG 0x7ffc0000U /* allow after logging */ 42 41 #define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */ ··· 63 60 __u64 args[6]; 64 61 }; 65 62 63 + struct seccomp_notif_sizes { 64 + __u16 seccomp_notif; 65 + __u16 seccomp_notif_resp; 66 + __u16 seccomp_data; 67 + }; 68 + 69 + struct seccomp_notif { 70 + __u64 id; 71 + __u32 pid; 72 + __u32 flags; 73 + struct seccomp_data data; 74 + }; 75 + 76 + struct seccomp_notif_resp { 77 + __u64 id; 78 + __s64 val; 79 + __s32 error; 80 + __u32 flags; 81 + }; 82 + 83 + #define SECCOMP_IOC_MAGIC '!' 84 + #define SECCOMP_IO(nr) _IO(SECCOMP_IOC_MAGIC, nr) 85 + #define SECCOMP_IOR(nr, type) _IOR(SECCOMP_IOC_MAGIC, nr, type) 86 + #define SECCOMP_IOW(nr, type) _IOW(SECCOMP_IOC_MAGIC, nr, type) 87 + #define SECCOMP_IOWR(nr, type) _IOWR(SECCOMP_IOC_MAGIC, nr, type) 88 + 89 + /* Flags for seccomp notification fd ioctl. */ 90 + #define SECCOMP_IOCTL_NOTIF_RECV SECCOMP_IOWR(0, struct seccomp_notif) 91 + #define SECCOMP_IOCTL_NOTIF_SEND SECCOMP_IOWR(1, \ 92 + struct seccomp_notif_resp) 93 + #define SECCOMP_IOCTL_NOTIF_ID_VALID SECCOMP_IOR(2, __u64) 66 94 #endif /* _UAPI_LINUX_SECCOMP_H */
+455 -12
kernel/seccomp.c
··· 33 33 #endif 34 34 35 35 #ifdef CONFIG_SECCOMP_FILTER 36 + #include <linux/file.h> 36 37 #include <linux/filter.h> 37 38 #include <linux/pid.h> 38 39 #include <linux/ptrace.h> 39 40 #include <linux/security.h> 40 41 #include <linux/tracehook.h> 41 42 #include <linux/uaccess.h> 43 + #include <linux/anon_inodes.h> 44 + 45 + enum notify_state { 46 + SECCOMP_NOTIFY_INIT, 47 + SECCOMP_NOTIFY_SENT, 48 + SECCOMP_NOTIFY_REPLIED, 49 + }; 50 + 51 + struct seccomp_knotif { 52 + /* The struct pid of the task whose filter triggered the notification */ 53 + struct task_struct *task; 54 + 55 + /* The "cookie" for this request; this is unique for this filter. */ 56 + u64 id; 57 + 58 + /* 59 + * The seccomp data. This pointer is valid the entire time this 60 + * notification is active, since it comes from __seccomp_filter which 61 + * eclipses the entire lifecycle here. 62 + */ 63 + const struct seccomp_data *data; 64 + 65 + /* 66 + * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a 67 + * struct seccomp_knotif is created and starts out in INIT. Once the 68 + * handler reads the notification off of an FD, it transitions to SENT. 69 + * If a signal is received the state transitions back to INIT and 70 + * another message is sent. When the userspace handler replies, state 71 + * transitions to REPLIED. 72 + */ 73 + enum notify_state state; 74 + 75 + /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */ 76 + int error; 77 + long val; 78 + 79 + /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */ 80 + struct completion ready; 81 + 82 + struct list_head list; 83 + }; 84 + 85 + /** 86 + * struct notification - container for seccomp userspace notifications. Since 87 + * most seccomp filters will not have notification listeners attached and this 88 + * structure is fairly large, we store the notification-specific stuff in a 89 + * separate structure. 90 + * 91 + * @request: A semaphore that users of this notification can wait on for 92 + * changes. Actual reads and writes are still controlled with 93 + * filter->notify_lock. 94 + * @next_id: The id of the next request. 95 + * @notifications: A list of struct seccomp_knotif elements. 96 + * @wqh: A wait queue for poll. 97 + */ 98 + struct notification { 99 + struct semaphore request; 100 + u64 next_id; 101 + struct list_head notifications; 102 + wait_queue_head_t wqh; 103 + }; 42 104 43 105 /** 44 106 * struct seccomp_filter - container for seccomp BPF programs ··· 112 50 * @log: true if all actions except for SECCOMP_RET_ALLOW should be logged 113 51 * @prev: points to a previously installed, or inherited, filter 114 52 * @prog: the BPF program to evaluate 53 + * @notif: the struct that holds all notification related information 54 + * @notify_lock: A lock for all notification-related accesses. 115 55 * 116 56 * seccomp_filter objects are organized in a tree linked via the @prev 117 57 * pointer. For any task, it appears to be a singly-linked list starting ··· 130 66 bool log; 131 67 struct seccomp_filter *prev; 132 68 struct bpf_prog *prog; 69 + struct notification *notif; 70 + struct mutex notify_lock; 133 71 }; 134 72 135 73 /* Limit any path through the tree to 256KB worth of instructions. */ ··· 254 188 static u32 seccomp_run_filters(const struct seccomp_data *sd, 255 189 struct seccomp_filter **match) 256 190 { 257 - struct seccomp_data sd_local; 258 191 u32 ret = SECCOMP_RET_ALLOW; 259 192 /* Make sure cross-thread synced filter points somewhere sane. */ 260 193 struct seccomp_filter *f = ··· 262 197 /* Ensure unexpected behavior doesn't result in failing open. */ 263 198 if (WARN_ON(f == NULL)) 264 199 return SECCOMP_RET_KILL_PROCESS; 265 - 266 - if (!sd) { 267 - populate_seccomp_data(&sd_local); 268 - sd = &sd_local; 269 - } 270 200 271 201 /* 272 202 * All filters in the list are evaluated and the lowest BPF return ··· 452 392 if (!sfilter) 453 393 return ERR_PTR(-ENOMEM); 454 394 395 + mutex_init(&sfilter->notify_lock); 455 396 ret = bpf_prog_create_from_user(&sfilter->prog, fprog, 456 397 seccomp_check_filter, save_orig); 457 398 if (ret < 0) { ··· 546 485 547 486 static void __get_seccomp_filter(struct seccomp_filter *filter) 548 487 { 549 - /* Reference count is bounded by the number of total processes. */ 550 488 refcount_inc(&filter->usage); 551 489 } 552 490 ··· 616 556 #define SECCOMP_LOG_TRACE (1 << 4) 617 557 #define SECCOMP_LOG_LOG (1 << 5) 618 558 #define SECCOMP_LOG_ALLOW (1 << 6) 559 + #define SECCOMP_LOG_USER_NOTIF (1 << 7) 619 560 620 561 static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS | 621 562 SECCOMP_LOG_KILL_THREAD | 622 563 SECCOMP_LOG_TRAP | 623 564 SECCOMP_LOG_ERRNO | 565 + SECCOMP_LOG_USER_NOTIF | 624 566 SECCOMP_LOG_TRACE | 625 567 SECCOMP_LOG_LOG; 626 568 ··· 642 580 break; 643 581 case SECCOMP_RET_TRACE: 644 582 log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE; 583 + break; 584 + case SECCOMP_RET_USER_NOTIF: 585 + log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF; 645 586 break; 646 587 case SECCOMP_RET_LOG: 647 588 log = seccomp_actions_logged & SECCOMP_LOG_LOG; ··· 717 652 #else 718 653 719 654 #ifdef CONFIG_SECCOMP_FILTER 655 + static u64 seccomp_next_notify_id(struct seccomp_filter *filter) 656 + { 657 + /* 658 + * Note: overflow is ok here, the id just needs to be unique per 659 + * filter. 660 + */ 661 + lockdep_assert_held(&filter->notify_lock); 662 + return filter->notif->next_id++; 663 + } 664 + 665 + static void seccomp_do_user_notification(int this_syscall, 666 + struct seccomp_filter *match, 667 + const struct seccomp_data *sd) 668 + { 669 + int err; 670 + long ret = 0; 671 + struct seccomp_knotif n = {}; 672 + 673 + mutex_lock(&match->notify_lock); 674 + err = -ENOSYS; 675 + if (!match->notif) 676 + goto out; 677 + 678 + n.task = current; 679 + n.state = SECCOMP_NOTIFY_INIT; 680 + n.data = sd; 681 + n.id = seccomp_next_notify_id(match); 682 + init_completion(&n.ready); 683 + list_add(&n.list, &match->notif->notifications); 684 + 685 + up(&match->notif->request); 686 + wake_up_poll(&match->notif->wqh, EPOLLIN | EPOLLRDNORM); 687 + mutex_unlock(&match->notify_lock); 688 + 689 + /* 690 + * This is where we wait for a reply from userspace. 691 + */ 692 + err = wait_for_completion_interruptible(&n.ready); 693 + mutex_lock(&match->notify_lock); 694 + if (err == 0) { 695 + ret = n.val; 696 + err = n.error; 697 + } 698 + 699 + /* 700 + * Note that it's possible the listener died in between the time when 701 + * we were notified of a respons (or a signal) and when we were able to 702 + * re-acquire the lock, so only delete from the list if the 703 + * notification actually exists. 704 + * 705 + * Also note that this test is only valid because there's no way to 706 + * *reattach* to a notifier right now. If one is added, we'll need to 707 + * keep track of the notif itself and make sure they match here. 708 + */ 709 + if (match->notif) 710 + list_del(&n.list); 711 + out: 712 + mutex_unlock(&match->notify_lock); 713 + syscall_set_return_value(current, task_pt_regs(current), 714 + err, ret); 715 + } 716 + 720 717 static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, 721 718 const bool recheck_after_trace) 722 719 { 723 720 u32 filter_ret, action; 724 721 struct seccomp_filter *match = NULL; 725 722 int data; 723 + struct seccomp_data sd_local; 726 724 727 725 /* 728 726 * Make sure that any changes to mode from another thread have 729 727 * been seen after TIF_SECCOMP was seen. 730 728 */ 731 729 rmb(); 730 + 731 + if (!sd) { 732 + populate_seccomp_data(&sd_local); 733 + sd = &sd_local; 734 + } 732 735 733 736 filter_ret = seccomp_run_filters(sd, &match); 734 737 data = filter_ret & SECCOMP_RET_DATA; ··· 860 727 return -1; 861 728 862 729 return 0; 730 + 731 + case SECCOMP_RET_USER_NOTIF: 732 + seccomp_do_user_notification(this_syscall, match, sd); 733 + goto skip; 863 734 864 735 case SECCOMP_RET_LOG: 865 736 seccomp_log(this_syscall, 0, action, true); ··· 971 834 } 972 835 973 836 #ifdef CONFIG_SECCOMP_FILTER 837 + static int seccomp_notify_release(struct inode *inode, struct file *file) 838 + { 839 + struct seccomp_filter *filter = file->private_data; 840 + struct seccomp_knotif *knotif; 841 + 842 + mutex_lock(&filter->notify_lock); 843 + 844 + /* 845 + * If this file is being closed because e.g. the task who owned it 846 + * died, let's wake everyone up who was waiting on us. 847 + */ 848 + list_for_each_entry(knotif, &filter->notif->notifications, list) { 849 + if (knotif->state == SECCOMP_NOTIFY_REPLIED) 850 + continue; 851 + 852 + knotif->state = SECCOMP_NOTIFY_REPLIED; 853 + knotif->error = -ENOSYS; 854 + knotif->val = 0; 855 + 856 + complete(&knotif->ready); 857 + } 858 + 859 + kfree(filter->notif); 860 + filter->notif = NULL; 861 + mutex_unlock(&filter->notify_lock); 862 + __put_seccomp_filter(filter); 863 + return 0; 864 + } 865 + 866 + static long seccomp_notify_recv(struct seccomp_filter *filter, 867 + void __user *buf) 868 + { 869 + struct seccomp_knotif *knotif = NULL, *cur; 870 + struct seccomp_notif unotif; 871 + ssize_t ret; 872 + 873 + memset(&unotif, 0, sizeof(unotif)); 874 + 875 + ret = down_interruptible(&filter->notif->request); 876 + if (ret < 0) 877 + return ret; 878 + 879 + mutex_lock(&filter->notify_lock); 880 + list_for_each_entry(cur, &filter->notif->notifications, list) { 881 + if (cur->state == SECCOMP_NOTIFY_INIT) { 882 + knotif = cur; 883 + break; 884 + } 885 + } 886 + 887 + /* 888 + * If we didn't find a notification, it could be that the task was 889 + * interrupted by a fatal signal between the time we were woken and 890 + * when we were able to acquire the rw lock. 891 + */ 892 + if (!knotif) { 893 + ret = -ENOENT; 894 + goto out; 895 + } 896 + 897 + unotif.id = knotif->id; 898 + unotif.pid = task_pid_vnr(knotif->task); 899 + unotif.data = *(knotif->data); 900 + 901 + knotif->state = SECCOMP_NOTIFY_SENT; 902 + wake_up_poll(&filter->notif->wqh, EPOLLOUT | EPOLLWRNORM); 903 + ret = 0; 904 + out: 905 + mutex_unlock(&filter->notify_lock); 906 + 907 + if (ret == 0 && copy_to_user(buf, &unotif, sizeof(unotif))) { 908 + ret = -EFAULT; 909 + 910 + /* 911 + * Userspace screwed up. To make sure that we keep this 912 + * notification alive, let's reset it back to INIT. It 913 + * may have died when we released the lock, so we need to make 914 + * sure it's still around. 915 + */ 916 + knotif = NULL; 917 + mutex_lock(&filter->notify_lock); 918 + list_for_each_entry(cur, &filter->notif->notifications, list) { 919 + if (cur->id == unotif.id) { 920 + knotif = cur; 921 + break; 922 + } 923 + } 924 + 925 + if (knotif) { 926 + knotif->state = SECCOMP_NOTIFY_INIT; 927 + up(&filter->notif->request); 928 + } 929 + mutex_unlock(&filter->notify_lock); 930 + } 931 + 932 + return ret; 933 + } 934 + 935 + static long seccomp_notify_send(struct seccomp_filter *filter, 936 + void __user *buf) 937 + { 938 + struct seccomp_notif_resp resp = {}; 939 + struct seccomp_knotif *knotif = NULL, *cur; 940 + long ret; 941 + 942 + if (copy_from_user(&resp, buf, sizeof(resp))) 943 + return -EFAULT; 944 + 945 + if (resp.flags) 946 + return -EINVAL; 947 + 948 + ret = mutex_lock_interruptible(&filter->notify_lock); 949 + if (ret < 0) 950 + return ret; 951 + 952 + list_for_each_entry(cur, &filter->notif->notifications, list) { 953 + if (cur->id == resp.id) { 954 + knotif = cur; 955 + break; 956 + } 957 + } 958 + 959 + if (!knotif) { 960 + ret = -ENOENT; 961 + goto out; 962 + } 963 + 964 + /* Allow exactly one reply. */ 965 + if (knotif->state != SECCOMP_NOTIFY_SENT) { 966 + ret = -EINPROGRESS; 967 + goto out; 968 + } 969 + 970 + ret = 0; 971 + knotif->state = SECCOMP_NOTIFY_REPLIED; 972 + knotif->error = resp.error; 973 + knotif->val = resp.val; 974 + complete(&knotif->ready); 975 + out: 976 + mutex_unlock(&filter->notify_lock); 977 + return ret; 978 + } 979 + 980 + static long seccomp_notify_id_valid(struct seccomp_filter *filter, 981 + void __user *buf) 982 + { 983 + struct seccomp_knotif *knotif = NULL; 984 + u64 id; 985 + long ret; 986 + 987 + if (copy_from_user(&id, buf, sizeof(id))) 988 + return -EFAULT; 989 + 990 + ret = mutex_lock_interruptible(&filter->notify_lock); 991 + if (ret < 0) 992 + return ret; 993 + 994 + ret = -ENOENT; 995 + list_for_each_entry(knotif, &filter->notif->notifications, list) { 996 + if (knotif->id == id) { 997 + if (knotif->state == SECCOMP_NOTIFY_SENT) 998 + ret = 0; 999 + goto out; 1000 + } 1001 + } 1002 + 1003 + out: 1004 + mutex_unlock(&filter->notify_lock); 1005 + return ret; 1006 + } 1007 + 1008 + static long seccomp_notify_ioctl(struct file *file, unsigned int cmd, 1009 + unsigned long arg) 1010 + { 1011 + struct seccomp_filter *filter = file->private_data; 1012 + void __user *buf = (void __user *)arg; 1013 + 1014 + switch (cmd) { 1015 + case SECCOMP_IOCTL_NOTIF_RECV: 1016 + return seccomp_notify_recv(filter, buf); 1017 + case SECCOMP_IOCTL_NOTIF_SEND: 1018 + return seccomp_notify_send(filter, buf); 1019 + case SECCOMP_IOCTL_NOTIF_ID_VALID: 1020 + return seccomp_notify_id_valid(filter, buf); 1021 + default: 1022 + return -EINVAL; 1023 + } 1024 + } 1025 + 1026 + static __poll_t seccomp_notify_poll(struct file *file, 1027 + struct poll_table_struct *poll_tab) 1028 + { 1029 + struct seccomp_filter *filter = file->private_data; 1030 + __poll_t ret = 0; 1031 + struct seccomp_knotif *cur; 1032 + 1033 + poll_wait(file, &filter->notif->wqh, poll_tab); 1034 + 1035 + if (mutex_lock_interruptible(&filter->notify_lock) < 0) 1036 + return EPOLLERR; 1037 + 1038 + list_for_each_entry(cur, &filter->notif->notifications, list) { 1039 + if (cur->state == SECCOMP_NOTIFY_INIT) 1040 + ret |= EPOLLIN | EPOLLRDNORM; 1041 + if (cur->state == SECCOMP_NOTIFY_SENT) 1042 + ret |= EPOLLOUT | EPOLLWRNORM; 1043 + if ((ret & EPOLLIN) && (ret & EPOLLOUT)) 1044 + break; 1045 + } 1046 + 1047 + mutex_unlock(&filter->notify_lock); 1048 + 1049 + return ret; 1050 + } 1051 + 1052 + static const struct file_operations seccomp_notify_ops = { 1053 + .poll = seccomp_notify_poll, 1054 + .release = seccomp_notify_release, 1055 + .unlocked_ioctl = seccomp_notify_ioctl, 1056 + }; 1057 + 1058 + static struct file *init_listener(struct seccomp_filter *filter) 1059 + { 1060 + struct file *ret = ERR_PTR(-EBUSY); 1061 + struct seccomp_filter *cur; 1062 + 1063 + for (cur = current->seccomp.filter; cur; cur = cur->prev) { 1064 + if (cur->notif) 1065 + goto out; 1066 + } 1067 + 1068 + ret = ERR_PTR(-ENOMEM); 1069 + filter->notif = kzalloc(sizeof(*(filter->notif)), GFP_KERNEL); 1070 + if (!filter->notif) 1071 + goto out; 1072 + 1073 + sema_init(&filter->notif->request, 0); 1074 + filter->notif->next_id = get_random_u64(); 1075 + INIT_LIST_HEAD(&filter->notif->notifications); 1076 + init_waitqueue_head(&filter->notif->wqh); 1077 + 1078 + ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops, 1079 + filter, O_RDWR); 1080 + if (IS_ERR(ret)) 1081 + goto out_notif; 1082 + 1083 + /* The file has a reference to it now */ 1084 + __get_seccomp_filter(filter); 1085 + 1086 + out_notif: 1087 + if (IS_ERR(ret)) 1088 + kfree(filter->notif); 1089 + out: 1090 + return ret; 1091 + } 1092 + 974 1093 /** 975 1094 * seccomp_set_mode_filter: internal function for setting seccomp filter 976 1095 * @flags: flags to change filter behavior ··· 1246 853 const unsigned long seccomp_mode = SECCOMP_MODE_FILTER; 1247 854 struct seccomp_filter *prepared = NULL; 1248 855 long ret = -EINVAL; 856 + int listener = -1; 857 + struct file *listener_f = NULL; 1249 858 1250 859 /* Validate flags. */ 1251 860 if (flags & ~SECCOMP_FILTER_FLAG_MASK) ··· 1258 863 if (IS_ERR(prepared)) 1259 864 return PTR_ERR(prepared); 1260 865 866 + if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) { 867 + listener = get_unused_fd_flags(O_CLOEXEC); 868 + if (listener < 0) { 869 + ret = listener; 870 + goto out_free; 871 + } 872 + 873 + listener_f = init_listener(prepared); 874 + if (IS_ERR(listener_f)) { 875 + put_unused_fd(listener); 876 + ret = PTR_ERR(listener_f); 877 + goto out_free; 878 + } 879 + } 880 + 1261 881 /* 1262 882 * Make sure we cannot change seccomp or nnp state via TSYNC 1263 883 * while another thread is in the middle of calling exec. 1264 884 */ 1265 885 if (flags & SECCOMP_FILTER_FLAG_TSYNC && 1266 886 mutex_lock_killable(&current->signal->cred_guard_mutex)) 1267 - goto out_free; 887 + goto out_put_fd; 1268 888 1269 889 spin_lock_irq(&current->sighand->siglock); 1270 890 ··· 1297 887 spin_unlock_irq(&current->sighand->siglock); 1298 888 if (flags & SECCOMP_FILTER_FLAG_TSYNC) 1299 889 mutex_unlock(&current->signal->cred_guard_mutex); 890 + out_put_fd: 891 + if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) { 892 + if (ret < 0) { 893 + fput(listener_f); 894 + put_unused_fd(listener); 895 + } else { 896 + fd_install(listener, listener_f); 897 + ret = listener; 898 + } 899 + } 1300 900 out_free: 1301 901 seccomp_filter_free(prepared); 1302 902 return ret; ··· 1331 911 case SECCOMP_RET_KILL_THREAD: 1332 912 case SECCOMP_RET_TRAP: 1333 913 case SECCOMP_RET_ERRNO: 914 + case SECCOMP_RET_USER_NOTIF: 1334 915 case SECCOMP_RET_TRACE: 1335 916 case SECCOMP_RET_LOG: 1336 917 case SECCOMP_RET_ALLOW: ··· 1343 922 return 0; 1344 923 } 1345 924 925 + static long seccomp_get_notif_sizes(void __user *usizes) 926 + { 927 + struct seccomp_notif_sizes sizes = { 928 + .seccomp_notif = sizeof(struct seccomp_notif), 929 + .seccomp_notif_resp = sizeof(struct seccomp_notif_resp), 930 + .seccomp_data = sizeof(struct seccomp_data), 931 + }; 932 + 933 + if (copy_to_user(usizes, &sizes, sizeof(sizes))) 934 + return -EFAULT; 935 + 936 + return 0; 937 + } 938 + 1346 939 /* Common entry point for both prctl and syscall. */ 1347 940 static long do_seccomp(unsigned int op, unsigned int flags, 1348 - const char __user *uargs) 941 + void __user *uargs) 1349 942 { 1350 943 switch (op) { 1351 944 case SECCOMP_SET_MODE_STRICT: ··· 1373 938 return -EINVAL; 1374 939 1375 940 return seccomp_get_action_avail(uargs); 941 + case SECCOMP_GET_NOTIF_SIZES: 942 + if (flags != 0) 943 + return -EINVAL; 944 + 945 + return seccomp_get_notif_sizes(uargs); 1376 946 default: 1377 947 return -EINVAL; 1378 948 } 1379 949 } 1380 950 1381 951 SYSCALL_DEFINE3(seccomp, unsigned int, op, unsigned int, flags, 1382 - const char __user *, uargs) 952 + void __user *, uargs) 1383 953 { 1384 954 return do_seccomp(op, flags, uargs); 1385 955 } ··· 1396 956 * 1397 957 * Returns 0 on success or -EINVAL on failure. 1398 958 */ 1399 - long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter) 959 + long prctl_set_seccomp(unsigned long seccomp_mode, void __user *filter) 1400 960 { 1401 961 unsigned int op; 1402 - char __user *uargs; 962 + void __user *uargs; 1403 963 1404 964 switch (seccomp_mode) { 1405 965 case SECCOMP_MODE_STRICT: ··· 1551 1111 #define SECCOMP_RET_KILL_THREAD_NAME "kill_thread" 1552 1112 #define SECCOMP_RET_TRAP_NAME "trap" 1553 1113 #define SECCOMP_RET_ERRNO_NAME "errno" 1114 + #define SECCOMP_RET_USER_NOTIF_NAME "user_notif" 1554 1115 #define SECCOMP_RET_TRACE_NAME "trace" 1555 1116 #define SECCOMP_RET_LOG_NAME "log" 1556 1117 #define SECCOMP_RET_ALLOW_NAME "allow" ··· 1561 1120 SECCOMP_RET_KILL_THREAD_NAME " " 1562 1121 SECCOMP_RET_TRAP_NAME " " 1563 1122 SECCOMP_RET_ERRNO_NAME " " 1123 + SECCOMP_RET_USER_NOTIF_NAME " " 1564 1124 SECCOMP_RET_TRACE_NAME " " 1565 1125 SECCOMP_RET_LOG_NAME " " 1566 1126 SECCOMP_RET_ALLOW_NAME; ··· 1576 1134 { SECCOMP_LOG_KILL_THREAD, SECCOMP_RET_KILL_THREAD_NAME }, 1577 1135 { SECCOMP_LOG_TRAP, SECCOMP_RET_TRAP_NAME }, 1578 1136 { SECCOMP_LOG_ERRNO, SECCOMP_RET_ERRNO_NAME }, 1137 + { SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME }, 1579 1138 { SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME }, 1580 1139 { SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME }, 1581 1140 { SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
+1
samples/seccomp/.gitignore
··· 1 1 bpf-direct 2 2 bpf-fancy 3 3 dropper 4 + user-trap
+6 -1
samples/seccomp/Makefile
··· 1 1 # SPDX-License-Identifier: GPL-2.0 2 2 ifndef CROSS_COMPILE 3 - hostprogs-$(CONFIG_SAMPLE_SECCOMP) := bpf-fancy dropper bpf-direct 3 + hostprogs-$(CONFIG_SAMPLE_SECCOMP) := bpf-fancy dropper bpf-direct user-trap 4 4 5 5 HOSTCFLAGS_bpf-fancy.o += -I$(objtree)/usr/include 6 6 HOSTCFLAGS_bpf-fancy.o += -idirafter $(objtree)/include ··· 15 15 HOSTCFLAGS_bpf-direct.o += -I$(objtree)/usr/include 16 16 HOSTCFLAGS_bpf-direct.o += -idirafter $(objtree)/include 17 17 bpf-direct-objs := bpf-direct.o 18 + 19 + HOSTCFLAGS_user-trap.o += -I$(objtree)/usr/include 20 + HOSTCFLAGS_user-trap.o += -idirafter $(objtree)/include 21 + user-trap-objs := user-trap.o 18 22 19 23 # Try to match the kernel target. 20 24 ifndef CONFIG_64BIT ··· 37 33 HOSTLDLIBS_bpf-direct += $(MFLAG) 38 34 HOSTLDLIBS_bpf-fancy += $(MFLAG) 39 35 HOSTLDLIBS_dropper += $(MFLAG) 36 + HOSTLDLIBS_user-trap += $(MFLAG) 40 37 endif 41 38 always := $(hostprogs-m) 42 39 endif
+375
samples/seccomp/user-trap.c
··· 1 + #include <signal.h> 2 + #include <stdio.h> 3 + #include <stdlib.h> 4 + #include <unistd.h> 5 + #include <errno.h> 6 + #include <fcntl.h> 7 + #include <string.h> 8 + #include <stddef.h> 9 + #include <sys/sysmacros.h> 10 + #include <sys/types.h> 11 + #include <sys/wait.h> 12 + #include <sys/socket.h> 13 + #include <sys/stat.h> 14 + #include <sys/mman.h> 15 + #include <sys/syscall.h> 16 + #include <sys/user.h> 17 + #include <sys/ioctl.h> 18 + #include <sys/ptrace.h> 19 + #include <sys/mount.h> 20 + #include <linux/limits.h> 21 + #include <linux/filter.h> 22 + #include <linux/seccomp.h> 23 + 24 + #define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x))) 25 + 26 + static int seccomp(unsigned int op, unsigned int flags, void *args) 27 + { 28 + errno = 0; 29 + return syscall(__NR_seccomp, op, flags, args); 30 + } 31 + 32 + static int send_fd(int sock, int fd) 33 + { 34 + struct msghdr msg = {}; 35 + struct cmsghdr *cmsg; 36 + char buf[CMSG_SPACE(sizeof(int))] = {0}, c = 'c'; 37 + struct iovec io = { 38 + .iov_base = &c, 39 + .iov_len = 1, 40 + }; 41 + 42 + msg.msg_iov = &io; 43 + msg.msg_iovlen = 1; 44 + msg.msg_control = buf; 45 + msg.msg_controllen = sizeof(buf); 46 + cmsg = CMSG_FIRSTHDR(&msg); 47 + cmsg->cmsg_level = SOL_SOCKET; 48 + cmsg->cmsg_type = SCM_RIGHTS; 49 + cmsg->cmsg_len = CMSG_LEN(sizeof(int)); 50 + *((int *)CMSG_DATA(cmsg)) = fd; 51 + msg.msg_controllen = cmsg->cmsg_len; 52 + 53 + if (sendmsg(sock, &msg, 0) < 0) { 54 + perror("sendmsg"); 55 + return -1; 56 + } 57 + 58 + return 0; 59 + } 60 + 61 + static int recv_fd(int sock) 62 + { 63 + struct msghdr msg = {}; 64 + struct cmsghdr *cmsg; 65 + char buf[CMSG_SPACE(sizeof(int))] = {0}, c = 'c'; 66 + struct iovec io = { 67 + .iov_base = &c, 68 + .iov_len = 1, 69 + }; 70 + 71 + msg.msg_iov = &io; 72 + msg.msg_iovlen = 1; 73 + msg.msg_control = buf; 74 + msg.msg_controllen = sizeof(buf); 75 + 76 + if (recvmsg(sock, &msg, 0) < 0) { 77 + perror("recvmsg"); 78 + return -1; 79 + } 80 + 81 + cmsg = CMSG_FIRSTHDR(&msg); 82 + 83 + return *((int *)CMSG_DATA(cmsg)); 84 + } 85 + 86 + static int user_trap_syscall(int nr, unsigned int flags) 87 + { 88 + struct sock_filter filter[] = { 89 + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 90 + offsetof(struct seccomp_data, nr)), 91 + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1), 92 + BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF), 93 + BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW), 94 + }; 95 + 96 + struct sock_fprog prog = { 97 + .len = (unsigned short)ARRAY_SIZE(filter), 98 + .filter = filter, 99 + }; 100 + 101 + return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog); 102 + } 103 + 104 + static int handle_req(struct seccomp_notif *req, 105 + struct seccomp_notif_resp *resp, int listener) 106 + { 107 + char path[PATH_MAX], source[PATH_MAX], target[PATH_MAX]; 108 + int ret = -1, mem; 109 + 110 + resp->id = req->id; 111 + resp->error = -EPERM; 112 + resp->val = 0; 113 + 114 + if (req->data.nr != __NR_mount) { 115 + fprintf(stderr, "huh? trapped something besides mount? %d\n", req->data.nr); 116 + return -1; 117 + } 118 + 119 + /* Only allow bind mounts. */ 120 + if (!(req->data.args[3] & MS_BIND)) 121 + return 0; 122 + 123 + /* 124 + * Ok, let's read the task's memory to see where they wanted their 125 + * mount to go. 126 + */ 127 + snprintf(path, sizeof(path), "/proc/%d/mem", req->pid); 128 + mem = open(path, O_RDONLY); 129 + if (mem < 0) { 130 + perror("open mem"); 131 + return -1; 132 + } 133 + 134 + /* 135 + * Now we avoid a TOCTOU: we referred to a pid by its pid, but since 136 + * the pid that made the syscall may have died, we need to confirm that 137 + * the pid is still valid after we open its /proc/pid/mem file. We can 138 + * ask the listener fd this as follows. 139 + * 140 + * Note that this check should occur *after* any task-specific 141 + * resources are opened, to make sure that the task has not died and 142 + * we're not wrongly reading someone else's state in order to make 143 + * decisions. 144 + */ 145 + if (ioctl(listener, SECCOMP_IOCTL_NOTIF_ID_VALID, &req->id) < 0) { 146 + fprintf(stderr, "task died before we could map its memory\n"); 147 + goto out; 148 + } 149 + 150 + /* 151 + * Phew, we've got the right /proc/pid/mem. Now we can read it. Note 152 + * that to avoid another TOCTOU, we should read all of the pointer args 153 + * before we decide to allow the syscall. 154 + */ 155 + if (lseek(mem, req->data.args[0], SEEK_SET) < 0) { 156 + perror("seek"); 157 + goto out; 158 + } 159 + 160 + ret = read(mem, source, sizeof(source)); 161 + if (ret < 0) { 162 + perror("read"); 163 + goto out; 164 + } 165 + 166 + if (lseek(mem, req->data.args[1], SEEK_SET) < 0) { 167 + perror("seek"); 168 + goto out; 169 + } 170 + 171 + ret = read(mem, target, sizeof(target)); 172 + if (ret < 0) { 173 + perror("read"); 174 + goto out; 175 + } 176 + 177 + /* 178 + * Our policy is to only allow bind mounts inside /tmp. This isn't very 179 + * interesting, because we could do unprivlieged bind mounts with user 180 + * namespaces already, but you get the idea. 181 + */ 182 + if (!strncmp(source, "/tmp/", 5) && !strncmp(target, "/tmp/", 5)) { 183 + if (mount(source, target, NULL, req->data.args[3], NULL) < 0) { 184 + ret = -1; 185 + perror("actual mount"); 186 + goto out; 187 + } 188 + resp->error = 0; 189 + } 190 + 191 + /* Even if we didn't allow it because of policy, generating the 192 + * response was be a success, because we want to tell the worker EPERM. 193 + */ 194 + ret = 0; 195 + 196 + out: 197 + close(mem); 198 + return ret; 199 + } 200 + 201 + int main(void) 202 + { 203 + int sk_pair[2], ret = 1, status, listener; 204 + pid_t worker = 0 , tracer = 0; 205 + 206 + if (socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair) < 0) { 207 + perror("socketpair"); 208 + return 1; 209 + } 210 + 211 + worker = fork(); 212 + if (worker < 0) { 213 + perror("fork"); 214 + goto close_pair; 215 + } 216 + 217 + if (worker == 0) { 218 + listener = user_trap_syscall(__NR_mount, 219 + SECCOMP_FILTER_FLAG_NEW_LISTENER); 220 + if (listener < 0) { 221 + perror("seccomp"); 222 + exit(1); 223 + } 224 + 225 + /* 226 + * Drop privileges. We definitely can't mount as uid 1000. 227 + */ 228 + if (setuid(1000) < 0) { 229 + perror("setuid"); 230 + exit(1); 231 + } 232 + 233 + /* 234 + * Send the listener to the parent; also serves as 235 + * synchronization. 236 + */ 237 + if (send_fd(sk_pair[1], listener) < 0) 238 + exit(1); 239 + close(listener); 240 + 241 + if (mkdir("/tmp/foo", 0755) < 0) { 242 + perror("mkdir"); 243 + exit(1); 244 + } 245 + 246 + /* 247 + * Try a bad mount just for grins. 248 + */ 249 + if (mount("/dev/sda", "/tmp/foo", NULL, 0, NULL) != -1) { 250 + fprintf(stderr, "huh? mounted /dev/sda?\n"); 251 + exit(1); 252 + } 253 + 254 + if (errno != EPERM) { 255 + perror("bad error from mount"); 256 + exit(1); 257 + } 258 + 259 + /* 260 + * Ok, we expect this one to succeed. 261 + */ 262 + if (mount("/tmp/foo", "/tmp/foo", NULL, MS_BIND, NULL) < 0) { 263 + perror("mount"); 264 + exit(1); 265 + } 266 + 267 + exit(0); 268 + } 269 + 270 + /* 271 + * Get the listener from the child. 272 + */ 273 + listener = recv_fd(sk_pair[0]); 274 + if (listener < 0) 275 + goto out_kill; 276 + 277 + /* 278 + * Fork a task to handle the requests. This isn't strictly necessary, 279 + * but it makes the particular writing of this sample easier, since we 280 + * can just wait ofr the tracee to exit and kill the tracer. 281 + */ 282 + tracer = fork(); 283 + if (tracer < 0) { 284 + perror("fork"); 285 + goto out_kill; 286 + } 287 + 288 + if (tracer == 0) { 289 + struct seccomp_notif *req; 290 + struct seccomp_notif_resp *resp; 291 + struct seccomp_notif_sizes sizes; 292 + 293 + if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes) < 0) { 294 + perror("seccomp(GET_NOTIF_SIZES)"); 295 + goto out_close; 296 + } 297 + 298 + req = malloc(sizes.seccomp_notif); 299 + if (!req) 300 + goto out_close; 301 + memset(req, 0, sizeof(*req)); 302 + 303 + resp = malloc(sizes.seccomp_notif_resp); 304 + if (!resp) 305 + goto out_req; 306 + memset(resp, 0, sizeof(*resp)); 307 + 308 + while (1) { 309 + if (ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, req)) { 310 + perror("ioctl recv"); 311 + goto out_resp; 312 + } 313 + 314 + if (handle_req(req, resp, listener) < 0) 315 + goto out_resp; 316 + 317 + /* 318 + * ENOENT here means that the task may have gotten a 319 + * signal and restarted the syscall. It's up to the 320 + * handler to decide what to do in this case, but for 321 + * the sample code, we just ignore it. Probably 322 + * something better should happen, like undoing the 323 + * mount, or keeping track of the args to make sure we 324 + * don't do it again. 325 + */ 326 + if (ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, resp) < 0 && 327 + errno != ENOENT) { 328 + perror("ioctl send"); 329 + goto out_resp; 330 + } 331 + } 332 + out_resp: 333 + free(resp); 334 + out_req: 335 + free(req); 336 + out_close: 337 + close(listener); 338 + exit(1); 339 + } 340 + 341 + close(listener); 342 + 343 + if (waitpid(worker, &status, 0) != worker) { 344 + perror("waitpid"); 345 + goto out_kill; 346 + } 347 + 348 + if (umount2("/tmp/foo", MNT_DETACH) < 0 && errno != EINVAL) { 349 + perror("umount2"); 350 + goto out_kill; 351 + } 352 + 353 + if (remove("/tmp/foo") < 0 && errno != ENOENT) { 354 + perror("remove"); 355 + exit(1); 356 + } 357 + 358 + if (!WIFEXITED(status) || WEXITSTATUS(status)) { 359 + fprintf(stderr, "worker exited nonzero\n"); 360 + goto out_kill; 361 + } 362 + 363 + ret = 0; 364 + 365 + out_kill: 366 + if (tracer > 0) 367 + kill(tracer, SIGKILL); 368 + if (worker > 0) 369 + kill(worker, SIGKILL); 370 + 371 + close_pair: 372 + close(sk_pair[0]); 373 + close(sk_pair[1]); 374 + return ret; 375 + }
+445 -2
tools/testing/selftests/seccomp/seccomp_bpf.c
··· 5 5 * Test code for seccomp bpf. 6 6 */ 7 7 8 + #define _GNU_SOURCE 8 9 #include <sys/types.h> 9 10 10 11 /* ··· 41 40 #include <sys/fcntl.h> 42 41 #include <sys/mman.h> 43 42 #include <sys/times.h> 43 + #include <sys/socket.h> 44 + #include <sys/ioctl.h> 44 45 45 - #define _GNU_SOURCE 46 46 #include <unistd.h> 47 47 #include <sys/syscall.h> 48 + #include <poll.h> 48 49 49 50 #include "../kselftest_harness.h" 50 51 ··· 136 133 #define SECCOMP_GET_ACTION_AVAIL 2 137 134 #endif 138 135 136 + #ifndef SECCOMP_GET_NOTIF_SIZES 137 + #define SECCOMP_GET_NOTIF_SIZES 3 138 + #endif 139 + 139 140 #ifndef SECCOMP_FILTER_FLAG_TSYNC 140 141 #define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) 141 142 #endif ··· 158 151 struct seccomp_metadata { 159 152 __u64 filter_off; /* Input: which filter */ 160 153 __u64 flags; /* Output: filter's flags */ 154 + }; 155 + #endif 156 + 157 + #ifndef SECCOMP_FILTER_FLAG_NEW_LISTENER 158 + #define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3) 159 + 160 + #define SECCOMP_RET_USER_NOTIF 0x7fc00000U 161 + 162 + #define SECCOMP_IOC_MAGIC '!' 163 + #define SECCOMP_IO(nr) _IO(SECCOMP_IOC_MAGIC, nr) 164 + #define SECCOMP_IOR(nr, type) _IOR(SECCOMP_IOC_MAGIC, nr, type) 165 + #define SECCOMP_IOW(nr, type) _IOW(SECCOMP_IOC_MAGIC, nr, type) 166 + #define SECCOMP_IOWR(nr, type) _IOWR(SECCOMP_IOC_MAGIC, nr, type) 167 + 168 + /* Flags for seccomp notification fd ioctl. */ 169 + #define SECCOMP_IOCTL_NOTIF_RECV SECCOMP_IOWR(0, struct seccomp_notif) 170 + #define SECCOMP_IOCTL_NOTIF_SEND SECCOMP_IOWR(1, \ 171 + struct seccomp_notif_resp) 172 + #define SECCOMP_IOCTL_NOTIF_ID_VALID SECCOMP_IOR(2, __u64) 173 + 174 + struct seccomp_notif { 175 + __u64 id; 176 + __u32 pid; 177 + __u32 flags; 178 + struct seccomp_data data; 179 + }; 180 + 181 + struct seccomp_notif_resp { 182 + __u64 id; 183 + __s64 val; 184 + __s32 error; 185 + __u32 flags; 186 + }; 187 + 188 + struct seccomp_notif_sizes { 189 + __u16 seccomp_notif; 190 + __u16 seccomp_notif_resp; 191 + __u16 seccomp_data; 161 192 }; 162 193 #endif 163 194 ··· 2122 2077 { 2123 2078 unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC, 2124 2079 SECCOMP_FILTER_FLAG_LOG, 2125 - SECCOMP_FILTER_FLAG_SPEC_ALLOW }; 2080 + SECCOMP_FILTER_FLAG_SPEC_ALLOW, 2081 + SECCOMP_FILTER_FLAG_NEW_LISTENER }; 2126 2082 unsigned int flag, all_flags; 2127 2083 int i; 2128 2084 long ret; ··· 2982 2936 2983 2937 skip: 2984 2938 ASSERT_EQ(0, kill(pid, SIGKILL)); 2939 + } 2940 + 2941 + static int user_trap_syscall(int nr, unsigned int flags) 2942 + { 2943 + struct sock_filter filter[] = { 2944 + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 2945 + offsetof(struct seccomp_data, nr)), 2946 + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1), 2947 + BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF), 2948 + BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW), 2949 + }; 2950 + 2951 + struct sock_fprog prog = { 2952 + .len = (unsigned short)ARRAY_SIZE(filter), 2953 + .filter = filter, 2954 + }; 2955 + 2956 + return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog); 2957 + } 2958 + 2959 + #define USER_NOTIF_MAGIC 116983961184613L 2960 + TEST(user_notification_basic) 2961 + { 2962 + pid_t pid; 2963 + long ret; 2964 + int status, listener; 2965 + struct seccomp_notif req = {}; 2966 + struct seccomp_notif_resp resp = {}; 2967 + struct pollfd pollfd; 2968 + 2969 + struct sock_filter filter[] = { 2970 + BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), 2971 + }; 2972 + struct sock_fprog prog = { 2973 + .len = (unsigned short)ARRAY_SIZE(filter), 2974 + .filter = filter, 2975 + }; 2976 + 2977 + pid = fork(); 2978 + ASSERT_GE(pid, 0); 2979 + 2980 + /* Check that we get -ENOSYS with no listener attached */ 2981 + if (pid == 0) { 2982 + if (user_trap_syscall(__NR_getpid, 0) < 0) 2983 + exit(1); 2984 + ret = syscall(__NR_getpid); 2985 + exit(ret >= 0 || errno != ENOSYS); 2986 + } 2987 + 2988 + EXPECT_EQ(waitpid(pid, &status, 0), pid); 2989 + EXPECT_EQ(true, WIFEXITED(status)); 2990 + EXPECT_EQ(0, WEXITSTATUS(status)); 2991 + 2992 + /* Add some no-op filters so for grins. */ 2993 + EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0); 2994 + EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0); 2995 + EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0); 2996 + EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0); 2997 + 2998 + /* Check that the basic notification machinery works */ 2999 + listener = user_trap_syscall(__NR_getpid, 3000 + SECCOMP_FILTER_FLAG_NEW_LISTENER); 3001 + EXPECT_GE(listener, 0); 3002 + 3003 + /* Installing a second listener in the chain should EBUSY */ 3004 + EXPECT_EQ(user_trap_syscall(__NR_getpid, 3005 + SECCOMP_FILTER_FLAG_NEW_LISTENER), 3006 + -1); 3007 + EXPECT_EQ(errno, EBUSY); 3008 + 3009 + pid = fork(); 3010 + ASSERT_GE(pid, 0); 3011 + 3012 + if (pid == 0) { 3013 + ret = syscall(__NR_getpid); 3014 + exit(ret != USER_NOTIF_MAGIC); 3015 + } 3016 + 3017 + pollfd.fd = listener; 3018 + pollfd.events = POLLIN | POLLOUT; 3019 + 3020 + EXPECT_GT(poll(&pollfd, 1, -1), 0); 3021 + EXPECT_EQ(pollfd.revents, POLLIN); 3022 + 3023 + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); 3024 + 3025 + pollfd.fd = listener; 3026 + pollfd.events = POLLIN | POLLOUT; 3027 + 3028 + EXPECT_GT(poll(&pollfd, 1, -1), 0); 3029 + EXPECT_EQ(pollfd.revents, POLLOUT); 3030 + 3031 + EXPECT_EQ(req.data.nr, __NR_getpid); 3032 + 3033 + resp.id = req.id; 3034 + resp.error = 0; 3035 + resp.val = USER_NOTIF_MAGIC; 3036 + 3037 + /* check that we make sure flags == 0 */ 3038 + resp.flags = 1; 3039 + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), -1); 3040 + EXPECT_EQ(errno, EINVAL); 3041 + 3042 + resp.flags = 0; 3043 + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0); 3044 + 3045 + EXPECT_EQ(waitpid(pid, &status, 0), pid); 3046 + EXPECT_EQ(true, WIFEXITED(status)); 3047 + EXPECT_EQ(0, WEXITSTATUS(status)); 3048 + } 3049 + 3050 + TEST(user_notification_kill_in_middle) 3051 + { 3052 + pid_t pid; 3053 + long ret; 3054 + int listener; 3055 + struct seccomp_notif req = {}; 3056 + struct seccomp_notif_resp resp = {}; 3057 + 3058 + listener = user_trap_syscall(__NR_getpid, 3059 + SECCOMP_FILTER_FLAG_NEW_LISTENER); 3060 + EXPECT_GE(listener, 0); 3061 + 3062 + /* 3063 + * Check that nothing bad happens when we kill the task in the middle 3064 + * of a syscall. 3065 + */ 3066 + pid = fork(); 3067 + ASSERT_GE(pid, 0); 3068 + 3069 + if (pid == 0) { 3070 + ret = syscall(__NR_getpid); 3071 + exit(ret != USER_NOTIF_MAGIC); 3072 + } 3073 + 3074 + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); 3075 + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ID_VALID, &req.id), 0); 3076 + 3077 + EXPECT_EQ(kill(pid, SIGKILL), 0); 3078 + EXPECT_EQ(waitpid(pid, NULL, 0), pid); 3079 + 3080 + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ID_VALID, &req.id), -1); 3081 + 3082 + resp.id = req.id; 3083 + ret = ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp); 3084 + EXPECT_EQ(ret, -1); 3085 + EXPECT_EQ(errno, ENOENT); 3086 + } 3087 + 3088 + static int handled = -1; 3089 + 3090 + static void signal_handler(int signal) 3091 + { 3092 + if (write(handled, "c", 1) != 1) 3093 + perror("write from signal"); 3094 + } 3095 + 3096 + TEST(user_notification_signal) 3097 + { 3098 + pid_t pid; 3099 + long ret; 3100 + int status, listener, sk_pair[2]; 3101 + struct seccomp_notif req = {}; 3102 + struct seccomp_notif_resp resp = {}; 3103 + char c; 3104 + 3105 + ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0); 3106 + 3107 + listener = user_trap_syscall(__NR_gettid, 3108 + SECCOMP_FILTER_FLAG_NEW_LISTENER); 3109 + EXPECT_GE(listener, 0); 3110 + 3111 + pid = fork(); 3112 + ASSERT_GE(pid, 0); 3113 + 3114 + if (pid == 0) { 3115 + close(sk_pair[0]); 3116 + handled = sk_pair[1]; 3117 + if (signal(SIGUSR1, signal_handler) == SIG_ERR) { 3118 + perror("signal"); 3119 + exit(1); 3120 + } 3121 + /* 3122 + * ERESTARTSYS behavior is a bit hard to test, because we need 3123 + * to rely on a signal that has not yet been handled. Let's at 3124 + * least check that the error code gets propagated through, and 3125 + * hope that it doesn't break when there is actually a signal :) 3126 + */ 3127 + ret = syscall(__NR_gettid); 3128 + exit(!(ret == -1 && errno == 512)); 3129 + } 3130 + 3131 + close(sk_pair[1]); 3132 + 3133 + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); 3134 + 3135 + EXPECT_EQ(kill(pid, SIGUSR1), 0); 3136 + 3137 + /* 3138 + * Make sure the signal really is delivered, which means we're not 3139 + * stuck in the user notification code any more and the notification 3140 + * should be dead. 3141 + */ 3142 + EXPECT_EQ(read(sk_pair[0], &c, 1), 1); 3143 + 3144 + resp.id = req.id; 3145 + resp.error = -EPERM; 3146 + resp.val = 0; 3147 + 3148 + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), -1); 3149 + EXPECT_EQ(errno, ENOENT); 3150 + 3151 + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); 3152 + 3153 + resp.id = req.id; 3154 + resp.error = -512; /* -ERESTARTSYS */ 3155 + resp.val = 0; 3156 + 3157 + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0); 3158 + 3159 + EXPECT_EQ(waitpid(pid, &status, 0), pid); 3160 + EXPECT_EQ(true, WIFEXITED(status)); 3161 + EXPECT_EQ(0, WEXITSTATUS(status)); 3162 + } 3163 + 3164 + TEST(user_notification_closed_listener) 3165 + { 3166 + pid_t pid; 3167 + long ret; 3168 + int status, listener; 3169 + 3170 + listener = user_trap_syscall(__NR_getpid, 3171 + SECCOMP_FILTER_FLAG_NEW_LISTENER); 3172 + EXPECT_GE(listener, 0); 3173 + 3174 + /* 3175 + * Check that we get an ENOSYS when the listener is closed. 3176 + */ 3177 + pid = fork(); 3178 + ASSERT_GE(pid, 0); 3179 + if (pid == 0) { 3180 + close(listener); 3181 + ret = syscall(__NR_getpid); 3182 + exit(ret != -1 && errno != ENOSYS); 3183 + } 3184 + 3185 + close(listener); 3186 + 3187 + EXPECT_EQ(waitpid(pid, &status, 0), pid); 3188 + EXPECT_EQ(true, WIFEXITED(status)); 3189 + EXPECT_EQ(0, WEXITSTATUS(status)); 3190 + } 3191 + 3192 + /* 3193 + * Check that a pid in a child namespace still shows up as valid in ours. 3194 + */ 3195 + TEST(user_notification_child_pid_ns) 3196 + { 3197 + pid_t pid; 3198 + int status, listener; 3199 + struct seccomp_notif req = {}; 3200 + struct seccomp_notif_resp resp = {}; 3201 + 3202 + ASSERT_EQ(unshare(CLONE_NEWPID), 0); 3203 + 3204 + listener = user_trap_syscall(__NR_getpid, SECCOMP_FILTER_FLAG_NEW_LISTENER); 3205 + ASSERT_GE(listener, 0); 3206 + 3207 + pid = fork(); 3208 + ASSERT_GE(pid, 0); 3209 + 3210 + if (pid == 0) 3211 + exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC); 3212 + 3213 + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); 3214 + EXPECT_EQ(req.pid, pid); 3215 + 3216 + resp.id = req.id; 3217 + resp.error = 0; 3218 + resp.val = USER_NOTIF_MAGIC; 3219 + 3220 + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0); 3221 + 3222 + EXPECT_EQ(waitpid(pid, &status, 0), pid); 3223 + EXPECT_EQ(true, WIFEXITED(status)); 3224 + EXPECT_EQ(0, WEXITSTATUS(status)); 3225 + close(listener); 3226 + } 3227 + 3228 + /* 3229 + * Check that a pid in a sibling (i.e. unrelated) namespace shows up as 0, i.e. 3230 + * invalid. 3231 + */ 3232 + TEST(user_notification_sibling_pid_ns) 3233 + { 3234 + pid_t pid, pid2; 3235 + int status, listener; 3236 + struct seccomp_notif req = {}; 3237 + struct seccomp_notif_resp resp = {}; 3238 + 3239 + listener = user_trap_syscall(__NR_getpid, SECCOMP_FILTER_FLAG_NEW_LISTENER); 3240 + ASSERT_GE(listener, 0); 3241 + 3242 + pid = fork(); 3243 + ASSERT_GE(pid, 0); 3244 + 3245 + if (pid == 0) { 3246 + ASSERT_EQ(unshare(CLONE_NEWPID), 0); 3247 + 3248 + pid2 = fork(); 3249 + ASSERT_GE(pid2, 0); 3250 + 3251 + if (pid2 == 0) 3252 + exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC); 3253 + 3254 + EXPECT_EQ(waitpid(pid2, &status, 0), pid2); 3255 + EXPECT_EQ(true, WIFEXITED(status)); 3256 + EXPECT_EQ(0, WEXITSTATUS(status)); 3257 + exit(WEXITSTATUS(status)); 3258 + } 3259 + 3260 + /* Create the sibling ns, and sibling in it. */ 3261 + EXPECT_EQ(unshare(CLONE_NEWPID), 0); 3262 + EXPECT_EQ(errno, 0); 3263 + 3264 + pid2 = fork(); 3265 + EXPECT_GE(pid2, 0); 3266 + 3267 + if (pid2 == 0) { 3268 + ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); 3269 + /* 3270 + * The pid should be 0, i.e. the task is in some namespace that 3271 + * we can't "see". 3272 + */ 3273 + ASSERT_EQ(req.pid, 0); 3274 + 3275 + resp.id = req.id; 3276 + resp.error = 0; 3277 + resp.val = USER_NOTIF_MAGIC; 3278 + 3279 + ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0); 3280 + exit(0); 3281 + } 3282 + 3283 + close(listener); 3284 + 3285 + EXPECT_EQ(waitpid(pid, &status, 0), pid); 3286 + EXPECT_EQ(true, WIFEXITED(status)); 3287 + EXPECT_EQ(0, WEXITSTATUS(status)); 3288 + 3289 + EXPECT_EQ(waitpid(pid2, &status, 0), pid2); 3290 + EXPECT_EQ(true, WIFEXITED(status)); 3291 + EXPECT_EQ(0, WEXITSTATUS(status)); 3292 + } 3293 + 3294 + TEST(user_notification_fault_recv) 3295 + { 3296 + pid_t pid; 3297 + int status, listener; 3298 + struct seccomp_notif req = {}; 3299 + struct seccomp_notif_resp resp = {}; 3300 + 3301 + listener = user_trap_syscall(__NR_getpid, SECCOMP_FILTER_FLAG_NEW_LISTENER); 3302 + ASSERT_GE(listener, 0); 3303 + 3304 + pid = fork(); 3305 + ASSERT_GE(pid, 0); 3306 + 3307 + if (pid == 0) 3308 + exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC); 3309 + 3310 + /* Do a bad recv() */ 3311 + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, NULL), -1); 3312 + EXPECT_EQ(errno, EFAULT); 3313 + 3314 + /* We should still be able to receive this notification, though. */ 3315 + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); 3316 + EXPECT_EQ(req.pid, pid); 3317 + 3318 + resp.id = req.id; 3319 + resp.error = 0; 3320 + resp.val = USER_NOTIF_MAGIC; 3321 + 3322 + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0); 3323 + 3324 + EXPECT_EQ(waitpid(pid, &status, 0), pid); 3325 + EXPECT_EQ(true, WIFEXITED(status)); 3326 + EXPECT_EQ(0, WEXITSTATUS(status)); 3327 + } 3328 + 3329 + TEST(seccomp_get_notif_sizes) 3330 + { 3331 + struct seccomp_notif_sizes sizes; 3332 + 3333 + EXPECT_EQ(seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes), 0); 3334 + EXPECT_EQ(sizes.seccomp_notif, sizeof(struct seccomp_notif)); 3335 + EXPECT_EQ(sizes.seccomp_notif_resp, sizeof(struct seccomp_notif_resp)); 2985 3336 } 2986 3337 2987 3338 /*