Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

doc: ReSTify seccomp_filter.txt

This updates seccomp_filter.txt for ReST markup, and moves it under the
user-space API index, since it describes how application author can use
seccomp.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>

authored by

Kees Cook and committed by
Jonathan Corbet
c061f33f 5e33994d

+62 -56
+60 -56
Documentation/prctl/seccomp_filter.txt Documentation/userspace-api/seccomp_filter.rst
··· 1 - SECure COMPuting with filters 2 - ============================= 1 + =========================================== 2 + Seccomp BPF (SECure COMPuting with filters) 3 + =========================================== 3 4 4 5 Introduction 5 - ------------ 6 + ============ 6 7 7 8 A large number of system calls are exposed to every userland process 8 9 with many of them going unused for the entire lifetime of the process. ··· 28 27 call arguments directly. 29 28 30 29 What it isn't 31 - ------------- 30 + ============= 32 31 33 32 System call filtering isn't a sandbox. It provides a clearly defined 34 33 mechanism for minimizing the exposed kernel surface. It is meant to be ··· 41 40 construed, incorrectly, as a more complete sandboxing solution. 42 41 43 42 Usage 44 - ----- 43 + ===== 45 44 46 45 An additional seccomp mode is added and is enabled using the same 47 46 prctl(2) call as the strict seccomp. If the architecture has 48 - CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below: 47 + ``CONFIG_HAVE_ARCH_SECCOMP_FILTER``, then filters may be added as below: 49 48 50 - PR_SET_SECCOMP: 49 + ``PR_SET_SECCOMP``: 51 50 Now takes an additional argument which specifies a new filter 52 51 using a BPF program. 53 52 The BPF program will be executed over struct seccomp_data ··· 56 55 acceptable values to inform the kernel which action should be 57 56 taken. 58 57 59 - Usage: 58 + Usage:: 59 + 60 60 prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog); 61 61 62 62 The 'prog' argument is a pointer to a struct sock_fprog which 63 63 will contain the filter program. If the program is invalid, the 64 - call will return -1 and set errno to EINVAL. 64 + call will return -1 and set errno to ``EINVAL``. 65 65 66 - If fork/clone and execve are allowed by @prog, any child 66 + If ``fork``/``clone`` and ``execve`` are allowed by @prog, any child 67 67 processes will be constrained to the same filters and system 68 68 call ABI as the parent. 69 69 70 - Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or 71 - run with CAP_SYS_ADMIN privileges in its namespace. If these are not 72 - true, -EACCES will be returned. This requirement ensures that filter 70 + Prior to use, the task must call ``prctl(PR_SET_NO_NEW_PRIVS, 1)`` or 71 + run with ``CAP_SYS_ADMIN`` privileges in its namespace. If these are not 72 + true, ``-EACCES`` will be returned. This requirement ensures that filter 73 73 programs cannot be applied to child processes with greater privileges 74 74 than the task that installed them. 75 75 76 - Additionally, if prctl(2) is allowed by the attached filter, 76 + Additionally, if ``prctl(2)`` is allowed by the attached filter, 77 77 additional filters may be layered on which will increase evaluation 78 78 time, but allow for further decreasing the attack surface during 79 79 execution of a process. ··· 82 80 The above call returns 0 on success and non-zero on error. 83 81 84 82 Return values 85 - ------------- 83 + ============= 84 + 86 85 A seccomp filter may return any of the following values. If multiple 87 86 filters exist, the return value for the evaluation of a given system 88 87 call will always use the highest precedent value. (For example, 89 - SECCOMP_RET_KILL will always take precedence.) 88 + ``SECCOMP_RET_KILL`` will always take precedence.) 90 89 91 90 In precedence order, they are: 92 91 93 - SECCOMP_RET_KILL: 92 + ``SECCOMP_RET_KILL``: 94 93 Results in the task exiting immediately without executing the 95 - system call. The exit status of the task (status & 0x7f) will 96 - be SIGSYS, not SIGKILL. 94 + system call. The exit status of the task (``status & 0x7f``) will 95 + be ``SIGSYS``, not ``SIGKILL``. 97 96 98 - SECCOMP_RET_TRAP: 99 - Results in the kernel sending a SIGSYS signal to the triggering 100 - task without executing the system call. siginfo->si_call_addr 97 + ``SECCOMP_RET_TRAP``: 98 + Results in the kernel sending a ``SIGSYS`` signal to the triggering 99 + task without executing the system call. ``siginfo->si_call_addr`` 101 100 will show the address of the system call instruction, and 102 - siginfo->si_syscall and siginfo->si_arch will indicate which 101 + ``siginfo->si_syscall`` and ``siginfo->si_arch`` will indicate which 103 102 syscall was attempted. The program counter will be as though 104 103 the syscall happened (i.e. it will not point to the syscall 105 104 instruction). The return value register will contain an arch- 106 105 dependent value -- if resuming execution, set it to something 107 106 sensible. (The architecture dependency is because replacing 108 - it with -ENOSYS could overwrite some useful information.) 107 + it with ``-ENOSYS`` could overwrite some useful information.) 109 108 110 - The SECCOMP_RET_DATA portion of the return value will be passed 111 - as si_errno. 109 + The ``SECCOMP_RET_DATA`` portion of the return value will be passed 110 + as ``si_errno``. 112 111 113 - SIGSYS triggered by seccomp will have a si_code of SYS_SECCOMP. 112 + ``SIGSYS`` triggered by seccomp will have a si_code of ``SYS_SECCOMP``. 114 113 115 - SECCOMP_RET_ERRNO: 114 + ``SECCOMP_RET_ERRNO``: 116 115 Results in the lower 16-bits of the return value being passed 117 116 to userland as the errno without executing the system call. 118 117 119 - SECCOMP_RET_TRACE: 118 + ``SECCOMP_RET_TRACE``: 120 119 When returned, this value will cause the kernel to attempt to 121 - notify a ptrace()-based tracer prior to executing the system 122 - call. If there is no tracer present, -ENOSYS is returned to 120 + notify a ``ptrace()``-based tracer prior to executing the system 121 + call. If there is no tracer present, ``-ENOSYS`` is returned to 123 122 userland and the system call is not executed. 124 123 125 - A tracer will be notified if it requests PTRACE_O_TRACESECCOMP 126 - using ptrace(PTRACE_SETOPTIONS). The tracer will be notified 127 - of a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of 124 + A tracer will be notified if it requests ``PTRACE_O_TRACESECCOM``P 125 + using ``ptrace(PTRACE_SETOPTIONS)``. The tracer will be notified 126 + of a ``PTRACE_EVENT_SECCOMP`` and the ``SECCOMP_RET_DATA`` portion of 128 127 the BPF program return value will be available to the tracer 129 - via PTRACE_GETEVENTMSG. 128 + via ``PTRACE_GETEVENTMSG``. 130 129 131 130 The tracer can skip the system call by changing the syscall number 132 131 to -1. Alternatively, the tracer can change the system call ··· 141 138 allow use of ptrace, even of other sandboxed processes, without 142 139 extreme care; ptracers can use this mechanism to escape.) 143 140 144 - SECCOMP_RET_ALLOW: 141 + ``SECCOMP_RET_ALLOW``: 145 142 Results in the system call being executed. 146 143 147 144 If multiple filters exist, the return value for the evaluation of a 148 145 given system call will always use the highest precedent value. 149 146 150 - Precedence is only determined using the SECCOMP_RET_ACTION mask. When 147 + Precedence is only determined using the ``SECCOMP_RET_ACTION`` mask. When 151 148 multiple filters return values of the same precedence, only the 152 - SECCOMP_RET_DATA from the most recently installed filter will be 149 + ``SECCOMP_RET_DATA`` from the most recently installed filter will be 153 150 returned. 154 151 155 152 Pitfalls 156 - -------- 153 + ======== 157 154 158 155 The biggest pitfall to avoid during use is filtering on system call 159 156 number without checking the architecture value. Why? On any ··· 163 160 the filters may be abused. Always check the arch value! 164 161 165 162 Example 166 - ------- 163 + ======= 167 164 168 - The samples/seccomp/ directory contains both an x86-specific example 165 + The ``samples/seccomp/`` directory contains both an x86-specific example 169 166 and a more generic example of a higher level macro interface for BPF 170 167 program generation. 171 168 172 169 173 170 174 171 Adding architecture support 175 - ----------------------- 172 + =========================== 176 173 177 - See arch/Kconfig for the authoritative requirements. In general, if an 174 + See ``arch/Kconfig`` for the authoritative requirements. In general, if an 178 175 architecture supports both ptrace_event and seccomp, it will be able to 179 - support seccomp filter with minor fixup: SIGSYS support and seccomp return 180 - value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER 176 + support seccomp filter with minor fixup: ``SIGSYS`` support and seccomp return 177 + value checking. Then it must just add ``CONFIG_HAVE_ARCH_SECCOMP_FILTER`` 181 178 to its arch-specific Kconfig. 182 179 183 180 184 181 185 182 Caveats 186 - ------- 183 + ======= 187 184 188 185 The vDSO can cause some system calls to run entirely in userspace, 189 186 leading to surprises when you run programs on different machines that 190 187 fall back to real syscalls. To minimize these surprises on x86, make 191 188 sure you test with 192 - /sys/devices/system/clocksource/clocksource0/current_clocksource set to 193 - something like acpi_pm. 189 + ``/sys/devices/system/clocksource/clocksource0/current_clocksource`` set to 190 + something like ``acpi_pm``. 194 191 195 192 On x86-64, vsyscall emulation is enabled by default. (vsyscalls are 196 - legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccomp, with a few oddities: 193 + legacy variants on vDSO calls.) Currently, emulated vsyscalls will 194 + honor seccomp, with a few oddities: 197 195 198 - - A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to 196 + - A return value of ``SECCOMP_RET_TRAP`` will set a ``si_call_addr`` pointing to 199 197 the vsyscall entry for the given call and not the address after the 200 198 'syscall' instruction. Any code which wants to restart the call 201 199 should be aware that (a) a ret instruction has been emulated and (b) ··· 204 200 emulation security checks, making resuming the syscall mostly 205 201 pointless. 206 202 207 - - A return value of SECCOMP_RET_TRACE will signal the tracer as usual, 203 + - A return value of ``SECCOMP_RET_TRACE`` will signal the tracer as usual, 208 204 but the syscall may not be changed to another system call using the 209 205 orig_rax register. It may only be changed to -1 order to skip the 210 206 currently emulated call. Any other change MAY terminate the process. ··· 213 209 rip or rsp. (Do not rely on other changes terminating the process. 214 210 They might work. For example, on some kernels, choosing a syscall 215 211 that only exists in future kernels will be correctly emulated (by 216 - returning -ENOSYS). 212 + returning ``-ENOSYS``). 217 213 218 - To detect this quirky behavior, check for addr & ~0x0C00 == 219 - 0xFFFFFFFFFF600000. (For SECCOMP_RET_TRACE, use rip. For 220 - SECCOMP_RET_TRAP, use siginfo->si_call_addr.) Do not check any other 214 + To detect this quirky behavior, check for ``addr & ~0x0C00 == 215 + 0xFFFFFFFFFF600000``. (For ``SECCOMP_RET_TRACE``, use rip. For 216 + ``SECCOMP_RET_TRAP``, use ``siginfo->si_call_addr``.) Do not check any other 221 217 condition: future kernels may improve vsyscall emulation and current 222 218 kernels in vsyscall=native mode will behave differently, but the 223 - instructions at 0xF...F600{0,4,8,C}00 will not be system calls in these 219 + instructions at ``0xF...F600{0,4,8,C}00`` will not be system calls in these 224 220 cases. 225 221 226 222 Note that modern systems are unlikely to use vsyscalls at all -- they
+1
Documentation/userspace-api/index.rst
··· 16 16 .. toctree:: 17 17 :maxdepth: 2 18 18 19 + seccomp_filter 19 20 unshare 20 21 21 22 .. only:: subproject and html
+1
MAINTAINERS
··· 11492 11492 F: include/uapi/linux/seccomp.h 11493 11493 F: include/linux/seccomp.h 11494 11494 F: tools/testing/selftests/seccomp/* 11495 + F: Documentation/userspace-api/seccomp_filter.rst 11495 11496 K: \bsecure_computing 11496 11497 K: \bTIF_SECCOMP\b 11497 11498