Documentation/security/self-protection.txt at v4.12-rc4

tjh.dev / kernel
fork
Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
fork
kernel / Documentation / security / self-protection.txt
at v4.12-rc4 275 lines 12 kB view raw
wrap content
  1# Kernel Self-Protection
  2
  3Kernel self-protection is the design and implementation of systems and
  4structures within the Linux kernel to protect against security flaws in
  5the kernel itself. This covers a wide range of issues, including removing
  6entire classes of bugs, blocking security flaw exploitation methods,
  7and actively detecting attack attempts. Not all topics are explored in
  8this document, but it should serve as a reasonable starting point and
  9answer any frequently asked questions. (Patches welcome, of course!)
 10
 11In the worst-case scenario, we assume an unprivileged local attacker
 12has arbitrary read and write access to the kernel's memory. In many
 13cases, bugs being exploited will not provide this level of access,
 14but with systems in place that defend against the worst case we'll
 15cover the more limited cases as well. A higher bar, and one that should
 16still be kept in mind, is protecting the kernel against a _privileged_
 17local attacker, since the root user has access to a vastly increased
 18attack surface. (Especially when they have the ability to load arbitrary
 19kernel modules.)
 20
 21The goals for successful self-protection systems would be that they
 22are effective, on by default, require no opt-in by developers, have no
 23performance impact, do not impede kernel debugging, and have tests. It
 24is uncommon that all these goals can be met, but it is worth explicitly
 25mentioning them, since these aspects need to be explored, dealt with,
 26and/or accepted.
 27
 28
 29## Attack Surface Reduction
 30
 31The most fundamental defense against security exploits is to reduce the
 32areas of the kernel that can be used to redirect execution. This ranges
 33from limiting the exposed APIs available to userspace, making in-kernel
 34APIs hard to use incorrectly, minimizing the areas of writable kernel
 35memory, etc.
 36
 37### Strict kernel memory permissions
 38
 39When all of kernel memory is writable, it becomes trivial for attacks
 40to redirect execution flow. To reduce the availability of these targets
 41the kernel needs to protect its memory with a tight set of permissions.
 42
 43#### Executable code and read-only data must not be writable
 44
 45Any areas of the kernel with executable memory must not be writable.
 46While this obviously includes the kernel text itself, we must consider
 47all additional places too: kernel modules, JIT memory, etc. (There are
 48temporary exceptions to this rule to support things like instruction
 49alternatives, breakpoints, kprobes, etc. If these must exist in a
 50kernel, they are implemented in a way where the memory is temporarily
 51made writable during the update, and then returned to the original
 52permissions.)
 53
 54In support of this are CONFIG_STRICT_KERNEL_RWX and
 55CONFIG_STRICT_MODULE_RWX, which seek to make sure that code is not
 56writable, data is not executable, and read-only data is neither writable
 57nor executable.
 58
 59Most architectures have these options on by default and not user selectable.
 60For some architectures like arm that wish to have these be selectable,
 61the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable
 62a Kconfig prompt. CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT determines
 63the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled.
 64
 65#### Function pointers and sensitive variables must not be writable
 66
 67Vast areas of kernel memory contain function pointers that are looked
 68up by the kernel and used to continue execution (e.g. descriptor/vector
 69tables, file/network/etc operation structures, etc). The number of these
 70variables must be reduced to an absolute minimum.
 71
 72Many such variables can be made read-only by setting them "const"
 73so that they live in the .rodata section instead of the .data section
 74of the kernel, gaining the protection of the kernel's strict memory
 75permissions as described above.
 76
 77For variables that are initialized once at __init time, these can
 78be marked with the (new and under development) __ro_after_init
 79attribute.
 80
 81What remains are variables that are updated rarely (e.g. GDT). These
 82will need another infrastructure (similar to the temporary exceptions
 83made to kernel code mentioned above) that allow them to spend the rest
 84of their lifetime read-only. (For example, when being updated, only the
 85CPU thread performing the update would be given uninterruptible write
 86access to the memory.)
 87
 88#### Segregation of kernel memory from userspace memory
 89
 90The kernel must never execute userspace memory. The kernel must also never
 91access userspace memory without explicit expectation to do so. These
 92rules can be enforced either by support of hardware-based restrictions
 93(x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains).
 94By blocking userspace memory in this way, execution and data parsing
 95cannot be passed to trivially-controlled userspace memory, forcing
 96attacks to operate entirely in kernel memory.
 97
 98### Reduced access to syscalls
 99
100One trivial way to eliminate many syscalls for 64-bit systems is building
101without CONFIG_COMPAT. However, this is rarely a feasible scenario.
102
103The "seccomp" system provides an opt-in feature made available to
104userspace, which provides a way to reduce the number of kernel entry
105points available to a running process. This limits the breadth of kernel
106code that can be reached, possibly reducing the availability of a given
107bug to an attack.
108
109An area of improvement would be creating viable ways to keep access to
110things like compat, user namespaces, BPF creation, and perf limited only
111to trusted processes. This would keep the scope of kernel entry points
112restricted to the more regular set of normally available to unprivileged
113userspace.
114
115### Restricting access to kernel modules
116
117The kernel should never allow an unprivileged user the ability to
118load specific kernel modules, since that would provide a facility to
119unexpectedly extend the available attack surface. (The on-demand loading
120of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is
121considered "expected" here, though additional consideration should be
122given even to these.) For example, loading a filesystem module via an
123unprivileged socket API is nonsense: only the root or physically local
124user should trigger filesystem module loading. (And even this can be up
125for debate in some scenarios.)
126
127To protect against even privileged users, systems may need to either
128disable module loading entirely (e.g. monolithic kernel builds or
129modules_disabled sysctl), or provide signed modules (e.g.
130CONFIG_MODULE_SIG_FORCE, or dm-crypt with LoadPin), to keep from having
131root load arbitrary kernel code via the module loader interface.
132
133
134## Memory integrity
135
136There are many memory structures in the kernel that are regularly abused
137to gain execution control during an attack, By far the most commonly
138understood is that of the stack buffer overflow in which the return
139address stored on the stack is overwritten. Many other examples of this
140kind of attack exist, and protections exist to defend against them.
141
142### Stack buffer overflow
143
144The classic stack buffer overflow involves writing past the expected end
145of a variable stored on the stack, ultimately writing a controlled value
146to the stack frame's stored return address. The most widely used defense
147is the presence of a stack canary between the stack variables and the
148return address (CONFIG_CC_STACKPROTECTOR), which is verified just before
149the function returns. Other defenses include things like shadow stacks.
150
151### Stack depth overflow
152
153A less well understood attack is using a bug that triggers the
154kernel to consume stack memory with deep function calls or large stack
155allocations. With this attack it is possible to write beyond the end of
156the kernel's preallocated stack space and into sensitive structures. Two
157important changes need to be made for better protections: moving the
158sensitive thread_info structure elsewhere, and adding a faulting memory
159hole at the bottom of the stack to catch these overflows.
160
161### Heap memory integrity
162
163The structures used to track heap free lists can be sanity-checked during
164allocation and freeing to make sure they aren't being used to manipulate
165other memory areas.
166
167### Counter integrity
168
169Many places in the kernel use atomic counters to track object references
170or perform similar lifetime management. When these counters can be made
171to wrap (over or under) this traditionally exposes a use-after-free
172flaw. By trapping atomic wrapping, this class of bug vanishes.
173
174### Size calculation overflow detection
175
176Similar to counter overflow, integer overflows (usually size calculations)
177need to be detected at runtime to kill this class of bug, which
178traditionally leads to being able to write past the end of kernel buffers.
179
180
181## Statistical defenses
182
183While many protections can be considered deterministic (e.g. read-only
184memory cannot be written to), some protections provide only statistical
185defense, in that an attack must gather enough information about a
186running system to overcome the defense. While not perfect, these do
187provide meaningful defenses.
188
189### Canaries, blinding, and other secrets
190
191It should be noted that things like the stack canary discussed earlier
192are technically statistical defenses, since they rely on a secret value,
193and such values may become discoverable through an information exposure
194flaw.
195
196Blinding literal values for things like JITs, where the executable
197contents may be partially under the control of userspace, need a similar
198secret value.
199
200It is critical that the secret values used must be separate (e.g.
201different canary per stack) and high entropy (e.g. is the RNG actually
202working?) in order to maximize their success.
203
204### Kernel Address Space Layout Randomization (KASLR)
205
206Since the location of kernel memory is almost always instrumental in
207mounting a successful attack, making the location non-deterministic
208raises the difficulty of an exploit. (Note that this in turn makes
209the value of information exposures higher, since they may be used to
210discover desired memory locations.)
211
212#### Text and module base
213
214By relocating the physical and virtual base address of the kernel at
215boot-time (CONFIG_RANDOMIZE_BASE), attacks needing kernel code will be
216frustrated. Additionally, offsetting the module loading base address
217means that even systems that load the same set of modules in the same
218order every boot will not share a common base address with the rest of
219the kernel text.
220
221#### Stack base
222
223If the base address of the kernel stack is not the same between processes,
224or even not the same between syscalls, targets on or beyond the stack
225become more difficult to locate.
226
227#### Dynamic memory base
228
229Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up
230being relatively deterministic in layout due to the order of early-boot
231initializations. If the base address of these areas is not the same
232between boots, targeting them is frustrated, requiring an information
233exposure specific to the region.
234
235#### Structure layout
236
237By performing a per-build randomization of the layout of sensitive
238structures, attacks must either be tuned to known kernel builds or expose
239enough kernel memory to determine structure layouts before manipulating
240them.
241
242
243## Preventing Information Exposures
244
245Since the locations of sensitive structures are the primary target for
246attacks, it is important to defend against exposure of both kernel memory
247addresses and kernel memory contents (since they may contain kernel
248addresses or other sensitive things like canary values).
249
250### Unique identifiers
251
252Kernel memory addresses must never be used as identifiers exposed to
253userspace. Instead, use an atomic counter, an idr, or similar unique
254identifier.
255
256### Memory initialization
257
258Memory copied to userspace must always be fully initialized. If not
259explicitly memset(), this will require changes to the compiler to make
260sure structure holes are cleared.
261
262### Memory poisoning
263
264When releasing memory, it is best to poison the contents (clear stack on
265syscall return, wipe heap memory on a free), to avoid reuse attacks that
266rely on the old contents of memory. This frustrates many uninitialized
267variable attacks, stack content exposures, heap content exposures, and
268use-after-free attacks.
269
270### Destination tracking
271
272To help kill classes of bugs that result in kernel addresses being
273written to userspace, the destination of writes needs to be tracked. If
274the buffer is destined for userspace (e.g. seq_file backed /proc files),
275it should automatically censor sensitive values.
Configure Feed

Configure Feed