Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

at v4.12-rc4 275 lines 12 kB view raw
1# Kernel Self-Protection 2 3Kernel self-protection is the design and implementation of systems and 4structures within the Linux kernel to protect against security flaws in 5the kernel itself. This covers a wide range of issues, including removing 6entire classes of bugs, blocking security flaw exploitation methods, 7and actively detecting attack attempts. Not all topics are explored in 8this document, but it should serve as a reasonable starting point and 9answer any frequently asked questions. (Patches welcome, of course!) 10 11In the worst-case scenario, we assume an unprivileged local attacker 12has arbitrary read and write access to the kernel's memory. In many 13cases, bugs being exploited will not provide this level of access, 14but with systems in place that defend against the worst case we'll 15cover the more limited cases as well. A higher bar, and one that should 16still be kept in mind, is protecting the kernel against a _privileged_ 17local attacker, since the root user has access to a vastly increased 18attack surface. (Especially when they have the ability to load arbitrary 19kernel modules.) 20 21The goals for successful self-protection systems would be that they 22are effective, on by default, require no opt-in by developers, have no 23performance impact, do not impede kernel debugging, and have tests. It 24is uncommon that all these goals can be met, but it is worth explicitly 25mentioning them, since these aspects need to be explored, dealt with, 26and/or accepted. 27 28 29## Attack Surface Reduction 30 31The most fundamental defense against security exploits is to reduce the 32areas of the kernel that can be used to redirect execution. This ranges 33from limiting the exposed APIs available to userspace, making in-kernel 34APIs hard to use incorrectly, minimizing the areas of writable kernel 35memory, etc. 36 37### Strict kernel memory permissions 38 39When all of kernel memory is writable, it becomes trivial for attacks 40to redirect execution flow. To reduce the availability of these targets 41the kernel needs to protect its memory with a tight set of permissions. 42 43#### Executable code and read-only data must not be writable 44 45Any areas of the kernel with executable memory must not be writable. 46While this obviously includes the kernel text itself, we must consider 47all additional places too: kernel modules, JIT memory, etc. (There are 48temporary exceptions to this rule to support things like instruction 49alternatives, breakpoints, kprobes, etc. If these must exist in a 50kernel, they are implemented in a way where the memory is temporarily 51made writable during the update, and then returned to the original 52permissions.) 53 54In support of this are CONFIG_STRICT_KERNEL_RWX and 55CONFIG_STRICT_MODULE_RWX, which seek to make sure that code is not 56writable, data is not executable, and read-only data is neither writable 57nor executable. 58 59Most architectures have these options on by default and not user selectable. 60For some architectures like arm that wish to have these be selectable, 61the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable 62a Kconfig prompt. CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT determines 63the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled. 64 65#### Function pointers and sensitive variables must not be writable 66 67Vast areas of kernel memory contain function pointers that are looked 68up by the kernel and used to continue execution (e.g. descriptor/vector 69tables, file/network/etc operation structures, etc). The number of these 70variables must be reduced to an absolute minimum. 71 72Many such variables can be made read-only by setting them "const" 73so that they live in the .rodata section instead of the .data section 74of the kernel, gaining the protection of the kernel's strict memory 75permissions as described above. 76 77For variables that are initialized once at __init time, these can 78be marked with the (new and under development) __ro_after_init 79attribute. 80 81What remains are variables that are updated rarely (e.g. GDT). These 82will need another infrastructure (similar to the temporary exceptions 83made to kernel code mentioned above) that allow them to spend the rest 84of their lifetime read-only. (For example, when being updated, only the 85CPU thread performing the update would be given uninterruptible write 86access to the memory.) 87 88#### Segregation of kernel memory from userspace memory 89 90The kernel must never execute userspace memory. The kernel must also never 91access userspace memory without explicit expectation to do so. These 92rules can be enforced either by support of hardware-based restrictions 93(x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains). 94By blocking userspace memory in this way, execution and data parsing 95cannot be passed to trivially-controlled userspace memory, forcing 96attacks to operate entirely in kernel memory. 97 98### Reduced access to syscalls 99 100One trivial way to eliminate many syscalls for 64-bit systems is building 101without CONFIG_COMPAT. However, this is rarely a feasible scenario. 102 103The "seccomp" system provides an opt-in feature made available to 104userspace, which provides a way to reduce the number of kernel entry 105points available to a running process. This limits the breadth of kernel 106code that can be reached, possibly reducing the availability of a given 107bug to an attack. 108 109An area of improvement would be creating viable ways to keep access to 110things like compat, user namespaces, BPF creation, and perf limited only 111to trusted processes. This would keep the scope of kernel entry points 112restricted to the more regular set of normally available to unprivileged 113userspace. 114 115### Restricting access to kernel modules 116 117The kernel should never allow an unprivileged user the ability to 118load specific kernel modules, since that would provide a facility to 119unexpectedly extend the available attack surface. (The on-demand loading 120of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is 121considered "expected" here, though additional consideration should be 122given even to these.) For example, loading a filesystem module via an 123unprivileged socket API is nonsense: only the root or physically local 124user should trigger filesystem module loading. (And even this can be up 125for debate in some scenarios.) 126 127To protect against even privileged users, systems may need to either 128disable module loading entirely (e.g. monolithic kernel builds or 129modules_disabled sysctl), or provide signed modules (e.g. 130CONFIG_MODULE_SIG_FORCE, or dm-crypt with LoadPin), to keep from having 131root load arbitrary kernel code via the module loader interface. 132 133 134## Memory integrity 135 136There are many memory structures in the kernel that are regularly abused 137to gain execution control during an attack, By far the most commonly 138understood is that of the stack buffer overflow in which the return 139address stored on the stack is overwritten. Many other examples of this 140kind of attack exist, and protections exist to defend against them. 141 142### Stack buffer overflow 143 144The classic stack buffer overflow involves writing past the expected end 145of a variable stored on the stack, ultimately writing a controlled value 146to the stack frame's stored return address. The most widely used defense 147is the presence of a stack canary between the stack variables and the 148return address (CONFIG_CC_STACKPROTECTOR), which is verified just before 149the function returns. Other defenses include things like shadow stacks. 150 151### Stack depth overflow 152 153A less well understood attack is using a bug that triggers the 154kernel to consume stack memory with deep function calls or large stack 155allocations. With this attack it is possible to write beyond the end of 156the kernel's preallocated stack space and into sensitive structures. Two 157important changes need to be made for better protections: moving the 158sensitive thread_info structure elsewhere, and adding a faulting memory 159hole at the bottom of the stack to catch these overflows. 160 161### Heap memory integrity 162 163The structures used to track heap free lists can be sanity-checked during 164allocation and freeing to make sure they aren't being used to manipulate 165other memory areas. 166 167### Counter integrity 168 169Many places in the kernel use atomic counters to track object references 170or perform similar lifetime management. When these counters can be made 171to wrap (over or under) this traditionally exposes a use-after-free 172flaw. By trapping atomic wrapping, this class of bug vanishes. 173 174### Size calculation overflow detection 175 176Similar to counter overflow, integer overflows (usually size calculations) 177need to be detected at runtime to kill this class of bug, which 178traditionally leads to being able to write past the end of kernel buffers. 179 180 181## Statistical defenses 182 183While many protections can be considered deterministic (e.g. read-only 184memory cannot be written to), some protections provide only statistical 185defense, in that an attack must gather enough information about a 186running system to overcome the defense. While not perfect, these do 187provide meaningful defenses. 188 189### Canaries, blinding, and other secrets 190 191It should be noted that things like the stack canary discussed earlier 192are technically statistical defenses, since they rely on a secret value, 193and such values may become discoverable through an information exposure 194flaw. 195 196Blinding literal values for things like JITs, where the executable 197contents may be partially under the control of userspace, need a similar 198secret value. 199 200It is critical that the secret values used must be separate (e.g. 201different canary per stack) and high entropy (e.g. is the RNG actually 202working?) in order to maximize their success. 203 204### Kernel Address Space Layout Randomization (KASLR) 205 206Since the location of kernel memory is almost always instrumental in 207mounting a successful attack, making the location non-deterministic 208raises the difficulty of an exploit. (Note that this in turn makes 209the value of information exposures higher, since they may be used to 210discover desired memory locations.) 211 212#### Text and module base 213 214By relocating the physical and virtual base address of the kernel at 215boot-time (CONFIG_RANDOMIZE_BASE), attacks needing kernel code will be 216frustrated. Additionally, offsetting the module loading base address 217means that even systems that load the same set of modules in the same 218order every boot will not share a common base address with the rest of 219the kernel text. 220 221#### Stack base 222 223If the base address of the kernel stack is not the same between processes, 224or even not the same between syscalls, targets on or beyond the stack 225become more difficult to locate. 226 227#### Dynamic memory base 228 229Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up 230being relatively deterministic in layout due to the order of early-boot 231initializations. If the base address of these areas is not the same 232between boots, targeting them is frustrated, requiring an information 233exposure specific to the region. 234 235#### Structure layout 236 237By performing a per-build randomization of the layout of sensitive 238structures, attacks must either be tuned to known kernel builds or expose 239enough kernel memory to determine structure layouts before manipulating 240them. 241 242 243## Preventing Information Exposures 244 245Since the locations of sensitive structures are the primary target for 246attacks, it is important to defend against exposure of both kernel memory 247addresses and kernel memory contents (since they may contain kernel 248addresses or other sensitive things like canary values). 249 250### Unique identifiers 251 252Kernel memory addresses must never be used as identifiers exposed to 253userspace. Instead, use an atomic counter, an idr, or similar unique 254identifier. 255 256### Memory initialization 257 258Memory copied to userspace must always be fully initialized. If not 259explicitly memset(), this will require changes to the compiler to make 260sure structure holes are cleared. 261 262### Memory poisoning 263 264When releasing memory, it is best to poison the contents (clear stack on 265syscall return, wipe heap memory on a free), to avoid reuse attacks that 266rely on the old contents of memory. This frustrates many uninitialized 267variable attacks, stack content exposures, heap content exposures, and 268use-after-free attacks. 269 270### Destination tracking 271 272To help kill classes of bugs that result in kernel addresses being 273written to userspace, the destination of writes needs to be tracked. If 274the buffer is destined for userspace (e.g. seq_file backed /proc files), 275it should automatically censor sensitive values.