Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

x86/KASLR: Build identity mappings on demand

Currently KASLR only supports relocation in a small physical range (from
16M to 1G), due to using the initial kernel page table identity mapping.
To support ranges above this, we need to have an identity mapping for the
desired memory range before we can decompress (and later run) the kernel.

32-bit kernels already have the needed identity mapping. This patch adds
identity mappings for the needed memory ranges on 64-bit kernels. This
happens in two possible boot paths:

If loaded via startup_32(), we need to set up the needed identity map.

If loaded from a 64-bit bootloader, the bootloader will have already
set up an identity mapping, and we'll start via the compressed kernel's
startup_64(). In this case, the bootloader's page tables need to be
avoided while selecting the new uncompressed kernel location. If not,
the decompressor could overwrite them during decompression.

To accomplish this, we could walk the pagetable and find every page
that is used, and add them to mem_avoid, but this needs extra code and
will require increasing the size of the mem_avoid array.

Instead, we can create a new set of page tables for our own identity
mapping instead. The pages for the new page table will come from the
_pagetable section of the compressed kernel, which means they are
already contained by in mem_avoid array. To do this, we reuse the code
from the uncompressed kernel's identity mapping routines.

The _pgtable will be shared by both the 32-bit and 64-bit paths to reduce
init_size, as now the compressed kernel's _rodata to _end will contribute
to init_size.

To handle the possible mappings, we need to increase the existing page
table buffer size:

When booting via startup_64(), we need to cover the old VO, params,
cmdline and uncompressed kernel. In an extreme case we could have them
all beyond the 512G boundary, which needs (2+2)*4 pages with 2M mappings.
And we'll need 2 for first 2M for VGA RAM. One more is needed for level4.
This gets us to 19 pages total.

When booting via startup_32(), KASLR could move the uncompressed kernel
above 4G, so we need to create extra identity mappings, which should only
need (2+2) pages at most when it is beyond the 512G boundary. So 19
pages is sufficient for this case as well.

The resulting BOOT_*PGT_SIZE defines use the "_SIZE" suffix on their
names to maintain logical consistency with the existing BOOT_HEAP_SIZE
and BOOT_STACK_SIZE defines.

This patch is based on earlier patches from Yinghai Lu and Baoquan He.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: lasse.collin@tukaani.org
Link: http://lkml.kernel.org/r/1462572095-11754-4-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

authored by

Kees Cook and committed by
Ingo Molnar
3a94707d cf4fb15b

+187 -2
+3
arch/x86/boot/compressed/Makefile
··· 75 75 76 76 vmlinux-objs-$(CONFIG_EARLY_PRINTK) += $(obj)/early_serial_console.o 77 77 vmlinux-objs-$(CONFIG_RANDOMIZE_BASE) += $(obj)/kaslr.o 78 + ifdef CONFIG_X86_64 79 + vmlinux-objs-$(CONFIG_RANDOMIZE_BASE) += $(obj)/pagetable.o 80 + endif 78 81 79 82 $(obj)/eboot.o: KBUILD_CFLAGS += -fshort-wchar -mno-red-zone 80 83
+2 -2
arch/x86/boot/compressed/head_64.S
··· 134 134 /* Initialize Page tables to 0 */ 135 135 leal pgtable(%ebx), %edi 136 136 xorl %eax, %eax 137 - movl $((4096*6)/4), %ecx 137 + movl $(BOOT_INIT_PGT_SIZE/4), %ecx 138 138 rep stosl 139 139 140 140 /* Build Level 4 */ ··· 486 486 .section ".pgtable","a",@nobits 487 487 .balign 4096 488 488 pgtable: 489 - .fill 6*4096, 1, 0 489 + .fill BOOT_PGT_SIZE, 1, 0
+17
arch/x86/boot/compressed/kaslr.c
··· 241 241 */ 242 242 mem_avoid[MEM_AVOID_ZO_RANGE].start = input; 243 243 mem_avoid[MEM_AVOID_ZO_RANGE].size = (output + init_size) - input; 244 + add_identity_map(mem_avoid[MEM_AVOID_ZO_RANGE].start, 245 + mem_avoid[MEM_AVOID_ZO_RANGE].size); 244 246 245 247 /* Avoid initrd. */ 246 248 initrd_start = (u64)boot_params->ext_ramdisk_image << 32; ··· 251 249 initrd_size |= boot_params->hdr.ramdisk_size; 252 250 mem_avoid[MEM_AVOID_INITRD].start = initrd_start; 253 251 mem_avoid[MEM_AVOID_INITRD].size = initrd_size; 252 + /* No need to set mapping for initrd, it will be handled in VO. */ 254 253 255 254 /* Avoid kernel command line. */ 256 255 cmd_line = (u64)boot_params->ext_cmd_line_ptr << 32; ··· 262 259 ; 263 260 mem_avoid[MEM_AVOID_CMDLINE].start = cmd_line; 264 261 mem_avoid[MEM_AVOID_CMDLINE].size = cmd_line_size; 262 + add_identity_map(mem_avoid[MEM_AVOID_CMDLINE].start, 263 + mem_avoid[MEM_AVOID_CMDLINE].size); 265 264 266 265 /* Avoid boot parameters. */ 267 266 mem_avoid[MEM_AVOID_BOOTPARAMS].start = (unsigned long)boot_params; 268 267 mem_avoid[MEM_AVOID_BOOTPARAMS].size = sizeof(*boot_params); 268 + add_identity_map(mem_avoid[MEM_AVOID_BOOTPARAMS].start, 269 + mem_avoid[MEM_AVOID_BOOTPARAMS].size); 270 + 271 + /* We don't need to set a mapping for setup_data. */ 272 + 273 + #ifdef CONFIG_X86_VERBOSE_BOOTUP 274 + /* Make sure video RAM can be used. */ 275 + add_identity_map(0, PMD_SIZE); 276 + #endif 269 277 } 270 278 271 279 /* Does this memory vector overlap a known avoided area? */ ··· 435 421 goto out; 436 422 437 423 choice = random_addr; 424 + 425 + add_identity_map(choice, output_size); 426 + finalize_identity_maps(); 438 427 out: 439 428 return (unsigned char *)choice; 440 429 }
+11
arch/x86/boot/compressed/misc.h
··· 84 84 } 85 85 #endif 86 86 87 + #ifdef CONFIG_X86_64 88 + void add_identity_map(unsigned long start, unsigned long size); 89 + void finalize_identity_maps(void); 90 + extern unsigned char _pgtable[]; 91 + #else 92 + static inline void add_identity_map(unsigned long start, unsigned long size) 93 + { } 94 + static inline void finalize_identity_maps(void) 95 + { } 96 + #endif 97 + 87 98 #ifdef CONFIG_EARLY_PRINTK 88 99 /* early_serial_console.c */ 89 100 extern int early_serial_base;
+135
arch/x86/boot/compressed/pagetable.c
··· 1 + /* 2 + * This code is used on x86_64 to create page table identity mappings on 3 + * demand by building up a new set of page tables (or appending to the 4 + * existing ones), and then switching over to them when ready. 5 + */ 6 + 7 + /* 8 + * Since we're dealing with identity mappings, physical and virtual 9 + * addresses are the same, so override these defines which are ultimately 10 + * used by the headers in misc.h. 11 + */ 12 + #define __pa(x) ((unsigned long)(x)) 13 + #define __va(x) ((void *)((unsigned long)(x))) 14 + 15 + #include "misc.h" 16 + 17 + /* These actually do the work of building the kernel identity maps. */ 18 + #include <asm/init.h> 19 + #include <asm/pgtable.h> 20 + #include "../../mm/ident_map.c" 21 + 22 + /* Used by pgtable.h asm code to force instruction serialization. */ 23 + unsigned long __force_order; 24 + 25 + /* Used to track our page table allocation area. */ 26 + struct alloc_pgt_data { 27 + unsigned char *pgt_buf; 28 + unsigned long pgt_buf_size; 29 + unsigned long pgt_buf_offset; 30 + }; 31 + 32 + /* 33 + * Allocates space for a page table entry, using struct alloc_pgt_data 34 + * above. Besides the local callers, this is used as the allocation 35 + * callback in mapping_info below. 36 + */ 37 + static void *alloc_pgt_page(void *context) 38 + { 39 + struct alloc_pgt_data *pages = (struct alloc_pgt_data *)context; 40 + unsigned char *entry; 41 + 42 + /* Validate there is space available for a new page. */ 43 + if (pages->pgt_buf_offset >= pages->pgt_buf_size) { 44 + debug_putstr("out of pgt_buf in " __FILE__ "!?\n"); 45 + debug_putaddr(pages->pgt_buf_offset); 46 + debug_putaddr(pages->pgt_buf_size); 47 + return NULL; 48 + } 49 + 50 + entry = pages->pgt_buf + pages->pgt_buf_offset; 51 + pages->pgt_buf_offset += PAGE_SIZE; 52 + 53 + return entry; 54 + } 55 + 56 + /* Used to track our allocated page tables. */ 57 + static struct alloc_pgt_data pgt_data; 58 + 59 + /* The top level page table entry pointer. */ 60 + static unsigned long level4p; 61 + 62 + /* Locates and clears a region for a new top level page table. */ 63 + static void prepare_level4(void) 64 + { 65 + /* 66 + * It should be impossible for this not to already be true, 67 + * but since calling this a second time would rewind the other 68 + * counters, let's just make sure this is reset too. 69 + */ 70 + pgt_data.pgt_buf_offset = 0; 71 + 72 + /* 73 + * If we came here via startup_32(), cr3 will be _pgtable already 74 + * and we must append to the existing area instead of entirely 75 + * overwriting it. 76 + */ 77 + level4p = read_cr3(); 78 + if (level4p == (unsigned long)_pgtable) { 79 + debug_putstr("booted via startup_32()\n"); 80 + pgt_data.pgt_buf = _pgtable + BOOT_INIT_PGT_SIZE; 81 + pgt_data.pgt_buf_size = BOOT_PGT_SIZE - BOOT_INIT_PGT_SIZE; 82 + memset(pgt_data.pgt_buf, 0, pgt_data.pgt_buf_size); 83 + } else { 84 + debug_putstr("booted via startup_64()\n"); 85 + pgt_data.pgt_buf = _pgtable; 86 + pgt_data.pgt_buf_size = BOOT_PGT_SIZE; 87 + memset(pgt_data.pgt_buf, 0, pgt_data.pgt_buf_size); 88 + level4p = (unsigned long)alloc_pgt_page(&pgt_data); 89 + } 90 + } 91 + 92 + /* 93 + * Mapping information structure passed to kernel_ident_mapping_init(). 94 + * Since this never changes, there's no reason to repeatedly fill it 95 + * in on the stack when calling add_identity_map(). 96 + */ 97 + static struct x86_mapping_info mapping_info = { 98 + .alloc_pgt_page = alloc_pgt_page, 99 + .context = &pgt_data, 100 + .pmd_flag = __PAGE_KERNEL_LARGE_EXEC, 101 + }; 102 + 103 + /* 104 + * Adds the specified range to what will become the new identity mappings. 105 + * Once all ranges have been added, the new mapping is activated by calling 106 + * finalize_identity_maps() below. 107 + */ 108 + void add_identity_map(unsigned long start, unsigned long size) 109 + { 110 + unsigned long end = start + size; 111 + 112 + /* Make sure we have a top level page table ready to use. */ 113 + if (!level4p) 114 + prepare_level4(); 115 + 116 + /* Align boundary to 2M. */ 117 + start = round_down(start, PMD_SIZE); 118 + end = round_up(end, PMD_SIZE); 119 + if (start >= end) 120 + return; 121 + 122 + /* Build the mapping. */ 123 + kernel_ident_mapping_init(&mapping_info, (pgd_t *)level4p, 124 + start, end); 125 + } 126 + 127 + /* 128 + * This switches the page tables to the new level4 that has been built 129 + * via calls to add_identity_map() above. If booted via startup_32(), 130 + * this is effectively a no-op. 131 + */ 132 + void finalize_identity_maps(void) 133 + { 134 + write_cr3(level4p); 135 + }
+19
arch/x86/include/asm/boot.h
··· 31 31 32 32 #ifdef CONFIG_X86_64 33 33 # define BOOT_STACK_SIZE 0x4000 34 + 35 + # define BOOT_INIT_PGT_SIZE (6*4096) 36 + # ifdef CONFIG_RANDOMIZE_BASE 37 + /* 38 + * Assuming all cross the 512GB boundary: 39 + * 1 page for level4 40 + * (2+2)*4 pages for kernel, param, cmd_line, and randomized kernel 41 + * 2 pages for first 2M (video RAM: CONFIG_X86_VERBOSE_BOOTUP). 42 + * Total is 19 pages. 43 + */ 44 + # ifdef CONFIG_X86_VERBOSE_BOOTUP 45 + # define BOOT_PGT_SIZE (19*4096) 46 + # else /* !CONFIG_X86_VERBOSE_BOOTUP */ 47 + # define BOOT_PGT_SIZE (17*4096) 48 + # endif 49 + # else /* !CONFIG_RANDOMIZE_BASE */ 50 + # define BOOT_PGT_SIZE BOOT_INIT_PGT_SIZE 51 + # endif 52 + 34 53 #else /* !CONFIG_X86_64 */ 35 54 # define BOOT_STACK_SIZE 0x1000 36 55 #endif