Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

x86/exceptions: Split debug IST stack

The debug IST stack is actually two separate debug stacks to handle #DB
recursion. This is required because the CPU starts always at top of stack
on exception entry, which means on #DB recursion the second #DB would
overwrite the stack of the first.

The low level entry code therefore adjusts the top of stack on entry so a
secondary #DB starts from a different stack page. But the stack pages are
adjacent without a guard page between them.

Split the debug stack into 3 stacks which are separated by guard pages. The
3rd stack is never mapped into the cpu_entry_area and is only there to
catch triple #DB nesting:

--- top of DB_stack <- Initial stack
--- end of DB_stack
guard page

--- top of DB1_stack <- Top of stack after entering first #DB
--- end of DB1_stack
guard page

--- top of DB2_stack <- Top of stack after entering second #DB
--- end of DB2_stack
guard page

If DB2 would not act as the final guard hole, a second #DB would point the
top of #DB stack to the stack below #DB1 which would be valid and not catch
the not so desired triple nesting.

The backing store does not allocate any memory for DB2 and its guard page
as it is not going to be mapped into the cpu_entry_area.

- Adjust the low level entry code so it adjusts top of #DB with the offset
between the stacks instead of exception stack size.

- Make the dumpstack code aware of the new stacks.

- Adjust the in_debug_stack() implementation and move it into the NMI code
where it belongs. As this is NMI hotpath code, it just checks the full
area between top of DB_stack and bottom of DB1_stack without checking
for the guard page. That's correct because the NMI cannot hit a
stackpointer pointing to the guard page between DB and DB1 stack. Even
if it would, then the NMI operation still is unaffected, but the resume
of the debug exception on the topmost DB stack will crash by touching
the guard page.

[ bp: Make exception_stack_names static const char * const ]

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: linux-doc@vger.kernel.org
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190414160145.439944544@linutronix.de

authored by

Thomas Gleixner and committed by
Borislav Petkov
2a594d4c 1bdb67e5

+52 -31
+6 -1
Documentation/x86/kernel-stacks
··· 76 76 middle of switching stacks. Using IST for NMI events avoids making 77 77 assumptions about the previous state of the kernel stack. 78 78 79 - * ESTACK_DB. DEBUG_STKSZ 79 + * ESTACK_DB. EXCEPTION_STKSZ (PAGE_SIZE). 80 80 81 81 Used for hardware debug interrupts (interrupt 1) and for software 82 82 debug interrupts (INT3). ··· 85 85 software) can occur at any time. Using IST for these interrupts 86 86 avoids making assumptions about the previous state of the kernel 87 87 stack. 88 + 89 + To handle nested #DB correctly there exist two instances of DB stacks. On 90 + #DB entry the IST stackpointer for #DB is switched to the second instance 91 + so a nested #DB starts from a clean stack. The nested #DB switches 92 + the IST stackpointer to a guard hole to catch triple nesting. 88 93 89 94 * ESTACK_MCE. EXCEPTION_STKSZ (PAGE_SIZE). 90 95
+4 -4
arch/x86/entry/entry_64.S
··· 879 879 * @paranoid == 2 is special: the stub will never switch stacks. This is for 880 880 * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS. 881 881 */ 882 - .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 882 + .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0 883 883 ENTRY(\sym) 884 884 UNWIND_HINT_IRET_REGS offset=\has_error_code*8 885 885 ··· 925 925 .endif 926 926 927 927 .if \shift_ist != -1 928 - subq $EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist) 928 + subq $\ist_offset, CPU_TSS_IST(\shift_ist) 929 929 .endif 930 930 931 931 call \do_sym 932 932 933 933 .if \shift_ist != -1 934 - addq $EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist) 934 + addq $\ist_offset, CPU_TSS_IST(\shift_ist) 935 935 .endif 936 936 937 937 /* these procedures expect "no swapgs" flag in ebx */ ··· 1129 1129 hv_stimer0_callback_vector hv_stimer0_vector_handler 1130 1130 #endif /* CONFIG_HYPERV */ 1131 1131 1132 - idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=IST_INDEX_DB 1132 + idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=IST_INDEX_DB ist_offset=DB_STACK_OFFSET 1133 1133 idtentry int3 do_int3 has_error_code=0 1134 1134 idtentry stack_segment do_stack_segment has_error_code=1 1135 1135
+10 -4
arch/x86/include/asm/cpu_entry_area.h
··· 10 10 #ifdef CONFIG_X86_64 11 11 12 12 /* Macro to enforce the same ordering and stack sizes */ 13 - #define ESTACKS_MEMBERS(guardsize) \ 13 + #define ESTACKS_MEMBERS(guardsize, db2_holesize)\ 14 14 char DF_stack_guard[guardsize]; \ 15 15 char DF_stack[EXCEPTION_STKSZ]; \ 16 16 char NMI_stack_guard[guardsize]; \ 17 17 char NMI_stack[EXCEPTION_STKSZ]; \ 18 + char DB2_stack_guard[guardsize]; \ 19 + char DB2_stack[db2_holesize]; \ 20 + char DB1_stack_guard[guardsize]; \ 21 + char DB1_stack[EXCEPTION_STKSZ]; \ 18 22 char DB_stack_guard[guardsize]; \ 19 - char DB_stack[DEBUG_STKSZ]; \ 23 + char DB_stack[EXCEPTION_STKSZ]; \ 20 24 char MCE_stack_guard[guardsize]; \ 21 25 char MCE_stack[EXCEPTION_STKSZ]; \ 22 26 char IST_top_guard[guardsize]; \ 23 27 24 28 /* The exception stacks' physical storage. No guard pages required */ 25 29 struct exception_stacks { 26 - ESTACKS_MEMBERS(0) 30 + ESTACKS_MEMBERS(0, 0) 27 31 }; 28 32 29 33 /* The effective cpu entry area mapping with guard pages. */ 30 34 struct cea_exception_stacks { 31 - ESTACKS_MEMBERS(PAGE_SIZE) 35 + ESTACKS_MEMBERS(PAGE_SIZE, EXCEPTION_STKSZ) 32 36 }; 33 37 34 38 /* ··· 41 37 enum exception_stack_ordering { 42 38 ESTACK_DF, 43 39 ESTACK_NMI, 40 + ESTACK_DB2, 41 + ESTACK_DB1, 44 42 ESTACK_DB, 45 43 ESTACK_MCE, 46 44 N_EXCEPTION_STACKS
-2
arch/x86/include/asm/debugreg.h
··· 104 104 { 105 105 __this_cpu_dec(debug_stack_usage); 106 106 } 107 - int is_debug_stack(unsigned long addr); 108 107 void debug_stack_set_zero(void); 109 108 void debug_stack_reset(void); 110 109 #else /* !X86_64 */ 111 - static inline int is_debug_stack(unsigned long addr) { return 0; } 112 110 static inline void debug_stack_set_zero(void) { } 113 111 static inline void debug_stack_reset(void) { } 114 112 static inline void debug_stack_usage_inc(void) { }
-3
arch/x86/include/asm/page_64_types.h
··· 18 18 #define EXCEPTION_STACK_ORDER (0 + KASAN_STACK_ORDER) 19 19 #define EXCEPTION_STKSZ (PAGE_SIZE << EXCEPTION_STACK_ORDER) 20 20 21 - #define DEBUG_STACK_ORDER (EXCEPTION_STACK_ORDER + 1) 22 - #define DEBUG_STKSZ (PAGE_SIZE << DEBUG_STACK_ORDER) 23 - 24 21 #define IRQ_STACK_ORDER (2 + KASAN_STACK_ORDER) 25 22 #define IRQ_STACK_SIZE (PAGE_SIZE << IRQ_STACK_ORDER) 26 23
+2
arch/x86/kernel/asm-offsets_64.c
··· 68 68 #undef ENTRY 69 69 70 70 OFFSET(TSS_ist, tss_struct, x86_tss.ist); 71 + DEFINE(DB_STACK_OFFSET, offsetof(struct cea_exception_stacks, DB_stack) - 72 + offsetof(struct cea_exception_stacks, DB1_stack)); 71 73 BLANK(); 72 74 73 75 #ifdef CONFIG_STACKPROTECTOR
-11
arch/x86/kernel/cpu/common.c
··· 1549 1549 X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT); 1550 1550 } 1551 1551 1552 - static DEFINE_PER_CPU(unsigned long, debug_stack_addr); 1553 1552 DEFINE_PER_CPU(int, debug_stack_usage); 1554 - 1555 - int is_debug_stack(unsigned long addr) 1556 - { 1557 - return __this_cpu_read(debug_stack_usage) || 1558 - (addr <= __this_cpu_read(debug_stack_addr) && 1559 - addr > (__this_cpu_read(debug_stack_addr) - DEBUG_STKSZ)); 1560 - } 1561 - NOKPROBE_SYMBOL(is_debug_stack); 1562 - 1563 1553 DEFINE_PER_CPU(u32, debug_idt_ctr); 1564 1554 1565 1555 void debug_stack_set_zero(void) ··· 1725 1735 t->x86_tss.ist[IST_INDEX_NMI] = __this_cpu_ist_top_va(NMI); 1726 1736 t->x86_tss.ist[IST_INDEX_DB] = __this_cpu_ist_top_va(DB); 1727 1737 t->x86_tss.ist[IST_INDEX_MCE] = __this_cpu_ist_top_va(MCE); 1728 - per_cpu(debug_stack_addr, cpu) = t->x86_tss.ist[IST_INDEX_DB]; 1729 1738 } 1730 1739 1731 1740 t->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET;
+8 -4
arch/x86/kernel/dumpstack_64.c
··· 19 19 #include <asm/cpu_entry_area.h> 20 20 #include <asm/stacktrace.h> 21 21 22 - static const char *exception_stack_names[N_EXCEPTION_STACKS] = { 22 + static const char * const exception_stack_names[] = { 23 23 [ ESTACK_DF ] = "#DF", 24 24 [ ESTACK_NMI ] = "NMI", 25 + [ ESTACK_DB2 ] = "#DB2", 26 + [ ESTACK_DB1 ] = "#DB1", 25 27 [ ESTACK_DB ] = "#DB", 26 28 [ ESTACK_MCE ] = "#MC", 27 29 }; 28 30 29 31 const char *stack_type_name(enum stack_type type) 30 32 { 31 - BUILD_BUG_ON(N_EXCEPTION_STACKS != 4); 33 + BUILD_BUG_ON(N_EXCEPTION_STACKS != 6); 32 34 33 35 if (type == STACK_TYPE_IRQ) 34 36 return "IRQ"; ··· 60 58 .end = offsetof(struct cea_exception_stacks, x## _stack_guard) \ 61 59 } 62 60 63 - static const struct estack_layout layout[N_EXCEPTION_STACKS] = { 61 + static const struct estack_layout layout[] = { 64 62 [ ESTACK_DF ] = ESTACK_ENTRY(DF), 65 63 [ ESTACK_NMI ] = ESTACK_ENTRY(NMI), 64 + [ ESTACK_DB2 ] = { .begin = 0, .end = 0}, 65 + [ ESTACK_DB1 ] = ESTACK_ENTRY(DB1), 66 66 [ ESTACK_DB ] = ESTACK_ENTRY(DB), 67 67 [ ESTACK_MCE ] = ESTACK_ENTRY(MCE), 68 68 }; ··· 75 71 struct pt_regs *regs; 76 72 unsigned int k; 77 73 78 - BUILD_BUG_ON(N_EXCEPTION_STACKS != 4); 74 + BUILD_BUG_ON(N_EXCEPTION_STACKS != 6); 79 75 80 76 estacks = (unsigned long)__this_cpu_read(cea_exception_stacks); 81 77
+19 -1
arch/x86/kernel/nmi.c
··· 21 21 #include <linux/ratelimit.h> 22 22 #include <linux/slab.h> 23 23 #include <linux/export.h> 24 + #include <linux/atomic.h> 24 25 #include <linux/sched/clock.h> 25 26 26 27 #if defined(CONFIG_EDAC) 27 28 #include <linux/edac.h> 28 29 #endif 29 30 30 - #include <linux/atomic.h> 31 + #include <asm/cpu_entry_area.h> 31 32 #include <asm/traps.h> 32 33 #include <asm/mach_traps.h> 33 34 #include <asm/nmi.h> ··· 488 487 * switch back to the original IDT. 489 488 */ 490 489 static DEFINE_PER_CPU(int, update_debug_stack); 490 + 491 + static bool notrace is_debug_stack(unsigned long addr) 492 + { 493 + struct cea_exception_stacks *cs = __this_cpu_read(cea_exception_stacks); 494 + unsigned long top = CEA_ESTACK_TOP(cs, DB); 495 + unsigned long bot = CEA_ESTACK_BOT(cs, DB1); 496 + 497 + if (__this_cpu_read(debug_stack_usage)) 498 + return true; 499 + /* 500 + * Note, this covers the guard page between DB and DB1 as well to 501 + * avoid two checks. But by all means @addr can never point into 502 + * the guard page. 503 + */ 504 + return addr >= bot && addr < top; 505 + } 506 + NOKPROBE_SYMBOL(is_debug_stack); 491 507 #endif 492 508 493 509 dotraplinkage notrace void
+3 -1
arch/x86/mm/cpu_entry_area.c
··· 98 98 99 99 /* 100 100 * The exceptions stack mappings in the per cpu area are protected 101 - * by guard pages so each stack must be mapped separately. 101 + * by guard pages so each stack must be mapped separately. DB2 is 102 + * not mapped; it just exists to catch triple nesting of #DB. 102 103 */ 103 104 cea_map_stack(DF); 104 105 cea_map_stack(NMI); 106 + cea_map_stack(DB1); 105 107 cea_map_stack(DB); 106 108 cea_map_stack(MCE); 107 109 }