Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

x86/entry: Vastly simplify SYSENTER TF (single-step) handling

Due to a blatant design error, SYSENTER doesn't clear TF (single-step).

As a result, if a user does SYSENTER with TF set, we will single-step
through the kernel until something clears TF. There is absolutely
nothing we can do to prevent this short of turning off SYSENTER [1].

Simplify the handling considerably with two changes:

1. We already sanitize EFLAGS in SYSENTER to clear NT and AC. We can
add TF to that list of flags to sanitize with no overhead whatsoever.

2. Teach do_debug() to ignore single-step traps in the SYSENTER prologue.

That's all we need to do.

Don't get too excited -- our handling is still buggy on 32-bit
kernels. There's nothing wrong with the SYSENTER code itself, but
the #DB prologue has a clever fixup for traps on the very first
instruction of entry_SYSENTER_32, and the fixup doesn't work quite
correctly. The next two patches will fix that.

[1] We could probably prevent it by forcing BTF on at all times and
making sure we clear TF before any branches in the SYSENTER
code. Needless to say, this is a bad idea.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a30d2ea06fe4b621fe6a9ef911b02c0f38feb6f2.1457578375.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

authored by

Andy Lutomirski and committed by
Ingo Molnar
f2b37575 8bb56436

+94 -24
+30 -12
arch/x86/entry/entry_32.S
··· 287 287 END(resume_kernel) 288 288 #endif 289 289 290 - # SYSENTER call handler stub 290 + GLOBAL(__begin_SYSENTER_singlestep_region) 291 + /* 292 + * All code from here through __end_SYSENTER_singlestep_region is subject 293 + * to being single-stepped if a user program sets TF and executes SYSENTER. 294 + * There is absolutely nothing that we can do to prevent this from happening 295 + * (thanks Intel!). To keep our handling of this situation as simple as 296 + * possible, we handle TF just like AC and NT, except that our #DB handler 297 + * will ignore all of the single-step traps generated in this range. 298 + */ 299 + 300 + #ifdef CONFIG_XEN 301 + /* 302 + * Xen doesn't set %esp to be precisely what the normal SYSENTER 303 + * entry point expects, so fix it up before using the normal path. 304 + */ 305 + ENTRY(xen_sysenter_target) 306 + addl $5*4, %esp /* remove xen-provided frame */ 307 + jmp sysenter_past_esp 308 + #endif 309 + 291 310 ENTRY(entry_SYSENTER_32) 292 311 movl TSS_sysenter_sp0(%esp), %esp 293 312 sysenter_past_esp: ··· 320 301 SAVE_ALL pt_regs_ax=$-ENOSYS /* save rest */ 321 302 322 303 /* 323 - * SYSENTER doesn't filter flags, so we need to clear NT and AC 324 - * ourselves. To save a few cycles, we can check whether 304 + * SYSENTER doesn't filter flags, so we need to clear NT, AC 305 + * and TF ourselves. To save a few cycles, we can check whether 325 306 * either was set instead of doing an unconditional popfq. 326 307 * This needs to happen before enabling interrupts so that 327 308 * we don't get preempted with NT set. 309 + * 310 + * If TF is set, we will single-step all the way to here -- do_debug 311 + * will ignore all the traps. (Yes, this is slow, but so is 312 + * single-stepping in general. This allows us to avoid having 313 + * a more complicated code to handle the case where a user program 314 + * forces us to single-step through the SYSENTER entry code.) 328 315 * 329 316 * NB.: .Lsysenter_fix_flags is a label with the code under it moved 330 317 * out-of-line as an optimization: NT is unlikely to be set in the ··· 338 313 * we're keeping that code behind a branch which will predict as 339 314 * not-taken and therefore its instructions won't be fetched. 340 315 */ 341 - testl $X86_EFLAGS_NT|X86_EFLAGS_AC, PT_EFLAGS(%esp) 316 + testl $X86_EFLAGS_NT|X86_EFLAGS_AC|X86_EFLAGS_TF, PT_EFLAGS(%esp) 342 317 jnz .Lsysenter_fix_flags 343 318 .Lsysenter_flags_fixed: 344 319 ··· 394 369 pushl $X86_EFLAGS_FIXED 395 370 popfl 396 371 jmp .Lsysenter_flags_fixed 372 + GLOBAL(__end_SYSENTER_singlestep_region) 397 373 ENDPROC(entry_SYSENTER_32) 398 374 399 375 # system call handler stub ··· 677 651 END(spurious_interrupt_bug) 678 652 679 653 #ifdef CONFIG_XEN 680 - /* 681 - * Xen doesn't set %esp to be precisely what the normal SYSENTER 682 - * entry point expects, so fix it up before using the normal path. 683 - */ 684 - ENTRY(xen_sysenter_target) 685 - addl $5*4, %esp /* remove xen-provided frame */ 686 - jmp sysenter_past_esp 687 - 688 654 ENTRY(xen_hypervisor_callback) 689 655 pushl $-1 /* orig_ax = -1 => not a system call */ 690 656 SAVE_ALL
+8 -1
arch/x86/entry/entry_64_compat.S
··· 94 94 * This needs to happen before enabling interrupts so that 95 95 * we don't get preempted with NT set. 96 96 * 97 + * If TF is set, we will single-step all the way to here -- do_debug 98 + * will ignore all the traps. (Yes, this is slow, but so is 99 + * single-stepping in general. This allows us to avoid having 100 + * a more complicated code to handle the case where a user program 101 + * forces us to single-step through the SYSENTER entry code.) 102 + * 97 103 * NB.: .Lsysenter_fix_flags is a label with the code under it moved 98 104 * out-of-line as an optimization: NT is unlikely to be set in the 99 105 * majority of the cases and instead of polluting the I$ unnecessarily, 100 106 * we're keeping that code behind a branch which will predict as 101 107 * not-taken and therefore its instructions won't be fetched. 102 108 */ 103 - testl $X86_EFLAGS_NT|X86_EFLAGS_AC, EFLAGS(%rsp) 109 + testl $X86_EFLAGS_NT|X86_EFLAGS_AC|X86_EFLAGS_TF, EFLAGS(%rsp) 104 110 jnz .Lsysenter_fix_flags 105 111 .Lsysenter_flags_fixed: 106 112 ··· 127 121 pushq $X86_EFLAGS_FIXED 128 122 popfq 129 123 jmp .Lsysenter_flags_fixed 124 + GLOBAL(__end_entry_SYSENTER_compat) 130 125 ENDPROC(entry_SYSENTER_compat) 131 126 132 127 /*
+13 -2
arch/x86/include/asm/proto.h
··· 7 7 8 8 void syscall_init(void); 9 9 10 + #ifdef CONFIG_X86_64 10 11 void entry_SYSCALL_64(void); 11 - void entry_SYSCALL_compat(void); 12 + #endif 13 + 14 + #ifdef CONFIG_X86_32 12 15 void entry_INT80_32(void); 13 - void entry_INT80_compat(void); 14 16 void entry_SYSENTER_32(void); 17 + void __begin_SYSENTER_singlestep_region(void); 18 + void __end_SYSENTER_singlestep_region(void); 19 + #endif 20 + 21 + #ifdef CONFIG_IA32_EMULATION 15 22 void entry_SYSENTER_compat(void); 23 + void __end_entry_SYSENTER_compat(void); 24 + void entry_SYSCALL_compat(void); 25 + void entry_INT80_compat(void); 26 + #endif 16 27 17 28 void x86_configure_nx(void); 18 29 void x86_report_nx(void);
+43 -9
arch/x86/kernel/traps.c
··· 559 559 NOKPROBE_SYMBOL(fixup_bad_iret); 560 560 #endif 561 561 562 + static bool is_sysenter_singlestep(struct pt_regs *regs) 563 + { 564 + /* 565 + * We don't try for precision here. If we're anywhere in the region of 566 + * code that can be single-stepped in the SYSENTER entry path, then 567 + * assume that this is a useless single-step trap due to SYSENTER 568 + * being invoked with TF set. (We don't know in advance exactly 569 + * which instructions will be hit because BTF could plausibly 570 + * be set.) 571 + */ 572 + #ifdef CONFIG_X86_32 573 + return (regs->ip - (unsigned long)__begin_SYSENTER_singlestep_region) < 574 + (unsigned long)__end_SYSENTER_singlestep_region - 575 + (unsigned long)__begin_SYSENTER_singlestep_region; 576 + #elif defined(CONFIG_IA32_EMULATION) 577 + return (regs->ip - (unsigned long)entry_SYSENTER_compat) < 578 + (unsigned long)__end_entry_SYSENTER_compat - 579 + (unsigned long)entry_SYSENTER_compat; 580 + #else 581 + return false; 582 + #endif 583 + } 584 + 562 585 /* 563 586 * Our handling of the processor debug registers is non-trivial. 564 587 * We do not clear them on entry and exit from the kernel. Therefore ··· 639 616 */ 640 617 clear_tsk_thread_flag(tsk, TIF_BLOCKSTEP); 641 618 619 + if (unlikely(!user_mode(regs) && (dr6 & DR_STEP) && 620 + is_sysenter_singlestep(regs))) { 621 + dr6 &= ~DR_STEP; 622 + if (!dr6) 623 + goto exit; 624 + /* 625 + * else we might have gotten a single-step trap and hit a 626 + * watchpoint at the same time, in which case we should fall 627 + * through and handle the watchpoint. 628 + */ 629 + } 630 + 642 631 /* 643 632 * If dr6 has no reason to give us about the origin of this trap, 644 633 * then it's very likely the result of an icebp/int01 trap. ··· 659 624 if (!dr6 && user_mode(regs)) 660 625 user_icebp = 1; 661 626 662 - /* Catch kmemcheck conditions first of all! */ 627 + /* Catch kmemcheck conditions! */ 663 628 if ((dr6 & DR_STEP) && kmemcheck_trap(regs)) 664 629 goto exit; 665 630 ··· 694 659 goto exit; 695 660 } 696 661 697 - /* 698 - * Single-stepping through system calls: ignore any exceptions in 699 - * kernel space, but re-enable TF when returning to user mode. 700 - * 701 - * We already checked v86 mode above, so we can check for kernel mode 702 - * by just checking the CPL of CS. 703 - */ 704 - if ((dr6 & DR_STEP) && !user_mode(regs)) { 662 + if (WARN_ON_ONCE((dr6 & DR_STEP) && !user_mode(regs))) { 663 + /* 664 + * Historical junk that used to handle SYSENTER single-stepping. 665 + * This should be unreachable now. If we survive for a while 666 + * without anyone hitting this warning, we'll turn this into 667 + * an oops. 668 + */ 705 669 tsk->thread.debugreg6 &= ~DR_STEP; 706 670 set_tsk_thread_flag(tsk, TIF_SINGLESTEP); 707 671 regs->flags &= ~X86_EFLAGS_TF;