Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

x86/process/64: Move cpu_current_top_of_stack out of TSS

cpu_current_top_of_stack is currently stored in TSS.sp1. TSS is exposed
through the cpu_entry_area which is visible with user CR3 when PTI is
enabled and active.

This makes it a coveted fruit for attackers. An attacker can fetch the
kernel stack top from it and continue next steps of actions based on the
kernel stack.

But it is actualy not necessary to be stored in the TSS. It is only
accessed after the entry code switched to kernel CR3 and kernel GS_BASE
which means it can be in any regular percpu variable.

The reason why it is in TSS is historical (pre PTI) because TSS is also
used as scratch space in SYSCALL_64 and therefore cache hot.

A syscall also needs the per CPU variable current_task and eventually
__preempt_count, so placing cpu_current_top_of_stack next to them makes it
likely that they end up in the same cache line which should avoid
performance regressions. This is not enforced as the compiler is free to
place these variables, so these entry relevant variables should move into
a data structure to make this enforceable.

The seccomp_benchmark doesn't show any performance loss in the "getpid
native" test result. Actually, the result changes from 93ns before to 92ns
with this change when KPTI is disabled. The test is very stable and
although the test doesn't show a higher degree of precision it gives enough
confidence that moving cpu_current_top_of_stack does not cause a
regression.

[ tglx: Removed unneeded export. Massaged changelog ]

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210125173444.22696-2-jiangshanlai@gmail.com

authored by

Lai Jiangshan and committed by
Thomas Gleixner
1591584e 800c120e

+8 -33
-10
arch/x86/include/asm/processor.h
··· 314 314 struct x86_hw_tss { 315 315 u32 reserved1; 316 316 u64 sp0; 317 - 318 - /* 319 - * We store cpu_current_top_of_stack in sp1 so it's always accessible. 320 - * Linux does not use ring 1, so sp1 is not otherwise needed. 321 - */ 322 317 u64 sp1; 323 318 324 319 /* ··· 421 426 char stack[IRQ_STACK_SIZE]; 422 427 } __aligned(IRQ_STACK_SIZE); 423 428 424 - #ifdef CONFIG_X86_32 425 429 DECLARE_PER_CPU(unsigned long, cpu_current_top_of_stack); 426 - #else 427 - /* The RO copy can't be accessed with this_cpu_xyz(), so use the RW copy. */ 428 - #define cpu_current_top_of_stack cpu_tss_rw.x86_tss.sp1 429 - #endif 430 430 431 431 #ifdef CONFIG_X86_64 432 432 struct fixed_percpu_data {
+1 -6
arch/x86/include/asm/switch_to.h
··· 71 71 else 72 72 this_cpu_write(cpu_tss_rw.x86_tss.sp1, task->thread.sp0); 73 73 #else 74 - /* 75 - * x86-64 updates x86_tss.sp1 via cpu_current_top_of_stack. That 76 - * doesn't work on x86-32 because sp1 and 77 - * cpu_current_top_of_stack have different values (because of 78 - * the non-zero stack-padding on 32bit). 79 - */ 74 + /* Xen PV enters the kernel on the thread stack. */ 80 75 if (static_cpu_has(X86_FEATURE_XENPV)) 81 76 load_sp0(task_top_of_stack(task)); 82 77 #endif
+1 -7
arch/x86/include/asm/thread_info.h
··· 197 197 #endif 198 198 } 199 199 200 - #else /* !__ASSEMBLY__ */ 201 - 202 - #ifdef CONFIG_X86_64 203 - # define cpu_current_top_of_stack (cpu_tss_rw + TSS_sp1) 204 - #endif 205 - 206 - #endif 200 + #endif /* !__ASSEMBLY__ */ 207 201 208 202 /* 209 203 * Thread-synchronous status.
+2
arch/x86/kernel/cpu/common.c
··· 1748 1748 DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT; 1749 1749 EXPORT_PER_CPU_SYMBOL(__preempt_count); 1750 1750 1751 + DEFINE_PER_CPU(unsigned long, cpu_current_top_of_stack) = TOP_OF_INIT_STACK; 1752 + 1751 1753 /* May not be marked __init: used by software suspend */ 1752 1754 void syscall_init(void) 1753 1755 {
+1 -6
arch/x86/kernel/process.c
··· 63 63 */ 64 64 .sp0 = (1UL << (BITS_PER_LONG-1)) + 1, 65 65 66 - /* 67 - * .sp1 is cpu_current_top_of_stack. The init task never 68 - * runs user code, but cpu_current_top_of_stack should still 69 - * be well defined before the first context switch. 70 - */ 66 + #ifdef CONFIG_X86_32 71 67 .sp1 = TOP_OF_INIT_STACK, 72 68 73 - #ifdef CONFIG_X86_32 74 69 .ss0 = __KERNEL_DS, 75 70 .ss1 = __KERNEL_CS, 76 71 #endif
+3 -4
arch/x86/mm/pti.c
··· 440 440 441 441 for_each_possible_cpu(cpu) { 442 442 /* 443 - * The SYSCALL64 entry code needs to be able to find the 444 - * thread stack and needs one word of scratch space in which 445 - * to spill a register. All of this lives in the TSS, in 446 - * the sp1 and sp2 slots. 443 + * The SYSCALL64 entry code needs one word of scratch space 444 + * in which to spill a register. It lives in the sp2 slot 445 + * of the CPU's TSS. 447 446 * 448 447 * This is done for all possible CPUs during boot to ensure 449 448 * that it's propagated to all mms.