Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge tag 'x86_tdx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull Intel TDX support from Borislav Petkov:
"Intel Trust Domain Extensions (TDX) support.

This is the Intel version of a confidential computing solution called
Trust Domain Extensions (TDX). This series adds support to run the
kernel as part of a TDX guest. It provides similar guest protections
to AMD's SEV-SNP like guest memory and register state encryption,
memory integrity protection and a lot more.

Design-wise, it differs from AMD's solution considerably: it uses a
software module which runs in a special CPU mode called (Secure
Arbitration Mode) SEAM. As the name suggests, this module serves as
sort of an arbiter which the confidential guest calls for services it
needs during its lifetime.

Just like AMD's SNP set, this series reworks and streamlines certain
parts of x86 arch code so that this feature can be properly
accomodated"

* tag 'x86_tdx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
x86/tdx: Fix RETs in TDX asm
x86/tdx: Annotate a noreturn function
x86/mm: Fix spacing within memory encryption features message
x86/kaslr: Fix build warning in KASLR code in boot stub
Documentation/x86: Document TDX kernel architecture
ACPICA: Avoid cache flush inside virtual machines
x86/tdx/ioapic: Add shared bit for IOAPIC base address
x86/mm: Make DMA memory shared for TD guest
x86/mm/cpa: Add support for TDX shared memory
x86/tdx: Make pages shared in ioremap()
x86/topology: Disable CPU online/offline control for TDX guests
x86/boot: Avoid #VE during boot for TDX platforms
x86/boot: Set CR0.NE early and keep it set during the boot
x86/acpi/x86/boot: Add multiprocessor wake-up support
x86/boot: Add a trampoline for booting APs via firmware handoff
x86/tdx: Wire up KVM hypercalls
x86/tdx: Port I/O: Add early boot support
x86/tdx: Port I/O: Add runtime hypercalls
x86/boot: Port I/O: Add decompression-time support for TDX
x86/boot: Port I/O: Allow to hook up alternative helpers
...

+2071 -120
+1
Documentation/x86/index.rst
··· 26 26 intel_txt 27 27 amd-memory-encryption 28 28 amd_hsmp 29 + tdx 29 30 pti 30 31 mds 31 32 microcode
+218
Documentation/x86/tdx.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ===================================== 4 + Intel Trust Domain Extensions (TDX) 5 + ===================================== 6 + 7 + Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from 8 + the host and physical attacks by isolating the guest register state and by 9 + encrypting the guest memory. In TDX, a special module running in a special 10 + mode sits between the host and the guest and manages the guest/host 11 + separation. 12 + 13 + Since the host cannot directly access guest registers or memory, much 14 + normal functionality of a hypervisor must be moved into the guest. This is 15 + implemented using a Virtualization Exception (#VE) that is handled by the 16 + guest kernel. A #VE is handled entirely inside the guest kernel, but some 17 + require the hypervisor to be consulted. 18 + 19 + TDX includes new hypercall-like mechanisms for communicating from the 20 + guest to the hypervisor or the TDX module. 21 + 22 + New TDX Exceptions 23 + ================== 24 + 25 + TDX guests behave differently from bare-metal and traditional VMX guests. 26 + In TDX guests, otherwise normal instructions or memory accesses can cause 27 + #VE or #GP exceptions. 28 + 29 + Instructions marked with an '*' conditionally cause exceptions. The 30 + details for these instructions are discussed below. 31 + 32 + Instruction-based #VE 33 + --------------------- 34 + 35 + - Port I/O (INS, OUTS, IN, OUT) 36 + - HLT 37 + - MONITOR, MWAIT 38 + - WBINVD, INVD 39 + - VMCALL 40 + - RDMSR*,WRMSR* 41 + - CPUID* 42 + 43 + Instruction-based #GP 44 + --------------------- 45 + 46 + - All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH, 47 + VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON 48 + - ENCLS, ENCLU 49 + - GETSEC 50 + - RSM 51 + - ENQCMD 52 + - RDMSR*,WRMSR* 53 + 54 + RDMSR/WRMSR Behavior 55 + -------------------- 56 + 57 + MSR access behavior falls into three categories: 58 + 59 + - #GP generated 60 + - #VE generated 61 + - "Just works" 62 + 63 + In general, the #GP MSRs should not be used in guests. Their use likely 64 + indicates a bug in the guest. The guest may try to handle the #GP with a 65 + hypercall but it is unlikely to succeed. 66 + 67 + The #VE MSRs are typically able to be handled by the hypervisor. Guests 68 + can make a hypercall to the hypervisor to handle the #VE. 69 + 70 + The "just works" MSRs do not need any special guest handling. They might 71 + be implemented by directly passing through the MSR to the hardware or by 72 + trapping and handling in the TDX module. Other than possibly being slow, 73 + these MSRs appear to function just as they would on bare metal. 74 + 75 + CPUID Behavior 76 + -------------- 77 + 78 + For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID 79 + return values (in guest EAX/EBX/ECX/EDX) are configurable by the 80 + hypervisor. For such cases, the Intel TDX module architecture defines two 81 + virtualization types: 82 + 83 + - Bit fields for which the hypervisor controls the value seen by the guest 84 + TD. 85 + 86 + - Bit fields for which the hypervisor configures the value such that the 87 + guest TD either sees their native value or a value of 0. For these bit 88 + fields, the hypervisor can mask off the native values, but it can not 89 + turn *on* values. 90 + 91 + A #VE is generated for CPUID leaves and sub-leaves that the TDX module does 92 + not know how to handle. The guest kernel may ask the hypervisor for the 93 + value with a hypercall. 94 + 95 + #VE on Memory Accesses 96 + ====================== 97 + 98 + There are essentially two classes of TDX memory: private and shared. 99 + Private memory receives full TDX protections. Its content is protected 100 + against access from the hypervisor. Shared memory is expected to be 101 + shared between guest and hypervisor and does not receive full TDX 102 + protections. 103 + 104 + A TD guest is in control of whether its memory accesses are treated as 105 + private or shared. It selects the behavior with a bit in its page table 106 + entries. This helps ensure that a guest does not place sensitive 107 + information in shared memory, exposing it to the untrusted hypervisor. 108 + 109 + #VE on Shared Memory 110 + -------------------- 111 + 112 + Access to shared mappings can cause a #VE. The hypervisor ultimately 113 + controls whether a shared memory access causes a #VE, so the guest must be 114 + careful to only reference shared pages it can safely handle a #VE. For 115 + instance, the guest should be careful not to access shared memory in the 116 + #VE handler before it reads the #VE info structure (TDG.VP.VEINFO.GET). 117 + 118 + Shared mapping content is entirely controlled by the hypervisor. The guest 119 + should only use shared mappings for communicating with the hypervisor. 120 + Shared mappings must never be used for sensitive memory content like kernel 121 + stacks. A good rule of thumb is that hypervisor-shared memory should be 122 + treated the same as memory mapped to userspace. Both the hypervisor and 123 + userspace are completely untrusted. 124 + 125 + MMIO for virtual devices is implemented as shared memory. The guest must 126 + be careful not to access device MMIO regions unless it is also prepared to 127 + handle a #VE. 128 + 129 + #VE on Private Pages 130 + -------------------- 131 + 132 + An access to private mappings can also cause a #VE. Since all kernel 133 + memory is also private memory, the kernel might theoretically need to 134 + handle a #VE on arbitrary kernel memory accesses. This is not feasible, so 135 + TDX guests ensure that all guest memory has been "accepted" before memory 136 + is used by the kernel. 137 + 138 + A modest amount of memory (typically 512M) is pre-accepted by the firmware 139 + before the kernel runs to ensure that the kernel can start up without 140 + being subjected to a #VE. 141 + 142 + The hypervisor is permitted to unilaterally move accepted pages to a 143 + "blocked" state. However, if it does this, page access will not generate a 144 + #VE. It will, instead, cause a "TD Exit" where the hypervisor is required 145 + to handle the exception. 146 + 147 + Linux #VE handler 148 + ================= 149 + 150 + Just like page faults or #GP's, #VE exceptions can be either handled or be 151 + fatal. Typically, an unhandled userspace #VE results in a SIGSEGV. 152 + An unhandled kernel #VE results in an oops. 153 + 154 + Handling nested exceptions on x86 is typically nasty business. A #VE 155 + could be interrupted by an NMI which triggers another #VE and hilarity 156 + ensues. The TDX #VE architecture anticipated this scenario and includes a 157 + feature to make it slightly less nasty. 158 + 159 + During #VE handling, the TDX module ensures that all interrupts (including 160 + NMIs) are blocked. The block remains in place until the guest makes a 161 + TDG.VP.VEINFO.GET TDCALL. This allows the guest to control when interrupts 162 + or a new #VE can be delivered. 163 + 164 + However, the guest kernel must still be careful to avoid potential 165 + #VE-triggering actions (discussed above) while this block is in place. 166 + While the block is in place, any #VE is elevated to a double fault (#DF) 167 + which is not recoverable. 168 + 169 + MMIO handling 170 + ============= 171 + 172 + In non-TDX VMs, MMIO is usually implemented by giving a guest access to a 173 + mapping which will cause a VMEXIT on access, and then the hypervisor 174 + emulates the access. That is not possible in TDX guests because VMEXIT 175 + will expose the register state to the host. TDX guests don't trust the host 176 + and can't have their state exposed to the host. 177 + 178 + In TDX, MMIO regions typically trigger a #VE exception in the guest. The 179 + guest #VE handler then emulates the MMIO instruction inside the guest and 180 + converts it into a controlled TDCALL to the host, rather than exposing 181 + guest state to the host. 182 + 183 + MMIO addresses on x86 are just special physical addresses. They can 184 + theoretically be accessed with any instruction that accesses memory. 185 + However, the kernel instruction decoding method is limited. It is only 186 + designed to decode instructions like those generated by io.h macros. 187 + 188 + MMIO access via other means (like structure overlays) may result in an 189 + oops. 190 + 191 + Shared Memory Conversions 192 + ========================= 193 + 194 + All TDX guest memory starts out as private at boot. This memory can not 195 + be accessed by the hypervisor. However, some kernel users like device 196 + drivers might have a need to share data with the hypervisor. To do this, 197 + memory must be converted between shared and private. This can be 198 + accomplished using some existing memory encryption helpers: 199 + 200 + * set_memory_decrypted() converts a range of pages to shared. 201 + * set_memory_encrypted() converts memory back to private. 202 + 203 + Device drivers are the primary user of shared memory, but there's no need 204 + to touch every driver. DMA buffers and ioremap() do the conversions 205 + automatically. 206 + 207 + TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is 208 + converted to shared on boot. 209 + 210 + For coherent DMA allocation, the DMA buffer gets converted on the 211 + allocation. Check force_dma_unencrypted() for details. 212 + 213 + References 214 + ========== 215 + 216 + TDX reference material is collected here: 217 + 218 + https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
+15
arch/x86/Kconfig
··· 878 878 IOT with small footprint and real-time features. More details can be 879 879 found in https://projectacrn.org/. 880 880 881 + config INTEL_TDX_GUEST 882 + bool "Intel TDX (Trust Domain Extensions) - Guest Support" 883 + depends on X86_64 && CPU_SUP_INTEL 884 + depends on X86_X2APIC 885 + select ARCH_HAS_CC_PLATFORM 886 + select X86_MEM_ENCRYPT 887 + select X86_MCE 888 + help 889 + Support running as a guest under Intel TDX. Without this support, 890 + the guest kernel can not boot or run under TDX. 891 + TDX includes memory encryption and integrity capabilities 892 + which protect the confidentiality and integrity of guest 893 + memory contents and CPU state. TDX guests are protected from 894 + some attacks from the VMM. 895 + 881 896 endif #HYPERVISOR_GUEST 882 897 883 898 source "arch/x86/Kconfig.cpu"
+2 -35
arch/x86/boot/boot.h
··· 26 26 #include "bitops.h" 27 27 #include "ctype.h" 28 28 #include "cpuflags.h" 29 + #include "io.h" 29 30 30 31 /* Useful macros */ 31 32 #define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x))) ··· 36 35 37 36 #define cpu_relax() asm volatile("rep; nop") 38 37 39 - /* Basic port I/O */ 40 - static inline void outb(u8 v, u16 port) 41 - { 42 - asm volatile("outb %0,%1" : : "a" (v), "dN" (port)); 43 - } 44 - static inline u8 inb(u16 port) 45 - { 46 - u8 v; 47 - asm volatile("inb %1,%0" : "=a" (v) : "dN" (port)); 48 - return v; 49 - } 50 - 51 - static inline void outw(u16 v, u16 port) 52 - { 53 - asm volatile("outw %0,%1" : : "a" (v), "dN" (port)); 54 - } 55 - static inline u16 inw(u16 port) 56 - { 57 - u16 v; 58 - asm volatile("inw %1,%0" : "=a" (v) : "dN" (port)); 59 - return v; 60 - } 61 - 62 - static inline void outl(u32 v, u16 port) 63 - { 64 - asm volatile("outl %0,%1" : : "a" (v), "dN" (port)); 65 - } 66 - static inline u32 inl(u16 port) 67 - { 68 - u32 v; 69 - asm volatile("inl %1,%0" : "=a" (v) : "dN" (port)); 70 - return v; 71 - } 72 - 73 38 static inline void io_delay(void) 74 39 { 75 40 const u16 DELAY_PORT = 0x80; 76 - asm volatile("outb %%al,%0" : : "dN" (DELAY_PORT)); 41 + outb(0, DELAY_PORT); 77 42 } 78 43 79 44 /* These functions are used to reference data in other segments. */
+1
arch/x86/boot/compressed/Makefile
··· 101 101 endif 102 102 103 103 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o 104 + vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o 104 105 105 106 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o 106 107 vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
+22 -5
arch/x86/boot/compressed/head_64.S
··· 289 289 pushl %eax 290 290 291 291 /* Enter paged protected Mode, activating Long Mode */ 292 - movl $(X86_CR0_PG | X86_CR0_PE), %eax /* Enable Paging and Protected mode */ 292 + movl $CR0_STATE, %eax 293 293 movl %eax, %cr0 294 294 295 295 /* Jump from 32bit compatibility mode into 64bit mode. */ ··· 649 649 movl $MSR_EFER, %ecx 650 650 rdmsr 651 651 btsl $_EFER_LME, %eax 652 + /* Avoid writing EFER if no change was made (for TDX guest) */ 653 + jc 1f 652 654 wrmsr 653 - popl %edx 655 + 1: popl %edx 654 656 popl %ecx 655 657 658 + #ifdef CONFIG_X86_MCE 659 + /* 660 + * Preserve CR4.MCE if the kernel will enable #MC support. 661 + * Clearing MCE may fault in some environments (that also force #MC 662 + * support). Any machine check that occurs before #MC support is fully 663 + * configured will crash the system regardless of the CR4.MCE value set 664 + * here. 665 + */ 666 + movl %cr4, %eax 667 + andl $X86_CR4_MCE, %eax 668 + #else 669 + movl $0, %eax 670 + #endif 671 + 656 672 /* Enable PAE and LA57 (if required) paging modes */ 657 - movl $X86_CR4_PAE, %eax 673 + orl $X86_CR4_PAE, %eax 658 674 testl %edx, %edx 659 675 jz 1f 660 676 orl $X86_CR4_LA57, %eax ··· 684 668 pushl $__KERNEL_CS 685 669 pushl %eax 686 670 687 - /* Enable paging again */ 688 - movl $(X86_CR0_PG | X86_CR0_PE), %eax 671 + /* Enable paging again. */ 672 + movl %cr0, %eax 673 + btsl $X86_CR0_PG_BIT, %eax 689 674 movl %eax, %cr0 690 675 691 676 lret
+12
arch/x86/boot/compressed/misc.c
··· 48 48 */ 49 49 struct boot_params *boot_params; 50 50 51 + struct port_io_ops pio_ops; 52 + 51 53 memptr free_mem_ptr; 52 54 memptr free_mem_end_ptr; 53 55 ··· 375 373 376 374 lines = boot_params->screen_info.orig_video_lines; 377 375 cols = boot_params->screen_info.orig_video_cols; 376 + 377 + init_default_io_ops(); 378 + 379 + /* 380 + * Detect TDX guest environment. 381 + * 382 + * It has to be done before console_init() in order to use 383 + * paravirtualized port I/O operations if needed. 384 + */ 385 + early_tdx_detect(); 378 386 379 387 console_init(); 380 388
+3 -1
arch/x86/boot/compressed/misc.h
··· 22 22 #include <linux/linkage.h> 23 23 #include <linux/screen_info.h> 24 24 #include <linux/elf.h> 25 - #include <linux/io.h> 26 25 #include <asm/page.h> 27 26 #include <asm/boot.h> 28 27 #include <asm/bootparam.h> 29 28 #include <asm/desc_defs.h> 29 + 30 + #include "tdx.h" 30 31 31 32 #define BOOT_CTYPE_H 32 33 #include <linux/acpi.h> 33 34 34 35 #define BOOT_BOOT_H 35 36 #include "../ctype.h" 37 + #include "../io.h" 36 38 37 39 #include "efi.h" 38 40
+1 -1
arch/x86/boot/compressed/pgtable.h
··· 6 6 #define TRAMPOLINE_32BIT_PGTABLE_OFFSET 0 7 7 8 8 #define TRAMPOLINE_32BIT_CODE_OFFSET PAGE_SIZE 9 - #define TRAMPOLINE_32BIT_CODE_SIZE 0x70 9 + #define TRAMPOLINE_32BIT_CODE_SIZE 0x80 10 10 11 11 #define TRAMPOLINE_32BIT_STACK_END TRAMPOLINE_32BIT_SIZE 12 12
+3
arch/x86/boot/compressed/tdcall.S
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + 3 + #include "../../coco/tdx/tdcall.S"
+77
arch/x86/boot/compressed/tdx.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include "../cpuflags.h" 4 + #include "../string.h" 5 + #include "../io.h" 6 + #include "error.h" 7 + 8 + #include <vdso/limits.h> 9 + #include <uapi/asm/vmx.h> 10 + 11 + #include <asm/shared/tdx.h> 12 + 13 + /* Called from __tdx_hypercall() for unrecoverable failure */ 14 + void __tdx_hypercall_failed(void) 15 + { 16 + error("TDVMCALL failed. TDX module bug?"); 17 + } 18 + 19 + static inline unsigned int tdx_io_in(int size, u16 port) 20 + { 21 + struct tdx_hypercall_args args = { 22 + .r10 = TDX_HYPERCALL_STANDARD, 23 + .r11 = EXIT_REASON_IO_INSTRUCTION, 24 + .r12 = size, 25 + .r13 = 0, 26 + .r14 = port, 27 + }; 28 + 29 + if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT)) 30 + return UINT_MAX; 31 + 32 + return args.r11; 33 + } 34 + 35 + static inline void tdx_io_out(int size, u16 port, u32 value) 36 + { 37 + struct tdx_hypercall_args args = { 38 + .r10 = TDX_HYPERCALL_STANDARD, 39 + .r11 = EXIT_REASON_IO_INSTRUCTION, 40 + .r12 = size, 41 + .r13 = 1, 42 + .r14 = port, 43 + .r15 = value, 44 + }; 45 + 46 + __tdx_hypercall(&args, 0); 47 + } 48 + 49 + static inline u8 tdx_inb(u16 port) 50 + { 51 + return tdx_io_in(1, port); 52 + } 53 + 54 + static inline void tdx_outb(u8 value, u16 port) 55 + { 56 + tdx_io_out(1, port, value); 57 + } 58 + 59 + static inline void tdx_outw(u16 value, u16 port) 60 + { 61 + tdx_io_out(2, port, value); 62 + } 63 + 64 + void early_tdx_detect(void) 65 + { 66 + u32 eax, sig[3]; 67 + 68 + cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]); 69 + 70 + if (memcmp(TDX_IDENT, sig, sizeof(sig))) 71 + return; 72 + 73 + /* Use hypercalls instead of I/O instructions */ 74 + pio_ops.f_inb = tdx_inb; 75 + pio_ops.f_outb = tdx_outb; 76 + pio_ops.f_outw = tdx_outw; 77 + }
+13
arch/x86/boot/compressed/tdx.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef BOOT_COMPRESSED_TDX_H 3 + #define BOOT_COMPRESSED_TDX_H 4 + 5 + #include <linux/types.h> 6 + 7 + #ifdef CONFIG_INTEL_TDX_GUEST 8 + void early_tdx_detect(void); 9 + #else 10 + static inline void early_tdx_detect(void) { }; 11 + #endif 12 + 13 + #endif /* BOOT_COMPRESSED_TDX_H */
+1 -2
arch/x86/boot/cpuflags.c
··· 71 71 # define EBX_REG "=b" 72 72 #endif 73 73 74 - static inline void cpuid_count(u32 id, u32 count, 75 - u32 *a, u32 *b, u32 *c, u32 *d) 74 + void cpuid_count(u32 id, u32 count, u32 *a, u32 *b, u32 *c, u32 *d) 76 75 { 77 76 asm volatile(".ifnc %%ebx,%3 ; movl %%ebx,%3 ; .endif \n\t" 78 77 "cpuid \n\t"
+1
arch/x86/boot/cpuflags.h
··· 17 17 18 18 int has_eflag(unsigned long mask); 19 19 void get_cpuflags(void); 20 + void cpuid_count(u32 id, u32 count, u32 *a, u32 *b, u32 *c, u32 *d); 20 21 21 22 #endif
+41
arch/x86/boot/io.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef BOOT_IO_H 3 + #define BOOT_IO_H 4 + 5 + #include <asm/shared/io.h> 6 + 7 + #undef inb 8 + #undef inw 9 + #undef inl 10 + #undef outb 11 + #undef outw 12 + #undef outl 13 + 14 + struct port_io_ops { 15 + u8 (*f_inb)(u16 port); 16 + void (*f_outb)(u8 v, u16 port); 17 + void (*f_outw)(u16 v, u16 port); 18 + }; 19 + 20 + extern struct port_io_ops pio_ops; 21 + 22 + /* 23 + * Use the normal I/O instructions by default. 24 + * TDX guests override these to use hypercalls. 25 + */ 26 + static inline void init_default_io_ops(void) 27 + { 28 + pio_ops.f_inb = __inb; 29 + pio_ops.f_outb = __outb; 30 + pio_ops.f_outw = __outw; 31 + } 32 + 33 + /* 34 + * Redirect port I/O operations via pio_ops callbacks. 35 + * TDX guests override these callbacks with TDX-specific helpers. 36 + */ 37 + #define inb pio_ops.f_inb 38 + #define outb pio_ops.f_outb 39 + #define outw pio_ops.f_outw 40 + 41 + #endif
+4
arch/x86/boot/main.c
··· 17 17 18 18 struct boot_params boot_params __attribute__((aligned(16))); 19 19 20 + struct port_io_ops pio_ops; 21 + 20 22 char *HEAP = _end; 21 23 char *heap_end = _end; /* Default end of heap = no heap */ 22 24 ··· 135 133 136 134 void main(void) 137 135 { 136 + init_default_io_ops(); 137 + 138 138 /* First, copy the boot header into the "zeropage" */ 139 139 copy_boot_params(); 140 140
+2
arch/x86/coco/Makefile
··· 4 4 CFLAGS_core.o += -fno-stack-protector 5 5 6 6 obj-y += core.o 7 + 8 + obj-$(CONFIG_INTEL_TDX_GUEST) += tdx/
+21 -1
arch/x86/coco/core.c
··· 18 18 19 19 static bool intel_cc_platform_has(enum cc_attr attr) 20 20 { 21 - return false; 21 + switch (attr) { 22 + case CC_ATTR_GUEST_UNROLL_STRING_IO: 23 + case CC_ATTR_HOTPLUG_DISABLED: 24 + case CC_ATTR_GUEST_MEM_ENCRYPT: 25 + case CC_ATTR_MEM_ENCRYPT: 26 + return true; 27 + default: 28 + return false; 29 + } 22 30 } 23 31 24 32 /* ··· 98 90 99 91 u64 cc_mkenc(u64 val) 100 92 { 93 + /* 94 + * Both AMD and Intel use a bit in the page table to indicate 95 + * encryption status of the page. 96 + * 97 + * - for AMD, bit *set* means the page is encrypted 98 + * - for Intel *clear* means encrypted. 99 + */ 101 100 switch (vendor) { 102 101 case CC_VENDOR_AMD: 103 102 return val | cc_mask; 103 + case CC_VENDOR_INTEL: 104 + return val & ~cc_mask; 104 105 default: 105 106 return val; 106 107 } ··· 117 100 118 101 u64 cc_mkdec(u64 val) 119 102 { 103 + /* See comment in cc_mkenc() */ 120 104 switch (vendor) { 121 105 case CC_VENDOR_AMD: 122 106 return val & ~cc_mask; 107 + case CC_VENDOR_INTEL: 108 + return val | cc_mask; 123 109 default: 124 110 return val; 125 111 }
+3
arch/x86/coco/tdx/Makefile
··· 1 + # SPDX-License-Identifier: GPL-2.0 2 + 3 + obj-y += tdx.o tdcall.o
+205
arch/x86/coco/tdx/tdcall.S
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #include <asm/asm-offsets.h> 3 + #include <asm/asm.h> 4 + #include <asm/frame.h> 5 + #include <asm/unwind_hints.h> 6 + 7 + #include <linux/linkage.h> 8 + #include <linux/bits.h> 9 + #include <linux/errno.h> 10 + 11 + #include "../../virt/vmx/tdx/tdxcall.S" 12 + 13 + /* 14 + * Bitmasks of exposed registers (with VMM). 15 + */ 16 + #define TDX_R10 BIT(10) 17 + #define TDX_R11 BIT(11) 18 + #define TDX_R12 BIT(12) 19 + #define TDX_R13 BIT(13) 20 + #define TDX_R14 BIT(14) 21 + #define TDX_R15 BIT(15) 22 + 23 + /* 24 + * These registers are clobbered to hold arguments for each 25 + * TDVMCALL. They are safe to expose to the VMM. 26 + * Each bit in this mask represents a register ID. Bit field 27 + * details can be found in TDX GHCI specification, section 28 + * titled "TDCALL [TDG.VP.VMCALL] leaf". 29 + */ 30 + #define TDVMCALL_EXPOSE_REGS_MASK ( TDX_R10 | TDX_R11 | \ 31 + TDX_R12 | TDX_R13 | \ 32 + TDX_R14 | TDX_R15 ) 33 + 34 + /* 35 + * __tdx_module_call() - Used by TDX guests to request services from 36 + * the TDX module (does not include VMM services) using TDCALL instruction. 37 + * 38 + * Transforms function call register arguments into the TDCALL register ABI. 39 + * After TDCALL operation, TDX module output is saved in @out (if it is 40 + * provided by the user). 41 + * 42 + *------------------------------------------------------------------------- 43 + * TDCALL ABI: 44 + *------------------------------------------------------------------------- 45 + * Input Registers: 46 + * 47 + * RAX - TDCALL Leaf number. 48 + * RCX,RDX,R8-R9 - TDCALL Leaf specific input registers. 49 + * 50 + * Output Registers: 51 + * 52 + * RAX - TDCALL instruction error code. 53 + * RCX,RDX,R8-R11 - TDCALL Leaf specific output registers. 54 + * 55 + *------------------------------------------------------------------------- 56 + * 57 + * __tdx_module_call() function ABI: 58 + * 59 + * @fn (RDI) - TDCALL Leaf ID, moved to RAX 60 + * @rcx (RSI) - Input parameter 1, moved to RCX 61 + * @rdx (RDX) - Input parameter 2, moved to RDX 62 + * @r8 (RCX) - Input parameter 3, moved to R8 63 + * @r9 (R8) - Input parameter 4, moved to R9 64 + * 65 + * @out (R9) - struct tdx_module_output pointer 66 + * stored temporarily in R12 (not 67 + * shared with the TDX module). It 68 + * can be NULL. 69 + * 70 + * Return status of TDCALL via RAX. 71 + */ 72 + SYM_FUNC_START(__tdx_module_call) 73 + FRAME_BEGIN 74 + TDX_MODULE_CALL host=0 75 + FRAME_END 76 + RET 77 + SYM_FUNC_END(__tdx_module_call) 78 + 79 + /* 80 + * __tdx_hypercall() - Make hypercalls to a TDX VMM using TDVMCALL leaf 81 + * of TDCALL instruction 82 + * 83 + * Transforms values in function call argument struct tdx_hypercall_args @args 84 + * into the TDCALL register ABI. After TDCALL operation, VMM output is saved 85 + * back in @args. 86 + * 87 + *------------------------------------------------------------------------- 88 + * TD VMCALL ABI: 89 + *------------------------------------------------------------------------- 90 + * 91 + * Input Registers: 92 + * 93 + * RAX - TDCALL instruction leaf number (0 - TDG.VP.VMCALL) 94 + * RCX - BITMAP which controls which part of TD Guest GPR 95 + * is passed as-is to the VMM and back. 96 + * R10 - Set 0 to indicate TDCALL follows standard TDX ABI 97 + * specification. Non zero value indicates vendor 98 + * specific ABI. 99 + * R11 - VMCALL sub function number 100 + * RBX, RBP, RDI, RSI - Used to pass VMCALL sub function specific arguments. 101 + * R8-R9, R12-R15 - Same as above. 102 + * 103 + * Output Registers: 104 + * 105 + * RAX - TDCALL instruction status (Not related to hypercall 106 + * output). 107 + * R10 - Hypercall output error code. 108 + * R11-R15 - Hypercall sub function specific output values. 109 + * 110 + *------------------------------------------------------------------------- 111 + * 112 + * __tdx_hypercall() function ABI: 113 + * 114 + * @args (RDI) - struct tdx_hypercall_args for input and output 115 + * @flags (RSI) - TDX_HCALL_* flags 116 + * 117 + * On successful completion, return the hypercall error code. 118 + */ 119 + SYM_FUNC_START(__tdx_hypercall) 120 + FRAME_BEGIN 121 + 122 + /* Save callee-saved GPRs as mandated by the x86_64 ABI */ 123 + push %r15 124 + push %r14 125 + push %r13 126 + push %r12 127 + 128 + /* Mangle function call ABI into TDCALL ABI: */ 129 + /* Set TDCALL leaf ID (TDVMCALL (0)) in RAX */ 130 + xor %eax, %eax 131 + 132 + /* Copy hypercall registers from arg struct: */ 133 + movq TDX_HYPERCALL_r10(%rdi), %r10 134 + movq TDX_HYPERCALL_r11(%rdi), %r11 135 + movq TDX_HYPERCALL_r12(%rdi), %r12 136 + movq TDX_HYPERCALL_r13(%rdi), %r13 137 + movq TDX_HYPERCALL_r14(%rdi), %r14 138 + movq TDX_HYPERCALL_r15(%rdi), %r15 139 + 140 + movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx 141 + 142 + /* 143 + * For the idle loop STI needs to be called directly before the TDCALL 144 + * that enters idle (EXIT_REASON_HLT case). STI instruction enables 145 + * interrupts only one instruction later. If there is a window between 146 + * STI and the instruction that emulates the HALT state, there is a 147 + * chance for interrupts to happen in this window, which can delay the 148 + * HLT operation indefinitely. Since this is the not the desired 149 + * result, conditionally call STI before TDCALL. 150 + */ 151 + testq $TDX_HCALL_ISSUE_STI, %rsi 152 + jz .Lskip_sti 153 + sti 154 + .Lskip_sti: 155 + tdcall 156 + 157 + /* 158 + * RAX==0 indicates a failure of the TDVMCALL mechanism itself and that 159 + * something has gone horribly wrong with the TDX module. 160 + * 161 + * The return status of the hypercall operation is in a separate 162 + * register (in R10). Hypercall errors are a part of normal operation 163 + * and are handled by callers. 164 + */ 165 + testq %rax, %rax 166 + jne .Lpanic 167 + 168 + /* TDVMCALL leaf return code is in R10 */ 169 + movq %r10, %rax 170 + 171 + /* Copy hypercall result registers to arg struct if needed */ 172 + testq $TDX_HCALL_HAS_OUTPUT, %rsi 173 + jz .Lout 174 + 175 + movq %r10, TDX_HYPERCALL_r10(%rdi) 176 + movq %r11, TDX_HYPERCALL_r11(%rdi) 177 + movq %r12, TDX_HYPERCALL_r12(%rdi) 178 + movq %r13, TDX_HYPERCALL_r13(%rdi) 179 + movq %r14, TDX_HYPERCALL_r14(%rdi) 180 + movq %r15, TDX_HYPERCALL_r15(%rdi) 181 + .Lout: 182 + /* 183 + * Zero out registers exposed to the VMM to avoid speculative execution 184 + * with VMM-controlled values. This needs to include all registers 185 + * present in TDVMCALL_EXPOSE_REGS_MASK (except R12-R15). R12-R15 186 + * context will be restored. 187 + */ 188 + xor %r10d, %r10d 189 + xor %r11d, %r11d 190 + 191 + /* Restore callee-saved GPRs as mandated by the x86_64 ABI */ 192 + pop %r12 193 + pop %r13 194 + pop %r14 195 + pop %r15 196 + 197 + FRAME_END 198 + 199 + RET 200 + .Lpanic: 201 + call __tdx_hypercall_failed 202 + /* __tdx_hypercall_failed never returns */ 203 + REACHABLE 204 + jmp .Lpanic 205 + SYM_FUNC_END(__tdx_hypercall)
+692
arch/x86/coco/tdx/tdx.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* Copyright (C) 2021-2022 Intel Corporation */ 3 + 4 + #undef pr_fmt 5 + #define pr_fmt(fmt) "tdx: " fmt 6 + 7 + #include <linux/cpufeature.h> 8 + #include <asm/coco.h> 9 + #include <asm/tdx.h> 10 + #include <asm/vmx.h> 11 + #include <asm/insn.h> 12 + #include <asm/insn-eval.h> 13 + #include <asm/pgtable.h> 14 + 15 + /* TDX module Call Leaf IDs */ 16 + #define TDX_GET_INFO 1 17 + #define TDX_GET_VEINFO 3 18 + #define TDX_ACCEPT_PAGE 6 19 + 20 + /* TDX hypercall Leaf IDs */ 21 + #define TDVMCALL_MAP_GPA 0x10001 22 + 23 + /* MMIO direction */ 24 + #define EPT_READ 0 25 + #define EPT_WRITE 1 26 + 27 + /* Port I/O direction */ 28 + #define PORT_READ 0 29 + #define PORT_WRITE 1 30 + 31 + /* See Exit Qualification for I/O Instructions in VMX documentation */ 32 + #define VE_IS_IO_IN(e) ((e) & BIT(3)) 33 + #define VE_GET_IO_SIZE(e) (((e) & GENMASK(2, 0)) + 1) 34 + #define VE_GET_PORT_NUM(e) ((e) >> 16) 35 + #define VE_IS_IO_STRING(e) ((e) & BIT(4)) 36 + 37 + /* 38 + * Wrapper for standard use of __tdx_hypercall with no output aside from 39 + * return code. 40 + */ 41 + static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15) 42 + { 43 + struct tdx_hypercall_args args = { 44 + .r10 = TDX_HYPERCALL_STANDARD, 45 + .r11 = fn, 46 + .r12 = r12, 47 + .r13 = r13, 48 + .r14 = r14, 49 + .r15 = r15, 50 + }; 51 + 52 + return __tdx_hypercall(&args, 0); 53 + } 54 + 55 + /* Called from __tdx_hypercall() for unrecoverable failure */ 56 + void __tdx_hypercall_failed(void) 57 + { 58 + panic("TDVMCALL failed. TDX module bug?"); 59 + } 60 + 61 + /* 62 + * The TDG.VP.VMCALL-Instruction-execution sub-functions are defined 63 + * independently from but are currently matched 1:1 with VMX EXIT_REASONs. 64 + * Reusing the KVM EXIT_REASON macros makes it easier to connect the host and 65 + * guest sides of these calls. 66 + */ 67 + static u64 hcall_func(u64 exit_reason) 68 + { 69 + return exit_reason; 70 + } 71 + 72 + #ifdef CONFIG_KVM_GUEST 73 + long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2, 74 + unsigned long p3, unsigned long p4) 75 + { 76 + struct tdx_hypercall_args args = { 77 + .r10 = nr, 78 + .r11 = p1, 79 + .r12 = p2, 80 + .r13 = p3, 81 + .r14 = p4, 82 + }; 83 + 84 + return __tdx_hypercall(&args, 0); 85 + } 86 + EXPORT_SYMBOL_GPL(tdx_kvm_hypercall); 87 + #endif 88 + 89 + /* 90 + * Used for TDX guests to make calls directly to the TD module. This 91 + * should only be used for calls that have no legitimate reason to fail 92 + * or where the kernel can not survive the call failing. 93 + */ 94 + static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, 95 + struct tdx_module_output *out) 96 + { 97 + if (__tdx_module_call(fn, rcx, rdx, r8, r9, out)) 98 + panic("TDCALL %lld failed (Buggy TDX module!)\n", fn); 99 + } 100 + 101 + static u64 get_cc_mask(void) 102 + { 103 + struct tdx_module_output out; 104 + unsigned int gpa_width; 105 + 106 + /* 107 + * TDINFO TDX module call is used to get the TD execution environment 108 + * information like GPA width, number of available vcpus, debug mode 109 + * information, etc. More details about the ABI can be found in TDX 110 + * Guest-Host-Communication Interface (GHCI), section 2.4.2 TDCALL 111 + * [TDG.VP.INFO]. 112 + * 113 + * The GPA width that comes out of this call is critical. TDX guests 114 + * can not meaningfully run without it. 115 + */ 116 + tdx_module_call(TDX_GET_INFO, 0, 0, 0, 0, &out); 117 + 118 + gpa_width = out.rcx & GENMASK(5, 0); 119 + 120 + /* 121 + * The highest bit of a guest physical address is the "sharing" bit. 122 + * Set it for shared pages and clear it for private pages. 123 + */ 124 + return BIT_ULL(gpa_width - 1); 125 + } 126 + 127 + static u64 __cpuidle __halt(const bool irq_disabled, const bool do_sti) 128 + { 129 + struct tdx_hypercall_args args = { 130 + .r10 = TDX_HYPERCALL_STANDARD, 131 + .r11 = hcall_func(EXIT_REASON_HLT), 132 + .r12 = irq_disabled, 133 + }; 134 + 135 + /* 136 + * Emulate HLT operation via hypercall. More info about ABI 137 + * can be found in TDX Guest-Host-Communication Interface 138 + * (GHCI), section 3.8 TDG.VP.VMCALL<Instruction.HLT>. 139 + * 140 + * The VMM uses the "IRQ disabled" param to understand IRQ 141 + * enabled status (RFLAGS.IF) of the TD guest and to determine 142 + * whether or not it should schedule the halted vCPU if an 143 + * IRQ becomes pending. E.g. if IRQs are disabled, the VMM 144 + * can keep the vCPU in virtual HLT, even if an IRQ is 145 + * pending, without hanging/breaking the guest. 146 + */ 147 + return __tdx_hypercall(&args, do_sti ? TDX_HCALL_ISSUE_STI : 0); 148 + } 149 + 150 + static bool handle_halt(void) 151 + { 152 + /* 153 + * Since non safe halt is mainly used in CPU offlining 154 + * and the guest will always stay in the halt state, don't 155 + * call the STI instruction (set do_sti as false). 156 + */ 157 + const bool irq_disabled = irqs_disabled(); 158 + const bool do_sti = false; 159 + 160 + if (__halt(irq_disabled, do_sti)) 161 + return false; 162 + 163 + return true; 164 + } 165 + 166 + void __cpuidle tdx_safe_halt(void) 167 + { 168 + /* 169 + * For do_sti=true case, __tdx_hypercall() function enables 170 + * interrupts using the STI instruction before the TDCALL. So 171 + * set irq_disabled as false. 172 + */ 173 + const bool irq_disabled = false; 174 + const bool do_sti = true; 175 + 176 + /* 177 + * Use WARN_ONCE() to report the failure. 178 + */ 179 + if (__halt(irq_disabled, do_sti)) 180 + WARN_ONCE(1, "HLT instruction emulation failed\n"); 181 + } 182 + 183 + static bool read_msr(struct pt_regs *regs) 184 + { 185 + struct tdx_hypercall_args args = { 186 + .r10 = TDX_HYPERCALL_STANDARD, 187 + .r11 = hcall_func(EXIT_REASON_MSR_READ), 188 + .r12 = regs->cx, 189 + }; 190 + 191 + /* 192 + * Emulate the MSR read via hypercall. More info about ABI 193 + * can be found in TDX Guest-Host-Communication Interface 194 + * (GHCI), section titled "TDG.VP.VMCALL<Instruction.RDMSR>". 195 + */ 196 + if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT)) 197 + return false; 198 + 199 + regs->ax = lower_32_bits(args.r11); 200 + regs->dx = upper_32_bits(args.r11); 201 + return true; 202 + } 203 + 204 + static bool write_msr(struct pt_regs *regs) 205 + { 206 + struct tdx_hypercall_args args = { 207 + .r10 = TDX_HYPERCALL_STANDARD, 208 + .r11 = hcall_func(EXIT_REASON_MSR_WRITE), 209 + .r12 = regs->cx, 210 + .r13 = (u64)regs->dx << 32 | regs->ax, 211 + }; 212 + 213 + /* 214 + * Emulate the MSR write via hypercall. More info about ABI 215 + * can be found in TDX Guest-Host-Communication Interface 216 + * (GHCI) section titled "TDG.VP.VMCALL<Instruction.WRMSR>". 217 + */ 218 + return !__tdx_hypercall(&args, 0); 219 + } 220 + 221 + static bool handle_cpuid(struct pt_regs *regs) 222 + { 223 + struct tdx_hypercall_args args = { 224 + .r10 = TDX_HYPERCALL_STANDARD, 225 + .r11 = hcall_func(EXIT_REASON_CPUID), 226 + .r12 = regs->ax, 227 + .r13 = regs->cx, 228 + }; 229 + 230 + /* 231 + * Only allow VMM to control range reserved for hypervisor 232 + * communication. 233 + * 234 + * Return all-zeros for any CPUID outside the range. It matches CPU 235 + * behaviour for non-supported leaf. 236 + */ 237 + if (regs->ax < 0x40000000 || regs->ax > 0x4FFFFFFF) { 238 + regs->ax = regs->bx = regs->cx = regs->dx = 0; 239 + return true; 240 + } 241 + 242 + /* 243 + * Emulate the CPUID instruction via a hypercall. More info about 244 + * ABI can be found in TDX Guest-Host-Communication Interface 245 + * (GHCI), section titled "VP.VMCALL<Instruction.CPUID>". 246 + */ 247 + if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT)) 248 + return false; 249 + 250 + /* 251 + * As per TDX GHCI CPUID ABI, r12-r15 registers contain contents of 252 + * EAX, EBX, ECX, EDX registers after the CPUID instruction execution. 253 + * So copy the register contents back to pt_regs. 254 + */ 255 + regs->ax = args.r12; 256 + regs->bx = args.r13; 257 + regs->cx = args.r14; 258 + regs->dx = args.r15; 259 + 260 + return true; 261 + } 262 + 263 + static bool mmio_read(int size, unsigned long addr, unsigned long *val) 264 + { 265 + struct tdx_hypercall_args args = { 266 + .r10 = TDX_HYPERCALL_STANDARD, 267 + .r11 = hcall_func(EXIT_REASON_EPT_VIOLATION), 268 + .r12 = size, 269 + .r13 = EPT_READ, 270 + .r14 = addr, 271 + .r15 = *val, 272 + }; 273 + 274 + if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT)) 275 + return false; 276 + *val = args.r11; 277 + return true; 278 + } 279 + 280 + static bool mmio_write(int size, unsigned long addr, unsigned long val) 281 + { 282 + return !_tdx_hypercall(hcall_func(EXIT_REASON_EPT_VIOLATION), size, 283 + EPT_WRITE, addr, val); 284 + } 285 + 286 + static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve) 287 + { 288 + char buffer[MAX_INSN_SIZE]; 289 + unsigned long *reg, val; 290 + struct insn insn = {}; 291 + enum mmio_type mmio; 292 + int size, extend_size; 293 + u8 extend_val = 0; 294 + 295 + /* Only in-kernel MMIO is supported */ 296 + if (WARN_ON_ONCE(user_mode(regs))) 297 + return false; 298 + 299 + if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE)) 300 + return false; 301 + 302 + if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64)) 303 + return false; 304 + 305 + mmio = insn_decode_mmio(&insn, &size); 306 + if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED)) 307 + return false; 308 + 309 + if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) { 310 + reg = insn_get_modrm_reg_ptr(&insn, regs); 311 + if (!reg) 312 + return false; 313 + } 314 + 315 + ve->instr_len = insn.length; 316 + 317 + /* Handle writes first */ 318 + switch (mmio) { 319 + case MMIO_WRITE: 320 + memcpy(&val, reg, size); 321 + return mmio_write(size, ve->gpa, val); 322 + case MMIO_WRITE_IMM: 323 + val = insn.immediate.value; 324 + return mmio_write(size, ve->gpa, val); 325 + case MMIO_READ: 326 + case MMIO_READ_ZERO_EXTEND: 327 + case MMIO_READ_SIGN_EXTEND: 328 + /* Reads are handled below */ 329 + break; 330 + case MMIO_MOVS: 331 + case MMIO_DECODE_FAILED: 332 + /* 333 + * MMIO was accessed with an instruction that could not be 334 + * decoded or handled properly. It was likely not using io.h 335 + * helpers or accessed MMIO accidentally. 336 + */ 337 + return false; 338 + default: 339 + WARN_ONCE(1, "Unknown insn_decode_mmio() decode value?"); 340 + return false; 341 + } 342 + 343 + /* Handle reads */ 344 + if (!mmio_read(size, ve->gpa, &val)) 345 + return false; 346 + 347 + switch (mmio) { 348 + case MMIO_READ: 349 + /* Zero-extend for 32-bit operation */ 350 + extend_size = size == 4 ? sizeof(*reg) : 0; 351 + break; 352 + case MMIO_READ_ZERO_EXTEND: 353 + /* Zero extend based on operand size */ 354 + extend_size = insn.opnd_bytes; 355 + break; 356 + case MMIO_READ_SIGN_EXTEND: 357 + /* Sign extend based on operand size */ 358 + extend_size = insn.opnd_bytes; 359 + if (size == 1 && val & BIT(7)) 360 + extend_val = 0xFF; 361 + else if (size > 1 && val & BIT(15)) 362 + extend_val = 0xFF; 363 + break; 364 + default: 365 + /* All other cases has to be covered with the first switch() */ 366 + WARN_ON_ONCE(1); 367 + return false; 368 + } 369 + 370 + if (extend_size) 371 + memset(reg, extend_val, extend_size); 372 + memcpy(reg, &val, size); 373 + return true; 374 + } 375 + 376 + static bool handle_in(struct pt_regs *regs, int size, int port) 377 + { 378 + struct tdx_hypercall_args args = { 379 + .r10 = TDX_HYPERCALL_STANDARD, 380 + .r11 = hcall_func(EXIT_REASON_IO_INSTRUCTION), 381 + .r12 = size, 382 + .r13 = PORT_READ, 383 + .r14 = port, 384 + }; 385 + u64 mask = GENMASK(BITS_PER_BYTE * size, 0); 386 + bool success; 387 + 388 + /* 389 + * Emulate the I/O read via hypercall. More info about ABI can be found 390 + * in TDX Guest-Host-Communication Interface (GHCI) section titled 391 + * "TDG.VP.VMCALL<Instruction.IO>". 392 + */ 393 + success = !__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT); 394 + 395 + /* Update part of the register affected by the emulated instruction */ 396 + regs->ax &= ~mask; 397 + if (success) 398 + regs->ax |= args.r11 & mask; 399 + 400 + return success; 401 + } 402 + 403 + static bool handle_out(struct pt_regs *regs, int size, int port) 404 + { 405 + u64 mask = GENMASK(BITS_PER_BYTE * size, 0); 406 + 407 + /* 408 + * Emulate the I/O write via hypercall. More info about ABI can be found 409 + * in TDX Guest-Host-Communication Interface (GHCI) section titled 410 + * "TDG.VP.VMCALL<Instruction.IO>". 411 + */ 412 + return !_tdx_hypercall(hcall_func(EXIT_REASON_IO_INSTRUCTION), size, 413 + PORT_WRITE, port, regs->ax & mask); 414 + } 415 + 416 + /* 417 + * Emulate I/O using hypercall. 418 + * 419 + * Assumes the IO instruction was using ax, which is enforced 420 + * by the standard io.h macros. 421 + * 422 + * Return True on success or False on failure. 423 + */ 424 + static bool handle_io(struct pt_regs *regs, u32 exit_qual) 425 + { 426 + int size, port; 427 + bool in; 428 + 429 + if (VE_IS_IO_STRING(exit_qual)) 430 + return false; 431 + 432 + in = VE_IS_IO_IN(exit_qual); 433 + size = VE_GET_IO_SIZE(exit_qual); 434 + port = VE_GET_PORT_NUM(exit_qual); 435 + 436 + 437 + if (in) 438 + return handle_in(regs, size, port); 439 + else 440 + return handle_out(regs, size, port); 441 + } 442 + 443 + /* 444 + * Early #VE exception handler. Only handles a subset of port I/O. 445 + * Intended only for earlyprintk. If failed, return false. 446 + */ 447 + __init bool tdx_early_handle_ve(struct pt_regs *regs) 448 + { 449 + struct ve_info ve; 450 + 451 + tdx_get_ve_info(&ve); 452 + 453 + if (ve.exit_reason != EXIT_REASON_IO_INSTRUCTION) 454 + return false; 455 + 456 + return handle_io(regs, ve.exit_qual); 457 + } 458 + 459 + void tdx_get_ve_info(struct ve_info *ve) 460 + { 461 + struct tdx_module_output out; 462 + 463 + /* 464 + * Called during #VE handling to retrieve the #VE info from the 465 + * TDX module. 466 + * 467 + * This has to be called early in #VE handling. A "nested" #VE which 468 + * occurs before this will raise a #DF and is not recoverable. 469 + * 470 + * The call retrieves the #VE info from the TDX module, which also 471 + * clears the "#VE valid" flag. This must be done before anything else 472 + * because any #VE that occurs while the valid flag is set will lead to 473 + * #DF. 474 + * 475 + * Note, the TDX module treats virtual NMIs as inhibited if the #VE 476 + * valid flag is set. It means that NMI=>#VE will not result in a #DF. 477 + */ 478 + tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out); 479 + 480 + /* Transfer the output parameters */ 481 + ve->exit_reason = out.rcx; 482 + ve->exit_qual = out.rdx; 483 + ve->gla = out.r8; 484 + ve->gpa = out.r9; 485 + ve->instr_len = lower_32_bits(out.r10); 486 + ve->instr_info = upper_32_bits(out.r10); 487 + } 488 + 489 + /* Handle the user initiated #VE */ 490 + static bool virt_exception_user(struct pt_regs *regs, struct ve_info *ve) 491 + { 492 + switch (ve->exit_reason) { 493 + case EXIT_REASON_CPUID: 494 + return handle_cpuid(regs); 495 + default: 496 + pr_warn("Unexpected #VE: %lld\n", ve->exit_reason); 497 + return false; 498 + } 499 + } 500 + 501 + /* Handle the kernel #VE */ 502 + static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve) 503 + { 504 + switch (ve->exit_reason) { 505 + case EXIT_REASON_HLT: 506 + return handle_halt(); 507 + case EXIT_REASON_MSR_READ: 508 + return read_msr(regs); 509 + case EXIT_REASON_MSR_WRITE: 510 + return write_msr(regs); 511 + case EXIT_REASON_CPUID: 512 + return handle_cpuid(regs); 513 + case EXIT_REASON_EPT_VIOLATION: 514 + return handle_mmio(regs, ve); 515 + case EXIT_REASON_IO_INSTRUCTION: 516 + return handle_io(regs, ve->exit_qual); 517 + default: 518 + pr_warn("Unexpected #VE: %lld\n", ve->exit_reason); 519 + return false; 520 + } 521 + } 522 + 523 + bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve) 524 + { 525 + bool ret; 526 + 527 + if (user_mode(regs)) 528 + ret = virt_exception_user(regs, ve); 529 + else 530 + ret = virt_exception_kernel(regs, ve); 531 + 532 + /* After successful #VE handling, move the IP */ 533 + if (ret) 534 + regs->ip += ve->instr_len; 535 + 536 + return ret; 537 + } 538 + 539 + static bool tdx_tlb_flush_required(bool private) 540 + { 541 + /* 542 + * TDX guest is responsible for flushing TLB on private->shared 543 + * transition. VMM is responsible for flushing on shared->private. 544 + * 545 + * The VMM _can't_ flush private addresses as it can't generate PAs 546 + * with the guest's HKID. Shared memory isn't subject to integrity 547 + * checking, i.e. the VMM doesn't need to flush for its own protection. 548 + * 549 + * There's no need to flush when converting from shared to private, 550 + * as flushing is the VMM's responsibility in this case, e.g. it must 551 + * flush to avoid integrity failures in the face of a buggy or 552 + * malicious guest. 553 + */ 554 + return !private; 555 + } 556 + 557 + static bool tdx_cache_flush_required(void) 558 + { 559 + /* 560 + * AMD SME/SEV can avoid cache flushing if HW enforces cache coherence. 561 + * TDX doesn't have such capability. 562 + * 563 + * Flush cache unconditionally. 564 + */ 565 + return true; 566 + } 567 + 568 + static bool try_accept_one(phys_addr_t *start, unsigned long len, 569 + enum pg_level pg_level) 570 + { 571 + unsigned long accept_size = page_level_size(pg_level); 572 + u64 tdcall_rcx; 573 + u8 page_size; 574 + 575 + if (!IS_ALIGNED(*start, accept_size)) 576 + return false; 577 + 578 + if (len < accept_size) 579 + return false; 580 + 581 + /* 582 + * Pass the page physical address to the TDX module to accept the 583 + * pending, private page. 584 + * 585 + * Bits 2:0 of RCX encode page size: 0 - 4K, 1 - 2M, 2 - 1G. 586 + */ 587 + switch (pg_level) { 588 + case PG_LEVEL_4K: 589 + page_size = 0; 590 + break; 591 + case PG_LEVEL_2M: 592 + page_size = 1; 593 + break; 594 + case PG_LEVEL_1G: 595 + page_size = 2; 596 + break; 597 + default: 598 + return false; 599 + } 600 + 601 + tdcall_rcx = *start | page_size; 602 + if (__tdx_module_call(TDX_ACCEPT_PAGE, tdcall_rcx, 0, 0, 0, NULL)) 603 + return false; 604 + 605 + *start += accept_size; 606 + return true; 607 + } 608 + 609 + /* 610 + * Inform the VMM of the guest's intent for this physical page: shared with 611 + * the VMM or private to the guest. The VMM is expected to change its mapping 612 + * of the page in response. 613 + */ 614 + static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc) 615 + { 616 + phys_addr_t start = __pa(vaddr); 617 + phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE); 618 + 619 + if (!enc) { 620 + /* Set the shared (decrypted) bits: */ 621 + start |= cc_mkdec(0); 622 + end |= cc_mkdec(0); 623 + } 624 + 625 + /* 626 + * Notify the VMM about page mapping conversion. More info about ABI 627 + * can be found in TDX Guest-Host-Communication Interface (GHCI), 628 + * section "TDG.VP.VMCALL<MapGPA>" 629 + */ 630 + if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0)) 631 + return false; 632 + 633 + /* private->shared conversion requires only MapGPA call */ 634 + if (!enc) 635 + return true; 636 + 637 + /* 638 + * For shared->private conversion, accept the page using 639 + * TDX_ACCEPT_PAGE TDX module call. 640 + */ 641 + while (start < end) { 642 + unsigned long len = end - start; 643 + 644 + /* 645 + * Try larger accepts first. It gives chance to VMM to keep 646 + * 1G/2M SEPT entries where possible and speeds up process by 647 + * cutting number of hypercalls (if successful). 648 + */ 649 + 650 + if (try_accept_one(&start, len, PG_LEVEL_1G)) 651 + continue; 652 + 653 + if (try_accept_one(&start, len, PG_LEVEL_2M)) 654 + continue; 655 + 656 + if (!try_accept_one(&start, len, PG_LEVEL_4K)) 657 + return false; 658 + } 659 + 660 + return true; 661 + } 662 + 663 + void __init tdx_early_init(void) 664 + { 665 + u64 cc_mask; 666 + u32 eax, sig[3]; 667 + 668 + cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]); 669 + 670 + if (memcmp(TDX_IDENT, sig, sizeof(sig))) 671 + return; 672 + 673 + setup_force_cpu_cap(X86_FEATURE_TDX_GUEST); 674 + 675 + cc_set_vendor(CC_VENDOR_INTEL); 676 + cc_mask = get_cc_mask(); 677 + cc_set_mask(cc_mask); 678 + 679 + /* 680 + * All bits above GPA width are reserved and kernel treats shared bit 681 + * as flag, not as part of physical address. 682 + * 683 + * Adjust physical mask to only cover valid GPA bits. 684 + */ 685 + physical_mask &= cc_mask - 1; 686 + 687 + x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required; 688 + x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required; 689 + x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed; 690 + 691 + pr_info("Guest detected\n"); 692 + }
+13 -1
arch/x86/include/asm/acenv.h
··· 13 13 14 14 /* Asm macros */ 15 15 16 - #define ACPI_FLUSH_CPU_CACHE() wbinvd() 16 + /* 17 + * ACPI_FLUSH_CPU_CACHE() flushes caches on entering sleep states. 18 + * It is required to prevent data loss. 19 + * 20 + * While running inside virtual machine, the kernel can bypass cache flushing. 21 + * Changing sleep state in a virtual machine doesn't affect the host system 22 + * sleep state and cannot lead to data loss. 23 + */ 24 + #define ACPI_FLUSH_CPU_CACHE() \ 25 + do { \ 26 + if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR)) \ 27 + wbinvd(); \ 28 + } while (0) 17 29 18 30 int __acpi_acquire_global_lock(unsigned int *lock); 19 31 int __acpi_release_global_lock(unsigned int *lock);
+7
arch/x86/include/asm/apic.h
··· 328 328 329 329 /* wakeup_secondary_cpu */ 330 330 int (*wakeup_secondary_cpu)(int apicid, unsigned long start_eip); 331 + /* wakeup secondary CPU using 64-bit wakeup point */ 332 + int (*wakeup_secondary_cpu_64)(int apicid, unsigned long start_eip); 331 333 332 334 void (*inquire_remote_apic)(int apicid); 333 335 ··· 489 487 490 488 return apic->get_apic_id(reg); 491 489 } 490 + 491 + #ifdef CONFIG_X86_64 492 + typedef int (*wakeup_cpu_handler)(int apicid, unsigned long start_eip); 493 + extern void acpi_wake_cpu_handler_update(wakeup_cpu_handler handler); 494 + #endif 492 495 493 496 extern int default_apic_id_valid(u32 apicid); 494 497 extern int default_acpi_madt_oem_check(char *, char *);
+1
arch/x86/include/asm/cpufeatures.h
··· 238 238 #define X86_FEATURE_VMW_VMMCALL ( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */ 239 239 #define X86_FEATURE_PVUNLOCK ( 8*32+20) /* "" PV unlock function */ 240 240 #define X86_FEATURE_VCPUPREEMPT ( 8*32+21) /* "" PV vcpu_is_preempted function */ 241 + #define X86_FEATURE_TDX_GUEST ( 8*32+22) /* Intel Trust Domain Extensions Guest */ 241 242 242 243 /* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */ 243 244 #define X86_FEATURE_FSGSBASE ( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
+7 -1
arch/x86/include/asm/disabled-features.h
··· 68 68 # define DISABLE_SGX (1 << (X86_FEATURE_SGX & 31)) 69 69 #endif 70 70 71 + #ifdef CONFIG_INTEL_TDX_GUEST 72 + # define DISABLE_TDX_GUEST 0 73 + #else 74 + # define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31)) 75 + #endif 76 + 71 77 /* 72 78 * Make sure to add features to the correct mask 73 79 */ ··· 85 79 #define DISABLED_MASK5 0 86 80 #define DISABLED_MASK6 0 87 81 #define DISABLED_MASK7 (DISABLE_PTI) 88 - #define DISABLED_MASK8 0 82 + #define DISABLED_MASK8 (DISABLE_TDX_GUEST) 89 83 #define DISABLED_MASK9 (DISABLE_SMAP|DISABLE_SGX) 90 84 #define DISABLED_MASK10 0 91 85 #define DISABLED_MASK11 0
+4
arch/x86/include/asm/idtentry.h
··· 632 632 DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER, exc_xen_unknown_trap); 633 633 #endif 634 634 635 + #ifdef CONFIG_INTEL_TDX_GUEST 636 + DECLARE_IDTENTRY(X86_TRAP_VE, exc_virtualization_exception); 637 + #endif 638 + 635 639 /* Device interrupts common/spurious */ 636 640 DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER, common_interrupt); 637 641 #ifdef CONFIG_X86_LOCAL_APIC
+12 -30
arch/x86/include/asm/io.h
··· 44 44 #include <asm/page.h> 45 45 #include <asm/early_ioremap.h> 46 46 #include <asm/pgtable_types.h> 47 + #include <asm/shared/io.h> 47 48 48 49 #define build_mmio_read(name, size, type, reg, barrier) \ 49 50 static inline type name(const volatile void __iomem *addr) \ ··· 257 256 #endif 258 257 259 258 #define BUILDIO(bwl, bw, type) \ 260 - static inline void out##bwl(unsigned type value, int port) \ 261 - { \ 262 - asm volatile("out" #bwl " %" #bw "0, %w1" \ 263 - : : "a"(value), "Nd"(port)); \ 264 - } \ 265 - \ 266 - static inline unsigned type in##bwl(int port) \ 267 - { \ 268 - unsigned type value; \ 269 - asm volatile("in" #bwl " %w1, %" #bw "0" \ 270 - : "=a"(value) : "Nd"(port)); \ 271 - return value; \ 272 - } \ 273 - \ 274 - static inline void out##bwl##_p(unsigned type value, int port) \ 259 + static inline void out##bwl##_p(type value, u16 port) \ 275 260 { \ 276 261 out##bwl(value, port); \ 277 262 slow_down_io(); \ 278 263 } \ 279 264 \ 280 - static inline unsigned type in##bwl##_p(int port) \ 265 + static inline type in##bwl##_p(u16 port) \ 281 266 { \ 282 - unsigned type value = in##bwl(port); \ 267 + type value = in##bwl(port); \ 283 268 slow_down_io(); \ 284 269 return value; \ 285 270 } \ 286 271 \ 287 - static inline void outs##bwl(int port, const void *addr, unsigned long count) \ 272 + static inline void outs##bwl(u16 port, const void *addr, unsigned long count) \ 288 273 { \ 289 274 if (cc_platform_has(CC_ATTR_GUEST_UNROLL_STRING_IO)) { \ 290 - unsigned type *value = (unsigned type *)addr; \ 275 + type *value = (type *)addr; \ 291 276 while (count) { \ 292 277 out##bwl(*value, port); \ 293 278 value++; \ ··· 286 299 } \ 287 300 } \ 288 301 \ 289 - static inline void ins##bwl(int port, void *addr, unsigned long count) \ 302 + static inline void ins##bwl(u16 port, void *addr, unsigned long count) \ 290 303 { \ 291 304 if (cc_platform_has(CC_ATTR_GUEST_UNROLL_STRING_IO)) { \ 292 - unsigned type *value = (unsigned type *)addr; \ 305 + type *value = (type *)addr; \ 293 306 while (count) { \ 294 307 *value = in##bwl(port); \ 295 308 value++; \ ··· 302 315 } \ 303 316 } 304 317 305 - BUILDIO(b, b, char) 306 - BUILDIO(w, w, short) 307 - BUILDIO(l, , int) 318 + BUILDIO(b, b, u8) 319 + BUILDIO(w, w, u16) 320 + BUILDIO(l, , u32) 321 + #undef BUILDIO 308 322 309 - #define inb inb 310 - #define inw inw 311 - #define inl inl 312 323 #define inb_p inb_p 313 324 #define inw_p inw_p 314 325 #define inl_p inl_p ··· 314 329 #define insw insw 315 330 #define insl insl 316 331 317 - #define outb outb 318 - #define outw outw 319 - #define outl outl 320 332 #define outb_p outb_p 321 333 #define outw_p outw_p 322 334 #define outl_p outl_p
+22
arch/x86/include/asm/kvm_para.h
··· 7 7 #include <linux/interrupt.h> 8 8 #include <uapi/asm/kvm_para.h> 9 9 10 + #include <asm/tdx.h> 11 + 10 12 #ifdef CONFIG_KVM_GUEST 11 13 bool kvm_check_and_clear_guest_paused(void); 12 14 #else ··· 34 32 static inline long kvm_hypercall0(unsigned int nr) 35 33 { 36 34 long ret; 35 + 36 + if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) 37 + return tdx_kvm_hypercall(nr, 0, 0, 0, 0); 38 + 37 39 asm volatile(KVM_HYPERCALL 38 40 : "=a"(ret) 39 41 : "a"(nr) ··· 48 42 static inline long kvm_hypercall1(unsigned int nr, unsigned long p1) 49 43 { 50 44 long ret; 45 + 46 + if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) 47 + return tdx_kvm_hypercall(nr, p1, 0, 0, 0); 48 + 51 49 asm volatile(KVM_HYPERCALL 52 50 : "=a"(ret) 53 51 : "a"(nr), "b"(p1) ··· 63 53 unsigned long p2) 64 54 { 65 55 long ret; 56 + 57 + if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) 58 + return tdx_kvm_hypercall(nr, p1, p2, 0, 0); 59 + 66 60 asm volatile(KVM_HYPERCALL 67 61 : "=a"(ret) 68 62 : "a"(nr), "b"(p1), "c"(p2) ··· 78 64 unsigned long p2, unsigned long p3) 79 65 { 80 66 long ret; 67 + 68 + if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) 69 + return tdx_kvm_hypercall(nr, p1, p2, p3, 0); 70 + 81 71 asm volatile(KVM_HYPERCALL 82 72 : "=a"(ret) 83 73 : "a"(nr), "b"(p1), "c"(p2), "d"(p3) ··· 94 76 unsigned long p4) 95 77 { 96 78 long ret; 79 + 80 + if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) 81 + return tdx_kvm_hypercall(nr, p1, p2, p3, p4); 82 + 97 83 asm volatile(KVM_HYPERCALL 98 84 : "=a"(ret) 99 85 : "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
+3 -3
arch/x86/include/asm/mem_encrypt.h
··· 49 49 50 50 void __init mem_encrypt_free_decrypted_mem(void); 51 51 52 - /* Architecture __weak replacement functions */ 53 - void __init mem_encrypt_init(void); 54 - 55 52 void __init sev_es_init_vc_handling(void); 56 53 57 54 #define __bss_decrypted __section(".bss..decrypted") ··· 85 88 #define __bss_decrypted 86 89 87 90 #endif /* CONFIG_AMD_MEM_ENCRYPT */ 91 + 92 + /* Architecture __weak replacement functions */ 93 + void __init mem_encrypt_init(void); 88 94 89 95 /* 90 96 * The __sme_pa() and __sme_pa_nodebug() macros are meant for use when
+1
arch/x86/include/asm/realmode.h
··· 25 25 u32 sev_es_trampoline_start; 26 26 #endif 27 27 #ifdef CONFIG_X86_64 28 + u32 trampoline_start64; 28 29 u32 trampoline_pgd; 29 30 #endif 30 31 /* ACPI S3 wakeup */
+34
arch/x86/include/asm/shared/io.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _ASM_X86_SHARED_IO_H 3 + #define _ASM_X86_SHARED_IO_H 4 + 5 + #include <linux/types.h> 6 + 7 + #define BUILDIO(bwl, bw, type) \ 8 + static inline void __out##bwl(type value, u16 port) \ 9 + { \ 10 + asm volatile("out" #bwl " %" #bw "0, %w1" \ 11 + : : "a"(value), "Nd"(port)); \ 12 + } \ 13 + \ 14 + static inline type __in##bwl(u16 port) \ 15 + { \ 16 + type value; \ 17 + asm volatile("in" #bwl " %w1, %" #bw "0" \ 18 + : "=a"(value) : "Nd"(port)); \ 19 + return value; \ 20 + } 21 + 22 + BUILDIO(b, b, u8) 23 + BUILDIO(w, w, u16) 24 + BUILDIO(l, , u32) 25 + #undef BUILDIO 26 + 27 + #define inb __inb 28 + #define inw __inw 29 + #define inl __inl 30 + #define outb __outb 31 + #define outw __outw 32 + #define outl __outl 33 + 34 + #endif
+40
arch/x86/include/asm/shared/tdx.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _ASM_X86_SHARED_TDX_H 3 + #define _ASM_X86_SHARED_TDX_H 4 + 5 + #include <linux/bits.h> 6 + #include <linux/types.h> 7 + 8 + #define TDX_HYPERCALL_STANDARD 0 9 + 10 + #define TDX_HCALL_HAS_OUTPUT BIT(0) 11 + #define TDX_HCALL_ISSUE_STI BIT(1) 12 + 13 + #define TDX_CPUID_LEAF_ID 0x21 14 + #define TDX_IDENT "IntelTDX " 15 + 16 + #ifndef __ASSEMBLY__ 17 + 18 + /* 19 + * Used in __tdx_hypercall() to pass down and get back registers' values of 20 + * the TDCALL instruction when requesting services from the VMM. 21 + * 22 + * This is a software only structure and not part of the TDX module/VMM ABI. 23 + */ 24 + struct tdx_hypercall_args { 25 + u64 r10; 26 + u64 r11; 27 + u64 r12; 28 + u64 r13; 29 + u64 r14; 30 + u64 r15; 31 + }; 32 + 33 + /* Used to request services from the VMM */ 34 + u64 __tdx_hypercall(struct tdx_hypercall_args *args, unsigned long flags); 35 + 36 + /* Called from __tdx_hypercall() for unrecoverable failure */ 37 + void __tdx_hypercall_failed(void); 38 + 39 + #endif /* !__ASSEMBLY__ */ 40 + #endif /* _ASM_X86_SHARED_TDX_H */
+91
arch/x86/include/asm/tdx.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* Copyright (C) 2021-2022 Intel Corporation */ 3 + #ifndef _ASM_X86_TDX_H 4 + #define _ASM_X86_TDX_H 5 + 6 + #include <linux/init.h> 7 + #include <linux/bits.h> 8 + #include <asm/ptrace.h> 9 + #include <asm/shared/tdx.h> 10 + 11 + /* 12 + * SW-defined error codes. 13 + * 14 + * Bits 47:40 == 0xFF indicate Reserved status code class that never used by 15 + * TDX module. 16 + */ 17 + #define TDX_ERROR _BITUL(63) 18 + #define TDX_SW_ERROR (TDX_ERROR | GENMASK_ULL(47, 40)) 19 + #define TDX_SEAMCALL_VMFAILINVALID (TDX_SW_ERROR | _UL(0xFFFF0000)) 20 + 21 + #ifndef __ASSEMBLY__ 22 + 23 + /* 24 + * Used to gather the output registers values of the TDCALL and SEAMCALL 25 + * instructions when requesting services from the TDX module. 26 + * 27 + * This is a software only structure and not part of the TDX module/VMM ABI. 28 + */ 29 + struct tdx_module_output { 30 + u64 rcx; 31 + u64 rdx; 32 + u64 r8; 33 + u64 r9; 34 + u64 r10; 35 + u64 r11; 36 + }; 37 + 38 + /* 39 + * Used by the #VE exception handler to gather the #VE exception 40 + * info from the TDX module. This is a software only structure 41 + * and not part of the TDX module/VMM ABI. 42 + */ 43 + struct ve_info { 44 + u64 exit_reason; 45 + u64 exit_qual; 46 + /* Guest Linear (virtual) Address */ 47 + u64 gla; 48 + /* Guest Physical Address */ 49 + u64 gpa; 50 + u32 instr_len; 51 + u32 instr_info; 52 + }; 53 + 54 + #ifdef CONFIG_INTEL_TDX_GUEST 55 + 56 + void __init tdx_early_init(void); 57 + 58 + /* Used to communicate with the TDX module */ 59 + u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, 60 + struct tdx_module_output *out); 61 + 62 + void tdx_get_ve_info(struct ve_info *ve); 63 + 64 + bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve); 65 + 66 + void tdx_safe_halt(void); 67 + 68 + bool tdx_early_handle_ve(struct pt_regs *regs); 69 + 70 + #else 71 + 72 + static inline void tdx_early_init(void) { }; 73 + static inline void tdx_safe_halt(void) { }; 74 + 75 + static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; } 76 + 77 + #endif /* CONFIG_INTEL_TDX_GUEST */ 78 + 79 + #if defined(CONFIG_KVM_GUEST) && defined(CONFIG_INTEL_TDX_GUEST) 80 + long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2, 81 + unsigned long p3, unsigned long p4); 82 + #else 83 + static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, 84 + unsigned long p2, unsigned long p3, 85 + unsigned long p4) 86 + { 87 + return -ENODEV; 88 + } 89 + #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */ 90 + #endif /* !__ASSEMBLY__ */ 91 + #endif /* _ASM_X86_TDX_H */
+92 -1
arch/x86/kernel/acpi/boot.c
··· 65 65 static bool acpi_support_online_capable; 66 66 #endif 67 67 68 + #ifdef CONFIG_X86_64 69 + /* Physical address of the Multiprocessor Wakeup Structure mailbox */ 70 + static u64 acpi_mp_wake_mailbox_paddr; 71 + /* Virtual address of the Multiprocessor Wakeup Structure mailbox */ 72 + static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox; 73 + #endif 74 + 68 75 #ifdef CONFIG_X86_IO_APIC 69 76 /* 70 77 * Locks related to IOAPIC hotplug ··· 343 336 return 0; 344 337 } 345 338 346 - #endif /*CONFIG_X86_LOCAL_APIC */ 339 + #ifdef CONFIG_X86_64 340 + static int acpi_wakeup_cpu(int apicid, unsigned long start_ip) 341 + { 342 + /* 343 + * Remap mailbox memory only for the first call to acpi_wakeup_cpu(). 344 + * 345 + * Wakeup of secondary CPUs is fully serialized in the core code. 346 + * No need to protect acpi_mp_wake_mailbox from concurrent accesses. 347 + */ 348 + if (!acpi_mp_wake_mailbox) { 349 + acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr, 350 + sizeof(*acpi_mp_wake_mailbox), 351 + MEMREMAP_WB); 352 + } 353 + 354 + /* 355 + * Mailbox memory is shared between the firmware and OS. Firmware will 356 + * listen on mailbox command address, and once it receives the wakeup 357 + * command, the CPU associated with the given apicid will be booted. 358 + * 359 + * The value of 'apic_id' and 'wakeup_vector' must be visible to the 360 + * firmware before the wakeup command is visible. smp_store_release() 361 + * ensures ordering and visibility. 362 + */ 363 + acpi_mp_wake_mailbox->apic_id = apicid; 364 + acpi_mp_wake_mailbox->wakeup_vector = start_ip; 365 + smp_store_release(&acpi_mp_wake_mailbox->command, 366 + ACPI_MP_WAKE_COMMAND_WAKEUP); 367 + 368 + /* 369 + * Wait for the CPU to wake up. 370 + * 371 + * The CPU being woken up is essentially in a spin loop waiting to be 372 + * woken up. It should not take long for it wake up and acknowledge by 373 + * zeroing out ->command. 374 + * 375 + * ACPI specification doesn't provide any guidance on how long kernel 376 + * has to wait for a wake up acknowledgement. It also doesn't provide 377 + * a way to cancel a wake up request if it takes too long. 378 + * 379 + * In TDX environment, the VMM has control over how long it takes to 380 + * wake up secondary. It can postpone scheduling secondary vCPU 381 + * indefinitely. Giving up on wake up request and reporting error opens 382 + * possible attack vector for VMM: it can wake up a secondary CPU when 383 + * kernel doesn't expect it. Wait until positive result of the wake up 384 + * request. 385 + */ 386 + while (READ_ONCE(acpi_mp_wake_mailbox->command)) 387 + cpu_relax(); 388 + 389 + return 0; 390 + } 391 + #endif /* CONFIG_X86_64 */ 392 + #endif /* CONFIG_X86_LOCAL_APIC */ 347 393 348 394 #ifdef CONFIG_X86_IO_APIC 349 395 #define MP_ISA_BUS 0 ··· 1143 1083 } 1144 1084 return 0; 1145 1085 } 1086 + 1087 + #ifdef CONFIG_X86_64 1088 + static int __init acpi_parse_mp_wake(union acpi_subtable_headers *header, 1089 + const unsigned long end) 1090 + { 1091 + struct acpi_madt_multiproc_wakeup *mp_wake; 1092 + 1093 + if (!IS_ENABLED(CONFIG_SMP)) 1094 + return -ENODEV; 1095 + 1096 + mp_wake = (struct acpi_madt_multiproc_wakeup *)header; 1097 + if (BAD_MADT_ENTRY(mp_wake, end)) 1098 + return -EINVAL; 1099 + 1100 + acpi_table_print_madt_entry(&header->common); 1101 + 1102 + acpi_mp_wake_mailbox_paddr = mp_wake->base_address; 1103 + 1104 + acpi_wake_cpu_handler_update(acpi_wakeup_cpu); 1105 + 1106 + return 0; 1107 + } 1108 + #endif /* CONFIG_X86_64 */ 1146 1109 #endif /* CONFIG_X86_LOCAL_APIC */ 1147 1110 1148 1111 #ifdef CONFIG_X86_IO_APIC ··· 1361 1278 1362 1279 smp_found_config = 1; 1363 1280 } 1281 + 1282 + #ifdef CONFIG_X86_64 1283 + /* 1284 + * Parse MADT MP Wake entry. 1285 + */ 1286 + acpi_table_parse_madt(ACPI_MADT_TYPE_MULTIPROC_WAKEUP, 1287 + acpi_parse_mp_wake, 1); 1288 + #endif 1364 1289 } 1365 1290 if (error == -EINVAL) { 1366 1291 /*
+10
arch/x86/kernel/apic/apic.c
··· 2551 2551 } 2552 2552 EXPORT_SYMBOL_GPL(x86_msi_msg_get_destid); 2553 2553 2554 + #ifdef CONFIG_X86_64 2555 + void __init acpi_wake_cpu_handler_update(wakeup_cpu_handler handler) 2556 + { 2557 + struct apic **drv; 2558 + 2559 + for (drv = __apicdrivers; drv < __apicdrivers_end; drv++) 2560 + (*drv)->wakeup_secondary_cpu_64 = handler; 2561 + } 2562 + #endif 2563 + 2554 2564 /* 2555 2565 * Override the generic EOI implementation with an optimized version. 2556 2566 * Only called during early boot when only one CPU is active and with
+16 -2
arch/x86/kernel/apic/io_apic.c
··· 65 65 #include <asm/irq_remapping.h> 66 66 #include <asm/hw_irq.h> 67 67 #include <asm/apic.h> 68 + #include <asm/pgtable.h> 68 69 69 70 #define for_each_ioapic(idx) \ 70 71 for ((idx) = 0; (idx) < nr_ioapics; (idx)++) ··· 2678 2677 return res; 2679 2678 } 2680 2679 2680 + static void io_apic_set_fixmap(enum fixed_addresses idx, phys_addr_t phys) 2681 + { 2682 + pgprot_t flags = FIXMAP_PAGE_NOCACHE; 2683 + 2684 + /* 2685 + * Ensure fixmaps for IOAPIC MMIO respect memory encryption pgprot 2686 + * bits, just like normal ioremap(): 2687 + */ 2688 + flags = pgprot_decrypted(flags); 2689 + 2690 + __set_fixmap(idx, phys, flags); 2691 + } 2692 + 2681 2693 void __init io_apic_init_mappings(void) 2682 2694 { 2683 2695 unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0; ··· 2723 2709 __func__, PAGE_SIZE, PAGE_SIZE); 2724 2710 ioapic_phys = __pa(ioapic_phys); 2725 2711 } 2726 - set_fixmap_nocache(idx, ioapic_phys); 2712 + io_apic_set_fixmap(idx, ioapic_phys); 2727 2713 apic_printk(APIC_VERBOSE, "mapped IOAPIC to %08lx (%08lx)\n", 2728 2714 __fix_to_virt(idx) + (ioapic_phys & ~PAGE_MASK), 2729 2715 ioapic_phys); ··· 2852 2838 ioapics[idx].mp_config.flags = MPC_APIC_USABLE; 2853 2839 ioapics[idx].mp_config.apicaddr = address; 2854 2840 2855 - set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address); 2841 + io_apic_set_fixmap(FIX_IO_APIC_BASE_0 + idx, address); 2856 2842 if (bad_ioapic_register(idx)) { 2857 2843 clear_fixmap(FIX_IO_APIC_BASE_0 + idx); 2858 2844 return -ENODEV;
+17
arch/x86/kernel/asm-offsets.c
··· 18 18 #include <asm/bootparam.h> 19 19 #include <asm/suspend.h> 20 20 #include <asm/tlbflush.h> 21 + #include <asm/tdx.h> 21 22 22 23 #ifdef CONFIG_XEN 23 24 #include <xen/interface/xen.h> ··· 65 64 OFFSET(XEN_vcpu_info_pending, vcpu_info, evtchn_upcall_pending); 66 65 OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2); 67 66 #endif 67 + 68 + BLANK(); 69 + OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx); 70 + OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx); 71 + OFFSET(TDX_MODULE_r8, tdx_module_output, r8); 72 + OFFSET(TDX_MODULE_r9, tdx_module_output, r9); 73 + OFFSET(TDX_MODULE_r10, tdx_module_output, r10); 74 + OFFSET(TDX_MODULE_r11, tdx_module_output, r11); 75 + 76 + BLANK(); 77 + OFFSET(TDX_HYPERCALL_r10, tdx_hypercall_args, r10); 78 + OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_args, r11); 79 + OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_args, r12); 80 + OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_args, r13); 81 + OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_args, r14); 82 + OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_args, r15); 68 83 69 84 BLANK(); 70 85 OFFSET(BP_scratch, boot_params, scratch);
+7
arch/x86/kernel/head64.c
··· 40 40 #include <asm/extable.h> 41 41 #include <asm/trapnr.h> 42 42 #include <asm/sev.h> 43 + #include <asm/tdx.h> 43 44 44 45 /* 45 46 * Manage page tables very early on. ··· 418 417 trapnr == X86_TRAP_VC && handle_vc_boot_ghcb(regs)) 419 418 return; 420 419 420 + if (trapnr == X86_TRAP_VE && tdx_early_handle_ve(regs)) 421 + return; 422 + 421 423 early_fixup_exception(regs, trapnr); 422 424 } 423 425 ··· 518 514 __native_tlb_flush_global(this_cpu_read(cpu_tlbstate.cr4)); 519 515 520 516 idt_setup_early_handler(); 517 + 518 + /* Needed before cc_platform_has() can be used for TDX */ 519 + tdx_early_init(); 521 520 522 521 copy_bootdata(__va(real_mode_data)); 523 522
+26 -2
arch/x86/kernel/head_64.S
··· 173 173 addq $(init_top_pgt - __START_KERNEL_map), %rax 174 174 1: 175 175 176 + #ifdef CONFIG_X86_MCE 177 + /* 178 + * Preserve CR4.MCE if the kernel will enable #MC support. 179 + * Clearing MCE may fault in some environments (that also force #MC 180 + * support). Any machine check that occurs before #MC support is fully 181 + * configured will crash the system regardless of the CR4.MCE value set 182 + * here. 183 + */ 184 + movq %cr4, %rcx 185 + andl $X86_CR4_MCE, %ecx 186 + #else 187 + movl $0, %ecx 188 + #endif 189 + 176 190 /* Enable PAE mode, PGE and LA57 */ 177 - movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx 191 + orl $(X86_CR4_PAE | X86_CR4_PGE), %ecx 178 192 #ifdef CONFIG_X86_5LEVEL 179 193 testl $1, __pgtable_l5_enabled(%rip) 180 194 jz 1f ··· 294 280 /* Setup EFER (Extended Feature Enable Register) */ 295 281 movl $MSR_EFER, %ecx 296 282 rdmsr 283 + /* 284 + * Preserve current value of EFER for comparison and to skip 285 + * EFER writes if no change was made (for TDX guest) 286 + */ 287 + movl %eax, %edx 297 288 btsl $_EFER_SCE, %eax /* Enable System Call */ 298 289 btl $20,%edi /* No Execute supported? */ 299 290 jnc 1f 300 291 btsl $_EFER_NX, %eax 301 292 btsq $_PAGE_BIT_NX,early_pmd_flags(%rip) 302 - 1: wrmsr /* Make changes effective */ 303 293 294 + /* Avoid writing EFER if no change was made (for TDX guest) */ 295 + 1: cmpl %edx, %eax 296 + je 1f 297 + xor %edx, %edx 298 + wrmsr /* Make changes effective */ 299 + 1: 304 300 /* Setup cr0 */ 305 301 movl $CR0_STATE, %eax 306 302 /* Make changes effective */
+3
arch/x86/kernel/idt.c
··· 69 69 */ 70 70 INTG(X86_TRAP_PF, asm_exc_page_fault), 71 71 #endif 72 + #ifdef CONFIG_INTEL_TDX_GUEST 73 + INTG(X86_TRAP_VE, asm_exc_virtualization_exception), 74 + #endif 72 75 }; 73 76 74 77 /*
+4
arch/x86/kernel/process.c
··· 46 46 #include <asm/proto.h> 47 47 #include <asm/frame.h> 48 48 #include <asm/unwind.h> 49 + #include <asm/tdx.h> 49 50 50 51 #include "process.h" 51 52 ··· 874 873 } else if (prefer_mwait_c1_over_halt(c)) { 875 874 pr_info("using mwait in idle threads\n"); 876 875 x86_idle = mwait_idle; 876 + } else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) { 877 + pr_info("using TDX aware idle routine\n"); 878 + x86_idle = tdx_safe_halt; 877 879 } else 878 880 x86_idle = default_idle; 879 881 }
+10 -2
arch/x86/kernel/smpboot.c
··· 1083 1083 unsigned long boot_error = 0; 1084 1084 unsigned long timeout; 1085 1085 1086 + #ifdef CONFIG_X86_64 1087 + /* If 64-bit wakeup method exists, use the 64-bit mode trampoline IP */ 1088 + if (apic->wakeup_secondary_cpu_64) 1089 + start_ip = real_mode_header->trampoline_start64; 1090 + #endif 1086 1091 idle->thread.sp = (unsigned long)task_pt_regs(idle); 1087 1092 early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu); 1088 1093 initial_code = (unsigned long)start_secondary; ··· 1129 1124 1130 1125 /* 1131 1126 * Wake up a CPU in difference cases: 1132 - * - Use the method in the APIC driver if it's defined 1127 + * - Use a method from the APIC driver if one defined, with wakeup 1128 + * straight to 64-bit mode preferred over wakeup to RM. 1133 1129 * Otherwise, 1134 1130 * - Use an INIT boot APIC message for APs or NMI for BSP. 1135 1131 */ 1136 - if (apic->wakeup_secondary_cpu) 1132 + if (apic->wakeup_secondary_cpu_64) 1133 + boot_error = apic->wakeup_secondary_cpu_64(apicid, start_ip); 1134 + else if (apic->wakeup_secondary_cpu) 1137 1135 boot_error = apic->wakeup_secondary_cpu(apicid, start_ip); 1138 1136 else 1139 1137 boot_error = wakeup_cpu_via_init_nmi(cpu, start_ip, apicid,
+117 -26
arch/x86/kernel/traps.c
··· 62 62 #include <asm/insn.h> 63 63 #include <asm/insn-eval.h> 64 64 #include <asm/vdso.h> 65 + #include <asm/tdx.h> 65 66 66 67 #ifdef CONFIG_X86_64 67 68 #include <asm/x86_init.h> ··· 687 686 #endif 688 687 } 689 688 689 + static bool gp_try_fixup_and_notify(struct pt_regs *regs, int trapnr, 690 + unsigned long error_code, const char *str) 691 + { 692 + if (fixup_exception(regs, trapnr, error_code, 0)) 693 + return true; 694 + 695 + current->thread.error_code = error_code; 696 + current->thread.trap_nr = trapnr; 697 + 698 + /* 699 + * To be potentially processing a kprobe fault and to trust the result 700 + * from kprobe_running(), we have to be non-preemptible. 701 + */ 702 + if (!preemptible() && kprobe_running() && 703 + kprobe_fault_handler(regs, trapnr)) 704 + return true; 705 + 706 + return notify_die(DIE_GPF, str, regs, error_code, trapnr, SIGSEGV) == NOTIFY_STOP; 707 + } 708 + 709 + static void gp_user_force_sig_segv(struct pt_regs *regs, int trapnr, 710 + unsigned long error_code, const char *str) 711 + { 712 + current->thread.error_code = error_code; 713 + current->thread.trap_nr = trapnr; 714 + show_signal(current, SIGSEGV, "", str, regs, error_code); 715 + force_sig(SIGSEGV); 716 + } 717 + 690 718 DEFINE_IDTENTRY_ERRORCODE(exc_general_protection) 691 719 { 692 720 char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR; 693 721 enum kernel_gp_hint hint = GP_NO_HINT; 694 - struct task_struct *tsk; 695 722 unsigned long gp_addr; 696 - int ret; 697 723 698 724 if (user_mode(regs) && try_fixup_enqcmd_gp()) 699 725 return; ··· 739 711 return; 740 712 } 741 713 742 - tsk = current; 743 - 744 714 if (user_mode(regs)) { 745 715 if (fixup_iopl_exception(regs)) 746 716 goto exit; 747 717 748 - tsk->thread.error_code = error_code; 749 - tsk->thread.trap_nr = X86_TRAP_GP; 750 - 751 718 if (fixup_vdso_exception(regs, X86_TRAP_GP, error_code, 0)) 752 719 goto exit; 753 720 754 - show_signal(tsk, SIGSEGV, "", desc, regs, error_code); 755 - force_sig(SIGSEGV); 721 + gp_user_force_sig_segv(regs, X86_TRAP_GP, error_code, desc); 756 722 goto exit; 757 723 } 758 724 759 - if (fixup_exception(regs, X86_TRAP_GP, error_code, 0)) 760 - goto exit; 761 - 762 - tsk->thread.error_code = error_code; 763 - tsk->thread.trap_nr = X86_TRAP_GP; 764 - 765 - /* 766 - * To be potentially processing a kprobe fault and to trust the result 767 - * from kprobe_running(), we have to be non-preemptible. 768 - */ 769 - if (!preemptible() && 770 - kprobe_running() && 771 - kprobe_fault_handler(regs, X86_TRAP_GP)) 772 - goto exit; 773 - 774 - ret = notify_die(DIE_GPF, desc, regs, error_code, X86_TRAP_GP, SIGSEGV); 775 - if (ret == NOTIFY_STOP) 725 + if (gp_try_fixup_and_notify(regs, X86_TRAP_GP, error_code, desc)) 776 726 goto exit; 777 727 778 728 if (error_code) ··· 1348 1342 die("unexpected #NM exception", regs, 0); 1349 1343 } 1350 1344 } 1345 + 1346 + #ifdef CONFIG_INTEL_TDX_GUEST 1347 + 1348 + #define VE_FAULT_STR "VE fault" 1349 + 1350 + static void ve_raise_fault(struct pt_regs *regs, long error_code) 1351 + { 1352 + if (user_mode(regs)) { 1353 + gp_user_force_sig_segv(regs, X86_TRAP_VE, error_code, VE_FAULT_STR); 1354 + return; 1355 + } 1356 + 1357 + if (gp_try_fixup_and_notify(regs, X86_TRAP_VE, error_code, VE_FAULT_STR)) 1358 + return; 1359 + 1360 + die_addr(VE_FAULT_STR, regs, error_code, 0); 1361 + } 1362 + 1363 + /* 1364 + * Virtualization Exceptions (#VE) are delivered to TDX guests due to 1365 + * specific guest actions which may happen in either user space or the 1366 + * kernel: 1367 + * 1368 + * * Specific instructions (WBINVD, for example) 1369 + * * Specific MSR accesses 1370 + * * Specific CPUID leaf accesses 1371 + * * Access to specific guest physical addresses 1372 + * 1373 + * In the settings that Linux will run in, virtualization exceptions are 1374 + * never generated on accesses to normal, TD-private memory that has been 1375 + * accepted (by BIOS or with tdx_enc_status_changed()). 1376 + * 1377 + * Syscall entry code has a critical window where the kernel stack is not 1378 + * yet set up. Any exception in this window leads to hard to debug issues 1379 + * and can be exploited for privilege escalation. Exceptions in the NMI 1380 + * entry code also cause issues. Returning from the exception handler with 1381 + * IRET will re-enable NMIs and nested NMI will corrupt the NMI stack. 1382 + * 1383 + * For these reasons, the kernel avoids #VEs during the syscall gap and 1384 + * the NMI entry code. Entry code paths do not access TD-shared memory, 1385 + * MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves 1386 + * that might generate #VE. VMM can remove memory from TD at any point, 1387 + * but access to unaccepted (or missing) private memory leads to VM 1388 + * termination, not to #VE. 1389 + * 1390 + * Similarly to page faults and breakpoints, #VEs are allowed in NMI 1391 + * handlers once the kernel is ready to deal with nested NMIs. 1392 + * 1393 + * During #VE delivery, all interrupts, including NMIs, are blocked until 1394 + * TDGETVEINFO is called. It prevents #VE nesting until the kernel reads 1395 + * the VE info. 1396 + * 1397 + * If a guest kernel action which would normally cause a #VE occurs in 1398 + * the interrupt-disabled region before TDGETVEINFO, a #DF (fault 1399 + * exception) is delivered to the guest which will result in an oops. 1400 + * 1401 + * The entry code has been audited carefully for following these expectations. 1402 + * Changes in the entry code have to be audited for correctness vs. this 1403 + * aspect. Similarly to #PF, #VE in these places will expose kernel to 1404 + * privilege escalation or may lead to random crashes. 1405 + */ 1406 + DEFINE_IDTENTRY(exc_virtualization_exception) 1407 + { 1408 + struct ve_info ve; 1409 + 1410 + /* 1411 + * NMIs/Machine-checks/Interrupts will be in a disabled state 1412 + * till TDGETVEINFO TDCALL is executed. This ensures that VE 1413 + * info cannot be overwritten by a nested #VE. 1414 + */ 1415 + tdx_get_ve_info(&ve); 1416 + 1417 + cond_local_irq_enable(regs); 1418 + 1419 + /* 1420 + * If tdx_handle_virt_exception() could not process 1421 + * it successfully, treat it as #GP(0) and handle it. 1422 + */ 1423 + if (!tdx_handle_virt_exception(regs, &ve)) 1424 + ve_raise_fault(regs, 0); 1425 + 1426 + cond_local_irq_disable(regs); 1427 + } 1428 + 1429 + #endif 1351 1430 1352 1431 #ifdef CONFIG_X86_32 1353 1432 DEFINE_IDTENTRY_SW(iret_error)
+1 -1
arch/x86/lib/kaslr.c
··· 11 11 #include <asm/msr.h> 12 12 #include <asm/archrandom.h> 13 13 #include <asm/e820/api.h> 14 - #include <asm/io.h> 14 + #include <asm/shared/io.h> 15 15 16 16 /* 17 17 * When built for the regular kernel, several functions need to be stubbed out
+5
arch/x86/mm/ioremap.c
··· 242 242 * If the page being mapped is in memory and SEV is active then 243 243 * make sure the memory encryption attribute is enabled in the 244 244 * resulting mapping. 245 + * In TDX guests, memory is marked private by default. If encryption 246 + * is not requested (using encrypted), explicitly set decrypt 247 + * attribute in all IOREMAPPED memory. 245 248 */ 246 249 prot = PAGE_KERNEL_IO; 247 250 if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted) 248 251 prot = pgprot_encrypted(prot); 252 + else 253 + prot = pgprot_decrypted(prot); 249 254 250 255 switch (pcm) { 251 256 case _PAGE_CACHE_MODE_UC:
+8 -1
arch/x86/mm/mem_encrypt.c
··· 42 42 43 43 static void print_mem_encrypt_feature_info(void) 44 44 { 45 - pr_info("AMD Memory Encryption Features active:"); 45 + pr_info("Memory Encryption Features active:"); 46 + 47 + if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) { 48 + pr_cont(" Intel TDX\n"); 49 + return; 50 + } 51 + 52 + pr_cont(" AMD"); 46 53 47 54 /* Secure Memory Encryption */ 48 55 if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)) {
+1
arch/x86/realmode/rm/header.S
··· 24 24 .long pa_sev_es_trampoline_start 25 25 #endif 26 26 #ifdef CONFIG_X86_64 27 + .long pa_trampoline_start64 27 28 .long pa_trampoline_pgd; 28 29 #endif 29 30 /* ACPI S3 wakeup */
+53 -4
arch/x86/realmode/rm/trampoline_64.S
··· 70 70 movw $__KERNEL_DS, %dx # Data segment descriptor 71 71 72 72 # Enable protected mode 73 - movl $X86_CR0_PE, %eax # protected mode (PE) bit 73 + movl $(CR0_STATE & ~X86_CR0_PG), %eax 74 74 movl %eax, %cr0 # into protected mode 75 75 76 76 # flush prefetch and jump to startup_32 ··· 143 143 movl %eax, %cr3 144 144 145 145 # Set up EFER 146 + movl $MSR_EFER, %ecx 147 + rdmsr 148 + /* 149 + * Skip writing to EFER if the register already has desired 150 + * value (to avoid #VE for the TDX guest). 151 + */ 152 + cmp pa_tr_efer, %eax 153 + jne .Lwrite_efer 154 + cmp pa_tr_efer + 4, %edx 155 + je .Ldone_efer 156 + .Lwrite_efer: 146 157 movl pa_tr_efer, %eax 147 158 movl pa_tr_efer + 4, %edx 148 - movl $MSR_EFER, %ecx 149 159 wrmsr 150 160 151 - # Enable paging and in turn activate Long Mode 152 - movl $(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax 161 + .Ldone_efer: 162 + # Enable paging and in turn activate Long Mode. 163 + movl $CR0_STATE, %eax 153 164 movl %eax, %cr0 154 165 155 166 /* ··· 172 161 ljmpl $__KERNEL_CS, $pa_startup_64 173 162 SYM_CODE_END(startup_32) 174 163 164 + SYM_CODE_START(pa_trampoline_compat) 165 + /* 166 + * In compatibility mode. Prep ESP and DX for startup_32, then disable 167 + * paging and complete the switch to legacy 32-bit mode. 168 + */ 169 + movl $rm_stack_end, %esp 170 + movw $__KERNEL_DS, %dx 171 + 172 + movl $(CR0_STATE & ~X86_CR0_PG), %eax 173 + movl %eax, %cr0 174 + ljmpl $__KERNEL32_CS, $pa_startup_32 175 + SYM_CODE_END(pa_trampoline_compat) 176 + 175 177 .section ".text64","ax" 176 178 .code64 177 179 .balign 4 ··· 192 168 # Now jump into the kernel using virtual addresses 193 169 jmpq *tr_start(%rip) 194 170 SYM_CODE_END(startup_64) 171 + 172 + SYM_CODE_START(trampoline_start64) 173 + /* 174 + * APs start here on a direct transfer from 64-bit BIOS with identity 175 + * mapped page tables. Load the kernel's GDT in order to gear down to 176 + * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load 177 + * segment registers. Load the zero IDT so any fault triggers a 178 + * shutdown instead of jumping back into BIOS. 179 + */ 180 + lidt tr_idt(%rip) 181 + lgdt tr_gdt64(%rip) 182 + 183 + ljmpl *tr_compat(%rip) 184 + SYM_CODE_END(trampoline_start64) 195 185 196 186 .section ".rodata","a" 197 187 # Duplicate the global descriptor table ··· 219 181 .quad 0x00af9b000000ffff # __KERNEL_CS 220 182 .quad 0x00cf93000000ffff # __KERNEL_DS 221 183 SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end) 184 + 185 + SYM_DATA_START(tr_gdt64) 186 + .short tr_gdt_end - tr_gdt - 1 # gdt limit 187 + .long pa_tr_gdt 188 + .long 0 189 + SYM_DATA_END(tr_gdt64) 190 + 191 + SYM_DATA_START(tr_compat) 192 + .long pa_trampoline_compat 193 + .short __KERNEL32_CS 194 + SYM_DATA_END(tr_compat) 222 195 223 196 .bss 224 197 .balign PAGE_SIZE
+11 -1
arch/x86/realmode/rm/trampoline_common.S
··· 1 1 /* SPDX-License-Identifier: GPL-2.0 */ 2 2 .section ".rodata","a" 3 3 .balign 16 4 - SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0) 4 + 5 + /* 6 + * When a bootloader hands off to the kernel in 32-bit mode an 7 + * IDT with a 2-byte limit and 4-byte base is needed. When a boot 8 + * loader hands off to a kernel 64-bit mode the base address 9 + * extends to 8-bytes. Reserve enough space for either scenario. 10 + */ 11 + SYM_DATA_START_LOCAL(tr_idt) 12 + .short 0 13 + .quad 0 14 + SYM_DATA_END(tr_idt)
+4
arch/x86/realmode/rm/wakemain.c
··· 62 62 } 63 63 } 64 64 65 + struct port_io_ops pio_ops; 66 + 65 67 void main(void) 66 68 { 69 + init_default_io_ops(); 70 + 67 71 /* Kill machine if structures are wrong */ 68 72 if (wakeup_header.real_magic != 0x12345678) 69 73 while (1)
+96
arch/x86/virt/vmx/tdx/tdxcall.S
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #include <asm/asm-offsets.h> 3 + #include <asm/tdx.h> 4 + 5 + /* 6 + * TDCALL and SEAMCALL are supported in Binutils >= 2.36. 7 + */ 8 + #define tdcall .byte 0x66,0x0f,0x01,0xcc 9 + #define seamcall .byte 0x66,0x0f,0x01,0xcf 10 + 11 + /* 12 + * TDX_MODULE_CALL - common helper macro for both 13 + * TDCALL and SEAMCALL instructions. 14 + * 15 + * TDCALL - used by TDX guests to make requests to the 16 + * TDX module and hypercalls to the VMM. 17 + * SEAMCALL - used by TDX hosts to make requests to the 18 + * TDX module. 19 + */ 20 + .macro TDX_MODULE_CALL host:req 21 + /* 22 + * R12 will be used as temporary storage for struct tdx_module_output 23 + * pointer. Since R12-R15 registers are not used by TDCALL/SEAMCALL 24 + * services supported by this function, it can be reused. 25 + */ 26 + 27 + /* Callee saved, so preserve it */ 28 + push %r12 29 + 30 + /* 31 + * Push output pointer to stack. 32 + * After the operation, it will be fetched into R12 register. 33 + */ 34 + push %r9 35 + 36 + /* Mangle function call ABI into TDCALL/SEAMCALL ABI: */ 37 + /* Move Leaf ID to RAX */ 38 + mov %rdi, %rax 39 + /* Move input 4 to R9 */ 40 + mov %r8, %r9 41 + /* Move input 3 to R8 */ 42 + mov %rcx, %r8 43 + /* Move input 1 to RCX */ 44 + mov %rsi, %rcx 45 + /* Leave input param 2 in RDX */ 46 + 47 + .if \host 48 + seamcall 49 + /* 50 + * SEAMCALL instruction is essentially a VMExit from VMX root 51 + * mode to SEAM VMX root mode. VMfailInvalid (CF=1) indicates 52 + * that the targeted SEAM firmware is not loaded or disabled, 53 + * or P-SEAMLDR is busy with another SEAMCALL. %rax is not 54 + * changed in this case. 55 + * 56 + * Set %rax to TDX_SEAMCALL_VMFAILINVALID for VMfailInvalid. 57 + * This value will never be used as actual SEAMCALL error code as 58 + * it is from the Reserved status code class. 59 + */ 60 + jnc .Lno_vmfailinvalid 61 + mov $TDX_SEAMCALL_VMFAILINVALID, %rax 62 + .Lno_vmfailinvalid: 63 + 64 + .else 65 + tdcall 66 + .endif 67 + 68 + /* 69 + * Fetch output pointer from stack to R12 (It is used 70 + * as temporary storage) 71 + */ 72 + pop %r12 73 + 74 + /* 75 + * Since this macro can be invoked with NULL as an output pointer, 76 + * check if caller provided an output struct before storing output 77 + * registers. 78 + * 79 + * Update output registers, even if the call failed (RAX != 0). 80 + * Other registers may contain details of the failure. 81 + */ 82 + test %r12, %r12 83 + jz .Lno_output_struct 84 + 85 + /* Copy result registers to output struct: */ 86 + movq %rcx, TDX_MODULE_rcx(%r12) 87 + movq %rdx, TDX_MODULE_rdx(%r12) 88 + movq %r8, TDX_MODULE_r8(%r12) 89 + movq %r9, TDX_MODULE_r9(%r12) 90 + movq %r10, TDX_MODULE_r10(%r12) 91 + movq %r11, TDX_MODULE_r11(%r12) 92 + 93 + .Lno_output_struct: 94 + /* Restore the state of R12 register */ 95 + pop %r12 96 + .endm
+10
include/linux/cc_platform.h
··· 80 80 * using AMD SEV-SNP features. 81 81 */ 82 82 CC_ATTR_GUEST_SEV_SNP, 83 + 84 + /** 85 + * @CC_ATTR_HOTPLUG_DISABLED: Hotplug is not supported or disabled. 86 + * 87 + * The platform/OS is running as a guest/virtual machine does not 88 + * support CPU hotplug feature. 89 + * 90 + * Examples include TDX Guest. 91 + */ 92 + CC_ATTR_HOTPLUG_DISABLED, 83 93 }; 84 94 85 95 #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
+7
kernel/cpu.c
··· 35 35 #include <linux/percpu-rwsem.h> 36 36 #include <linux/cpuset.h> 37 37 #include <linux/random.h> 38 + #include <linux/cc_platform.h> 38 39 39 40 #include <trace/events/power.h> 40 41 #define CREATE_TRACE_POINTS ··· 1191 1190 1192 1191 static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target) 1193 1192 { 1193 + /* 1194 + * If the platform does not support hotplug, report it explicitly to 1195 + * differentiate it from a transient offlining failure. 1196 + */ 1197 + if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED)) 1198 + return -EOPNOTSUPP; 1194 1199 if (cpu_hotplug_disabled) 1195 1200 return -EBUSY; 1196 1201 return _cpu_down(cpu, 0, target);