Merge tag 'x86_tdx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

+1

Documentation/x86/index.rst

··· 26 26 intel_txt 27 27 amd-memory-encryption 28 28 amd_hsmp 29 + tdx 29 30 pti 30 31 mds 31 32 microcode

+218

Documentation/x86/tdx.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ===================================== 4 + Intel Trust Domain Extensions (TDX) 5 + ===================================== 6 + 7 + Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from 8 + the host and physical attacks by isolating the guest register state and by 9 + encrypting the guest memory. In TDX, a special module running in a special 10 + mode sits between the host and the guest and manages the guest/host 11 + separation. 12 + 13 + Since the host cannot directly access guest registers or memory, much 14 + normal functionality of a hypervisor must be moved into the guest. This is 15 + implemented using a Virtualization Exception (#VE) that is handled by the 16 + guest kernel. A #VE is handled entirely inside the guest kernel, but some 17 + require the hypervisor to be consulted. 18 + 19 + TDX includes new hypercall-like mechanisms for communicating from the 20 + guest to the hypervisor or the TDX module. 21 + 22 + New TDX Exceptions 23 + ================== 24 + 25 + TDX guests behave differently from bare-metal and traditional VMX guests. 26 + In TDX guests, otherwise normal instructions or memory accesses can cause 27 + #VE or #GP exceptions. 28 + 29 + Instructions marked with an '*' conditionally cause exceptions. The 30 + details for these instructions are discussed below. 31 + 32 + Instruction-based #VE 33 + --------------------- 34 + 35 + - Port I/O (INS, OUTS, IN, OUT) 36 + - HLT 37 + - MONITOR, MWAIT 38 + - WBINVD, INVD 39 + - VMCALL 40 + - RDMSR*,WRMSR* 41 + - CPUID* 42 + 43 + Instruction-based #GP 44 + --------------------- 45 + 46 + - All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH, 47 + VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON 48 + - ENCLS, ENCLU 49 + - GETSEC 50 + - RSM 51 + - ENQCMD 52 + - RDMSR*,WRMSR* 53 + 54 + RDMSR/WRMSR Behavior 55 + -------------------- 56 + 57 + MSR access behavior falls into three categories: 58 + 59 + - #GP generated 60 + - #VE generated 61 + - "Just works" 62 + 63 + In general, the #GP MSRs should not be used in guests. Their use likely 64 + indicates a bug in the guest. The guest may try to handle the #GP with a 65 + hypercall but it is unlikely to succeed. 66 + 67 + The #VE MSRs are typically able to be handled by the hypervisor. Guests 68 + can make a hypercall to the hypervisor to handle the #VE. 69 + 70 + The "just works" MSRs do not need any special guest handling. They might 71 + be implemented by directly passing through the MSR to the hardware or by 72 + trapping and handling in the TDX module. Other than possibly being slow, 73 + these MSRs appear to function just as they would on bare metal. 74 + 75 + CPUID Behavior 76 + -------------- 77 + 78 + For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID 79 + return values (in guest EAX/EBX/ECX/EDX) are configurable by the 80 + hypervisor. For such cases, the Intel TDX module architecture defines two 81 + virtualization types: 82 + 83 + - Bit fields for which the hypervisor controls the value seen by the guest 84 + TD. 85 + 86 + - Bit fields for which the hypervisor configures the value such that the 87 + guest TD either sees their native value or a value of 0. For these bit 88 + fields, the hypervisor can mask off the native values, but it can not 89 + turn *on* values. 90 + 91 + A #VE is generated for CPUID leaves and sub-leaves that the TDX module does 92 + not know how to handle. The guest kernel may ask the hypervisor for the 93 + value with a hypercall. 94 + 95 + #VE on Memory Accesses 96 + ====================== 97 + 98 + There are essentially two classes of TDX memory: private and shared. 99 + Private memory receives full TDX protections. Its content is protected 100 + against access from the hypervisor. Shared memory is expected to be 101 + shared between guest and hypervisor and does not receive full TDX 102 + protections. 103 + 104 + A TD guest is in control of whether its memory accesses are treated as 105 + private or shared. It selects the behavior with a bit in its page table 106 + entries. This helps ensure that a guest does not place sensitive 107 + information in shared memory, exposing it to the untrusted hypervisor. 108 + 109 + #VE on Shared Memory 110 + -------------------- 111 + 112 + Access to shared mappings can cause a #VE. The hypervisor ultimately 113 + controls whether a shared memory access causes a #VE, so the guest must be 114 + careful to only reference shared pages it can safely handle a #VE. For 115 + instance, the guest should be careful not to access shared memory in the 116 + #VE handler before it reads the #VE info structure (TDG.VP.VEINFO.GET). 117 + 118 + Shared mapping content is entirely controlled by the hypervisor. The guest 119 + should only use shared mappings for communicating with the hypervisor. 120 + Shared mappings must never be used for sensitive memory content like kernel 121 + stacks. A good rule of thumb is that hypervisor-shared memory should be 122 + treated the same as memory mapped to userspace. Both the hypervisor and 123 + userspace are completely untrusted. 124 + 125 + MMIO for virtual devices is implemented as shared memory. The guest must 126 + be careful not to access device MMIO regions unless it is also prepared to 127 + handle a #VE. 128 + 129 + #VE on Private Pages 130 + -------------------- 131 + 132 + An access to private mappings can also cause a #VE. Since all kernel 133 + memory is also private memory, the kernel might theoretically need to 134 + handle a #VE on arbitrary kernel memory accesses. This is not feasible, so 135 + TDX guests ensure that all guest memory has been "accepted" before memory 136 + is used by the kernel. 137 + 138 + A modest amount of memory (typically 512M) is pre-accepted by the firmware 139 + before the kernel runs to ensure that the kernel can start up without 140 + being subjected to a #VE. 141 + 142 + The hypervisor is permitted to unilaterally move accepted pages to a 143 + "blocked" state. However, if it does this, page access will not generate a 144 + #VE. It will, instead, cause a "TD Exit" where the hypervisor is required 145 + to handle the exception. 146 + 147 + Linux #VE handler 148 + ================= 149 + 150 + Just like page faults or #GP's, #VE exceptions can be either handled or be 151 + fatal. Typically, an unhandled userspace #VE results in a SIGSEGV. 152 + An unhandled kernel #VE results in an oops. 153 + 154 + Handling nested exceptions on x86 is typically nasty business. A #VE 155 + could be interrupted by an NMI which triggers another #VE and hilarity 156 + ensues. The TDX #VE architecture anticipated this scenario and includes a 157 + feature to make it slightly less nasty. 158 + 159 + During #VE handling, the TDX module ensures that all interrupts (including 160 + NMIs) are blocked. The block remains in place until the guest makes a 161 + TDG.VP.VEINFO.GET TDCALL. This allows the guest to control when interrupts 162 + or a new #VE can be delivered. 163 + 164 + However, the guest kernel must still be careful to avoid potential 165 + #VE-triggering actions (discussed above) while this block is in place. 166 + While the block is in place, any #VE is elevated to a double fault (#DF) 167 + which is not recoverable. 168 + 169 + MMIO handling 170 + ============= 171 + 172 + In non-TDX VMs, MMIO is usually implemented by giving a guest access to a 173 + mapping which will cause a VMEXIT on access, and then the hypervisor 174 + emulates the access. That is not possible in TDX guests because VMEXIT 175 + will expose the register state to the host. TDX guests don't trust the host 176 + and can't have their state exposed to the host. 177 + 178 + In TDX, MMIO regions typically trigger a #VE exception in the guest. The 179 + guest #VE handler then emulates the MMIO instruction inside the guest and 180 + converts it into a controlled TDCALL to the host, rather than exposing 181 + guest state to the host. 182 + 183 + MMIO addresses on x86 are just special physical addresses. They can 184 + theoretically be accessed with any instruction that accesses memory. 185 + However, the kernel instruction decoding method is limited. It is only 186 + designed to decode instructions like those generated by io.h macros. 187 + 188 + MMIO access via other means (like structure overlays) may result in an 189 + oops. 190 + 191 + Shared Memory Conversions 192 + ========================= 193 + 194 + All TDX guest memory starts out as private at boot. This memory can not 195 + be accessed by the hypervisor. However, some kernel users like device 196 + drivers might have a need to share data with the hypervisor. To do this, 197 + memory must be converted between shared and private. This can be 198 + accomplished using some existing memory encryption helpers: 199 + 200 + * set_memory_decrypted() converts a range of pages to shared. 201 + * set_memory_encrypted() converts memory back to private. 202 + 203 + Device drivers are the primary user of shared memory, but there's no need 204 + to touch every driver. DMA buffers and ioremap() do the conversions 205 + automatically. 206 + 207 + TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is 208 + converted to shared on boot. 209 + 210 + For coherent DMA allocation, the DMA buffer gets converted on the 211 + allocation. Check force_dma_unencrypted() for details. 212 + 213 + References 214 + ========== 215 + 216 + TDX reference material is collected here: 217 + 218 + https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html

+15

arch/x86/Kconfig

··· 878 878 IOT with small footprint and real-time features. More details can be 879 879 found in https://projectacrn.org/. 880 880 881 + config INTEL_TDX_GUEST 882 + bool "Intel TDX (Trust Domain Extensions) - Guest Support" 883 + depends on X86_64 && CPU_SUP_INTEL 884 + depends on X86_X2APIC 885 + select ARCH_HAS_CC_PLATFORM 886 + select X86_MEM_ENCRYPT 887 + select X86_MCE 888 + help 889 + Support running as a guest under Intel TDX. Without this support, 890 + the guest kernel can not boot or run under TDX. 891 + TDX includes memory encryption and integrity capabilities 892 + which protect the confidentiality and integrity of guest 893 + memory contents and CPU state. TDX guests are protected from 894 + some attacks from the VMM. 895 + 881 896 endif #HYPERVISOR_GUEST 882 897 883 898 source "arch/x86/Kconfig.cpu"

+2 -35

arch/x86/boot/boot.h

··· 26 26 #include "bitops.h" 27 27 #include "ctype.h" 28 28 #include "cpuflags.h" 29 + #include "io.h" 29 30 30 31 /* Useful macros */ 31 32 #define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x))) ··· 36 35 37 36 #define cpu_relax() asm volatile("rep; nop") 38 37 39 - /* Basic port I/O */ 40 - static inline void outb(u8 v, u16 port) 41 - { 42 - asm volatile("outb %0,%1" : : "a" (v), "dN" (port)); 43 - } 44 - static inline u8 inb(u16 port) 45 - { 46 - u8 v; 47 - asm volatile("inb %1,%0" : "=a" (v) : "dN" (port)); 48 - return v; 49 - } 50 - 51 - static inline void outw(u16 v, u16 port) 52 - { 53 - asm volatile("outw %0,%1" : : "a" (v), "dN" (port)); 54 - } 55 - static inline u16 inw(u16 port) 56 - { 57 - u16 v; 58 - asm volatile("inw %1,%0" : "=a" (v) : "dN" (port)); 59 - return v; 60 - } 61 - 62 - static inline void outl(u32 v, u16 port) 63 - { 64 - asm volatile("outl %0,%1" : : "a" (v), "dN" (port)); 65 - } 66 - static inline u32 inl(u16 port) 67 - { 68 - u32 v; 69 - asm volatile("inl %1,%0" : "=a" (v) : "dN" (port)); 70 - return v; 71 - } 72 - 73 38 static inline void io_delay(void) 74 39 { 75 40 const u16 DELAY_PORT = 0x80; 76 - asm volatile("outb %%al,%0" : : "dN" (DELAY_PORT)); 41 + outb(0, DELAY_PORT); 77 42 } 78 43 79 44 /* These functions are used to reference data in other segments. */

+1

arch/x86/boot/compressed/Makefile

··· 101 101 endif 102 102 103 103 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o 104 + vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o 104 105 105 106 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o 106 107 vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o

+22 -5

arch/x86/boot/compressed/head_64.S

··· 289 289 pushl %eax 290 290 291 291 /* Enter paged protected Mode, activating Long Mode */ 292 - movl $(X86_CR0_PG | X86_CR0_PE), %eax /* Enable Paging and Protected mode */ 292 + movl $CR0_STATE, %eax 293 293 movl %eax, %cr0 294 294 295 295 /* Jump from 32bit compatibility mode into 64bit mode. */ ··· 649 649 movl $MSR_EFER, %ecx 650 650 rdmsr 651 651 btsl $_EFER_LME, %eax 652 + /* Avoid writing EFER if no change was made (for TDX guest) */ 653 + jc 1f 652 654 wrmsr 653 - popl %edx 655 + 1: popl %edx 654 656 popl %ecx 655 657 658 + #ifdef CONFIG_X86_MCE 659 + /* 660 + * Preserve CR4.MCE if the kernel will enable #MC support. 661 + * Clearing MCE may fault in some environments (that also force #MC 662 + * support). Any machine check that occurs before #MC support is fully 663 + * configured will crash the system regardless of the CR4.MCE value set 664 + * here. 665 + */ 666 + movl %cr4, %eax 667 + andl $X86_CR4_MCE, %eax 668 + #else 669 + movl $0, %eax 670 + #endif 671 + 656 672 /* Enable PAE and LA57 (if required) paging modes */ 657 - movl $X86_CR4_PAE, %eax 673 + orl $X86_CR4_PAE, %eax 658 674 testl %edx, %edx 659 675 jz 1f 660 676 orl $X86_CR4_LA57, %eax ··· 684 668 pushl $__KERNEL_CS 685 669 pushl %eax 686 670 687 - /* Enable paging again */ 688 - movl $(X86_CR0_PG | X86_CR0_PE), %eax 671 + /* Enable paging again. */ 672 + movl %cr0, %eax 673 + btsl $X86_CR0_PG_BIT, %eax 689 674 movl %eax, %cr0 690 675 691 676 lret

+12

arch/x86/boot/compressed/misc.c

··· 48 48 */ 49 49 struct boot_params *boot_params; 50 50 51 + struct port_io_ops pio_ops; 52 + 51 53 memptr free_mem_ptr; 52 54 memptr free_mem_end_ptr; 53 55 ··· 375 373 376 374 lines = boot_params->screen_info.orig_video_lines; 377 375 cols = boot_params->screen_info.orig_video_cols; 376 + 377 + init_default_io_ops(); 378 + 379 + /* 380 + * Detect TDX guest environment. 381 + * 382 + * It has to be done before console_init() in order to use 383 + * paravirtualized port I/O operations if needed. 384 + */ 385 + early_tdx_detect(); 378 386 379 387 console_init(); 380 388

+3 -1

arch/x86/boot/compressed/misc.h

··· 22 22 #include <linux/linkage.h> 23 23 #include <linux/screen_info.h> 24 24 #include <linux/elf.h> 25 - #include <linux/io.h> 26 25 #include <asm/page.h> 27 26 #include <asm/boot.h> 28 27 #include <asm/bootparam.h> 29 28 #include <asm/desc_defs.h> 29 + 30 + #include "tdx.h" 30 31 31 32 #define BOOT_CTYPE_H 32 33 #include <linux/acpi.h> 33 34 34 35 #define BOOT_BOOT_H 35 36 #include "../ctype.h" 37 + #include "../io.h" 36 38 37 39 #include "efi.h" 38 40

+1 -1

arch/x86/boot/compressed/pgtable.h

··· 6 6 #define TRAMPOLINE_32BIT_PGTABLE_OFFSET 0 7 7 8 8 #define TRAMPOLINE_32BIT_CODE_OFFSET PAGE_SIZE 9 - #define TRAMPOLINE_32BIT_CODE_SIZE 0x70 9 + #define TRAMPOLINE_32BIT_CODE_SIZE 0x80 10 10 11 11 #define TRAMPOLINE_32BIT_STACK_END TRAMPOLINE_32BIT_SIZE 12 12

+3

arch/x86/boot/compressed/tdcall.S

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + 3 + #include "../../coco/tdx/tdcall.S"

+77

arch/x86/boot/compressed/tdx.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include "../cpuflags.h" 4 + #include "../string.h" 5 + #include "../io.h" 6 + #include "error.h" 7 + 8 + #include <vdso/limits.h> 9 + #include <uapi/asm/vmx.h> 10 + 11 + #include <asm/shared/tdx.h> 12 + 13 + /* Called from __tdx_hypercall() for unrecoverable failure */ 14 + void __tdx_hypercall_failed(void) 15 + { 16 + error("TDVMCALL failed. TDX module bug?"); 17 + } 18 + 19 + static inline unsigned int tdx_io_in(int size, u16 port) 20 + { 21 + struct tdx_hypercall_args args = { 22 + .r10 = TDX_HYPERCALL_STANDARD, 23 + .r11 = EXIT_REASON_IO_INSTRUCTION, 24 + .r12 = size, 25 + .r13 = 0, 26 + .r14 = port, 27 + }; 28 + 29 + if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT)) 30 + return UINT_MAX; 31 + 32 + return args.r11; 33 + } 34 + 35 + static inline void tdx_io_out(int size, u16 port, u32 value) 36 + { 37 + struct tdx_hypercall_args args = { 38 + .r10 = TDX_HYPERCALL_STANDARD, 39 + .r11 = EXIT_REASON_IO_INSTRUCTION, 40 + .r12 = size, 41 + .r13 = 1, 42 + .r14 = port, 43 + .r15 = value, 44 + }; 45 + 46 + __tdx_hypercall(&args, 0); 47 + } 48 + 49 + static inline u8 tdx_inb(u16 port) 50 + { 51 + return tdx_io_in(1, port); 52 + } 53 + 54 + static inline void tdx_outb(u8 value, u16 port) 55 + { 56 + tdx_io_out(1, port, value); 57 + } 58 + 59 + static inline void tdx_outw(u16 value, u16 port) 60 + { 61 + tdx_io_out(2, port, value); 62 + } 63 + 64 + void early_tdx_detect(void) 65 + { 66 + u32 eax, sig[3]; 67 + 68 + cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]); 69 + 70 + if (memcmp(TDX_IDENT, sig, sizeof(sig))) 71 + return; 72 + 73 + /* Use hypercalls instead of I/O instructions */ 74 + pio_ops.f_inb = tdx_inb; 75 + pio_ops.f_outb = tdx_outb; 76 + pio_ops.f_outw = tdx_outw; 77 + }

+13

arch/x86/boot/compressed/tdx.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef BOOT_COMPRESSED_TDX_H 3 + #define BOOT_COMPRESSED_TDX_H 4 + 5 + #include <linux/types.h> 6 + 7 + #ifdef CONFIG_INTEL_TDX_GUEST 8 + void early_tdx_detect(void); 9 + #else 10 + static inline void early_tdx_detect(void) { }; 11 + #endif 12 + 13 + #endif /* BOOT_COMPRESSED_TDX_H */

+1 -2

arch/x86/boot/cpuflags.c

··· 71 71 # define EBX_REG "=b" 72 72 #endif 73 73 74 - static inline void cpuid_count(u32 id, u32 count, 75 - u32 *a, u32 *b, u32 *c, u32 *d) 74 + void cpuid_count(u32 id, u32 count, u32 *a, u32 *b, u32 *c, u32 *d) 76 75 { 77 76 asm volatile(".ifnc %%ebx,%3 ; movl %%ebx,%3 ; .endif \n\t" 78 77 "cpuid \n\t"

+1

arch/x86/boot/cpuflags.h

··· 17 17 18 18 int has_eflag(unsigned long mask); 19 19 void get_cpuflags(void); 20 + void cpuid_count(u32 id, u32 count, u32 *a, u32 *b, u32 *c, u32 *d); 20 21 21 22 #endif

+41

arch/x86/boot/io.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef BOOT_IO_H 3 + #define BOOT_IO_H 4 + 5 + #include <asm/shared/io.h> 6 + 7 + #undef inb 8 + #undef inw 9 + #undef inl 10 + #undef outb 11 + #undef outw 12 + #undef outl 13 + 14 + struct port_io_ops { 15 + u8 (*f_inb)(u16 port); 16 + void (*f_outb)(u8 v, u16 port); 17 + void (*f_outw)(u16 v, u16 port); 18 + }; 19 + 20 + extern struct port_io_ops pio_ops; 21 + 22 + /* 23 + * Use the normal I/O instructions by default. 24 + * TDX guests override these to use hypercalls. 25 + */ 26 + static inline void init_default_io_ops(void) 27 + { 28 + pio_ops.f_inb = __inb; 29 + pio_ops.f_outb = __outb; 30 + pio_ops.f_outw = __outw; 31 + } 32 + 33 + /* 34 + * Redirect port I/O operations via pio_ops callbacks. 35 + * TDX guests override these callbacks with TDX-specific helpers. 36 + */ 37 + #define inb pio_ops.f_inb 38 + #define outb pio_ops.f_outb 39 + #define outw pio_ops.f_outw 40 + 41 + #endif

+4

arch/x86/boot/main.c

··· 17 17 18 18 struct boot_params boot_params __attribute__((aligned(16))); 19 19 20 + struct port_io_ops pio_ops; 21 + 20 22 char *HEAP = _end; 21 23 char *heap_end = _end; /* Default end of heap = no heap */ 22 24 ··· 135 133 136 134 void main(void) 137 135 { 136 + init_default_io_ops(); 137 + 138 138 /* First, copy the boot header into the "zeropage" */ 139 139 copy_boot_params(); 140 140

+2

arch/x86/coco/Makefile

··· 4 4 CFLAGS_core.o += -fno-stack-protector 5 5 6 6 obj-y += core.o 7 + 8 + obj-$(CONFIG_INTEL_TDX_GUEST) += tdx/

+21 -1

arch/x86/coco/core.c

··· 18 18 19 19 static bool intel_cc_platform_has(enum cc_attr attr) 20 20 { 21 - return false; 21 + switch (attr) { 22 + case CC_ATTR_GUEST_UNROLL_STRING_IO: 23 + case CC_ATTR_HOTPLUG_DISABLED: 24 + case CC_ATTR_GUEST_MEM_ENCRYPT: 25 + case CC_ATTR_MEM_ENCRYPT: 26 + return true; 27 + default: 28 + return false; 29 + } 22 30 } 23 31 24 32 /* ··· 98 90 99 91 u64 cc_mkenc(u64 val) 100 92 { 93 + /* 94 + * Both AMD and Intel use a bit in the page table to indicate 95 + * encryption status of the page. 96 + * 97 + * - for AMD, bit *set* means the page is encrypted 98 + * - for Intel *clear* means encrypted. 99 + */ 101 100 switch (vendor) { 102 101 case CC_VENDOR_AMD: 103 102 return val | cc_mask; 103 + case CC_VENDOR_INTEL: 104 + return val & ~cc_mask; 104 105 default: 105 106 return val; 106 107 } ··· 117 100 118 101 u64 cc_mkdec(u64 val) 119 102 { 103 + /* See comment in cc_mkenc() */ 120 104 switch (vendor) { 121 105 case CC_VENDOR_AMD: 122 106 return val & ~cc_mask; 107 + case CC_VENDOR_INTEL: 108 + return val | cc_mask; 123 109 default: 124 110 return val; 125 111 }

+3

arch/x86/coco/tdx/Makefile

··· 1 + # SPDX-License-Identifier: GPL-2.0 2 + 3 + obj-y += tdx.o tdcall.o

+205

arch/x86/coco/tdx/tdcall.S

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #include <asm/asm-offsets.h> 3 + #include <asm/asm.h> 4 + #include <asm/frame.h> 5 + #include <asm/unwind_hints.h> 6 + 7 + #include <linux/linkage.h> 8 + #include <linux/bits.h> 9 + #include <linux/errno.h> 10 + 11 + #include "../../virt/vmx/tdx/tdxcall.S" 12 + 13 + /* 14 + * Bitmasks of exposed registers (with VMM). 15 + */ 16 + #define TDX_R10 BIT(10) 17 + #define TDX_R11 BIT(11) 18 + #define TDX_R12 BIT(12) 19 + #define TDX_R13 BIT(13) 20 + #define TDX_R14 BIT(14) 21 + #define TDX_R15 BIT(15) 22 + 23 + /* 24 + * These registers are clobbered to hold arguments for each 25 + * TDVMCALL. They are safe to expose to the VMM. 26 + * Each bit in this mask represents a register ID. Bit field 27 + * details can be found in TDX GHCI specification, section 28 + * titled "TDCALL [TDG.VP.VMCALL] leaf". 29 + */ 30 + #define TDVMCALL_EXPOSE_REGS_MASK ( TDX_R10 | TDX_R11 | \ 31 + TDX_R12 | TDX_R13 | \ 32 + TDX_R14 | TDX_R15 ) 33 + 34 + /* 35 + * __tdx_module_call() - Used by TDX guests to request services from 36 + * the TDX module (does not include VMM services) using TDCALL instruction. 37 + * 38 + * Transforms function call register arguments into the TDCALL register ABI. 39 + * After TDCALL operation, TDX module output is saved in @out (if it is 40 + * provided by the user). 41 + * 42 + *------------------------------------------------------------------------- 43 + * TDCALL ABI: 44 + *------------------------------------------------------------------------- 45 + * Input Registers: 46 + * 47 + * RAX - TDCALL Leaf number. 48 + * RCX,RDX,R8-R9 - TDCALL Leaf specific input registers. 49 + * 50 + * Output Registers: 51 + * 52 + * RAX - TDCALL instruction error code. 53 + * RCX,RDX,R8-R11 - TDCALL Leaf specific output registers. 54 + * 55 + *------------------------------------------------------------------------- 56 + * 57 + * __tdx_module_call() function ABI: 58 + * 59 + * @fn (RDI) - TDCALL Leaf ID, moved to RAX 60 + * @rcx (RSI) - Input parameter 1, moved to RCX 61 + * @rdx (RDX) - Input parameter 2, moved to RDX 62 + * @r8 (RCX) - Input parameter 3, moved to R8 63 + * @r9 (R8) - Input parameter 4, moved to R9 64 + * 65 + * @out (R9) - struct tdx_module_output pointer 66 + * stored temporarily in R12 (not 67 + * shared with the TDX module). It 68 + * can be NULL. 69 + * 70 + * Return status of TDCALL via RAX. 71 + */ 72 + SYM_FUNC_START(__tdx_module_call) 73 + FRAME_BEGIN 74 + TDX_MODULE_CALL host=0 75 + FRAME_END 76 + RET 77 + SYM_FUNC_END(__tdx_module_call) 78 + 79 + /* 80 + * __tdx_hypercall() - Make hypercalls to a TDX VMM using TDVMCALL leaf 81 + * of TDCALL instruction 82 + * 83 + * Transforms values in function call argument struct tdx_hypercall_args @args 84 + * into the TDCALL register ABI. After TDCALL operation, VMM output is saved 85 + * back in @args. 86 + * 87 + *------------------------------------------------------------------------- 88 + * TD VMCALL ABI: 89 + *------------------------------------------------------------------------- 90 + * 91 + * Input Registers: 92 + * 93 + * RAX - TDCALL instruction leaf number (0 - TDG.VP.VMCALL) 94 + * RCX - BITMAP which controls which part of TD Guest GPR 95 + * is passed as-is to the VMM and back. 96 + * R10 - Set 0 to indicate TDCALL follows standard TDX ABI 97 + * specification. Non zero value indicates vendor 98 + * specific ABI. 99 + * R11 - VMCALL sub function number 100 + * RBX, RBP, RDI, RSI - Used to pass VMCALL sub function specific arguments. 101 + * R8-R9, R12-R15 - Same as above. 102 + * 103 + * Output Registers: 104 + * 105 + * RAX - TDCALL instruction status (Not related to hypercall 106 + * output). 107 + * R10 - Hypercall output error code. 108 + * R11-R15 - Hypercall sub function specific output values. 109 + * 110 + *------------------------------------------------------------------------- 111 + * 112 + * __tdx_hypercall() function ABI: 113 + * 114 + * @args (RDI) - struct tdx_hypercall_args for input and output 115 + * @flags (RSI) - TDX_HCALL_* flags 116 + * 117 + * On successful completion, return the hypercall error code. 118 + */ 119 + SYM_FUNC_START(__tdx_hypercall) 120 + FRAME_BEGIN 121 + 122 + /* Save callee-saved GPRs as mandated by the x86_64 ABI */ 123 + push %r15 124 + push %r14 125 + push %r13 126 + push %r12 127 + 128 + /* Mangle function call ABI into TDCALL ABI: */ 129 + /* Set TDCALL leaf ID (TDVMCALL (0)) in RAX */ 130 + xor %eax, %eax 131 + 132 + /* Copy hypercall registers from arg struct: */ 133 + movq TDX_HYPERCALL_r10(%rdi), %r10 134 + movq TDX_HYPERCALL_r11(%rdi), %r11 135 + movq TDX_HYPERCALL_r12(%rdi), %r12 136 + movq TDX_HYPERCALL_r13(%rdi), %r13 137 + movq TDX_HYPERCALL_r14(%rdi), %r14 138 + movq TDX_HYPERCALL_r15(%rdi), %r15 139 + 140 + movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx 141 + 142 + /* 143 + * For the idle loop STI needs to be called directly before the TDCALL 144 + * that enters idle (EXIT_REASON_HLT case). STI instruction enables 145 + * interrupts only one instruction later. If there is a window between 146 + * STI and the instruction that emulates the HALT state, there is a 147 + * chance for interrupts to happen in this window, which can delay the 148 + * HLT operation indefinitely. Since this is the not the desired 149 + * result, conditionally call STI before TDCALL. 150 + */ 151 + testq $TDX_HCALL_ISSUE_STI, %rsi 152 + jz .Lskip_sti 153 + sti 154 + .Lskip_sti: 155 + tdcall 156 + 157 + /* 158 + * RAX==0 indicates a failure of the TDVMCALL mechanism itself and that 159 + * something has gone horribly wrong with the TDX module. 160 + * 161 + * The return status of the hypercall operation is in a separate 162 + * register (in R10). Hypercall errors are a part of normal operation 163 + * and are handled by callers. 164 + */ 165 + testq %rax, %rax 166 + jne .Lpanic 167 + 168 + /* TDVMCALL leaf return code is in R10 */ 169 + movq %r10, %rax 170 + 171 + /* Copy hypercall result registers to arg struct if needed */ 172 + testq $TDX_HCALL_HAS_OUTPUT, %rsi 173 + jz .Lout 174 + 175 + movq %r10, TDX_HYPERCALL_r10(%rdi) 176 + movq %r11, TDX_HYPERCALL_r11(%rdi) 177 + movq %r12, TDX_HYPERCALL_r12(%rdi) 178 + movq %r13, TDX_HYPERCALL_r13(%rdi) 179 + movq %r14, TDX_HYPERCALL_r14(%rdi) 180 + movq %r15, TDX_HYPERCALL_r15(%rdi) 181 + .Lout: 182 + /* 183 + * Zero out registers exposed to the VMM to avoid speculative execution 184 + * with VMM-controlled values. This needs to include all registers 185 + * present in TDVMCALL_EXPOSE_REGS_MASK (except R12-R15). R12-R15 186 + * context will be restored. 187 + */ 188 + xor %r10d, %r10d 189 + xor %r11d, %r11d 190 + 191 + /* Restore callee-saved GPRs as mandated by the x86_64 ABI */ 192 + pop %r12 193 + pop %r13 194 + pop %r14 195 + pop %r15 196 + 197 + FRAME_END 198 + 199 + RET 200 + .Lpanic: 201 + call __tdx_hypercall_failed 202 + /* __tdx_hypercall_failed never returns */ 203 + REACHABLE 204 + jmp .Lpanic 205 + SYM_FUNC_END(__tdx_hypercall)

+692

arch/x86/coco/tdx/tdx.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* Copyright (C) 2021-2022 Intel Corporation */ 3 + 4 + #undef pr_fmt 5 + #define pr_fmt(fmt) "tdx: " fmt 6 + 7 + #include <linux/cpufeature.h> 8 + #include <asm/coco.h> 9 + #include <asm/tdx.h> 10 + #include <asm/vmx.h> 11 + #include <asm/insn.h> 12 + #include <asm/insn-eval.h> 13 + #include <asm/pgtable.h> 14 + 15 + /* TDX module Call Leaf IDs */ 16 + #define TDX_GET_INFO 1 17 + #define TDX_GET_VEINFO 3 18 + #define TDX_ACCEPT_PAGE 6 19 + 20 + /* TDX hypercall Leaf IDs */ 21 + #define TDVMCALL_MAP_GPA 0x10001 22 + 23 + /* MMIO direction */ 24 + #define EPT_READ 0 25 + #define EPT_WRITE 1 26 + 27 + /* Port I/O direction */ 28 + #define PORT_READ 0 29 + #define PORT_WRITE 1 30 + 31 + /* See Exit Qualification for I/O Instructions in VMX documentation */ 32 + #define VE_IS_IO_IN(e) ((e) & BIT(3)) 33 + #define VE_GET_IO_SIZE(e) (((e) & GENMASK(2, 0)) + 1) 34 + #define VE_GET_PORT_NUM(e) ((e) >> 16) 35 + #define VE_IS_IO_STRING(e) ((e) & BIT(4)) 36 + 37 + /* 38 + * Wrapper for standard use of __tdx_hypercall with no output aside from 39 + * return code. 40 + */ 41 + static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15) 42 + { 43 + struct tdx_hypercall_args args = { 44 + .r10 = TDX_HYPERCALL_STANDARD, 45 + .r11 = fn, 46 + .r12 = r12, 47 + .r13 = r13, 48 + .r14 = r14, 49 + .r15 = r15, 50 + }; 51 + 52 + return __tdx_hypercall(&args, 0); 53 + } 54 + 55 + /* Called from __tdx_hypercall() for unrecoverable failure */ 56 + void __tdx_hypercall_failed(void) 57 + { 58 + panic("TDVMCALL failed. TDX module bug?"); 59 + } 60 + 61 + /* 62 + * The TDG.VP.VMCALL-Instruction-execution sub-functions are defined 63 + * independently from but are currently matched 1:1 with VMX EXIT_REASONs. 64 + * Reusing the KVM EXIT_REASON macros makes it easier to connect the host and 65 + * guest sides of these calls. 66 + */ 67 + static u64 hcall_func(u64 exit_reason) 68 + { 69 + return exit_reason; 70 + } 71 + 72 + #ifdef CONFIG_KVM_GUEST 73 + long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2, 74 + unsigned long p3, unsigned long p4) 75 + { 76 + struct tdx_hypercall_args args = { 77 + .r10 = nr, 78 + .r11 = p1, 79 + .r12 = p2, 80 + .r13 = p3, 81 + .r14 = p4, 82 + }; 83 + 84 + return __tdx_hypercall(&args, 0); 85 + } 86 + EXPORT_SYMBOL_GPL(tdx_kvm_hypercall); 87 + #endif 88 + 89 + /* 90 + * Used for TDX guests to make calls directly to the TD module. This 91 + * should only be used for calls that have no legitimate reason to fail 92 + * or where the kernel can not survive the call failing. 93 + */ 94 + static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, 95 + struct tdx_module_output *out) 96 + { 97 + if (__tdx_module_call(fn, rcx, rdx, r8, r9, out)) 98 + panic("TDCALL %lld failed (Buggy TDX module!)\n", fn); 99 + } 100 + 101 + static u64 get_cc_mask(void) 102 + { 103 + struct tdx_module_output out; 104 + unsigned int gpa_width; 105 + 106 + /* 107 + * TDINFO TDX module call is used to get the TD execution environment 108 + * information like GPA width, number of available vcpus, debug mode 109 + * information, etc. More details about the ABI can be found in TDX 110 + * Guest-Host-Communication Interface (GHCI), section 2.4.2 TDCALL 111 + * [TDG.VP.INFO]. 112 + * 113 + * The GPA width that comes out of this call is critical. TDX guests 114 + * can not meaningfully run without it. 115 + */ 116 + tdx_module_call(TDX_GET_INFO, 0, 0, 0, 0, &out); 117 + 118 + gpa_width = out.rcx & GENMASK(5, 0); 119 + 120 + /* 121 + * The highest bit of a guest physical address is the "sharing" bit. 122 + * Set it for shared pages and clear it for private pages. 123 + */ 124 + return BIT_ULL(gpa_width - 1); 125 + } 126 + 127 + static u64 __cpuidle __halt(const bool irq_disabled, const bool do_sti) 128 + { 129 + struct tdx_hypercall_args args = { 130 + .r10 = TDX_HYPERCALL_STANDARD, 131 + .r11 = hcall_func(EXIT_REASON_HLT), 132 + .r12 = irq_disabled, 133 + }; 134 + 135 + /* 136 + * Emulate HLT operation via hypercall. More info about ABI 137 + * can be found in TDX Guest-Host-Communication Interface 138 + * (GHCI), section 3.8 TDG.VP.VMCALL<Instruction.HLT>. 139 + * 140 + * The VMM uses the "IRQ disabled" param to understand IRQ 141 + * enabled status (RFLAGS.IF) of the TD guest and to determine 142 + * whether or not it should schedule the halted vCPU if an 143 + * IRQ becomes pending. E.g. if IRQs are disabled, the VMM 144 + * can keep the vCPU in virtual HLT, even if an IRQ is 145 + * pending, without hanging/breaking the guest. 146 + */ 147 + return __tdx_hypercall(&args, do_sti ? TDX_HCALL_ISSUE_STI : 0); 148 + } 149 + 150 + static bool handle_halt(void) 151 + { 152 + /* 153 + * Since non safe halt is mainly used in CPU offlining 154 + * and the guest will always stay in the halt state, don't 155 + * call the STI instruction (set do_sti as false). 156 + */ 157 + const bool irq_disabled = irqs_disabled(); 158 + const bool do_sti = false; 159 + 160 + if (__halt(irq_disabled, do_sti)) 161 + return false; 162 + 163 + return true; 164 + } 165 + 166 + void __cpuidle tdx_safe_halt(void) 167 + { 168 + /* 169 + * For do_sti=true case, __tdx_hypercall() function enables 170 + * interrupts using the STI instruction before the TDCALL. So 171 + * set irq_disabled as false. 172 + */ 173 + const bool irq_disabled = false; 174 + const bool do_sti = true; 175 + 176 + /* 177 + * Use WARN_ONCE() to report the failure. 178 + */ 179 + if (__halt(irq_disabled, do_sti)) 180 + WARN_ONCE(1, "HLT instruction emulation failed\n"); 181 + } 182 + 183 + static bool read_msr(struct pt_regs *regs) 184 + { 185 + struct tdx_hypercall_args args = { 186 + .r10 = TDX_HYPERCALL_STANDARD, 187 + .r11 = hcall_func(EXIT_REASON_MSR_READ), 188 + .r12 = regs->cx, 189 + }; 190 + 191 + /* 192 + * Emulate the MSR read via hypercall. More info about ABI 193 + * can be found in TDX Guest-Host-Communication Interface 194 + * (GHCI), section titled "TDG.VP.VMCALL<Instruction.RDMSR>". 195 + */ 196 + if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT)) 197 + return false; 198 + 199 + regs->ax = lower_32_bits(args.r11); 200 + regs->dx = upper_32_bits(args.r11); 201 + return true; 202 + } 203 + 204 + static bool write_msr(struct pt_regs *regs) 205 + { 206 + struct tdx_hypercall_args args = { 207 + .r10 = TDX_HYPERCALL_STANDARD, 208 + .r11 = hcall_func(EXIT_REASON_MSR_WRITE), 209 + .r12 = regs->cx, 210 + .r13 = (u64)regs->dx << 32 | regs->ax, 211 + }; 212 + 213 + /* 214 + * Emulate the MSR write via hypercall. More info about ABI 215 + * can be found in TDX Guest-Host-Communication Interface 216 + * (GHCI) section titled "TDG.VP.VMCALL<Instruction.WRMSR>". 217 + */ 218 + return !__tdx_hypercall(&args, 0); 219 + } 220 + 221 + static bool handle_cpuid(struct pt_regs *regs) 222 + { 223 + struct tdx_hypercall_args args = { 224 + .r10 = TDX_HYPERCALL_STANDARD, 225 + .r11 = hcall_func(EXIT_REASON_CPUID), 226 + .r12 = regs->ax, 227 + .r13 = regs->cx, 228 + }; 229 + 230 + /* 231 + * Only allow VMM to control range reserved for hypervisor 232 + * communication. 233 + * 234 + * Return all-zeros for any CPUID outside the range. It matches CPU 235 + * behaviour for non-supported leaf. 236 + */ 237 + if (regs->ax < 0x40000000 || regs->ax > 0x4FFFFFFF) { 238 + regs->ax = regs->bx = regs->cx = regs->dx = 0; 239 + return true; 240 + } 241 + 242 + /* 243 + * Emulate the CPUID instruction via a hypercall. More info about 244 + * ABI can be found in TDX Guest-Host-Communication Interface 245 + * (GHCI), section titled "VP.VMCALL<Instruction.CPUID>". 246 + */ 247 + if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT)) 248 + return false; 249 + 250 + /* 251 + * As per TDX GHCI CPUID ABI, r12-r15 registers contain contents of 252 + * EAX, EBX, ECX, EDX registers after the CPUID instruction execution. 253 + * So copy the register contents back to pt_regs. 254 + */ 255 + regs->ax = args.r12; 256 + regs->bx = args.r13; 257 + regs->cx = args.r14; 258 + regs->dx = args.r15; 259 + 260 + return true; 261 + } 262 + 263 + static bool mmio_read(int size, unsigned long addr, unsigned long *val) 264 + { 265 + struct tdx_hypercall_args args = { 266 + .r10 = TDX_HYPERCALL_STANDARD, 267 + .r11 = hcall_func(EXIT_REASON_EPT_VIOLATION), 268 + .r12 = size, 269 + .r13 = EPT_READ, 270 + .r14 = addr, 271 + .r15 = *val, 272 + }; 273 + 274 + if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT)) 275 + return false; 276 + *val = args.r11; 277 + return true; 278 + } 279 + 280 + static bool mmio_write(int size, unsigned long addr, unsigned long val) 281 + { 282 + return !_tdx_hypercall(hcall_func(EXIT_REASON_EPT_VIOLATION), size, 283 + EPT_WRITE, addr, val); 284 + } 285 + 286 + static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve) 287 + { 288 + char buffer[MAX_INSN_SIZE]; 289 + unsigned long *reg, val; 290 + struct insn insn = {}; 291 + enum mmio_type mmio; 292 + int size, extend_size; 293 + u8 extend_val = 0; 294 + 295 + /* Only in-kernel MMIO is supported */ 296 + if (WARN_ON_ONCE(user_mode(regs))) 297 + return false; 298 + 299 + if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE)) 300 + return false; 301 + 302 + if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64)) 303 + return false; 304 + 305 + mmio = insn_decode_mmio(&insn, &size); 306 + if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED)) 307 + return false; 308 + 309 + if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) { 310 + reg = insn_get_modrm_reg_ptr(&insn, regs); 311 + if (!reg) 312 + return false; 313 + } 314 + 315 + ve->instr_len = insn.length; 316 + 317 + /* Handle writes first */ 318 + switch (mmio) { 319 + case MMIO_WRITE: 320 + memcpy(&val, reg, size); 321 + return mmio_write(size, ve->gpa, val); 322 + case MMIO_WRITE_IMM: 323 + val = insn.immediate.value; 324 + return mmio_write(size, ve->gpa, val); 325 + case MMIO_READ: 326 + case MMIO_READ_ZERO_EXTEND: 327 + case MMIO_READ_SIGN_EXTEND: 328 + /* Reads are handled below */ 329 + break; 330 + case MMIO_MOVS: 331 + case MMIO_DECODE_FAILED: 332 + /* 333 + * MMIO was accessed with an instruction that could not be 334 + * decoded or handled properly. It was likely not using io.h 335 + * helpers or accessed MMIO accidentally. 336 + */ 337 + return false; 338 + default: 339 + WARN_ONCE(1, "Unknown insn_decode_mmio() decode value?"); 340 + return false; 341 + } 342 + 343 + /* Handle reads */ 344 + if (!mmio_read(size, ve->gpa, &val)) 345 + return false; 346 + 347 + switch (mmio) { 348 + case MMIO_READ: 349 + /* Zero-extend for 32-bit operation */ 350 + extend_size = size == 4 ? sizeof(*reg) : 0; 351 + break; 352 + case MMIO_READ_ZERO_EXTEND: 353 + /* Zero extend based on operand size */ 354 + extend_size = insn.opnd_bytes; 355 + break; 356 + case MMIO_READ_SIGN_EXTEND: 357 + /* Sign extend based on operand size */ 358 + extend_size = insn.opnd_bytes; 359 + if (size == 1 && val & BIT(7)) 360 + extend_val = 0xFF; 361 + else if (size > 1 && val & BIT(15)) 362 + extend_val = 0xFF; 363 + break; 364 + default: 365 + /* All other cases has to be covered with the first switch() */ 366 + WARN_ON_ONCE(1); 367 + return false; 368 + } 369 + 370 + if (extend_size) 371 + memset(reg, extend_val, extend_size); 372 + memcpy(reg, &val, size); 373 + return true; 374 + } 375 + 376 + static bool handle_in(struct pt_regs *regs, int size, int port) 377 + { 378 + struct tdx_hypercall_args args = { 379 + .r10 = TDX_HYPERCALL_STANDARD, 380 + .r11 = hcall_func(EXIT_REASON_IO_INSTRUCTION), 381 + .r12 = size, 382 + .r13 = PORT_READ, 383 + .r14 = port, 384 + }; 385 + u64 mask = GENMASK(BITS_PER_BYTE * size, 0); 386 + bool success; 387 + 388 + /* 389 + * Emulate the I/O read via hypercall. More info about ABI can be found 390 + * in TDX Guest-Host-Communication Interface (GHCI) section titled 391 + * "TDG.VP.VMCALL<Instruction.IO>". 392 + */ 393 + success = !__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT); 394 + 395 + /* Update part of the register affected by the emulated instruction */ 396 + regs->ax &= ~mask; 397 + if (success) 398 + regs->ax |= args.r11 & mask; 399 + 400 + return success; 401 + } 402 + 403 + static bool handle_out(struct pt_regs *regs, int size, int port) 404 + { 405 + u64 mask = GENMASK(BITS_PER_BYTE * size, 0); 406 + 407 + /* 408 + * Emulate the I/O write via hypercall. More info about ABI can be found 409 + * in TDX Guest-Host-Communication Interface (GHCI) section titled 410 + * "TDG.VP.VMCALL<Instruction.IO>". 411 + */ 412 + return !_tdx_hypercall(hcall_func(EXIT_REASON_IO_INSTRUCTION), size, 413 + PORT_WRITE, port, regs->ax & mask); 414 + } 415 + 416 + /* 417 + * Emulate I/O using hypercall. 418 + * 419 + * Assumes the IO instruction was using ax, which is enforced 420 + * by the standard io.h macros. 421 + * 422 + * Return True on success or False on failure. 423 + */ 424 + static bool handle_io(struct pt_regs *regs, u32 exit_qual) 425 + { 426 + int size, port; 427 + bool in; 428 + 429 + if (VE_IS_IO_STRING(exit_qual)) 430 + return false; 431 + 432 + in = VE_IS_IO_IN(exit_qual); 433 + size = VE_GET_IO_SIZE(exit_qual); 434 + port = VE_GET_PORT_NUM(exit_qual); 435 + 436 + 437 + if (in) 438 + return handle_in(regs, size, port); 439 + else 440 + return handle_out(regs, size, port); 441 + } 442 + 443 + /* 444 + * Early #VE exception handler. Only handles a subset of port I/O. 445 + * Intended only for earlyprintk. If failed, return false. 446 + */ 447 + __init bool tdx_early_handle_ve(struct pt_regs *regs) 448 + { 449 + struct ve_info ve; 450 + 451 + tdx_get_ve_info(&ve); 452 + 453 + if (ve.exit_reason != EXIT_REASON_IO_INSTRUCTION) 454 + return false; 455 + 456 + return handle_io(regs, ve.exit_qual); 457 + } 458 + 459 + void tdx_get_ve_info(struct ve_info *ve) 460 + { 461 + struct tdx_module_output out; 462 + 463 + /* 464 + * Called during #VE handling to retrieve the #VE info from the 465 + * TDX module. 466 + * 467 + * This has to be called early in #VE handling. A "nested" #VE which 468 + * occurs before this will raise a #DF and is not recoverable. 469 + * 470 + * The call retrieves the #VE info from the TDX module, which also 471 + * clears the "#VE valid" flag. This must be done before anything else 472 + * because any #VE that occurs while the valid flag is set will lead to 473 + * #DF. 474 + * 475 + * Note, the TDX module treats virtual NMIs as inhibited if the #VE 476 + * valid flag is set. It means that NMI=>#VE will not result in a #DF. 477 + */ 478 + tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out); 479 + 480 + /* Transfer the output parameters */ 481 + ve->exit_reason = out.rcx; 482 + ve->exit_qual = out.rdx; 483 + ve->gla = out.r8; 484 + ve->gpa = out.r9; 485 + ve->instr_len = lower_32_bits(out.r10); 486 + ve->instr_info = upper_32_bits(out.r10); 487 + } 488 + 489 + /* Handle the user initiated #VE */ 490 + static bool virt_exception_user(struct pt_regs *regs, struct ve_info *ve) 491 + { 492 + switch (ve->exit_reason) { 493 + case EXIT_REASON_CPUID: 494 + return handle_cpuid(regs); 495 + default: 496 + pr_warn("Unexpected #VE: %lld\n", ve->exit_reason); 497 + return false; 498 + } 499 + } 500 + 501 + /* Handle the kernel #VE */ 502 + static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve) 503 + { 504 + switch (ve->exit_reason) { 505 + case EXIT_REASON_HLT: 506 + return handle_halt(); 507 + case EXIT_REASON_MSR_READ: 508 + return read_msr(regs); 509 + case EXIT_REASON_MSR_WRITE: 510 + return write_msr(regs); 511 + case EXIT_REASON_CPUID: 512 + return handle_cpuid(regs); 513 + case EXIT_REASON_EPT_VIOLATION: 514 + return handle_mmio(regs, ve); 515 + case EXIT_REASON_IO_INSTRUCTION: 516 + return handle_io(regs, ve->exit_qual); 517 + default: 518 + pr_warn("Unexpected #VE: %lld\n", ve->exit_reason); 519 + return false; 520 + } 521 + } 522 + 523 + bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve) 524 + { 525 + bool ret; 526 + 527 + if (user_mode(regs)) 528 + ret = virt_exception_user(regs, ve); 529 + else 530 + ret = virt_exception_kernel(regs, ve); 531 + 532 + /* After successful #VE handling, move the IP */ 533 + if (ret) 534 + regs->ip += ve->instr_len; 535 + 536 + return ret; 537 + } 538 + 539 + static bool tdx_tlb_flush_required(bool private) 540 + { 541 + /* 542 + * TDX guest is responsible for flushing TLB on private->shared 543 + * transition. VMM is responsible for flushing on shared->private. 544 + * 545 + * The VMM _can't_ flush private addresses as it can't generate PAs 546 + * with the guest's HKID. Shared memory isn't subject to integrity 547 + * checking, i.e. the VMM doesn't need to flush for its own protection. 548 + * 549 + * There's no need to flush when converting from shared to private, 550 + * as flushing is the VMM's responsibility in this case, e.g. it must 551 + * flush to avoid integrity failures in the face of a buggy or 552 + * malicious guest. 553 + */ 554 + return !private; 555 + } 556 + 557 + static bool tdx_cache_flush_required(void) 558 + { 559 + /* 560 + * AMD SME/SEV can avoid cache flushing if HW enforces cache coherence. 561 + * TDX doesn't have such capability. 562 + * 563 + * Flush cache unconditionally. 564 + */ 565 + return true; 566 + } 567 + 568 + static bool try_accept_one(phys_addr_t *start, unsigned long len, 569 + enum pg_level pg_level) 570 + { 571 + unsigned long accept_size = page_level_size(pg_level); 572 + u64 tdcall_rcx; 573 + u8 page_size; 574 + 575 + if (!IS_ALIGNED(*start, accept_size)) 576 + return false; 577 + 578 + if (len < accept_size) 579 + return false; 580 + 581 + /* 582 + * Pass the page physical address to the TDX module to accept the 583 + * pending, private page. 584 + * 585 + * Bits 2:0 of RCX encode page size: 0 - 4K, 1 - 2M, 2 - 1G. 586 + */ 587 + switch (pg_level) { 588 + case PG_LEVEL_4K: 589 + page_size = 0; 590 + break; 591 + case PG_LEVEL_2M: 592 + page_size = 1; 593 + break; 594 + case PG_LEVEL_1G: 595 + page_size = 2; 596 + break; 597 + default: 598 + return false; 599 + } 600 + 601 + tdcall_rcx = *start | page_size; 602 + if (__tdx_module_call(TDX_ACCEPT_PAGE, tdcall_rcx, 0, 0, 0, NULL)) 603 + return false; 604 + 605 + *start += accept_size; 606 + return true; 607 + } 608 + 609 + /* 610 + * Inform the VMM of the guest's intent for this physical page: shared with 611 + * the VMM or private to the guest. The VMM is expected to change its mapping 612 + * of the page in response. 613 + */ 614 + static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc) 615 + { 616 + phys_addr_t start = __pa(vaddr); 617 + phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE); 618 + 619 + if (!enc) { 620 + /* Set the shared (decrypted) bits: */ 621 + start |= cc_mkdec(0); 622 + end |= cc_mkdec(0); 623 + } 624 + 625 + /* 626 + * Notify the VMM about page mapping conversion. More info about ABI 627 + * can be found in TDX Guest-Host-Communication Interface (GHCI), 628 + * section "TDG.VP.VMCALL<MapGPA>" 629 + */ 630 + if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0)) 631 + return false; 632 + 633 + /* private->shared conversion requires only MapGPA call */ 634 + if (!enc) 635 + return true; 636 + 637 + /* 638 + * For shared->private conversion, accept the page using 639 + * TDX_ACCEPT_PAGE TDX module call. 640 + */ 641 + while (start < end) { 642 + unsigned long len = end - start; 643 + 644 + /* 645 + * Try larger accepts first. It gives chance to VMM to keep 646 + * 1G/2M SEPT entries where possible and speeds up process by 647 + * cutting number of hypercalls (if successful). 648 + */ 649 + 650 + if (try_accept_one(&start, len, PG_LEVEL_1G)) 651 + continue; 652 + 653 + if (try_accept_one(&start, len, PG_LEVEL_2M)) 654 + continue; 655 + 656 + if (!try_accept_one(&start, len, PG_LEVEL_4K)) 657 + return false; 658 + } 659 + 660 + return true; 661 + } 662 + 663 + void __init tdx_early_init(void) 664 + { 665 + u64 cc_mask; 666 + u32 eax, sig[3]; 667 + 668 + cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]); 669 + 670 + if (memcmp(TDX_IDENT, sig, sizeof(sig))) 671 + return; 672 + 673 + setup_force_cpu_cap(X86_FEATURE_TDX_GUEST); 674 + 675 + cc_set_vendor(CC_VENDOR_INTEL); 676 + cc_mask = get_cc_mask(); 677 + cc_set_mask(cc_mask); 678 + 679 + /* 680 + * All bits above GPA width are reserved and kernel treats shared bit 681 + * as flag, not as part of physical address. 682 + * 683 + * Adjust physical mask to only cover valid GPA bits. 684 + */ 685 + physical_mask &= cc_mask - 1; 686 + 687 + x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required; 688 + x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required; 689 + x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed; 690 + 691 + pr_info("Guest detected\n"); 692 + }

+13 -1

arch/x86/include/asm/acenv.h

··· 13 13 14 14 /* Asm macros */ 15 15 16 - #define ACPI_FLUSH_CPU_CACHE() wbinvd() 16 + /* 17 + * ACPI_FLUSH_CPU_CACHE() flushes caches on entering sleep states. 18 + * It is required to prevent data loss. 19 + * 20 + * While running inside virtual machine, the kernel can bypass cache flushing. 21 + * Changing sleep state in a virtual machine doesn't affect the host system 22 + * sleep state and cannot lead to data loss. 23 + */ 24 + #define ACPI_FLUSH_CPU_CACHE() \ 25 + do { \ 26 + if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR)) \ 27 + wbinvd(); \ 28 + } while (0) 17 29 18 30 int __acpi_acquire_global_lock(unsigned int *lock); 19 31 int __acpi_release_global_lock(unsigned int *lock);

+7

arch/x86/include/asm/apic.h

··· 328 328 329 329 /* wakeup_secondary_cpu */ 330 330 int (*wakeup_secondary_cpu)(int apicid, unsigned long start_eip); 331 + /* wakeup secondary CPU using 64-bit wakeup point */ 332 + int (*wakeup_secondary_cpu_64)(int apicid, unsigned long start_eip); 331 333 332 334 void (*inquire_remote_apic)(int apicid); 333 335 ··· 489 487 490 488 return apic->get_apic_id(reg); 491 489 } 490 + 491 + #ifdef CONFIG_X86_64 492 + typedef int (*wakeup_cpu_handler)(int apicid, unsigned long start_eip); 493 + extern void acpi_wake_cpu_handler_update(wakeup_cpu_handler handler); 494 + #endif 492 495 493 496 extern int default_apic_id_valid(u32 apicid); 494 497 extern int default_acpi_madt_oem_check(char *, char *);

+1

arch/x86/include/asm/cpufeatures.h

··· 238 238 #define X86_FEATURE_VMW_VMMCALL ( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */ 239 239 #define X86_FEATURE_PVUNLOCK ( 8*32+20) /* "" PV unlock function */ 240 240 #define X86_FEATURE_VCPUPREEMPT ( 8*32+21) /* "" PV vcpu_is_preempted function */ 241 + #define X86_FEATURE_TDX_GUEST ( 8*32+22) /* Intel Trust Domain Extensions Guest */ 241 242 242 243 /* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */ 243 244 #define X86_FEATURE_FSGSBASE ( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/

+7 -1

arch/x86/include/asm/disabled-features.h

··· 68 68 # define DISABLE_SGX (1 << (X86_FEATURE_SGX & 31)) 69 69 #endif 70 70 71 + #ifdef CONFIG_INTEL_TDX_GUEST 72 + # define DISABLE_TDX_GUEST 0 73 + #else 74 + # define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31)) 75 + #endif 76 + 71 77 /* 72 78 * Make sure to add features to the correct mask 73 79 */ ··· 85 79 #define DISABLED_MASK5 0 86 80 #define DISABLED_MASK6 0 87 81 #define DISABLED_MASK7 (DISABLE_PTI) 88 - #define DISABLED_MASK8 0 82 + #define DISABLED_MASK8 (DISABLE_TDX_GUEST) 89 83 #define DISABLED_MASK9 (DISABLE_SMAP|DISABLE_SGX) 90 84 #define DISABLED_MASK10 0 91 85 #define DISABLED_MASK11 0

+4

arch/x86/include/asm/idtentry.h

··· 632 632 DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER, exc_xen_unknown_trap); 633 633 #endif 634 634 635 + #ifdef CONFIG_INTEL_TDX_GUEST 636 + DECLARE_IDTENTRY(X86_TRAP_VE, exc_virtualization_exception); 637 + #endif 638 + 635 639 /* Device interrupts common/spurious */ 636 640 DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER, common_interrupt); 637 641 #ifdef CONFIG_X86_LOCAL_APIC

+12 -30

arch/x86/include/asm/io.h

··· 44 44 #include <asm/page.h> 45 45 #include <asm/early_ioremap.h> 46 46 #include <asm/pgtable_types.h> 47 + #include <asm/shared/io.h> 47 48 48 49 #define build_mmio_read(name, size, type, reg, barrier) \ 49 50 static inline type name(const volatile void __iomem *addr) \ ··· 257 256 #endif 258 257 259 258 #define BUILDIO(bwl, bw, type) \ 260 - static inline void out##bwl(unsigned type value, int port) \ 261 - { \ 262 - asm volatile("out" #bwl " %" #bw "0, %w1" \ 263 - : : "a"(value), "Nd"(port)); \ 264 - } \ 265 - \ 266 - static inline unsigned type in##bwl(int port) \ 267 - { \ 268 - unsigned type value; \ 269 - asm volatile("in" #bwl " %w1, %" #bw "0" \ 270 - : "=a"(value) : "Nd"(port)); \ 271 - return value; \ 272 - } \ 273 - \ 274 - static inline void out##bwl##_p(unsigned type value, int port) \ 259 + static inline void out##bwl##_p(type value, u16 port) \ 275 260 { \ 276 261 out##bwl(value, port); \ 277 262 slow_down_io(); \ 278 263 } \ 279 264 \ 280 - static inline unsigned type in##bwl##_p(int port) \ 265 + static inline type in##bwl##_p(u16 port) \ 281 266 { \ 282 - unsigned type value = in##bwl(port); \ 267 + type value = in##bwl(port); \ 283 268 slow_down_io(); \ 284 269 return value; \ 285 270 } \ 286 271 \ 287 - static inline void outs##bwl(int port, const void *addr, unsigned long count) \ 272 + static inline void outs##bwl(u16 port, const void *addr, unsigned long count) \ 288 273 { \ 289 274 if (cc_platform_has(CC_ATTR_GUEST_UNROLL_STRING_IO)) { \ 290 - unsigned type *value = (unsigned type *)addr; \ 275 + type *value = (type *)addr; \ 291 276 while (count) { \ 292 277 out##bwl(*value, port); \ 293 278 value++; \ ··· 286 299 } \ 287 300 } \ 288 301 \ 289 - static inline void ins##bwl(int port, void *addr, unsigned long count) \ 302 + static inline void ins##bwl(u16 port, void *addr, unsigned long count) \ 290 303 { \ 291 304 if (cc_platform_has(CC_ATTR_GUEST_UNROLL_STRING_IO)) { \ 292 - unsigned type *value = (unsigned type *)addr; \ 305 + type *value = (type *)addr; \ 293 306 while (count) { \ 294 307 *value = in##bwl(port); \ 295 308 value++; \ ··· 302 315 } \ 303 316 } 304 317 305 - BUILDIO(b, b, char) 306 - BUILDIO(w, w, short) 307 - BUILDIO(l, , int) 318 + BUILDIO(b, b, u8) 319 + BUILDIO(w, w, u16) 320 + BUILDIO(l, , u32) 321 + #undef BUILDIO 308 322 309 - #define inb inb 310 - #define inw inw 311 - #define inl inl 312 323 #define inb_p inb_p 313 324 #define inw_p inw_p 314 325 #define inl_p inl_p ··· 314 329 #define insw insw 315 330 #define insl insl 316 331 317 - #define outb outb 318 - #define outw outw 319 - #define outl outl 320 332 #define outb_p outb_p 321 333 #define outw_p outw_p 322 334 #define outl_p outl_p

+22

arch/x86/include/asm/kvm_para.h

··· 7 7 #include <linux/interrupt.h> 8 8 #include <uapi/asm/kvm_para.h> 9 9 10 + #include <asm/tdx.h> 11 + 10 12 #ifdef CONFIG_KVM_GUEST 11 13 bool kvm_check_and_clear_guest_paused(void); 12 14 #else ··· 34 32 static inline long kvm_hypercall0(unsigned int nr) 35 33 { 36 34 long ret; 35 + 36 + if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) 37 + return tdx_kvm_hypercall(nr, 0, 0, 0, 0); 38 + 37 39 asm volatile(KVM_HYPERCALL 38 40 : "=a"(ret) 39 41 : "a"(nr) ··· 48 42 static inline long kvm_hypercall1(unsigned int nr, unsigned long p1) 49 43 { 50 44 long ret; 45 + 46 + if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) 47 + return tdx_kvm_hypercall(nr, p1, 0, 0, 0); 48 + 51 49 asm volatile(KVM_HYPERCALL 52 50 : "=a"(ret) 53 51 : "a"(nr), "b"(p1) ··· 63 53 unsigned long p2) 64 54 { 65 55 long ret; 56 + 57 + if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) 58 + return tdx_kvm_hypercall(nr, p1, p2, 0, 0); 59 + 66 60 asm volatile(KVM_HYPERCALL 67 61 : "=a"(ret) 68 62 : "a"(nr), "b"(p1), "c"(p2) ··· 78 64 unsigned long p2, unsigned long p3) 79 65 { 80 66 long ret; 67 + 68 + if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) 69 + return tdx_kvm_hypercall(nr, p1, p2, p3, 0); 70 + 81 71 asm volatile(KVM_HYPERCALL 82 72 : "=a"(ret) 83 73 : "a"(nr), "b"(p1), "c"(p2), "d"(p3) ··· 94 76 unsigned long p4) 95 77 { 96 78 long ret; 79 + 80 + if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) 81 + return tdx_kvm_hypercall(nr, p1, p2, p3, p4); 82 + 97 83 asm volatile(KVM_HYPERCALL 98 84 : "=a"(ret) 99 85 : "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)

+3 -3

arch/x86/include/asm/mem_encrypt.h

··· 49 49 50 50 void __init mem_encrypt_free_decrypted_mem(void); 51 51 52 - /* Architecture __weak replacement functions */ 53 - void __init mem_encrypt_init(void); 54 - 55 52 void __init sev_es_init_vc_handling(void); 56 53 57 54 #define __bss_decrypted __section(".bss..decrypted") ··· 85 88 #define __bss_decrypted 86 89 87 90 #endif /* CONFIG_AMD_MEM_ENCRYPT */ 91 + 92 + /* Architecture __weak replacement functions */ 93 + void __init mem_encrypt_init(void); 88 94 89 95 /* 90 96 * The __sme_pa() and __sme_pa_nodebug() macros are meant for use when

+1

arch/x86/include/asm/realmode.h

··· 25 25 u32 sev_es_trampoline_start; 26 26 #endif 27 27 #ifdef CONFIG_X86_64 28 + u32 trampoline_start64; 28 29 u32 trampoline_pgd; 29 30 #endif 30 31 /* ACPI S3 wakeup */

+34

arch/x86/include/asm/shared/io.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _ASM_X86_SHARED_IO_H 3 + #define _ASM_X86_SHARED_IO_H 4 + 5 + #include <linux/types.h> 6 + 7 + #define BUILDIO(bwl, bw, type) \ 8 + static inline void __out##bwl(type value, u16 port) \ 9 + { \ 10 + asm volatile("out" #bwl " %" #bw "0, %w1" \ 11 + : : "a"(value), "Nd"(port)); \ 12 + } \ 13 + \ 14 + static inline type __in##bwl(u16 port) \ 15 + { \ 16 + type value; \ 17 + asm volatile("in" #bwl " %w1, %" #bw "0" \ 18 + : "=a"(value) : "Nd"(port)); \ 19 + return value; \ 20 + } 21 + 22 + BUILDIO(b, b, u8) 23 + BUILDIO(w, w, u16) 24 + BUILDIO(l, , u32) 25 + #undef BUILDIO 26 + 27 + #define inb __inb 28 + #define inw __inw 29 + #define inl __inl 30 + #define outb __outb 31 + #define outw __outw 32 + #define outl __outl 33 + 34 + #endif

+40

arch/x86/include/asm/shared/tdx.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _ASM_X86_SHARED_TDX_H 3 + #define _ASM_X86_SHARED_TDX_H 4 + 5 + #include <linux/bits.h> 6 + #include <linux/types.h> 7 + 8 + #define TDX_HYPERCALL_STANDARD 0 9 + 10 + #define TDX_HCALL_HAS_OUTPUT BIT(0) 11 + #define TDX_HCALL_ISSUE_STI BIT(1) 12 + 13 + #define TDX_CPUID_LEAF_ID 0x21 14 + #define TDX_IDENT "IntelTDX " 15 + 16 + #ifndef __ASSEMBLY__ 17 + 18 + /* 19 + * Used in __tdx_hypercall() to pass down and get back registers' values of 20 + * the TDCALL instruction when requesting services from the VMM. 21 + * 22 + * This is a software only structure and not part of the TDX module/VMM ABI. 23 + */ 24 + struct tdx_hypercall_args { 25 + u64 r10; 26 + u64 r11; 27 + u64 r12; 28 + u64 r13; 29 + u64 r14; 30 + u64 r15; 31 + }; 32 + 33 + /* Used to request services from the VMM */ 34 + u64 __tdx_hypercall(struct tdx_hypercall_args *args, unsigned long flags); 35 + 36 + /* Called from __tdx_hypercall() for unrecoverable failure */ 37 + void __tdx_hypercall_failed(void); 38 + 39 + #endif /* !__ASSEMBLY__ */ 40 + #endif /* _ASM_X86_SHARED_TDX_H */

+91

arch/x86/include/asm/tdx.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* Copyright (C) 2021-2022 Intel Corporation */ 3 + #ifndef _ASM_X86_TDX_H 4 + #define _ASM_X86_TDX_H 5 + 6 + #include <linux/init.h> 7 + #include <linux/bits.h> 8 + #include <asm/ptrace.h> 9 + #include <asm/shared/tdx.h> 10 + 11 + /* 12 + * SW-defined error codes. 13 + * 14 + * Bits 47:40 == 0xFF indicate Reserved status code class that never used by 15 + * TDX module. 16 + */ 17 + #define TDX_ERROR _BITUL(63) 18 + #define TDX_SW_ERROR (TDX_ERROR | GENMASK_ULL(47, 40)) 19 + #define TDX_SEAMCALL_VMFAILINVALID (TDX_SW_ERROR | _UL(0xFFFF0000)) 20 + 21 + #ifndef __ASSEMBLY__ 22 + 23 + /* 24 + * Used to gather the output registers values of the TDCALL and SEAMCALL 25 + * instructions when requesting services from the TDX module. 26 + * 27 + * This is a software only structure and not part of the TDX module/VMM ABI. 28 + */ 29 + struct tdx_module_output { 30 + u64 rcx; 31 + u64 rdx; 32 + u64 r8; 33 + u64 r9; 34 + u64 r10; 35 + u64 r11; 36 + }; 37 + 38 + /* 39 + * Used by the #VE exception handler to gather the #VE exception 40 + * info from the TDX module. This is a software only structure 41 + * and not part of the TDX module/VMM ABI. 42 + */ 43 + struct ve_info { 44 + u64 exit_reason; 45 + u64 exit_qual; 46 + /* Guest Linear (virtual) Address */ 47 + u64 gla; 48 + /* Guest Physical Address */ 49 + u64 gpa; 50 + u32 instr_len; 51 + u32 instr_info; 52 + }; 53 + 54 + #ifdef CONFIG_INTEL_TDX_GUEST 55 + 56 + void __init tdx_early_init(void); 57 + 58 + /* Used to communicate with the TDX module */ 59 + u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, 60 + struct tdx_module_output *out); 61 + 62 + void tdx_get_ve_info(struct ve_info *ve); 63 + 64 + bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve); 65 + 66 + void tdx_safe_halt(void); 67 + 68 + bool tdx_early_handle_ve(struct pt_regs *regs); 69 + 70 + #else 71 + 72 + static inline void tdx_early_init(void) { }; 73 + static inline void tdx_safe_halt(void) { }; 74 + 75 + static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; } 76 + 77 + #endif /* CONFIG_INTEL_TDX_GUEST */ 78 + 79 + #if defined(CONFIG_KVM_GUEST) && defined(CONFIG_INTEL_TDX_GUEST) 80 + long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2, 81 + unsigned long p3, unsigned long p4); 82 + #else 83 + static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, 84 + unsigned long p2, unsigned long p3, 85 + unsigned long p4) 86 + { 87 + return -ENODEV; 88 + } 89 + #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */ 90 + #endif /* !__ASSEMBLY__ */ 91 + #endif /* _ASM_X86_TDX_H */

+92 -1

arch/x86/kernel/acpi/boot.c

··· 65 65 static bool acpi_support_online_capable; 66 66 #endif 67 67 68 + #ifdef CONFIG_X86_64 69 + /* Physical address of the Multiprocessor Wakeup Structure mailbox */ 70 + static u64 acpi_mp_wake_mailbox_paddr; 71 + /* Virtual address of the Multiprocessor Wakeup Structure mailbox */ 72 + static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox; 73 + #endif 74 + 68 75 #ifdef CONFIG_X86_IO_APIC 69 76 /* 70 77 * Locks related to IOAPIC hotplug ··· 343 336 return 0; 344 337 } 345 338 346 - #endif /*CONFIG_X86_LOCAL_APIC */ 339 + #ifdef CONFIG_X86_64 340 + static int acpi_wakeup_cpu(int apicid, unsigned long start_ip) 341 + { 342 + /* 343 + * Remap mailbox memory only for the first call to acpi_wakeup_cpu(). 344 + * 345 + * Wakeup of secondary CPUs is fully serialized in the core code. 346 + * No need to protect acpi_mp_wake_mailbox from concurrent accesses. 347 + */ 348 + if (!acpi_mp_wake_mailbox) { 349 + acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr, 350 + sizeof(*acpi_mp_wake_mailbox), 351 + MEMREMAP_WB); 352 + } 353 + 354 + /* 355 + * Mailbox memory is shared between the firmware and OS. Firmware will 356 + * listen on mailbox command address, and once it receives the wakeup 357 + * command, the CPU associated with the given apicid will be booted. 358 + * 359 + * The value of 'apic_id' and 'wakeup_vector' must be visible to the 360 + * firmware before the wakeup command is visible. smp_store_release() 361 + * ensures ordering and visibility. 362 + */ 363 + acpi_mp_wake_mailbox->apic_id = apicid; 364 + acpi_mp_wake_mailbox->wakeup_vector = start_ip; 365 + smp_store_release(&acpi_mp_wake_mailbox->command, 366 + ACPI_MP_WAKE_COMMAND_WAKEUP); 367 + 368 + /* 369 + * Wait for the CPU to wake up. 370 + * 371 + * The CPU being woken up is essentially in a spin loop waiting to be 372 + * woken up. It should not take long for it wake up and acknowledge by 373 + * zeroing out ->command. 374 + * 375 + * ACPI specification doesn't provide any guidance on how long kernel 376 + * has to wait for a wake up acknowledgement. It also doesn't provide 377 + * a way to cancel a wake up request if it takes too long. 378 + * 379 + * In TDX environment, the VMM has control over how long it takes to 380 + * wake up secondary. It can postpone scheduling secondary vCPU 381 + * indefinitely. Giving up on wake up request and reporting error opens 382 + * possible attack vector for VMM: it can wake up a secondary CPU when 383 + * kernel doesn't expect it. Wait until positive result of the wake up 384 + * request. 385 + */ 386 + while (READ_ONCE(acpi_mp_wake_mailbox->command)) 387 + cpu_relax(); 388 + 389 + return 0; 390 + } 391 + #endif /* CONFIG_X86_64 */ 392 + #endif /* CONFIG_X86_LOCAL_APIC */ 347 393 348 394 #ifdef CONFIG_X86_IO_APIC 349 395 #define MP_ISA_BUS 0 ··· 1143 1083 } 1144 1084 return 0; 1145 1085 } 1086 + 1087 + #ifdef CONFIG_X86_64 1088 + static int __init acpi_parse_mp_wake(union acpi_subtable_headers *header, 1089 + const unsigned long end) 1090 + { 1091 + struct acpi_madt_multiproc_wakeup *mp_wake; 1092 + 1093 + if (!IS_ENABLED(CONFIG_SMP)) 1094 + return -ENODEV; 1095 + 1096 + mp_wake = (struct acpi_madt_multiproc_wakeup *)header; 1097 + if (BAD_MADT_ENTRY(mp_wake, end)) 1098 + return -EINVAL; 1099 + 1100 + acpi_table_print_madt_entry(&header->common); 1101 + 1102 + acpi_mp_wake_mailbox_paddr = mp_wake->base_address; 1103 + 1104 + acpi_wake_cpu_handler_update(acpi_wakeup_cpu); 1105 + 1106 + return 0; 1107 + } 1108 + #endif /* CONFIG_X86_64 */ 1146 1109 #endif /* CONFIG_X86_LOCAL_APIC */ 1147 1110 1148 1111 #ifdef CONFIG_X86_IO_APIC ··· 1361 1278 1362 1279 smp_found_config = 1; 1363 1280 } 1281 + 1282 + #ifdef CONFIG_X86_64 1283 + /* 1284 + * Parse MADT MP Wake entry. 1285 + */ 1286 + acpi_table_parse_madt(ACPI_MADT_TYPE_MULTIPROC_WAKEUP, 1287 + acpi_parse_mp_wake, 1); 1288 + #endif 1364 1289 } 1365 1290 if (error == -EINVAL) { 1366 1291 /*

+10

arch/x86/kernel/apic/apic.c

··· 2551 2551 } 2552 2552 EXPORT_SYMBOL_GPL(x86_msi_msg_get_destid); 2553 2553 2554 + #ifdef CONFIG_X86_64 2555 + void __init acpi_wake_cpu_handler_update(wakeup_cpu_handler handler) 2556 + { 2557 + struct apic **drv; 2558 + 2559 + for (drv = __apicdrivers; drv < __apicdrivers_end; drv++) 2560 + (*drv)->wakeup_secondary_cpu_64 = handler; 2561 + } 2562 + #endif 2563 + 2554 2564 /* 2555 2565 * Override the generic EOI implementation with an optimized version. 2556 2566 * Only called during early boot when only one CPU is active and with

+16 -2

arch/x86/kernel/apic/io_apic.c

··· 65 65 #include <asm/irq_remapping.h> 66 66 #include <asm/hw_irq.h> 67 67 #include <asm/apic.h> 68 + #include <asm/pgtable.h> 68 69 69 70 #define for_each_ioapic(idx) \ 70 71 for ((idx) = 0; (idx) < nr_ioapics; (idx)++) ··· 2678 2677 return res; 2679 2678 } 2680 2679 2680 + static void io_apic_set_fixmap(enum fixed_addresses idx, phys_addr_t phys) 2681 + { 2682 + pgprot_t flags = FIXMAP_PAGE_NOCACHE; 2683 + 2684 + /* 2685 + * Ensure fixmaps for IOAPIC MMIO respect memory encryption pgprot 2686 + * bits, just like normal ioremap(): 2687 + */ 2688 + flags = pgprot_decrypted(flags); 2689 + 2690 + __set_fixmap(idx, phys, flags); 2691 + } 2692 + 2681 2693 void __init io_apic_init_mappings(void) 2682 2694 { 2683 2695 unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0; ··· 2723 2709 __func__, PAGE_SIZE, PAGE_SIZE); 2724 2710 ioapic_phys = __pa(ioapic_phys); 2725 2711 } 2726 - set_fixmap_nocache(idx, ioapic_phys); 2712 + io_apic_set_fixmap(idx, ioapic_phys); 2727 2713 apic_printk(APIC_VERBOSE, "mapped IOAPIC to %08lx (%08lx)\n", 2728 2714 __fix_to_virt(idx) + (ioapic_phys & ~PAGE_MASK), 2729 2715 ioapic_phys); ··· 2852 2838 ioapics[idx].mp_config.flags = MPC_APIC_USABLE; 2853 2839 ioapics[idx].mp_config.apicaddr = address; 2854 2840 2855 - set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address); 2841 + io_apic_set_fixmap(FIX_IO_APIC_BASE_0 + idx, address); 2856 2842 if (bad_ioapic_register(idx)) { 2857 2843 clear_fixmap(FIX_IO_APIC_BASE_0 + idx); 2858 2844 return -ENODEV;

+17

arch/x86/kernel/asm-offsets.c

··· 18 18 #include <asm/bootparam.h> 19 19 #include <asm/suspend.h> 20 20 #include <asm/tlbflush.h> 21 + #include <asm/tdx.h> 21 22 22 23 #ifdef CONFIG_XEN 23 24 #include <xen/interface/xen.h> ··· 65 64 OFFSET(XEN_vcpu_info_pending, vcpu_info, evtchn_upcall_pending); 66 65 OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2); 67 66 #endif 67 + 68 + BLANK(); 69 + OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx); 70 + OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx); 71 + OFFSET(TDX_MODULE_r8, tdx_module_output, r8); 72 + OFFSET(TDX_MODULE_r9, tdx_module_output, r9); 73 + OFFSET(TDX_MODULE_r10, tdx_module_output, r10); 74 + OFFSET(TDX_MODULE_r11, tdx_module_output, r11); 75 + 76 + BLANK(); 77 + OFFSET(TDX_HYPERCALL_r10, tdx_hypercall_args, r10); 78 + OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_args, r11); 79 + OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_args, r12); 80 + OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_args, r13); 81 + OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_args, r14); 82 + OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_args, r15); 68 83 69 84 BLANK(); 70 85 OFFSET(BP_scratch, boot_params, scratch);

+7

arch/x86/kernel/head64.c

··· 40 40 #include <asm/extable.h> 41 41 #include <asm/trapnr.h> 42 42 #include <asm/sev.h> 43 + #include <asm/tdx.h> 43 44 44 45 /* 45 46 * Manage page tables very early on. ··· 418 417 trapnr == X86_TRAP_VC && handle_vc_boot_ghcb(regs)) 419 418 return; 420 419 420 + if (trapnr == X86_TRAP_VE && tdx_early_handle_ve(regs)) 421 + return; 422 + 421 423 early_fixup_exception(regs, trapnr); 422 424 } 423 425 ··· 518 514 __native_tlb_flush_global(this_cpu_read(cpu_tlbstate.cr4)); 519 515 520 516 idt_setup_early_handler(); 517 + 518 + /* Needed before cc_platform_has() can be used for TDX */ 519 + tdx_early_init(); 521 520 522 521 copy_bootdata(__va(real_mode_data)); 523 522

+26 -2

arch/x86/kernel/head_64.S

··· 173 173 addq $(init_top_pgt - __START_KERNEL_map), %rax 174 174 1: 175 175 176 + #ifdef CONFIG_X86_MCE 177 + /* 178 + * Preserve CR4.MCE if the kernel will enable #MC support. 179 + * Clearing MCE may fault in some environments (that also force #MC 180 + * support). Any machine check that occurs before #MC support is fully 181 + * configured will crash the system regardless of the CR4.MCE value set 182 + * here. 183 + */ 184 + movq %cr4, %rcx 185 + andl $X86_CR4_MCE, %ecx 186 + #else 187 + movl $0, %ecx 188 + #endif 189 + 176 190 /* Enable PAE mode, PGE and LA57 */ 177 - movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx 191 + orl $(X86_CR4_PAE | X86_CR4_PGE), %ecx 178 192 #ifdef CONFIG_X86_5LEVEL 179 193 testl $1, __pgtable_l5_enabled(%rip) 180 194 jz 1f ··· 294 280 /* Setup EFER (Extended Feature Enable Register) */ 295 281 movl $MSR_EFER, %ecx 296 282 rdmsr 283 + /* 284 + * Preserve current value of EFER for comparison and to skip 285 + * EFER writes if no change was made (for TDX guest) 286 + */ 287 + movl %eax, %edx 297 288 btsl $_EFER_SCE, %eax /* Enable System Call */ 298 289 btl $20,%edi /* No Execute supported? */ 299 290 jnc 1f 300 291 btsl $_EFER_NX, %eax 301 292 btsq $_PAGE_BIT_NX,early_pmd_flags(%rip) 302 - 1: wrmsr /* Make changes effective */ 303 293 294 + /* Avoid writing EFER if no change was made (for TDX guest) */ 295 + 1: cmpl %edx, %eax 296 + je 1f 297 + xor %edx, %edx 298 + wrmsr /* Make changes effective */ 299 + 1: 304 300 /* Setup cr0 */ 305 301 movl $CR0_STATE, %eax 306 302 /* Make changes effective */

+3

arch/x86/kernel/idt.c

··· 69 69 */ 70 70 INTG(X86_TRAP_PF, asm_exc_page_fault), 71 71 #endif 72 + #ifdef CONFIG_INTEL_TDX_GUEST 73 + INTG(X86_TRAP_VE, asm_exc_virtualization_exception), 74 + #endif 72 75 }; 73 76 74 77 /*

+4

arch/x86/kernel/process.c

··· 46 46 #include <asm/proto.h> 47 47 #include <asm/frame.h> 48 48 #include <asm/unwind.h> 49 + #include <asm/tdx.h> 49 50 50 51 #include "process.h" 51 52 ··· 874 873 } else if (prefer_mwait_c1_over_halt(c)) { 875 874 pr_info("using mwait in idle threads\n"); 876 875 x86_idle = mwait_idle; 876 + } else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) { 877 + pr_info("using TDX aware idle routine\n"); 878 + x86_idle = tdx_safe_halt; 877 879 } else 878 880 x86_idle = default_idle; 879 881 }

+10 -2

arch/x86/kernel/smpboot.c

··· 1083 1083 unsigned long boot_error = 0; 1084 1084 unsigned long timeout; 1085 1085 1086 + #ifdef CONFIG_X86_64 1087 + /* If 64-bit wakeup method exists, use the 64-bit mode trampoline IP */ 1088 + if (apic->wakeup_secondary_cpu_64) 1089 + start_ip = real_mode_header->trampoline_start64; 1090 + #endif 1086 1091 idle->thread.sp = (unsigned long)task_pt_regs(idle); 1087 1092 early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu); 1088 1093 initial_code = (unsigned long)start_secondary; ··· 1129 1124 1130 1125 /* 1131 1126 * Wake up a CPU in difference cases: 1132 - * - Use the method in the APIC driver if it's defined 1127 + * - Use a method from the APIC driver if one defined, with wakeup 1128 + * straight to 64-bit mode preferred over wakeup to RM. 1133 1129 * Otherwise, 1134 1130 * - Use an INIT boot APIC message for APs or NMI for BSP. 1135 1131 */ 1136 - if (apic->wakeup_secondary_cpu) 1132 + if (apic->wakeup_secondary_cpu_64) 1133 + boot_error = apic->wakeup_secondary_cpu_64(apicid, start_ip); 1134 + else if (apic->wakeup_secondary_cpu) 1137 1135 boot_error = apic->wakeup_secondary_cpu(apicid, start_ip); 1138 1136 else 1139 1137 boot_error = wakeup_cpu_via_init_nmi(cpu, start_ip, apicid,

+117 -26

arch/x86/kernel/traps.c

··· 62 62 #include <asm/insn.h> 63 63 #include <asm/insn-eval.h> 64 64 #include <asm/vdso.h> 65 + #include <asm/tdx.h> 65 66 66 67 #ifdef CONFIG_X86_64 67 68 #include <asm/x86_init.h> ··· 687 686 #endif 688 687 } 689 688 689 + static bool gp_try_fixup_and_notify(struct pt_regs *regs, int trapnr, 690 + unsigned long error_code, const char *str) 691 + { 692 + if (fixup_exception(regs, trapnr, error_code, 0)) 693 + return true; 694 + 695 + current->thread.error_code = error_code; 696 + current->thread.trap_nr = trapnr; 697 + 698 + /* 699 + * To be potentially processing a kprobe fault and to trust the result 700 + * from kprobe_running(), we have to be non-preemptible. 701 + */ 702 + if (!preemptible() && kprobe_running() && 703 + kprobe_fault_handler(regs, trapnr)) 704 + return true; 705 + 706 + return notify_die(DIE_GPF, str, regs, error_code, trapnr, SIGSEGV) == NOTIFY_STOP; 707 + } 708 + 709 + static void gp_user_force_sig_segv(struct pt_regs *regs, int trapnr, 710 + unsigned long error_code, const char *str) 711 + { 712 + current->thread.error_code = error_code; 713 + current->thread.trap_nr = trapnr; 714 + show_signal(current, SIGSEGV, "", str, regs, error_code); 715 + force_sig(SIGSEGV); 716 + } 717 + 690 718 DEFINE_IDTENTRY_ERRORCODE(exc_general_protection) 691 719 { 692 720 char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR; 693 721 enum kernel_gp_hint hint = GP_NO_HINT; 694 - struct task_struct *tsk; 695 722 unsigned long gp_addr; 696 - int ret; 697 723 698 724 if (user_mode(regs) && try_fixup_enqcmd_gp()) 699 725 return; ··· 739 711 return; 740 712 } 741 713 742 - tsk = current; 743 - 744 714 if (user_mode(regs)) { 745 715 if (fixup_iopl_exception(regs)) 746 716 goto exit; 747 717 748 - tsk->thread.error_code = error_code; 749 - tsk->thread.trap_nr = X86_TRAP_GP; 750 - 751 718 if (fixup_vdso_exception(regs, X86_TRAP_GP, error_code, 0)) 752 719 goto exit; 753 720 754 - show_signal(tsk, SIGSEGV, "", desc, regs, error_code); 755 - force_sig(SIGSEGV); 721 + gp_user_force_sig_segv(regs, X86_TRAP_GP, error_code, desc); 756 722 goto exit; 757 723 } 758 724 759 - if (fixup_exception(regs, X86_TRAP_GP, error_code, 0)) 760 - goto exit; 761 - 762 - tsk->thread.error_code = error_code; 763 - tsk->thread.trap_nr = X86_TRAP_GP; 764 - 765 - /* 766 - * To be potentially processing a kprobe fault and to trust the result 767 - * from kprobe_running(), we have to be non-preemptible. 768 - */ 769 - if (!preemptible() && 770 - kprobe_running() && 771 - kprobe_fault_handler(regs, X86_TRAP_GP)) 772 - goto exit; 773 - 774 - ret = notify_die(DIE_GPF, desc, regs, error_code, X86_TRAP_GP, SIGSEGV); 775 - if (ret == NOTIFY_STOP) 725 + if (gp_try_fixup_and_notify(regs, X86_TRAP_GP, error_code, desc)) 776 726 goto exit; 777 727 778 728 if (error_code) ··· 1348 1342 die("unexpected #NM exception", regs, 0); 1349 1343 } 1350 1344 } 1345 + 1346 + #ifdef CONFIG_INTEL_TDX_GUEST 1347 + 1348 + #define VE_FAULT_STR "VE fault" 1349 + 1350 + static void ve_raise_fault(struct pt_regs *regs, long error_code) 1351 + { 1352 + if (user_mode(regs)) { 1353 + gp_user_force_sig_segv(regs, X86_TRAP_VE, error_code, VE_FAULT_STR); 1354 + return; 1355 + } 1356 + 1357 + if (gp_try_fixup_and_notify(regs, X86_TRAP_VE, error_code, VE_FAULT_STR)) 1358 + return; 1359 + 1360 + die_addr(VE_FAULT_STR, regs, error_code, 0); 1361 + } 1362 + 1363 + /* 1364 + * Virtualization Exceptions (#VE) are delivered to TDX guests due to 1365 + * specific guest actions which may happen in either user space or the 1366 + * kernel: 1367 + * 1368 + * * Specific instructions (WBINVD, for example) 1369 + * * Specific MSR accesses 1370 + * * Specific CPUID leaf accesses 1371 + * * Access to specific guest physical addresses 1372 + * 1373 + * In the settings that Linux will run in, virtualization exceptions are 1374 + * never generated on accesses to normal, TD-private memory that has been 1375 + * accepted (by BIOS or with tdx_enc_status_changed()). 1376 + * 1377 + * Syscall entry code has a critical window where the kernel stack is not 1378 + * yet set up. Any exception in this window leads to hard to debug issues 1379 + * and can be exploited for privilege escalation. Exceptions in the NMI 1380 + * entry code also cause issues. Returning from the exception handler with 1381 + * IRET will re-enable NMIs and nested NMI will corrupt the NMI stack. 1382 + * 1383 + * For these reasons, the kernel avoids #VEs during the syscall gap and 1384 + * the NMI entry code. Entry code paths do not access TD-shared memory, 1385 + * MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves 1386 + * that might generate #VE. VMM can remove memory from TD at any point, 1387 + * but access to unaccepted (or missing) private memory leads to VM 1388 + * termination, not to #VE. 1389 + * 1390 + * Similarly to page faults and breakpoints, #VEs are allowed in NMI 1391 + * handlers once the kernel is ready to deal with nested NMIs. 1392 + * 1393 + * During #VE delivery, all interrupts, including NMIs, are blocked until 1394 + * TDGETVEINFO is called. It prevents #VE nesting until the kernel reads 1395 + * the VE info. 1396 + * 1397 + * If a guest kernel action which would normally cause a #VE occurs in 1398 + * the interrupt-disabled region before TDGETVEINFO, a #DF (fault 1399 + * exception) is delivered to the guest which will result in an oops. 1400 + * 1401 + * The entry code has been audited carefully for following these expectations. 1402 + * Changes in the entry code have to be audited for correctness vs. this 1403 + * aspect. Similarly to #PF, #VE in these places will expose kernel to 1404 + * privilege escalation or may lead to random crashes. 1405 + */ 1406 + DEFINE_IDTENTRY(exc_virtualization_exception) 1407 + { 1408 + struct ve_info ve; 1409 + 1410 + /* 1411 + * NMIs/Machine-checks/Interrupts will be in a disabled state 1412 + * till TDGETVEINFO TDCALL is executed. This ensures that VE 1413 + * info cannot be overwritten by a nested #VE. 1414 + */ 1415 + tdx_get_ve_info(&ve); 1416 + 1417 + cond_local_irq_enable(regs); 1418 + 1419 + /* 1420 + * If tdx_handle_virt_exception() could not process 1421 + * it successfully, treat it as #GP(0) and handle it. 1422 + */ 1423 + if (!tdx_handle_virt_exception(regs, &ve)) 1424 + ve_raise_fault(regs, 0); 1425 + 1426 + cond_local_irq_disable(regs); 1427 + } 1428 + 1429 + #endif 1351 1430 1352 1431 #ifdef CONFIG_X86_32 1353 1432 DEFINE_IDTENTRY_SW(iret_error)

+1 -1

arch/x86/lib/kaslr.c

··· 11 11 #include <asm/msr.h> 12 12 #include <asm/archrandom.h> 13 13 #include <asm/e820/api.h> 14 - #include <asm/io.h> 14 + #include <asm/shared/io.h> 15 15 16 16 /* 17 17 * When built for the regular kernel, several functions need to be stubbed out

+5

arch/x86/mm/ioremap.c

··· 242 242 * If the page being mapped is in memory and SEV is active then 243 243 * make sure the memory encryption attribute is enabled in the 244 244 * resulting mapping. 245 + * In TDX guests, memory is marked private by default. If encryption 246 + * is not requested (using encrypted), explicitly set decrypt 247 + * attribute in all IOREMAPPED memory. 245 248 */ 246 249 prot = PAGE_KERNEL_IO; 247 250 if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted) 248 251 prot = pgprot_encrypted(prot); 252 + else 253 + prot = pgprot_decrypted(prot); 249 254 250 255 switch (pcm) { 251 256 case _PAGE_CACHE_MODE_UC:

+8 -1

arch/x86/mm/mem_encrypt.c

··· 42 42 43 43 static void print_mem_encrypt_feature_info(void) 44 44 { 45 - pr_info("AMD Memory Encryption Features active:"); 45 + pr_info("Memory Encryption Features active:"); 46 + 47 + if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) { 48 + pr_cont(" Intel TDX\n"); 49 + return; 50 + } 51 + 52 + pr_cont(" AMD"); 46 53 47 54 /* Secure Memory Encryption */ 48 55 if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)) {

+1

arch/x86/realmode/rm/header.S

··· 24 24 .long pa_sev_es_trampoline_start 25 25 #endif 26 26 #ifdef CONFIG_X86_64 27 + .long pa_trampoline_start64 27 28 .long pa_trampoline_pgd; 28 29 #endif 29 30 /* ACPI S3 wakeup */

+53 -4

arch/x86/realmode/rm/trampoline_64.S

··· 70 70 movw $__KERNEL_DS, %dx # Data segment descriptor 71 71 72 72 # Enable protected mode 73 - movl $X86_CR0_PE, %eax # protected mode (PE) bit 73 + movl $(CR0_STATE & ~X86_CR0_PG), %eax 74 74 movl %eax, %cr0 # into protected mode 75 75 76 76 # flush prefetch and jump to startup_32 ··· 143 143 movl %eax, %cr3 144 144 145 145 # Set up EFER 146 + movl $MSR_EFER, %ecx 147 + rdmsr 148 + /* 149 + * Skip writing to EFER if the register already has desired 150 + * value (to avoid #VE for the TDX guest). 151 + */ 152 + cmp pa_tr_efer, %eax 153 + jne .Lwrite_efer 154 + cmp pa_tr_efer + 4, %edx 155 + je .Ldone_efer 156 + .Lwrite_efer: 146 157 movl pa_tr_efer, %eax 147 158 movl pa_tr_efer + 4, %edx 148 - movl $MSR_EFER, %ecx 149 159 wrmsr 150 160 151 - # Enable paging and in turn activate Long Mode 152 - movl $(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax 161 + .Ldone_efer: 162 + # Enable paging and in turn activate Long Mode. 163 + movl $CR0_STATE, %eax 153 164 movl %eax, %cr0 154 165 155 166 /* ··· 172 161 ljmpl $__KERNEL_CS, $pa_startup_64 173 162 SYM_CODE_END(startup_32) 174 163 164 + SYM_CODE_START(pa_trampoline_compat) 165 + /* 166 + * In compatibility mode. Prep ESP and DX for startup_32, then disable 167 + * paging and complete the switch to legacy 32-bit mode. 168 + */ 169 + movl $rm_stack_end, %esp 170 + movw $__KERNEL_DS, %dx 171 + 172 + movl $(CR0_STATE & ~X86_CR0_PG), %eax 173 + movl %eax, %cr0 174 + ljmpl $__KERNEL32_CS, $pa_startup_32 175 + SYM_CODE_END(pa_trampoline_compat) 176 + 175 177 .section ".text64","ax" 176 178 .code64 177 179 .balign 4 ··· 192 168 # Now jump into the kernel using virtual addresses 193 169 jmpq *tr_start(%rip) 194 170 SYM_CODE_END(startup_64) 171 + 172 + SYM_CODE_START(trampoline_start64) 173 + /* 174 + * APs start here on a direct transfer from 64-bit BIOS with identity 175 + * mapped page tables. Load the kernel's GDT in order to gear down to 176 + * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load 177 + * segment registers. Load the zero IDT so any fault triggers a 178 + * shutdown instead of jumping back into BIOS. 179 + */ 180 + lidt tr_idt(%rip) 181 + lgdt tr_gdt64(%rip) 182 + 183 + ljmpl *tr_compat(%rip) 184 + SYM_CODE_END(trampoline_start64) 195 185 196 186 .section ".rodata","a" 197 187 # Duplicate the global descriptor table ··· 219 181 .quad 0x00af9b000000ffff # __KERNEL_CS 220 182 .quad 0x00cf93000000ffff # __KERNEL_DS 221 183 SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end) 184 + 185 + SYM_DATA_START(tr_gdt64) 186 + .short tr_gdt_end - tr_gdt - 1 # gdt limit 187 + .long pa_tr_gdt 188 + .long 0 189 + SYM_DATA_END(tr_gdt64) 190 + 191 + SYM_DATA_START(tr_compat) 192 + .long pa_trampoline_compat 193 + .short __KERNEL32_CS 194 + SYM_DATA_END(tr_compat) 222 195 223 196 .bss 224 197 .balign PAGE_SIZE

+11 -1

arch/x86/realmode/rm/trampoline_common.S

··· 1 1 /* SPDX-License-Identifier: GPL-2.0 */ 2 2 .section ".rodata","a" 3 3 .balign 16 4 - SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0) 4 + 5 + /* 6 + * When a bootloader hands off to the kernel in 32-bit mode an 7 + * IDT with a 2-byte limit and 4-byte base is needed. When a boot 8 + * loader hands off to a kernel 64-bit mode the base address 9 + * extends to 8-bytes. Reserve enough space for either scenario. 10 + */ 11 + SYM_DATA_START_LOCAL(tr_idt) 12 + .short 0 13 + .quad 0 14 + SYM_DATA_END(tr_idt)

+4

arch/x86/realmode/rm/wakemain.c

··· 62 62 } 63 63 } 64 64 65 + struct port_io_ops pio_ops; 66 + 65 67 void main(void) 66 68 { 69 + init_default_io_ops(); 70 + 67 71 /* Kill machine if structures are wrong */ 68 72 if (wakeup_header.real_magic != 0x12345678) 69 73 while (1)

+96

arch/x86/virt/vmx/tdx/tdxcall.S

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #include <asm/asm-offsets.h> 3 + #include <asm/tdx.h> 4 + 5 + /* 6 + * TDCALL and SEAMCALL are supported in Binutils >= 2.36. 7 + */ 8 + #define tdcall .byte 0x66,0x0f,0x01,0xcc 9 + #define seamcall .byte 0x66,0x0f,0x01,0xcf 10 + 11 + /* 12 + * TDX_MODULE_CALL - common helper macro for both 13 + * TDCALL and SEAMCALL instructions. 14 + * 15 + * TDCALL - used by TDX guests to make requests to the 16 + * TDX module and hypercalls to the VMM. 17 + * SEAMCALL - used by TDX hosts to make requests to the 18 + * TDX module. 19 + */ 20 + .macro TDX_MODULE_CALL host:req 21 + /* 22 + * R12 will be used as temporary storage for struct tdx_module_output 23 + * pointer. Since R12-R15 registers are not used by TDCALL/SEAMCALL 24 + * services supported by this function, it can be reused. 25 + */ 26 + 27 + /* Callee saved, so preserve it */ 28 + push %r12 29 + 30 + /* 31 + * Push output pointer to stack. 32 + * After the operation, it will be fetched into R12 register. 33 + */ 34 + push %r9 35 + 36 + /* Mangle function call ABI into TDCALL/SEAMCALL ABI: */ 37 + /* Move Leaf ID to RAX */ 38 + mov %rdi, %rax 39 + /* Move input 4 to R9 */ 40 + mov %r8, %r9 41 + /* Move input 3 to R8 */ 42 + mov %rcx, %r8 43 + /* Move input 1 to RCX */ 44 + mov %rsi, %rcx 45 + /* Leave input param 2 in RDX */ 46 + 47 + .if \host 48 + seamcall 49 + /* 50 + * SEAMCALL instruction is essentially a VMExit from VMX root 51 + * mode to SEAM VMX root mode. VMfailInvalid (CF=1) indicates 52 + * that the targeted SEAM firmware is not loaded or disabled, 53 + * or P-SEAMLDR is busy with another SEAMCALL. %rax is not 54 + * changed in this case. 55 + * 56 + * Set %rax to TDX_SEAMCALL_VMFAILINVALID for VMfailInvalid. 57 + * This value will never be used as actual SEAMCALL error code as 58 + * it is from the Reserved status code class. 59 + */ 60 + jnc .Lno_vmfailinvalid 61 + mov $TDX_SEAMCALL_VMFAILINVALID, %rax 62 + .Lno_vmfailinvalid: 63 + 64 + .else 65 + tdcall 66 + .endif 67 + 68 + /* 69 + * Fetch output pointer from stack to R12 (It is used 70 + * as temporary storage) 71 + */ 72 + pop %r12 73 + 74 + /* 75 + * Since this macro can be invoked with NULL as an output pointer, 76 + * check if caller provided an output struct before storing output 77 + * registers. 78 + * 79 + * Update output registers, even if the call failed (RAX != 0). 80 + * Other registers may contain details of the failure. 81 + */ 82 + test %r12, %r12 83 + jz .Lno_output_struct 84 + 85 + /* Copy result registers to output struct: */ 86 + movq %rcx, TDX_MODULE_rcx(%r12) 87 + movq %rdx, TDX_MODULE_rdx(%r12) 88 + movq %r8, TDX_MODULE_r8(%r12) 89 + movq %r9, TDX_MODULE_r9(%r12) 90 + movq %r10, TDX_MODULE_r10(%r12) 91 + movq %r11, TDX_MODULE_r11(%r12) 92 + 93 + .Lno_output_struct: 94 + /* Restore the state of R12 register */ 95 + pop %r12 96 + .endm

+10

include/linux/cc_platform.h

··· 80 80 * using AMD SEV-SNP features. 81 81 */ 82 82 CC_ATTR_GUEST_SEV_SNP, 83 + 84 + /** 85 + * @CC_ATTR_HOTPLUG_DISABLED: Hotplug is not supported or disabled. 86 + * 87 + * The platform/OS is running as a guest/virtual machine does not 88 + * support CPU hotplug feature. 89 + * 90 + * Examples include TDX Guest. 91 + */ 92 + CC_ATTR_HOTPLUG_DISABLED, 83 93 }; 84 94 85 95 #ifdef CONFIG_ARCH_HAS_CC_PLATFORM

+7

kernel/cpu.c

··· 35 35 #include <linux/percpu-rwsem.h> 36 36 #include <linux/cpuset.h> 37 37 #include <linux/random.h> 38 + #include <linux/cc_platform.h> 38 39 39 40 #include <trace/events/power.h> 40 41 #define CREATE_TRACE_POINTS ··· 1191 1190 1192 1191 static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target) 1193 1192 { 1193 + /* 1194 + * If the platform does not support hotplug, report it explicitly to 1195 + * differentiate it from a transient offlining failure. 1196 + */ 1197 + if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED)) 1198 + return -EOPNOTSUPP; 1194 1199 if (cpu_hotplug_disabled) 1195 1200 return -EBUSY; 1196 1201 return _cpu_down(cpu, 0, target);