Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge branch 'kvm-tdx-initial' into HEAD

This large commit contains the initial support for TDX in KVM. All x86
parts enable the host-side hypercalls that KVM uses to talk to the TDX
module, a software component that runs in a special CPU mode called SEAM
(Secure Arbitration Mode).

The series is in turn split into multiple sub-series, each with a separate
merge commit:

- Initialization: basic setup for using the TDX module from KVM, plus
ioctls to create TDX VMs and vCPUs.

- MMU: in TDX, private and shared halves of the address space are mapped by
different EPT roots, and the private half is managed by the TDX module.
Using the support that was added to the generic MMU code in 6.14,
add support for TDX's secure page tables to the Intel side of KVM.
Generic KVM code takes care of maintaining a mirror of the secure page
tables so that they can be queried efficiently, and ensuring that changes
are applied to both the mirror and the secure EPT.

- vCPU enter/exit: implement the callbacks that handle the entry of a TDX
vCPU (via the SEAMCALL TDH.VP.ENTER) and the corresponding save/restore
of host state.

- Userspace exits: introduce support for guest TDVMCALLs that KVM forwards to
userspace. These correspond to the usual KVM_EXIT_* "heavyweight vmexits"
but are triggered through a different mechanism, similar to VMGEXIT for
SEV-ES and SEV-SNP.

- Interrupt handling: support for virtual interrupt injection as well as
handling VM-Exits that are caused by vectored events. Exclusive to
TDX are machine-check SMIs, which the kernel already knows how to
handle through the kernel machine check handler (commit 7911f145de5f,
"x86/mce: Implement recovery for errors in TDX/SEAM non-root mode")

- Loose ends: handling of the remaining exits from the TDX module, including
EPT violation/misconfig and several TDVMCALL leaves that are handled in
the kernel (CPUID, HLT, RDMSR/WRMSR, GetTdVmCallInfo); plus returning
an error or ignoring operations that are not supported by TDX guests

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

+6820 -581
+37 -4
Documentation/virt/kvm/api.rst
··· 1411 1411 mmap() that affects the region will be made visible immediately. Another 1412 1412 example is madvise(MADV_DROP). 1413 1413 1414 + For TDX guest, deleting/moving memory region loses guest memory contents. 1415 + Read only region isn't supported. Only as-id 0 is supported. 1416 + 1414 1417 Note: On arm64, a write generated by the page-table walker (to update 1415 1418 the Access and Dirty flags, for example) never results in a 1416 1419 KVM_EXIT_MMIO exit when the slot has the KVM_MEM_READONLY flag. This ··· 4771 4768 4772 4769 :Capability: basic 4773 4770 :Architectures: x86 4774 - :Type: vm 4771 + :Type: vm ioctl, vcpu ioctl 4775 4772 :Parameters: an opaque platform specific structure (in/out) 4776 4773 :Returns: 0 on success; -1 on error 4777 4774 ··· 4779 4776 for issuing platform-specific memory encryption commands to manage those 4780 4777 encrypted VMs. 4781 4778 4782 - Currently, this ioctl is used for issuing Secure Encrypted Virtualization 4783 - (SEV) commands on AMD Processors. The SEV commands are defined in 4784 - Documentation/virt/kvm/x86/amd-memory-encryption.rst. 4779 + Currently, this ioctl is used for issuing both Secure Encrypted Virtualization 4780 + (SEV) commands on AMD Processors and Trusted Domain Extensions (TDX) commands 4781 + on Intel Processors. The detailed commands are defined in 4782 + Documentation/virt/kvm/x86/amd-memory-encryption.rst and 4783 + Documentation/virt/kvm/x86/intel-tdx.rst. 4785 4784 4786 4785 4.111 KVM_MEMORY_ENCRYPT_REG_REGION 4787 4786 ----------------------------------- ··· 6832 6827 #define KVM_SYSTEM_EVENT_WAKEUP 4 6833 6828 #define KVM_SYSTEM_EVENT_SUSPEND 5 6834 6829 #define KVM_SYSTEM_EVENT_SEV_TERM 6 6830 + #define KVM_SYSTEM_EVENT_TDX_FATAL 7 6835 6831 __u32 type; 6836 6832 __u32 ndata; 6837 6833 __u64 data[16]; ··· 6859 6853 reset/shutdown of the VM. 6860 6854 - KVM_SYSTEM_EVENT_SEV_TERM -- an AMD SEV guest requested termination. 6861 6855 The guest physical address of the guest's GHCB is stored in `data[0]`. 6856 + - KVM_SYSTEM_EVENT_TDX_FATAL -- a TDX guest reported a fatal error state. 6857 + KVM doesn't do any parsing or conversion, it just dumps 16 general-purpose 6858 + registers to userspace, in ascending order of the 4-bit indices for x86-64 6859 + general-purpose registers in instruction encoding, as defined in the Intel 6860 + SDM. 6862 6861 - KVM_SYSTEM_EVENT_WAKEUP -- the exiting vCPU is in a suspended state and 6863 6862 KVM has recognized a wakeup event. Userspace may honor this event by 6864 6863 marking the exiting vCPU as runnable, or deny it and call KVM_RUN again. ··· 8205 8194 and 0x489), as KVM does now allow them to 8206 8195 be set by userspace (KVM sets them based on 8207 8196 guest CPUID, for safety purposes). 8197 + 8198 + KVM_X86_QUIRK_IGNORE_GUEST_PAT By default, on Intel platforms, KVM ignores 8199 + guest PAT and forces the effective memory 8200 + type to WB in EPT. The quirk is not available 8201 + on Intel platforms which are incapable of 8202 + safely honoring guest PAT (i.e., without CPU 8203 + self-snoop, KVM always ignores guest PAT and 8204 + forces effective memory type to WB). It is 8205 + also ignored on AMD platforms or, on Intel, 8206 + when a VM has non-coherent DMA devices 8207 + assigned; KVM always honors guest PAT in 8208 + such case. The quirk is needed to avoid 8209 + slowdowns on certain Intel Xeon platforms 8210 + (e.g. ICX, SPR) where self-snoop feature is 8211 + supported but UC is slow enough to cause 8212 + issues with some older guests that use 8213 + UC instead of WC to map the video RAM. 8214 + Userspace can disable the quirk to honor 8215 + guest PAT if it knows that there is no such 8216 + guest software, for example if it does not 8217 + expose a bochs graphics device (which is 8218 + known to have had a buggy driver). 8208 8219 =================================== ============================================ 8209 8220 8210 8221 7.32 KVM_CAP_MAX_VCPU_ID
+1
Documentation/virt/kvm/x86/index.rst
··· 11 11 cpuid 12 12 errata 13 13 hypercalls 14 + intel-tdx 14 15 mmu 15 16 msr 16 17 nested-vmx
+255
Documentation/virt/kvm/x86/intel-tdx.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + =================================== 4 + Intel Trust Domain Extensions (TDX) 5 + =================================== 6 + 7 + Overview 8 + ======== 9 + Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from the 10 + host and physical attacks. A CPU-attested software module called 'the TDX 11 + module' runs inside a new CPU isolated range to provide the functionalities to 12 + manage and run protected VMs, a.k.a, TDX guests or TDs. 13 + 14 + Please refer to [1] for the whitepaper, specifications and other resources. 15 + 16 + This documentation describes TDX-specific KVM ABIs. The TDX module needs to be 17 + initialized before it can be used by KVM to run any TDX guests. The host 18 + core-kernel provides the support of initializing the TDX module, which is 19 + described in the Documentation/arch/x86/tdx.rst. 20 + 21 + API description 22 + =============== 23 + 24 + KVM_MEMORY_ENCRYPT_OP 25 + --------------------- 26 + :Type: vm ioctl, vcpu ioctl 27 + 28 + For TDX operations, KVM_MEMORY_ENCRYPT_OP is re-purposed to be generic 29 + ioctl with TDX specific sub-ioctl() commands. 30 + 31 + :: 32 + 33 + /* Trust Domain Extensions sub-ioctl() commands. */ 34 + enum kvm_tdx_cmd_id { 35 + KVM_TDX_CAPABILITIES = 0, 36 + KVM_TDX_INIT_VM, 37 + KVM_TDX_INIT_VCPU, 38 + KVM_TDX_INIT_MEM_REGION, 39 + KVM_TDX_FINALIZE_VM, 40 + KVM_TDX_GET_CPUID, 41 + 42 + KVM_TDX_CMD_NR_MAX, 43 + }; 44 + 45 + struct kvm_tdx_cmd { 46 + /* enum kvm_tdx_cmd_id */ 47 + __u32 id; 48 + /* flags for sub-command. If sub-command doesn't use this, set zero. */ 49 + __u32 flags; 50 + /* 51 + * data for each sub-command. An immediate or a pointer to the actual 52 + * data in process virtual address. If sub-command doesn't use it, 53 + * set zero. 54 + */ 55 + __u64 data; 56 + /* 57 + * Auxiliary error code. The sub-command may return TDX SEAMCALL 58 + * status code in addition to -Exxx. 59 + */ 60 + __u64 hw_error; 61 + }; 62 + 63 + KVM_TDX_CAPABILITIES 64 + -------------------- 65 + :Type: vm ioctl 66 + :Returns: 0 on success, <0 on error 67 + 68 + Return the TDX capabilities that current KVM supports with the specific TDX 69 + module loaded in the system. It reports what features/capabilities are allowed 70 + to be configured to the TDX guest. 71 + 72 + - id: KVM_TDX_CAPABILITIES 73 + - flags: must be 0 74 + - data: pointer to struct kvm_tdx_capabilities 75 + - hw_error: must be 0 76 + 77 + :: 78 + 79 + struct kvm_tdx_capabilities { 80 + __u64 supported_attrs; 81 + __u64 supported_xfam; 82 + __u64 reserved[254]; 83 + 84 + /* Configurable CPUID bits for userspace */ 85 + struct kvm_cpuid2 cpuid; 86 + }; 87 + 88 + 89 + KVM_TDX_INIT_VM 90 + --------------- 91 + :Type: vm ioctl 92 + :Returns: 0 on success, <0 on error 93 + 94 + Perform TDX specific VM initialization. This needs to be called after 95 + KVM_CREATE_VM and before creating any VCPUs. 96 + 97 + - id: KVM_TDX_INIT_VM 98 + - flags: must be 0 99 + - data: pointer to struct kvm_tdx_init_vm 100 + - hw_error: must be 0 101 + 102 + :: 103 + 104 + struct kvm_tdx_init_vm { 105 + __u64 attributes; 106 + __u64 xfam; 107 + __u64 mrconfigid[6]; /* sha384 digest */ 108 + __u64 mrowner[6]; /* sha384 digest */ 109 + __u64 mrownerconfig[6]; /* sha384 digest */ 110 + 111 + /* The total space for TD_PARAMS before the CPUIDs is 256 bytes */ 112 + __u64 reserved[12]; 113 + 114 + /* 115 + * Call KVM_TDX_INIT_VM before vcpu creation, thus before 116 + * KVM_SET_CPUID2. 117 + * This configuration supersedes KVM_SET_CPUID2s for VCPUs because the 118 + * TDX module directly virtualizes those CPUIDs without VMM. The user 119 + * space VMM, e.g. qemu, should make KVM_SET_CPUID2 consistent with 120 + * those values. If it doesn't, KVM may have wrong idea of vCPUIDs of 121 + * the guest, and KVM may wrongly emulate CPUIDs or MSRs that the TDX 122 + * module doesn't virtualize. 123 + */ 124 + struct kvm_cpuid2 cpuid; 125 + }; 126 + 127 + 128 + KVM_TDX_INIT_VCPU 129 + ----------------- 130 + :Type: vcpu ioctl 131 + :Returns: 0 on success, <0 on error 132 + 133 + Perform TDX specific VCPU initialization. 134 + 135 + - id: KVM_TDX_INIT_VCPU 136 + - flags: must be 0 137 + - data: initial value of the guest TD VCPU RCX 138 + - hw_error: must be 0 139 + 140 + KVM_TDX_INIT_MEM_REGION 141 + ----------------------- 142 + :Type: vcpu ioctl 143 + :Returns: 0 on success, <0 on error 144 + 145 + Initialize @nr_pages TDX guest private memory starting from @gpa with userspace 146 + provided data from @source_addr. 147 + 148 + Note, before calling this sub command, memory attribute of the range 149 + [gpa, gpa + nr_pages] needs to be private. Userspace can use 150 + KVM_SET_MEMORY_ATTRIBUTES to set the attribute. 151 + 152 + If KVM_TDX_MEASURE_MEMORY_REGION flag is specified, it also extends measurement. 153 + 154 + - id: KVM_TDX_INIT_MEM_REGION 155 + - flags: currently only KVM_TDX_MEASURE_MEMORY_REGION is defined 156 + - data: pointer to struct kvm_tdx_init_mem_region 157 + - hw_error: must be 0 158 + 159 + :: 160 + 161 + #define KVM_TDX_MEASURE_MEMORY_REGION (1UL << 0) 162 + 163 + struct kvm_tdx_init_mem_region { 164 + __u64 source_addr; 165 + __u64 gpa; 166 + __u64 nr_pages; 167 + }; 168 + 169 + 170 + KVM_TDX_FINALIZE_VM 171 + ------------------- 172 + :Type: vm ioctl 173 + :Returns: 0 on success, <0 on error 174 + 175 + Complete measurement of the initial TD contents and mark it ready to run. 176 + 177 + - id: KVM_TDX_FINALIZE_VM 178 + - flags: must be 0 179 + - data: must be 0 180 + - hw_error: must be 0 181 + 182 + 183 + KVM_TDX_GET_CPUID 184 + ----------------- 185 + :Type: vcpu ioctl 186 + :Returns: 0 on success, <0 on error 187 + 188 + Get the CPUID values that the TDX module virtualizes for the TD guest. 189 + When it returns -E2BIG, the user space should allocate a larger buffer and 190 + retry. The minimum buffer size is updated in the nent field of the 191 + struct kvm_cpuid2. 192 + 193 + - id: KVM_TDX_GET_CPUID 194 + - flags: must be 0 195 + - data: pointer to struct kvm_cpuid2 (in/out) 196 + - hw_error: must be 0 (out) 197 + 198 + :: 199 + 200 + struct kvm_cpuid2 { 201 + __u32 nent; 202 + __u32 padding; 203 + struct kvm_cpuid_entry2 entries[0]; 204 + }; 205 + 206 + struct kvm_cpuid_entry2 { 207 + __u32 function; 208 + __u32 index; 209 + __u32 flags; 210 + __u32 eax; 211 + __u32 ebx; 212 + __u32 ecx; 213 + __u32 edx; 214 + __u32 padding[3]; 215 + }; 216 + 217 + KVM TDX creation flow 218 + ===================== 219 + In addition to the standard KVM flow, new TDX ioctls need to be called. The 220 + control flow is as follows: 221 + 222 + #. Check system wide capability 223 + 224 + * KVM_CAP_VM_TYPES: Check if VM type is supported and if KVM_X86_TDX_VM 225 + is supported. 226 + 227 + #. Create VM 228 + 229 + * KVM_CREATE_VM 230 + * KVM_TDX_CAPABILITIES: Query TDX capabilities for creating TDX guests. 231 + * KVM_CHECK_EXTENSION(KVM_CAP_MAX_VCPUS): Query maximum VCPUs the TD can 232 + support at VM level (TDX has its own limitation on this). 233 + * KVM_SET_TSC_KHZ: Configure TD's TSC frequency if a different TSC frequency 234 + than host is desired. This is Optional. 235 + * KVM_TDX_INIT_VM: Pass TDX specific VM parameters. 236 + 237 + #. Create VCPU 238 + 239 + * KVM_CREATE_VCPU 240 + * KVM_TDX_INIT_VCPU: Pass TDX specific VCPU parameters. 241 + * KVM_SET_CPUID2: Configure TD's CPUIDs. 242 + * KVM_SET_MSRS: Configure TD's MSRs. 243 + 244 + #. Initialize initial guest memory 245 + 246 + * Prepare content of initial guest memory. 247 + * KVM_TDX_INIT_MEM_REGION: Add initial guest memory. 248 + * KVM_TDX_FINALIZE_VM: Finalize the measurement of the TDX guest. 249 + 250 + #. Run VCPU 251 + 252 + References 253 + ========== 254 + 255 + .. [1] https://www.intel.com/content/www/us/en/developer/tools/trust-domain-extensions/documentation.html
+4 -1
arch/x86/include/asm/kvm-x86-ops.h
··· 21 21 KVM_X86_OP(vcpu_after_set_cpuid) 22 22 KVM_X86_OP(vm_init) 23 23 KVM_X86_OP_OPTIONAL(vm_destroy) 24 + KVM_X86_OP_OPTIONAL(vm_pre_destroy) 24 25 KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate) 25 26 KVM_X86_OP(vcpu_create) 26 27 KVM_X86_OP(vcpu_free) ··· 116 115 KVM_X86_OP_OPTIONAL(apicv_pre_state_restore) 117 116 KVM_X86_OP_OPTIONAL(apicv_post_state_restore) 118 117 KVM_X86_OP_OPTIONAL_RET0(dy_apicv_has_pending_interrupt) 118 + KVM_X86_OP_OPTIONAL(protected_apic_has_interrupt) 119 119 KVM_X86_OP_OPTIONAL(set_hv_timer) 120 120 KVM_X86_OP_OPTIONAL(cancel_hv_timer) 121 121 KVM_X86_OP(setup_mce) ··· 127 125 KVM_X86_OP(enable_smi_window) 128 126 #endif 129 127 KVM_X86_OP_OPTIONAL(dev_get_attr) 130 - KVM_X86_OP_OPTIONAL(mem_enc_ioctl) 128 + KVM_X86_OP(mem_enc_ioctl) 129 + KVM_X86_OP_OPTIONAL(vcpu_mem_enc_ioctl) 131 130 KVM_X86_OP_OPTIONAL(mem_enc_register_region) 132 131 KVM_X86_OP_OPTIONAL(mem_enc_unregister_region) 133 132 KVM_X86_OP_OPTIONAL(vm_copy_enc_context_from)
+26 -8
arch/x86/include/asm/kvm_host.h
··· 607 607 struct kvm_pmu_ops; 608 608 609 609 enum { 610 - KVM_DEBUGREG_BP_ENABLED = 1, 611 - KVM_DEBUGREG_WONT_EXIT = 2, 610 + KVM_DEBUGREG_BP_ENABLED = BIT(0), 611 + KVM_DEBUGREG_WONT_EXIT = BIT(1), 612 + /* 613 + * Guest debug registers (DR0-3, DR6 and DR7) are saved/restored by 614 + * hardware on exit from or enter to guest. KVM needn't switch them. 615 + * DR0-3, DR6 and DR7 are set to their architectural INIT value on VM 616 + * exit, host values need to be restored. 617 + */ 618 + KVM_DEBUGREG_AUTO_SWITCH = BIT(2), 612 619 }; 613 620 614 621 struct kvm_mtrr { ··· 1576 1569 struct kvm_mmu_memory_cache split_desc_cache; 1577 1570 1578 1571 gfn_t gfn_direct_bits; 1572 + 1573 + /* 1574 + * Size of the CPU's dirty log buffer, i.e. VMX's PML buffer. A Zero 1575 + * value indicates CPU dirty logging is unsupported or disabled in 1576 + * current VM. 1577 + */ 1578 + int cpu_dirty_log_size; 1579 1579 }; 1580 1580 1581 1581 struct kvm_vm_stat { ··· 1686 1672 unsigned int vm_size; 1687 1673 int (*vm_init)(struct kvm *kvm); 1688 1674 void (*vm_destroy)(struct kvm *kvm); 1675 + void (*vm_pre_destroy)(struct kvm *kvm); 1689 1676 1690 1677 /* Create, but do not attach this VCPU */ 1691 1678 int (*vcpu_precreate)(struct kvm *kvm); ··· 1836 1821 struct x86_exception *exception); 1837 1822 void (*handle_exit_irqoff)(struct kvm_vcpu *vcpu); 1838 1823 1839 - /* 1840 - * Size of the CPU's dirty log buffer, i.e. VMX's PML buffer. A zero 1841 - * value indicates CPU dirty logging is unsupported or disabled. 1842 - */ 1843 - int cpu_dirty_log_size; 1844 1824 void (*update_cpu_dirty_logging)(struct kvm_vcpu *vcpu); 1845 1825 1846 1826 const struct kvm_x86_nested_ops *nested_ops; ··· 1849 1839 void (*apicv_pre_state_restore)(struct kvm_vcpu *vcpu); 1850 1840 void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu); 1851 1841 bool (*dy_apicv_has_pending_interrupt)(struct kvm_vcpu *vcpu); 1842 + bool (*protected_apic_has_interrupt)(struct kvm_vcpu *vcpu); 1852 1843 1853 1844 int (*set_hv_timer)(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc, 1854 1845 bool *expired); ··· 1866 1855 1867 1856 int (*dev_get_attr)(u32 group, u64 attr, u64 *val); 1868 1857 int (*mem_enc_ioctl)(struct kvm *kvm, void __user *argp); 1858 + int (*vcpu_mem_enc_ioctl)(struct kvm_vcpu *vcpu, void __user *argp); 1869 1859 int (*mem_enc_register_region)(struct kvm *kvm, struct kvm_enc_region *argp); 1870 1860 int (*mem_enc_unregister_region)(struct kvm *kvm, struct kvm_enc_region *argp); 1871 1861 int (*vm_copy_enc_context_from)(struct kvm *kvm, unsigned int source_fd); ··· 2343 2331 int kvm_add_user_return_msr(u32 msr); 2344 2332 int kvm_find_user_return_msr(u32 msr); 2345 2333 int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask); 2334 + void kvm_user_return_msr_update_cache(unsigned int index, u64 val); 2346 2335 2347 2336 static inline bool kvm_is_supported_user_return_msr(u32 msr) 2348 2337 { ··· 2427 2414 KVM_X86_QUIRK_FIX_HYPERCALL_INSN | \ 2428 2415 KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS | \ 2429 2416 KVM_X86_QUIRK_SLOT_ZAP_ALL | \ 2430 - KVM_X86_QUIRK_STUFF_FEATURE_MSRS) 2417 + KVM_X86_QUIRK_STUFF_FEATURE_MSRS | \ 2418 + KVM_X86_QUIRK_IGNORE_GUEST_PAT) 2419 + 2420 + #define KVM_X86_CONDITIONAL_QUIRKS \ 2421 + (KVM_X86_QUIRK_CD_NW_CLEARED | \ 2422 + KVM_X86_QUIRK_IGNORE_GUEST_PAT) 2431 2423 2432 2424 /* 2433 2425 * KVM previously used a u32 field in kvm_run to indicate the hypercall was
+5
arch/x86/include/asm/posted_intr.h
··· 81 81 return test_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control); 82 82 } 83 83 84 + static inline bool pi_test_pir(int vector, struct pi_desc *pi_desc) 85 + { 86 + return test_bit(vector, (unsigned long *)pi_desc->pir); 87 + } 88 + 84 89 /* Non-atomic helpers */ 85 90 static inline void __pi_set_sn(struct pi_desc *pi_desc) 86 91 {
+8 -1
arch/x86/include/asm/shared/tdx.h
··· 67 67 #define TD_CTLS_LOCK BIT_ULL(TD_CTLS_LOCK_BIT) 68 68 69 69 /* TDX hypercall Leaf IDs */ 70 + #define TDVMCALL_GET_TD_VM_CALL_INFO 0x10000 70 71 #define TDVMCALL_MAP_GPA 0x10001 71 72 #define TDVMCALL_GET_QUOTE 0x10002 72 73 #define TDVMCALL_REPORT_FATAL_ERROR 0x10003 73 74 74 - #define TDVMCALL_STATUS_RETRY 1 75 + /* 76 + * TDG.VP.VMCALL Status Codes (returned in R10) 77 + */ 78 + #define TDVMCALL_STATUS_SUCCESS 0x0000000000000000ULL 79 + #define TDVMCALL_STATUS_RETRY 0x0000000000000001ULL 80 + #define TDVMCALL_STATUS_INVALID_OPERAND 0x8000000000000000ULL 81 + #define TDVMCALL_STATUS_ALIGN_ERROR 0x8000000000000002ULL 75 82 76 83 /* 77 84 * Bitmasks of exposed registers (with VMM).
+75
arch/x86/include/asm/tdx.h
··· 5 5 6 6 #include <linux/init.h> 7 7 #include <linux/bits.h> 8 + #include <linux/mmzone.h> 8 9 9 10 #include <asm/errno.h> 10 11 #include <asm/ptrace.h> ··· 19 18 * TDX module. 20 19 */ 21 20 #define TDX_ERROR _BITUL(63) 21 + #define TDX_NON_RECOVERABLE _BITUL(62) 22 22 #define TDX_SW_ERROR (TDX_ERROR | GENMASK_ULL(47, 40)) 23 23 #define TDX_SEAMCALL_VMFAILINVALID (TDX_SW_ERROR | _UL(0xFFFF0000)) 24 24 ··· 35 33 #ifndef __ASSEMBLER__ 36 34 37 35 #include <uapi/asm/mce.h> 36 + #include <asm/tdx_global_metadata.h> 37 + #include <linux/pgtable.h> 38 38 39 39 /* 40 40 * Used by the #VE exception handler to gather the #VE exception ··· 123 119 int tdx_cpu_enable(void); 124 120 int tdx_enable(void); 125 121 const char *tdx_dump_mce_info(struct mce *m); 122 + const struct tdx_sys_info *tdx_get_sysinfo(void); 123 + 124 + int tdx_guest_keyid_alloc(void); 125 + u32 tdx_get_nr_guest_keyids(void); 126 + void tdx_guest_keyid_free(unsigned int keyid); 127 + 128 + struct tdx_td { 129 + /* TD root structure: */ 130 + struct page *tdr_page; 131 + 132 + int tdcs_nr_pages; 133 + /* TD control structure: */ 134 + struct page **tdcs_pages; 135 + 136 + /* Size of `tdcx_pages` in struct tdx_vp */ 137 + int tdcx_nr_pages; 138 + }; 139 + 140 + struct tdx_vp { 141 + /* TDVP root page */ 142 + struct page *tdvpr_page; 143 + 144 + /* TD vCPU control structure: */ 145 + struct page **tdcx_pages; 146 + }; 147 + 148 + static inline u64 mk_keyed_paddr(u16 hkid, struct page *page) 149 + { 150 + u64 ret; 151 + 152 + ret = page_to_phys(page); 153 + /* KeyID bits are just above the physical address bits: */ 154 + ret |= (u64)hkid << boot_cpu_data.x86_phys_bits; 155 + 156 + return ret; 157 + } 158 + 159 + static inline int pg_level_to_tdx_sept_level(enum pg_level level) 160 + { 161 + WARN_ON_ONCE(level == PG_LEVEL_NONE); 162 + return level - 1; 163 + } 164 + 165 + u64 tdh_vp_enter(struct tdx_vp *vp, struct tdx_module_args *args); 166 + u64 tdh_mng_addcx(struct tdx_td *td, struct page *tdcs_page); 167 + u64 tdh_mem_page_add(struct tdx_td *td, u64 gpa, struct page *page, struct page *source, u64 *ext_err1, u64 *ext_err2); 168 + u64 tdh_mem_sept_add(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2); 169 + u64 tdh_vp_addcx(struct tdx_vp *vp, struct page *tdcx_page); 170 + u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2); 171 + u64 tdh_mem_range_block(struct tdx_td *td, u64 gpa, int level, u64 *ext_err1, u64 *ext_err2); 172 + u64 tdh_mng_key_config(struct tdx_td *td); 173 + u64 tdh_mng_create(struct tdx_td *td, u16 hkid); 174 + u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp); 175 + u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data); 176 + u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2); 177 + u64 tdh_mr_finalize(struct tdx_td *td); 178 + u64 tdh_vp_flush(struct tdx_vp *vp); 179 + u64 tdh_mng_vpflushdone(struct tdx_td *td); 180 + u64 tdh_mng_key_freeid(struct tdx_td *td); 181 + u64 tdh_mng_init(struct tdx_td *td, u64 td_params, u64 *extended_err); 182 + u64 tdh_vp_init(struct tdx_vp *vp, u64 initial_rcx, u32 x2apicid); 183 + u64 tdh_vp_rd(struct tdx_vp *vp, u64 field, u64 *data); 184 + u64 tdh_vp_wr(struct tdx_vp *vp, u64 field, u64 data, u64 mask); 185 + u64 tdh_phymem_page_reclaim(struct page *page, u64 *tdx_pt, u64 *tdx_owner, u64 *tdx_size); 186 + u64 tdh_mem_track(struct tdx_td *tdr); 187 + u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_err1, u64 *ext_err2); 188 + u64 tdh_phymem_cache_wb(bool resume); 189 + u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td); 190 + u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page); 126 191 #else 127 192 static inline void tdx_init(void) { } 128 193 static inline int tdx_cpu_enable(void) { return -ENODEV; } 129 194 static inline int tdx_enable(void) { return -ENODEV; } 195 + static inline u32 tdx_get_nr_guest_keyids(void) { return 0; } 130 196 static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; } 197 + static inline const struct tdx_sys_info *tdx_get_sysinfo(void) { return NULL; } 131 198 #endif /* CONFIG_INTEL_TDX_HOST */ 132 199 133 200 #endif /* !__ASSEMBLER__ */
+2
arch/x86/include/asm/vmx.h
··· 256 256 TSC_MULTIPLIER_HIGH = 0x00002033, 257 257 TERTIARY_VM_EXEC_CONTROL = 0x00002034, 258 258 TERTIARY_VM_EXEC_CONTROL_HIGH = 0x00002035, 259 + SHARED_EPT_POINTER = 0x0000203C, 259 260 PID_POINTER_TABLE = 0x00002042, 260 261 PID_POINTER_TABLE_HIGH = 0x00002043, 261 262 GUEST_PHYSICAL_ADDRESS = 0x00002400, ··· 587 586 #define EPT_VIOLATION_PROT_READ BIT(3) 588 587 #define EPT_VIOLATION_PROT_WRITE BIT(4) 589 588 #define EPT_VIOLATION_PROT_EXEC BIT(5) 589 + #define EPT_VIOLATION_EXEC_FOR_RING3_LIN BIT(6) 590 590 #define EPT_VIOLATION_PROT_MASK (EPT_VIOLATION_PROT_READ | \ 591 591 EPT_VIOLATION_PROT_WRITE | \ 592 592 EPT_VIOLATION_PROT_EXEC)
+71
arch/x86/include/uapi/asm/kvm.h
··· 441 441 #define KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS (1 << 6) 442 442 #define KVM_X86_QUIRK_SLOT_ZAP_ALL (1 << 7) 443 443 #define KVM_X86_QUIRK_STUFF_FEATURE_MSRS (1 << 8) 444 + #define KVM_X86_QUIRK_IGNORE_GUEST_PAT (1 << 9) 444 445 445 446 #define KVM_STATE_NESTED_FORMAT_VMX 0 446 447 #define KVM_STATE_NESTED_FORMAT_SVM 1 ··· 930 929 #define KVM_X86_SEV_ES_VM 3 931 930 #define KVM_X86_SNP_VM 4 932 931 #define KVM_X86_TDX_VM 5 932 + 933 + /* Trust Domain eXtension sub-ioctl() commands. */ 934 + enum kvm_tdx_cmd_id { 935 + KVM_TDX_CAPABILITIES = 0, 936 + KVM_TDX_INIT_VM, 937 + KVM_TDX_INIT_VCPU, 938 + KVM_TDX_INIT_MEM_REGION, 939 + KVM_TDX_FINALIZE_VM, 940 + KVM_TDX_GET_CPUID, 941 + 942 + KVM_TDX_CMD_NR_MAX, 943 + }; 944 + 945 + struct kvm_tdx_cmd { 946 + /* enum kvm_tdx_cmd_id */ 947 + __u32 id; 948 + /* flags for sub-commend. If sub-command doesn't use this, set zero. */ 949 + __u32 flags; 950 + /* 951 + * data for each sub-command. An immediate or a pointer to the actual 952 + * data in process virtual address. If sub-command doesn't use it, 953 + * set zero. 954 + */ 955 + __u64 data; 956 + /* 957 + * Auxiliary error code. The sub-command may return TDX SEAMCALL 958 + * status code in addition to -Exxx. 959 + */ 960 + __u64 hw_error; 961 + }; 962 + 963 + struct kvm_tdx_capabilities { 964 + __u64 supported_attrs; 965 + __u64 supported_xfam; 966 + __u64 reserved[254]; 967 + 968 + /* Configurable CPUID bits for userspace */ 969 + struct kvm_cpuid2 cpuid; 970 + }; 971 + 972 + struct kvm_tdx_init_vm { 973 + __u64 attributes; 974 + __u64 xfam; 975 + __u64 mrconfigid[6]; /* sha384 digest */ 976 + __u64 mrowner[6]; /* sha384 digest */ 977 + __u64 mrownerconfig[6]; /* sha384 digest */ 978 + 979 + /* The total space for TD_PARAMS before the CPUIDs is 256 bytes */ 980 + __u64 reserved[12]; 981 + 982 + /* 983 + * Call KVM_TDX_INIT_VM before vcpu creation, thus before 984 + * KVM_SET_CPUID2. 985 + * This configuration supersedes KVM_SET_CPUID2s for VCPUs because the 986 + * TDX module directly virtualizes those CPUIDs without VMM. The user 987 + * space VMM, e.g. qemu, should make KVM_SET_CPUID2 consistent with 988 + * those values. If it doesn't, KVM may have wrong idea of vCPUIDs of 989 + * the guest, and KVM may wrongly emulate CPUIDs or MSRs that the TDX 990 + * module doesn't virtualize. 991 + */ 992 + struct kvm_cpuid2 cpuid; 993 + }; 994 + 995 + #define KVM_TDX_MEASURE_MEMORY_REGION _BITULL(0) 996 + 997 + struct kvm_tdx_init_mem_region { 998 + __u64 source_addr; 999 + __u64 gpa; 1000 + __u64 nr_pages; 1001 + }; 933 1002 934 1003 #endif /* _ASM_X86_KVM_H */
+4 -1
arch/x86/include/uapi/asm/vmx.h
··· 34 34 #define EXIT_REASON_TRIPLE_FAULT 2 35 35 #define EXIT_REASON_INIT_SIGNAL 3 36 36 #define EXIT_REASON_SIPI_SIGNAL 4 37 + #define EXIT_REASON_OTHER_SMI 6 37 38 38 39 #define EXIT_REASON_INTERRUPT_WINDOW 7 39 40 #define EXIT_REASON_NMI_WINDOW 8 ··· 93 92 #define EXIT_REASON_TPAUSE 68 94 93 #define EXIT_REASON_BUS_LOCK 74 95 94 #define EXIT_REASON_NOTIFY 75 95 + #define EXIT_REASON_TDCALL 77 96 96 97 97 #define VMX_EXIT_REASONS \ 98 98 { EXIT_REASON_EXCEPTION_NMI, "EXCEPTION_NMI" }, \ ··· 157 155 { EXIT_REASON_UMWAIT, "UMWAIT" }, \ 158 156 { EXIT_REASON_TPAUSE, "TPAUSE" }, \ 159 157 { EXIT_REASON_BUS_LOCK, "BUS_LOCK" }, \ 160 - { EXIT_REASON_NOTIFY, "NOTIFY" } 158 + { EXIT_REASON_NOTIFY, "NOTIFY" }, \ 159 + { EXIT_REASON_TDCALL, "TDCALL" } 161 160 162 161 #define VMX_EXIT_REASON_FLAGS \ 163 162 { VMX_EXIT_REASONS_FAILED_VMENTRY, "FAILED_VMENTRY" }
+12
arch/x86/kvm/Kconfig
··· 95 95 config KVM_INTEL 96 96 tristate "KVM for Intel (and compatible) processors support" 97 97 depends on KVM && IA32_FEAT_CTL 98 + select KVM_GENERIC_PRIVATE_MEM if INTEL_TDX_HOST 99 + select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST 98 100 help 99 101 Provides support for KVM on processors equipped with Intel's VT 100 102 extensions, a.k.a. Virtual Machine Extensions (VMX). ··· 128 126 129 127 This includes support to expose "raw" unreclaimable enclave memory to 130 128 guests via a device node, e.g. /dev/sgx_vepc. 129 + 130 + If unsure, say N. 131 + 132 + config KVM_INTEL_TDX 133 + bool "Intel Trust Domain Extensions (TDX) support" 134 + default y 135 + depends on INTEL_TDX_HOST 136 + help 137 + Provides support for launching Intel Trust Domain Extensions (TDX) 138 + confidential VMs on Intel processors. 131 139 132 140 If unsure, say N. 133 141
+1
arch/x86/kvm/Makefile
··· 20 20 21 21 kvm-intel-$(CONFIG_X86_SGX_KVM) += vmx/sgx.o 22 22 kvm-intel-$(CONFIG_KVM_HYPERV) += vmx/hyperv.o vmx/hyperv_evmcs.o 23 + kvm-intel-$(CONFIG_KVM_INTEL_TDX) += vmx/tdx.o 23 24 24 25 kvm-amd-y += svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o 25 26
+19 -33
arch/x86/kvm/cpuid.c
··· 81 81 return ret; 82 82 } 83 83 84 - /* 85 - * Magic value used by KVM when querying userspace-provided CPUID entries and 86 - * doesn't care about the CPIUD index because the index of the function in 87 - * question is not significant. Note, this magic value must have at least one 88 - * bit set in bits[63:32] and must be consumed as a u64 by cpuid_entry2_find() 89 - * to avoid false positives when processing guest CPUID input. 90 - */ 91 - #define KVM_CPUID_INDEX_NOT_SIGNIFICANT -1ull 92 - 93 - static struct kvm_cpuid_entry2 *cpuid_entry2_find(struct kvm_vcpu *vcpu, 94 - u32 function, u64 index) 84 + struct kvm_cpuid_entry2 *kvm_find_cpuid_entry2( 85 + struct kvm_cpuid_entry2 *entries, int nent, u32 function, u64 index) 95 86 { 96 87 struct kvm_cpuid_entry2 *e; 97 88 int i; ··· 99 108 */ 100 109 lockdep_assert_irqs_enabled(); 101 110 102 - for (i = 0; i < vcpu->arch.cpuid_nent; i++) { 103 - e = &vcpu->arch.cpuid_entries[i]; 111 + for (i = 0; i < nent; i++) { 112 + e = &entries[i]; 104 113 105 114 if (e->function != function) 106 115 continue; ··· 131 140 132 141 return NULL; 133 142 } 134 - 135 - struct kvm_cpuid_entry2 *kvm_find_cpuid_entry_index(struct kvm_vcpu *vcpu, 136 - u32 function, u32 index) 137 - { 138 - return cpuid_entry2_find(vcpu, function, index); 139 - } 140 - EXPORT_SYMBOL_GPL(kvm_find_cpuid_entry_index); 141 - 142 - struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu, 143 - u32 function) 144 - { 145 - return cpuid_entry2_find(vcpu, function, KVM_CPUID_INDEX_NOT_SIGNIFICANT); 146 - } 147 - EXPORT_SYMBOL_GPL(kvm_find_cpuid_entry); 148 - 149 - /* 150 - * cpuid_entry2_find() and KVM_CPUID_INDEX_NOT_SIGNIFICANT should never be used 151 - * directly outside of kvm_find_cpuid_entry() and kvm_find_cpuid_entry_index(). 152 - */ 153 - #undef KVM_CPUID_INDEX_NOT_SIGNIFICANT 143 + EXPORT_SYMBOL_GPL(kvm_find_cpuid_entry2); 154 144 155 145 static int kvm_check_cpuid(struct kvm_vcpu *vcpu) 156 146 { ··· 462 490 return best->eax & 0xff; 463 491 not_found: 464 492 return 36; 493 + } 494 + 495 + int cpuid_query_maxguestphyaddr(struct kvm_vcpu *vcpu) 496 + { 497 + struct kvm_cpuid_entry2 *best; 498 + 499 + best = kvm_find_cpuid_entry(vcpu, 0x80000000); 500 + if (!best || best->eax < 0x80000008) 501 + goto not_found; 502 + best = kvm_find_cpuid_entry(vcpu, 0x80000008); 503 + if (best) 504 + return (best->eax >> 16) & 0xff; 505 + not_found: 506 + return 0; 465 507 } 466 508 467 509 /*
+29 -4
arch/x86/kvm/cpuid.h
··· 11 11 void kvm_set_cpu_caps(void); 12 12 13 13 void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu); 14 - struct kvm_cpuid_entry2 *kvm_find_cpuid_entry_index(struct kvm_vcpu *vcpu, 15 - u32 function, u32 index); 16 - struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu, 17 - u32 function); 14 + struct kvm_cpuid_entry2 *kvm_find_cpuid_entry2(struct kvm_cpuid_entry2 *entries, 15 + int nent, u32 function, u64 index); 16 + /* 17 + * Magic value used by KVM when querying userspace-provided CPUID entries and 18 + * doesn't care about the CPIUD index because the index of the function in 19 + * question is not significant. Note, this magic value must have at least one 20 + * bit set in bits[63:32] and must be consumed as a u64 by kvm_find_cpuid_entry2() 21 + * to avoid false positives when processing guest CPUID input. 22 + * 23 + * KVM_CPUID_INDEX_NOT_SIGNIFICANT should never be used directly outside of 24 + * kvm_find_cpuid_entry2() and kvm_find_cpuid_entry(). 25 + */ 26 + #define KVM_CPUID_INDEX_NOT_SIGNIFICANT -1ull 27 + 28 + static inline struct kvm_cpuid_entry2 *kvm_find_cpuid_entry_index(struct kvm_vcpu *vcpu, 29 + u32 function, u32 index) 30 + { 31 + return kvm_find_cpuid_entry2(vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent, 32 + function, index); 33 + } 34 + 35 + static inline struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu, 36 + u32 function) 37 + { 38 + return kvm_find_cpuid_entry2(vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent, 39 + function, KVM_CPUID_INDEX_NOT_SIGNIFICANT); 40 + } 41 + 18 42 int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid, 19 43 struct kvm_cpuid_entry2 __user *entries, 20 44 unsigned int type); ··· 58 34 u32 xstate_required_size(u64 xstate_bv, bool compacted); 59 35 60 36 int cpuid_query_maxphyaddr(struct kvm_vcpu *vcpu); 37 + int cpuid_query_maxguestphyaddr(struct kvm_vcpu *vcpu); 61 38 u64 kvm_vcpu_reserved_gpa_bits_raw(struct kvm_vcpu *vcpu); 62 39 63 40 static inline int cpuid_maxphyaddr(struct kvm_vcpu *vcpu)
+3
arch/x86/kvm/irq.c
··· 100 100 if (kvm_cpu_has_extint(v)) 101 101 return 1; 102 102 103 + if (lapic_in_kernel(v) && v->arch.apic->guest_apic_protected) 104 + return kvm_x86_call(protected_apic_has_interrupt)(v); 105 + 103 106 return kvm_apic_has_interrupt(v) != -1; /* LAPIC */ 104 107 } 105 108 EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
+14 -1
arch/x86/kvm/lapic.c
··· 1790 1790 static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu) 1791 1791 { 1792 1792 struct kvm_lapic *apic = vcpu->arch.apic; 1793 - u32 reg = kvm_lapic_get_reg(apic, APIC_LVTT); 1793 + u32 reg; 1794 1794 1795 + /* 1796 + * Assume a timer IRQ was "injected" if the APIC is protected. KVM's 1797 + * copy of the vIRR is bogus, it's the responsibility of the caller to 1798 + * precisely check whether or not a timer IRQ is pending. 1799 + */ 1800 + if (apic->guest_apic_protected) 1801 + return true; 1802 + 1803 + reg = kvm_lapic_get_reg(apic, APIC_LVTT); 1795 1804 if (kvm_apic_hw_enabled(apic)) { 1796 1805 int vec = reg & APIC_VECTOR_MASK; 1797 1806 void *bitmap = apic->regs + APIC_ISR; ··· 2659 2650 kvm_recalculate_apic_map(vcpu->kvm); 2660 2651 return 0; 2661 2652 } 2653 + EXPORT_SYMBOL_GPL(kvm_apic_set_base); 2662 2654 2663 2655 void kvm_apic_update_apicv(struct kvm_vcpu *vcpu) 2664 2656 { ··· 2966 2956 u32 ppr; 2967 2957 2968 2958 if (!kvm_apic_present(vcpu)) 2959 + return -1; 2960 + 2961 + if (apic->guest_apic_protected) 2969 2962 return -1; 2970 2963 2971 2964 __apic_update_ppr(apic, &ppr);
+2
arch/x86/kvm/lapic.h
··· 65 65 bool sw_enabled; 66 66 bool irr_pending; 67 67 bool lvt0_in_nmi_mode; 68 + /* Select registers in the vAPIC cannot be read/written. */ 69 + bool guest_apic_protected; 68 70 /* Number of bits set in ISR. */ 69 71 s16 isr_count; 70 72 /* The highest vector set in ISR; if -1 - invalid, must scan ISR. */
+5 -1
arch/x86/kvm/mmu.h
··· 79 79 u8 kvm_mmu_get_max_tdp_level(void); 80 80 81 81 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask); 82 + void kvm_mmu_set_mmio_spte_value(struct kvm *kvm, u64 mmio_value); 82 83 void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask); 83 84 void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only); 84 85 ··· 232 231 return -(u32)fault & errcode; 233 232 } 234 233 235 - bool kvm_mmu_may_ignore_guest_pat(void); 234 + bool kvm_mmu_may_ignore_guest_pat(struct kvm *kvm); 236 235 237 236 int kvm_mmu_post_init_vm(struct kvm *kvm); 238 237 void kvm_mmu_pre_destroy_vm(struct kvm *kvm); ··· 253 252 #else 254 253 #define tdp_mmu_enabled false 255 254 #endif 255 + 256 + bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa); 257 + int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level); 256 258 257 259 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm) 258 260 {
+20 -19
arch/x86/kvm/mmu/mmu.c
··· 110 110 #ifdef CONFIG_X86_64 111 111 bool __read_mostly tdp_mmu_enabled = true; 112 112 module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0444); 113 + EXPORT_SYMBOL_GPL(tdp_mmu_enabled); 113 114 #endif 114 115 115 116 static int max_huge_page_level __read_mostly; ··· 1457 1456 * enabled but it chooses between clearing the Dirty bit and Writeable 1458 1457 * bit based on the context. 1459 1458 */ 1460 - if (kvm_x86_ops.cpu_dirty_log_size) 1459 + if (kvm->arch.cpu_dirty_log_size) 1461 1460 kvm_mmu_clear_dirty_pt_masked(kvm, slot, gfn_offset, mask); 1462 1461 else 1463 1462 kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask); 1464 1463 } 1465 1464 1466 - int kvm_cpu_dirty_log_size(void) 1465 + int kvm_cpu_dirty_log_size(struct kvm *kvm) 1467 1466 { 1468 - return kvm_x86_ops.cpu_dirty_log_size; 1467 + return kvm->arch.cpu_dirty_log_size; 1469 1468 } 1470 1469 1471 1470 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm, ··· 4836 4835 } 4837 4836 #endif 4838 4837 4839 - bool kvm_mmu_may_ignore_guest_pat(void) 4840 - { 4841 - /* 4842 - * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM 4843 - * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to 4844 - * honor the memtype from the guest's PAT so that guest accesses to 4845 - * memory that is DMA'd aren't cached against the guest's wishes. As a 4846 - * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA, 4847 - * KVM _always_ ignores guest PAT (when EPT is enabled). 4848 - */ 4849 - return shadow_memtype_mask; 4850 - } 4851 - 4852 4838 int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) 4853 4839 { 4854 4840 #ifdef CONFIG_X86_64 ··· 4846 4858 return direct_page_fault(vcpu, fault); 4847 4859 } 4848 4860 4849 - static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, 4850 - u8 *level) 4861 + int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level) 4851 4862 { 4852 4863 int r; 4853 4864 ··· 4860 4873 do { 4861 4874 if (signal_pending(current)) 4862 4875 return -EINTR; 4876 + 4877 + if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) 4878 + return -EIO; 4879 + 4863 4880 cond_resched(); 4864 4881 r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level); 4865 4882 } while (r == RET_PF_RETRY); ··· 4888 4897 return -EIO; 4889 4898 } 4890 4899 } 4900 + EXPORT_SYMBOL_GPL(kvm_tdp_map_page); 4891 4901 4892 4902 long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, 4893 4903 struct kvm_pre_fault_memory *range) ··· 5581 5589 5582 5590 static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu) 5583 5591 { 5592 + int maxpa; 5593 + 5594 + if (vcpu->kvm->arch.vm_type == KVM_X86_TDX_VM) 5595 + maxpa = cpuid_query_maxguestphyaddr(vcpu); 5596 + else 5597 + maxpa = cpuid_maxphyaddr(vcpu); 5598 + 5584 5599 /* tdp_root_level is architecture forced level, use it if nonzero */ 5585 5600 if (tdp_root_level) 5586 5601 return tdp_root_level; 5587 5602 5588 5603 /* Use 5-level TDP if and only if it's useful/necessary. */ 5589 - if (max_tdp_level == 5 && cpuid_maxphyaddr(vcpu) <= 48) 5604 + if (max_tdp_level == 5 && maxpa <= 48) 5590 5605 return 4; 5591 5606 5592 5607 return max_tdp_level; ··· 5912 5913 out: 5913 5914 return r; 5914 5915 } 5916 + EXPORT_SYMBOL_GPL(kvm_mmu_load); 5915 5917 5916 5918 void kvm_mmu_unload(struct kvm_vcpu *vcpu) 5917 5919 { ··· 7238 7238 .start = slot->base_gfn, 7239 7239 .end = slot->base_gfn + slot->npages, 7240 7240 .may_block = true, 7241 + .attr_filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED, 7241 7242 }; 7242 7243 bool flush; 7243 7244
+3 -2
arch/x86/kvm/mmu/mmu_internal.h
··· 187 187 return kvm_gfn_direct_bits(kvm); 188 188 } 189 189 190 - static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp) 190 + static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm *kvm, 191 + struct kvm_mmu_page *sp) 191 192 { 192 193 /* 193 194 * When using the EPT page-modification log, the GPAs in the CPU dirty ··· 198 197 * being enabled is mandatory as the bits used to denote WP-only SPTEs 199 198 * are reserved for PAE paging (32-bit KVM). 200 199 */ 201 - return kvm_x86_ops.cpu_dirty_log_size && sp->role.guest_mode; 200 + return kvm->arch.cpu_dirty_log_size && sp->role.guest_mode; 202 201 } 203 202 204 203 static inline gfn_t gfn_round_for_level(gfn_t gfn, int level)
+3
arch/x86/kvm/mmu/page_track.c
··· 172 172 struct kvm_memory_slot *slot; 173 173 int r = 0, i, bkt; 174 174 175 + if (kvm->arch.vm_type == KVM_X86_TDX_VM) 176 + return -EOPNOTSUPP; 177 + 175 178 mutex_lock(&kvm->slots_arch_lock); 176 179 177 180 /*
+9 -20
arch/x86/kvm/mmu/spte.c
··· 37 37 u64 __read_mostly shadow_mmio_mask; 38 38 u64 __read_mostly shadow_mmio_access_mask; 39 39 u64 __read_mostly shadow_present_mask; 40 - u64 __read_mostly shadow_memtype_mask; 41 40 u64 __read_mostly shadow_me_value; 42 41 u64 __read_mostly shadow_me_mask; 43 42 u64 __read_mostly shadow_acc_track_mask; ··· 94 95 u64 gen = kvm_vcpu_memslots(vcpu)->generation & MMIO_SPTE_GEN_MASK; 95 96 u64 spte = generation_mmio_spte_mask(gen); 96 97 u64 gpa = gfn << PAGE_SHIFT; 97 - 98 - WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value); 99 98 100 99 access &= shadow_mmio_access_mask; 101 100 spte |= vcpu->kvm->arch.shadow_mmio_value | access; ··· 174 177 175 178 if (sp->role.ad_disabled) 176 179 spte |= SPTE_TDP_AD_DISABLED; 177 - else if (kvm_mmu_page_ad_need_write_protect(sp)) 180 + else if (kvm_mmu_page_ad_need_write_protect(vcpu->kvm, sp)) 178 181 spte |= SPTE_TDP_AD_WRPROT_ONLY; 179 182 180 183 spte |= shadow_present_mask; ··· 209 212 if (level > PG_LEVEL_4K) 210 213 spte |= PT_PAGE_SIZE_MASK; 211 214 212 - if (shadow_memtype_mask) 213 - spte |= kvm_x86_call(get_mt_mask)(vcpu, gfn, 214 - kvm_is_mmio_pfn(pfn)); 215 + spte |= kvm_x86_call(get_mt_mask)(vcpu, gfn, kvm_is_mmio_pfn(pfn)); 215 216 if (host_writable) 216 217 spte |= shadow_host_writable_mask; 217 218 else ··· 435 440 } 436 441 EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask); 437 442 443 + void kvm_mmu_set_mmio_spte_value(struct kvm *kvm, u64 mmio_value) 444 + { 445 + kvm->arch.shadow_mmio_value = mmio_value; 446 + } 447 + EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_value); 448 + 438 449 void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask) 439 450 { 440 451 /* shadow_me_value must be a subset of shadow_me_mask */ ··· 464 463 /* VMX_EPT_SUPPRESS_VE_BIT is needed for W or X violation. */ 465 464 shadow_present_mask = 466 465 (has_exec_only ? 0ull : VMX_EPT_READABLE_MASK) | VMX_EPT_SUPPRESS_VE_BIT; 467 - /* 468 - * EPT overrides the host MTRRs, and so KVM must program the desired 469 - * memtype directly into the SPTEs. Note, this mask is just the mask 470 - * of all bits that factor into the memtype, the actual memtype must be 471 - * dynamically calculated, e.g. to ensure host MMIO is mapped UC. 472 - */ 473 - shadow_memtype_mask = VMX_EPT_MT_MASK | VMX_EPT_IPAT_BIT; 466 + 474 467 shadow_acc_track_mask = VMX_EPT_RWX_MASK; 475 468 shadow_host_writable_mask = EPT_SPTE_HOST_WRITABLE; 476 469 shadow_mmu_writable_mask = EPT_SPTE_MMU_WRITABLE; ··· 516 521 shadow_x_mask = 0; 517 522 shadow_present_mask = PT_PRESENT_MASK; 518 523 519 - /* 520 - * For shadow paging and NPT, KVM uses PAT entry '0' to encode WB 521 - * memtype in the SPTEs, i.e. relies on host MTRRs to provide the 522 - * correct memtype (WB is the "weakest" memtype). 523 - */ 524 - shadow_memtype_mask = 0; 525 524 shadow_acc_track_mask = 0; 526 525 shadow_me_mask = 0; 527 526 shadow_me_value = 0;
-1
arch/x86/kvm/mmu/spte.h
··· 187 187 extern u64 __read_mostly shadow_mmio_mask; 188 188 extern u64 __read_mostly shadow_mmio_access_mask; 189 189 extern u64 __read_mostly shadow_present_mask; 190 - extern u64 __read_mostly shadow_memtype_mask; 191 190 extern u64 __read_mostly shadow_me_value; 192 191 extern u64 __read_mostly shadow_me_mask; 193 192
+38 -11
arch/x86/kvm/mmu/tdp_mmu.c
··· 1630 1630 } 1631 1631 } 1632 1632 1633 - static bool tdp_mmu_need_write_protect(struct kvm_mmu_page *sp) 1633 + static bool tdp_mmu_need_write_protect(struct kvm *kvm, struct kvm_mmu_page *sp) 1634 1634 { 1635 1635 /* 1636 1636 * All TDP MMU shadow pages share the same role as their root, aside 1637 1637 * from level, so it is valid to key off any shadow page to determine if 1638 1638 * write protection is needed for an entire tree. 1639 1639 */ 1640 - return kvm_mmu_page_ad_need_write_protect(sp) || !kvm_ad_enabled; 1640 + return kvm_mmu_page_ad_need_write_protect(kvm, sp) || !kvm_ad_enabled; 1641 1641 } 1642 1642 1643 1643 static void clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, 1644 1644 gfn_t start, gfn_t end) 1645 1645 { 1646 - const u64 dbit = tdp_mmu_need_write_protect(root) ? PT_WRITABLE_MASK : 1647 - shadow_dirty_mask; 1646 + const u64 dbit = tdp_mmu_need_write_protect(kvm, root) ? 1647 + PT_WRITABLE_MASK : shadow_dirty_mask; 1648 1648 struct tdp_iter iter; 1649 1649 1650 1650 rcu_read_lock(); ··· 1689 1689 static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root, 1690 1690 gfn_t gfn, unsigned long mask, bool wrprot) 1691 1691 { 1692 - const u64 dbit = (wrprot || tdp_mmu_need_write_protect(root)) ? PT_WRITABLE_MASK : 1693 - shadow_dirty_mask; 1692 + const u64 dbit = (wrprot || tdp_mmu_need_write_protect(kvm, root)) ? 1693 + PT_WRITABLE_MASK : shadow_dirty_mask; 1694 1694 struct tdp_iter iter; 1695 1695 1696 1696 lockdep_assert_held_write(&kvm->mmu_lock); ··· 1911 1911 * 1912 1912 * Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}. 1913 1913 */ 1914 - int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, 1915 - int *root_level) 1914 + static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, 1915 + struct kvm_mmu_page *root) 1916 1916 { 1917 - struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa); 1918 1917 struct tdp_iter iter; 1919 1918 gfn_t gfn = addr >> PAGE_SHIFT; 1920 1919 int leaf = -1; 1921 - 1922 - *root_level = vcpu->arch.mmu->root_role.level; 1923 1920 1924 1921 for_each_tdp_pte(iter, vcpu->kvm, root, gfn, gfn + 1) { 1925 1922 leaf = iter.level; ··· 1925 1928 1926 1929 return leaf; 1927 1930 } 1931 + 1932 + int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, 1933 + int *root_level) 1934 + { 1935 + struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa); 1936 + *root_level = vcpu->arch.mmu->root_role.level; 1937 + 1938 + return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, root); 1939 + } 1940 + 1941 + bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa) 1942 + { 1943 + struct kvm *kvm = vcpu->kvm; 1944 + bool is_direct = kvm_is_addr_direct(kvm, gpa); 1945 + hpa_t root = is_direct ? vcpu->arch.mmu->root.hpa : 1946 + vcpu->arch.mmu->mirror_root_hpa; 1947 + u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte; 1948 + int leaf; 1949 + 1950 + lockdep_assert_held(&kvm->mmu_lock); 1951 + rcu_read_lock(); 1952 + leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, root_to_sp(root)); 1953 + rcu_read_unlock(); 1954 + if (leaf < 0) 1955 + return false; 1956 + 1957 + spte = sptes[leaf]; 1958 + return is_shadow_present_pte(spte) && is_last_spte(spte, leaf); 1959 + } 1960 + EXPORT_SYMBOL_GPL(kvm_tdp_mmu_gpa_is_mapped); 1928 1961 1929 1962 /* 1930 1963 * Returns the last level spte pointer of the shadow page walk for the given
+3
arch/x86/kvm/smm.h
··· 142 142 143 143 static inline int kvm_inject_smi(struct kvm_vcpu *vcpu) 144 144 { 145 + if (!kvm_x86_call(has_emulated_msr)(vcpu->kvm, MSR_IA32_SMBASE)) 146 + return -ENOTTY; 147 + 145 148 kvm_make_request(KVM_REQ_SMI, vcpu); 146 149 return 0; 147 150 }
+1
arch/x86/kvm/svm/svm.c
··· 5501 5501 */ 5502 5502 allow_smaller_maxphyaddr = !npt_enabled; 5503 5503 5504 + kvm_caps.inapplicable_quirks &= ~KVM_X86_QUIRK_CD_NW_CLEARED; 5504 5505 return 0; 5505 5506 5506 5507 err:
+182
arch/x86/kvm/vmx/common.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + #ifndef __KVM_X86_VMX_COMMON_H 3 + #define __KVM_X86_VMX_COMMON_H 4 + 5 + #include <linux/kvm_host.h> 6 + #include <asm/posted_intr.h> 7 + 8 + #include "mmu.h" 9 + 10 + union vmx_exit_reason { 11 + struct { 12 + u32 basic : 16; 13 + u32 reserved16 : 1; 14 + u32 reserved17 : 1; 15 + u32 reserved18 : 1; 16 + u32 reserved19 : 1; 17 + u32 reserved20 : 1; 18 + u32 reserved21 : 1; 19 + u32 reserved22 : 1; 20 + u32 reserved23 : 1; 21 + u32 reserved24 : 1; 22 + u32 reserved25 : 1; 23 + u32 bus_lock_detected : 1; 24 + u32 enclave_mode : 1; 25 + u32 smi_pending_mtf : 1; 26 + u32 smi_from_vmx_root : 1; 27 + u32 reserved30 : 1; 28 + u32 failed_vmentry : 1; 29 + }; 30 + u32 full; 31 + }; 32 + 33 + struct vcpu_vt { 34 + /* Posted interrupt descriptor */ 35 + struct pi_desc pi_desc; 36 + 37 + /* Used if this vCPU is waiting for PI notification wakeup. */ 38 + struct list_head pi_wakeup_list; 39 + 40 + union vmx_exit_reason exit_reason; 41 + 42 + unsigned long exit_qualification; 43 + u32 exit_intr_info; 44 + 45 + /* 46 + * If true, guest state has been loaded into hardware, and host state 47 + * saved into vcpu_{vt,vmx,tdx}. If false, host state is loaded into 48 + * hardware. 49 + */ 50 + bool guest_state_loaded; 51 + bool emulation_required; 52 + 53 + #ifdef CONFIG_X86_64 54 + u64 msr_host_kernel_gs_base; 55 + #endif 56 + 57 + unsigned long host_debugctlmsr; 58 + }; 59 + 60 + #ifdef CONFIG_KVM_INTEL_TDX 61 + 62 + static __always_inline bool is_td(struct kvm *kvm) 63 + { 64 + return kvm->arch.vm_type == KVM_X86_TDX_VM; 65 + } 66 + 67 + static __always_inline bool is_td_vcpu(struct kvm_vcpu *vcpu) 68 + { 69 + return is_td(vcpu->kvm); 70 + } 71 + 72 + #else 73 + 74 + static inline bool is_td(struct kvm *kvm) { return false; } 75 + static inline bool is_td_vcpu(struct kvm_vcpu *vcpu) { return false; } 76 + 77 + #endif 78 + 79 + static inline bool vt_is_tdx_private_gpa(struct kvm *kvm, gpa_t gpa) 80 + { 81 + /* For TDX the direct mask is the shared mask. */ 82 + return !kvm_is_addr_direct(kvm, gpa); 83 + } 84 + 85 + static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa, 86 + unsigned long exit_qualification) 87 + { 88 + u64 error_code; 89 + 90 + /* Is it a read fault? */ 91 + error_code = (exit_qualification & EPT_VIOLATION_ACC_READ) 92 + ? PFERR_USER_MASK : 0; 93 + /* Is it a write fault? */ 94 + error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE) 95 + ? PFERR_WRITE_MASK : 0; 96 + /* Is it a fetch fault? */ 97 + error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR) 98 + ? PFERR_FETCH_MASK : 0; 99 + /* ept page table entry is present? */ 100 + error_code |= (exit_qualification & EPT_VIOLATION_PROT_MASK) 101 + ? PFERR_PRESENT_MASK : 0; 102 + 103 + if (error_code & EPT_VIOLATION_GVA_IS_VALID) 104 + error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) ? 105 + PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK; 106 + 107 + if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) 108 + error_code |= PFERR_PRIVATE_ACCESS; 109 + 110 + return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); 111 + } 112 + 113 + static inline void kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu, 114 + int pi_vec) 115 + { 116 + #ifdef CONFIG_SMP 117 + if (vcpu->mode == IN_GUEST_MODE) { 118 + /* 119 + * The vector of the virtual has already been set in the PIR. 120 + * Send a notification event to deliver the virtual interrupt 121 + * unless the vCPU is the currently running vCPU, i.e. the 122 + * event is being sent from a fastpath VM-Exit handler, in 123 + * which case the PIR will be synced to the vIRR before 124 + * re-entering the guest. 125 + * 126 + * When the target is not the running vCPU, the following 127 + * possibilities emerge: 128 + * 129 + * Case 1: vCPU stays in non-root mode. Sending a notification 130 + * event posts the interrupt to the vCPU. 131 + * 132 + * Case 2: vCPU exits to root mode and is still runnable. The 133 + * PIR will be synced to the vIRR before re-entering the guest. 134 + * Sending a notification event is ok as the host IRQ handler 135 + * will ignore the spurious event. 136 + * 137 + * Case 3: vCPU exits to root mode and is blocked. vcpu_block() 138 + * has already synced PIR to vIRR and never blocks the vCPU if 139 + * the vIRR is not empty. Therefore, a blocked vCPU here does 140 + * not wait for any requested interrupts in PIR, and sending a 141 + * notification event also results in a benign, spurious event. 142 + */ 143 + 144 + if (vcpu != kvm_get_running_vcpu()) 145 + __apic_send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec); 146 + return; 147 + } 148 + #endif 149 + /* 150 + * The vCPU isn't in the guest; wake the vCPU in case it is blocking, 151 + * otherwise do nothing as KVM will grab the highest priority pending 152 + * IRQ via ->sync_pir_to_irr() in vcpu_enter_guest(). 153 + */ 154 + kvm_vcpu_wake_up(vcpu); 155 + } 156 + 157 + /* 158 + * Post an interrupt to a vCPU's PIR and trigger the vCPU to process the 159 + * interrupt if necessary. 160 + */ 161 + static inline void __vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, 162 + struct pi_desc *pi_desc, int vector) 163 + { 164 + if (pi_test_and_set_pir(vector, pi_desc)) 165 + return; 166 + 167 + /* If a previous notification has sent the IPI, nothing to do. */ 168 + if (pi_test_and_set_on(pi_desc)) 169 + return; 170 + 171 + /* 172 + * The implied barrier in pi_test_and_set_on() pairs with the smp_mb_*() 173 + * after setting vcpu->mode in vcpu_enter_guest(), thus the vCPU is 174 + * guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a 175 + * posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE. 176 + */ 177 + kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR); 178 + } 179 + 180 + noinstr void vmx_handle_nmi(struct kvm_vcpu *vcpu); 181 + 182 + #endif /* __KVM_X86_VMX_COMMON_H */
+1032 -89
arch/x86/kvm/vmx/main.c
··· 3 3 4 4 #include "x86_ops.h" 5 5 #include "vmx.h" 6 + #include "mmu.h" 6 7 #include "nested.h" 7 8 #include "pmu.h" 8 9 #include "posted_intr.h" 10 + #include "tdx.h" 11 + #include "tdx_arch.h" 12 + 13 + #ifdef CONFIG_KVM_INTEL_TDX 14 + static_assert(offsetof(struct vcpu_vmx, vt) == offsetof(struct vcpu_tdx, vt)); 15 + #endif 16 + 17 + static void vt_disable_virtualization_cpu(void) 18 + { 19 + /* Note, TDX *and* VMX need to be disabled if TDX is enabled. */ 20 + if (enable_tdx) 21 + tdx_disable_virtualization_cpu(); 22 + vmx_disable_virtualization_cpu(); 23 + } 24 + 25 + static __init int vt_hardware_setup(void) 26 + { 27 + int ret; 28 + 29 + ret = vmx_hardware_setup(); 30 + if (ret) 31 + return ret; 32 + 33 + /* 34 + * Update vt_x86_ops::vm_size here so it is ready before 35 + * kvm_ops_update() is called in kvm_x86_vendor_init(). 36 + * 37 + * Note, the actual bringing up of TDX must be done after 38 + * kvm_ops_update() because enabling TDX requires enabling 39 + * hardware virtualization first, i.e., all online CPUs must 40 + * be in post-VMXON state. This means the @vm_size here 41 + * may be updated to TDX's size but TDX may fail to enable 42 + * at later time. 43 + * 44 + * The VMX/VT code could update kvm_x86_ops::vm_size again 45 + * after bringing up TDX, but this would require exporting 46 + * either kvm_x86_ops or kvm_ops_update() from the base KVM 47 + * module, which looks overkill. Anyway, the worst case here 48 + * is KVM may allocate couple of more bytes than needed for 49 + * each VM. 50 + */ 51 + if (enable_tdx) { 52 + vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size, 53 + sizeof(struct kvm_tdx)); 54 + /* 55 + * Note, TDX may fail to initialize in a later time in 56 + * vt_init(), in which case it is not necessary to setup 57 + * those callbacks. But making them valid here even 58 + * when TDX fails to init later is fine because those 59 + * callbacks won't be called if the VM isn't TDX guest. 60 + */ 61 + vt_x86_ops.link_external_spt = tdx_sept_link_private_spt; 62 + vt_x86_ops.set_external_spte = tdx_sept_set_private_spte; 63 + vt_x86_ops.free_external_spt = tdx_sept_free_private_spt; 64 + vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte; 65 + vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt; 66 + } 67 + 68 + return 0; 69 + } 70 + 71 + static int vt_vm_init(struct kvm *kvm) 72 + { 73 + if (is_td(kvm)) 74 + return tdx_vm_init(kvm); 75 + 76 + return vmx_vm_init(kvm); 77 + } 78 + 79 + static void vt_vm_pre_destroy(struct kvm *kvm) 80 + { 81 + if (is_td(kvm)) 82 + return tdx_mmu_release_hkid(kvm); 83 + } 84 + 85 + static void vt_vm_destroy(struct kvm *kvm) 86 + { 87 + if (is_td(kvm)) 88 + return tdx_vm_destroy(kvm); 89 + 90 + vmx_vm_destroy(kvm); 91 + } 92 + 93 + static int vt_vcpu_precreate(struct kvm *kvm) 94 + { 95 + if (is_td(kvm)) 96 + return 0; 97 + 98 + return vmx_vcpu_precreate(kvm); 99 + } 100 + 101 + static int vt_vcpu_create(struct kvm_vcpu *vcpu) 102 + { 103 + if (is_td_vcpu(vcpu)) 104 + return tdx_vcpu_create(vcpu); 105 + 106 + return vmx_vcpu_create(vcpu); 107 + } 108 + 109 + static void vt_vcpu_free(struct kvm_vcpu *vcpu) 110 + { 111 + if (is_td_vcpu(vcpu)) { 112 + tdx_vcpu_free(vcpu); 113 + return; 114 + } 115 + 116 + vmx_vcpu_free(vcpu); 117 + } 118 + 119 + static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) 120 + { 121 + if (is_td_vcpu(vcpu)) { 122 + tdx_vcpu_reset(vcpu, init_event); 123 + return; 124 + } 125 + 126 + vmx_vcpu_reset(vcpu, init_event); 127 + } 128 + 129 + static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu) 130 + { 131 + if (is_td_vcpu(vcpu)) { 132 + tdx_vcpu_load(vcpu, cpu); 133 + return; 134 + } 135 + 136 + vmx_vcpu_load(vcpu, cpu); 137 + } 138 + 139 + static void vt_update_cpu_dirty_logging(struct kvm_vcpu *vcpu) 140 + { 141 + /* 142 + * Basic TDX does not support feature PML. KVM does not enable PML in 143 + * TD's VMCS, nor does it allocate or flush PML buffer for TDX. 144 + */ 145 + if (WARN_ON_ONCE(is_td_vcpu(vcpu))) 146 + return; 147 + 148 + vmx_update_cpu_dirty_logging(vcpu); 149 + } 150 + 151 + static void vt_prepare_switch_to_guest(struct kvm_vcpu *vcpu) 152 + { 153 + if (is_td_vcpu(vcpu)) { 154 + tdx_prepare_switch_to_guest(vcpu); 155 + return; 156 + } 157 + 158 + vmx_prepare_switch_to_guest(vcpu); 159 + } 160 + 161 + static void vt_vcpu_put(struct kvm_vcpu *vcpu) 162 + { 163 + if (is_td_vcpu(vcpu)) { 164 + tdx_vcpu_put(vcpu); 165 + return; 166 + } 167 + 168 + vmx_vcpu_put(vcpu); 169 + } 170 + 171 + static int vt_vcpu_pre_run(struct kvm_vcpu *vcpu) 172 + { 173 + if (is_td_vcpu(vcpu)) 174 + return tdx_vcpu_pre_run(vcpu); 175 + 176 + return vmx_vcpu_pre_run(vcpu); 177 + } 178 + 179 + static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit) 180 + { 181 + if (is_td_vcpu(vcpu)) 182 + return tdx_vcpu_run(vcpu, force_immediate_exit); 183 + 184 + return vmx_vcpu_run(vcpu, force_immediate_exit); 185 + } 186 + 187 + static int vt_handle_exit(struct kvm_vcpu *vcpu, 188 + enum exit_fastpath_completion fastpath) 189 + { 190 + if (is_td_vcpu(vcpu)) 191 + return tdx_handle_exit(vcpu, fastpath); 192 + 193 + return vmx_handle_exit(vcpu, fastpath); 194 + } 195 + 196 + static int vt_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) 197 + { 198 + if (unlikely(is_td_vcpu(vcpu))) 199 + return tdx_set_msr(vcpu, msr_info); 200 + 201 + return vmx_set_msr(vcpu, msr_info); 202 + } 203 + 204 + /* 205 + * The kvm parameter can be NULL (module initialization, or invocation before 206 + * VM creation). Be sure to check the kvm parameter before using it. 207 + */ 208 + static bool vt_has_emulated_msr(struct kvm *kvm, u32 index) 209 + { 210 + if (kvm && is_td(kvm)) 211 + return tdx_has_emulated_msr(index); 212 + 213 + return vmx_has_emulated_msr(kvm, index); 214 + } 215 + 216 + static int vt_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) 217 + { 218 + if (unlikely(is_td_vcpu(vcpu))) 219 + return tdx_get_msr(vcpu, msr_info); 220 + 221 + return vmx_get_msr(vcpu, msr_info); 222 + } 223 + 224 + static void vt_msr_filter_changed(struct kvm_vcpu *vcpu) 225 + { 226 + /* 227 + * TDX doesn't allow VMM to configure interception of MSR accesses. 228 + * TDX guest requests MSR accesses by calling TDVMCALL. The MSR 229 + * filters will be applied when handling the TDVMCALL for RDMSR/WRMSR 230 + * if the userspace has set any. 231 + */ 232 + if (is_td_vcpu(vcpu)) 233 + return; 234 + 235 + vmx_msr_filter_changed(vcpu); 236 + } 237 + 238 + static int vt_complete_emulated_msr(struct kvm_vcpu *vcpu, int err) 239 + { 240 + if (is_td_vcpu(vcpu)) 241 + return tdx_complete_emulated_msr(vcpu, err); 242 + 243 + return kvm_complete_insn_gp(vcpu, err); 244 + } 245 + 246 + #ifdef CONFIG_KVM_SMM 247 + static int vt_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection) 248 + { 249 + if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm)) 250 + return 0; 251 + 252 + return vmx_smi_allowed(vcpu, for_injection); 253 + } 254 + 255 + static int vt_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram) 256 + { 257 + if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm)) 258 + return 0; 259 + 260 + return vmx_enter_smm(vcpu, smram); 261 + } 262 + 263 + static int vt_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram) 264 + { 265 + if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm)) 266 + return 0; 267 + 268 + return vmx_leave_smm(vcpu, smram); 269 + } 270 + 271 + static void vt_enable_smi_window(struct kvm_vcpu *vcpu) 272 + { 273 + if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm)) 274 + return; 275 + 276 + /* RSM will cause a vmexit anyway. */ 277 + vmx_enable_smi_window(vcpu); 278 + } 279 + #endif 280 + 281 + static int vt_check_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type, 282 + void *insn, int insn_len) 283 + { 284 + /* 285 + * For TDX, this can only be triggered for MMIO emulation. Let the 286 + * guest retry after installing the SPTE with suppress #VE bit cleared, 287 + * so that the guest will receive #VE when retry. The guest is expected 288 + * to call TDG.VP.VMCALL<MMIO> to request VMM to do MMIO emulation on 289 + * #VE. 290 + */ 291 + if (is_td_vcpu(vcpu)) 292 + return X86EMUL_RETRY_INSTR; 293 + 294 + return vmx_check_emulate_instruction(vcpu, emul_type, insn, insn_len); 295 + } 296 + 297 + static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu) 298 + { 299 + /* 300 + * INIT and SIPI are always blocked for TDX, i.e., INIT handling and 301 + * the OP vcpu_deliver_sipi_vector() won't be called. 302 + */ 303 + if (is_td_vcpu(vcpu)) 304 + return true; 305 + 306 + return vmx_apic_init_signal_blocked(vcpu); 307 + } 308 + 309 + static void vt_set_virtual_apic_mode(struct kvm_vcpu *vcpu) 310 + { 311 + /* Only x2APIC mode is supported for TD. */ 312 + if (is_td_vcpu(vcpu)) 313 + return; 314 + 315 + return vmx_set_virtual_apic_mode(vcpu); 316 + } 317 + 318 + static void vt_apicv_pre_state_restore(struct kvm_vcpu *vcpu) 319 + { 320 + struct pi_desc *pi = vcpu_to_pi_desc(vcpu); 321 + 322 + pi_clear_on(pi); 323 + memset(pi->pir, 0, sizeof(pi->pir)); 324 + } 325 + 326 + static void vt_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr) 327 + { 328 + if (is_td_vcpu(vcpu)) 329 + return; 330 + 331 + return vmx_hwapic_isr_update(vcpu, max_isr); 332 + } 333 + 334 + static int vt_sync_pir_to_irr(struct kvm_vcpu *vcpu) 335 + { 336 + if (is_td_vcpu(vcpu)) 337 + return -1; 338 + 339 + return vmx_sync_pir_to_irr(vcpu); 340 + } 341 + 342 + static void vt_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode, 343 + int trig_mode, int vector) 344 + { 345 + if (is_td_vcpu(apic->vcpu)) { 346 + tdx_deliver_interrupt(apic, delivery_mode, trig_mode, 347 + vector); 348 + return; 349 + } 350 + 351 + vmx_deliver_interrupt(apic, delivery_mode, trig_mode, vector); 352 + } 353 + 354 + static void vt_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) 355 + { 356 + if (is_td_vcpu(vcpu)) 357 + return; 358 + 359 + vmx_vcpu_after_set_cpuid(vcpu); 360 + } 361 + 362 + static void vt_update_exception_bitmap(struct kvm_vcpu *vcpu) 363 + { 364 + if (is_td_vcpu(vcpu)) 365 + return; 366 + 367 + vmx_update_exception_bitmap(vcpu); 368 + } 369 + 370 + static u64 vt_get_segment_base(struct kvm_vcpu *vcpu, int seg) 371 + { 372 + if (is_td_vcpu(vcpu)) 373 + return 0; 374 + 375 + return vmx_get_segment_base(vcpu, seg); 376 + } 377 + 378 + static void vt_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, 379 + int seg) 380 + { 381 + if (is_td_vcpu(vcpu)) { 382 + memset(var, 0, sizeof(*var)); 383 + return; 384 + } 385 + 386 + vmx_get_segment(vcpu, var, seg); 387 + } 388 + 389 + static void vt_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, 390 + int seg) 391 + { 392 + if (is_td_vcpu(vcpu)) 393 + return; 394 + 395 + vmx_set_segment(vcpu, var, seg); 396 + } 397 + 398 + static int vt_get_cpl(struct kvm_vcpu *vcpu) 399 + { 400 + if (is_td_vcpu(vcpu)) 401 + return 0; 402 + 403 + return vmx_get_cpl(vcpu); 404 + } 405 + 406 + static int vt_get_cpl_no_cache(struct kvm_vcpu *vcpu) 407 + { 408 + if (is_td_vcpu(vcpu)) 409 + return 0; 410 + 411 + return vmx_get_cpl_no_cache(vcpu); 412 + } 413 + 414 + static void vt_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l) 415 + { 416 + if (is_td_vcpu(vcpu)) { 417 + *db = 0; 418 + *l = 0; 419 + return; 420 + } 421 + 422 + vmx_get_cs_db_l_bits(vcpu, db, l); 423 + } 424 + 425 + static bool vt_is_valid_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) 426 + { 427 + if (is_td_vcpu(vcpu)) 428 + return true; 429 + 430 + return vmx_is_valid_cr0(vcpu, cr0); 431 + } 432 + 433 + static void vt_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) 434 + { 435 + if (is_td_vcpu(vcpu)) 436 + return; 437 + 438 + vmx_set_cr0(vcpu, cr0); 439 + } 440 + 441 + static bool vt_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) 442 + { 443 + if (is_td_vcpu(vcpu)) 444 + return true; 445 + 446 + return vmx_is_valid_cr4(vcpu, cr4); 447 + } 448 + 449 + static void vt_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) 450 + { 451 + if (is_td_vcpu(vcpu)) 452 + return; 453 + 454 + vmx_set_cr4(vcpu, cr4); 455 + } 456 + 457 + static int vt_set_efer(struct kvm_vcpu *vcpu, u64 efer) 458 + { 459 + if (is_td_vcpu(vcpu)) 460 + return 0; 461 + 462 + return vmx_set_efer(vcpu, efer); 463 + } 464 + 465 + static void vt_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt) 466 + { 467 + if (is_td_vcpu(vcpu)) { 468 + memset(dt, 0, sizeof(*dt)); 469 + return; 470 + } 471 + 472 + vmx_get_idt(vcpu, dt); 473 + } 474 + 475 + static void vt_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt) 476 + { 477 + if (is_td_vcpu(vcpu)) 478 + return; 479 + 480 + vmx_set_idt(vcpu, dt); 481 + } 482 + 483 + static void vt_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt) 484 + { 485 + if (is_td_vcpu(vcpu)) { 486 + memset(dt, 0, sizeof(*dt)); 487 + return; 488 + } 489 + 490 + vmx_get_gdt(vcpu, dt); 491 + } 492 + 493 + static void vt_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt) 494 + { 495 + if (is_td_vcpu(vcpu)) 496 + return; 497 + 498 + vmx_set_gdt(vcpu, dt); 499 + } 500 + 501 + static void vt_set_dr6(struct kvm_vcpu *vcpu, unsigned long val) 502 + { 503 + if (is_td_vcpu(vcpu)) 504 + return; 505 + 506 + vmx_set_dr6(vcpu, val); 507 + } 508 + 509 + static void vt_set_dr7(struct kvm_vcpu *vcpu, unsigned long val) 510 + { 511 + if (is_td_vcpu(vcpu)) 512 + return; 513 + 514 + vmx_set_dr7(vcpu, val); 515 + } 516 + 517 + static void vt_sync_dirty_debug_regs(struct kvm_vcpu *vcpu) 518 + { 519 + /* 520 + * MOV-DR exiting is always cleared for TD guest, even in debug mode. 521 + * Thus KVM_DEBUGREG_WONT_EXIT can never be set and it should never 522 + * reach here for TD vcpu. 523 + */ 524 + if (is_td_vcpu(vcpu)) 525 + return; 526 + 527 + vmx_sync_dirty_debug_regs(vcpu); 528 + } 529 + 530 + static void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg) 531 + { 532 + if (WARN_ON_ONCE(is_td_vcpu(vcpu))) 533 + return; 534 + 535 + vmx_cache_reg(vcpu, reg); 536 + } 537 + 538 + static unsigned long vt_get_rflags(struct kvm_vcpu *vcpu) 539 + { 540 + if (is_td_vcpu(vcpu)) 541 + return 0; 542 + 543 + return vmx_get_rflags(vcpu); 544 + } 545 + 546 + static void vt_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) 547 + { 548 + if (is_td_vcpu(vcpu)) 549 + return; 550 + 551 + vmx_set_rflags(vcpu, rflags); 552 + } 553 + 554 + static bool vt_get_if_flag(struct kvm_vcpu *vcpu) 555 + { 556 + if (is_td_vcpu(vcpu)) 557 + return false; 558 + 559 + return vmx_get_if_flag(vcpu); 560 + } 561 + 562 + static void vt_flush_tlb_all(struct kvm_vcpu *vcpu) 563 + { 564 + if (is_td_vcpu(vcpu)) { 565 + tdx_flush_tlb_all(vcpu); 566 + return; 567 + } 568 + 569 + vmx_flush_tlb_all(vcpu); 570 + } 571 + 572 + static void vt_flush_tlb_current(struct kvm_vcpu *vcpu) 573 + { 574 + if (is_td_vcpu(vcpu)) { 575 + tdx_flush_tlb_current(vcpu); 576 + return; 577 + } 578 + 579 + vmx_flush_tlb_current(vcpu); 580 + } 581 + 582 + static void vt_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr) 583 + { 584 + if (is_td_vcpu(vcpu)) 585 + return; 586 + 587 + vmx_flush_tlb_gva(vcpu, addr); 588 + } 589 + 590 + static void vt_flush_tlb_guest(struct kvm_vcpu *vcpu) 591 + { 592 + if (is_td_vcpu(vcpu)) 593 + return; 594 + 595 + vmx_flush_tlb_guest(vcpu); 596 + } 597 + 598 + static void vt_inject_nmi(struct kvm_vcpu *vcpu) 599 + { 600 + if (is_td_vcpu(vcpu)) { 601 + tdx_inject_nmi(vcpu); 602 + return; 603 + } 604 + 605 + vmx_inject_nmi(vcpu); 606 + } 607 + 608 + static int vt_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection) 609 + { 610 + /* 611 + * The TDX module manages NMI windows and NMI reinjection, and hides NMI 612 + * blocking, all KVM can do is throw an NMI over the wall. 613 + */ 614 + if (is_td_vcpu(vcpu)) 615 + return true; 616 + 617 + return vmx_nmi_allowed(vcpu, for_injection); 618 + } 619 + 620 + static bool vt_get_nmi_mask(struct kvm_vcpu *vcpu) 621 + { 622 + /* 623 + * KVM can't get NMI blocking status for TDX guest, assume NMIs are 624 + * always unmasked. 625 + */ 626 + if (is_td_vcpu(vcpu)) 627 + return false; 628 + 629 + return vmx_get_nmi_mask(vcpu); 630 + } 631 + 632 + static void vt_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked) 633 + { 634 + if (is_td_vcpu(vcpu)) 635 + return; 636 + 637 + vmx_set_nmi_mask(vcpu, masked); 638 + } 639 + 640 + static void vt_enable_nmi_window(struct kvm_vcpu *vcpu) 641 + { 642 + /* Refer to the comments in tdx_inject_nmi(). */ 643 + if (is_td_vcpu(vcpu)) 644 + return; 645 + 646 + vmx_enable_nmi_window(vcpu); 647 + } 648 + 649 + static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, 650 + int pgd_level) 651 + { 652 + if (is_td_vcpu(vcpu)) { 653 + tdx_load_mmu_pgd(vcpu, root_hpa, pgd_level); 654 + return; 655 + } 656 + 657 + vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level); 658 + } 659 + 660 + static void vt_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask) 661 + { 662 + if (is_td_vcpu(vcpu)) 663 + return; 664 + 665 + vmx_set_interrupt_shadow(vcpu, mask); 666 + } 667 + 668 + static u32 vt_get_interrupt_shadow(struct kvm_vcpu *vcpu) 669 + { 670 + if (is_td_vcpu(vcpu)) 671 + return 0; 672 + 673 + return vmx_get_interrupt_shadow(vcpu); 674 + } 675 + 676 + static void vt_patch_hypercall(struct kvm_vcpu *vcpu, 677 + unsigned char *hypercall) 678 + { 679 + /* 680 + * Because guest memory is protected, guest can't be patched. TD kernel 681 + * is modified to use TDG.VP.VMCALL for hypercall. 682 + */ 683 + if (is_td_vcpu(vcpu)) 684 + return; 685 + 686 + vmx_patch_hypercall(vcpu, hypercall); 687 + } 688 + 689 + static void vt_inject_irq(struct kvm_vcpu *vcpu, bool reinjected) 690 + { 691 + if (is_td_vcpu(vcpu)) 692 + return; 693 + 694 + vmx_inject_irq(vcpu, reinjected); 695 + } 696 + 697 + static void vt_inject_exception(struct kvm_vcpu *vcpu) 698 + { 699 + if (is_td_vcpu(vcpu)) 700 + return; 701 + 702 + vmx_inject_exception(vcpu); 703 + } 704 + 705 + static void vt_cancel_injection(struct kvm_vcpu *vcpu) 706 + { 707 + if (is_td_vcpu(vcpu)) 708 + return; 709 + 710 + vmx_cancel_injection(vcpu); 711 + } 712 + 713 + static int vt_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection) 714 + { 715 + if (is_td_vcpu(vcpu)) 716 + return tdx_interrupt_allowed(vcpu); 717 + 718 + return vmx_interrupt_allowed(vcpu, for_injection); 719 + } 720 + 721 + static void vt_enable_irq_window(struct kvm_vcpu *vcpu) 722 + { 723 + if (is_td_vcpu(vcpu)) 724 + return; 725 + 726 + vmx_enable_irq_window(vcpu); 727 + } 728 + 729 + static void vt_get_entry_info(struct kvm_vcpu *vcpu, u32 *intr_info, u32 *error_code) 730 + { 731 + *intr_info = 0; 732 + *error_code = 0; 733 + 734 + if (is_td_vcpu(vcpu)) 735 + return; 736 + 737 + vmx_get_entry_info(vcpu, intr_info, error_code); 738 + } 739 + 740 + static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason, 741 + u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code) 742 + { 743 + if (is_td_vcpu(vcpu)) { 744 + tdx_get_exit_info(vcpu, reason, info1, info2, intr_info, 745 + error_code); 746 + return; 747 + } 748 + 749 + vmx_get_exit_info(vcpu, reason, info1, info2, intr_info, error_code); 750 + } 751 + 752 + static void vt_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr) 753 + { 754 + if (is_td_vcpu(vcpu)) 755 + return; 756 + 757 + vmx_update_cr8_intercept(vcpu, tpr, irr); 758 + } 759 + 760 + static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu) 761 + { 762 + if (is_td_vcpu(vcpu)) 763 + return; 764 + 765 + vmx_set_apic_access_page_addr(vcpu); 766 + } 767 + 768 + static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu) 769 + { 770 + if (is_td_vcpu(vcpu)) { 771 + KVM_BUG_ON(!kvm_vcpu_apicv_active(vcpu), vcpu->kvm); 772 + return; 773 + } 774 + 775 + vmx_refresh_apicv_exec_ctrl(vcpu); 776 + } 777 + 778 + static void vt_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap) 779 + { 780 + if (is_td_vcpu(vcpu)) 781 + return; 782 + 783 + vmx_load_eoi_exitmap(vcpu, eoi_exit_bitmap); 784 + } 785 + 786 + static int vt_set_tss_addr(struct kvm *kvm, unsigned int addr) 787 + { 788 + if (is_td(kvm)) 789 + return 0; 790 + 791 + return vmx_set_tss_addr(kvm, addr); 792 + } 793 + 794 + static int vt_set_identity_map_addr(struct kvm *kvm, u64 ident_addr) 795 + { 796 + if (is_td(kvm)) 797 + return 0; 798 + 799 + return vmx_set_identity_map_addr(kvm, ident_addr); 800 + } 801 + 802 + static u64 vt_get_l2_tsc_offset(struct kvm_vcpu *vcpu) 803 + { 804 + /* TDX doesn't support L2 guest at the moment. */ 805 + if (is_td_vcpu(vcpu)) 806 + return 0; 807 + 808 + return vmx_get_l2_tsc_offset(vcpu); 809 + } 810 + 811 + static u64 vt_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu) 812 + { 813 + /* TDX doesn't support L2 guest at the moment. */ 814 + if (is_td_vcpu(vcpu)) 815 + return 0; 816 + 817 + return vmx_get_l2_tsc_multiplier(vcpu); 818 + } 819 + 820 + static void vt_write_tsc_offset(struct kvm_vcpu *vcpu) 821 + { 822 + /* In TDX, tsc offset can't be changed. */ 823 + if (is_td_vcpu(vcpu)) 824 + return; 825 + 826 + vmx_write_tsc_offset(vcpu); 827 + } 828 + 829 + static void vt_write_tsc_multiplier(struct kvm_vcpu *vcpu) 830 + { 831 + /* In TDX, tsc multiplier can't be changed. */ 832 + if (is_td_vcpu(vcpu)) 833 + return; 834 + 835 + vmx_write_tsc_multiplier(vcpu); 836 + } 837 + 838 + #ifdef CONFIG_X86_64 839 + static int vt_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc, 840 + bool *expired) 841 + { 842 + /* VMX-preemption timer isn't available for TDX. */ 843 + if (is_td_vcpu(vcpu)) 844 + return -EINVAL; 845 + 846 + return vmx_set_hv_timer(vcpu, guest_deadline_tsc, expired); 847 + } 848 + 849 + static void vt_cancel_hv_timer(struct kvm_vcpu *vcpu) 850 + { 851 + /* VMX-preemption timer can't be set. See vt_set_hv_timer(). */ 852 + if (is_td_vcpu(vcpu)) 853 + return; 854 + 855 + vmx_cancel_hv_timer(vcpu); 856 + } 857 + #endif 858 + 859 + static void vt_setup_mce(struct kvm_vcpu *vcpu) 860 + { 861 + if (is_td_vcpu(vcpu)) 862 + return; 863 + 864 + vmx_setup_mce(vcpu); 865 + } 866 + 867 + static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp) 868 + { 869 + if (!is_td(kvm)) 870 + return -ENOTTY; 871 + 872 + return tdx_vm_ioctl(kvm, argp); 873 + } 874 + 875 + static int vt_vcpu_mem_enc_ioctl(struct kvm_vcpu *vcpu, void __user *argp) 876 + { 877 + if (!is_td_vcpu(vcpu)) 878 + return -EINVAL; 879 + 880 + return tdx_vcpu_ioctl(vcpu, argp); 881 + } 882 + 883 + static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn) 884 + { 885 + if (is_td(kvm)) 886 + return tdx_gmem_private_max_mapping_level(kvm, pfn); 887 + 888 + return 0; 889 + } 9 890 10 891 #define VMX_REQUIRED_APICV_INHIBITS \ 11 892 (BIT(APICV_INHIBIT_REASON_DISABLED) | \ ··· 905 24 .hardware_unsetup = vmx_hardware_unsetup, 906 25 907 26 .enable_virtualization_cpu = vmx_enable_virtualization_cpu, 908 - .disable_virtualization_cpu = vmx_disable_virtualization_cpu, 27 + .disable_virtualization_cpu = vt_disable_virtualization_cpu, 909 28 .emergency_disable_virtualization_cpu = vmx_emergency_disable_virtualization_cpu, 910 29 911 - .has_emulated_msr = vmx_has_emulated_msr, 30 + .has_emulated_msr = vt_has_emulated_msr, 912 31 913 32 .vm_size = sizeof(struct kvm_vmx), 914 - .vm_init = vmx_vm_init, 915 - .vm_destroy = vmx_vm_destroy, 916 33 917 - .vcpu_precreate = vmx_vcpu_precreate, 918 - .vcpu_create = vmx_vcpu_create, 919 - .vcpu_free = vmx_vcpu_free, 920 - .vcpu_reset = vmx_vcpu_reset, 34 + .vm_init = vt_vm_init, 35 + .vm_pre_destroy = vt_vm_pre_destroy, 36 + .vm_destroy = vt_vm_destroy, 921 37 922 - .prepare_switch_to_guest = vmx_prepare_switch_to_guest, 923 - .vcpu_load = vmx_vcpu_load, 924 - .vcpu_put = vmx_vcpu_put, 38 + .vcpu_precreate = vt_vcpu_precreate, 39 + .vcpu_create = vt_vcpu_create, 40 + .vcpu_free = vt_vcpu_free, 41 + .vcpu_reset = vt_vcpu_reset, 925 42 926 - .update_exception_bitmap = vmx_update_exception_bitmap, 43 + .prepare_switch_to_guest = vt_prepare_switch_to_guest, 44 + .vcpu_load = vt_vcpu_load, 45 + .vcpu_put = vt_vcpu_put, 46 + 47 + .update_exception_bitmap = vt_update_exception_bitmap, 927 48 .get_feature_msr = vmx_get_feature_msr, 928 - .get_msr = vmx_get_msr, 929 - .set_msr = vmx_set_msr, 930 - .get_segment_base = vmx_get_segment_base, 931 - .get_segment = vmx_get_segment, 932 - .set_segment = vmx_set_segment, 933 - .get_cpl = vmx_get_cpl, 934 - .get_cpl_no_cache = vmx_get_cpl_no_cache, 935 - .get_cs_db_l_bits = vmx_get_cs_db_l_bits, 936 - .is_valid_cr0 = vmx_is_valid_cr0, 937 - .set_cr0 = vmx_set_cr0, 938 - .is_valid_cr4 = vmx_is_valid_cr4, 939 - .set_cr4 = vmx_set_cr4, 940 - .set_efer = vmx_set_efer, 941 - .get_idt = vmx_get_idt, 942 - .set_idt = vmx_set_idt, 943 - .get_gdt = vmx_get_gdt, 944 - .set_gdt = vmx_set_gdt, 945 - .set_dr6 = vmx_set_dr6, 946 - .set_dr7 = vmx_set_dr7, 947 - .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs, 948 - .cache_reg = vmx_cache_reg, 949 - .get_rflags = vmx_get_rflags, 950 - .set_rflags = vmx_set_rflags, 951 - .get_if_flag = vmx_get_if_flag, 49 + .get_msr = vt_get_msr, 50 + .set_msr = vt_set_msr, 952 51 953 - .flush_tlb_all = vmx_flush_tlb_all, 954 - .flush_tlb_current = vmx_flush_tlb_current, 955 - .flush_tlb_gva = vmx_flush_tlb_gva, 956 - .flush_tlb_guest = vmx_flush_tlb_guest, 52 + .get_segment_base = vt_get_segment_base, 53 + .get_segment = vt_get_segment, 54 + .set_segment = vt_set_segment, 55 + .get_cpl = vt_get_cpl, 56 + .get_cpl_no_cache = vt_get_cpl_no_cache, 57 + .get_cs_db_l_bits = vt_get_cs_db_l_bits, 58 + .is_valid_cr0 = vt_is_valid_cr0, 59 + .set_cr0 = vt_set_cr0, 60 + .is_valid_cr4 = vt_is_valid_cr4, 61 + .set_cr4 = vt_set_cr4, 62 + .set_efer = vt_set_efer, 63 + .get_idt = vt_get_idt, 64 + .set_idt = vt_set_idt, 65 + .get_gdt = vt_get_gdt, 66 + .set_gdt = vt_set_gdt, 67 + .set_dr6 = vt_set_dr6, 68 + .set_dr7 = vt_set_dr7, 69 + .sync_dirty_debug_regs = vt_sync_dirty_debug_regs, 70 + .cache_reg = vt_cache_reg, 71 + .get_rflags = vt_get_rflags, 72 + .set_rflags = vt_set_rflags, 73 + .get_if_flag = vt_get_if_flag, 957 74 958 - .vcpu_pre_run = vmx_vcpu_pre_run, 959 - .vcpu_run = vmx_vcpu_run, 960 - .handle_exit = vmx_handle_exit, 75 + .flush_tlb_all = vt_flush_tlb_all, 76 + .flush_tlb_current = vt_flush_tlb_current, 77 + .flush_tlb_gva = vt_flush_tlb_gva, 78 + .flush_tlb_guest = vt_flush_tlb_guest, 79 + 80 + .vcpu_pre_run = vt_vcpu_pre_run, 81 + .vcpu_run = vt_vcpu_run, 82 + .handle_exit = vt_handle_exit, 961 83 .skip_emulated_instruction = vmx_skip_emulated_instruction, 962 84 .update_emulated_instruction = vmx_update_emulated_instruction, 963 - .set_interrupt_shadow = vmx_set_interrupt_shadow, 964 - .get_interrupt_shadow = vmx_get_interrupt_shadow, 965 - .patch_hypercall = vmx_patch_hypercall, 966 - .inject_irq = vmx_inject_irq, 967 - .inject_nmi = vmx_inject_nmi, 968 - .inject_exception = vmx_inject_exception, 969 - .cancel_injection = vmx_cancel_injection, 970 - .interrupt_allowed = vmx_interrupt_allowed, 971 - .nmi_allowed = vmx_nmi_allowed, 972 - .get_nmi_mask = vmx_get_nmi_mask, 973 - .set_nmi_mask = vmx_set_nmi_mask, 974 - .enable_nmi_window = vmx_enable_nmi_window, 975 - .enable_irq_window = vmx_enable_irq_window, 976 - .update_cr8_intercept = vmx_update_cr8_intercept, 85 + .set_interrupt_shadow = vt_set_interrupt_shadow, 86 + .get_interrupt_shadow = vt_get_interrupt_shadow, 87 + .patch_hypercall = vt_patch_hypercall, 88 + .inject_irq = vt_inject_irq, 89 + .inject_nmi = vt_inject_nmi, 90 + .inject_exception = vt_inject_exception, 91 + .cancel_injection = vt_cancel_injection, 92 + .interrupt_allowed = vt_interrupt_allowed, 93 + .nmi_allowed = vt_nmi_allowed, 94 + .get_nmi_mask = vt_get_nmi_mask, 95 + .set_nmi_mask = vt_set_nmi_mask, 96 + .enable_nmi_window = vt_enable_nmi_window, 97 + .enable_irq_window = vt_enable_irq_window, 98 + .update_cr8_intercept = vt_update_cr8_intercept, 977 99 978 100 .x2apic_icr_is_split = false, 979 - .set_virtual_apic_mode = vmx_set_virtual_apic_mode, 980 - .set_apic_access_page_addr = vmx_set_apic_access_page_addr, 981 - .refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl, 982 - .load_eoi_exitmap = vmx_load_eoi_exitmap, 983 - .apicv_pre_state_restore = vmx_apicv_pre_state_restore, 101 + .set_virtual_apic_mode = vt_set_virtual_apic_mode, 102 + .set_apic_access_page_addr = vt_set_apic_access_page_addr, 103 + .refresh_apicv_exec_ctrl = vt_refresh_apicv_exec_ctrl, 104 + .load_eoi_exitmap = vt_load_eoi_exitmap, 105 + .apicv_pre_state_restore = vt_apicv_pre_state_restore, 984 106 .required_apicv_inhibits = VMX_REQUIRED_APICV_INHIBITS, 985 - .hwapic_isr_update = vmx_hwapic_isr_update, 986 - .sync_pir_to_irr = vmx_sync_pir_to_irr, 987 - .deliver_interrupt = vmx_deliver_interrupt, 107 + .hwapic_isr_update = vt_hwapic_isr_update, 108 + .sync_pir_to_irr = vt_sync_pir_to_irr, 109 + .deliver_interrupt = vt_deliver_interrupt, 988 110 .dy_apicv_has_pending_interrupt = pi_has_pending_interrupt, 989 111 990 - .set_tss_addr = vmx_set_tss_addr, 991 - .set_identity_map_addr = vmx_set_identity_map_addr, 112 + .set_tss_addr = vt_set_tss_addr, 113 + .set_identity_map_addr = vt_set_identity_map_addr, 992 114 .get_mt_mask = vmx_get_mt_mask, 993 115 994 - .get_exit_info = vmx_get_exit_info, 995 - .get_entry_info = vmx_get_entry_info, 116 + .get_exit_info = vt_get_exit_info, 117 + .get_entry_info = vt_get_entry_info, 996 118 997 - .vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid, 119 + .vcpu_after_set_cpuid = vt_vcpu_after_set_cpuid, 998 120 999 121 .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit, 1000 122 1001 - .get_l2_tsc_offset = vmx_get_l2_tsc_offset, 1002 - .get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier, 1003 - .write_tsc_offset = vmx_write_tsc_offset, 1004 - .write_tsc_multiplier = vmx_write_tsc_multiplier, 123 + .get_l2_tsc_offset = vt_get_l2_tsc_offset, 124 + .get_l2_tsc_multiplier = vt_get_l2_tsc_multiplier, 125 + .write_tsc_offset = vt_write_tsc_offset, 126 + .write_tsc_multiplier = vt_write_tsc_multiplier, 1005 127 1006 - .load_mmu_pgd = vmx_load_mmu_pgd, 128 + .load_mmu_pgd = vt_load_mmu_pgd, 1007 129 1008 130 .check_intercept = vmx_check_intercept, 1009 131 .handle_exit_irqoff = vmx_handle_exit_irqoff, 1010 132 1011 - .cpu_dirty_log_size = PML_LOG_NR_ENTRIES, 1012 - .update_cpu_dirty_logging = vmx_update_cpu_dirty_logging, 133 + .update_cpu_dirty_logging = vt_update_cpu_dirty_logging, 1013 134 1014 135 .nested_ops = &vmx_nested_ops, 1015 136 ··· 1019 136 .pi_start_assignment = vmx_pi_start_assignment, 1020 137 1021 138 #ifdef CONFIG_X86_64 1022 - .set_hv_timer = vmx_set_hv_timer, 1023 - .cancel_hv_timer = vmx_cancel_hv_timer, 139 + .set_hv_timer = vt_set_hv_timer, 140 + .cancel_hv_timer = vt_cancel_hv_timer, 1024 141 #endif 1025 142 1026 - .setup_mce = vmx_setup_mce, 143 + .setup_mce = vt_setup_mce, 1027 144 1028 145 #ifdef CONFIG_KVM_SMM 1029 - .smi_allowed = vmx_smi_allowed, 1030 - .enter_smm = vmx_enter_smm, 1031 - .leave_smm = vmx_leave_smm, 1032 - .enable_smi_window = vmx_enable_smi_window, 146 + .smi_allowed = vt_smi_allowed, 147 + .enter_smm = vt_enter_smm, 148 + .leave_smm = vt_leave_smm, 149 + .enable_smi_window = vt_enable_smi_window, 1033 150 #endif 1034 151 1035 - .check_emulate_instruction = vmx_check_emulate_instruction, 1036 - .apic_init_signal_blocked = vmx_apic_init_signal_blocked, 152 + .check_emulate_instruction = vt_check_emulate_instruction, 153 + .apic_init_signal_blocked = vt_apic_init_signal_blocked, 1037 154 .migrate_timers = vmx_migrate_timers, 1038 155 1039 - .msr_filter_changed = vmx_msr_filter_changed, 1040 - .complete_emulated_msr = kvm_complete_insn_gp, 156 + .msr_filter_changed = vt_msr_filter_changed, 157 + .complete_emulated_msr = vt_complete_emulated_msr, 1041 158 1042 159 .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector, 1043 160 1044 161 .get_untagged_addr = vmx_get_untagged_addr, 162 + 163 + .mem_enc_ioctl = vt_mem_enc_ioctl, 164 + .vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl, 165 + 166 + .private_max_mapping_level = vt_gmem_private_max_mapping_level 1045 167 }; 1046 168 1047 169 struct kvm_x86_init_ops vt_init_ops __initdata = { 1048 - .hardware_setup = vmx_hardware_setup, 170 + .hardware_setup = vt_hardware_setup, 1049 171 .handle_intel_pt_intr = NULL, 1050 172 1051 173 .runtime_ops = &vt_x86_ops, 1052 174 .pmu_ops = &intel_pmu_ops, 1053 175 }; 176 + 177 + static void __exit vt_exit(void) 178 + { 179 + kvm_exit(); 180 + tdx_cleanup(); 181 + vmx_exit(); 182 + } 183 + module_exit(vt_exit); 184 + 185 + static int __init vt_init(void) 186 + { 187 + unsigned vcpu_size, vcpu_align; 188 + int r; 189 + 190 + r = vmx_init(); 191 + if (r) 192 + return r; 193 + 194 + /* tdx_init() has been taken */ 195 + r = tdx_bringup(); 196 + if (r) 197 + goto err_tdx_bringup; 198 + 199 + /* 200 + * TDX and VMX have different vCPU structures. Calculate the 201 + * maximum size/align so that kvm_init() can use the larger 202 + * values to create the kmem_vcpu_cache. 203 + */ 204 + vcpu_size = sizeof(struct vcpu_vmx); 205 + vcpu_align = __alignof__(struct vcpu_vmx); 206 + if (enable_tdx) { 207 + vcpu_size = max_t(unsigned, vcpu_size, 208 + sizeof(struct vcpu_tdx)); 209 + vcpu_align = max_t(unsigned, vcpu_align, 210 + __alignof__(struct vcpu_tdx)); 211 + kvm_caps.supported_vm_types |= BIT(KVM_X86_TDX_VM); 212 + } 213 + 214 + /* 215 + * Common KVM initialization _must_ come last, after this, /dev/kvm is 216 + * exposed to userspace! 217 + */ 218 + r = kvm_init(vcpu_size, vcpu_align, THIS_MODULE); 219 + if (r) 220 + goto err_kvm_init; 221 + 222 + return 0; 223 + 224 + err_kvm_init: 225 + tdx_cleanup(); 226 + err_tdx_bringup: 227 + vmx_exit(); 228 + return r; 229 + } 230 + module_init(vt_init);
+6 -6
arch/x86/kvm/vmx/nested.c
··· 275 275 { 276 276 struct vmcs_host_state *dest, *src; 277 277 278 - if (unlikely(!vmx->guest_state_loaded)) 278 + if (unlikely(!vmx->vt.guest_state_loaded)) 279 279 return; 280 280 281 281 src = &prev->host_state; ··· 425 425 * tables also changed, but KVM should not treat EPT Misconfig 426 426 * VM-Exits as writes. 427 427 */ 428 - WARN_ON_ONCE(vmx->exit_reason.basic != EXIT_REASON_EPT_VIOLATION); 428 + WARN_ON_ONCE(vmx->vt.exit_reason.basic != EXIT_REASON_EPT_VIOLATION); 429 429 430 430 /* 431 431 * PML Full and EPT Violation VM-Exits both use bit 12 to report ··· 4622 4622 { 4623 4623 /* update exit information fields: */ 4624 4624 vmcs12->vm_exit_reason = vm_exit_reason; 4625 - if (to_vmx(vcpu)->exit_reason.enclave_mode) 4625 + if (vmx_get_exit_reason(vcpu).enclave_mode) 4626 4626 vmcs12->vm_exit_reason |= VMX_EXIT_REASONS_SGX_ENCLAVE_MODE; 4627 4627 vmcs12->exit_qualification = exit_qualification; 4628 4628 ··· 4794 4794 vmcs12->vm_exit_msr_load_count)) 4795 4795 nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL); 4796 4796 4797 - to_vmx(vcpu)->emulation_required = vmx_emulation_required(vcpu); 4797 + to_vt(vcpu)->emulation_required = vmx_emulation_required(vcpu); 4798 4798 } 4799 4799 4800 4800 static inline u64 nested_vmx_get_vmcs01_guest_efer(struct vcpu_vmx *vmx) ··· 6127 6127 * nested VM-Exit. Pass the original exit reason, i.e. don't hardcode 6128 6128 * EXIT_REASON_VMFUNC as the exit reason. 6129 6129 */ 6130 - nested_vmx_vmexit(vcpu, vmx->exit_reason.full, 6130 + nested_vmx_vmexit(vcpu, vmx->vt.exit_reason.full, 6131 6131 vmx_get_intr_info(vcpu), 6132 6132 vmx_get_exit_qual(vcpu)); 6133 6133 return 1; ··· 6572 6572 bool nested_vmx_reflect_vmexit(struct kvm_vcpu *vcpu) 6573 6573 { 6574 6574 struct vcpu_vmx *vmx = to_vmx(vcpu); 6575 - union vmx_exit_reason exit_reason = vmx->exit_reason; 6575 + union vmx_exit_reason exit_reason = vmx->vt.exit_reason; 6576 6576 unsigned long exit_qual; 6577 6577 u32 exit_intr_info; 6578 6578
+51 -1
arch/x86/kvm/vmx/pmu_intel.c
··· 19 19 #include "lapic.h" 20 20 #include "nested.h" 21 21 #include "pmu.h" 22 + #include "tdx.h" 22 23 23 24 /* 24 25 * Perf's "BASE" is wildly misleading, architectural PMUs use bits 31:16 of ECX ··· 34 33 #define INTEL_RDPMC_INDEX_MASK GENMASK(15, 0) 35 34 36 35 #define MSR_PMC_FULL_WIDTH_BIT (MSR_IA32_PMC0 - MSR_IA32_PERFCTR0) 36 + 37 + static struct lbr_desc *vcpu_to_lbr_desc(struct kvm_vcpu *vcpu) 38 + { 39 + if (is_td_vcpu(vcpu)) 40 + return NULL; 41 + 42 + return &to_vmx(vcpu)->lbr_desc; 43 + } 44 + 45 + static struct x86_pmu_lbr *vcpu_to_lbr_records(struct kvm_vcpu *vcpu) 46 + { 47 + if (is_td_vcpu(vcpu)) 48 + return NULL; 49 + 50 + return &to_vmx(vcpu)->lbr_desc.records; 51 + } 52 + 53 + #pragma GCC poison to_vmx 37 54 38 55 static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data) 39 56 { ··· 148 129 return get_gp_pmc(pmu, msr, MSR_IA32_PMC0); 149 130 } 150 131 132 + static bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu) 133 + { 134 + if (is_td_vcpu(vcpu)) 135 + return false; 136 + 137 + return cpuid_model_is_consistent(vcpu); 138 + } 139 + 140 + bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu) 141 + { 142 + if (is_td_vcpu(vcpu)) 143 + return false; 144 + 145 + return !!vcpu_to_lbr_records(vcpu)->nr; 146 + } 147 + 151 148 static bool intel_pmu_is_valid_lbr_msr(struct kvm_vcpu *vcpu, u32 index) 152 149 { 153 150 struct x86_pmu_lbr *records = vcpu_to_lbr_records(vcpu); ··· 229 194 { 230 195 struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu); 231 196 197 + if (!lbr_desc) 198 + return; 199 + 232 200 if (lbr_desc->event) { 233 201 perf_event_release_kernel(lbr_desc->event); 234 202 lbr_desc->event = NULL; ··· 272 234 .branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK | 273 235 PERF_SAMPLE_BRANCH_USER, 274 236 }; 237 + 238 + if (WARN_ON_ONCE(!lbr_desc)) 239 + return 0; 275 240 276 241 if (unlikely(lbr_desc->event)) { 277 242 __set_bit(INTEL_PMC_IDX_FIXED_VLBR, pmu->pmc_in_use); ··· 507 466 u64 perf_capabilities; 508 467 u64 counter_rsvd; 509 468 469 + if (!lbr_desc) 470 + return; 471 + 510 472 memset(&lbr_desc->records, 0, sizeof(lbr_desc->records)); 511 473 512 474 /* ··· 586 542 INTEL_PMC_MAX_GENERIC, pmu->nr_arch_fixed_counters); 587 543 588 544 perf_capabilities = vcpu_get_perf_capabilities(vcpu); 589 - if (cpuid_model_is_consistent(vcpu) && 545 + if (intel_pmu_lbr_is_compatible(vcpu) && 590 546 (perf_capabilities & PMU_CAP_LBR_FMT)) 591 547 memcpy(&lbr_desc->records, &vmx_lbr_caps, sizeof(vmx_lbr_caps)); 592 548 else ··· 613 569 int i; 614 570 struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); 615 571 struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu); 572 + 573 + if (!lbr_desc) 574 + return; 616 575 617 576 for (i = 0; i < KVM_MAX_NR_INTEL_GP_COUNTERS; i++) { 618 577 pmu->gp_counters[i].type = KVM_PMC_GP; ··· 723 676 { 724 677 struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); 725 678 struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu); 679 + 680 + if (WARN_ON_ONCE(!lbr_desc)) 681 + return; 726 682 727 683 if (!lbr_desc->event) { 728 684 vmx_disable_lbr_msrs_passthrough(vcpu);
+28
arch/x86/kvm/vmx/pmu_intel.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef __KVM_X86_VMX_PMU_INTEL_H 3 + #define __KVM_X86_VMX_PMU_INTEL_H 4 + 5 + #include <linux/kvm_host.h> 6 + 7 + bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu); 8 + int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu); 9 + 10 + struct lbr_desc { 11 + /* Basic info about guest LBR records. */ 12 + struct x86_pmu_lbr records; 13 + 14 + /* 15 + * Emulate LBR feature via passthrough LBR registers when the 16 + * per-vcpu guest LBR event is scheduled on the current pcpu. 17 + * 18 + * The records may be inaccurate if the host reclaims the LBR. 19 + */ 20 + struct perf_event *event; 21 + 22 + /* True if LBRs are marked as not intercepted in the MSR bitmap */ 23 + bool msr_passthrough; 24 + }; 25 + 26 + extern struct x86_pmu_lbr vmx_lbr_caps; 27 + 28 + #endif /* __KVM_X86_VMX_PMU_INTEL_H */
+16 -12
arch/x86/kvm/vmx/posted_intr.c
··· 11 11 #include "posted_intr.h" 12 12 #include "trace.h" 13 13 #include "vmx.h" 14 + #include "tdx.h" 14 15 15 16 /* 16 17 * Maintain a per-CPU list of vCPUs that need to be awakened by wakeup_handler() ··· 34 33 35 34 #define PI_LOCK_SCHED_OUT SINGLE_DEPTH_NESTING 36 35 37 - static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu) 36 + struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu) 38 37 { 39 - return &(to_vmx(vcpu)->pi_desc); 38 + return &(to_vt(vcpu)->pi_desc); 40 39 } 41 40 42 41 static int pi_try_set_control(struct pi_desc *pi_desc, u64 *pold, u64 new) ··· 56 55 void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu) 57 56 { 58 57 struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu); 59 - struct vcpu_vmx *vmx = to_vmx(vcpu); 58 + struct vcpu_vt *vt = to_vt(vcpu); 60 59 struct pi_desc old, new; 61 60 unsigned long flags; 62 61 unsigned int dest; ··· 103 102 */ 104 103 raw_spin_lock(spinlock); 105 104 spin_acquire(&spinlock->dep_map, PI_LOCK_SCHED_OUT, 0, _RET_IP_); 106 - list_del(&vmx->pi_wakeup_list); 105 + list_del(&vt->pi_wakeup_list); 107 106 spin_release(&spinlock->dep_map, _RET_IP_); 108 107 raw_spin_unlock(spinlock); 109 108 } ··· 160 159 static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu) 161 160 { 162 161 struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu); 163 - struct vcpu_vmx *vmx = to_vmx(vcpu); 162 + struct vcpu_vt *vt = to_vt(vcpu); 164 163 struct pi_desc old, new; 165 164 166 165 lockdep_assert_irqs_disabled(); ··· 179 178 */ 180 179 raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu), 181 180 PI_LOCK_SCHED_OUT); 182 - list_add_tail(&vmx->pi_wakeup_list, 181 + list_add_tail(&vt->pi_wakeup_list, 183 182 &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu)); 184 183 raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); 185 184 ··· 214 213 * notification vector is switched to the one that calls 215 214 * back to the pi_wakeup_handler() function. 216 215 */ 217 - return vmx_can_use_ipiv(vcpu) || vmx_can_use_vtd_pi(vcpu->kvm); 216 + return (vmx_can_use_ipiv(vcpu) && !is_td_vcpu(vcpu)) || 217 + vmx_can_use_vtd_pi(vcpu->kvm); 218 218 } 219 219 220 220 void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu) ··· 225 223 if (!vmx_needs_pi_wakeup(vcpu)) 226 224 return; 227 225 228 - if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu)) 226 + if (kvm_vcpu_is_blocking(vcpu) && 227 + ((is_td_vcpu(vcpu) && tdx_interrupt_allowed(vcpu)) || 228 + (!is_td_vcpu(vcpu) && !vmx_interrupt_blocked(vcpu)))) 229 229 pi_enable_wakeup_handler(vcpu); 230 230 231 231 /* ··· 247 243 int cpu = smp_processor_id(); 248 244 struct list_head *wakeup_list = &per_cpu(wakeup_vcpus_on_cpu, cpu); 249 245 raw_spinlock_t *spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, cpu); 250 - struct vcpu_vmx *vmx; 246 + struct vcpu_vt *vt; 251 247 252 248 raw_spin_lock(spinlock); 253 - list_for_each_entry(vmx, wakeup_list, pi_wakeup_list) { 249 + list_for_each_entry(vt, wakeup_list, pi_wakeup_list) { 254 250 255 - if (pi_test_on(&vmx->pi_desc)) 256 - kvm_vcpu_wake_up(&vmx->vcpu); 251 + if (pi_test_on(&vt->pi_desc)) 252 + kvm_vcpu_wake_up(vt_to_vcpu(vt)); 257 253 } 258 254 raw_spin_unlock(spinlock); 259 255 }
+2
arch/x86/kvm/vmx/posted_intr.h
··· 5 5 #include <linux/bitmap.h> 6 6 #include <asm/posted_intr.h> 7 7 8 + struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu); 9 + 8 10 void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu); 9 11 void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu); 10 12 void pi_wakeup_handler(void);
+3526
arch/x86/kvm/vmx/tdx.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #include <linux/cleanup.h> 3 + #include <linux/cpu.h> 4 + #include <asm/cpufeature.h> 5 + #include <asm/fpu/xcr.h> 6 + #include <linux/misc_cgroup.h> 7 + #include <linux/mmu_context.h> 8 + #include <asm/tdx.h> 9 + #include "capabilities.h" 10 + #include "mmu.h" 11 + #include "x86_ops.h" 12 + #include "lapic.h" 13 + #include "tdx.h" 14 + #include "vmx.h" 15 + #include "mmu/spte.h" 16 + #include "common.h" 17 + #include "posted_intr.h" 18 + #include "irq.h" 19 + #include <trace/events/kvm.h> 20 + #include "trace.h" 21 + 22 + #pragma GCC poison to_vmx 23 + 24 + #undef pr_fmt 25 + #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt 26 + 27 + #define pr_tdx_error(__fn, __err) \ 28 + pr_err_ratelimited("SEAMCALL %s failed: 0x%llx\n", #__fn, __err) 29 + 30 + #define __pr_tdx_error_N(__fn_str, __err, __fmt, ...) \ 31 + pr_err_ratelimited("SEAMCALL " __fn_str " failed: 0x%llx, " __fmt, __err, __VA_ARGS__) 32 + 33 + #define pr_tdx_error_1(__fn, __err, __rcx) \ 34 + __pr_tdx_error_N(#__fn, __err, "rcx 0x%llx\n", __rcx) 35 + 36 + #define pr_tdx_error_2(__fn, __err, __rcx, __rdx) \ 37 + __pr_tdx_error_N(#__fn, __err, "rcx 0x%llx, rdx 0x%llx\n", __rcx, __rdx) 38 + 39 + #define pr_tdx_error_3(__fn, __err, __rcx, __rdx, __r8) \ 40 + __pr_tdx_error_N(#__fn, __err, "rcx 0x%llx, rdx 0x%llx, r8 0x%llx\n", __rcx, __rdx, __r8) 41 + 42 + bool enable_tdx __ro_after_init; 43 + module_param_named(tdx, enable_tdx, bool, 0444); 44 + 45 + #define TDX_SHARED_BIT_PWL_5 gpa_to_gfn(BIT_ULL(51)) 46 + #define TDX_SHARED_BIT_PWL_4 gpa_to_gfn(BIT_ULL(47)) 47 + 48 + static enum cpuhp_state tdx_cpuhp_state; 49 + 50 + static const struct tdx_sys_info *tdx_sysinfo; 51 + 52 + void tdh_vp_rd_failed(struct vcpu_tdx *tdx, char *uclass, u32 field, u64 err) 53 + { 54 + KVM_BUG_ON(1, tdx->vcpu.kvm); 55 + pr_err("TDH_VP_RD[%s.0x%x] failed 0x%llx\n", uclass, field, err); 56 + } 57 + 58 + void tdh_vp_wr_failed(struct vcpu_tdx *tdx, char *uclass, char *op, u32 field, 59 + u64 val, u64 err) 60 + { 61 + KVM_BUG_ON(1, tdx->vcpu.kvm); 62 + pr_err("TDH_VP_WR[%s.0x%x]%s0x%llx failed: 0x%llx\n", uclass, field, op, val, err); 63 + } 64 + 65 + #define KVM_SUPPORTED_TD_ATTRS (TDX_TD_ATTR_SEPT_VE_DISABLE) 66 + 67 + static __always_inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm) 68 + { 69 + return container_of(kvm, struct kvm_tdx, kvm); 70 + } 71 + 72 + static __always_inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu) 73 + { 74 + return container_of(vcpu, struct vcpu_tdx, vcpu); 75 + } 76 + 77 + static u64 tdx_get_supported_attrs(const struct tdx_sys_info_td_conf *td_conf) 78 + { 79 + u64 val = KVM_SUPPORTED_TD_ATTRS; 80 + 81 + if ((val & td_conf->attributes_fixed1) != td_conf->attributes_fixed1) 82 + return 0; 83 + 84 + val &= td_conf->attributes_fixed0; 85 + 86 + return val; 87 + } 88 + 89 + static u64 tdx_get_supported_xfam(const struct tdx_sys_info_td_conf *td_conf) 90 + { 91 + u64 val = kvm_caps.supported_xcr0 | kvm_caps.supported_xss; 92 + 93 + if ((val & td_conf->xfam_fixed1) != td_conf->xfam_fixed1) 94 + return 0; 95 + 96 + val &= td_conf->xfam_fixed0; 97 + 98 + return val; 99 + } 100 + 101 + static int tdx_get_guest_phys_addr_bits(const u32 eax) 102 + { 103 + return (eax & GENMASK(23, 16)) >> 16; 104 + } 105 + 106 + static u32 tdx_set_guest_phys_addr_bits(const u32 eax, int addr_bits) 107 + { 108 + return (eax & ~GENMASK(23, 16)) | (addr_bits & 0xff) << 16; 109 + } 110 + 111 + #define TDX_FEATURE_TSX (__feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM)) 112 + 113 + static bool has_tsx(const struct kvm_cpuid_entry2 *entry) 114 + { 115 + return entry->function == 7 && entry->index == 0 && 116 + (entry->ebx & TDX_FEATURE_TSX); 117 + } 118 + 119 + static void clear_tsx(struct kvm_cpuid_entry2 *entry) 120 + { 121 + entry->ebx &= ~TDX_FEATURE_TSX; 122 + } 123 + 124 + static bool has_waitpkg(const struct kvm_cpuid_entry2 *entry) 125 + { 126 + return entry->function == 7 && entry->index == 0 && 127 + (entry->ecx & __feature_bit(X86_FEATURE_WAITPKG)); 128 + } 129 + 130 + static void clear_waitpkg(struct kvm_cpuid_entry2 *entry) 131 + { 132 + entry->ecx &= ~__feature_bit(X86_FEATURE_WAITPKG); 133 + } 134 + 135 + static void tdx_clear_unsupported_cpuid(struct kvm_cpuid_entry2 *entry) 136 + { 137 + if (has_tsx(entry)) 138 + clear_tsx(entry); 139 + 140 + if (has_waitpkg(entry)) 141 + clear_waitpkg(entry); 142 + } 143 + 144 + static bool tdx_unsupported_cpuid(const struct kvm_cpuid_entry2 *entry) 145 + { 146 + return has_tsx(entry) || has_waitpkg(entry); 147 + } 148 + 149 + #define KVM_TDX_CPUID_NO_SUBLEAF ((__u32)-1) 150 + 151 + static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, unsigned char idx) 152 + { 153 + const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf; 154 + 155 + entry->function = (u32)td_conf->cpuid_config_leaves[idx]; 156 + entry->index = td_conf->cpuid_config_leaves[idx] >> 32; 157 + entry->eax = (u32)td_conf->cpuid_config_values[idx][0]; 158 + entry->ebx = td_conf->cpuid_config_values[idx][0] >> 32; 159 + entry->ecx = (u32)td_conf->cpuid_config_values[idx][1]; 160 + entry->edx = td_conf->cpuid_config_values[idx][1] >> 32; 161 + 162 + if (entry->index == KVM_TDX_CPUID_NO_SUBLEAF) 163 + entry->index = 0; 164 + 165 + /* 166 + * The TDX module doesn't allow configuring the guest phys addr bits 167 + * (EAX[23:16]). However, KVM uses it as an interface to the userspace 168 + * to configure the GPAW. Report these bits as configurable. 169 + */ 170 + if (entry->function == 0x80000008) 171 + entry->eax = tdx_set_guest_phys_addr_bits(entry->eax, 0xff); 172 + 173 + tdx_clear_unsupported_cpuid(entry); 174 + } 175 + 176 + static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf, 177 + struct kvm_tdx_capabilities *caps) 178 + { 179 + int i; 180 + 181 + caps->supported_attrs = tdx_get_supported_attrs(td_conf); 182 + if (!caps->supported_attrs) 183 + return -EIO; 184 + 185 + caps->supported_xfam = tdx_get_supported_xfam(td_conf); 186 + if (!caps->supported_xfam) 187 + return -EIO; 188 + 189 + caps->cpuid.nent = td_conf->num_cpuid_config; 190 + 191 + for (i = 0; i < td_conf->num_cpuid_config; i++) 192 + td_init_cpuid_entry2(&caps->cpuid.entries[i], i); 193 + 194 + return 0; 195 + } 196 + 197 + /* 198 + * Some SEAMCALLs acquire the TDX module globally, and can fail with 199 + * TDX_OPERAND_BUSY. Use a global mutex to serialize these SEAMCALLs. 200 + */ 201 + static DEFINE_MUTEX(tdx_lock); 202 + 203 + static atomic_t nr_configured_hkid; 204 + 205 + static bool tdx_operand_busy(u64 err) 206 + { 207 + return (err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY; 208 + } 209 + 210 + 211 + /* 212 + * A per-CPU list of TD vCPUs associated with a given CPU. 213 + * Protected by interrupt mask. Only manipulated by the CPU owning this per-CPU 214 + * list. 215 + * - When a vCPU is loaded onto a CPU, it is removed from the per-CPU list of 216 + * the old CPU during the IPI callback running on the old CPU, and then added 217 + * to the per-CPU list of the new CPU. 218 + * - When a TD is tearing down, all vCPUs are disassociated from their current 219 + * running CPUs and removed from the per-CPU list during the IPI callback 220 + * running on those CPUs. 221 + * - When a CPU is brought down, traverse the per-CPU list to disassociate all 222 + * associated TD vCPUs and remove them from the per-CPU list. 223 + */ 224 + static DEFINE_PER_CPU(struct list_head, associated_tdvcpus); 225 + 226 + static __always_inline unsigned long tdvmcall_exit_type(struct kvm_vcpu *vcpu) 227 + { 228 + return to_tdx(vcpu)->vp_enter_args.r10; 229 + } 230 + 231 + static __always_inline unsigned long tdvmcall_leaf(struct kvm_vcpu *vcpu) 232 + { 233 + return to_tdx(vcpu)->vp_enter_args.r11; 234 + } 235 + 236 + static __always_inline void tdvmcall_set_return_code(struct kvm_vcpu *vcpu, 237 + long val) 238 + { 239 + to_tdx(vcpu)->vp_enter_args.r10 = val; 240 + } 241 + 242 + static __always_inline void tdvmcall_set_return_val(struct kvm_vcpu *vcpu, 243 + unsigned long val) 244 + { 245 + to_tdx(vcpu)->vp_enter_args.r11 = val; 246 + } 247 + 248 + static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx) 249 + { 250 + tdx_guest_keyid_free(kvm_tdx->hkid); 251 + kvm_tdx->hkid = -1; 252 + atomic_dec(&nr_configured_hkid); 253 + misc_cg_uncharge(MISC_CG_RES_TDX, kvm_tdx->misc_cg, 1); 254 + put_misc_cg(kvm_tdx->misc_cg); 255 + kvm_tdx->misc_cg = NULL; 256 + } 257 + 258 + static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx) 259 + { 260 + return kvm_tdx->hkid > 0; 261 + } 262 + 263 + static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu) 264 + { 265 + lockdep_assert_irqs_disabled(); 266 + 267 + list_del(&to_tdx(vcpu)->cpu_list); 268 + 269 + /* 270 + * Ensure tdx->cpu_list is updated before setting vcpu->cpu to -1, 271 + * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU 272 + * to its list before it's deleted from this CPU's list. 273 + */ 274 + smp_wmb(); 275 + 276 + vcpu->cpu = -1; 277 + } 278 + 279 + static void tdx_clear_page(struct page *page) 280 + { 281 + const void *zero_page = (const void *) page_to_virt(ZERO_PAGE(0)); 282 + void *dest = page_to_virt(page); 283 + unsigned long i; 284 + 285 + /* 286 + * The page could have been poisoned. MOVDIR64B also clears 287 + * the poison bit so the kernel can safely use the page again. 288 + */ 289 + for (i = 0; i < PAGE_SIZE; i += 64) 290 + movdir64b(dest + i, zero_page); 291 + /* 292 + * MOVDIR64B store uses WC buffer. Prevent following memory reads 293 + * from seeing potentially poisoned cache. 294 + */ 295 + __mb(); 296 + } 297 + 298 + static void tdx_no_vcpus_enter_start(struct kvm *kvm) 299 + { 300 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 301 + 302 + lockdep_assert_held_write(&kvm->mmu_lock); 303 + 304 + WRITE_ONCE(kvm_tdx->wait_for_sept_zap, true); 305 + 306 + kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE); 307 + } 308 + 309 + static void tdx_no_vcpus_enter_stop(struct kvm *kvm) 310 + { 311 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 312 + 313 + lockdep_assert_held_write(&kvm->mmu_lock); 314 + 315 + WRITE_ONCE(kvm_tdx->wait_for_sept_zap, false); 316 + } 317 + 318 + /* TDH.PHYMEM.PAGE.RECLAIM is allowed only when destroying the TD. */ 319 + static int __tdx_reclaim_page(struct page *page) 320 + { 321 + u64 err, rcx, rdx, r8; 322 + 323 + err = tdh_phymem_page_reclaim(page, &rcx, &rdx, &r8); 324 + 325 + /* 326 + * No need to check for TDX_OPERAND_BUSY; all TD pages are freed 327 + * before the HKID is released and control pages have also been 328 + * released at this point, so there is no possibility of contention. 329 + */ 330 + if (WARN_ON_ONCE(err)) { 331 + pr_tdx_error_3(TDH_PHYMEM_PAGE_RECLAIM, err, rcx, rdx, r8); 332 + return -EIO; 333 + } 334 + return 0; 335 + } 336 + 337 + static int tdx_reclaim_page(struct page *page) 338 + { 339 + int r; 340 + 341 + r = __tdx_reclaim_page(page); 342 + if (!r) 343 + tdx_clear_page(page); 344 + return r; 345 + } 346 + 347 + 348 + /* 349 + * Reclaim the TD control page(s) which are crypto-protected by TDX guest's 350 + * private KeyID. Assume the cache associated with the TDX private KeyID has 351 + * been flushed. 352 + */ 353 + static void tdx_reclaim_control_page(struct page *ctrl_page) 354 + { 355 + /* 356 + * Leak the page if the kernel failed to reclaim the page. 357 + * The kernel cannot use it safely anymore. 358 + */ 359 + if (tdx_reclaim_page(ctrl_page)) 360 + return; 361 + 362 + __free_page(ctrl_page); 363 + } 364 + 365 + struct tdx_flush_vp_arg { 366 + struct kvm_vcpu *vcpu; 367 + u64 err; 368 + }; 369 + 370 + static void tdx_flush_vp(void *_arg) 371 + { 372 + struct tdx_flush_vp_arg *arg = _arg; 373 + struct kvm_vcpu *vcpu = arg->vcpu; 374 + u64 err; 375 + 376 + arg->err = 0; 377 + lockdep_assert_irqs_disabled(); 378 + 379 + /* Task migration can race with CPU offlining. */ 380 + if (unlikely(vcpu->cpu != raw_smp_processor_id())) 381 + return; 382 + 383 + /* 384 + * No need to do TDH_VP_FLUSH if the vCPU hasn't been initialized. The 385 + * list tracking still needs to be updated so that it's correct if/when 386 + * the vCPU does get initialized. 387 + */ 388 + if (to_tdx(vcpu)->state != VCPU_TD_STATE_UNINITIALIZED) { 389 + /* 390 + * No need to retry. TDX Resources needed for TDH.VP.FLUSH are: 391 + * TDVPR as exclusive, TDR as shared, and TDCS as shared. This 392 + * vp flush function is called when destructing vCPU/TD or vCPU 393 + * migration. No other thread uses TDVPR in those cases. 394 + */ 395 + err = tdh_vp_flush(&to_tdx(vcpu)->vp); 396 + if (unlikely(err && err != TDX_VCPU_NOT_ASSOCIATED)) { 397 + /* 398 + * This function is called in IPI context. Do not use 399 + * printk to avoid console semaphore. 400 + * The caller prints out the error message, instead. 401 + */ 402 + if (err) 403 + arg->err = err; 404 + } 405 + } 406 + 407 + tdx_disassociate_vp(vcpu); 408 + } 409 + 410 + static void tdx_flush_vp_on_cpu(struct kvm_vcpu *vcpu) 411 + { 412 + struct tdx_flush_vp_arg arg = { 413 + .vcpu = vcpu, 414 + }; 415 + int cpu = vcpu->cpu; 416 + 417 + if (unlikely(cpu == -1)) 418 + return; 419 + 420 + smp_call_function_single(cpu, tdx_flush_vp, &arg, 1); 421 + if (KVM_BUG_ON(arg.err, vcpu->kvm)) 422 + pr_tdx_error(TDH_VP_FLUSH, arg.err); 423 + } 424 + 425 + void tdx_disable_virtualization_cpu(void) 426 + { 427 + int cpu = raw_smp_processor_id(); 428 + struct list_head *tdvcpus = &per_cpu(associated_tdvcpus, cpu); 429 + struct tdx_flush_vp_arg arg; 430 + struct vcpu_tdx *tdx, *tmp; 431 + unsigned long flags; 432 + 433 + local_irq_save(flags); 434 + /* Safe variant needed as tdx_disassociate_vp() deletes the entry. */ 435 + list_for_each_entry_safe(tdx, tmp, tdvcpus, cpu_list) { 436 + arg.vcpu = &tdx->vcpu; 437 + tdx_flush_vp(&arg); 438 + } 439 + local_irq_restore(flags); 440 + } 441 + 442 + #define TDX_SEAMCALL_RETRIES 10000 443 + 444 + static void smp_func_do_phymem_cache_wb(void *unused) 445 + { 446 + u64 err = 0; 447 + bool resume; 448 + int i; 449 + 450 + /* 451 + * TDH.PHYMEM.CACHE.WB flushes caches associated with any TDX private 452 + * KeyID on the package or core. The TDX module may not finish the 453 + * cache flush but return TDX_INTERRUPTED_RESUMEABLE instead. The 454 + * kernel should retry it until it returns success w/o rescheduling. 455 + */ 456 + for (i = TDX_SEAMCALL_RETRIES; i > 0; i--) { 457 + resume = !!err; 458 + err = tdh_phymem_cache_wb(resume); 459 + switch (err) { 460 + case TDX_INTERRUPTED_RESUMABLE: 461 + continue; 462 + case TDX_NO_HKID_READY_TO_WBCACHE: 463 + err = TDX_SUCCESS; /* Already done by other thread */ 464 + fallthrough; 465 + default: 466 + goto out; 467 + } 468 + } 469 + 470 + out: 471 + if (WARN_ON_ONCE(err)) 472 + pr_tdx_error(TDH_PHYMEM_CACHE_WB, err); 473 + } 474 + 475 + void tdx_mmu_release_hkid(struct kvm *kvm) 476 + { 477 + bool packages_allocated, targets_allocated; 478 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 479 + cpumask_var_t packages, targets; 480 + struct kvm_vcpu *vcpu; 481 + unsigned long j; 482 + int i; 483 + u64 err; 484 + 485 + if (!is_hkid_assigned(kvm_tdx)) 486 + return; 487 + 488 + packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL); 489 + targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL); 490 + cpus_read_lock(); 491 + 492 + kvm_for_each_vcpu(j, vcpu, kvm) 493 + tdx_flush_vp_on_cpu(vcpu); 494 + 495 + /* 496 + * TDH.PHYMEM.CACHE.WB tries to acquire the TDX module global lock 497 + * and can fail with TDX_OPERAND_BUSY when it fails to get the lock. 498 + * Multiple TDX guests can be destroyed simultaneously. Take the 499 + * mutex to prevent it from getting error. 500 + */ 501 + mutex_lock(&tdx_lock); 502 + 503 + /* 504 + * Releasing HKID is in vm_destroy(). 505 + * After the above flushing vps, there should be no more vCPU 506 + * associations, as all vCPU fds have been released at this stage. 507 + */ 508 + err = tdh_mng_vpflushdone(&kvm_tdx->td); 509 + if (err == TDX_FLUSHVP_NOT_DONE) 510 + goto out; 511 + if (KVM_BUG_ON(err, kvm)) { 512 + pr_tdx_error(TDH_MNG_VPFLUSHDONE, err); 513 + pr_err("tdh_mng_vpflushdone() failed. HKID %d is leaked.\n", 514 + kvm_tdx->hkid); 515 + goto out; 516 + } 517 + 518 + for_each_online_cpu(i) { 519 + if (packages_allocated && 520 + cpumask_test_and_set_cpu(topology_physical_package_id(i), 521 + packages)) 522 + continue; 523 + if (targets_allocated) 524 + cpumask_set_cpu(i, targets); 525 + } 526 + if (targets_allocated) 527 + on_each_cpu_mask(targets, smp_func_do_phymem_cache_wb, NULL, true); 528 + else 529 + on_each_cpu(smp_func_do_phymem_cache_wb, NULL, true); 530 + /* 531 + * In the case of error in smp_func_do_phymem_cache_wb(), the following 532 + * tdh_mng_key_freeid() will fail. 533 + */ 534 + err = tdh_mng_key_freeid(&kvm_tdx->td); 535 + if (KVM_BUG_ON(err, kvm)) { 536 + pr_tdx_error(TDH_MNG_KEY_FREEID, err); 537 + pr_err("tdh_mng_key_freeid() failed. HKID %d is leaked.\n", 538 + kvm_tdx->hkid); 539 + } else { 540 + tdx_hkid_free(kvm_tdx); 541 + } 542 + 543 + out: 544 + mutex_unlock(&tdx_lock); 545 + cpus_read_unlock(); 546 + free_cpumask_var(targets); 547 + free_cpumask_var(packages); 548 + } 549 + 550 + static void tdx_reclaim_td_control_pages(struct kvm *kvm) 551 + { 552 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 553 + u64 err; 554 + int i; 555 + 556 + /* 557 + * tdx_mmu_release_hkid() failed to reclaim HKID. Something went wrong 558 + * heavily with TDX module. Give up freeing TD pages. As the function 559 + * already warned, don't warn it again. 560 + */ 561 + if (is_hkid_assigned(kvm_tdx)) 562 + return; 563 + 564 + if (kvm_tdx->td.tdcs_pages) { 565 + for (i = 0; i < kvm_tdx->td.tdcs_nr_pages; i++) { 566 + if (!kvm_tdx->td.tdcs_pages[i]) 567 + continue; 568 + 569 + tdx_reclaim_control_page(kvm_tdx->td.tdcs_pages[i]); 570 + } 571 + kfree(kvm_tdx->td.tdcs_pages); 572 + kvm_tdx->td.tdcs_pages = NULL; 573 + } 574 + 575 + if (!kvm_tdx->td.tdr_page) 576 + return; 577 + 578 + if (__tdx_reclaim_page(kvm_tdx->td.tdr_page)) 579 + return; 580 + 581 + /* 582 + * Use a SEAMCALL to ask the TDX module to flush the cache based on the 583 + * KeyID. TDX module may access TDR while operating on TD (Especially 584 + * when it is reclaiming TDCS). 585 + */ 586 + err = tdh_phymem_page_wbinvd_tdr(&kvm_tdx->td); 587 + if (KVM_BUG_ON(err, kvm)) { 588 + pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err); 589 + return; 590 + } 591 + tdx_clear_page(kvm_tdx->td.tdr_page); 592 + 593 + __free_page(kvm_tdx->td.tdr_page); 594 + kvm_tdx->td.tdr_page = NULL; 595 + } 596 + 597 + void tdx_vm_destroy(struct kvm *kvm) 598 + { 599 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 600 + 601 + tdx_reclaim_td_control_pages(kvm); 602 + 603 + kvm_tdx->state = TD_STATE_UNINITIALIZED; 604 + } 605 + 606 + static int tdx_do_tdh_mng_key_config(void *param) 607 + { 608 + struct kvm_tdx *kvm_tdx = param; 609 + u64 err; 610 + 611 + /* TDX_RND_NO_ENTROPY related retries are handled by sc_retry() */ 612 + err = tdh_mng_key_config(&kvm_tdx->td); 613 + 614 + if (KVM_BUG_ON(err, &kvm_tdx->kvm)) { 615 + pr_tdx_error(TDH_MNG_KEY_CONFIG, err); 616 + return -EIO; 617 + } 618 + 619 + return 0; 620 + } 621 + 622 + int tdx_vm_init(struct kvm *kvm) 623 + { 624 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 625 + 626 + kvm->arch.has_protected_state = true; 627 + kvm->arch.has_private_mem = true; 628 + kvm->arch.disabled_quirks |= KVM_X86_QUIRK_IGNORE_GUEST_PAT; 629 + 630 + /* 631 + * Because guest TD is protected, VMM can't parse the instruction in TD. 632 + * Instead, guest uses MMIO hypercall. For unmodified device driver, 633 + * #VE needs to be injected for MMIO and #VE handler in TD converts MMIO 634 + * instruction into MMIO hypercall. 635 + * 636 + * SPTE value for MMIO needs to be setup so that #VE is injected into 637 + * TD instead of triggering EPT MISCONFIG. 638 + * - RWX=0 so that EPT violation is triggered. 639 + * - suppress #VE bit is cleared to inject #VE. 640 + */ 641 + kvm_mmu_set_mmio_spte_value(kvm, 0); 642 + 643 + /* 644 + * TDX has its own limit of maximum vCPUs it can support for all 645 + * TDX guests in addition to KVM_MAX_VCPUS. TDX module reports 646 + * such limit via the MAX_VCPU_PER_TD global metadata. In 647 + * practice, it reflects the number of logical CPUs that ALL 648 + * platforms that the TDX module supports can possibly have. 649 + * 650 + * Limit TDX guest's maximum vCPUs to the number of logical CPUs 651 + * the platform has. Simply forwarding the MAX_VCPU_PER_TD to 652 + * userspace would result in an unpredictable ABI. 653 + */ 654 + kvm->max_vcpus = min_t(int, kvm->max_vcpus, num_present_cpus()); 655 + 656 + kvm_tdx->state = TD_STATE_UNINITIALIZED; 657 + 658 + return 0; 659 + } 660 + 661 + int tdx_vcpu_create(struct kvm_vcpu *vcpu) 662 + { 663 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm); 664 + struct vcpu_tdx *tdx = to_tdx(vcpu); 665 + 666 + if (kvm_tdx->state != TD_STATE_INITIALIZED) 667 + return -EIO; 668 + 669 + /* 670 + * TDX module mandates APICv, which requires an in-kernel local APIC. 671 + * Disallow an in-kernel I/O APIC, because level-triggered interrupts 672 + * and thus the I/O APIC as a whole can't be faithfully emulated in KVM. 673 + */ 674 + if (!irqchip_split(vcpu->kvm)) 675 + return -EINVAL; 676 + 677 + fpstate_set_confidential(&vcpu->arch.guest_fpu); 678 + vcpu->arch.apic->guest_apic_protected = true; 679 + INIT_LIST_HEAD(&tdx->vt.pi_wakeup_list); 680 + 681 + vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX; 682 + 683 + vcpu->arch.switch_db_regs = KVM_DEBUGREG_AUTO_SWITCH; 684 + vcpu->arch.cr0_guest_owned_bits = -1ul; 685 + vcpu->arch.cr4_guest_owned_bits = -1ul; 686 + 687 + /* KVM can't change TSC offset/multiplier as TDX module manages them. */ 688 + vcpu->arch.guest_tsc_protected = true; 689 + vcpu->arch.tsc_offset = kvm_tdx->tsc_offset; 690 + vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset; 691 + vcpu->arch.tsc_scaling_ratio = kvm_tdx->tsc_multiplier; 692 + vcpu->arch.l1_tsc_scaling_ratio = kvm_tdx->tsc_multiplier; 693 + 694 + vcpu->arch.guest_state_protected = 695 + !(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTR_DEBUG); 696 + 697 + if ((kvm_tdx->xfam & XFEATURE_MASK_XTILE) == XFEATURE_MASK_XTILE) 698 + vcpu->arch.xfd_no_write_intercept = true; 699 + 700 + tdx->vt.pi_desc.nv = POSTED_INTR_VECTOR; 701 + __pi_set_sn(&tdx->vt.pi_desc); 702 + 703 + tdx->state = VCPU_TD_STATE_UNINITIALIZED; 704 + 705 + return 0; 706 + } 707 + 708 + void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) 709 + { 710 + struct vcpu_tdx *tdx = to_tdx(vcpu); 711 + 712 + vmx_vcpu_pi_load(vcpu, cpu); 713 + if (vcpu->cpu == cpu || !is_hkid_assigned(to_kvm_tdx(vcpu->kvm))) 714 + return; 715 + 716 + tdx_flush_vp_on_cpu(vcpu); 717 + 718 + KVM_BUG_ON(cpu != raw_smp_processor_id(), vcpu->kvm); 719 + local_irq_disable(); 720 + /* 721 + * Pairs with the smp_wmb() in tdx_disassociate_vp() to ensure 722 + * vcpu->cpu is read before tdx->cpu_list. 723 + */ 724 + smp_rmb(); 725 + 726 + list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu)); 727 + local_irq_enable(); 728 + } 729 + 730 + bool tdx_interrupt_allowed(struct kvm_vcpu *vcpu) 731 + { 732 + /* 733 + * KVM can't get the interrupt status of TDX guest and it assumes 734 + * interrupt is always allowed unless TDX guest calls TDVMCALL with HLT, 735 + * which passes the interrupt blocked flag. 736 + */ 737 + return vmx_get_exit_reason(vcpu).basic != EXIT_REASON_HLT || 738 + !to_tdx(vcpu)->vp_enter_args.r12; 739 + } 740 + 741 + bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu) 742 + { 743 + u64 vcpu_state_details; 744 + 745 + if (pi_has_pending_interrupt(vcpu)) 746 + return true; 747 + 748 + /* 749 + * Only check RVI pending for HALTED case with IRQ enabled. 750 + * For non-HLT cases, KVM doesn't care about STI/SS shadows. And if the 751 + * interrupt was pending before TD exit, then it _must_ be blocked, 752 + * otherwise the interrupt would have been serviced at the instruction 753 + * boundary. 754 + */ 755 + if (vmx_get_exit_reason(vcpu).basic != EXIT_REASON_HLT || 756 + to_tdx(vcpu)->vp_enter_args.r12) 757 + return false; 758 + 759 + vcpu_state_details = 760 + td_state_non_arch_read64(to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH); 761 + 762 + return tdx_vcpu_state_details_intr_pending(vcpu_state_details); 763 + } 764 + 765 + /* 766 + * Compared to vmx_prepare_switch_to_guest(), there is not much to do 767 + * as SEAMCALL/SEAMRET calls take care of most of save and restore. 768 + */ 769 + void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) 770 + { 771 + struct vcpu_vt *vt = to_vt(vcpu); 772 + 773 + if (vt->guest_state_loaded) 774 + return; 775 + 776 + if (likely(is_64bit_mm(current->mm))) 777 + vt->msr_host_kernel_gs_base = current->thread.gsbase; 778 + else 779 + vt->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE); 780 + 781 + vt->host_debugctlmsr = get_debugctlmsr(); 782 + 783 + vt->guest_state_loaded = true; 784 + } 785 + 786 + struct tdx_uret_msr { 787 + u32 msr; 788 + unsigned int slot; 789 + u64 defval; 790 + }; 791 + 792 + static struct tdx_uret_msr tdx_uret_msrs[] = { 793 + {.msr = MSR_SYSCALL_MASK, .defval = 0x20200 }, 794 + {.msr = MSR_STAR,}, 795 + {.msr = MSR_LSTAR,}, 796 + {.msr = MSR_TSC_AUX,}, 797 + }; 798 + 799 + static void tdx_user_return_msr_update_cache(void) 800 + { 801 + int i; 802 + 803 + for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) 804 + kvm_user_return_msr_update_cache(tdx_uret_msrs[i].slot, 805 + tdx_uret_msrs[i].defval); 806 + } 807 + 808 + static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu) 809 + { 810 + struct vcpu_vt *vt = to_vt(vcpu); 811 + struct vcpu_tdx *tdx = to_tdx(vcpu); 812 + 813 + if (!vt->guest_state_loaded) 814 + return; 815 + 816 + ++vcpu->stat.host_state_reload; 817 + wrmsrl(MSR_KERNEL_GS_BASE, vt->msr_host_kernel_gs_base); 818 + 819 + if (tdx->guest_entered) { 820 + tdx_user_return_msr_update_cache(); 821 + tdx->guest_entered = false; 822 + } 823 + 824 + vt->guest_state_loaded = false; 825 + } 826 + 827 + void tdx_vcpu_put(struct kvm_vcpu *vcpu) 828 + { 829 + vmx_vcpu_pi_put(vcpu); 830 + tdx_prepare_switch_to_host(vcpu); 831 + } 832 + 833 + void tdx_vcpu_free(struct kvm_vcpu *vcpu) 834 + { 835 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm); 836 + struct vcpu_tdx *tdx = to_tdx(vcpu); 837 + int i; 838 + 839 + /* 840 + * It is not possible to reclaim pages while hkid is assigned. It might 841 + * be assigned if: 842 + * 1. the TD VM is being destroyed but freeing hkid failed, in which 843 + * case the pages are leaked 844 + * 2. TD VCPU creation failed and this on the error path, in which case 845 + * there is nothing to do anyway 846 + */ 847 + if (is_hkid_assigned(kvm_tdx)) 848 + return; 849 + 850 + if (tdx->vp.tdcx_pages) { 851 + for (i = 0; i < kvm_tdx->td.tdcx_nr_pages; i++) { 852 + if (tdx->vp.tdcx_pages[i]) 853 + tdx_reclaim_control_page(tdx->vp.tdcx_pages[i]); 854 + } 855 + kfree(tdx->vp.tdcx_pages); 856 + tdx->vp.tdcx_pages = NULL; 857 + } 858 + if (tdx->vp.tdvpr_page) { 859 + tdx_reclaim_control_page(tdx->vp.tdvpr_page); 860 + tdx->vp.tdvpr_page = 0; 861 + } 862 + 863 + tdx->state = VCPU_TD_STATE_UNINITIALIZED; 864 + } 865 + 866 + int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu) 867 + { 868 + if (unlikely(to_tdx(vcpu)->state != VCPU_TD_STATE_INITIALIZED || 869 + to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE)) 870 + return -EINVAL; 871 + 872 + return 1; 873 + } 874 + 875 + static __always_inline u32 tdcall_to_vmx_exit_reason(struct kvm_vcpu *vcpu) 876 + { 877 + switch (tdvmcall_leaf(vcpu)) { 878 + case EXIT_REASON_CPUID: 879 + case EXIT_REASON_HLT: 880 + case EXIT_REASON_IO_INSTRUCTION: 881 + case EXIT_REASON_MSR_READ: 882 + case EXIT_REASON_MSR_WRITE: 883 + return tdvmcall_leaf(vcpu); 884 + case EXIT_REASON_EPT_VIOLATION: 885 + return EXIT_REASON_EPT_MISCONFIG; 886 + default: 887 + break; 888 + } 889 + 890 + return EXIT_REASON_TDCALL; 891 + } 892 + 893 + static __always_inline u32 tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu) 894 + { 895 + struct vcpu_tdx *tdx = to_tdx(vcpu); 896 + u32 exit_reason; 897 + 898 + switch (tdx->vp_enter_ret & TDX_SEAMCALL_STATUS_MASK) { 899 + case TDX_SUCCESS: 900 + case TDX_NON_RECOVERABLE_VCPU: 901 + case TDX_NON_RECOVERABLE_TD: 902 + case TDX_NON_RECOVERABLE_TD_NON_ACCESSIBLE: 903 + case TDX_NON_RECOVERABLE_TD_WRONG_APIC_MODE: 904 + break; 905 + default: 906 + return -1u; 907 + } 908 + 909 + exit_reason = tdx->vp_enter_ret; 910 + 911 + switch (exit_reason) { 912 + case EXIT_REASON_TDCALL: 913 + if (tdvmcall_exit_type(vcpu)) 914 + return EXIT_REASON_VMCALL; 915 + 916 + return tdcall_to_vmx_exit_reason(vcpu); 917 + case EXIT_REASON_EPT_MISCONFIG: 918 + /* 919 + * Defer KVM_BUG_ON() until tdx_handle_exit() because this is in 920 + * non-instrumentable code with interrupts disabled. 921 + */ 922 + return -1u; 923 + default: 924 + break; 925 + } 926 + 927 + return exit_reason; 928 + } 929 + 930 + static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu) 931 + { 932 + struct vcpu_tdx *tdx = to_tdx(vcpu); 933 + struct vcpu_vt *vt = to_vt(vcpu); 934 + 935 + guest_state_enter_irqoff(); 936 + 937 + tdx->vp_enter_ret = tdh_vp_enter(&tdx->vp, &tdx->vp_enter_args); 938 + 939 + vt->exit_reason.full = tdx_to_vmx_exit_reason(vcpu); 940 + 941 + vt->exit_qualification = tdx->vp_enter_args.rcx; 942 + tdx->ext_exit_qualification = tdx->vp_enter_args.rdx; 943 + tdx->exit_gpa = tdx->vp_enter_args.r8; 944 + vt->exit_intr_info = tdx->vp_enter_args.r9; 945 + 946 + vmx_handle_nmi(vcpu); 947 + 948 + guest_state_exit_irqoff(); 949 + } 950 + 951 + static bool tdx_failed_vmentry(struct kvm_vcpu *vcpu) 952 + { 953 + return vmx_get_exit_reason(vcpu).failed_vmentry && 954 + vmx_get_exit_reason(vcpu).full != -1u; 955 + } 956 + 957 + static fastpath_t tdx_exit_handlers_fastpath(struct kvm_vcpu *vcpu) 958 + { 959 + u64 vp_enter_ret = to_tdx(vcpu)->vp_enter_ret; 960 + 961 + /* 962 + * TDX_OPERAND_BUSY could be returned for SEPT due to 0-step mitigation 963 + * or for TD EPOCH due to contention with TDH.MEM.TRACK on TDH.VP.ENTER. 964 + * 965 + * When KVM requests KVM_REQ_OUTSIDE_GUEST_MODE, which has both 966 + * KVM_REQUEST_WAIT and KVM_REQUEST_NO_ACTION set, it requires target 967 + * vCPUs leaving fastpath so that interrupt can be enabled to ensure the 968 + * IPIs can be delivered. Return EXIT_FASTPATH_EXIT_HANDLED instead of 969 + * EXIT_FASTPATH_REENTER_GUEST to exit fastpath, otherwise, the 970 + * requester may be blocked endlessly. 971 + */ 972 + if (unlikely(tdx_operand_busy(vp_enter_ret))) 973 + return EXIT_FASTPATH_EXIT_HANDLED; 974 + 975 + return EXIT_FASTPATH_NONE; 976 + } 977 + 978 + #define TDX_REGS_AVAIL_SET (BIT_ULL(VCPU_EXREG_EXIT_INFO_1) | \ 979 + BIT_ULL(VCPU_EXREG_EXIT_INFO_2) | \ 980 + BIT_ULL(VCPU_REGS_RAX) | \ 981 + BIT_ULL(VCPU_REGS_RBX) | \ 982 + BIT_ULL(VCPU_REGS_RCX) | \ 983 + BIT_ULL(VCPU_REGS_RDX) | \ 984 + BIT_ULL(VCPU_REGS_RBP) | \ 985 + BIT_ULL(VCPU_REGS_RSI) | \ 986 + BIT_ULL(VCPU_REGS_RDI) | \ 987 + BIT_ULL(VCPU_REGS_R8) | \ 988 + BIT_ULL(VCPU_REGS_R9) | \ 989 + BIT_ULL(VCPU_REGS_R10) | \ 990 + BIT_ULL(VCPU_REGS_R11) | \ 991 + BIT_ULL(VCPU_REGS_R12) | \ 992 + BIT_ULL(VCPU_REGS_R13) | \ 993 + BIT_ULL(VCPU_REGS_R14) | \ 994 + BIT_ULL(VCPU_REGS_R15)) 995 + 996 + static void tdx_load_host_xsave_state(struct kvm_vcpu *vcpu) 997 + { 998 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm); 999 + 1000 + /* 1001 + * All TDX hosts support PKRU; but even if they didn't, 1002 + * vcpu->arch.host_pkru would be 0 and the wrpkru would be 1003 + * skipped. 1004 + */ 1005 + if (vcpu->arch.host_pkru != 0) 1006 + wrpkru(vcpu->arch.host_pkru); 1007 + 1008 + if (kvm_host.xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0)) 1009 + xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0); 1010 + 1011 + /* 1012 + * Likewise, even if a TDX hosts didn't support XSS both arms of 1013 + * the comparison would be 0 and the wrmsrl would be skipped. 1014 + */ 1015 + if (kvm_host.xss != (kvm_tdx->xfam & kvm_caps.supported_xss)) 1016 + wrmsrl(MSR_IA32_XSS, kvm_host.xss); 1017 + } 1018 + 1019 + #define TDX_DEBUGCTL_PRESERVED (DEBUGCTLMSR_BTF | \ 1020 + DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI | \ 1021 + DEBUGCTLMSR_FREEZE_IN_SMM) 1022 + 1023 + fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit) 1024 + { 1025 + struct vcpu_tdx *tdx = to_tdx(vcpu); 1026 + struct vcpu_vt *vt = to_vt(vcpu); 1027 + 1028 + /* 1029 + * force_immediate_exit requires vCPU entering for events injection with 1030 + * an immediately exit followed. But The TDX module doesn't guarantee 1031 + * entry, it's already possible for KVM to _think_ it completely entry 1032 + * to the guest without actually having done so. 1033 + * Since KVM never needs to force an immediate exit for TDX, and can't 1034 + * do direct injection, just warn on force_immediate_exit. 1035 + */ 1036 + WARN_ON_ONCE(force_immediate_exit); 1037 + 1038 + /* 1039 + * Wait until retry of SEPT-zap-related SEAMCALL completes before 1040 + * allowing vCPU entry to avoid contention with tdh_vp_enter() and 1041 + * TDCALLs. 1042 + */ 1043 + if (unlikely(READ_ONCE(to_kvm_tdx(vcpu->kvm)->wait_for_sept_zap))) 1044 + return EXIT_FASTPATH_EXIT_HANDLED; 1045 + 1046 + trace_kvm_entry(vcpu, force_immediate_exit); 1047 + 1048 + if (pi_test_on(&vt->pi_desc)) { 1049 + apic->send_IPI_self(POSTED_INTR_VECTOR); 1050 + 1051 + if (pi_test_pir(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTT) & 1052 + APIC_VECTOR_MASK, &vt->pi_desc)) 1053 + kvm_wait_lapic_expire(vcpu); 1054 + } 1055 + 1056 + tdx_vcpu_enter_exit(vcpu); 1057 + 1058 + if (vt->host_debugctlmsr & ~TDX_DEBUGCTL_PRESERVED) 1059 + update_debugctlmsr(vt->host_debugctlmsr); 1060 + 1061 + tdx_load_host_xsave_state(vcpu); 1062 + tdx->guest_entered = true; 1063 + 1064 + vcpu->arch.regs_avail &= TDX_REGS_AVAIL_SET; 1065 + 1066 + if (unlikely(tdx->vp_enter_ret == EXIT_REASON_EPT_MISCONFIG)) 1067 + return EXIT_FASTPATH_NONE; 1068 + 1069 + if (unlikely((tdx->vp_enter_ret & TDX_SW_ERROR) == TDX_SW_ERROR)) 1070 + return EXIT_FASTPATH_NONE; 1071 + 1072 + if (unlikely(vmx_get_exit_reason(vcpu).basic == EXIT_REASON_MCE_DURING_VMENTRY)) 1073 + kvm_machine_check(); 1074 + 1075 + trace_kvm_exit(vcpu, KVM_ISA_VMX); 1076 + 1077 + if (unlikely(tdx_failed_vmentry(vcpu))) 1078 + return EXIT_FASTPATH_NONE; 1079 + 1080 + return tdx_exit_handlers_fastpath(vcpu); 1081 + } 1082 + 1083 + void tdx_inject_nmi(struct kvm_vcpu *vcpu) 1084 + { 1085 + ++vcpu->stat.nmi_injections; 1086 + td_management_write8(to_tdx(vcpu), TD_VCPU_PEND_NMI, 1); 1087 + /* 1088 + * From KVM's perspective, NMI injection is completed right after 1089 + * writing to PEND_NMI. KVM doesn't care whether an NMI is injected by 1090 + * the TDX module or not. 1091 + */ 1092 + vcpu->arch.nmi_injected = false; 1093 + /* 1094 + * TDX doesn't support KVM to request NMI window exit. If there is 1095 + * still a pending vNMI, KVM is not able to inject it along with the 1096 + * one pending in TDX module in a back-to-back way. Since the previous 1097 + * vNMI is still pending in TDX module, i.e. it has not been delivered 1098 + * to TDX guest yet, it's OK to collapse the pending vNMI into the 1099 + * previous one. The guest is expected to handle all the NMI sources 1100 + * when handling the first vNMI. 1101 + */ 1102 + vcpu->arch.nmi_pending = 0; 1103 + } 1104 + 1105 + static int tdx_handle_exception_nmi(struct kvm_vcpu *vcpu) 1106 + { 1107 + u32 intr_info = vmx_get_intr_info(vcpu); 1108 + 1109 + /* 1110 + * Machine checks are handled by handle_exception_irqoff(), or by 1111 + * tdx_handle_exit() with TDX_NON_RECOVERABLE set if a #MC occurs on 1112 + * VM-Entry. NMIs are handled by tdx_vcpu_enter_exit(). 1113 + */ 1114 + if (is_nmi(intr_info) || is_machine_check(intr_info)) 1115 + return 1; 1116 + 1117 + vcpu->run->exit_reason = KVM_EXIT_EXCEPTION; 1118 + vcpu->run->ex.exception = intr_info & INTR_INFO_VECTOR_MASK; 1119 + vcpu->run->ex.error_code = 0; 1120 + 1121 + return 0; 1122 + } 1123 + 1124 + static int complete_hypercall_exit(struct kvm_vcpu *vcpu) 1125 + { 1126 + tdvmcall_set_return_code(vcpu, vcpu->run->hypercall.ret); 1127 + return 1; 1128 + } 1129 + 1130 + static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu) 1131 + { 1132 + kvm_rax_write(vcpu, to_tdx(vcpu)->vp_enter_args.r10); 1133 + kvm_rbx_write(vcpu, to_tdx(vcpu)->vp_enter_args.r11); 1134 + kvm_rcx_write(vcpu, to_tdx(vcpu)->vp_enter_args.r12); 1135 + kvm_rdx_write(vcpu, to_tdx(vcpu)->vp_enter_args.r13); 1136 + kvm_rsi_write(vcpu, to_tdx(vcpu)->vp_enter_args.r14); 1137 + 1138 + return __kvm_emulate_hypercall(vcpu, 0, complete_hypercall_exit); 1139 + } 1140 + 1141 + /* 1142 + * Split into chunks and check interrupt pending between chunks. This allows 1143 + * for timely injection of interrupts to prevent issues with guest lockup 1144 + * detection. 1145 + */ 1146 + #define TDX_MAP_GPA_MAX_LEN (2 * 1024 * 1024) 1147 + static void __tdx_map_gpa(struct vcpu_tdx *tdx); 1148 + 1149 + static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu) 1150 + { 1151 + struct vcpu_tdx *tdx = to_tdx(vcpu); 1152 + 1153 + if (vcpu->run->hypercall.ret) { 1154 + tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND); 1155 + tdx->vp_enter_args.r11 = tdx->map_gpa_next; 1156 + return 1; 1157 + } 1158 + 1159 + tdx->map_gpa_next += TDX_MAP_GPA_MAX_LEN; 1160 + if (tdx->map_gpa_next >= tdx->map_gpa_end) 1161 + return 1; 1162 + 1163 + /* 1164 + * Stop processing the remaining part if there is a pending interrupt, 1165 + * which could be qualified to deliver. Skip checking pending RVI for 1166 + * TDVMCALL_MAP_GPA, see comments in tdx_protected_apic_has_interrupt(). 1167 + */ 1168 + if (kvm_vcpu_has_events(vcpu)) { 1169 + tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_RETRY); 1170 + tdx->vp_enter_args.r11 = tdx->map_gpa_next; 1171 + return 1; 1172 + } 1173 + 1174 + __tdx_map_gpa(tdx); 1175 + return 0; 1176 + } 1177 + 1178 + static void __tdx_map_gpa(struct vcpu_tdx *tdx) 1179 + { 1180 + u64 gpa = tdx->map_gpa_next; 1181 + u64 size = tdx->map_gpa_end - tdx->map_gpa_next; 1182 + 1183 + if (size > TDX_MAP_GPA_MAX_LEN) 1184 + size = TDX_MAP_GPA_MAX_LEN; 1185 + 1186 + tdx->vcpu.run->exit_reason = KVM_EXIT_HYPERCALL; 1187 + tdx->vcpu.run->hypercall.nr = KVM_HC_MAP_GPA_RANGE; 1188 + /* 1189 + * In principle this should have been -KVM_ENOSYS, but userspace (QEMU <=9.2) 1190 + * assumed that vcpu->run->hypercall.ret is never changed by KVM and thus that 1191 + * it was always zero on KVM_EXIT_HYPERCALL. Since KVM is now overwriting 1192 + * vcpu->run->hypercall.ret, ensuring that it is zero to not break QEMU. 1193 + */ 1194 + tdx->vcpu.run->hypercall.ret = 0; 1195 + tdx->vcpu.run->hypercall.args[0] = gpa & ~gfn_to_gpa(kvm_gfn_direct_bits(tdx->vcpu.kvm)); 1196 + tdx->vcpu.run->hypercall.args[1] = size / PAGE_SIZE; 1197 + tdx->vcpu.run->hypercall.args[2] = vt_is_tdx_private_gpa(tdx->vcpu.kvm, gpa) ? 1198 + KVM_MAP_GPA_RANGE_ENCRYPTED : 1199 + KVM_MAP_GPA_RANGE_DECRYPTED; 1200 + tdx->vcpu.run->hypercall.flags = KVM_EXIT_HYPERCALL_LONG_MODE; 1201 + 1202 + tdx->vcpu.arch.complete_userspace_io = tdx_complete_vmcall_map_gpa; 1203 + } 1204 + 1205 + static int tdx_map_gpa(struct kvm_vcpu *vcpu) 1206 + { 1207 + struct vcpu_tdx *tdx = to_tdx(vcpu); 1208 + u64 gpa = tdx->vp_enter_args.r12; 1209 + u64 size = tdx->vp_enter_args.r13; 1210 + u64 ret; 1211 + 1212 + /* 1213 + * Converting TDVMCALL_MAP_GPA to KVM_HC_MAP_GPA_RANGE requires 1214 + * userspace to enable KVM_CAP_EXIT_HYPERCALL with KVM_HC_MAP_GPA_RANGE 1215 + * bit set. If not, the error code is not defined in GHCI for TDX, use 1216 + * TDVMCALL_STATUS_INVALID_OPERAND for this case. 1217 + */ 1218 + if (!user_exit_on_hypercall(vcpu->kvm, KVM_HC_MAP_GPA_RANGE)) { 1219 + ret = TDVMCALL_STATUS_INVALID_OPERAND; 1220 + goto error; 1221 + } 1222 + 1223 + if (gpa + size <= gpa || !kvm_vcpu_is_legal_gpa(vcpu, gpa) || 1224 + !kvm_vcpu_is_legal_gpa(vcpu, gpa + size - 1) || 1225 + (vt_is_tdx_private_gpa(vcpu->kvm, gpa) != 1226 + vt_is_tdx_private_gpa(vcpu->kvm, gpa + size - 1))) { 1227 + ret = TDVMCALL_STATUS_INVALID_OPERAND; 1228 + goto error; 1229 + } 1230 + 1231 + if (!PAGE_ALIGNED(gpa) || !PAGE_ALIGNED(size)) { 1232 + ret = TDVMCALL_STATUS_ALIGN_ERROR; 1233 + goto error; 1234 + } 1235 + 1236 + tdx->map_gpa_end = gpa + size; 1237 + tdx->map_gpa_next = gpa; 1238 + 1239 + __tdx_map_gpa(tdx); 1240 + return 0; 1241 + 1242 + error: 1243 + tdvmcall_set_return_code(vcpu, ret); 1244 + tdx->vp_enter_args.r11 = gpa; 1245 + return 1; 1246 + } 1247 + 1248 + static int tdx_report_fatal_error(struct kvm_vcpu *vcpu) 1249 + { 1250 + struct vcpu_tdx *tdx = to_tdx(vcpu); 1251 + u64 *regs = vcpu->run->system_event.data; 1252 + u64 *module_regs = &tdx->vp_enter_args.r8; 1253 + int index = VCPU_REGS_RAX; 1254 + 1255 + vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT; 1256 + vcpu->run->system_event.type = KVM_SYSTEM_EVENT_TDX_FATAL; 1257 + vcpu->run->system_event.ndata = 16; 1258 + 1259 + /* Dump 16 general-purpose registers to userspace in ascending order. */ 1260 + regs[index++] = tdx->vp_enter_ret; 1261 + regs[index++] = tdx->vp_enter_args.rcx; 1262 + regs[index++] = tdx->vp_enter_args.rdx; 1263 + regs[index++] = tdx->vp_enter_args.rbx; 1264 + regs[index++] = 0; 1265 + regs[index++] = 0; 1266 + regs[index++] = tdx->vp_enter_args.rsi; 1267 + regs[index] = tdx->vp_enter_args.rdi; 1268 + for (index = 0; index < 8; index++) 1269 + regs[VCPU_REGS_R8 + index] = module_regs[index]; 1270 + 1271 + return 0; 1272 + } 1273 + 1274 + static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu) 1275 + { 1276 + u32 eax, ebx, ecx, edx; 1277 + struct vcpu_tdx *tdx = to_tdx(vcpu); 1278 + 1279 + /* EAX and ECX for cpuid is stored in R12 and R13. */ 1280 + eax = tdx->vp_enter_args.r12; 1281 + ecx = tdx->vp_enter_args.r13; 1282 + 1283 + kvm_cpuid(vcpu, &eax, &ebx, &ecx, &edx, false); 1284 + 1285 + tdx->vp_enter_args.r12 = eax; 1286 + tdx->vp_enter_args.r13 = ebx; 1287 + tdx->vp_enter_args.r14 = ecx; 1288 + tdx->vp_enter_args.r15 = edx; 1289 + 1290 + return 1; 1291 + } 1292 + 1293 + static int tdx_complete_pio_out(struct kvm_vcpu *vcpu) 1294 + { 1295 + vcpu->arch.pio.count = 0; 1296 + return 1; 1297 + } 1298 + 1299 + static int tdx_complete_pio_in(struct kvm_vcpu *vcpu) 1300 + { 1301 + struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt; 1302 + unsigned long val = 0; 1303 + int ret; 1304 + 1305 + ret = ctxt->ops->pio_in_emulated(ctxt, vcpu->arch.pio.size, 1306 + vcpu->arch.pio.port, &val, 1); 1307 + 1308 + WARN_ON_ONCE(!ret); 1309 + 1310 + tdvmcall_set_return_val(vcpu, val); 1311 + 1312 + return 1; 1313 + } 1314 + 1315 + static int tdx_emulate_io(struct kvm_vcpu *vcpu) 1316 + { 1317 + struct vcpu_tdx *tdx = to_tdx(vcpu); 1318 + struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt; 1319 + unsigned long val = 0; 1320 + unsigned int port; 1321 + u64 size, write; 1322 + int ret; 1323 + 1324 + ++vcpu->stat.io_exits; 1325 + 1326 + size = tdx->vp_enter_args.r12; 1327 + write = tdx->vp_enter_args.r13; 1328 + port = tdx->vp_enter_args.r14; 1329 + 1330 + if ((write != 0 && write != 1) || (size != 1 && size != 2 && size != 4)) { 1331 + tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND); 1332 + return 1; 1333 + } 1334 + 1335 + if (write) { 1336 + val = tdx->vp_enter_args.r15; 1337 + ret = ctxt->ops->pio_out_emulated(ctxt, size, port, &val, 1); 1338 + } else { 1339 + ret = ctxt->ops->pio_in_emulated(ctxt, size, port, &val, 1); 1340 + } 1341 + 1342 + if (!ret) 1343 + vcpu->arch.complete_userspace_io = write ? tdx_complete_pio_out : 1344 + tdx_complete_pio_in; 1345 + else if (!write) 1346 + tdvmcall_set_return_val(vcpu, val); 1347 + 1348 + return ret; 1349 + } 1350 + 1351 + static int tdx_complete_mmio_read(struct kvm_vcpu *vcpu) 1352 + { 1353 + unsigned long val = 0; 1354 + gpa_t gpa; 1355 + int size; 1356 + 1357 + gpa = vcpu->mmio_fragments[0].gpa; 1358 + size = vcpu->mmio_fragments[0].len; 1359 + 1360 + memcpy(&val, vcpu->run->mmio.data, size); 1361 + tdvmcall_set_return_val(vcpu, val); 1362 + trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val); 1363 + return 1; 1364 + } 1365 + 1366 + static inline int tdx_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, int size, 1367 + unsigned long val) 1368 + { 1369 + if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) { 1370 + trace_kvm_fast_mmio(gpa); 1371 + return 0; 1372 + } 1373 + 1374 + trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, size, gpa, &val); 1375 + if (kvm_io_bus_write(vcpu, KVM_MMIO_BUS, gpa, size, &val)) 1376 + return -EOPNOTSUPP; 1377 + 1378 + return 0; 1379 + } 1380 + 1381 + static inline int tdx_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, int size) 1382 + { 1383 + unsigned long val; 1384 + 1385 + if (kvm_io_bus_read(vcpu, KVM_MMIO_BUS, gpa, size, &val)) 1386 + return -EOPNOTSUPP; 1387 + 1388 + tdvmcall_set_return_val(vcpu, val); 1389 + trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val); 1390 + return 0; 1391 + } 1392 + 1393 + static int tdx_emulate_mmio(struct kvm_vcpu *vcpu) 1394 + { 1395 + struct vcpu_tdx *tdx = to_tdx(vcpu); 1396 + int size, write, r; 1397 + unsigned long val; 1398 + gpa_t gpa; 1399 + 1400 + size = tdx->vp_enter_args.r12; 1401 + write = tdx->vp_enter_args.r13; 1402 + gpa = tdx->vp_enter_args.r14; 1403 + val = write ? tdx->vp_enter_args.r15 : 0; 1404 + 1405 + if (size != 1 && size != 2 && size != 4 && size != 8) 1406 + goto error; 1407 + if (write != 0 && write != 1) 1408 + goto error; 1409 + 1410 + /* 1411 + * TDG.VP.VMCALL<MMIO> allows only shared GPA, it makes no sense to 1412 + * do MMIO emulation for private GPA. 1413 + */ 1414 + if (vt_is_tdx_private_gpa(vcpu->kvm, gpa) || 1415 + vt_is_tdx_private_gpa(vcpu->kvm, gpa + size - 1)) 1416 + goto error; 1417 + 1418 + gpa = gpa & ~gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm)); 1419 + 1420 + if (write) 1421 + r = tdx_mmio_write(vcpu, gpa, size, val); 1422 + else 1423 + r = tdx_mmio_read(vcpu, gpa, size); 1424 + if (!r) 1425 + /* Kernel completed device emulation. */ 1426 + return 1; 1427 + 1428 + /* Request the device emulation to userspace device model. */ 1429 + vcpu->mmio_is_write = write; 1430 + if (!write) 1431 + vcpu->arch.complete_userspace_io = tdx_complete_mmio_read; 1432 + 1433 + vcpu->run->mmio.phys_addr = gpa; 1434 + vcpu->run->mmio.len = size; 1435 + vcpu->run->mmio.is_write = write; 1436 + vcpu->run->exit_reason = KVM_EXIT_MMIO; 1437 + 1438 + if (write) { 1439 + memcpy(vcpu->run->mmio.data, &val, size); 1440 + } else { 1441 + vcpu->mmio_fragments[0].gpa = gpa; 1442 + vcpu->mmio_fragments[0].len = size; 1443 + trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, size, gpa, NULL); 1444 + } 1445 + return 0; 1446 + 1447 + error: 1448 + tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND); 1449 + return 1; 1450 + } 1451 + 1452 + static int tdx_get_td_vm_call_info(struct kvm_vcpu *vcpu) 1453 + { 1454 + struct vcpu_tdx *tdx = to_tdx(vcpu); 1455 + 1456 + if (tdx->vp_enter_args.r12) 1457 + tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND); 1458 + else { 1459 + tdx->vp_enter_args.r11 = 0; 1460 + tdx->vp_enter_args.r13 = 0; 1461 + tdx->vp_enter_args.r14 = 0; 1462 + } 1463 + return 1; 1464 + } 1465 + 1466 + static int handle_tdvmcall(struct kvm_vcpu *vcpu) 1467 + { 1468 + switch (tdvmcall_leaf(vcpu)) { 1469 + case TDVMCALL_MAP_GPA: 1470 + return tdx_map_gpa(vcpu); 1471 + case TDVMCALL_REPORT_FATAL_ERROR: 1472 + return tdx_report_fatal_error(vcpu); 1473 + case TDVMCALL_GET_TD_VM_CALL_INFO: 1474 + return tdx_get_td_vm_call_info(vcpu); 1475 + default: 1476 + break; 1477 + } 1478 + 1479 + tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND); 1480 + return 1; 1481 + } 1482 + 1483 + void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level) 1484 + { 1485 + u64 shared_bit = (pgd_level == 5) ? TDX_SHARED_BIT_PWL_5 : 1486 + TDX_SHARED_BIT_PWL_4; 1487 + 1488 + if (KVM_BUG_ON(shared_bit != kvm_gfn_direct_bits(vcpu->kvm), vcpu->kvm)) 1489 + return; 1490 + 1491 + td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa); 1492 + } 1493 + 1494 + static void tdx_unpin(struct kvm *kvm, struct page *page) 1495 + { 1496 + put_page(page); 1497 + } 1498 + 1499 + static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn, 1500 + enum pg_level level, struct page *page) 1501 + { 1502 + int tdx_level = pg_level_to_tdx_sept_level(level); 1503 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 1504 + gpa_t gpa = gfn_to_gpa(gfn); 1505 + u64 entry, level_state; 1506 + u64 err; 1507 + 1508 + err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state); 1509 + if (unlikely(tdx_operand_busy(err))) { 1510 + tdx_unpin(kvm, page); 1511 + return -EBUSY; 1512 + } 1513 + 1514 + if (KVM_BUG_ON(err, kvm)) { 1515 + pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state); 1516 + tdx_unpin(kvm, page); 1517 + return -EIO; 1518 + } 1519 + 1520 + return 0; 1521 + } 1522 + 1523 + /* 1524 + * KVM_TDX_INIT_MEM_REGION calls kvm_gmem_populate() to map guest pages; the 1525 + * callback tdx_gmem_post_populate() then maps pages into private memory. 1526 + * through the a seamcall TDH.MEM.PAGE.ADD(). The SEAMCALL also requires the 1527 + * private EPT structures for the page to have been built before, which is 1528 + * done via kvm_tdp_map_page(). nr_premapped counts the number of pages that 1529 + * were added to the EPT structures but not added with TDH.MEM.PAGE.ADD(). 1530 + * The counter has to be zero on KVM_TDX_FINALIZE_VM, to ensure that there 1531 + * are no half-initialized shared EPT pages. 1532 + */ 1533 + static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn, 1534 + enum pg_level level, kvm_pfn_t pfn) 1535 + { 1536 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 1537 + 1538 + if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm)) 1539 + return -EINVAL; 1540 + 1541 + /* nr_premapped will be decreased when tdh_mem_page_add() is called. */ 1542 + atomic64_inc(&kvm_tdx->nr_premapped); 1543 + return 0; 1544 + } 1545 + 1546 + int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, 1547 + enum pg_level level, kvm_pfn_t pfn) 1548 + { 1549 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 1550 + struct page *page = pfn_to_page(pfn); 1551 + 1552 + /* TODO: handle large pages. */ 1553 + if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm)) 1554 + return -EINVAL; 1555 + 1556 + /* 1557 + * Because guest_memfd doesn't support page migration with 1558 + * a_ops->migrate_folio (yet), no callback is triggered for KVM on page 1559 + * migration. Until guest_memfd supports page migration, prevent page 1560 + * migration. 1561 + * TODO: Once guest_memfd introduces callback on page migration, 1562 + * implement it and remove get_page/put_page(). 1563 + */ 1564 + get_page(page); 1565 + 1566 + /* 1567 + * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching 1568 + * barrier in tdx_td_finalize(). 1569 + */ 1570 + smp_rmb(); 1571 + if (likely(kvm_tdx->state == TD_STATE_RUNNABLE)) 1572 + return tdx_mem_page_aug(kvm, gfn, level, page); 1573 + 1574 + return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn); 1575 + } 1576 + 1577 + static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn, 1578 + enum pg_level level, struct page *page) 1579 + { 1580 + int tdx_level = pg_level_to_tdx_sept_level(level); 1581 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 1582 + gpa_t gpa = gfn_to_gpa(gfn); 1583 + u64 err, entry, level_state; 1584 + 1585 + /* TODO: handle large pages. */ 1586 + if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm)) 1587 + return -EINVAL; 1588 + 1589 + if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm)) 1590 + return -EINVAL; 1591 + 1592 + /* 1593 + * When zapping private page, write lock is held. So no race condition 1594 + * with other vcpu sept operation. 1595 + * Race with TDH.VP.ENTER due to (0-step mitigation) and Guest TDCALLs. 1596 + */ 1597 + err = tdh_mem_page_remove(&kvm_tdx->td, gpa, tdx_level, &entry, 1598 + &level_state); 1599 + 1600 + if (unlikely(tdx_operand_busy(err))) { 1601 + /* 1602 + * The second retry is expected to succeed after kicking off all 1603 + * other vCPUs and prevent them from invoking TDH.VP.ENTER. 1604 + */ 1605 + tdx_no_vcpus_enter_start(kvm); 1606 + err = tdh_mem_page_remove(&kvm_tdx->td, gpa, tdx_level, &entry, 1607 + &level_state); 1608 + tdx_no_vcpus_enter_stop(kvm); 1609 + } 1610 + 1611 + if (KVM_BUG_ON(err, kvm)) { 1612 + pr_tdx_error_2(TDH_MEM_PAGE_REMOVE, err, entry, level_state); 1613 + return -EIO; 1614 + } 1615 + 1616 + err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page); 1617 + 1618 + if (KVM_BUG_ON(err, kvm)) { 1619 + pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err); 1620 + return -EIO; 1621 + } 1622 + tdx_clear_page(page); 1623 + tdx_unpin(kvm, page); 1624 + return 0; 1625 + } 1626 + 1627 + int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn, 1628 + enum pg_level level, void *private_spt) 1629 + { 1630 + int tdx_level = pg_level_to_tdx_sept_level(level); 1631 + gpa_t gpa = gfn_to_gpa(gfn); 1632 + struct page *page = virt_to_page(private_spt); 1633 + u64 err, entry, level_state; 1634 + 1635 + err = tdh_mem_sept_add(&to_kvm_tdx(kvm)->td, gpa, tdx_level, page, &entry, 1636 + &level_state); 1637 + if (unlikely(tdx_operand_busy(err))) 1638 + return -EBUSY; 1639 + 1640 + if (KVM_BUG_ON(err, kvm)) { 1641 + pr_tdx_error_2(TDH_MEM_SEPT_ADD, err, entry, level_state); 1642 + return -EIO; 1643 + } 1644 + 1645 + return 0; 1646 + } 1647 + 1648 + /* 1649 + * Check if the error returned from a SEPT zap SEAMCALL is due to that a page is 1650 + * mapped by KVM_TDX_INIT_MEM_REGION without tdh_mem_page_add() being called 1651 + * successfully. 1652 + * 1653 + * Since tdh_mem_sept_add() must have been invoked successfully before a 1654 + * non-leaf entry present in the mirrored page table, the SEPT ZAP related 1655 + * SEAMCALLs should not encounter err TDX_EPT_WALK_FAILED. They should instead 1656 + * find TDX_EPT_ENTRY_STATE_INCORRECT due to an empty leaf entry found in the 1657 + * SEPT. 1658 + * 1659 + * Further check if the returned entry from SEPT walking is with RWX permissions 1660 + * to filter out anything unexpected. 1661 + * 1662 + * Note: @level is pg_level, not the tdx_level. The tdx_level extracted from 1663 + * level_state returned from a SEAMCALL error is the same as that passed into 1664 + * the SEAMCALL. 1665 + */ 1666 + static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err, 1667 + u64 entry, int level) 1668 + { 1669 + if (!err || kvm_tdx->state == TD_STATE_RUNNABLE) 1670 + return false; 1671 + 1672 + if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX)) 1673 + return false; 1674 + 1675 + if ((is_last_spte(entry, level) && (entry & VMX_EPT_RWX_MASK))) 1676 + return false; 1677 + 1678 + return true; 1679 + } 1680 + 1681 + static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn, 1682 + enum pg_level level, struct page *page) 1683 + { 1684 + int tdx_level = pg_level_to_tdx_sept_level(level); 1685 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 1686 + gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level); 1687 + u64 err, entry, level_state; 1688 + 1689 + /* For now large page isn't supported yet. */ 1690 + WARN_ON_ONCE(level != PG_LEVEL_4K); 1691 + 1692 + err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state); 1693 + 1694 + if (unlikely(tdx_operand_busy(err))) { 1695 + /* After no vCPUs enter, the second retry is expected to succeed */ 1696 + tdx_no_vcpus_enter_start(kvm); 1697 + err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state); 1698 + tdx_no_vcpus_enter_stop(kvm); 1699 + } 1700 + if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) && 1701 + !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) { 1702 + atomic64_dec(&kvm_tdx->nr_premapped); 1703 + tdx_unpin(kvm, page); 1704 + return 0; 1705 + } 1706 + 1707 + if (KVM_BUG_ON(err, kvm)) { 1708 + pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state); 1709 + return -EIO; 1710 + } 1711 + return 1; 1712 + } 1713 + 1714 + /* 1715 + * Ensure shared and private EPTs to be flushed on all vCPUs. 1716 + * tdh_mem_track() is the only caller that increases TD epoch. An increase in 1717 + * the TD epoch (e.g., to value "N + 1") is successful only if no vCPUs are 1718 + * running in guest mode with the value "N - 1". 1719 + * 1720 + * A successful execution of tdh_mem_track() ensures that vCPUs can only run in 1721 + * guest mode with TD epoch value "N" if no TD exit occurs after the TD epoch 1722 + * being increased to "N + 1". 1723 + * 1724 + * Kicking off all vCPUs after that further results in no vCPUs can run in guest 1725 + * mode with TD epoch value "N", which unblocks the next tdh_mem_track() (e.g. 1726 + * to increase TD epoch to "N + 2"). 1727 + * 1728 + * TDX module will flush EPT on the next TD enter and make vCPUs to run in 1729 + * guest mode with TD epoch value "N + 1". 1730 + * 1731 + * kvm_make_all_cpus_request() guarantees all vCPUs are out of guest mode by 1732 + * waiting empty IPI handler ack_kick(). 1733 + * 1734 + * No action is required to the vCPUs being kicked off since the kicking off 1735 + * occurs certainly after TD epoch increment and before the next 1736 + * tdh_mem_track(). 1737 + */ 1738 + static void tdx_track(struct kvm *kvm) 1739 + { 1740 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 1741 + u64 err; 1742 + 1743 + /* If TD isn't finalized, it's before any vcpu running. */ 1744 + if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) 1745 + return; 1746 + 1747 + lockdep_assert_held_write(&kvm->mmu_lock); 1748 + 1749 + err = tdh_mem_track(&kvm_tdx->td); 1750 + if (unlikely(tdx_operand_busy(err))) { 1751 + /* After no vCPUs enter, the second retry is expected to succeed */ 1752 + tdx_no_vcpus_enter_start(kvm); 1753 + err = tdh_mem_track(&kvm_tdx->td); 1754 + tdx_no_vcpus_enter_stop(kvm); 1755 + } 1756 + 1757 + if (KVM_BUG_ON(err, kvm)) 1758 + pr_tdx_error(TDH_MEM_TRACK, err); 1759 + 1760 + kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE); 1761 + } 1762 + 1763 + int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn, 1764 + enum pg_level level, void *private_spt) 1765 + { 1766 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 1767 + 1768 + /* 1769 + * free_external_spt() is only called after hkid is freed when TD is 1770 + * tearing down. 1771 + * KVM doesn't (yet) zap page table pages in mirror page table while 1772 + * TD is active, though guest pages mapped in mirror page table could be 1773 + * zapped during TD is active, e.g. for shared <-> private conversion 1774 + * and slot move/deletion. 1775 + */ 1776 + if (KVM_BUG_ON(is_hkid_assigned(kvm_tdx), kvm)) 1777 + return -EINVAL; 1778 + 1779 + /* 1780 + * The HKID assigned to this TD was already freed and cache was 1781 + * already flushed. We don't have to flush again. 1782 + */ 1783 + return tdx_reclaim_page(virt_to_page(private_spt)); 1784 + } 1785 + 1786 + int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn, 1787 + enum pg_level level, kvm_pfn_t pfn) 1788 + { 1789 + struct page *page = pfn_to_page(pfn); 1790 + int ret; 1791 + 1792 + /* 1793 + * HKID is released after all private pages have been removed, and set 1794 + * before any might be populated. Warn if zapping is attempted when 1795 + * there can't be anything populated in the private EPT. 1796 + */ 1797 + if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm)) 1798 + return -EINVAL; 1799 + 1800 + ret = tdx_sept_zap_private_spte(kvm, gfn, level, page); 1801 + if (ret <= 0) 1802 + return ret; 1803 + 1804 + /* 1805 + * TDX requires TLB tracking before dropping private page. Do 1806 + * it here, although it is also done later. 1807 + */ 1808 + tdx_track(kvm); 1809 + 1810 + return tdx_sept_drop_private_spte(kvm, gfn, level, page); 1811 + } 1812 + 1813 + void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode, 1814 + int trig_mode, int vector) 1815 + { 1816 + struct kvm_vcpu *vcpu = apic->vcpu; 1817 + struct vcpu_tdx *tdx = to_tdx(vcpu); 1818 + 1819 + /* TDX supports only posted interrupt. No lapic emulation. */ 1820 + __vmx_deliver_posted_interrupt(vcpu, &tdx->vt.pi_desc, vector); 1821 + 1822 + trace_kvm_apicv_accept_irq(vcpu->vcpu_id, delivery_mode, trig_mode, vector); 1823 + } 1824 + 1825 + static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcpu) 1826 + { 1827 + u64 eeq_type = to_tdx(vcpu)->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK; 1828 + u64 eq = vmx_get_exit_qual(vcpu); 1829 + 1830 + if (eeq_type != TDX_EXT_EXIT_QUAL_TYPE_PENDING_EPT_VIOLATION) 1831 + return false; 1832 + 1833 + return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN); 1834 + } 1835 + 1836 + static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) 1837 + { 1838 + unsigned long exit_qual; 1839 + gpa_t gpa = to_tdx(vcpu)->exit_gpa; 1840 + bool local_retry = false; 1841 + int ret; 1842 + 1843 + if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) { 1844 + if (tdx_is_sept_violation_unexpected_pending(vcpu)) { 1845 + pr_warn("Guest access before accepting 0x%llx on vCPU %d\n", 1846 + gpa, vcpu->vcpu_id); 1847 + kvm_vm_dead(vcpu->kvm); 1848 + return -EIO; 1849 + } 1850 + /* 1851 + * Always treat SEPT violations as write faults. Ignore the 1852 + * EXIT_QUALIFICATION reported by TDX-SEAM for SEPT violations. 1853 + * TD private pages are always RWX in the SEPT tables, 1854 + * i.e. they're always mapped writable. Just as importantly, 1855 + * treating SEPT violations as write faults is necessary to 1856 + * avoid COW allocations, which will cause TDAUGPAGE failures 1857 + * due to aliasing a single HPA to multiple GPAs. 1858 + */ 1859 + exit_qual = EPT_VIOLATION_ACC_WRITE; 1860 + 1861 + /* Only private GPA triggers zero-step mitigation */ 1862 + local_retry = true; 1863 + } else { 1864 + exit_qual = vmx_get_exit_qual(vcpu); 1865 + /* 1866 + * EPT violation due to instruction fetch should never be 1867 + * triggered from shared memory in TDX guest. If such EPT 1868 + * violation occurs, treat it as broken hardware. 1869 + */ 1870 + if (KVM_BUG_ON(exit_qual & EPT_VIOLATION_ACC_INSTR, vcpu->kvm)) 1871 + return -EIO; 1872 + } 1873 + 1874 + trace_kvm_page_fault(vcpu, gpa, exit_qual); 1875 + 1876 + /* 1877 + * To minimize TDH.VP.ENTER invocations, retry locally for private GPA 1878 + * mapping in TDX. 1879 + * 1880 + * KVM may return RET_PF_RETRY for private GPA due to 1881 + * - contentions when atomically updating SPTEs of the mirror page table 1882 + * - in-progress GFN invalidation or memslot removal. 1883 + * - TDX_OPERAND_BUSY error from TDH.MEM.PAGE.AUG or TDH.MEM.SEPT.ADD, 1884 + * caused by contentions with TDH.VP.ENTER (with zero-step mitigation) 1885 + * or certain TDCALLs. 1886 + * 1887 + * If TDH.VP.ENTER is invoked more times than the threshold set by the 1888 + * TDX module before KVM resolves the private GPA mapping, the TDX 1889 + * module will activate zero-step mitigation during TDH.VP.ENTER. This 1890 + * process acquires an SEPT tree lock in the TDX module, leading to 1891 + * further contentions with TDH.MEM.PAGE.AUG or TDH.MEM.SEPT.ADD 1892 + * operations on other vCPUs. 1893 + * 1894 + * Breaking out of local retries for kvm_vcpu_has_events() is for 1895 + * interrupt injection. kvm_vcpu_has_events() should not see pending 1896 + * events for TDX. Since KVM can't determine if IRQs (or NMIs) are 1897 + * blocked by TDs, false positives are inevitable i.e., KVM may re-enter 1898 + * the guest even if the IRQ/NMI can't be delivered. 1899 + * 1900 + * Note: even without breaking out of local retries, zero-step 1901 + * mitigation may still occur due to 1902 + * - invoking of TDH.VP.ENTER after KVM_EXIT_MEMORY_FAULT, 1903 + * - a single RIP causing EPT violations for more GFNs than the 1904 + * threshold count. 1905 + * This is safe, as triggering zero-step mitigation only introduces 1906 + * contentions to page installation SEAMCALLs on other vCPUs, which will 1907 + * handle retries locally in their EPT violation handlers. 1908 + */ 1909 + while (1) { 1910 + ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual); 1911 + 1912 + if (ret != RET_PF_RETRY || !local_retry) 1913 + break; 1914 + 1915 + if (kvm_vcpu_has_events(vcpu) || signal_pending(current)) 1916 + break; 1917 + 1918 + if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) { 1919 + ret = -EIO; 1920 + break; 1921 + } 1922 + 1923 + cond_resched(); 1924 + } 1925 + return ret; 1926 + } 1927 + 1928 + int tdx_complete_emulated_msr(struct kvm_vcpu *vcpu, int err) 1929 + { 1930 + if (err) { 1931 + tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND); 1932 + return 1; 1933 + } 1934 + 1935 + if (vmx_get_exit_reason(vcpu).basic == EXIT_REASON_MSR_READ) 1936 + tdvmcall_set_return_val(vcpu, kvm_read_edx_eax(vcpu)); 1937 + 1938 + return 1; 1939 + } 1940 + 1941 + 1942 + int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath) 1943 + { 1944 + struct vcpu_tdx *tdx = to_tdx(vcpu); 1945 + u64 vp_enter_ret = tdx->vp_enter_ret; 1946 + union vmx_exit_reason exit_reason = vmx_get_exit_reason(vcpu); 1947 + 1948 + if (fastpath != EXIT_FASTPATH_NONE) 1949 + return 1; 1950 + 1951 + if (unlikely(vp_enter_ret == EXIT_REASON_EPT_MISCONFIG)) { 1952 + KVM_BUG_ON(1, vcpu->kvm); 1953 + return -EIO; 1954 + } 1955 + 1956 + /* 1957 + * Handle TDX SW errors, including TDX_SEAMCALL_UD, TDX_SEAMCALL_GP and 1958 + * TDX_SEAMCALL_VMFAILINVALID. 1959 + */ 1960 + if (unlikely((vp_enter_ret & TDX_SW_ERROR) == TDX_SW_ERROR)) { 1961 + KVM_BUG_ON(!kvm_rebooting, vcpu->kvm); 1962 + goto unhandled_exit; 1963 + } 1964 + 1965 + if (unlikely(tdx_failed_vmentry(vcpu))) { 1966 + /* 1967 + * If the guest state is protected, that means off-TD debug is 1968 + * not enabled, TDX_NON_RECOVERABLE must be set. 1969 + */ 1970 + WARN_ON_ONCE(vcpu->arch.guest_state_protected && 1971 + !(vp_enter_ret & TDX_NON_RECOVERABLE)); 1972 + vcpu->run->exit_reason = KVM_EXIT_FAIL_ENTRY; 1973 + vcpu->run->fail_entry.hardware_entry_failure_reason = exit_reason.full; 1974 + vcpu->run->fail_entry.cpu = vcpu->arch.last_vmentry_cpu; 1975 + return 0; 1976 + } 1977 + 1978 + if (unlikely(vp_enter_ret & (TDX_ERROR | TDX_NON_RECOVERABLE)) && 1979 + exit_reason.basic != EXIT_REASON_TRIPLE_FAULT) { 1980 + kvm_pr_unimpl("TD vp_enter_ret 0x%llx\n", vp_enter_ret); 1981 + goto unhandled_exit; 1982 + } 1983 + 1984 + WARN_ON_ONCE(exit_reason.basic != EXIT_REASON_TRIPLE_FAULT && 1985 + (vp_enter_ret & TDX_SEAMCALL_STATUS_MASK) != TDX_SUCCESS); 1986 + 1987 + switch (exit_reason.basic) { 1988 + case EXIT_REASON_TRIPLE_FAULT: 1989 + vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN; 1990 + vcpu->mmio_needed = 0; 1991 + return 0; 1992 + case EXIT_REASON_EXCEPTION_NMI: 1993 + return tdx_handle_exception_nmi(vcpu); 1994 + case EXIT_REASON_EXTERNAL_INTERRUPT: 1995 + ++vcpu->stat.irq_exits; 1996 + return 1; 1997 + case EXIT_REASON_CPUID: 1998 + return tdx_emulate_cpuid(vcpu); 1999 + case EXIT_REASON_HLT: 2000 + return kvm_emulate_halt_noskip(vcpu); 2001 + case EXIT_REASON_TDCALL: 2002 + return handle_tdvmcall(vcpu); 2003 + case EXIT_REASON_VMCALL: 2004 + return tdx_emulate_vmcall(vcpu); 2005 + case EXIT_REASON_IO_INSTRUCTION: 2006 + return tdx_emulate_io(vcpu); 2007 + case EXIT_REASON_MSR_READ: 2008 + kvm_rcx_write(vcpu, tdx->vp_enter_args.r12); 2009 + return kvm_emulate_rdmsr(vcpu); 2010 + case EXIT_REASON_MSR_WRITE: 2011 + kvm_rcx_write(vcpu, tdx->vp_enter_args.r12); 2012 + kvm_rax_write(vcpu, tdx->vp_enter_args.r13 & -1u); 2013 + kvm_rdx_write(vcpu, tdx->vp_enter_args.r13 >> 32); 2014 + return kvm_emulate_wrmsr(vcpu); 2015 + case EXIT_REASON_EPT_MISCONFIG: 2016 + return tdx_emulate_mmio(vcpu); 2017 + case EXIT_REASON_EPT_VIOLATION: 2018 + return tdx_handle_ept_violation(vcpu); 2019 + case EXIT_REASON_OTHER_SMI: 2020 + /* 2021 + * Unlike VMX, SMI in SEAM non-root mode (i.e. when 2022 + * TD guest vCPU is running) will cause VM exit to TDX module, 2023 + * then SEAMRET to KVM. Once it exits to KVM, SMI is delivered 2024 + * and handled by kernel handler right away. 2025 + * 2026 + * The Other SMI exit can also be caused by the SEAM non-root 2027 + * machine check delivered via Machine Check System Management 2028 + * Interrupt (MSMI), but it has already been handled by the 2029 + * kernel machine check handler, i.e., the memory page has been 2030 + * marked as poisoned and it won't be freed to the free list 2031 + * when the TDX guest is terminated (the TDX module marks the 2032 + * guest as dead and prevent it from further running when 2033 + * machine check happens in SEAM non-root). 2034 + * 2035 + * - A MSMI will not reach here, it's handled as non_recoverable 2036 + * case above. 2037 + * - If it's not an MSMI, no need to do anything here. 2038 + */ 2039 + return 1; 2040 + default: 2041 + break; 2042 + } 2043 + 2044 + unhandled_exit: 2045 + vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR; 2046 + vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON; 2047 + vcpu->run->internal.ndata = 2; 2048 + vcpu->run->internal.data[0] = vp_enter_ret; 2049 + vcpu->run->internal.data[1] = vcpu->arch.last_vmentry_cpu; 2050 + return 0; 2051 + } 2052 + 2053 + void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason, 2054 + u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code) 2055 + { 2056 + struct vcpu_tdx *tdx = to_tdx(vcpu); 2057 + 2058 + *reason = tdx->vt.exit_reason.full; 2059 + if (*reason != -1u) { 2060 + *info1 = vmx_get_exit_qual(vcpu); 2061 + *info2 = tdx->ext_exit_qualification; 2062 + *intr_info = vmx_get_intr_info(vcpu); 2063 + } else { 2064 + *info1 = 0; 2065 + *info2 = 0; 2066 + *intr_info = 0; 2067 + } 2068 + 2069 + *error_code = 0; 2070 + } 2071 + 2072 + bool tdx_has_emulated_msr(u32 index) 2073 + { 2074 + switch (index) { 2075 + case MSR_IA32_UCODE_REV: 2076 + case MSR_IA32_ARCH_CAPABILITIES: 2077 + case MSR_IA32_POWER_CTL: 2078 + case MSR_IA32_CR_PAT: 2079 + case MSR_MTRRcap: 2080 + case MTRRphysBase_MSR(0) ... MSR_MTRRfix4K_F8000: 2081 + case MSR_MTRRdefType: 2082 + case MSR_IA32_TSC_DEADLINE: 2083 + case MSR_IA32_MISC_ENABLE: 2084 + case MSR_PLATFORM_INFO: 2085 + case MSR_MISC_FEATURES_ENABLES: 2086 + case MSR_IA32_APICBASE: 2087 + case MSR_EFER: 2088 + case MSR_IA32_FEAT_CTL: 2089 + case MSR_IA32_MCG_CAP: 2090 + case MSR_IA32_MCG_STATUS: 2091 + case MSR_IA32_MCG_CTL: 2092 + case MSR_IA32_MCG_EXT_CTL: 2093 + case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1: 2094 + case MSR_IA32_MC0_CTL2 ... MSR_IA32_MCx_CTL2(KVM_MAX_MCE_BANKS) - 1: 2095 + /* MSR_IA32_MCx_{CTL, STATUS, ADDR, MISC, CTL2} */ 2096 + case MSR_KVM_POLL_CONTROL: 2097 + return true; 2098 + case APIC_BASE_MSR ... APIC_BASE_MSR + 0xff: 2099 + /* 2100 + * x2APIC registers that are virtualized by the CPU can't be 2101 + * emulated, KVM doesn't have access to the virtual APIC page. 2102 + */ 2103 + switch (index) { 2104 + case X2APIC_MSR(APIC_TASKPRI): 2105 + case X2APIC_MSR(APIC_PROCPRI): 2106 + case X2APIC_MSR(APIC_EOI): 2107 + case X2APIC_MSR(APIC_ISR) ... X2APIC_MSR(APIC_ISR + APIC_ISR_NR): 2108 + case X2APIC_MSR(APIC_TMR) ... X2APIC_MSR(APIC_TMR + APIC_ISR_NR): 2109 + case X2APIC_MSR(APIC_IRR) ... X2APIC_MSR(APIC_IRR + APIC_ISR_NR): 2110 + return false; 2111 + default: 2112 + return true; 2113 + } 2114 + default: 2115 + return false; 2116 + } 2117 + } 2118 + 2119 + static bool tdx_is_read_only_msr(u32 index) 2120 + { 2121 + return index == MSR_IA32_APICBASE || index == MSR_EFER || 2122 + index == MSR_IA32_FEAT_CTL; 2123 + } 2124 + 2125 + int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) 2126 + { 2127 + switch (msr->index) { 2128 + case MSR_IA32_FEAT_CTL: 2129 + /* 2130 + * MCE and MCA are advertised via cpuid. Guest kernel could 2131 + * check if LMCE is enabled or not. 2132 + */ 2133 + msr->data = FEAT_CTL_LOCKED; 2134 + if (vcpu->arch.mcg_cap & MCG_LMCE_P) 2135 + msr->data |= FEAT_CTL_LMCE_ENABLED; 2136 + return 0; 2137 + case MSR_IA32_MCG_EXT_CTL: 2138 + if (!msr->host_initiated && !(vcpu->arch.mcg_cap & MCG_LMCE_P)) 2139 + return 1; 2140 + msr->data = vcpu->arch.mcg_ext_ctl; 2141 + return 0; 2142 + default: 2143 + if (!tdx_has_emulated_msr(msr->index)) 2144 + return 1; 2145 + 2146 + return kvm_get_msr_common(vcpu, msr); 2147 + } 2148 + } 2149 + 2150 + int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) 2151 + { 2152 + switch (msr->index) { 2153 + case MSR_IA32_MCG_EXT_CTL: 2154 + if ((!msr->host_initiated && !(vcpu->arch.mcg_cap & MCG_LMCE_P)) || 2155 + (msr->data & ~MCG_EXT_CTL_LMCE_EN)) 2156 + return 1; 2157 + vcpu->arch.mcg_ext_ctl = msr->data; 2158 + return 0; 2159 + default: 2160 + if (tdx_is_read_only_msr(msr->index)) 2161 + return 1; 2162 + 2163 + if (!tdx_has_emulated_msr(msr->index)) 2164 + return 1; 2165 + 2166 + return kvm_set_msr_common(vcpu, msr); 2167 + } 2168 + } 2169 + 2170 + static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd) 2171 + { 2172 + const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf; 2173 + struct kvm_tdx_capabilities __user *user_caps; 2174 + struct kvm_tdx_capabilities *caps = NULL; 2175 + int ret = 0; 2176 + 2177 + /* flags is reserved for future use */ 2178 + if (cmd->flags) 2179 + return -EINVAL; 2180 + 2181 + caps = kmalloc(sizeof(*caps) + 2182 + sizeof(struct kvm_cpuid_entry2) * td_conf->num_cpuid_config, 2183 + GFP_KERNEL); 2184 + if (!caps) 2185 + return -ENOMEM; 2186 + 2187 + user_caps = u64_to_user_ptr(cmd->data); 2188 + if (copy_from_user(caps, user_caps, sizeof(*caps))) { 2189 + ret = -EFAULT; 2190 + goto out; 2191 + } 2192 + 2193 + if (caps->cpuid.nent < td_conf->num_cpuid_config) { 2194 + ret = -E2BIG; 2195 + goto out; 2196 + } 2197 + 2198 + ret = init_kvm_tdx_caps(td_conf, caps); 2199 + if (ret) 2200 + goto out; 2201 + 2202 + if (copy_to_user(user_caps, caps, sizeof(*caps))) { 2203 + ret = -EFAULT; 2204 + goto out; 2205 + } 2206 + 2207 + if (copy_to_user(user_caps->cpuid.entries, caps->cpuid.entries, 2208 + caps->cpuid.nent * 2209 + sizeof(caps->cpuid.entries[0]))) 2210 + ret = -EFAULT; 2211 + 2212 + out: 2213 + /* kfree() accepts NULL. */ 2214 + kfree(caps); 2215 + return ret; 2216 + } 2217 + 2218 + /* 2219 + * KVM reports guest physical address in CPUID.0x800000008.EAX[23:16], which is 2220 + * similar to TDX's GPAW. Use this field as the interface for userspace to 2221 + * configure the GPAW and EPT level for TDs. 2222 + * 2223 + * Only values 48 and 52 are supported. Value 52 means GPAW-52 and EPT level 2224 + * 5, Value 48 means GPAW-48 and EPT level 4. For value 48, GPAW-48 is always 2225 + * supported. Value 52 is only supported when the platform supports 5 level 2226 + * EPT. 2227 + */ 2228 + static int setup_tdparams_eptp_controls(struct kvm_cpuid2 *cpuid, 2229 + struct td_params *td_params) 2230 + { 2231 + const struct kvm_cpuid_entry2 *entry; 2232 + int guest_pa; 2233 + 2234 + entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0x80000008, 0); 2235 + if (!entry) 2236 + return -EINVAL; 2237 + 2238 + guest_pa = tdx_get_guest_phys_addr_bits(entry->eax); 2239 + 2240 + if (guest_pa != 48 && guest_pa != 52) 2241 + return -EINVAL; 2242 + 2243 + if (guest_pa == 52 && !cpu_has_vmx_ept_5levels()) 2244 + return -EINVAL; 2245 + 2246 + td_params->eptp_controls = VMX_EPTP_MT_WB; 2247 + if (guest_pa == 52) { 2248 + td_params->eptp_controls |= VMX_EPTP_PWL_5; 2249 + td_params->config_flags |= TDX_CONFIG_FLAGS_MAX_GPAW; 2250 + } else { 2251 + td_params->eptp_controls |= VMX_EPTP_PWL_4; 2252 + } 2253 + 2254 + return 0; 2255 + } 2256 + 2257 + static int setup_tdparams_cpuids(struct kvm_cpuid2 *cpuid, 2258 + struct td_params *td_params) 2259 + { 2260 + const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf; 2261 + const struct kvm_cpuid_entry2 *entry; 2262 + struct tdx_cpuid_value *value; 2263 + int i, copy_cnt = 0; 2264 + 2265 + /* 2266 + * td_params.cpuid_values: The number and the order of cpuid_value must 2267 + * be same to the one of struct tdsysinfo.{num_cpuid_config, cpuid_configs} 2268 + * It's assumed that td_params was zeroed. 2269 + */ 2270 + for (i = 0; i < td_conf->num_cpuid_config; i++) { 2271 + struct kvm_cpuid_entry2 tmp; 2272 + 2273 + td_init_cpuid_entry2(&tmp, i); 2274 + 2275 + entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 2276 + tmp.function, tmp.index); 2277 + if (!entry) 2278 + continue; 2279 + 2280 + if (tdx_unsupported_cpuid(entry)) 2281 + return -EINVAL; 2282 + 2283 + copy_cnt++; 2284 + 2285 + value = &td_params->cpuid_values[i]; 2286 + value->eax = entry->eax; 2287 + value->ebx = entry->ebx; 2288 + value->ecx = entry->ecx; 2289 + value->edx = entry->edx; 2290 + 2291 + /* 2292 + * TDX module does not accept nonzero bits 16..23 for the 2293 + * CPUID[0x80000008].EAX, see setup_tdparams_eptp_controls(). 2294 + */ 2295 + if (tmp.function == 0x80000008) 2296 + value->eax = tdx_set_guest_phys_addr_bits(value->eax, 0); 2297 + } 2298 + 2299 + /* 2300 + * Rely on the TDX module to reject invalid configuration, but it can't 2301 + * check of leafs that don't have a proper slot in td_params->cpuid_values 2302 + * to stick then. So fail if there were entries that didn't get copied to 2303 + * td_params. 2304 + */ 2305 + if (copy_cnt != cpuid->nent) 2306 + return -EINVAL; 2307 + 2308 + return 0; 2309 + } 2310 + 2311 + static int setup_tdparams(struct kvm *kvm, struct td_params *td_params, 2312 + struct kvm_tdx_init_vm *init_vm) 2313 + { 2314 + const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf; 2315 + struct kvm_cpuid2 *cpuid = &init_vm->cpuid; 2316 + int ret; 2317 + 2318 + if (kvm->created_vcpus) 2319 + return -EBUSY; 2320 + 2321 + if (init_vm->attributes & ~tdx_get_supported_attrs(td_conf)) 2322 + return -EINVAL; 2323 + 2324 + if (init_vm->xfam & ~tdx_get_supported_xfam(td_conf)) 2325 + return -EINVAL; 2326 + 2327 + td_params->max_vcpus = kvm->max_vcpus; 2328 + td_params->attributes = init_vm->attributes | td_conf->attributes_fixed1; 2329 + td_params->xfam = init_vm->xfam | td_conf->xfam_fixed1; 2330 + 2331 + td_params->config_flags = TDX_CONFIG_FLAGS_NO_RBP_MOD; 2332 + td_params->tsc_frequency = TDX_TSC_KHZ_TO_25MHZ(kvm->arch.default_tsc_khz); 2333 + 2334 + ret = setup_tdparams_eptp_controls(cpuid, td_params); 2335 + if (ret) 2336 + return ret; 2337 + 2338 + ret = setup_tdparams_cpuids(cpuid, td_params); 2339 + if (ret) 2340 + return ret; 2341 + 2342 + #define MEMCPY_SAME_SIZE(dst, src) \ 2343 + do { \ 2344 + BUILD_BUG_ON(sizeof(dst) != sizeof(src)); \ 2345 + memcpy((dst), (src), sizeof(dst)); \ 2346 + } while (0) 2347 + 2348 + MEMCPY_SAME_SIZE(td_params->mrconfigid, init_vm->mrconfigid); 2349 + MEMCPY_SAME_SIZE(td_params->mrowner, init_vm->mrowner); 2350 + MEMCPY_SAME_SIZE(td_params->mrownerconfig, init_vm->mrownerconfig); 2351 + 2352 + return 0; 2353 + } 2354 + 2355 + static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params, 2356 + u64 *seamcall_err) 2357 + { 2358 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 2359 + cpumask_var_t packages; 2360 + struct page **tdcs_pages = NULL; 2361 + struct page *tdr_page; 2362 + int ret, i; 2363 + u64 err, rcx; 2364 + 2365 + *seamcall_err = 0; 2366 + ret = tdx_guest_keyid_alloc(); 2367 + if (ret < 0) 2368 + return ret; 2369 + kvm_tdx->hkid = ret; 2370 + kvm_tdx->misc_cg = get_current_misc_cg(); 2371 + ret = misc_cg_try_charge(MISC_CG_RES_TDX, kvm_tdx->misc_cg, 1); 2372 + if (ret) 2373 + goto free_hkid; 2374 + 2375 + ret = -ENOMEM; 2376 + 2377 + atomic_inc(&nr_configured_hkid); 2378 + 2379 + tdr_page = alloc_page(GFP_KERNEL); 2380 + if (!tdr_page) 2381 + goto free_hkid; 2382 + 2383 + kvm_tdx->td.tdcs_nr_pages = tdx_sysinfo->td_ctrl.tdcs_base_size / PAGE_SIZE; 2384 + /* TDVPS = TDVPR(4K page) + TDCX(multiple 4K pages), -1 for TDVPR. */ 2385 + kvm_tdx->td.tdcx_nr_pages = tdx_sysinfo->td_ctrl.tdvps_base_size / PAGE_SIZE - 1; 2386 + tdcs_pages = kcalloc(kvm_tdx->td.tdcs_nr_pages, sizeof(*kvm_tdx->td.tdcs_pages), 2387 + GFP_KERNEL | __GFP_ZERO); 2388 + if (!tdcs_pages) 2389 + goto free_tdr; 2390 + 2391 + for (i = 0; i < kvm_tdx->td.tdcs_nr_pages; i++) { 2392 + tdcs_pages[i] = alloc_page(GFP_KERNEL); 2393 + if (!tdcs_pages[i]) 2394 + goto free_tdcs; 2395 + } 2396 + 2397 + if (!zalloc_cpumask_var(&packages, GFP_KERNEL)) 2398 + goto free_tdcs; 2399 + 2400 + cpus_read_lock(); 2401 + 2402 + /* 2403 + * Need at least one CPU of the package to be online in order to 2404 + * program all packages for host key id. Check it. 2405 + */ 2406 + for_each_present_cpu(i) 2407 + cpumask_set_cpu(topology_physical_package_id(i), packages); 2408 + for_each_online_cpu(i) 2409 + cpumask_clear_cpu(topology_physical_package_id(i), packages); 2410 + if (!cpumask_empty(packages)) { 2411 + ret = -EIO; 2412 + /* 2413 + * Because it's hard for human operator to figure out the 2414 + * reason, warn it. 2415 + */ 2416 + #define MSG_ALLPKG "All packages need to have online CPU to create TD. Online CPU and retry.\n" 2417 + pr_warn_ratelimited(MSG_ALLPKG); 2418 + goto free_packages; 2419 + } 2420 + 2421 + /* 2422 + * TDH.MNG.CREATE tries to grab the global TDX module and fails 2423 + * with TDX_OPERAND_BUSY when it fails to grab. Take the global 2424 + * lock to prevent it from failure. 2425 + */ 2426 + mutex_lock(&tdx_lock); 2427 + kvm_tdx->td.tdr_page = tdr_page; 2428 + err = tdh_mng_create(&kvm_tdx->td, kvm_tdx->hkid); 2429 + mutex_unlock(&tdx_lock); 2430 + 2431 + if (err == TDX_RND_NO_ENTROPY) { 2432 + ret = -EAGAIN; 2433 + goto free_packages; 2434 + } 2435 + 2436 + if (WARN_ON_ONCE(err)) { 2437 + pr_tdx_error(TDH_MNG_CREATE, err); 2438 + ret = -EIO; 2439 + goto free_packages; 2440 + } 2441 + 2442 + for_each_online_cpu(i) { 2443 + int pkg = topology_physical_package_id(i); 2444 + 2445 + if (cpumask_test_and_set_cpu(pkg, packages)) 2446 + continue; 2447 + 2448 + /* 2449 + * Program the memory controller in the package with an 2450 + * encryption key associated to a TDX private host key id 2451 + * assigned to this TDR. Concurrent operations on same memory 2452 + * controller results in TDX_OPERAND_BUSY. No locking needed 2453 + * beyond the cpus_read_lock() above as it serializes against 2454 + * hotplug and the first online CPU of the package is always 2455 + * used. We never have two CPUs in the same socket trying to 2456 + * program the key. 2457 + */ 2458 + ret = smp_call_on_cpu(i, tdx_do_tdh_mng_key_config, 2459 + kvm_tdx, true); 2460 + if (ret) 2461 + break; 2462 + } 2463 + cpus_read_unlock(); 2464 + free_cpumask_var(packages); 2465 + if (ret) { 2466 + i = 0; 2467 + goto teardown; 2468 + } 2469 + 2470 + kvm_tdx->td.tdcs_pages = tdcs_pages; 2471 + for (i = 0; i < kvm_tdx->td.tdcs_nr_pages; i++) { 2472 + err = tdh_mng_addcx(&kvm_tdx->td, tdcs_pages[i]); 2473 + if (err == TDX_RND_NO_ENTROPY) { 2474 + /* Here it's hard to allow userspace to retry. */ 2475 + ret = -EAGAIN; 2476 + goto teardown; 2477 + } 2478 + if (WARN_ON_ONCE(err)) { 2479 + pr_tdx_error(TDH_MNG_ADDCX, err); 2480 + ret = -EIO; 2481 + goto teardown; 2482 + } 2483 + } 2484 + 2485 + err = tdh_mng_init(&kvm_tdx->td, __pa(td_params), &rcx); 2486 + if ((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_INVALID) { 2487 + /* 2488 + * Because a user gives operands, don't warn. 2489 + * Return a hint to the user because it's sometimes hard for the 2490 + * user to figure out which operand is invalid. SEAMCALL status 2491 + * code includes which operand caused invalid operand error. 2492 + */ 2493 + *seamcall_err = err; 2494 + ret = -EINVAL; 2495 + goto teardown; 2496 + } else if (WARN_ON_ONCE(err)) { 2497 + pr_tdx_error_1(TDH_MNG_INIT, err, rcx); 2498 + ret = -EIO; 2499 + goto teardown; 2500 + } 2501 + 2502 + return 0; 2503 + 2504 + /* 2505 + * The sequence for freeing resources from a partially initialized TD 2506 + * varies based on where in the initialization flow failure occurred. 2507 + * Simply use the full teardown and destroy, which naturally play nice 2508 + * with partial initialization. 2509 + */ 2510 + teardown: 2511 + /* Only free pages not yet added, so start at 'i' */ 2512 + for (; i < kvm_tdx->td.tdcs_nr_pages; i++) { 2513 + if (tdcs_pages[i]) { 2514 + __free_page(tdcs_pages[i]); 2515 + tdcs_pages[i] = NULL; 2516 + } 2517 + } 2518 + if (!kvm_tdx->td.tdcs_pages) 2519 + kfree(tdcs_pages); 2520 + 2521 + tdx_mmu_release_hkid(kvm); 2522 + tdx_reclaim_td_control_pages(kvm); 2523 + 2524 + return ret; 2525 + 2526 + free_packages: 2527 + cpus_read_unlock(); 2528 + free_cpumask_var(packages); 2529 + 2530 + free_tdcs: 2531 + for (i = 0; i < kvm_tdx->td.tdcs_nr_pages; i++) { 2532 + if (tdcs_pages[i]) 2533 + __free_page(tdcs_pages[i]); 2534 + } 2535 + kfree(tdcs_pages); 2536 + kvm_tdx->td.tdcs_pages = NULL; 2537 + 2538 + free_tdr: 2539 + if (tdr_page) 2540 + __free_page(tdr_page); 2541 + kvm_tdx->td.tdr_page = 0; 2542 + 2543 + free_hkid: 2544 + tdx_hkid_free(kvm_tdx); 2545 + 2546 + return ret; 2547 + } 2548 + 2549 + static u64 tdx_td_metadata_field_read(struct kvm_tdx *tdx, u64 field_id, 2550 + u64 *data) 2551 + { 2552 + u64 err; 2553 + 2554 + err = tdh_mng_rd(&tdx->td, field_id, data); 2555 + 2556 + return err; 2557 + } 2558 + 2559 + #define TDX_MD_UNREADABLE_LEAF_MASK GENMASK(30, 7) 2560 + #define TDX_MD_UNREADABLE_SUBLEAF_MASK GENMASK(31, 7) 2561 + 2562 + static int tdx_read_cpuid(struct kvm_vcpu *vcpu, u32 leaf, u32 sub_leaf, 2563 + bool sub_leaf_set, int *entry_index, 2564 + struct kvm_cpuid_entry2 *out) 2565 + { 2566 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm); 2567 + u64 field_id = TD_MD_FIELD_ID_CPUID_VALUES; 2568 + u64 ebx_eax, edx_ecx; 2569 + u64 err = 0; 2570 + 2571 + if (sub_leaf > 0b1111111) 2572 + return -EINVAL; 2573 + 2574 + if (*entry_index >= KVM_MAX_CPUID_ENTRIES) 2575 + return -EINVAL; 2576 + 2577 + if (leaf & TDX_MD_UNREADABLE_LEAF_MASK || 2578 + sub_leaf & TDX_MD_UNREADABLE_SUBLEAF_MASK) 2579 + return -EINVAL; 2580 + 2581 + /* 2582 + * bit 23:17, REVSERVED: reserved, must be 0; 2583 + * bit 16, LEAF_31: leaf number bit 31; 2584 + * bit 15:9, LEAF_6_0: leaf number bits 6:0, leaf bits 30:7 are 2585 + * implicitly 0; 2586 + * bit 8, SUBLEAF_NA: sub-leaf not applicable flag; 2587 + * bit 7:1, SUBLEAF_6_0: sub-leaf number bits 6:0. If SUBLEAF_NA is 1, 2588 + * the SUBLEAF_6_0 is all-1. 2589 + * sub-leaf bits 31:7 are implicitly 0; 2590 + * bit 0, ELEMENT_I: Element index within field; 2591 + */ 2592 + field_id |= ((leaf & 0x80000000) ? 1 : 0) << 16; 2593 + field_id |= (leaf & 0x7f) << 9; 2594 + if (sub_leaf_set) 2595 + field_id |= (sub_leaf & 0x7f) << 1; 2596 + else 2597 + field_id |= 0x1fe; 2598 + 2599 + err = tdx_td_metadata_field_read(kvm_tdx, field_id, &ebx_eax); 2600 + if (err) //TODO check for specific errors 2601 + goto err_out; 2602 + 2603 + out->eax = (u32) ebx_eax; 2604 + out->ebx = (u32) (ebx_eax >> 32); 2605 + 2606 + field_id++; 2607 + err = tdx_td_metadata_field_read(kvm_tdx, field_id, &edx_ecx); 2608 + /* 2609 + * It's weird that reading edx_ecx fails while reading ebx_eax 2610 + * succeeded. 2611 + */ 2612 + if (WARN_ON_ONCE(err)) 2613 + goto err_out; 2614 + 2615 + out->ecx = (u32) edx_ecx; 2616 + out->edx = (u32) (edx_ecx >> 32); 2617 + 2618 + out->function = leaf; 2619 + out->index = sub_leaf; 2620 + out->flags |= sub_leaf_set ? KVM_CPUID_FLAG_SIGNIFCANT_INDEX : 0; 2621 + 2622 + /* 2623 + * Work around missing support on old TDX modules, fetch 2624 + * guest maxpa from gfn_direct_bits. 2625 + */ 2626 + if (leaf == 0x80000008) { 2627 + gpa_t gpa_bits = gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm)); 2628 + unsigned int g_maxpa = __ffs(gpa_bits) + 1; 2629 + 2630 + out->eax = tdx_set_guest_phys_addr_bits(out->eax, g_maxpa); 2631 + } 2632 + 2633 + (*entry_index)++; 2634 + 2635 + return 0; 2636 + 2637 + err_out: 2638 + out->eax = 0; 2639 + out->ebx = 0; 2640 + out->ecx = 0; 2641 + out->edx = 0; 2642 + 2643 + return -EIO; 2644 + } 2645 + 2646 + static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd) 2647 + { 2648 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 2649 + struct kvm_tdx_init_vm *init_vm; 2650 + struct td_params *td_params = NULL; 2651 + int ret; 2652 + 2653 + BUILD_BUG_ON(sizeof(*init_vm) != 256 + sizeof_field(struct kvm_tdx_init_vm, cpuid)); 2654 + BUILD_BUG_ON(sizeof(struct td_params) != 1024); 2655 + 2656 + if (kvm_tdx->state != TD_STATE_UNINITIALIZED) 2657 + return -EINVAL; 2658 + 2659 + if (cmd->flags) 2660 + return -EINVAL; 2661 + 2662 + init_vm = kmalloc(sizeof(*init_vm) + 2663 + sizeof(init_vm->cpuid.entries[0]) * KVM_MAX_CPUID_ENTRIES, 2664 + GFP_KERNEL); 2665 + if (!init_vm) 2666 + return -ENOMEM; 2667 + 2668 + if (copy_from_user(init_vm, u64_to_user_ptr(cmd->data), sizeof(*init_vm))) { 2669 + ret = -EFAULT; 2670 + goto out; 2671 + } 2672 + 2673 + if (init_vm->cpuid.nent > KVM_MAX_CPUID_ENTRIES) { 2674 + ret = -E2BIG; 2675 + goto out; 2676 + } 2677 + 2678 + if (copy_from_user(init_vm->cpuid.entries, 2679 + u64_to_user_ptr(cmd->data) + sizeof(*init_vm), 2680 + flex_array_size(init_vm, cpuid.entries, init_vm->cpuid.nent))) { 2681 + ret = -EFAULT; 2682 + goto out; 2683 + } 2684 + 2685 + if (memchr_inv(init_vm->reserved, 0, sizeof(init_vm->reserved))) { 2686 + ret = -EINVAL; 2687 + goto out; 2688 + } 2689 + 2690 + if (init_vm->cpuid.padding) { 2691 + ret = -EINVAL; 2692 + goto out; 2693 + } 2694 + 2695 + td_params = kzalloc(sizeof(struct td_params), GFP_KERNEL); 2696 + if (!td_params) { 2697 + ret = -ENOMEM; 2698 + goto out; 2699 + } 2700 + 2701 + ret = setup_tdparams(kvm, td_params, init_vm); 2702 + if (ret) 2703 + goto out; 2704 + 2705 + ret = __tdx_td_init(kvm, td_params, &cmd->hw_error); 2706 + if (ret) 2707 + goto out; 2708 + 2709 + kvm_tdx->tsc_offset = td_tdcs_exec_read64(kvm_tdx, TD_TDCS_EXEC_TSC_OFFSET); 2710 + kvm_tdx->tsc_multiplier = td_tdcs_exec_read64(kvm_tdx, TD_TDCS_EXEC_TSC_MULTIPLIER); 2711 + kvm_tdx->attributes = td_params->attributes; 2712 + kvm_tdx->xfam = td_params->xfam; 2713 + 2714 + if (td_params->config_flags & TDX_CONFIG_FLAGS_MAX_GPAW) 2715 + kvm->arch.gfn_direct_bits = TDX_SHARED_BIT_PWL_5; 2716 + else 2717 + kvm->arch.gfn_direct_bits = TDX_SHARED_BIT_PWL_4; 2718 + 2719 + kvm_tdx->state = TD_STATE_INITIALIZED; 2720 + out: 2721 + /* kfree() accepts NULL. */ 2722 + kfree(init_vm); 2723 + kfree(td_params); 2724 + 2725 + return ret; 2726 + } 2727 + 2728 + void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) 2729 + { 2730 + /* 2731 + * flush_tlb_current() is invoked when the first time for the vcpu to 2732 + * run or when root of shared EPT is invalidated. 2733 + * KVM only needs to flush shared EPT because the TDX module handles TLB 2734 + * invalidation for private EPT in tdh_vp_enter(); 2735 + * 2736 + * A single context invalidation for shared EPT can be performed here. 2737 + * However, this single context invalidation requires the private EPTP 2738 + * rather than the shared EPTP to flush shared EPT, as shared EPT uses 2739 + * private EPTP as its ASID for TLB invalidation. 2740 + * 2741 + * To avoid reading back private EPTP, perform a global invalidation for 2742 + * shared EPT instead to keep this function simple. 2743 + */ 2744 + ept_sync_global(); 2745 + } 2746 + 2747 + void tdx_flush_tlb_all(struct kvm_vcpu *vcpu) 2748 + { 2749 + /* 2750 + * TDX has called tdx_track() in tdx_sept_remove_private_spte() to 2751 + * ensure that private EPT will be flushed on the next TD enter. No need 2752 + * to call tdx_track() here again even when this callback is a result of 2753 + * zapping private EPT. 2754 + * 2755 + * Due to the lack of the context to determine which EPT has been 2756 + * affected by zapping, invoke invept() directly here for both shared 2757 + * EPT and private EPT for simplicity, though it's not necessary for 2758 + * private EPT. 2759 + */ 2760 + ept_sync_global(); 2761 + } 2762 + 2763 + static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd) 2764 + { 2765 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 2766 + 2767 + guard(mutex)(&kvm->slots_lock); 2768 + 2769 + if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE) 2770 + return -EINVAL; 2771 + /* 2772 + * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue 2773 + * TDH.MEM.PAGE.ADD(). 2774 + */ 2775 + if (atomic64_read(&kvm_tdx->nr_premapped)) 2776 + return -EINVAL; 2777 + 2778 + cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td); 2779 + if (tdx_operand_busy(cmd->hw_error)) 2780 + return -EBUSY; 2781 + if (KVM_BUG_ON(cmd->hw_error, kvm)) { 2782 + pr_tdx_error(TDH_MR_FINALIZE, cmd->hw_error); 2783 + return -EIO; 2784 + } 2785 + 2786 + kvm_tdx->state = TD_STATE_RUNNABLE; 2787 + /* TD_STATE_RUNNABLE must be set before 'pre_fault_allowed' */ 2788 + smp_wmb(); 2789 + kvm->arch.pre_fault_allowed = true; 2790 + return 0; 2791 + } 2792 + 2793 + int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) 2794 + { 2795 + struct kvm_tdx_cmd tdx_cmd; 2796 + int r; 2797 + 2798 + if (copy_from_user(&tdx_cmd, argp, sizeof(struct kvm_tdx_cmd))) 2799 + return -EFAULT; 2800 + 2801 + /* 2802 + * Userspace should never set hw_error. It is used to fill 2803 + * hardware-defined error by the kernel. 2804 + */ 2805 + if (tdx_cmd.hw_error) 2806 + return -EINVAL; 2807 + 2808 + mutex_lock(&kvm->lock); 2809 + 2810 + switch (tdx_cmd.id) { 2811 + case KVM_TDX_CAPABILITIES: 2812 + r = tdx_get_capabilities(&tdx_cmd); 2813 + break; 2814 + case KVM_TDX_INIT_VM: 2815 + r = tdx_td_init(kvm, &tdx_cmd); 2816 + break; 2817 + case KVM_TDX_FINALIZE_VM: 2818 + r = tdx_td_finalize(kvm, &tdx_cmd); 2819 + break; 2820 + default: 2821 + r = -EINVAL; 2822 + goto out; 2823 + } 2824 + 2825 + if (copy_to_user(argp, &tdx_cmd, sizeof(struct kvm_tdx_cmd))) 2826 + r = -EFAULT; 2827 + 2828 + out: 2829 + mutex_unlock(&kvm->lock); 2830 + return r; 2831 + } 2832 + 2833 + /* VMM can pass one 64bit auxiliary data to vcpu via RCX for guest BIOS. */ 2834 + static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx) 2835 + { 2836 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm); 2837 + struct vcpu_tdx *tdx = to_tdx(vcpu); 2838 + struct page *page; 2839 + int ret, i; 2840 + u64 err; 2841 + 2842 + page = alloc_page(GFP_KERNEL); 2843 + if (!page) 2844 + return -ENOMEM; 2845 + tdx->vp.tdvpr_page = page; 2846 + 2847 + tdx->vp.tdcx_pages = kcalloc(kvm_tdx->td.tdcx_nr_pages, sizeof(*tdx->vp.tdcx_pages), 2848 + GFP_KERNEL); 2849 + if (!tdx->vp.tdcx_pages) { 2850 + ret = -ENOMEM; 2851 + goto free_tdvpr; 2852 + } 2853 + 2854 + for (i = 0; i < kvm_tdx->td.tdcx_nr_pages; i++) { 2855 + page = alloc_page(GFP_KERNEL); 2856 + if (!page) { 2857 + ret = -ENOMEM; 2858 + goto free_tdcx; 2859 + } 2860 + tdx->vp.tdcx_pages[i] = page; 2861 + } 2862 + 2863 + err = tdh_vp_create(&kvm_tdx->td, &tdx->vp); 2864 + if (KVM_BUG_ON(err, vcpu->kvm)) { 2865 + ret = -EIO; 2866 + pr_tdx_error(TDH_VP_CREATE, err); 2867 + goto free_tdcx; 2868 + } 2869 + 2870 + for (i = 0; i < kvm_tdx->td.tdcx_nr_pages; i++) { 2871 + err = tdh_vp_addcx(&tdx->vp, tdx->vp.tdcx_pages[i]); 2872 + if (KVM_BUG_ON(err, vcpu->kvm)) { 2873 + pr_tdx_error(TDH_VP_ADDCX, err); 2874 + /* 2875 + * Pages already added are reclaimed by the vcpu_free 2876 + * method, but the rest are freed here. 2877 + */ 2878 + for (; i < kvm_tdx->td.tdcx_nr_pages; i++) { 2879 + __free_page(tdx->vp.tdcx_pages[i]); 2880 + tdx->vp.tdcx_pages[i] = NULL; 2881 + } 2882 + return -EIO; 2883 + } 2884 + } 2885 + 2886 + err = tdh_vp_init(&tdx->vp, vcpu_rcx, vcpu->vcpu_id); 2887 + if (KVM_BUG_ON(err, vcpu->kvm)) { 2888 + pr_tdx_error(TDH_VP_INIT, err); 2889 + return -EIO; 2890 + } 2891 + 2892 + vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; 2893 + 2894 + return 0; 2895 + 2896 + free_tdcx: 2897 + for (i = 0; i < kvm_tdx->td.tdcx_nr_pages; i++) { 2898 + if (tdx->vp.tdcx_pages[i]) 2899 + __free_page(tdx->vp.tdcx_pages[i]); 2900 + tdx->vp.tdcx_pages[i] = NULL; 2901 + } 2902 + kfree(tdx->vp.tdcx_pages); 2903 + tdx->vp.tdcx_pages = NULL; 2904 + 2905 + free_tdvpr: 2906 + if (tdx->vp.tdvpr_page) 2907 + __free_page(tdx->vp.tdvpr_page); 2908 + tdx->vp.tdvpr_page = 0; 2909 + 2910 + return ret; 2911 + } 2912 + 2913 + /* Sometimes reads multipple subleafs. Return how many enties were written. */ 2914 + static int tdx_vcpu_get_cpuid_leaf(struct kvm_vcpu *vcpu, u32 leaf, int *entry_index, 2915 + struct kvm_cpuid_entry2 *output_e) 2916 + { 2917 + int sub_leaf = 0; 2918 + int ret; 2919 + 2920 + /* First try without a subleaf */ 2921 + ret = tdx_read_cpuid(vcpu, leaf, 0, false, entry_index, output_e); 2922 + 2923 + /* If success, or invalid leaf, just give up */ 2924 + if (ret != -EIO) 2925 + return ret; 2926 + 2927 + /* 2928 + * If the try without a subleaf failed, try reading subleafs until 2929 + * failure. The TDX module only supports 6 bits of subleaf index. 2930 + */ 2931 + while (1) { 2932 + /* Keep reading subleafs until there is a failure. */ 2933 + if (tdx_read_cpuid(vcpu, leaf, sub_leaf, true, entry_index, output_e)) 2934 + return !sub_leaf; 2935 + 2936 + sub_leaf++; 2937 + output_e++; 2938 + } 2939 + 2940 + return 0; 2941 + } 2942 + 2943 + static int tdx_vcpu_get_cpuid(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd) 2944 + { 2945 + struct kvm_cpuid2 __user *output, *td_cpuid; 2946 + int r = 0, i = 0, leaf; 2947 + u32 level; 2948 + 2949 + output = u64_to_user_ptr(cmd->data); 2950 + td_cpuid = kzalloc(sizeof(*td_cpuid) + 2951 + sizeof(output->entries[0]) * KVM_MAX_CPUID_ENTRIES, 2952 + GFP_KERNEL); 2953 + if (!td_cpuid) 2954 + return -ENOMEM; 2955 + 2956 + if (copy_from_user(td_cpuid, output, sizeof(*output))) { 2957 + r = -EFAULT; 2958 + goto out; 2959 + } 2960 + 2961 + /* Read max CPUID for normal range */ 2962 + if (tdx_vcpu_get_cpuid_leaf(vcpu, 0, &i, &td_cpuid->entries[i])) { 2963 + r = -EIO; 2964 + goto out; 2965 + } 2966 + level = td_cpuid->entries[0].eax; 2967 + 2968 + for (leaf = 1; leaf <= level; leaf++) 2969 + tdx_vcpu_get_cpuid_leaf(vcpu, leaf, &i, &td_cpuid->entries[i]); 2970 + 2971 + /* Read max CPUID for extended range */ 2972 + if (tdx_vcpu_get_cpuid_leaf(vcpu, 0x80000000, &i, &td_cpuid->entries[i])) { 2973 + r = -EIO; 2974 + goto out; 2975 + } 2976 + level = td_cpuid->entries[i - 1].eax; 2977 + 2978 + for (leaf = 0x80000001; leaf <= level; leaf++) 2979 + tdx_vcpu_get_cpuid_leaf(vcpu, leaf, &i, &td_cpuid->entries[i]); 2980 + 2981 + if (td_cpuid->nent < i) 2982 + r = -E2BIG; 2983 + td_cpuid->nent = i; 2984 + 2985 + if (copy_to_user(output, td_cpuid, sizeof(*output))) { 2986 + r = -EFAULT; 2987 + goto out; 2988 + } 2989 + 2990 + if (r == -E2BIG) 2991 + goto out; 2992 + 2993 + if (copy_to_user(output->entries, td_cpuid->entries, 2994 + td_cpuid->nent * sizeof(struct kvm_cpuid_entry2))) 2995 + r = -EFAULT; 2996 + 2997 + out: 2998 + kfree(td_cpuid); 2999 + 3000 + return r; 3001 + } 3002 + 3003 + static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd) 3004 + { 3005 + u64 apic_base; 3006 + struct vcpu_tdx *tdx = to_tdx(vcpu); 3007 + int ret; 3008 + 3009 + if (cmd->flags) 3010 + return -EINVAL; 3011 + 3012 + if (tdx->state != VCPU_TD_STATE_UNINITIALIZED) 3013 + return -EINVAL; 3014 + 3015 + /* 3016 + * TDX requires X2APIC, userspace is responsible for configuring guest 3017 + * CPUID accordingly. 3018 + */ 3019 + apic_base = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC | 3020 + (kvm_vcpu_is_reset_bsp(vcpu) ? MSR_IA32_APICBASE_BSP : 0); 3021 + if (kvm_apic_set_base(vcpu, apic_base, true)) 3022 + return -EINVAL; 3023 + 3024 + ret = tdx_td_vcpu_init(vcpu, (u64)cmd->data); 3025 + if (ret) 3026 + return ret; 3027 + 3028 + td_vmcs_write16(tdx, POSTED_INTR_NV, POSTED_INTR_VECTOR); 3029 + td_vmcs_write64(tdx, POSTED_INTR_DESC_ADDR, __pa(&tdx->vt.pi_desc)); 3030 + td_vmcs_setbit32(tdx, PIN_BASED_VM_EXEC_CONTROL, PIN_BASED_POSTED_INTR); 3031 + 3032 + tdx->state = VCPU_TD_STATE_INITIALIZED; 3033 + 3034 + return 0; 3035 + } 3036 + 3037 + void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) 3038 + { 3039 + /* 3040 + * Yell on INIT, as TDX doesn't support INIT, i.e. KVM should drop all 3041 + * INIT events. 3042 + * 3043 + * Defer initializing vCPU for RESET state until KVM_TDX_INIT_VCPU, as 3044 + * userspace needs to define the vCPU model before KVM can initialize 3045 + * vCPU state, e.g. to enable x2APIC. 3046 + */ 3047 + WARN_ON_ONCE(init_event); 3048 + } 3049 + 3050 + struct tdx_gmem_post_populate_arg { 3051 + struct kvm_vcpu *vcpu; 3052 + __u32 flags; 3053 + }; 3054 + 3055 + static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, 3056 + void __user *src, int order, void *_arg) 3057 + { 3058 + u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS; 3059 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 3060 + struct tdx_gmem_post_populate_arg *arg = _arg; 3061 + struct kvm_vcpu *vcpu = arg->vcpu; 3062 + gpa_t gpa = gfn_to_gpa(gfn); 3063 + u8 level = PG_LEVEL_4K; 3064 + struct page *src_page; 3065 + int ret, i; 3066 + u64 err, entry, level_state; 3067 + 3068 + /* 3069 + * Get the source page if it has been faulted in. Return failure if the 3070 + * source page has been swapped out or unmapped in primary memory. 3071 + */ 3072 + ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page); 3073 + if (ret < 0) 3074 + return ret; 3075 + if (ret != 1) 3076 + return -ENOMEM; 3077 + 3078 + ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level); 3079 + if (ret < 0) 3080 + goto out; 3081 + 3082 + /* 3083 + * The private mem cannot be zapped after kvm_tdp_map_page() 3084 + * because all paths are covered by slots_lock and the 3085 + * filemap invalidate lock. Check that they are indeed enough. 3086 + */ 3087 + if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) { 3088 + scoped_guard(read_lock, &kvm->mmu_lock) { 3089 + if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa), kvm)) { 3090 + ret = -EIO; 3091 + goto out; 3092 + } 3093 + } 3094 + } 3095 + 3096 + ret = 0; 3097 + err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn), 3098 + src_page, &entry, &level_state); 3099 + if (err) { 3100 + ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO; 3101 + goto out; 3102 + } 3103 + 3104 + if (!KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) 3105 + atomic64_dec(&kvm_tdx->nr_premapped); 3106 + 3107 + if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) { 3108 + for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) { 3109 + err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, 3110 + &level_state); 3111 + if (err) { 3112 + ret = -EIO; 3113 + break; 3114 + } 3115 + } 3116 + } 3117 + 3118 + out: 3119 + put_page(src_page); 3120 + return ret; 3121 + } 3122 + 3123 + static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd) 3124 + { 3125 + struct vcpu_tdx *tdx = to_tdx(vcpu); 3126 + struct kvm *kvm = vcpu->kvm; 3127 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 3128 + struct kvm_tdx_init_mem_region region; 3129 + struct tdx_gmem_post_populate_arg arg; 3130 + long gmem_ret; 3131 + int ret; 3132 + 3133 + if (tdx->state != VCPU_TD_STATE_INITIALIZED) 3134 + return -EINVAL; 3135 + 3136 + guard(mutex)(&kvm->slots_lock); 3137 + 3138 + /* Once TD is finalized, the initial guest memory is fixed. */ 3139 + if (kvm_tdx->state == TD_STATE_RUNNABLE) 3140 + return -EINVAL; 3141 + 3142 + if (cmd->flags & ~KVM_TDX_MEASURE_MEMORY_REGION) 3143 + return -EINVAL; 3144 + 3145 + if (copy_from_user(&region, u64_to_user_ptr(cmd->data), sizeof(region))) 3146 + return -EFAULT; 3147 + 3148 + if (!PAGE_ALIGNED(region.source_addr) || !PAGE_ALIGNED(region.gpa) || 3149 + !region.nr_pages || 3150 + region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa || 3151 + !vt_is_tdx_private_gpa(kvm, region.gpa) || 3152 + !vt_is_tdx_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1)) 3153 + return -EINVAL; 3154 + 3155 + kvm_mmu_reload(vcpu); 3156 + ret = 0; 3157 + while (region.nr_pages) { 3158 + if (signal_pending(current)) { 3159 + ret = -EINTR; 3160 + break; 3161 + } 3162 + 3163 + arg = (struct tdx_gmem_post_populate_arg) { 3164 + .vcpu = vcpu, 3165 + .flags = cmd->flags, 3166 + }; 3167 + gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa), 3168 + u64_to_user_ptr(region.source_addr), 3169 + 1, tdx_gmem_post_populate, &arg); 3170 + if (gmem_ret < 0) { 3171 + ret = gmem_ret; 3172 + break; 3173 + } 3174 + 3175 + if (gmem_ret != 1) { 3176 + ret = -EIO; 3177 + break; 3178 + } 3179 + 3180 + region.source_addr += PAGE_SIZE; 3181 + region.gpa += PAGE_SIZE; 3182 + region.nr_pages--; 3183 + 3184 + cond_resched(); 3185 + } 3186 + 3187 + if (copy_to_user(u64_to_user_ptr(cmd->data), &region, sizeof(region))) 3188 + ret = -EFAULT; 3189 + return ret; 3190 + } 3191 + 3192 + int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) 3193 + { 3194 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm); 3195 + struct kvm_tdx_cmd cmd; 3196 + int ret; 3197 + 3198 + if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE) 3199 + return -EINVAL; 3200 + 3201 + if (copy_from_user(&cmd, argp, sizeof(cmd))) 3202 + return -EFAULT; 3203 + 3204 + if (cmd.hw_error) 3205 + return -EINVAL; 3206 + 3207 + switch (cmd.id) { 3208 + case KVM_TDX_INIT_VCPU: 3209 + ret = tdx_vcpu_init(vcpu, &cmd); 3210 + break; 3211 + case KVM_TDX_INIT_MEM_REGION: 3212 + ret = tdx_vcpu_init_mem_region(vcpu, &cmd); 3213 + break; 3214 + case KVM_TDX_GET_CPUID: 3215 + ret = tdx_vcpu_get_cpuid(vcpu, &cmd); 3216 + break; 3217 + default: 3218 + ret = -EINVAL; 3219 + break; 3220 + } 3221 + 3222 + return ret; 3223 + } 3224 + 3225 + int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn) 3226 + { 3227 + return PG_LEVEL_4K; 3228 + } 3229 + 3230 + static int tdx_online_cpu(unsigned int cpu) 3231 + { 3232 + unsigned long flags; 3233 + int r; 3234 + 3235 + /* Sanity check CPU is already in post-VMXON */ 3236 + WARN_ON_ONCE(!(cr4_read_shadow() & X86_CR4_VMXE)); 3237 + 3238 + local_irq_save(flags); 3239 + r = tdx_cpu_enable(); 3240 + local_irq_restore(flags); 3241 + 3242 + return r; 3243 + } 3244 + 3245 + static int tdx_offline_cpu(unsigned int cpu) 3246 + { 3247 + int i; 3248 + 3249 + /* No TD is running. Allow any cpu to be offline. */ 3250 + if (!atomic_read(&nr_configured_hkid)) 3251 + return 0; 3252 + 3253 + /* 3254 + * In order to reclaim TDX HKID, (i.e. when deleting guest TD), need to 3255 + * call TDH.PHYMEM.PAGE.WBINVD on all packages to program all memory 3256 + * controller with pconfig. If we have active TDX HKID, refuse to 3257 + * offline the last online cpu. 3258 + */ 3259 + for_each_online_cpu(i) { 3260 + /* 3261 + * Found another online cpu on the same package. 3262 + * Allow to offline. 3263 + */ 3264 + if (i != cpu && topology_physical_package_id(i) == 3265 + topology_physical_package_id(cpu)) 3266 + return 0; 3267 + } 3268 + 3269 + /* 3270 + * This is the last cpu of this package. Don't offline it. 3271 + * 3272 + * Because it's hard for human operator to understand the 3273 + * reason, warn it. 3274 + */ 3275 + #define MSG_ALLPKG_ONLINE \ 3276 + "TDX requires all packages to have an online CPU. Delete all TDs in order to offline all CPUs of a package.\n" 3277 + pr_warn_ratelimited(MSG_ALLPKG_ONLINE); 3278 + return -EBUSY; 3279 + } 3280 + 3281 + static void __do_tdx_cleanup(void) 3282 + { 3283 + /* 3284 + * Once TDX module is initialized, it cannot be disabled and 3285 + * re-initialized again w/o runtime update (which isn't 3286 + * supported by kernel). Only need to remove the cpuhp here. 3287 + * The TDX host core code tracks TDX status and can handle 3288 + * 'multiple enabling' scenario. 3289 + */ 3290 + WARN_ON_ONCE(!tdx_cpuhp_state); 3291 + cpuhp_remove_state_nocalls_cpuslocked(tdx_cpuhp_state); 3292 + tdx_cpuhp_state = 0; 3293 + } 3294 + 3295 + static void __tdx_cleanup(void) 3296 + { 3297 + cpus_read_lock(); 3298 + __do_tdx_cleanup(); 3299 + cpus_read_unlock(); 3300 + } 3301 + 3302 + static int __init __do_tdx_bringup(void) 3303 + { 3304 + int r; 3305 + 3306 + /* 3307 + * TDX-specific cpuhp callback to call tdx_cpu_enable() on all 3308 + * online CPUs before calling tdx_enable(), and on any new 3309 + * going-online CPU to make sure it is ready for TDX guest. 3310 + */ 3311 + r = cpuhp_setup_state_cpuslocked(CPUHP_AP_ONLINE_DYN, 3312 + "kvm/cpu/tdx:online", 3313 + tdx_online_cpu, tdx_offline_cpu); 3314 + if (r < 0) 3315 + return r; 3316 + 3317 + tdx_cpuhp_state = r; 3318 + 3319 + r = tdx_enable(); 3320 + if (r) 3321 + __do_tdx_cleanup(); 3322 + 3323 + return r; 3324 + } 3325 + 3326 + static int __init __tdx_bringup(void) 3327 + { 3328 + const struct tdx_sys_info_td_conf *td_conf; 3329 + int r, i; 3330 + 3331 + for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) { 3332 + /* 3333 + * Check if MSRs (tdx_uret_msrs) can be saved/restored 3334 + * before returning to user space. 3335 + * 3336 + * this_cpu_ptr(user_return_msrs)->registered isn't checked 3337 + * because the registration is done at vcpu runtime by 3338 + * tdx_user_return_msr_update_cache(). 3339 + */ 3340 + tdx_uret_msrs[i].slot = kvm_find_user_return_msr(tdx_uret_msrs[i].msr); 3341 + if (tdx_uret_msrs[i].slot == -1) { 3342 + /* If any MSR isn't supported, it is a KVM bug */ 3343 + pr_err("MSR %x isn't included by kvm_find_user_return_msr\n", 3344 + tdx_uret_msrs[i].msr); 3345 + return -EIO; 3346 + } 3347 + } 3348 + 3349 + /* 3350 + * Enabling TDX requires enabling hardware virtualization first, 3351 + * as making SEAMCALLs requires CPU being in post-VMXON state. 3352 + */ 3353 + r = kvm_enable_virtualization(); 3354 + if (r) 3355 + return r; 3356 + 3357 + cpus_read_lock(); 3358 + r = __do_tdx_bringup(); 3359 + cpus_read_unlock(); 3360 + 3361 + if (r) 3362 + goto tdx_bringup_err; 3363 + 3364 + /* Get TDX global information for later use */ 3365 + tdx_sysinfo = tdx_get_sysinfo(); 3366 + if (WARN_ON_ONCE(!tdx_sysinfo)) { 3367 + r = -EINVAL; 3368 + goto get_sysinfo_err; 3369 + } 3370 + 3371 + /* Check TDX module and KVM capabilities */ 3372 + if (!tdx_get_supported_attrs(&tdx_sysinfo->td_conf) || 3373 + !tdx_get_supported_xfam(&tdx_sysinfo->td_conf)) 3374 + goto get_sysinfo_err; 3375 + 3376 + if (!(tdx_sysinfo->features.tdx_features0 & MD_FIELD_ID_FEATURES0_TOPOLOGY_ENUM)) 3377 + goto get_sysinfo_err; 3378 + 3379 + /* 3380 + * TDX has its own limit of maximum vCPUs it can support for all 3381 + * TDX guests in addition to KVM_MAX_VCPUS. Userspace needs to 3382 + * query TDX guest's maximum vCPUs by checking KVM_CAP_MAX_VCPU 3383 + * extension on per-VM basis. 3384 + * 3385 + * TDX module reports such limit via the MAX_VCPU_PER_TD global 3386 + * metadata. Different modules may report different values. 3387 + * Some old module may also not support this metadata (in which 3388 + * case this limit is U16_MAX). 3389 + * 3390 + * In practice, the reported value reflects the maximum logical 3391 + * CPUs that ALL the platforms that the module supports can 3392 + * possibly have. 3393 + * 3394 + * Simply forwarding the MAX_VCPU_PER_TD to userspace could 3395 + * result in an unpredictable ABI. KVM instead always advertise 3396 + * the number of logical CPUs the platform has as the maximum 3397 + * vCPUs for TDX guests. 3398 + * 3399 + * Make sure MAX_VCPU_PER_TD reported by TDX module is not 3400 + * smaller than the number of logical CPUs, otherwise KVM will 3401 + * report an unsupported value to userspace. 3402 + * 3403 + * Note, a platform with TDX enabled in the BIOS cannot support 3404 + * physical CPU hotplug, and TDX requires the BIOS has marked 3405 + * all logical CPUs in MADT table as enabled. Just use 3406 + * num_present_cpus() for the number of logical CPUs. 3407 + */ 3408 + td_conf = &tdx_sysinfo->td_conf; 3409 + if (td_conf->max_vcpus_per_td < num_present_cpus()) { 3410 + pr_err("Disable TDX: MAX_VCPU_PER_TD (%u) smaller than number of logical CPUs (%u).\n", 3411 + td_conf->max_vcpus_per_td, num_present_cpus()); 3412 + r = -EINVAL; 3413 + goto get_sysinfo_err; 3414 + } 3415 + 3416 + if (misc_cg_set_capacity(MISC_CG_RES_TDX, tdx_get_nr_guest_keyids())) { 3417 + r = -EINVAL; 3418 + goto get_sysinfo_err; 3419 + } 3420 + 3421 + /* 3422 + * Leave hardware virtualization enabled after TDX is enabled 3423 + * successfully. TDX CPU hotplug depends on this. 3424 + */ 3425 + return 0; 3426 + 3427 + get_sysinfo_err: 3428 + __tdx_cleanup(); 3429 + tdx_bringup_err: 3430 + kvm_disable_virtualization(); 3431 + return r; 3432 + } 3433 + 3434 + void tdx_cleanup(void) 3435 + { 3436 + if (enable_tdx) { 3437 + misc_cg_set_capacity(MISC_CG_RES_TDX, 0); 3438 + __tdx_cleanup(); 3439 + kvm_disable_virtualization(); 3440 + } 3441 + } 3442 + 3443 + int __init tdx_bringup(void) 3444 + { 3445 + int r, i; 3446 + 3447 + /* tdx_disable_virtualization_cpu() uses associated_tdvcpus. */ 3448 + for_each_possible_cpu(i) 3449 + INIT_LIST_HEAD(&per_cpu(associated_tdvcpus, i)); 3450 + 3451 + if (!enable_tdx) 3452 + return 0; 3453 + 3454 + if (!enable_ept) { 3455 + pr_err("EPT is required for TDX\n"); 3456 + goto success_disable_tdx; 3457 + } 3458 + 3459 + if (!tdp_mmu_enabled || !enable_mmio_caching || !enable_ept_ad_bits) { 3460 + pr_err("TDP MMU and MMIO caching and EPT A/D bit is required for TDX\n"); 3461 + goto success_disable_tdx; 3462 + } 3463 + 3464 + if (!enable_apicv) { 3465 + pr_err("APICv is required for TDX\n"); 3466 + goto success_disable_tdx; 3467 + } 3468 + 3469 + if (!cpu_feature_enabled(X86_FEATURE_OSXSAVE)) { 3470 + pr_err("tdx: OSXSAVE is required for TDX\n"); 3471 + goto success_disable_tdx; 3472 + } 3473 + 3474 + if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) { 3475 + pr_err("tdx: MOVDIR64B is required for TDX\n"); 3476 + goto success_disable_tdx; 3477 + } 3478 + 3479 + if (!cpu_feature_enabled(X86_FEATURE_SELFSNOOP)) { 3480 + pr_err("Self-snoop is required for TDX\n"); 3481 + goto success_disable_tdx; 3482 + } 3483 + 3484 + if (!cpu_feature_enabled(X86_FEATURE_TDX_HOST_PLATFORM)) { 3485 + pr_err("tdx: no TDX private KeyIDs available\n"); 3486 + goto success_disable_tdx; 3487 + } 3488 + 3489 + if (!enable_virt_at_load) { 3490 + pr_err("tdx: tdx requires kvm.enable_virt_at_load=1\n"); 3491 + goto success_disable_tdx; 3492 + } 3493 + 3494 + /* 3495 + * Ideally KVM should probe whether TDX module has been loaded 3496 + * first and then try to bring it up. But TDX needs to use SEAMCALL 3497 + * to probe whether the module is loaded (there is no CPUID or MSR 3498 + * for that), and making SEAMCALL requires enabling virtualization 3499 + * first, just like the rest steps of bringing up TDX module. 3500 + * 3501 + * So, for simplicity do everything in __tdx_bringup(); the first 3502 + * SEAMCALL will return -ENODEV when the module is not loaded. The 3503 + * only complication is having to make sure that initialization 3504 + * SEAMCALLs don't return TDX_SEAMCALL_VMFAILINVALID in other 3505 + * cases. 3506 + */ 3507 + r = __tdx_bringup(); 3508 + if (r) { 3509 + /* 3510 + * Disable TDX only but don't fail to load module if 3511 + * the TDX module could not be loaded. No need to print 3512 + * message saying "module is not loaded" because it was 3513 + * printed when the first SEAMCALL failed. 3514 + */ 3515 + if (r == -ENODEV) 3516 + goto success_disable_tdx; 3517 + 3518 + enable_tdx = 0; 3519 + } 3520 + 3521 + return r; 3522 + 3523 + success_disable_tdx: 3524 + enable_tdx = 0; 3525 + return 0; 3526 + }
+204
arch/x86/kvm/vmx/tdx.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef __KVM_X86_VMX_TDX_H 3 + #define __KVM_X86_VMX_TDX_H 4 + 5 + #include "tdx_arch.h" 6 + #include "tdx_errno.h" 7 + 8 + #ifdef CONFIG_KVM_INTEL_TDX 9 + #include "common.h" 10 + 11 + int tdx_bringup(void); 12 + void tdx_cleanup(void); 13 + 14 + extern bool enable_tdx; 15 + 16 + /* TDX module hardware states. These follow the TDX module OP_STATEs. */ 17 + enum kvm_tdx_state { 18 + TD_STATE_UNINITIALIZED = 0, 19 + TD_STATE_INITIALIZED, 20 + TD_STATE_RUNNABLE, 21 + }; 22 + 23 + struct kvm_tdx { 24 + struct kvm kvm; 25 + 26 + struct misc_cg *misc_cg; 27 + int hkid; 28 + enum kvm_tdx_state state; 29 + 30 + u64 attributes; 31 + u64 xfam; 32 + 33 + u64 tsc_offset; 34 + u64 tsc_multiplier; 35 + 36 + struct tdx_td td; 37 + 38 + /* For KVM_TDX_INIT_MEM_REGION. */ 39 + atomic64_t nr_premapped; 40 + 41 + /* 42 + * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do 43 + * not contend with tdh_vp_enter() and TDCALLs. 44 + * Set/unset is protected with kvm->mmu_lock. 45 + */ 46 + bool wait_for_sept_zap; 47 + }; 48 + 49 + /* TDX module vCPU states */ 50 + enum vcpu_tdx_state { 51 + VCPU_TD_STATE_UNINITIALIZED = 0, 52 + VCPU_TD_STATE_INITIALIZED, 53 + }; 54 + 55 + struct vcpu_tdx { 56 + struct kvm_vcpu vcpu; 57 + struct vcpu_vt vt; 58 + u64 ext_exit_qualification; 59 + gpa_t exit_gpa; 60 + struct tdx_module_args vp_enter_args; 61 + 62 + struct tdx_vp vp; 63 + 64 + struct list_head cpu_list; 65 + 66 + u64 vp_enter_ret; 67 + 68 + enum vcpu_tdx_state state; 69 + bool guest_entered; 70 + 71 + u64 map_gpa_next; 72 + u64 map_gpa_end; 73 + }; 74 + 75 + void tdh_vp_rd_failed(struct vcpu_tdx *tdx, char *uclass, u32 field, u64 err); 76 + void tdh_vp_wr_failed(struct vcpu_tdx *tdx, char *uclass, char *op, u32 field, 77 + u64 val, u64 err); 78 + 79 + static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 field) 80 + { 81 + u64 err, data; 82 + 83 + err = tdh_mng_rd(&kvm_tdx->td, TDCS_EXEC(field), &data); 84 + if (unlikely(err)) { 85 + pr_err("TDH_MNG_RD[EXEC.0x%x] failed: 0x%llx\n", field, err); 86 + return 0; 87 + } 88 + return data; 89 + } 90 + 91 + static __always_inline void tdvps_vmcs_check(u32 field, u8 bits) 92 + { 93 + #define VMCS_ENC_ACCESS_TYPE_MASK 0x1UL 94 + #define VMCS_ENC_ACCESS_TYPE_FULL 0x0UL 95 + #define VMCS_ENC_ACCESS_TYPE_HIGH 0x1UL 96 + #define VMCS_ENC_ACCESS_TYPE(field) ((field) & VMCS_ENC_ACCESS_TYPE_MASK) 97 + 98 + /* TDX is 64bit only. HIGH field isn't supported. */ 99 + BUILD_BUG_ON_MSG(__builtin_constant_p(field) && 100 + VMCS_ENC_ACCESS_TYPE(field) == VMCS_ENC_ACCESS_TYPE_HIGH, 101 + "Read/Write to TD VMCS *_HIGH fields not supported"); 102 + 103 + BUILD_BUG_ON(bits != 16 && bits != 32 && bits != 64); 104 + 105 + #define VMCS_ENC_WIDTH_MASK GENMASK(14, 13) 106 + #define VMCS_ENC_WIDTH_16BIT (0UL << 13) 107 + #define VMCS_ENC_WIDTH_64BIT (1UL << 13) 108 + #define VMCS_ENC_WIDTH_32BIT (2UL << 13) 109 + #define VMCS_ENC_WIDTH_NATURAL (3UL << 13) 110 + #define VMCS_ENC_WIDTH(field) ((field) & VMCS_ENC_WIDTH_MASK) 111 + 112 + /* TDX is 64bit only. i.e. natural width = 64bit. */ 113 + BUILD_BUG_ON_MSG(bits != 64 && __builtin_constant_p(field) && 114 + (VMCS_ENC_WIDTH(field) == VMCS_ENC_WIDTH_64BIT || 115 + VMCS_ENC_WIDTH(field) == VMCS_ENC_WIDTH_NATURAL), 116 + "Invalid TD VMCS access for 64-bit field"); 117 + BUILD_BUG_ON_MSG(bits != 32 && __builtin_constant_p(field) && 118 + VMCS_ENC_WIDTH(field) == VMCS_ENC_WIDTH_32BIT, 119 + "Invalid TD VMCS access for 32-bit field"); 120 + BUILD_BUG_ON_MSG(bits != 16 && __builtin_constant_p(field) && 121 + VMCS_ENC_WIDTH(field) == VMCS_ENC_WIDTH_16BIT, 122 + "Invalid TD VMCS access for 16-bit field"); 123 + } 124 + 125 + static __always_inline void tdvps_management_check(u64 field, u8 bits) {} 126 + static __always_inline void tdvps_state_non_arch_check(u64 field, u8 bits) {} 127 + 128 + #define TDX_BUILD_TDVPS_ACCESSORS(bits, uclass, lclass) \ 129 + static __always_inline u##bits td_##lclass##_read##bits(struct vcpu_tdx *tdx, \ 130 + u32 field) \ 131 + { \ 132 + u64 err, data; \ 133 + \ 134 + tdvps_##lclass##_check(field, bits); \ 135 + err = tdh_vp_rd(&tdx->vp, TDVPS_##uclass(field), &data); \ 136 + if (unlikely(err)) { \ 137 + tdh_vp_rd_failed(tdx, #uclass, field, err); \ 138 + return 0; \ 139 + } \ 140 + return (u##bits)data; \ 141 + } \ 142 + static __always_inline void td_##lclass##_write##bits(struct vcpu_tdx *tdx, \ 143 + u32 field, u##bits val) \ 144 + { \ 145 + u64 err; \ 146 + \ 147 + tdvps_##lclass##_check(field, bits); \ 148 + err = tdh_vp_wr(&tdx->vp, TDVPS_##uclass(field), val, \ 149 + GENMASK_ULL(bits - 1, 0)); \ 150 + if (unlikely(err)) \ 151 + tdh_vp_wr_failed(tdx, #uclass, " = ", field, (u64)val, err); \ 152 + } \ 153 + static __always_inline void td_##lclass##_setbit##bits(struct vcpu_tdx *tdx, \ 154 + u32 field, u64 bit) \ 155 + { \ 156 + u64 err; \ 157 + \ 158 + tdvps_##lclass##_check(field, bits); \ 159 + err = tdh_vp_wr(&tdx->vp, TDVPS_##uclass(field), bit, bit); \ 160 + if (unlikely(err)) \ 161 + tdh_vp_wr_failed(tdx, #uclass, " |= ", field, bit, err); \ 162 + } \ 163 + static __always_inline void td_##lclass##_clearbit##bits(struct vcpu_tdx *tdx, \ 164 + u32 field, u64 bit) \ 165 + { \ 166 + u64 err; \ 167 + \ 168 + tdvps_##lclass##_check(field, bits); \ 169 + err = tdh_vp_wr(&tdx->vp, TDVPS_##uclass(field), 0, bit); \ 170 + if (unlikely(err)) \ 171 + tdh_vp_wr_failed(tdx, #uclass, " &= ~", field, bit, err);\ 172 + } 173 + 174 + 175 + bool tdx_interrupt_allowed(struct kvm_vcpu *vcpu); 176 + int tdx_complete_emulated_msr(struct kvm_vcpu *vcpu, int err); 177 + 178 + TDX_BUILD_TDVPS_ACCESSORS(16, VMCS, vmcs); 179 + TDX_BUILD_TDVPS_ACCESSORS(32, VMCS, vmcs); 180 + TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs); 181 + 182 + TDX_BUILD_TDVPS_ACCESSORS(8, MANAGEMENT, management); 183 + TDX_BUILD_TDVPS_ACCESSORS(64, STATE_NON_ARCH, state_non_arch); 184 + 185 + #else 186 + static inline int tdx_bringup(void) { return 0; } 187 + static inline void tdx_cleanup(void) {} 188 + 189 + #define enable_tdx 0 190 + 191 + struct kvm_tdx { 192 + struct kvm kvm; 193 + }; 194 + 195 + struct vcpu_tdx { 196 + struct kvm_vcpu vcpu; 197 + }; 198 + 199 + static inline bool tdx_interrupt_allowed(struct kvm_vcpu *vcpu) { return false; } 200 + static inline int tdx_complete_emulated_msr(struct kvm_vcpu *vcpu, int err) { return 0; } 201 + 202 + #endif 203 + 204 + #endif
+167
arch/x86/kvm/vmx/tdx_arch.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* architectural constants/data definitions for TDX SEAMCALLs */ 3 + 4 + #ifndef __KVM_X86_TDX_ARCH_H 5 + #define __KVM_X86_TDX_ARCH_H 6 + 7 + #include <linux/types.h> 8 + 9 + /* TDX control structure (TDR/TDCS/TDVPS) field access codes */ 10 + #define TDX_NON_ARCH BIT_ULL(63) 11 + #define TDX_CLASS_SHIFT 56 12 + #define TDX_FIELD_MASK GENMASK_ULL(31, 0) 13 + 14 + #define __BUILD_TDX_FIELD(non_arch, class, field) \ 15 + (((non_arch) ? TDX_NON_ARCH : 0) | \ 16 + ((u64)(class) << TDX_CLASS_SHIFT) | \ 17 + ((u64)(field) & TDX_FIELD_MASK)) 18 + 19 + #define BUILD_TDX_FIELD(class, field) \ 20 + __BUILD_TDX_FIELD(false, (class), (field)) 21 + 22 + #define BUILD_TDX_FIELD_NON_ARCH(class, field) \ 23 + __BUILD_TDX_FIELD(true, (class), (field)) 24 + 25 + 26 + /* Class code for TD */ 27 + #define TD_CLASS_EXECUTION_CONTROLS 17ULL 28 + 29 + /* Class code for TDVPS */ 30 + #define TDVPS_CLASS_VMCS 0ULL 31 + #define TDVPS_CLASS_GUEST_GPR 16ULL 32 + #define TDVPS_CLASS_OTHER_GUEST 17ULL 33 + #define TDVPS_CLASS_MANAGEMENT 32ULL 34 + 35 + enum tdx_tdcs_execution_control { 36 + TD_TDCS_EXEC_TSC_OFFSET = 10, 37 + TD_TDCS_EXEC_TSC_MULTIPLIER = 11, 38 + }; 39 + 40 + enum tdx_vcpu_guest_other_state { 41 + TD_VCPU_STATE_DETAILS_NON_ARCH = 0x100, 42 + }; 43 + 44 + #define TDX_VCPU_STATE_DETAILS_INTR_PENDING BIT_ULL(0) 45 + 46 + static inline bool tdx_vcpu_state_details_intr_pending(u64 vcpu_state_details) 47 + { 48 + return !!(vcpu_state_details & TDX_VCPU_STATE_DETAILS_INTR_PENDING); 49 + } 50 + 51 + /* @field is any of enum tdx_tdcs_execution_control */ 52 + #define TDCS_EXEC(field) BUILD_TDX_FIELD(TD_CLASS_EXECUTION_CONTROLS, (field)) 53 + 54 + /* @field is the VMCS field encoding */ 55 + #define TDVPS_VMCS(field) BUILD_TDX_FIELD(TDVPS_CLASS_VMCS, (field)) 56 + 57 + /* @field is any of enum tdx_guest_other_state */ 58 + #define TDVPS_STATE(field) BUILD_TDX_FIELD(TDVPS_CLASS_OTHER_GUEST, (field)) 59 + #define TDVPS_STATE_NON_ARCH(field) BUILD_TDX_FIELD_NON_ARCH(TDVPS_CLASS_OTHER_GUEST, (field)) 60 + 61 + /* Management class fields */ 62 + enum tdx_vcpu_guest_management { 63 + TD_VCPU_PEND_NMI = 11, 64 + }; 65 + 66 + /* @field is any of enum tdx_vcpu_guest_management */ 67 + #define TDVPS_MANAGEMENT(field) BUILD_TDX_FIELD(TDVPS_CLASS_MANAGEMENT, (field)) 68 + 69 + #define TDX_EXTENDMR_CHUNKSIZE 256 70 + 71 + struct tdx_cpuid_value { 72 + u32 eax; 73 + u32 ebx; 74 + u32 ecx; 75 + u32 edx; 76 + } __packed; 77 + 78 + #define TDX_TD_ATTR_DEBUG BIT_ULL(0) 79 + #define TDX_TD_ATTR_SEPT_VE_DISABLE BIT_ULL(28) 80 + #define TDX_TD_ATTR_PKS BIT_ULL(30) 81 + #define TDX_TD_ATTR_KL BIT_ULL(31) 82 + #define TDX_TD_ATTR_PERFMON BIT_ULL(63) 83 + 84 + #define TDX_EXT_EXIT_QUAL_TYPE_MASK GENMASK(3, 0) 85 + #define TDX_EXT_EXIT_QUAL_TYPE_PENDING_EPT_VIOLATION 6 86 + /* 87 + * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B. 88 + */ 89 + struct td_params { 90 + u64 attributes; 91 + u64 xfam; 92 + u16 max_vcpus; 93 + u8 reserved0[6]; 94 + 95 + u64 eptp_controls; 96 + u64 config_flags; 97 + u16 tsc_frequency; 98 + u8 reserved1[38]; 99 + 100 + u64 mrconfigid[6]; 101 + u64 mrowner[6]; 102 + u64 mrownerconfig[6]; 103 + u64 reserved2[4]; 104 + 105 + union { 106 + DECLARE_FLEX_ARRAY(struct tdx_cpuid_value, cpuid_values); 107 + u8 reserved3[768]; 108 + }; 109 + } __packed __aligned(1024); 110 + 111 + /* 112 + * Guest uses MAX_PA for GPAW when set. 113 + * 0: GPA.SHARED bit is GPA[47] 114 + * 1: GPA.SHARED bit is GPA[51] 115 + */ 116 + #define TDX_CONFIG_FLAGS_MAX_GPAW BIT_ULL(0) 117 + 118 + /* 119 + * TDH.VP.ENTER, TDG.VP.VMCALL preserves RBP 120 + * 0: RBP can be used for TDG.VP.VMCALL input. RBP is clobbered. 121 + * 1: RBP can't be used for TDG.VP.VMCALL input. RBP is preserved. 122 + */ 123 + #define TDX_CONFIG_FLAGS_NO_RBP_MOD BIT_ULL(2) 124 + 125 + 126 + /* 127 + * TDX requires the frequency to be defined in units of 25MHz, which is the 128 + * frequency of the core crystal clock on TDX-capable platforms, i.e. the TDX 129 + * module can only program frequencies that are multiples of 25MHz. The 130 + * frequency must be between 100mhz and 10ghz (inclusive). 131 + */ 132 + #define TDX_TSC_KHZ_TO_25MHZ(tsc_in_khz) ((tsc_in_khz) / (25 * 1000)) 133 + #define TDX_TSC_25MHZ_TO_KHZ(tsc_in_25mhz) ((tsc_in_25mhz) * (25 * 1000)) 134 + #define TDX_MIN_TSC_FREQUENCY_KHZ (100 * 1000) 135 + #define TDX_MAX_TSC_FREQUENCY_KHZ (10 * 1000 * 1000) 136 + 137 + /* Additional Secure EPT entry information */ 138 + #define TDX_SEPT_LEVEL_MASK GENMASK_ULL(2, 0) 139 + #define TDX_SEPT_STATE_MASK GENMASK_ULL(15, 8) 140 + #define TDX_SEPT_STATE_SHIFT 8 141 + 142 + enum tdx_sept_entry_state { 143 + TDX_SEPT_FREE = 0, 144 + TDX_SEPT_BLOCKED = 1, 145 + TDX_SEPT_PENDING = 2, 146 + TDX_SEPT_PENDING_BLOCKED = 3, 147 + TDX_SEPT_PRESENT = 4, 148 + }; 149 + 150 + static inline u8 tdx_get_sept_level(u64 sept_entry_info) 151 + { 152 + return sept_entry_info & TDX_SEPT_LEVEL_MASK; 153 + } 154 + 155 + static inline u8 tdx_get_sept_state(u64 sept_entry_info) 156 + { 157 + return (sept_entry_info & TDX_SEPT_STATE_MASK) >> TDX_SEPT_STATE_SHIFT; 158 + } 159 + 160 + #define MD_FIELD_ID_FEATURES0_TOPOLOGY_ENUM BIT_ULL(20) 161 + 162 + /* 163 + * TD scope metadata field ID. 164 + */ 165 + #define TD_MD_FIELD_ID_CPUID_VALUES 0x9410000300000000ULL 166 + 167 + #endif /* __KVM_X86_TDX_ARCH_H */
+40
arch/x86/kvm/vmx/tdx_errno.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* architectural status code for SEAMCALL */ 3 + 4 + #ifndef __KVM_X86_TDX_ERRNO_H 5 + #define __KVM_X86_TDX_ERRNO_H 6 + 7 + #define TDX_SEAMCALL_STATUS_MASK 0xFFFFFFFF00000000ULL 8 + 9 + /* 10 + * TDX SEAMCALL Status Codes (returned in RAX) 11 + */ 12 + #define TDX_NON_RECOVERABLE_VCPU 0x4000000100000000ULL 13 + #define TDX_NON_RECOVERABLE_TD 0x4000000200000000ULL 14 + #define TDX_NON_RECOVERABLE_TD_NON_ACCESSIBLE 0x6000000500000000ULL 15 + #define TDX_NON_RECOVERABLE_TD_WRONG_APIC_MODE 0x6000000700000000ULL 16 + #define TDX_INTERRUPTED_RESUMABLE 0x8000000300000000ULL 17 + #define TDX_OPERAND_INVALID 0xC000010000000000ULL 18 + #define TDX_OPERAND_BUSY 0x8000020000000000ULL 19 + #define TDX_PREVIOUS_TLB_EPOCH_BUSY 0x8000020100000000ULL 20 + #define TDX_PAGE_METADATA_INCORRECT 0xC000030000000000ULL 21 + #define TDX_VCPU_NOT_ASSOCIATED 0x8000070200000000ULL 22 + #define TDX_KEY_GENERATION_FAILED 0x8000080000000000ULL 23 + #define TDX_KEY_STATE_INCORRECT 0xC000081100000000ULL 24 + #define TDX_KEY_CONFIGURED 0x0000081500000000ULL 25 + #define TDX_NO_HKID_READY_TO_WBCACHE 0x0000082100000000ULL 26 + #define TDX_FLUSHVP_NOT_DONE 0x8000082400000000ULL 27 + #define TDX_EPT_WALK_FAILED 0xC0000B0000000000ULL 28 + #define TDX_EPT_ENTRY_STATE_INCORRECT 0xC0000B0D00000000ULL 29 + #define TDX_METADATA_FIELD_NOT_READABLE 0xC0000C0200000000ULL 30 + 31 + /* 32 + * TDX module operand ID, appears in 31:0 part of error code as 33 + * detail information 34 + */ 35 + #define TDX_OPERAND_ID_RCX 0x01 36 + #define TDX_OPERAND_ID_TDR 0x80 37 + #define TDX_OPERAND_ID_SEPT 0x92 38 + #define TDX_OPERAND_ID_TD_EPOCH 0xa9 39 + 40 + #endif /* __KVM_X86_TDX_ERRNO_H */
+111 -180
arch/x86/kvm/vmx/vmx.c
··· 53 53 #include <trace/events/ipi.h> 54 54 55 55 #include "capabilities.h" 56 + #include "common.h" 56 57 #include "cpuid.h" 57 58 #include "hyperv.h" 58 59 #include "kvm_onhyperv.h" ··· 1282 1281 void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) 1283 1282 { 1284 1283 struct vcpu_vmx *vmx = to_vmx(vcpu); 1284 + struct vcpu_vt *vt = to_vt(vcpu); 1285 1285 struct vmcs_host_state *host_state; 1286 1286 #ifdef CONFIG_X86_64 1287 1287 int cpu = raw_smp_processor_id(); ··· 1311 1309 if (vmx->nested.need_vmcs12_to_shadow_sync) 1312 1310 nested_sync_vmcs12_to_shadow(vcpu); 1313 1311 1314 - if (vmx->guest_state_loaded) 1312 + if (vt->guest_state_loaded) 1315 1313 return; 1316 1314 1317 1315 host_state = &vmx->loaded_vmcs->host_state; ··· 1332 1330 fs_sel = current->thread.fsindex; 1333 1331 gs_sel = current->thread.gsindex; 1334 1332 fs_base = current->thread.fsbase; 1335 - vmx->msr_host_kernel_gs_base = current->thread.gsbase; 1333 + vt->msr_host_kernel_gs_base = current->thread.gsbase; 1336 1334 } else { 1337 1335 savesegment(fs, fs_sel); 1338 1336 savesegment(gs, gs_sel); 1339 1337 fs_base = read_msr(MSR_FS_BASE); 1340 - vmx->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE); 1338 + vt->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE); 1341 1339 } 1342 1340 1343 1341 wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base); ··· 1349 1347 #endif 1350 1348 1351 1349 vmx_set_host_fs_gs(host_state, fs_sel, gs_sel, fs_base, gs_base); 1352 - vmx->guest_state_loaded = true; 1350 + vt->guest_state_loaded = true; 1353 1351 } 1354 1352 1355 1353 static void vmx_prepare_switch_to_host(struct vcpu_vmx *vmx) 1356 1354 { 1357 1355 struct vmcs_host_state *host_state; 1358 1356 1359 - if (!vmx->guest_state_loaded) 1357 + if (!vmx->vt.guest_state_loaded) 1360 1358 return; 1361 1359 1362 1360 host_state = &vmx->loaded_vmcs->host_state; ··· 1384 1382 #endif 1385 1383 invalidate_tss_limit(); 1386 1384 #ifdef CONFIG_X86_64 1387 - wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base); 1385 + wrmsrl(MSR_KERNEL_GS_BASE, vmx->vt.msr_host_kernel_gs_base); 1388 1386 #endif 1389 1387 load_fixmap_gdt(raw_smp_processor_id()); 1390 - vmx->guest_state_loaded = false; 1388 + vmx->vt.guest_state_loaded = false; 1391 1389 vmx->guest_uret_msrs_loaded = false; 1392 1390 } 1393 1391 ··· 1395 1393 static u64 vmx_read_guest_kernel_gs_base(struct vcpu_vmx *vmx) 1396 1394 { 1397 1395 preempt_disable(); 1398 - if (vmx->guest_state_loaded) 1396 + if (vmx->vt.guest_state_loaded) 1399 1397 rdmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base); 1400 1398 preempt_enable(); 1401 1399 return vmx->msr_guest_kernel_gs_base; ··· 1404 1402 static void vmx_write_guest_kernel_gs_base(struct vcpu_vmx *vmx, u64 data) 1405 1403 { 1406 1404 preempt_disable(); 1407 - if (vmx->guest_state_loaded) 1405 + if (vmx->vt.guest_state_loaded) 1408 1406 wrmsrl(MSR_KERNEL_GS_BASE, data); 1409 1407 preempt_enable(); 1410 1408 vmx->msr_guest_kernel_gs_base = data; ··· 1581 1579 vmcs_writel(GUEST_RFLAGS, rflags); 1582 1580 1583 1581 if ((old_rflags ^ vmx->rflags) & X86_EFLAGS_VM) 1584 - vmx->emulation_required = vmx_emulation_required(vcpu); 1582 + vmx->vt.emulation_required = vmx_emulation_required(vcpu); 1585 1583 } 1586 1584 1587 1585 bool vmx_get_if_flag(struct kvm_vcpu *vcpu) ··· 1701 1699 * so that guest userspace can't DoS the guest simply by triggering 1702 1700 * emulation (enclaves are CPL3 only). 1703 1701 */ 1704 - if (to_vmx(vcpu)->exit_reason.enclave_mode) { 1702 + if (vmx_get_exit_reason(vcpu).enclave_mode) { 1705 1703 kvm_queue_exception(vcpu, UD_VECTOR); 1706 1704 return X86EMUL_PROPAGATE_FAULT; 1707 1705 } ··· 1716 1714 1717 1715 static int skip_emulated_instruction(struct kvm_vcpu *vcpu) 1718 1716 { 1719 - union vmx_exit_reason exit_reason = to_vmx(vcpu)->exit_reason; 1717 + union vmx_exit_reason exit_reason = vmx_get_exit_reason(vcpu); 1720 1718 unsigned long rip, orig_rip; 1721 1719 u32 instr_len; 1722 1720 ··· 1863 1861 return; 1864 1862 } 1865 1863 1866 - WARN_ON_ONCE(vmx->emulation_required); 1864 + WARN_ON_ONCE(vmx->vt.emulation_required); 1867 1865 1868 1866 if (kvm_exception_is_soft(ex->vector)) { 1869 1867 vmcs_write32(VM_ENTRY_INSTRUCTION_LEN, ··· 3406 3404 } 3407 3405 3408 3406 /* depends on vcpu->arch.cr0 to be set to a new value */ 3409 - vmx->emulation_required = vmx_emulation_required(vcpu); 3407 + vmx->vt.emulation_required = vmx_emulation_required(vcpu); 3410 3408 } 3411 3409 3412 3410 static int vmx_get_max_ept_level(void) ··· 3669 3667 { 3670 3668 __vmx_set_segment(vcpu, var, seg); 3671 3669 3672 - to_vmx(vcpu)->emulation_required = vmx_emulation_required(vcpu); 3670 + to_vmx(vcpu)->vt.emulation_required = vmx_emulation_required(vcpu); 3673 3671 } 3674 3672 3675 3673 void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l) ··· 4197 4195 pt_update_intercept_for_msr(vcpu); 4198 4196 } 4199 4197 4200 - static inline void kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu, 4201 - int pi_vec) 4202 - { 4203 - #ifdef CONFIG_SMP 4204 - if (vcpu->mode == IN_GUEST_MODE) { 4205 - /* 4206 - * The vector of the virtual has already been set in the PIR. 4207 - * Send a notification event to deliver the virtual interrupt 4208 - * unless the vCPU is the currently running vCPU, i.e. the 4209 - * event is being sent from a fastpath VM-Exit handler, in 4210 - * which case the PIR will be synced to the vIRR before 4211 - * re-entering the guest. 4212 - * 4213 - * When the target is not the running vCPU, the following 4214 - * possibilities emerge: 4215 - * 4216 - * Case 1: vCPU stays in non-root mode. Sending a notification 4217 - * event posts the interrupt to the vCPU. 4218 - * 4219 - * Case 2: vCPU exits to root mode and is still runnable. The 4220 - * PIR will be synced to the vIRR before re-entering the guest. 4221 - * Sending a notification event is ok as the host IRQ handler 4222 - * will ignore the spurious event. 4223 - * 4224 - * Case 3: vCPU exits to root mode and is blocked. vcpu_block() 4225 - * has already synced PIR to vIRR and never blocks the vCPU if 4226 - * the vIRR is not empty. Therefore, a blocked vCPU here does 4227 - * not wait for any requested interrupts in PIR, and sending a 4228 - * notification event also results in a benign, spurious event. 4229 - */ 4230 - 4231 - if (vcpu != kvm_get_running_vcpu()) 4232 - __apic_send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec); 4233 - return; 4234 - } 4235 - #endif 4236 - /* 4237 - * The vCPU isn't in the guest; wake the vCPU in case it is blocking, 4238 - * otherwise do nothing as KVM will grab the highest priority pending 4239 - * IRQ via ->sync_pir_to_irr() in vcpu_enter_guest(). 4240 - */ 4241 - kvm_vcpu_wake_up(vcpu); 4242 - } 4243 - 4244 4198 static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu, 4245 4199 int vector) 4246 4200 { ··· 4245 4287 */ 4246 4288 static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector) 4247 4289 { 4248 - struct vcpu_vmx *vmx = to_vmx(vcpu); 4290 + struct vcpu_vt *vt = to_vt(vcpu); 4249 4291 int r; 4250 4292 4251 4293 r = vmx_deliver_nested_posted_interrupt(vcpu, vector); ··· 4256 4298 if (!vcpu->arch.apic->apicv_active) 4257 4299 return -1; 4258 4300 4259 - if (pi_test_and_set_pir(vector, &vmx->pi_desc)) 4260 - return 0; 4261 - 4262 - /* If a previous notification has sent the IPI, nothing to do. */ 4263 - if (pi_test_and_set_on(&vmx->pi_desc)) 4264 - return 0; 4265 - 4266 - /* 4267 - * The implied barrier in pi_test_and_set_on() pairs with the smp_mb_*() 4268 - * after setting vcpu->mode in vcpu_enter_guest(), thus the vCPU is 4269 - * guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a 4270 - * posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE. 4271 - */ 4272 - kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR); 4301 + __vmx_deliver_posted_interrupt(vcpu, &vt->pi_desc, vector); 4273 4302 return 0; 4274 4303 } 4275 4304 ··· 4723 4778 vmcs_write16(GUEST_INTR_STATUS, 0); 4724 4779 4725 4780 vmcs_write16(POSTED_INTR_NV, POSTED_INTR_VECTOR); 4726 - vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc))); 4781 + vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->vt.pi_desc))); 4727 4782 } 4728 4783 4729 4784 if (vmx_can_use_ipiv(&vmx->vcpu)) { ··· 4836 4891 * Enforce invariant: pi_desc.nv is always either POSTED_INTR_VECTOR 4837 4892 * or POSTED_INTR_WAKEUP_VECTOR. 4838 4893 */ 4839 - vmx->pi_desc.nv = POSTED_INTR_VECTOR; 4840 - __pi_set_sn(&vmx->pi_desc); 4894 + vmx->vt.pi_desc.nv = POSTED_INTR_VECTOR; 4895 + __pi_set_sn(&vmx->vt.pi_desc); 4841 4896 } 4842 4897 4843 4898 void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) ··· 5754 5809 5755 5810 static int handle_ept_violation(struct kvm_vcpu *vcpu) 5756 5811 { 5757 - unsigned long exit_qualification; 5812 + unsigned long exit_qualification = vmx_get_exit_qual(vcpu); 5758 5813 gpa_t gpa; 5759 - u64 error_code; 5760 - 5761 - exit_qualification = vmx_get_exit_qual(vcpu); 5762 5814 5763 5815 /* 5764 5816 * EPT violation happened while executing iret from NMI, ··· 5771 5829 gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS); 5772 5830 trace_kvm_page_fault(vcpu, gpa, exit_qualification); 5773 5831 5774 - /* Is it a read fault? */ 5775 - error_code = (exit_qualification & EPT_VIOLATION_ACC_READ) 5776 - ? PFERR_USER_MASK : 0; 5777 - /* Is it a write fault? */ 5778 - error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE) 5779 - ? PFERR_WRITE_MASK : 0; 5780 - /* Is it a fetch fault? */ 5781 - error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR) 5782 - ? PFERR_FETCH_MASK : 0; 5783 - /* ept page table entry is present? */ 5784 - error_code |= (exit_qualification & EPT_VIOLATION_PROT_MASK) 5785 - ? PFERR_PRESENT_MASK : 0; 5786 - 5787 - if (error_code & EPT_VIOLATION_GVA_IS_VALID) 5788 - error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) ? 5789 - PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK; 5790 - 5791 5832 /* 5792 5833 * Check that the GPA doesn't exceed physical memory limits, as that is 5793 5834 * a guest page fault. We have to emulate the instruction here, because ··· 5782 5857 if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu, gpa))) 5783 5858 return kvm_emulate_instruction(vcpu, 0); 5784 5859 5785 - return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); 5860 + return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification); 5786 5861 } 5787 5862 5788 5863 static int handle_ept_misconfig(struct kvm_vcpu *vcpu) ··· 5827 5902 { 5828 5903 struct vcpu_vmx *vmx = to_vmx(vcpu); 5829 5904 5830 - if (!vmx->emulation_required) 5905 + if (!vmx->vt.emulation_required) 5831 5906 return false; 5832 5907 5833 5908 /* ··· 5859 5934 intr_window_requested = exec_controls_get(vmx) & 5860 5935 CPU_BASED_INTR_WINDOW_EXITING; 5861 5936 5862 - while (vmx->emulation_required && count-- != 0) { 5937 + while (vmx->vt.emulation_required && count-- != 0) { 5863 5938 if (intr_window_requested && !vmx_interrupt_blocked(vcpu)) 5864 5939 return handle_interrupt_window(&vmx->vcpu); 5865 5940 ··· 6054 6129 * VM-Exits. Unconditionally set the flag here and leave the handling to 6055 6130 * vmx_handle_exit(). 6056 6131 */ 6057 - to_vmx(vcpu)->exit_reason.bus_lock_detected = true; 6132 + to_vt(vcpu)->exit_reason.bus_lock_detected = true; 6058 6133 return 1; 6059 6134 } 6060 6135 ··· 6152 6227 { 6153 6228 struct vcpu_vmx *vmx = to_vmx(vcpu); 6154 6229 6155 - *reason = vmx->exit_reason.full; 6230 + *reason = vmx->vt.exit_reason.full; 6156 6231 *info1 = vmx_get_exit_qual(vcpu); 6157 - if (!(vmx->exit_reason.failed_vmentry)) { 6232 + if (!(vmx->vt.exit_reason.failed_vmentry)) { 6158 6233 *info2 = vmx->idt_vectoring_info; 6159 6234 *intr_info = vmx_get_intr_info(vcpu); 6160 6235 if (is_exception_with_error_code(*intr_info)) ··· 6450 6525 static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) 6451 6526 { 6452 6527 struct vcpu_vmx *vmx = to_vmx(vcpu); 6453 - union vmx_exit_reason exit_reason = vmx->exit_reason; 6528 + union vmx_exit_reason exit_reason = vmx_get_exit_reason(vcpu); 6454 6529 u32 vectoring_info = vmx->idt_vectoring_info; 6455 6530 u16 exit_handler_index; 6456 6531 ··· 6506 6581 * the least awful solution for the userspace case without 6507 6582 * risking false positives. 6508 6583 */ 6509 - if (vmx->emulation_required) { 6584 + if (vmx->vt.emulation_required) { 6510 6585 nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0); 6511 6586 return 1; 6512 6587 } ··· 6516 6591 } 6517 6592 6518 6593 /* If guest state is invalid, start emulating. L2 is handled above. */ 6519 - if (vmx->emulation_required) 6594 + if (vmx->vt.emulation_required) 6520 6595 return handle_invalid_guest_state(vcpu); 6521 6596 6522 6597 if (exit_reason.failed_vmentry) { ··· 6616 6691 * Exit to user space when bus lock detected to inform that there is 6617 6692 * a bus lock in guest. 6618 6693 */ 6619 - if (to_vmx(vcpu)->exit_reason.bus_lock_detected) { 6694 + if (vmx_get_exit_reason(vcpu).bus_lock_detected) { 6620 6695 if (ret > 0) 6621 6696 vcpu->run->exit_reason = KVM_EXIT_X86_BUS_LOCK; 6622 6697 ··· 6895 6970 6896 6971 int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu) 6897 6972 { 6898 - struct vcpu_vmx *vmx = to_vmx(vcpu); 6973 + struct vcpu_vt *vt = to_vt(vcpu); 6899 6974 int max_irr; 6900 6975 bool got_posted_interrupt; 6901 6976 6902 6977 if (KVM_BUG_ON(!enable_apicv, vcpu->kvm)) 6903 6978 return -EIO; 6904 6979 6905 - if (pi_test_on(&vmx->pi_desc)) { 6906 - pi_clear_on(&vmx->pi_desc); 6980 + if (pi_test_on(&vt->pi_desc)) { 6981 + pi_clear_on(&vt->pi_desc); 6907 6982 /* 6908 6983 * IOMMU can write to PID.ON, so the barrier matters even on UP. 6909 6984 * But on x86 this is just a compiler barrier anyway. 6910 6985 */ 6911 6986 smp_mb__after_atomic(); 6912 6987 got_posted_interrupt = 6913 - kvm_apic_update_irr(vcpu, vmx->pi_desc.pir, &max_irr); 6988 + kvm_apic_update_irr(vcpu, vt->pi_desc.pir, &max_irr); 6914 6989 } else { 6915 6990 max_irr = kvm_lapic_find_highest_irr(vcpu); 6916 6991 got_posted_interrupt = false; ··· 6948 7023 vmcs_write64(EOI_EXIT_BITMAP1, eoi_exit_bitmap[1]); 6949 7024 vmcs_write64(EOI_EXIT_BITMAP2, eoi_exit_bitmap[2]); 6950 7025 vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]); 6951 - } 6952 - 6953 - void vmx_apicv_pre_state_restore(struct kvm_vcpu *vcpu) 6954 - { 6955 - struct vcpu_vmx *vmx = to_vmx(vcpu); 6956 - 6957 - pi_clear_on(&vmx->pi_desc); 6958 - memset(vmx->pi_desc.pir, 0, sizeof(vmx->pi_desc.pir)); 6959 7026 } 6960 7027 6961 7028 void vmx_do_interrupt_irqoff(unsigned long entry); ··· 7006 7089 7007 7090 void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu) 7008 7091 { 7009 - struct vcpu_vmx *vmx = to_vmx(vcpu); 7010 - 7011 - if (vmx->emulation_required) 7092 + if (to_vt(vcpu)->emulation_required) 7012 7093 return; 7013 7094 7014 - if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT) 7095 + if (vmx_get_exit_reason(vcpu).basic == EXIT_REASON_EXTERNAL_INTERRUPT) 7015 7096 handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu)); 7016 - else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI) 7097 + else if (vmx_get_exit_reason(vcpu).basic == EXIT_REASON_EXCEPTION_NMI) 7017 7098 handle_exception_irqoff(vcpu, vmx_get_intr_info(vcpu)); 7018 7099 } 7019 7100 ··· 7246 7331 * the fastpath even, all other exits must use the slow path. 7247 7332 */ 7248 7333 if (is_guest_mode(vcpu) && 7249 - to_vmx(vcpu)->exit_reason.basic != EXIT_REASON_PREEMPTION_TIMER) 7334 + vmx_get_exit_reason(vcpu).basic != EXIT_REASON_PREEMPTION_TIMER) 7250 7335 return EXIT_FASTPATH_NONE; 7251 7336 7252 - switch (to_vmx(vcpu)->exit_reason.basic) { 7337 + switch (vmx_get_exit_reason(vcpu).basic) { 7253 7338 case EXIT_REASON_MSR_WRITE: 7254 7339 return handle_fastpath_set_msr_irqoff(vcpu); 7255 7340 case EXIT_REASON_PREEMPTION_TIMER: ··· 7259 7344 default: 7260 7345 return EXIT_FASTPATH_NONE; 7261 7346 } 7347 + } 7348 + 7349 + noinstr void vmx_handle_nmi(struct kvm_vcpu *vcpu) 7350 + { 7351 + if ((u16)vmx_get_exit_reason(vcpu).basic != EXIT_REASON_EXCEPTION_NMI || 7352 + !is_nmi(vmx_get_intr_info(vcpu))) 7353 + return; 7354 + 7355 + kvm_before_interrupt(vcpu, KVM_HANDLING_NMI); 7356 + if (cpu_feature_enabled(X86_FEATURE_FRED)) 7357 + fred_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR); 7358 + else 7359 + vmx_do_nmi_irqoff(); 7360 + kvm_after_interrupt(vcpu); 7262 7361 } 7263 7362 7264 7363 static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, ··· 7310 7381 vmx_enable_fb_clear(vmx); 7311 7382 7312 7383 if (unlikely(vmx->fail)) { 7313 - vmx->exit_reason.full = 0xdead; 7384 + vmx->vt.exit_reason.full = 0xdead; 7314 7385 goto out; 7315 7386 } 7316 7387 7317 - vmx->exit_reason.full = vmcs_read32(VM_EXIT_REASON); 7318 - if (likely(!vmx->exit_reason.failed_vmentry)) 7388 + vmx->vt.exit_reason.full = vmcs_read32(VM_EXIT_REASON); 7389 + if (likely(!vmx_get_exit_reason(vcpu).failed_vmentry)) 7319 7390 vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD); 7320 7391 7321 - if ((u16)vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI && 7322 - is_nmi(vmx_get_intr_info(vcpu))) { 7323 - kvm_before_interrupt(vcpu, KVM_HANDLING_NMI); 7324 - if (cpu_feature_enabled(X86_FEATURE_FRED)) 7325 - fred_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR); 7326 - else 7327 - vmx_do_nmi_irqoff(); 7328 - kvm_after_interrupt(vcpu); 7329 - } 7392 + vmx_handle_nmi(vcpu); 7330 7393 7331 7394 out: 7332 7395 guest_state_exit_irqoff(); ··· 7339 7418 * start emulation until we arrive back to a valid state. Synthesize a 7340 7419 * consistency check VM-Exit due to invalid guest state and bail. 7341 7420 */ 7342 - if (unlikely(vmx->emulation_required)) { 7421 + if (unlikely(vmx->vt.emulation_required)) { 7343 7422 vmx->fail = 0; 7344 7423 7345 - vmx->exit_reason.full = EXIT_REASON_INVALID_STATE; 7346 - vmx->exit_reason.failed_vmentry = 1; 7424 + vmx->vt.exit_reason.full = EXIT_REASON_INVALID_STATE; 7425 + vmx->vt.exit_reason.failed_vmentry = 1; 7347 7426 kvm_register_mark_available(vcpu, VCPU_EXREG_EXIT_INFO_1); 7348 - vmx->exit_qualification = ENTRY_FAIL_DEFAULT; 7427 + vmx->vt.exit_qualification = ENTRY_FAIL_DEFAULT; 7349 7428 kvm_register_mark_available(vcpu, VCPU_EXREG_EXIT_INFO_2); 7350 - vmx->exit_intr_info = 0; 7429 + vmx->vt.exit_intr_info = 0; 7351 7430 return EXIT_FASTPATH_NONE; 7352 7431 } 7353 7432 ··· 7450 7529 * checking. 7451 7530 */ 7452 7531 if (vmx->nested.nested_run_pending && 7453 - !vmx->exit_reason.failed_vmentry) 7532 + !vmx_get_exit_reason(vcpu).failed_vmentry) 7454 7533 ++vcpu->stat.nested_run; 7455 7534 7456 7535 vmx->nested.nested_run_pending = 0; ··· 7459 7538 if (unlikely(vmx->fail)) 7460 7539 return EXIT_FASTPATH_NONE; 7461 7540 7462 - if (unlikely((u16)vmx->exit_reason.basic == EXIT_REASON_MCE_DURING_VMENTRY)) 7541 + if (unlikely((u16)vmx_get_exit_reason(vcpu).basic == EXIT_REASON_MCE_DURING_VMENTRY)) 7463 7542 kvm_machine_check(); 7464 7543 7465 7544 trace_kvm_exit(vcpu, KVM_ISA_VMX); 7466 7545 7467 - if (unlikely(vmx->exit_reason.failed_vmentry)) 7546 + if (unlikely(vmx_get_exit_reason(vcpu).failed_vmentry)) 7468 7547 return EXIT_FASTPATH_NONE; 7469 7548 7470 7549 vmx->loaded_vmcs->launched = 1; ··· 7496 7575 BUILD_BUG_ON(offsetof(struct vcpu_vmx, vcpu) != 0); 7497 7576 vmx = to_vmx(vcpu); 7498 7577 7499 - INIT_LIST_HEAD(&vmx->pi_wakeup_list); 7578 + INIT_LIST_HEAD(&vmx->vt.pi_wakeup_list); 7500 7579 7501 7580 err = -ENOMEM; 7502 7581 ··· 7594 7673 7595 7674 if (vmx_can_use_ipiv(vcpu)) 7596 7675 WRITE_ONCE(to_kvm_vmx(vcpu->kvm)->pid_table[vcpu->vcpu_id], 7597 - __pa(&vmx->pi_desc) | PID_TABLE_ENTRY_VALID); 7676 + __pa(&vmx->vt.pi_desc) | PID_TABLE_ENTRY_VALID); 7598 7677 7599 7678 return 0; 7600 7679 ··· 7638 7717 break; 7639 7718 } 7640 7719 } 7720 + 7721 + if (enable_pml) 7722 + kvm->arch.cpu_dirty_log_size = PML_LOG_NR_ENTRIES; 7641 7723 return 0; 7724 + } 7725 + 7726 + static inline bool vmx_ignore_guest_pat(struct kvm *kvm) 7727 + { 7728 + /* 7729 + * Non-coherent DMA devices need the guest to flush CPU properly. 7730 + * In that case it is not possible to map all guest RAM as WB, so 7731 + * always trust guest PAT. 7732 + */ 7733 + return !kvm_arch_has_noncoherent_dma(kvm) && 7734 + kvm_check_has_quirk(kvm, KVM_X86_QUIRK_IGNORE_GUEST_PAT); 7642 7735 } 7643 7736 7644 7737 u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) ··· 7664 7729 if (is_mmio) 7665 7730 return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; 7666 7731 7667 - /* 7668 - * Force WB and ignore guest PAT if the VM does NOT have a non-coherent 7669 - * device attached. Letting the guest control memory types on Intel 7670 - * CPUs may result in unexpected behavior, and so KVM's ABI is to trust 7671 - * the guest to behave only as a last resort. 7672 - */ 7673 - if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) 7732 + /* Force WB if ignoring guest PAT */ 7733 + if (vmx_ignore_guest_pat(vcpu->kvm)) 7674 7734 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; 7675 7735 7676 7736 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT); ··· 8527 8597 if (enable_ept) 8528 8598 kvm_mmu_set_ept_masks(enable_ept_ad_bits, 8529 8599 cpu_has_vmx_ept_execute_only()); 8600 + else 8601 + vt_x86_ops.get_mt_mask = NULL; 8530 8602 8531 8603 /* 8532 8604 * Setup shadow_me_value/shadow_me_mask to include MKTME KeyID ··· 8545 8613 */ 8546 8614 if (!enable_ept || !enable_ept_ad_bits || !cpu_has_vmx_pml()) 8547 8615 enable_pml = 0; 8548 - 8549 - if (!enable_pml) 8550 - vt_x86_ops.cpu_dirty_log_size = 0; 8551 8616 8552 8617 if (!cpu_has_vmx_preemption_timer()) 8553 8618 enable_preemption_timer = false; ··· 8603 8674 8604 8675 kvm_set_posted_intr_wakeup_handler(pi_wakeup_handler); 8605 8676 8677 + /* 8678 + * On Intel CPUs that lack self-snoop feature, letting the guest control 8679 + * memory types may result in unexpected behavior. So always ignore guest 8680 + * PAT on those CPUs and map VM as writeback, not allowing userspace to 8681 + * disable the quirk. 8682 + * 8683 + * On certain Intel CPUs (e.g. SPR, ICX), though self-snoop feature is 8684 + * supported, UC is slow enough to cause issues with some older guests (e.g. 8685 + * an old version of bochs driver uses ioremap() instead of ioremap_wc() to 8686 + * map the video RAM, causing wayland desktop to fail to get started 8687 + * correctly). To avoid breaking those older guests that rely on KVM to force 8688 + * memory type to WB, provide KVM_X86_QUIRK_IGNORE_GUEST_PAT to preserve the 8689 + * safer (for performance) default behavior. 8690 + * 8691 + * On top of this, non-coherent DMA devices need the guest to flush CPU 8692 + * caches properly. This also requires honoring guest PAT, and is forced 8693 + * independent of the quirk in vmx_ignore_guest_pat(). 8694 + */ 8695 + if (!static_cpu_has(X86_FEATURE_SELFSNOOP)) 8696 + kvm_caps.supported_quirks &= ~KVM_X86_QUIRK_IGNORE_GUEST_PAT; 8697 + kvm_caps.inapplicable_quirks &= ~KVM_X86_QUIRK_IGNORE_GUEST_PAT; 8606 8698 return r; 8607 8699 } 8608 8700 ··· 8637 8687 l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO; 8638 8688 } 8639 8689 8640 - static void __vmx_exit(void) 8690 + void vmx_exit(void) 8641 8691 { 8642 8692 allow_smaller_maxphyaddr = false; 8643 8693 8644 8694 vmx_cleanup_l1d_flush(); 8645 - } 8646 8695 8647 - static void __exit vmx_exit(void) 8648 - { 8649 - kvm_exit(); 8650 - __vmx_exit(); 8651 8696 kvm_x86_vendor_exit(); 8652 - 8653 8697 } 8654 - module_exit(vmx_exit); 8655 8698 8656 - static int __init vmx_init(void) 8699 + int __init vmx_init(void) 8657 8700 { 8658 8701 int r, cpu; 8659 8702 ··· 8690 8747 if (!enable_ept) 8691 8748 allow_smaller_maxphyaddr = true; 8692 8749 8693 - /* 8694 - * Common KVM initialization _must_ come last, after this, /dev/kvm is 8695 - * exposed to userspace! 8696 - */ 8697 - r = kvm_init(sizeof(struct vcpu_vmx), __alignof__(struct vcpu_vmx), 8698 - THIS_MODULE); 8699 - if (r) 8700 - goto err_kvm_init; 8701 - 8702 8750 return 0; 8703 8751 8704 - err_kvm_init: 8705 - __vmx_exit(); 8706 8752 err_l1d_flush: 8707 8753 kvm_x86_vendor_exit(); 8708 8754 return r; 8709 8755 } 8710 - module_init(vmx_init);
+43 -97
arch/x86/kvm/vmx/vmx.h
··· 11 11 12 12 #include "capabilities.h" 13 13 #include "../kvm_cache_regs.h" 14 + #include "pmu_intel.h" 14 15 #include "vmcs.h" 15 16 #include "vmx_ops.h" 16 17 #include "../cpuid.h" 17 18 #include "run_flags.h" 18 19 #include "../mmu.h" 20 + #include "common.h" 19 21 20 22 #define X2APIC_MSR(r) (APIC_BASE_MSR + ((r) >> 4)) 21 23 ··· 68 66 struct pt_ctx host; 69 67 struct pt_ctx guest; 70 68 }; 71 - 72 - union vmx_exit_reason { 73 - struct { 74 - u32 basic : 16; 75 - u32 reserved16 : 1; 76 - u32 reserved17 : 1; 77 - u32 reserved18 : 1; 78 - u32 reserved19 : 1; 79 - u32 reserved20 : 1; 80 - u32 reserved21 : 1; 81 - u32 reserved22 : 1; 82 - u32 reserved23 : 1; 83 - u32 reserved24 : 1; 84 - u32 reserved25 : 1; 85 - u32 bus_lock_detected : 1; 86 - u32 enclave_mode : 1; 87 - u32 smi_pending_mtf : 1; 88 - u32 smi_from_vmx_root : 1; 89 - u32 reserved30 : 1; 90 - u32 failed_vmentry : 1; 91 - }; 92 - u32 full; 93 - }; 94 - 95 - struct lbr_desc { 96 - /* Basic info about guest LBR records. */ 97 - struct x86_pmu_lbr records; 98 - 99 - /* 100 - * Emulate LBR feature via passthrough LBR registers when the 101 - * per-vcpu guest LBR event is scheduled on the current pcpu. 102 - * 103 - * The records may be inaccurate if the host reclaims the LBR. 104 - */ 105 - struct perf_event *event; 106 - 107 - /* True if LBRs are marked as not intercepted in the MSR bitmap */ 108 - bool msr_passthrough; 109 - }; 110 - 111 - extern struct x86_pmu_lbr vmx_lbr_caps; 112 69 113 70 /* 114 71 * The nested_vmx structure is part of vcpu_vmx, and holds information we need ··· 209 248 210 249 struct vcpu_vmx { 211 250 struct kvm_vcpu vcpu; 251 + struct vcpu_vt vt; 212 252 u8 fail; 213 253 u8 x2apic_msr_bitmap_mode; 214 254 215 - /* 216 - * If true, host state has been stored in vmx->loaded_vmcs for 217 - * the CPU registers that only need to be switched when transitioning 218 - * to/from the kernel, and the registers have been loaded with guest 219 - * values. If false, host state is loaded in the CPU registers 220 - * and vmx->loaded_vmcs->host_state is invalid. 221 - */ 222 - bool guest_state_loaded; 223 - 224 - unsigned long exit_qualification; 225 - u32 exit_intr_info; 226 255 u32 idt_vectoring_info; 227 256 ulong rflags; 228 257 ··· 225 274 struct vmx_uret_msr guest_uret_msrs[MAX_NR_USER_RETURN_MSRS]; 226 275 bool guest_uret_msrs_loaded; 227 276 #ifdef CONFIG_X86_64 228 - u64 msr_host_kernel_gs_base; 229 277 u64 msr_guest_kernel_gs_base; 230 278 #endif 231 279 ··· 263 313 } seg[8]; 264 314 } segment_cache; 265 315 int vpid; 266 - bool emulation_required; 267 - 268 - union vmx_exit_reason exit_reason; 269 - 270 - /* Posted interrupt descriptor */ 271 - struct pi_desc pi_desc; 272 - 273 - /* Used if this vCPU is waiting for PI notification wakeup. */ 274 - struct list_head pi_wakeup_list; 275 316 276 317 /* Support for a guest hypervisor (nested VMX) */ 277 318 struct nested_vmx nested; ··· 316 375 /* Posted Interrupt Descriptor (PID) table for IPI virtualization */ 317 376 u64 *pid_table; 318 377 }; 378 + 379 + static __always_inline struct vcpu_vt *to_vt(struct kvm_vcpu *vcpu) 380 + { 381 + return &(container_of(vcpu, struct vcpu_vmx, vcpu)->vt); 382 + } 383 + 384 + static __always_inline struct kvm_vcpu *vt_to_vcpu(struct vcpu_vt *vt) 385 + { 386 + return &(container_of(vt, struct vcpu_vmx, vt)->vcpu); 387 + } 388 + 389 + static __always_inline union vmx_exit_reason vmx_get_exit_reason(struct kvm_vcpu *vcpu) 390 + { 391 + return to_vt(vcpu)->exit_reason; 392 + } 393 + 394 + static __always_inline unsigned long vmx_get_exit_qual(struct kvm_vcpu *vcpu) 395 + { 396 + struct vcpu_vt *vt = to_vt(vcpu); 397 + 398 + if (!kvm_register_test_and_mark_available(vcpu, VCPU_EXREG_EXIT_INFO_1) && 399 + !WARN_ON_ONCE(is_td_vcpu(vcpu))) 400 + vt->exit_qualification = vmcs_readl(EXIT_QUALIFICATION); 401 + 402 + return vt->exit_qualification; 403 + } 404 + 405 + static __always_inline u32 vmx_get_intr_info(struct kvm_vcpu *vcpu) 406 + { 407 + struct vcpu_vt *vt = to_vt(vcpu); 408 + 409 + if (!kvm_register_test_and_mark_available(vcpu, VCPU_EXREG_EXIT_INFO_2) && 410 + !WARN_ON_ONCE(is_td_vcpu(vcpu))) 411 + vt->exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); 412 + 413 + return vt->exit_intr_info; 414 + } 319 415 320 416 void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu, 321 417 struct loaded_vmcs *buddy); ··· 640 662 return container_of(vcpu, struct vcpu_vmx, vcpu); 641 663 } 642 664 643 - static inline struct lbr_desc *vcpu_to_lbr_desc(struct kvm_vcpu *vcpu) 644 - { 645 - return &to_vmx(vcpu)->lbr_desc; 646 - } 647 - 648 - static inline struct x86_pmu_lbr *vcpu_to_lbr_records(struct kvm_vcpu *vcpu) 649 - { 650 - return &vcpu_to_lbr_desc(vcpu)->records; 651 - } 652 - 653 - static inline bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu) 654 - { 655 - return !!vcpu_to_lbr_records(vcpu)->nr; 656 - } 657 - 658 665 void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu); 659 666 int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu); 660 667 void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu); 661 - 662 - static __always_inline unsigned long vmx_get_exit_qual(struct kvm_vcpu *vcpu) 663 - { 664 - struct vcpu_vmx *vmx = to_vmx(vcpu); 665 - 666 - if (!kvm_register_test_and_mark_available(vcpu, VCPU_EXREG_EXIT_INFO_1)) 667 - vmx->exit_qualification = vmcs_readl(EXIT_QUALIFICATION); 668 - 669 - return vmx->exit_qualification; 670 - } 671 - 672 - static __always_inline u32 vmx_get_intr_info(struct kvm_vcpu *vcpu) 673 - { 674 - struct vcpu_vmx *vmx = to_vmx(vcpu); 675 - 676 - if (!kvm_register_test_and_mark_available(vcpu, VCPU_EXREG_EXIT_INFO_2)) 677 - vmx->exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); 678 - 679 - return vmx->exit_intr_info; 680 - } 681 668 682 669 struct vmcs *alloc_vmcs_cpu(bool shadow, int cpu, gfp_t flags); 683 670 void free_vmcs(struct vmcs *vmcs); ··· 700 757 { 701 758 vmx->segment_cache.bitmask = 0; 702 759 } 760 + 761 + int vmx_init(void); 762 + void vmx_exit(void); 703 763 704 764 #endif /* __KVM_X86_VMX_H */
+110 -1
arch/x86/kvm/vmx/x86_ops.h
··· 46 46 bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu); 47 47 void vmx_migrate_timers(struct kvm_vcpu *vcpu); 48 48 void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu); 49 - void vmx_apicv_pre_state_restore(struct kvm_vcpu *vcpu); 50 49 void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr); 51 50 int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu); 52 51 void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode, ··· 119 120 void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu); 120 121 #endif 121 122 void vmx_setup_mce(struct kvm_vcpu *vcpu); 123 + 124 + #ifdef CONFIG_KVM_INTEL_TDX 125 + void tdx_disable_virtualization_cpu(void); 126 + int tdx_vm_init(struct kvm *kvm); 127 + void tdx_mmu_release_hkid(struct kvm *kvm); 128 + void tdx_vm_destroy(struct kvm *kvm); 129 + int tdx_vm_ioctl(struct kvm *kvm, void __user *argp); 130 + 131 + int tdx_vcpu_create(struct kvm_vcpu *vcpu); 132 + void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event); 133 + void tdx_vcpu_free(struct kvm_vcpu *vcpu); 134 + void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu); 135 + int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu); 136 + fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit); 137 + void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu); 138 + void tdx_vcpu_put(struct kvm_vcpu *vcpu); 139 + bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu); 140 + int tdx_handle_exit(struct kvm_vcpu *vcpu, 141 + enum exit_fastpath_completion fastpath); 142 + 143 + void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode, 144 + int trig_mode, int vector); 145 + void tdx_inject_nmi(struct kvm_vcpu *vcpu); 146 + void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason, 147 + u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code); 148 + bool tdx_has_emulated_msr(u32 index); 149 + int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr); 150 + int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr); 151 + 152 + int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp); 153 + 154 + int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn, 155 + enum pg_level level, void *private_spt); 156 + int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn, 157 + enum pg_level level, void *private_spt); 158 + int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, 159 + enum pg_level level, kvm_pfn_t pfn); 160 + int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn, 161 + enum pg_level level, kvm_pfn_t pfn); 162 + 163 + void tdx_flush_tlb_current(struct kvm_vcpu *vcpu); 164 + void tdx_flush_tlb_all(struct kvm_vcpu *vcpu); 165 + void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level); 166 + int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn); 167 + #else 168 + static inline void tdx_disable_virtualization_cpu(void) {} 169 + static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; } 170 + static inline void tdx_mmu_release_hkid(struct kvm *kvm) {} 171 + static inline void tdx_vm_destroy(struct kvm *kvm) {} 172 + static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; } 173 + 174 + static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; } 175 + static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {} 176 + static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {} 177 + static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {} 178 + static inline int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; } 179 + static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit) 180 + { 181 + return EXIT_FASTPATH_NONE; 182 + } 183 + static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {} 184 + static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {} 185 + static inline bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu) { return false; } 186 + static inline int tdx_handle_exit(struct kvm_vcpu *vcpu, 187 + enum exit_fastpath_completion fastpath) { return 0; } 188 + 189 + static inline void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode, 190 + int trig_mode, int vector) {} 191 + static inline void tdx_inject_nmi(struct kvm_vcpu *vcpu) {} 192 + static inline void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason, u64 *info1, 193 + u64 *info2, u32 *intr_info, u32 *error_code) {} 194 + static inline bool tdx_has_emulated_msr(u32 index) { return false; } 195 + static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; } 196 + static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; } 197 + 198 + static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; } 199 + 200 + static inline int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn, 201 + enum pg_level level, 202 + void *private_spt) 203 + { 204 + return -EOPNOTSUPP; 205 + } 206 + 207 + static inline int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn, 208 + enum pg_level level, 209 + void *private_spt) 210 + { 211 + return -EOPNOTSUPP; 212 + } 213 + 214 + static inline int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, 215 + enum pg_level level, 216 + kvm_pfn_t pfn) 217 + { 218 + return -EOPNOTSUPP; 219 + } 220 + 221 + static inline int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn, 222 + enum pg_level level, 223 + kvm_pfn_t pfn) 224 + { 225 + return -EOPNOTSUPP; 226 + } 227 + 228 + static inline void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) {} 229 + static inline void tdx_flush_tlb_all(struct kvm_vcpu *vcpu) {} 230 + static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {} 231 + static inline int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn) { return 0; } 232 + #endif 122 233 123 234 #endif /* __KVM_X86_VMX_X86_OPS_H */
+68 -31
arch/x86/kvm/x86.c
··· 90 90 #include "trace.h" 91 91 92 92 #define MAX_IO_MSRS 256 93 - #define KVM_MAX_MCE_BANKS 32 94 93 95 94 /* 96 95 * Note, kvm_caps fields should *never* have default values, all fields must be ··· 635 636 } 636 637 } 637 638 639 + static void kvm_user_return_register_notifier(struct kvm_user_return_msrs *msrs) 640 + { 641 + if (!msrs->registered) { 642 + msrs->urn.on_user_return = kvm_on_user_return; 643 + user_return_notifier_register(&msrs->urn); 644 + msrs->registered = true; 645 + } 646 + } 647 + 638 648 int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask) 639 649 { 640 650 struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs); ··· 657 649 return 1; 658 650 659 651 msrs->values[slot].curr = value; 660 - if (!msrs->registered) { 661 - msrs->urn.on_user_return = kvm_on_user_return; 662 - user_return_notifier_register(&msrs->urn); 663 - msrs->registered = true; 664 - } 652 + kvm_user_return_register_notifier(msrs); 665 653 return 0; 666 654 } 667 655 EXPORT_SYMBOL_GPL(kvm_set_user_return_msr); 656 + 657 + void kvm_user_return_msr_update_cache(unsigned int slot, u64 value) 658 + { 659 + struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs); 660 + 661 + msrs->values[slot].curr = value; 662 + kvm_user_return_register_notifier(msrs); 663 + } 664 + EXPORT_SYMBOL_GPL(kvm_user_return_msr_update_cache); 668 665 669 666 static void drop_user_return_notifiers(void) 670 667 { ··· 4750 4737 break; 4751 4738 case KVM_CAP_MAX_VCPUS: 4752 4739 r = KVM_MAX_VCPUS; 4740 + if (kvm) 4741 + r = kvm->max_vcpus; 4753 4742 break; 4754 4743 case KVM_CAP_MAX_VCPU_ID: 4755 4744 r = KVM_MAX_VCPU_IDS; ··· 4807 4792 r = enable_pmu ? KVM_CAP_PMU_VALID_MASK : 0; 4808 4793 break; 4809 4794 case KVM_CAP_DISABLE_QUIRKS2: 4810 - r = KVM_X86_VALID_QUIRKS; 4795 + r = kvm_caps.supported_quirks; 4811 4796 break; 4812 4797 case KVM_CAP_X86_NOTIFY_VMEXIT: 4813 4798 r = kvm_caps.has_notify_vmexit; ··· 5130 5115 static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu, 5131 5116 struct kvm_lapic_state *s) 5132 5117 { 5118 + if (vcpu->arch.apic->guest_apic_protected) 5119 + return -EINVAL; 5120 + 5133 5121 kvm_x86_call(sync_pir_to_irr)(vcpu); 5134 5122 5135 5123 return kvm_apic_get_state(vcpu, s); ··· 5142 5124 struct kvm_lapic_state *s) 5143 5125 { 5144 5126 int r; 5127 + 5128 + if (vcpu->arch.apic->guest_apic_protected) 5129 + return -EINVAL; 5145 5130 5146 5131 r = kvm_apic_set_state(vcpu, s); 5147 5132 if (r) ··· 6323 6302 case KVM_SET_DEVICE_ATTR: 6324 6303 r = kvm_vcpu_ioctl_device_attr(vcpu, ioctl, argp); 6325 6304 break; 6305 + case KVM_MEMORY_ENCRYPT_OP: 6306 + r = -ENOTTY; 6307 + if (!kvm_x86_ops.vcpu_mem_enc_ioctl) 6308 + goto out; 6309 + r = kvm_x86_ops.vcpu_mem_enc_ioctl(vcpu, argp); 6310 + break; 6326 6311 default: 6327 6312 r = -EINVAL; 6328 6313 } ··· 6516 6489 struct kvm_vcpu *vcpu; 6517 6490 unsigned long i; 6518 6491 6519 - if (!kvm_x86_ops.cpu_dirty_log_size) 6492 + if (!kvm->arch.cpu_dirty_log_size) 6520 6493 return; 6521 6494 6522 6495 kvm_for_each_vcpu(i, vcpu, kvm) ··· 6546 6519 switch (cap->cap) { 6547 6520 case KVM_CAP_DISABLE_QUIRKS2: 6548 6521 r = -EINVAL; 6549 - if (cap->args[0] & ~KVM_X86_VALID_QUIRKS) 6522 + if (cap->args[0] & ~kvm_caps.supported_quirks) 6550 6523 break; 6551 6524 fallthrough; 6552 6525 case KVM_CAP_DISABLE_QUIRKS: 6553 - kvm->arch.disabled_quirks = cap->args[0]; 6526 + kvm->arch.disabled_quirks |= cap->args[0] & kvm_caps.supported_quirks; 6554 6527 r = 0; 6555 6528 break; 6556 6529 case KVM_CAP_SPLIT_IRQCHIP: { ··· 7325 7298 goto out; 7326 7299 } 7327 7300 case KVM_MEMORY_ENCRYPT_OP: { 7328 - r = -ENOTTY; 7329 - if (!kvm_x86_ops.mem_enc_ioctl) 7330 - goto out; 7331 - 7332 7301 r = kvm_x86_call(mem_enc_ioctl)(kvm, argp); 7333 7302 break; 7334 7303 } ··· 9792 9769 kvm_host.xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK); 9793 9770 kvm_caps.supported_xcr0 = kvm_host.xcr0 & KVM_SUPPORTED_XCR0; 9794 9771 } 9772 + kvm_caps.supported_quirks = KVM_X86_VALID_QUIRKS; 9773 + kvm_caps.inapplicable_quirks = KVM_X86_CONDITIONAL_QUIRKS; 9795 9774 9796 9775 rdmsrl_safe(MSR_EFER, &kvm_host.efer); 9797 9776 ··· 9837 9812 9838 9813 if (IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && tdp_mmu_enabled) 9839 9814 kvm_caps.supported_vm_types |= BIT(KVM_X86_SW_PROTECTED_VM); 9815 + 9816 + /* KVM always ignores guest PAT for shadow paging. */ 9817 + if (!tdp_enabled) 9818 + kvm_caps.supported_quirks &= ~KVM_X86_QUIRK_IGNORE_GUEST_PAT; 9840 9819 9841 9820 if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES)) 9842 9821 kvm_caps.supported_xss = 0; ··· 10050 10021 return kvm_skip_emulated_instruction(vcpu); 10051 10022 } 10052 10023 10053 - int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr, 10054 - unsigned long a0, unsigned long a1, 10055 - unsigned long a2, unsigned long a3, 10056 - int op_64_bit, int cpl, 10024 + int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, int cpl, 10057 10025 int (*complete_hypercall)(struct kvm_vcpu *)) 10058 10026 { 10059 10027 unsigned long ret; 10028 + unsigned long nr = kvm_rax_read(vcpu); 10029 + unsigned long a0 = kvm_rbx_read(vcpu); 10030 + unsigned long a1 = kvm_rcx_read(vcpu); 10031 + unsigned long a2 = kvm_rdx_read(vcpu); 10032 + unsigned long a3 = kvm_rsi_read(vcpu); 10033 + int op_64_bit = is_64_bit_hypercall(vcpu); 10060 10034 10061 10035 ++vcpu->stat.hypercalls; 10062 10036 ··· 10162 10130 if (kvm_hv_hypercall_enabled(vcpu)) 10163 10131 return kvm_hv_hypercall(vcpu); 10164 10132 10165 - return __kvm_emulate_hypercall(vcpu, rax, rbx, rcx, rdx, rsi, 10166 - is_64_bit_hypercall(vcpu), 10167 - kvm_x86_call(get_cpl)(vcpu), 10133 + return __kvm_emulate_hypercall(vcpu, kvm_x86_call(get_cpl)(vcpu), 10168 10134 complete_hypercall_exit); 10169 10135 } 10170 10136 EXPORT_SYMBOL_GPL(kvm_emulate_hypercall); ··· 11006 10976 if (vcpu->arch.guest_fpu.xfd_err) 11007 10977 wrmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err); 11008 10978 11009 - if (unlikely(vcpu->arch.switch_db_regs)) { 10979 + if (unlikely(vcpu->arch.switch_db_regs && 10980 + !(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH))) { 11010 10981 set_debugreg(0, 7); 11011 10982 set_debugreg(vcpu->arch.eff_db[0], 0); 11012 10983 set_debugreg(vcpu->arch.eff_db[1], 1); ··· 11059 11028 */ 11060 11029 if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) { 11061 11030 WARN_ON(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP); 11031 + WARN_ON(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH); 11062 11032 kvm_x86_call(sync_dirty_debug_regs)(vcpu); 11063 11033 kvm_update_dr0123(vcpu); 11064 11034 kvm_update_dr7(vcpu); ··· 11163 11131 !vcpu->arch.apf.halted); 11164 11132 } 11165 11133 11166 - static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu) 11134 + bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu) 11167 11135 { 11168 11136 if (!list_empty_careful(&vcpu->async_pf.done)) 11169 11137 return true; 11170 11138 11171 11139 if (kvm_apic_has_pending_init_or_sipi(vcpu) && 11172 11140 kvm_apic_init_sipi_allowed(vcpu)) 11173 - return true; 11174 - 11175 - if (vcpu->arch.pv.pv_unhalted) 11176 11141 return true; 11177 11142 11178 11143 if (kvm_is_exception_pending(vcpu)) ··· 11209 11180 11210 11181 return false; 11211 11182 } 11183 + EXPORT_SYMBOL_GPL(kvm_vcpu_has_events); 11212 11184 11213 11185 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu) 11214 11186 { 11215 - return kvm_vcpu_running(vcpu) || kvm_vcpu_has_events(vcpu); 11187 + return kvm_vcpu_running(vcpu) || vcpu->arch.pv.pv_unhalted || 11188 + kvm_vcpu_has_events(vcpu); 11216 11189 } 11217 11190 11218 11191 /* Called within kvm->srcu read side. */ ··· 11348 11317 */ 11349 11318 ++vcpu->stat.halt_exits; 11350 11319 if (lapic_in_kernel(vcpu)) { 11351 - if (kvm_vcpu_has_events(vcpu)) 11320 + if (kvm_vcpu_has_events(vcpu) || vcpu->arch.pv.pv_unhalted) 11352 11321 state = KVM_MP_STATE_RUNNABLE; 11353 11322 kvm_set_mp_state(vcpu, state); 11354 11323 return 1; ··· 12722 12691 { 12723 12692 return vcpu->kvm->arch.bsp_vcpu_id == vcpu->vcpu_id; 12724 12693 } 12694 + EXPORT_SYMBOL_GPL(kvm_vcpu_is_reset_bsp); 12725 12695 12726 12696 bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu) 12727 12697 { ··· 12752 12720 /* Decided by the vendor code for other VM types. */ 12753 12721 kvm->arch.pre_fault_allowed = 12754 12722 type == KVM_X86_DEFAULT_VM || type == KVM_X86_SW_PROTECTED_VM; 12723 + kvm->arch.disabled_quirks = kvm_caps.inapplicable_quirks & kvm_caps.supported_quirks; 12755 12724 12756 12725 ret = kvm_page_track_init(kvm); 12757 12726 if (ret) ··· 12906 12873 kvm_free_pit(kvm); 12907 12874 12908 12875 kvm_mmu_pre_destroy_vm(kvm); 12876 + static_call_cond(kvm_x86_vm_pre_destroy)(kvm); 12909 12877 } 12910 12878 12911 12879 void kvm_arch_destroy_vm(struct kvm *kvm) ··· 13104 13070 { 13105 13071 int nr_slots; 13106 13072 13107 - if (!kvm_x86_ops.cpu_dirty_log_size) 13073 + if (!kvm->arch.cpu_dirty_log_size) 13108 13074 return; 13109 13075 13110 13076 nr_slots = atomic_read(&kvm->nr_memslots_dirty_logging); ··· 13176 13142 if (READ_ONCE(eager_page_split)) 13177 13143 kvm_mmu_slot_try_split_huge_pages(kvm, new, PG_LEVEL_4K); 13178 13144 13179 - if (kvm_x86_ops.cpu_dirty_log_size) { 13145 + if (kvm->arch.cpu_dirty_log_size) { 13180 13146 kvm_mmu_slot_leaf_clear_dirty(kvm, new); 13181 13147 kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_2M); 13182 13148 } else { ··· 13565 13531 * due to toggling the "ignore PAT" bit. Zap all SPTEs when the first 13566 13532 * (or last) non-coherent device is (un)registered to so that new SPTEs 13567 13533 * with the correct "ignore guest PAT" setting are created. 13534 + * 13535 + * If KVM always honors guest PAT, however, there is nothing to do. 13568 13536 */ 13569 - if (kvm_mmu_may_ignore_guest_pat()) 13537 + if (kvm_check_has_quirk(kvm, KVM_X86_QUIRK_IGNORE_GUEST_PAT)) 13570 13538 kvm_zap_gfn_range(kvm, gpa_to_gfn(0), gpa_to_gfn(~0ULL)); 13571 13539 } 13572 13540 ··· 14036 14000 14037 14001 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry); 14038 14002 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit); 14003 + EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_mmio); 14039 14004 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio); 14040 14005 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq); 14041 14006 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
+14 -17
arch/x86/kvm/x86.h
··· 10 10 #include "kvm_emulate.h" 11 11 #include "cpuid.h" 12 12 13 + #define KVM_MAX_MCE_BANKS 32 14 + 13 15 struct kvm_caps { 14 16 /* control of guest tsc rate supported? */ 15 17 bool has_tsc_control; ··· 34 32 u64 supported_xcr0; 35 33 u64 supported_xss; 36 34 u64 supported_perf_cap; 35 + 36 + u64 supported_quirks; 37 + u64 inapplicable_quirks; 37 38 }; 38 39 39 40 struct kvm_host_values { ··· 634 629 return kvm->arch.hypercall_exit_enabled & BIT(hc_nr); 635 630 } 636 631 637 - int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr, 638 - unsigned long a0, unsigned long a1, 639 - unsigned long a2, unsigned long a3, 640 - int op_64_bit, int cpl, 632 + int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, int cpl, 641 633 int (*complete_hypercall)(struct kvm_vcpu *)); 642 634 643 - #define __kvm_emulate_hypercall(_vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl, complete_hypercall) \ 644 - ({ \ 645 - int __ret; \ 646 - \ 647 - __ret = ____kvm_emulate_hypercall(_vcpu, \ 648 - kvm_##nr##_read(_vcpu), kvm_##a0##_read(_vcpu), \ 649 - kvm_##a1##_read(_vcpu), kvm_##a2##_read(_vcpu), \ 650 - kvm_##a3##_read(_vcpu), op_64_bit, cpl, \ 651 - complete_hypercall); \ 652 - \ 653 - if (__ret > 0) \ 654 - __ret = complete_hypercall(_vcpu); \ 655 - __ret; \ 635 + #define __kvm_emulate_hypercall(_vcpu, cpl, complete_hypercall) \ 636 + ({ \ 637 + int __ret; \ 638 + __ret = ____kvm_emulate_hypercall(_vcpu, cpl, complete_hypercall); \ 639 + \ 640 + if (__ret > 0) \ 641 + __ret = complete_hypercall(_vcpu); \ 642 + __ret; \ 656 643 }) 657 644 658 645 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
+3
arch/x86/virt/vmx/tdx/seamcall.S
··· 41 41 TDX_MODULE_CALL host=1 ret=1 42 42 SYM_FUNC_END(__seamcall_ret) 43 43 44 + /* KVM requires non-instrumentable __seamcall_saved_ret() for TDH.VP.ENTER */ 45 + .section .noinstr.text, "ax" 46 + 44 47 /* 45 48 * __seamcall_saved_ret() - Host-side interface functions to SEAM software 46 49 * (the P-SEAMLDR or the TDX module), with saving output registers to the
+418 -5
arch/x86/virt/vmx/tdx/tdx.c
··· 5 5 * Intel Trusted Domain Extensions (TDX) support 6 6 */ 7 7 8 + #include "asm/page_types.h" 8 9 #define pr_fmt(fmt) "virt/tdx: " fmt 9 10 10 11 #include <linux/types.h> ··· 28 27 #include <linux/log2.h> 29 28 #include <linux/acpi.h> 30 29 #include <linux/suspend.h> 30 + #include <linux/idr.h> 31 31 #include <asm/page.h> 32 32 #include <asm/special_insns.h> 33 33 #include <asm/msr-index.h> ··· 44 42 static u32 tdx_guest_keyid_start __ro_after_init; 45 43 static u32 tdx_nr_guest_keyids __ro_after_init; 46 44 45 + static DEFINE_IDA(tdx_guest_keyid_pool); 46 + 47 47 static DEFINE_PER_CPU(bool, tdx_lp_initialized); 48 48 49 49 static struct tdmr_info_list tdx_tdmr_list; ··· 55 51 56 52 /* All TDX-usable memory regions. Protected by mem_hotplug_lock. */ 57 53 static LIST_HEAD(tdx_memlist); 54 + 55 + static struct tdx_sys_info tdx_sysinfo; 58 56 59 57 typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args); 60 58 ··· 1066 1060 1067 1061 static int init_tdx_module(void) 1068 1062 { 1069 - struct tdx_sys_info sysinfo; 1070 1063 int ret; 1071 1064 1072 - ret = get_tdx_sys_info(&sysinfo); 1065 + ret = get_tdx_sys_info(&tdx_sysinfo); 1073 1066 if (ret) 1074 1067 return ret; 1075 1068 1076 1069 /* Check whether the kernel can support this module */ 1077 - ret = check_features(&sysinfo); 1070 + ret = check_features(&tdx_sysinfo); 1078 1071 if (ret) 1079 1072 return ret; 1080 1073 ··· 1094 1089 goto out_put_tdxmem; 1095 1090 1096 1091 /* Allocate enough space for constructing TDMRs */ 1097 - ret = alloc_tdmr_list(&tdx_tdmr_list, &sysinfo.tdmr); 1092 + ret = alloc_tdmr_list(&tdx_tdmr_list, &tdx_sysinfo.tdmr); 1098 1093 if (ret) 1099 1094 goto err_free_tdxmem; 1100 1095 1101 1096 /* Cover all TDX-usable memory regions in TDMRs */ 1102 - ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, &sysinfo.tdmr); 1097 + ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, &tdx_sysinfo.tdmr); 1103 1098 if (ret) 1104 1099 goto err_free_tdmrs; 1105 1100 ··· 1461 1456 1462 1457 check_tdx_erratum(); 1463 1458 } 1459 + 1460 + const struct tdx_sys_info *tdx_get_sysinfo(void) 1461 + { 1462 + const struct tdx_sys_info *p = NULL; 1463 + 1464 + /* Make sure all fields in @tdx_sysinfo have been populated */ 1465 + mutex_lock(&tdx_module_lock); 1466 + if (tdx_module_status == TDX_MODULE_INITIALIZED) 1467 + p = (const struct tdx_sys_info *)&tdx_sysinfo; 1468 + mutex_unlock(&tdx_module_lock); 1469 + 1470 + return p; 1471 + } 1472 + EXPORT_SYMBOL_GPL(tdx_get_sysinfo); 1473 + 1474 + u32 tdx_get_nr_guest_keyids(void) 1475 + { 1476 + return tdx_nr_guest_keyids; 1477 + } 1478 + EXPORT_SYMBOL_GPL(tdx_get_nr_guest_keyids); 1479 + 1480 + int tdx_guest_keyid_alloc(void) 1481 + { 1482 + return ida_alloc_range(&tdx_guest_keyid_pool, tdx_guest_keyid_start, 1483 + tdx_guest_keyid_start + tdx_nr_guest_keyids - 1, 1484 + GFP_KERNEL); 1485 + } 1486 + EXPORT_SYMBOL_GPL(tdx_guest_keyid_alloc); 1487 + 1488 + void tdx_guest_keyid_free(unsigned int keyid) 1489 + { 1490 + ida_free(&tdx_guest_keyid_pool, keyid); 1491 + } 1492 + EXPORT_SYMBOL_GPL(tdx_guest_keyid_free); 1493 + 1494 + static inline u64 tdx_tdr_pa(struct tdx_td *td) 1495 + { 1496 + return page_to_phys(td->tdr_page); 1497 + } 1498 + 1499 + static inline u64 tdx_tdvpr_pa(struct tdx_vp *td) 1500 + { 1501 + return page_to_phys(td->tdvpr_page); 1502 + } 1503 + 1504 + /* 1505 + * The TDX module exposes a CLFLUSH_BEFORE_ALLOC bit to specify whether 1506 + * a CLFLUSH of pages is required before handing them to the TDX module. 1507 + * Be conservative and make the code simpler by doing the CLFLUSH 1508 + * unconditionally. 1509 + */ 1510 + static void tdx_clflush_page(struct page *page) 1511 + { 1512 + clflush_cache_range(page_to_virt(page), PAGE_SIZE); 1513 + } 1514 + 1515 + noinstr u64 tdh_vp_enter(struct tdx_vp *td, struct tdx_module_args *args) 1516 + { 1517 + args->rcx = tdx_tdvpr_pa(td); 1518 + 1519 + return __seamcall_saved_ret(TDH_VP_ENTER, args); 1520 + } 1521 + EXPORT_SYMBOL_GPL(tdh_vp_enter); 1522 + 1523 + u64 tdh_mng_addcx(struct tdx_td *td, struct page *tdcs_page) 1524 + { 1525 + struct tdx_module_args args = { 1526 + .rcx = page_to_phys(tdcs_page), 1527 + .rdx = tdx_tdr_pa(td), 1528 + }; 1529 + 1530 + tdx_clflush_page(tdcs_page); 1531 + return seamcall(TDH_MNG_ADDCX, &args); 1532 + } 1533 + EXPORT_SYMBOL_GPL(tdh_mng_addcx); 1534 + 1535 + u64 tdh_mem_page_add(struct tdx_td *td, u64 gpa, struct page *page, struct page *source, u64 *ext_err1, u64 *ext_err2) 1536 + { 1537 + struct tdx_module_args args = { 1538 + .rcx = gpa, 1539 + .rdx = tdx_tdr_pa(td), 1540 + .r8 = page_to_phys(page), 1541 + .r9 = page_to_phys(source), 1542 + }; 1543 + u64 ret; 1544 + 1545 + tdx_clflush_page(page); 1546 + ret = seamcall_ret(TDH_MEM_PAGE_ADD, &args); 1547 + 1548 + *ext_err1 = args.rcx; 1549 + *ext_err2 = args.rdx; 1550 + 1551 + return ret; 1552 + } 1553 + EXPORT_SYMBOL_GPL(tdh_mem_page_add); 1554 + 1555 + u64 tdh_mem_sept_add(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2) 1556 + { 1557 + struct tdx_module_args args = { 1558 + .rcx = gpa | level, 1559 + .rdx = tdx_tdr_pa(td), 1560 + .r8 = page_to_phys(page), 1561 + }; 1562 + u64 ret; 1563 + 1564 + tdx_clflush_page(page); 1565 + ret = seamcall_ret(TDH_MEM_SEPT_ADD, &args); 1566 + 1567 + *ext_err1 = args.rcx; 1568 + *ext_err2 = args.rdx; 1569 + 1570 + return ret; 1571 + } 1572 + EXPORT_SYMBOL_GPL(tdh_mem_sept_add); 1573 + 1574 + u64 tdh_vp_addcx(struct tdx_vp *vp, struct page *tdcx_page) 1575 + { 1576 + struct tdx_module_args args = { 1577 + .rcx = page_to_phys(tdcx_page), 1578 + .rdx = tdx_tdvpr_pa(vp), 1579 + }; 1580 + 1581 + tdx_clflush_page(tdcx_page); 1582 + return seamcall(TDH_VP_ADDCX, &args); 1583 + } 1584 + EXPORT_SYMBOL_GPL(tdh_vp_addcx); 1585 + 1586 + u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2) 1587 + { 1588 + struct tdx_module_args args = { 1589 + .rcx = gpa | level, 1590 + .rdx = tdx_tdr_pa(td), 1591 + .r8 = page_to_phys(page), 1592 + }; 1593 + u64 ret; 1594 + 1595 + tdx_clflush_page(page); 1596 + ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args); 1597 + 1598 + *ext_err1 = args.rcx; 1599 + *ext_err2 = args.rdx; 1600 + 1601 + return ret; 1602 + } 1603 + EXPORT_SYMBOL_GPL(tdh_mem_page_aug); 1604 + 1605 + u64 tdh_mem_range_block(struct tdx_td *td, u64 gpa, int level, u64 *ext_err1, u64 *ext_err2) 1606 + { 1607 + struct tdx_module_args args = { 1608 + .rcx = gpa | level, 1609 + .rdx = tdx_tdr_pa(td), 1610 + }; 1611 + u64 ret; 1612 + 1613 + ret = seamcall_ret(TDH_MEM_RANGE_BLOCK, &args); 1614 + 1615 + *ext_err1 = args.rcx; 1616 + *ext_err2 = args.rdx; 1617 + 1618 + return ret; 1619 + } 1620 + EXPORT_SYMBOL_GPL(tdh_mem_range_block); 1621 + 1622 + u64 tdh_mng_key_config(struct tdx_td *td) 1623 + { 1624 + struct tdx_module_args args = { 1625 + .rcx = tdx_tdr_pa(td), 1626 + }; 1627 + 1628 + return seamcall(TDH_MNG_KEY_CONFIG, &args); 1629 + } 1630 + EXPORT_SYMBOL_GPL(tdh_mng_key_config); 1631 + 1632 + u64 tdh_mng_create(struct tdx_td *td, u16 hkid) 1633 + { 1634 + struct tdx_module_args args = { 1635 + .rcx = tdx_tdr_pa(td), 1636 + .rdx = hkid, 1637 + }; 1638 + 1639 + tdx_clflush_page(td->tdr_page); 1640 + return seamcall(TDH_MNG_CREATE, &args); 1641 + } 1642 + EXPORT_SYMBOL_GPL(tdh_mng_create); 1643 + 1644 + u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp) 1645 + { 1646 + struct tdx_module_args args = { 1647 + .rcx = tdx_tdvpr_pa(vp), 1648 + .rdx = tdx_tdr_pa(td), 1649 + }; 1650 + 1651 + tdx_clflush_page(vp->tdvpr_page); 1652 + return seamcall(TDH_VP_CREATE, &args); 1653 + } 1654 + EXPORT_SYMBOL_GPL(tdh_vp_create); 1655 + 1656 + u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data) 1657 + { 1658 + struct tdx_module_args args = { 1659 + .rcx = tdx_tdr_pa(td), 1660 + .rdx = field, 1661 + }; 1662 + u64 ret; 1663 + 1664 + ret = seamcall_ret(TDH_MNG_RD, &args); 1665 + 1666 + /* R8: Content of the field, or 0 in case of error. */ 1667 + *data = args.r8; 1668 + 1669 + return ret; 1670 + } 1671 + EXPORT_SYMBOL_GPL(tdh_mng_rd); 1672 + 1673 + u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2) 1674 + { 1675 + struct tdx_module_args args = { 1676 + .rcx = gpa, 1677 + .rdx = tdx_tdr_pa(td), 1678 + }; 1679 + u64 ret; 1680 + 1681 + ret = seamcall_ret(TDH_MR_EXTEND, &args); 1682 + 1683 + *ext_err1 = args.rcx; 1684 + *ext_err2 = args.rdx; 1685 + 1686 + return ret; 1687 + } 1688 + EXPORT_SYMBOL_GPL(tdh_mr_extend); 1689 + 1690 + u64 tdh_mr_finalize(struct tdx_td *td) 1691 + { 1692 + struct tdx_module_args args = { 1693 + .rcx = tdx_tdr_pa(td), 1694 + }; 1695 + 1696 + return seamcall(TDH_MR_FINALIZE, &args); 1697 + } 1698 + EXPORT_SYMBOL_GPL(tdh_mr_finalize); 1699 + 1700 + u64 tdh_vp_flush(struct tdx_vp *vp) 1701 + { 1702 + struct tdx_module_args args = { 1703 + .rcx = tdx_tdvpr_pa(vp), 1704 + }; 1705 + 1706 + return seamcall(TDH_VP_FLUSH, &args); 1707 + } 1708 + EXPORT_SYMBOL_GPL(tdh_vp_flush); 1709 + 1710 + u64 tdh_mng_vpflushdone(struct tdx_td *td) 1711 + { 1712 + struct tdx_module_args args = { 1713 + .rcx = tdx_tdr_pa(td), 1714 + }; 1715 + 1716 + return seamcall(TDH_MNG_VPFLUSHDONE, &args); 1717 + } 1718 + EXPORT_SYMBOL_GPL(tdh_mng_vpflushdone); 1719 + 1720 + u64 tdh_mng_key_freeid(struct tdx_td *td) 1721 + { 1722 + struct tdx_module_args args = { 1723 + .rcx = tdx_tdr_pa(td), 1724 + }; 1725 + 1726 + return seamcall(TDH_MNG_KEY_FREEID, &args); 1727 + } 1728 + EXPORT_SYMBOL_GPL(tdh_mng_key_freeid); 1729 + 1730 + u64 tdh_mng_init(struct tdx_td *td, u64 td_params, u64 *extended_err) 1731 + { 1732 + struct tdx_module_args args = { 1733 + .rcx = tdx_tdr_pa(td), 1734 + .rdx = td_params, 1735 + }; 1736 + u64 ret; 1737 + 1738 + ret = seamcall_ret(TDH_MNG_INIT, &args); 1739 + 1740 + *extended_err = args.rcx; 1741 + 1742 + return ret; 1743 + } 1744 + EXPORT_SYMBOL_GPL(tdh_mng_init); 1745 + 1746 + u64 tdh_vp_rd(struct tdx_vp *vp, u64 field, u64 *data) 1747 + { 1748 + struct tdx_module_args args = { 1749 + .rcx = tdx_tdvpr_pa(vp), 1750 + .rdx = field, 1751 + }; 1752 + u64 ret; 1753 + 1754 + ret = seamcall_ret(TDH_VP_RD, &args); 1755 + 1756 + /* R8: Content of the field, or 0 in case of error. */ 1757 + *data = args.r8; 1758 + 1759 + return ret; 1760 + } 1761 + EXPORT_SYMBOL_GPL(tdh_vp_rd); 1762 + 1763 + u64 tdh_vp_wr(struct tdx_vp *vp, u64 field, u64 data, u64 mask) 1764 + { 1765 + struct tdx_module_args args = { 1766 + .rcx = tdx_tdvpr_pa(vp), 1767 + .rdx = field, 1768 + .r8 = data, 1769 + .r9 = mask, 1770 + }; 1771 + 1772 + return seamcall(TDH_VP_WR, &args); 1773 + } 1774 + EXPORT_SYMBOL_GPL(tdh_vp_wr); 1775 + 1776 + u64 tdh_vp_init(struct tdx_vp *vp, u64 initial_rcx, u32 x2apicid) 1777 + { 1778 + struct tdx_module_args args = { 1779 + .rcx = tdx_tdvpr_pa(vp), 1780 + .rdx = initial_rcx, 1781 + .r8 = x2apicid, 1782 + }; 1783 + 1784 + /* apicid requires version == 1. */ 1785 + return seamcall(TDH_VP_INIT | (1ULL << TDX_VERSION_SHIFT), &args); 1786 + } 1787 + EXPORT_SYMBOL_GPL(tdh_vp_init); 1788 + 1789 + /* 1790 + * TDX ABI defines output operands as PT, OWNER and SIZE. These are TDX defined fomats. 1791 + * So despite the names, they must be interpted specially as described by the spec. Return 1792 + * them only for error reporting purposes. 1793 + */ 1794 + u64 tdh_phymem_page_reclaim(struct page *page, u64 *tdx_pt, u64 *tdx_owner, u64 *tdx_size) 1795 + { 1796 + struct tdx_module_args args = { 1797 + .rcx = page_to_phys(page), 1798 + }; 1799 + u64 ret; 1800 + 1801 + ret = seamcall_ret(TDH_PHYMEM_PAGE_RECLAIM, &args); 1802 + 1803 + *tdx_pt = args.rcx; 1804 + *tdx_owner = args.rdx; 1805 + *tdx_size = args.r8; 1806 + 1807 + return ret; 1808 + } 1809 + EXPORT_SYMBOL_GPL(tdh_phymem_page_reclaim); 1810 + 1811 + u64 tdh_mem_track(struct tdx_td *td) 1812 + { 1813 + struct tdx_module_args args = { 1814 + .rcx = tdx_tdr_pa(td), 1815 + }; 1816 + 1817 + return seamcall(TDH_MEM_TRACK, &args); 1818 + } 1819 + EXPORT_SYMBOL_GPL(tdh_mem_track); 1820 + 1821 + u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_err1, u64 *ext_err2) 1822 + { 1823 + struct tdx_module_args args = { 1824 + .rcx = gpa | level, 1825 + .rdx = tdx_tdr_pa(td), 1826 + }; 1827 + u64 ret; 1828 + 1829 + ret = seamcall_ret(TDH_MEM_PAGE_REMOVE, &args); 1830 + 1831 + *ext_err1 = args.rcx; 1832 + *ext_err2 = args.rdx; 1833 + 1834 + return ret; 1835 + } 1836 + EXPORT_SYMBOL_GPL(tdh_mem_page_remove); 1837 + 1838 + u64 tdh_phymem_cache_wb(bool resume) 1839 + { 1840 + struct tdx_module_args args = { 1841 + .rcx = resume ? 1 : 0, 1842 + }; 1843 + 1844 + return seamcall(TDH_PHYMEM_CACHE_WB, &args); 1845 + } 1846 + EXPORT_SYMBOL_GPL(tdh_phymem_cache_wb); 1847 + 1848 + u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td) 1849 + { 1850 + struct tdx_module_args args = {}; 1851 + 1852 + args.rcx = mk_keyed_paddr(tdx_global_keyid, td->tdr_page); 1853 + 1854 + return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args); 1855 + } 1856 + EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_tdr); 1857 + 1858 + u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page) 1859 + { 1860 + struct tdx_module_args args = {}; 1861 + 1862 + args.rcx = mk_keyed_paddr(hkid, page); 1863 + 1864 + return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args); 1865 + } 1866 + EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
+40 -8
arch/x86/virt/vmx/tdx/tdx.h
··· 3 3 #define _X86_VIRT_TDX_H 4 4 5 5 #include <linux/bits.h> 6 - #include "tdx_global_metadata.h" 7 6 8 7 /* 9 8 * This file contains both macros and data structures defined by the TDX ··· 14 15 /* 15 16 * TDX module SEAMCALL leaf functions 16 17 */ 17 - #define TDH_PHYMEM_PAGE_RDMD 24 18 - #define TDH_SYS_KEY_CONFIG 31 19 - #define TDH_SYS_INIT 33 20 - #define TDH_SYS_RD 34 21 - #define TDH_SYS_LP_INIT 35 22 - #define TDH_SYS_TDMR_INIT 36 23 - #define TDH_SYS_CONFIG 45 18 + #define TDH_VP_ENTER 0 19 + #define TDH_MNG_ADDCX 1 20 + #define TDH_MEM_PAGE_ADD 2 21 + #define TDH_MEM_SEPT_ADD 3 22 + #define TDH_VP_ADDCX 4 23 + #define TDH_MEM_PAGE_AUG 6 24 + #define TDH_MEM_RANGE_BLOCK 7 25 + #define TDH_MNG_KEY_CONFIG 8 26 + #define TDH_MNG_CREATE 9 27 + #define TDH_MNG_RD 11 28 + #define TDH_MR_EXTEND 16 29 + #define TDH_MR_FINALIZE 17 30 + #define TDH_VP_FLUSH 18 31 + #define TDH_MNG_VPFLUSHDONE 19 32 + #define TDH_VP_CREATE 10 33 + #define TDH_MNG_KEY_FREEID 20 34 + #define TDH_MNG_INIT 21 35 + #define TDH_VP_INIT 22 36 + #define TDH_PHYMEM_PAGE_RDMD 24 37 + #define TDH_VP_RD 26 38 + #define TDH_PHYMEM_PAGE_RECLAIM 28 39 + #define TDH_MEM_PAGE_REMOVE 29 40 + #define TDH_SYS_KEY_CONFIG 31 41 + #define TDH_SYS_INIT 33 42 + #define TDH_SYS_RD 34 43 + #define TDH_SYS_LP_INIT 35 44 + #define TDH_SYS_TDMR_INIT 36 45 + #define TDH_MEM_TRACK 38 46 + #define TDH_PHYMEM_CACHE_WB 40 47 + #define TDH_PHYMEM_PAGE_WBINVD 41 48 + #define TDH_VP_WR 43 49 + #define TDH_SYS_CONFIG 45 50 + 51 + /* 52 + * SEAMCALL leaf: 53 + * 54 + * Bit 15:0 Leaf number 55 + * Bit 23:16 Version number 56 + */ 57 + #define TDX_VERSION_SHIFT 16 24 58 25 59 /* TDX page types */ 26 60 #define PT_NDA 0x0
+50
arch/x86/virt/vmx/tdx/tdx_global_metadata.c
··· 37 37 return ret; 38 38 } 39 39 40 + static int get_tdx_sys_info_td_ctrl(struct tdx_sys_info_td_ctrl *sysinfo_td_ctrl) 41 + { 42 + int ret = 0; 43 + u64 val; 44 + 45 + if (!ret && !(ret = read_sys_metadata_field(0x9800000100000000, &val))) 46 + sysinfo_td_ctrl->tdr_base_size = val; 47 + if (!ret && !(ret = read_sys_metadata_field(0x9800000100000100, &val))) 48 + sysinfo_td_ctrl->tdcs_base_size = val; 49 + if (!ret && !(ret = read_sys_metadata_field(0x9800000100000200, &val))) 50 + sysinfo_td_ctrl->tdvps_base_size = val; 51 + 52 + return ret; 53 + } 54 + 55 + static int get_tdx_sys_info_td_conf(struct tdx_sys_info_td_conf *sysinfo_td_conf) 56 + { 57 + int ret = 0; 58 + u64 val; 59 + int i, j; 60 + 61 + if (!ret && !(ret = read_sys_metadata_field(0x1900000300000000, &val))) 62 + sysinfo_td_conf->attributes_fixed0 = val; 63 + if (!ret && !(ret = read_sys_metadata_field(0x1900000300000001, &val))) 64 + sysinfo_td_conf->attributes_fixed1 = val; 65 + if (!ret && !(ret = read_sys_metadata_field(0x1900000300000002, &val))) 66 + sysinfo_td_conf->xfam_fixed0 = val; 67 + if (!ret && !(ret = read_sys_metadata_field(0x1900000300000003, &val))) 68 + sysinfo_td_conf->xfam_fixed1 = val; 69 + if (!ret && !(ret = read_sys_metadata_field(0x9900000100000004, &val))) 70 + sysinfo_td_conf->num_cpuid_config = val; 71 + if (!ret && !(ret = read_sys_metadata_field(0x9900000100000008, &val))) 72 + sysinfo_td_conf->max_vcpus_per_td = val; 73 + if (sysinfo_td_conf->num_cpuid_config > ARRAY_SIZE(sysinfo_td_conf->cpuid_config_leaves)) 74 + return -EINVAL; 75 + for (i = 0; i < sysinfo_td_conf->num_cpuid_config; i++) 76 + if (!ret && !(ret = read_sys_metadata_field(0x9900000300000400 + i, &val))) 77 + sysinfo_td_conf->cpuid_config_leaves[i] = val; 78 + if (sysinfo_td_conf->num_cpuid_config > ARRAY_SIZE(sysinfo_td_conf->cpuid_config_values)) 79 + return -EINVAL; 80 + for (i = 0; i < sysinfo_td_conf->num_cpuid_config; i++) 81 + for (j = 0; j < 2; j++) 82 + if (!ret && !(ret = read_sys_metadata_field(0x9900000300000500 + i * 2 + j, &val))) 83 + sysinfo_td_conf->cpuid_config_values[i][j] = val; 84 + 85 + return ret; 86 + } 87 + 40 88 static int get_tdx_sys_info(struct tdx_sys_info *sysinfo) 41 89 { 42 90 int ret = 0; 43 91 44 92 ret = ret ?: get_tdx_sys_info_features(&sysinfo->features); 45 93 ret = ret ?: get_tdx_sys_info_tdmr(&sysinfo->tdmr); 94 + ret = ret ?: get_tdx_sys_info_td_ctrl(&sysinfo->td_ctrl); 95 + ret = ret ?: get_tdx_sys_info_td_conf(&sysinfo->td_conf); 46 96 47 97 return ret; 48 98 }
+19
arch/x86/virt/vmx/tdx/tdx_global_metadata.h arch/x86/include/asm/tdx_global_metadata.h
··· 17 17 u16 pamt_1g_entry_size; 18 18 }; 19 19 20 + struct tdx_sys_info_td_ctrl { 21 + u16 tdr_base_size; 22 + u16 tdcs_base_size; 23 + u16 tdvps_base_size; 24 + }; 25 + 26 + struct tdx_sys_info_td_conf { 27 + u64 attributes_fixed0; 28 + u64 attributes_fixed1; 29 + u64 xfam_fixed0; 30 + u64 xfam_fixed1; 31 + u16 num_cpuid_config; 32 + u16 max_vcpus_per_td; 33 + u64 cpuid_config_leaves[128]; 34 + u64 cpuid_config_values[128][2]; 35 + }; 36 + 20 37 struct tdx_sys_info { 21 38 struct tdx_sys_info_features features; 22 39 struct tdx_sys_info_tdmr tdmr; 40 + struct tdx_sys_info_td_ctrl td_ctrl; 41 + struct tdx_sys_info_td_conf td_conf; 23 42 }; 24 43 25 44 #endif
+6 -5
include/linux/kvm_dirty_ring.h
··· 32 32 * If CONFIG_HAVE_HVM_DIRTY_RING not defined, kvm_dirty_ring.o should 33 33 * not be included as well, so define these nop functions for the arch. 34 34 */ 35 - static inline u32 kvm_dirty_ring_get_rsvd_entries(void) 35 + static inline u32 kvm_dirty_ring_get_rsvd_entries(struct kvm *kvm) 36 36 { 37 37 return 0; 38 38 } ··· 42 42 return true; 43 43 } 44 44 45 - static inline int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, 45 + static inline int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring, 46 46 int index, u32 size) 47 47 { 48 48 return 0; ··· 71 71 72 72 #else /* CONFIG_HAVE_KVM_DIRTY_RING */ 73 73 74 - int kvm_cpu_dirty_log_size(void); 74 + int kvm_cpu_dirty_log_size(struct kvm *kvm); 75 75 bool kvm_use_dirty_bitmap(struct kvm *kvm); 76 76 bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm); 77 - u32 kvm_dirty_ring_get_rsvd_entries(void); 78 - int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size); 77 + u32 kvm_dirty_ring_get_rsvd_entries(struct kvm *kvm); 78 + int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring, 79 + int index, u32 size); 79 80 80 81 /* 81 82 * called with kvm->slots_lock held, returns the number of
+10
include/linux/kvm_host.h
··· 1610 1610 int kvm_arch_enable_virtualization_cpu(void); 1611 1611 void kvm_arch_disable_virtualization_cpu(void); 1612 1612 #endif 1613 + bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu); 1613 1614 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu); 1614 1615 bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu); 1615 1616 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu); ··· 2285 2284 } 2286 2285 2287 2286 #ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING 2287 + extern bool enable_virt_at_load; 2288 2288 extern bool kvm_rebooting; 2289 2289 #endif 2290 2290 ··· 2571 2569 #ifdef CONFIG_KVM_GENERIC_PRE_FAULT_MEMORY 2572 2570 long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, 2573 2571 struct kvm_pre_fault_memory *range); 2572 + #endif 2573 + 2574 + #ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING 2575 + int kvm_enable_virtualization(void); 2576 + void kvm_disable_virtualization(void); 2577 + #else 2578 + static inline int kvm_enable_virtualization(void) { return 0; } 2579 + static inline void kvm_disable_virtualization(void) { } 2574 2580 #endif 2575 2581 2576 2582 #endif
+4
include/linux/misc_cgroup.h
··· 18 18 /** @MISC_CG_RES_SEV_ES: AMD SEV-ES ASIDs resource */ 19 19 MISC_CG_RES_SEV_ES, 20 20 #endif 21 + #ifdef CONFIG_INTEL_TDX_HOST 22 + /* Intel TDX HKIDs resource */ 23 + MISC_CG_RES_TDX, 24 + #endif 21 25 /** @MISC_CG_RES_TYPES: count of enum misc_res_type constants */ 22 26 MISC_CG_RES_TYPES 23 27 };
+1
include/uapi/linux/kvm.h
··· 375 375 #define KVM_SYSTEM_EVENT_WAKEUP 4 376 376 #define KVM_SYSTEM_EVENT_SUSPEND 5 377 377 #define KVM_SYSTEM_EVENT_SEV_TERM 6 378 + #define KVM_SYSTEM_EVENT_TDX_FATAL 7 378 379 __u32 type; 379 380 __u32 ndata; 380 381 union {
+4
kernel/cgroup/misc.c
··· 24 24 /* AMD SEV-ES ASIDs resource */ 25 25 "sev_es", 26 26 #endif 27 + #ifdef CONFIG_INTEL_TDX_HOST 28 + /* Intel TDX HKIDs resource */ 29 + "tdx", 30 + #endif 27 31 }; 28 32 29 33 /* Root misc cgroup */
+6 -5
virt/kvm/dirty_ring.c
··· 11 11 #include <trace/events/kvm.h> 12 12 #include "kvm_mm.h" 13 13 14 - int __weak kvm_cpu_dirty_log_size(void) 14 + int __weak kvm_cpu_dirty_log_size(struct kvm *kvm) 15 15 { 16 16 return 0; 17 17 } 18 18 19 - u32 kvm_dirty_ring_get_rsvd_entries(void) 19 + u32 kvm_dirty_ring_get_rsvd_entries(struct kvm *kvm) 20 20 { 21 - return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size(); 21 + return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size(kvm); 22 22 } 23 23 24 24 bool kvm_use_dirty_bitmap(struct kvm *kvm) ··· 74 74 KVM_MMU_UNLOCK(kvm); 75 75 } 76 76 77 - int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size) 77 + int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring, 78 + int index, u32 size) 78 79 { 79 80 ring->dirty_gfns = vzalloc(size); 80 81 if (!ring->dirty_gfns) 81 82 return -ENOMEM; 82 83 83 84 ring->size = size / sizeof(struct kvm_dirty_gfn); 84 - ring->soft_limit = ring->size - kvm_dirty_ring_get_rsvd_entries(); 85 + ring->soft_limit = ring->size - kvm_dirty_ring_get_rsvd_entries(kvm); 85 86 ring->dirty_index = 0; 86 87 ring->reset_index = 0; 87 88 ring->index = index;
+9 -17
virt/kvm/kvm_main.c
··· 143 143 #define KVM_COMPAT(c) .compat_ioctl = kvm_no_compat_ioctl, \ 144 144 .open = kvm_no_compat_open 145 145 #endif 146 - static int kvm_enable_virtualization(void); 147 - static void kvm_disable_virtualization(void); 148 146 149 147 static void kvm_io_bus_destroy(struct kvm_io_bus *bus); 150 148 ··· 4124 4126 goto vcpu_free_run_page; 4125 4127 4126 4128 if (kvm->dirty_ring_size) { 4127 - r = kvm_dirty_ring_alloc(&vcpu->dirty_ring, 4129 + r = kvm_dirty_ring_alloc(kvm, &vcpu->dirty_ring, 4128 4130 id, kvm->dirty_ring_size); 4129 4131 if (r) 4130 4132 goto arch_vcpu_destroy; ··· 4862 4864 return -EINVAL; 4863 4865 4864 4866 /* Should be bigger to keep the reserved entries, or a page */ 4865 - if (size < kvm_dirty_ring_get_rsvd_entries() * 4867 + if (size < kvm_dirty_ring_get_rsvd_entries(kvm) * 4866 4868 sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE) 4867 4869 return -EINVAL; 4868 4870 ··· 5477 5479 }; 5478 5480 5479 5481 #ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING 5480 - static bool enable_virt_at_load = true; 5482 + bool enable_virt_at_load = true; 5481 5483 module_param(enable_virt_at_load, bool, 0444); 5484 + EXPORT_SYMBOL_GPL(enable_virt_at_load); 5482 5485 5483 5486 __visible bool kvm_rebooting; 5484 5487 EXPORT_SYMBOL_GPL(kvm_rebooting); ··· 5588 5589 .shutdown = kvm_shutdown, 5589 5590 }; 5590 5591 5591 - static int kvm_enable_virtualization(void) 5592 + int kvm_enable_virtualization(void) 5592 5593 { 5593 5594 int r; 5594 5595 ··· 5633 5634 --kvm_usage_count; 5634 5635 return r; 5635 5636 } 5637 + EXPORT_SYMBOL_GPL(kvm_enable_virtualization); 5636 5638 5637 - static void kvm_disable_virtualization(void) 5639 + void kvm_disable_virtualization(void) 5638 5640 { 5639 5641 guard(mutex)(&kvm_usage_lock); 5640 5642 ··· 5646 5646 cpuhp_remove_state(CPUHP_AP_KVM_ONLINE); 5647 5647 kvm_arch_disable_virtualization(); 5648 5648 } 5649 + EXPORT_SYMBOL_GPL(kvm_disable_virtualization); 5649 5650 5650 5651 static int kvm_init_virtualization(void) 5651 5652 { ··· 5662 5661 kvm_disable_virtualization(); 5663 5662 } 5664 5663 #else /* CONFIG_KVM_GENERIC_HARDWARE_ENABLING */ 5665 - static int kvm_enable_virtualization(void) 5666 - { 5667 - return 0; 5668 - } 5669 - 5670 5664 static int kvm_init_virtualization(void) 5671 5665 { 5672 5666 return 0; 5673 - } 5674 - 5675 - static void kvm_disable_virtualization(void) 5676 - { 5677 - 5678 5667 } 5679 5668 5680 5669 static void kvm_uninit_virtualization(void) ··· 5855 5864 r = __kvm_io_bus_read(vcpu, bus, &range, val); 5856 5865 return r < 0 ? r : 0; 5857 5866 } 5867 + EXPORT_SYMBOL_GPL(kvm_io_bus_read); 5858 5868 5859 5869 int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, 5860 5870 int len, struct kvm_io_device *dev)