Merge tag 'hyperv-fixes-signed-20240908' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

+260

Documentation/virt/hyperv/coco.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + Confidential Computing VMs 4 + ========================== 5 + Hyper-V can create and run Linux guests that are Confidential Computing 6 + (CoCo) VMs. Such VMs cooperate with the physical processor to better protect 7 + the confidentiality and integrity of data in the VM's memory, even in the 8 + face of a hypervisor/VMM that has been compromised and may behave maliciously. 9 + CoCo VMs on Hyper-V share the generic CoCo VM threat model and security 10 + objectives described in Documentation/security/snp-tdx-threat-model.rst. Note 11 + that Hyper-V specific code in Linux refers to CoCo VMs as "isolated VMs" or 12 + "isolation VMs". 13 + 14 + A Linux CoCo VM on Hyper-V requires the cooperation and interaction of the 15 + following: 16 + 17 + * Physical hardware with a processor that supports CoCo VMs 18 + 19 + * The hardware runs a version of Windows/Hyper-V with support for CoCo VMs 20 + 21 + * The VM runs a version of Linux that supports being a CoCo VM 22 + 23 + The physical hardware requirements are as follows: 24 + 25 + * AMD processor with SEV-SNP. Hyper-V does not run guest VMs with AMD SME, 26 + SEV, or SEV-ES encryption, and such encryption is not sufficient for a CoCo 27 + VM on Hyper-V. 28 + 29 + * Intel processor with TDX 30 + 31 + To create a CoCo VM, the "Isolated VM" attribute must be specified to Hyper-V 32 + when the VM is created. A VM cannot be changed from a CoCo VM to a normal VM, 33 + or vice versa, after it is created. 34 + 35 + Operational Modes 36 + ----------------- 37 + Hyper-V CoCo VMs can run in two modes. The mode is selected when the VM is 38 + created and cannot be changed during the life of the VM. 39 + 40 + * Fully-enlightened mode. In this mode, the guest operating system is 41 + enlightened to understand and manage all aspects of running as a CoCo VM. 42 + 43 + * Paravisor mode. In this mode, a paravisor layer between the guest and the 44 + host provides some operations needed to run as a CoCo VM. The guest operating 45 + system can have fewer CoCo enlightenments than is required in the 46 + fully-enlightened case. 47 + 48 + Conceptually, fully-enlightened mode and paravisor mode may be treated as 49 + points on a spectrum spanning the degree of guest enlightenment needed to run 50 + as a CoCo VM. Fully-enlightened mode is one end of the spectrum. A full 51 + implementation of paravisor mode is the other end of the spectrum, where all 52 + aspects of running as a CoCo VM are handled by the paravisor, and a normal 53 + guest OS with no knowledge of memory encryption or other aspects of CoCo VMs 54 + can run successfully. However, the Hyper-V implementation of paravisor mode 55 + does not go this far, and is somewhere in the middle of the spectrum. Some 56 + aspects of CoCo VMs are handled by the Hyper-V paravisor while the guest OS 57 + must be enlightened for other aspects. Unfortunately, there is no 58 + standardized enumeration of feature/functions that might be provided in the 59 + paravisor, and there is no standardized mechanism for a guest OS to query the 60 + paravisor for the feature/functions it provides. The understanding of what 61 + the paravisor provides is hard-coded in the guest OS. 62 + 63 + Paravisor mode has similarities to the `Coconut project`_, which aims to provide 64 + a limited paravisor to provide services to the guest such as a virtual TPM. 65 + However, the Hyper-V paravisor generally handles more aspects of CoCo VMs 66 + than is currently envisioned for Coconut, and so is further toward the "no 67 + guest enlightenments required" end of the spectrum. 68 + 69 + .. _Coconut project: https://github.com/coconut-svsm/svsm 70 + 71 + In the CoCo VM threat model, the paravisor is in the guest security domain 72 + and must be trusted by the guest OS. By implication, the hypervisor/VMM must 73 + protect itself against a potentially malicious paravisor just like it 74 + protects against a potentially malicious guest. 75 + 76 + The hardware architectural approach to fully-enlightened vs. paravisor mode 77 + varies depending on the underlying processor. 78 + 79 + * With AMD SEV-SNP processors, in fully-enlightened mode the guest OS runs in 80 + VMPL 0 and has full control of the guest context. In paravisor mode, the 81 + guest OS runs in VMPL 2 and the paravisor runs in VMPL 0. The paravisor 82 + running in VMPL 0 has privileges that the guest OS in VMPL 2 does not have. 83 + Certain operations require the guest to invoke the paravisor. Furthermore, in 84 + paravisor mode the guest OS operates in "virtual Top Of Memory" (vTOM) mode 85 + as defined by the SEV-SNP architecture. This mode simplifies guest management 86 + of memory encryption when a paravisor is used. 87 + 88 + * With Intel TDX processor, in fully-enlightened mode the guest OS runs in an 89 + L1 VM. In paravisor mode, TD partitioning is used. The paravisor runs in the 90 + L1 VM, and the guest OS runs in a nested L2 VM. 91 + 92 + Hyper-V exposes a synthetic MSR to guests that describes the CoCo mode. This 93 + MSR indicates if the underlying processor uses AMD SEV-SNP or Intel TDX, and 94 + whether a paravisor is being used. It is straightforward to build a single 95 + kernel image that can boot and run properly on either architecture, and in 96 + either mode. 97 + 98 + Paravisor Effects 99 + ----------------- 100 + Running in paravisor mode affects the following areas of generic Linux kernel 101 + CoCo VM functionality: 102 + 103 + * Initial guest memory setup. When a new VM is created in paravisor mode, the 104 + paravisor runs first and sets up the guest physical memory as encrypted. The 105 + guest Linux does normal memory initialization, except for explicitly marking 106 + appropriate ranges as decrypted (shared). In paravisor mode, Linux does not 107 + perform the early boot memory setup steps that are particularly tricky with 108 + AMD SEV-SNP in fully-enlightened mode. 109 + 110 + * #VC/#VE exception handling. In paravisor mode, Hyper-V configures the guest 111 + CoCo VM to route #VC and #VE exceptions to VMPL 0 and the L1 VM, 112 + respectively, and not the guest Linux. Consequently, these exception handlers 113 + do not run in the guest Linux and are not a required enlightenment for a 114 + Linux guest in paravisor mode. 115 + 116 + * CPUID flags. Both AMD SEV-SNP and Intel TDX provide a CPUID flag in the 117 + guest indicating that the VM is operating with the respective hardware 118 + support. While these CPUID flags are visible in fully-enlightened CoCo VMs, 119 + the paravisor filters out these flags and the guest Linux does not see them. 120 + Throughout the Linux kernel, explicitly testing these flags has mostly been 121 + eliminated in favor of the cc_platform_has() function, with the goal of 122 + abstracting the differences between SEV-SNP and TDX. But the 123 + cc_platform_has() abstraction also allows the Hyper-V paravisor configuration 124 + to selectively enable aspects of CoCo VM functionality even when the CPUID 125 + flags are not set. The exception is early boot memory setup on SEV-SNP, which 126 + tests the CPUID SEV-SNP flag. But not having the flag in Hyper-V paravisor 127 + mode VM achieves the desired effect or not running SEV-SNP specific early 128 + boot memory setup. 129 + 130 + * Device emulation. In paravisor mode, the Hyper-V paravisor provides 131 + emulation of devices such as the IO-APIC and TPM. Because the emulation 132 + happens in the paravisor in the guest context (instead of the hypervisor/VMM 133 + context), MMIO accesses to these devices must be encrypted references instead 134 + of the decrypted references that would be used in a fully-enlightened CoCo 135 + VM. The __ioremap_caller() function has been enhanced to make a callback to 136 + check whether a particular address range should be treated as encrypted 137 + (private). See the "is_private_mmio" callback. 138 + 139 + * Encrypt/decrypt memory transitions. In a CoCo VM, transitioning guest 140 + memory between encrypted and decrypted requires coordinating with the 141 + hypervisor/VMM. This is done via callbacks invoked from 142 + __set_memory_enc_pgtable(). In fully-enlightened mode, the normal SEV-SNP and 143 + TDX implementations of these callbacks are used. In paravisor mode, a Hyper-V 144 + specific set of callbacks is used. These callbacks invoke the paravisor so 145 + that the paravisor can coordinate the transitions and inform the hypervisor 146 + as necessary. See hv_vtom_init() where these callback are set up. 147 + 148 + * Interrupt injection. In fully enlightened mode, a malicious hypervisor 149 + could inject interrupts into the guest OS at times that violate x86/x64 150 + architectural rules. For full protection, the guest OS should include 151 + enlightenments that use the interrupt injection management features provided 152 + by CoCo-capable processors. In paravisor mode, the paravisor mediates 153 + interrupt injection into the guest OS, and ensures that the guest OS only 154 + sees interrupts that are "legal". The paravisor uses the interrupt injection 155 + management features provided by the CoCo-capable physical processor, thereby 156 + masking these complexities from the guest OS. 157 + 158 + Hyper-V Hypercalls 159 + ------------------ 160 + When in fully-enlightened mode, hypercalls made by the Linux guest are routed 161 + directly to the hypervisor, just as in a non-CoCo VM. But in paravisor mode, 162 + normal hypercalls trap to the paravisor first, which may in turn invoke the 163 + hypervisor. But the paravisor is idiosyncratic in this regard, and a few 164 + hypercalls made by the Linux guest must always be routed directly to the 165 + hypervisor. These hypercall sites test for a paravisor being present, and use 166 + a special invocation sequence. See hv_post_message(), for example. 167 + 168 + Guest communication with Hyper-V 169 + -------------------------------- 170 + Separate from the generic Linux kernel handling of memory encryption in Linux 171 + CoCo VMs, Hyper-V has VMBus and VMBus devices that communicate using memory 172 + shared between the Linux guest and the host. This shared memory must be 173 + marked decrypted to enable communication. Furthermore, since the threat model 174 + includes a compromised and potentially malicious host, the guest must guard 175 + against leaking any unintended data to the host through this shared memory. 176 + 177 + These Hyper-V and VMBus memory pages are marked as decrypted: 178 + 179 + * VMBus monitor pages 180 + 181 + * Synthetic interrupt controller (synic) related pages (unless supplied by 182 + the paravisor) 183 + 184 + * Per-cpu hypercall input and output pages (unless running with a paravisor) 185 + 186 + * VMBus ring buffers. The direct mapping is marked decrypted in 187 + __vmbus_establish_gpadl(). The secondary mapping created in 188 + hv_ringbuffer_init() must also include the "decrypted" attribute. 189 + 190 + When the guest writes data to memory that is shared with the host, it must 191 + ensure that only the intended data is written. Padding or unused fields must 192 + be initialized to zeros before copying into the shared memory so that random 193 + kernel data is not inadvertently given to the host. 194 + 195 + Similarly, when the guest reads memory that is shared with the host, it must 196 + validate the data before acting on it so that a malicious host cannot induce 197 + the guest to expose unintended data. Doing such validation can be tricky 198 + because the host can modify the shared memory areas even while or after 199 + validation is performed. For messages passed from the host to the guest in a 200 + VMBus ring buffer, the length of the message is validated, and the message is 201 + copied into a temporary (encrypted) buffer for further validation and 202 + processing. The copying adds a small amount of overhead, but is the only way 203 + to protect against a malicious host. See hv_pkt_iter_first(). 204 + 205 + Many drivers for VMBus devices have been "hardened" by adding code to fully 206 + validate messages received over VMBus, instead of assuming that Hyper-V is 207 + acting cooperatively. Such drivers are marked as "allowed_in_isolated" in the 208 + vmbus_devs[] table. Other drivers for VMBus devices that are not needed in a 209 + CoCo VM have not been hardened, and they are not allowed to load in a CoCo 210 + VM. See vmbus_is_valid_offer() where such devices are excluded. 211 + 212 + Two VMBus devices depend on the Hyper-V host to do DMA data transfers: 213 + storvsc for disk I/O and netvsc for network I/O. storvsc uses the normal 214 + Linux kernel DMA APIs, and so bounce buffering through decrypted swiotlb 215 + memory is done implicitly. netvsc has two modes for data transfers. The first 216 + mode goes through send and receive buffer space that is explicitly allocated 217 + by the netvsc driver, and is used for most smaller packets. These send and 218 + receive buffers are marked decrypted by __vmbus_establish_gpadl(). Because 219 + the netvsc driver explicitly copies packets to/from these buffers, the 220 + equivalent of bounce buffering between encrypted and decrypted memory is 221 + already part of the data path. The second mode uses the normal Linux kernel 222 + DMA APIs, and is bounce buffered through swiotlb memory implicitly like in 223 + storvsc. 224 + 225 + Finally, the VMBus virtual PCI driver needs special handling in a CoCo VM. 226 + Linux PCI device drivers access PCI config space using standard APIs provided 227 + by the Linux PCI subsystem. On Hyper-V, these functions directly access MMIO 228 + space, and the access traps to Hyper-V for emulation. But in CoCo VMs, memory 229 + encryption prevents Hyper-V from reading the guest instruction stream to 230 + emulate the access. So in a CoCo VM, these functions must make a hypercall 231 + with arguments explicitly describing the access. See 232 + _hv_pcifront_read_config() and _hv_pcifront_write_config() and the 233 + "use_calls" flag indicating to use hypercalls. 234 + 235 + load_unaligned_zeropad() 236 + ------------------------ 237 + When transitioning memory between encrypted and decrypted, the caller of 238 + set_memory_encrypted() or set_memory_decrypted() is responsible for ensuring 239 + the memory isn't in use and isn't referenced while the transition is in 240 + progress. The transition has multiple steps, and includes interaction with 241 + the Hyper-V host. The memory is in an inconsistent state until all steps are 242 + complete. A reference while the state is inconsistent could result in an 243 + exception that can't be cleanly fixed up. 244 + 245 + However, the kernel load_unaligned_zeropad() mechanism may make stray 246 + references that can't be prevented by the caller of set_memory_encrypted() or 247 + set_memory_decrypted(), so there's specific code in the #VC or #VE exception 248 + handler to fixup this case. But a CoCo VM running on Hyper-V may be 249 + configured to run with a paravisor, with the #VC or #VE exception routed to 250 + the paravisor. There's no architectural way to forward the exceptions back to 251 + the guest kernel, and in such a case, the load_unaligned_zeropad() fixup code 252 + in the #VC/#VE handlers doesn't run. 253 + 254 + To avoid this problem, the Hyper-V specific functions for notifying the 255 + hypervisor of the transition mark pages as "not present" while a transition 256 + is in progress. If load_unaligned_zeropad() causes a stray reference, a 257 + normal page fault is generated instead of #VC or #VE, and the page-fault- 258 + based handlers for load_unaligned_zeropad() fixup the reference. When the 259 + encrypted/decrypted transition is complete, the pages are marked as "present" 260 + again. See hv_vtom_clear_present() and hv_vtom_set_host_visibility().

+1

Documentation/virt/hyperv/index.rst

··· 11 11 vmbus 12 12 clocks 13 13 vpci 14 + coco

+1 -4

arch/x86/hyperv/hv_init.c

··· 35 35 #include <clocksource/hyperv_timer.h> 36 36 #include <linux/highmem.h> 37 37 38 - int hyperv_init_cpuhp; 39 38 u64 hv_current_partition_id = ~0ull; 40 39 EXPORT_SYMBOL_GPL(hv_current_partition_id); 41 40 ··· 606 607 607 608 register_syscore_ops(&hv_syscore_ops); 608 609 609 - hyperv_init_cpuhp = cpuhp; 610 - 611 610 if (cpuid_ebx(HYPERV_CPUID_FEATURES) & HV_ACCESS_PARTITION_ID) 612 611 hv_get_partition_id(); 613 612 ··· 634 637 clean_guest_os_id: 635 638 wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0); 636 639 hv_ivm_msr_write(HV_X64_MSR_GUEST_OS_ID, 0); 637 - cpuhp_remove_state(cpuhp); 640 + cpuhp_remove_state(CPUHP_AP_HYPERV_ONLINE); 638 641 free_ghcb_page: 639 642 free_percpu(hv_ghcb_pg); 640 643 free_vp_assist_page:

-1

arch/x86/include/asm/mshyperv.h

··· 40 40 } 41 41 42 42 #if IS_ENABLED(CONFIG_HYPERV) 43 - extern int hyperv_init_cpuhp; 44 43 extern bool hyperv_paravisor_present; 45 44 46 45 extern void *hv_hypercall_pg;

+18 -3

arch/x86/kernel/cpu/mshyperv.c

··· 199 199 * Call hv_cpu_die() on all the CPUs, otherwise later the hypervisor 200 200 * corrupts the old VP Assist Pages and can crash the kexec kernel. 201 201 */ 202 - if (kexec_in_progress && hyperv_init_cpuhp > 0) 203 - cpuhp_remove_state(hyperv_init_cpuhp); 202 + if (kexec_in_progress) 203 + cpuhp_remove_state(CPUHP_AP_HYPERV_ONLINE); 204 204 205 205 /* The function calls stop_other_cpus(). */ 206 206 native_machine_shutdown(); ··· 424 424 ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) { 425 425 x86_platform.calibrate_tsc = hv_get_tsc_khz; 426 426 x86_platform.calibrate_cpu = hv_get_tsc_khz; 427 + setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ); 427 428 } 428 429 429 430 if (ms_hyperv.priv_high & HV_ISOLATION) { ··· 450 449 ms_hyperv.hints &= ~HV_X64_APIC_ACCESS_RECOMMENDED; 451 450 452 451 if (!ms_hyperv.paravisor_present) { 453 - /* To be supported: more work is required. */ 452 + /* 453 + * Mark the Hyper-V TSC page feature as disabled 454 + * in a TDX VM without paravisor so that the 455 + * Invariant TSC, which is a better clocksource 456 + * anyway, is used instead. 457 + */ 454 458 ms_hyperv.features &= ~HV_MSR_REFERENCE_TSC_AVAILABLE; 459 + 460 + /* 461 + * The Invariant TSC is expected to be available 462 + * in a TDX VM without paravisor, but if not, 463 + * print a warning message. The slower Hyper-V MSR-based 464 + * Ref Counter should end up being the clocksource. 465 + */ 466 + if (!(ms_hyperv.features & HV_ACCESS_TSC_INVARIANT)) 467 + pr_warn("Hyper-V: Invariant TSC is unavailable\n"); 455 468 456 469 /* HV_MSR_CRASH_CTL is unsupported. */ 457 470 ms_hyperv.misc_features &= ~HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE;

+15 -1

drivers/clocksource/hyperv_timer.c

··· 137 137 ce->name = "Hyper-V clockevent"; 138 138 ce->features = CLOCK_EVT_FEAT_ONESHOT; 139 139 ce->cpumask = cpumask_of(cpu); 140 - ce->rating = 1000; 140 + 141 + /* 142 + * Lower the rating of the Hyper-V timer in a TDX VM without paravisor, 143 + * so the local APIC timer (lapic_clockevent) is the default timer in 144 + * such a VM. The Hyper-V timer is not preferred in such a VM because 145 + * it depends on the slow VM Reference Counter MSR (the Hyper-V TSC 146 + * page is not enbled in such a VM because the VM uses Invariant TSC 147 + * as a better clocksource and it's challenging to mark the Hyper-V 148 + * TSC page shared in very early boot). 149 + */ 150 + if (!ms_hyperv.paravisor_present && hv_isolation_type_tdx()) 151 + ce->rating = 90; 152 + else 153 + ce->rating = 1000; 154 + 141 155 ce->set_state_shutdown = hv_ce_shutdown; 142 156 ce->set_state_oneshot = hv_ce_set_oneshot; 143 157 ce->set_next_event = hv_ce_set_next_event;

+3 -3

drivers/hv/hv.c

··· 342 342 return 0; 343 343 } 344 344 345 - /* 346 - * hv_synic_cleanup - Cleanup routine for hv_synic_init(). 347 - */ 348 345 void hv_synic_disable_regs(unsigned int cpu) 349 346 { 350 347 struct hv_per_cpu_context *hv_cpu = ··· 433 436 return pending; 434 437 } 435 438 439 + /* 440 + * hv_synic_cleanup - Cleanup routine for hv_synic_init(). 441 + */ 436 442 int hv_synic_cleanup(unsigned int cpu) 437 443 { 438 444 struct vmbus_channel *channel, *sc;

-6

drivers/hv/hyperv_vmbus.h

··· 380 380 int hv_vss_pre_suspend(void); 381 381 int hv_vss_pre_resume(void); 382 382 void hv_vss_onchannelcallback(void *context); 383 - 384 - int hv_fcopy_init(struct hv_util_service *srv); 385 - void hv_fcopy_deinit(void); 386 - int hv_fcopy_pre_suspend(void); 387 - int hv_fcopy_pre_resume(void); 388 - void hv_fcopy_onchannelcallback(void *context); 389 383 void vmbus_initiate_unload(bool crash); 390 384 391 385 static inline void hv_poll_channel(struct vmbus_channel *channel,

+2 -2

drivers/hv/vmbus_drv.c

··· 1803 1803 return attr->mode; 1804 1804 } 1805 1805 1806 - static struct attribute_group vmbus_chan_group = { 1806 + static const struct attribute_group vmbus_chan_group = { 1807 1807 .attrs = vmbus_chan_attrs, 1808 1808 .is_visible = vmbus_chan_attr_is_visible 1809 1809 }; 1810 1810 1811 - static struct kobj_type vmbus_chan_ktype = { 1811 + static const struct kobj_type vmbus_chan_ktype = { 1812 1812 .sysfs_ops = &vmbus_chan_sysfs_ops, 1813 1813 .release = vmbus_chan_release, 1814 1814 };

+1 -1

tools/hv/Makefile

··· 52 52 53 53 clean: 54 54 rm -f $(ALL_PROGRAMS) 55 - find $(or $(OUTPUT),.) -name '*.o' -delete -o -name '\.*.d' -delete 55 + find $(or $(OUTPUT),.) -name '*.o' -delete -o -name '\.*.d' -delete -o -name '\.*.cmd' -delete 56 56 57 57 install: $(ALL_PROGRAMS) 58 58 install -d -m 755 $(DESTDIR)$(sbindir); \

+1 -1

tools/hv/lsvmbus

··· 1 - #!/usr/bin/env python 1 + #!/usr/bin/env python3 2 2 # SPDX-License-Identifier: GPL-2.0 3 3 4 4 import os