Merge tag 'hyperv-next-signed-20251207' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

+138 -1

Documentation/virt/hyperv/coco.rst

··· 178 178 179 179 * VMBus monitor pages 180 180 181 - * Synthetic interrupt controller (synic) related pages (unless supplied by 181 + * Synthetic interrupt controller (SynIC) related pages (unless supplied by 182 182 the paravisor) 183 183 184 184 * Per-cpu hypercall input and output pages (unless running with a paravisor) ··· 231 231 with arguments explicitly describing the access. See 232 232 _hv_pcifront_read_config() and _hv_pcifront_write_config() and the 233 233 "use_calls" flag indicating to use hypercalls. 234 + 235 + Confidential VMBus 236 + ------------------ 237 + The confidential VMBus enables the confidential guest not to interact with 238 + the untrusted host partition and the untrusted hypervisor. Instead, the guest 239 + relies on the trusted paravisor to communicate with the devices processing 240 + sensitive data. The hardware (SNP or TDX) encrypts the guest memory and the 241 + register state while measuring the paravisor image using the platform security 242 + processor to ensure trusted and confidential computing. 243 + 244 + Confidential VMBus provides a secure communication channel between the guest 245 + and the paravisor, ensuring that sensitive data is protected from hypervisor- 246 + level access through memory encryption and register state isolation. 247 + 248 + Confidential VMBus is an extension of Confidential Computing (CoCo) VMs 249 + (a.k.a. "Isolated" VMs in Hyper-V terminology). Without Confidential VMBus, 250 + guest VMBus device drivers (the "VSC"s in VMBus terminology) communicate 251 + with VMBus servers (the VSPs) running on the Hyper-V host. The 252 + communication must be through memory that has been decrypted so the 253 + host can access it. With Confidential VMBus, one or more of the VSPs reside 254 + in the trusted paravisor layer in the guest VM. Since the paravisor layer also 255 + operates in encrypted memory, the memory used for communication with 256 + such VSPs does not need to be decrypted and thereby exposed to the 257 + Hyper-V host. The paravisor is responsible for communicating securely 258 + with the Hyper-V host as necessary. 259 + 260 + The data is transferred directly between the VM and a vPCI device (a.k.a. 261 + a PCI pass-thru device, see :doc:`vpci`) that is directly assigned to VTL2 262 + and that supports encrypted memory. In such a case, neither the host partition 263 + nor the hypervisor has any access to the data. The guest needs to establish 264 + a VMBus connection only with the paravisor for the channels that process 265 + sensitive data, and the paravisor abstracts the details of communicating 266 + with the specific devices away providing the guest with the well-established 267 + VSP (Virtual Service Provider) interface that has had support in the Hyper-V 268 + drivers for a decade. 269 + 270 + In the case the device does not support encrypted memory, the paravisor 271 + provides bounce-buffering, and although the data is not encrypted, the backing 272 + pages aren't mapped into the host partition through SLAT. While not impossible, 273 + it becomes much more difficult for the host partition to exfiltrate the data 274 + than it would be with a conventional VMBus connection where the host partition 275 + has direct access to the memory used for communication. 276 + 277 + Here is the data flow for a conventional VMBus connection (`C` stands for the 278 + client or VSC, `S` for the server or VSP, the `DEVICE` is a physical one, might 279 + be with multiple virtual functions):: 280 + 281 + +---- GUEST ----+ +----- DEVICE ----+ +----- HOST -----+ 282 + | | | | | | 283 + | | | | | | 284 + | | | ========== | 285 + | | | | | | 286 + | | | | | | 287 + | | | | | | 288 + +----- C -------+ +-----------------+ +------- S ------+ 289 + || || 290 + || || 291 + +------||------------------ VMBus --------------------------||------+ 292 + | Interrupts, MMIO | 293 + +-------------------------------------------------------------------+ 294 + 295 + and the Confidential VMBus connection:: 296 + 297 + +---- GUEST --------------- VTL0 ------+ +-- DEVICE --+ 298 + | | | | 299 + | +- PARAVISOR --------- VTL2 -----+ | | | 300 + | | +-- VMBus Relay ------+ ====+================ | 301 + | | | Interrupts, MMIO | | | | | 302 + | | +-------- S ----------+ | | +------------+ 303 + | | || | | 304 + | +---------+ || | | 305 + | | Linux | || OpenHCL | | 306 + | | kernel | || | | 307 + | +---- C --+-----||---------------+ | 308 + | || || | 309 + +-------++------- C -------------------+ +------------+ 310 + || | HOST | 311 + || +---- S -----+ 312 + +-------||----------------- VMBus ---------------------------||-----+ 313 + | Interrupts, MMIO | 314 + +-------------------------------------------------------------------+ 315 + 316 + An implementation of the VMBus relay that offers the Confidential VMBus 317 + channels is available in the OpenVMM project as a part of the OpenHCL 318 + paravisor. Please refer to 319 + 320 + * https://openvmm.dev/, and 321 + * https://github.com/microsoft/openvmm 322 + 323 + for more information about the OpenHCL paravisor. 324 + 325 + A guest that is running with a paravisor must determine at runtime if 326 + Confidential VMBus is supported by the current paravisor. The x86_64-specific 327 + approach relies on the CPUID Virtualization Stack leaf; the ARM64 implementation 328 + is expected to support the Confidential VMBus unconditionally when running 329 + ARM CCA guests. 330 + 331 + Confidential VMBus is a characteristic of the VMBus connection as a whole, 332 + and of each VMBus channel that is created. When a Confidential VMBus 333 + connection is established, the paravisor provides the guest the message-passing 334 + path that is used for VMBus device creation and deletion, and it provides a 335 + per-CPU synthetic interrupt controller (SynIC) just like the SynIC that is 336 + offered by the Hyper-V host. Each VMBus device that is offered to the guest 337 + indicates the degree to which it participates in Confidential VMBus. The offer 338 + indicates if the device uses encrypted ring buffers, and if the device uses 339 + encrypted memory for DMA that is done outside the ring buffer. These settings 340 + may be different for different devices using the same Confidential VMBus 341 + connection. 342 + 343 + Although these settings are separate, in practice it'll always be encrypted 344 + ring buffer only, or both encrypted ring buffer and external data. If a channel 345 + is offered by the paravisor with confidential VMBus, the ring buffer can always 346 + be encrypted since it's strictly for communication between the VTL2 paravisor 347 + and the VTL0 guest. However, other memory regions are often used for e.g. DMA, 348 + so they need to be accessible by the underlying hardware, and must be 349 + unencrypted (unless the device supports encrypted memory). Currently, there are 350 + not any VSPs in OpenHCL that support encrypted external memory, but future 351 + versions are expected to enable this capability. 352 + 353 + Because some devices on a Confidential VMBus may require decrypted ring buffers 354 + and DMA transfers, the guest must interact with two SynICs -- the one provided 355 + by the paravisor and the one provided by the Hyper-V host when Confidential 356 + VMBus is not offered. Interrupts are always signaled by the paravisor SynIC, 357 + but the guest must check for messages and for channel interrupts on both SynICs. 358 + 359 + In the case of a confidential VMBus, regular SynIC access by the guest is 360 + intercepted by the paravisor (this includes various MSRs such as the SIMP and 361 + SIEFP, as well as hypercalls like HvPostMessage and HvSignalEvent). If the 362 + guest actually wants to communicate with the hypervisor, it has to use special 363 + mechanisms (GHCB page on SNP, or tdcall on TDX). Messages can be of either 364 + kind: with confidential VMBus, messages use the paravisor SynIC, and if the 365 + guest chose to communicate directly to the hypervisor, they use the hypervisor 366 + SynIC. For interrupt signaling, some channels may be running on the host 367 + (non-confidential, using the VMBus relay) and use the hypervisor SynIC, and 368 + some on the paravisor and use its SynIC. The RelIDs are coordinated by the 369 + OpenHCL VMBus server and are guaranteed to be unique regardless of whether 370 + the channel originated on the host or the paravisor. 234 371 235 372 load_unaligned_zeropad() 236 373 ------------------------

+3

MAINTAINERS

··· 11705 11705 M: Haiyang Zhang <haiyangz@microsoft.com> 11706 11706 M: Wei Liu <wei.liu@kernel.org> 11707 11707 M: Dexuan Cui <decui@microsoft.com> 11708 + M: Long Li <longli@microsoft.com> 11708 11709 L: linux-hyperv@vger.kernel.org 11709 11710 S: Supported 11710 11711 T: git git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git ··· 11723 11722 F: drivers/clocksource/hyperv_timer.c 11724 11723 F: drivers/hid/hid-hyperv.c 11725 11724 F: drivers/hv/ 11725 + F: drivers/infiniband/hw/mana/ 11726 11726 F: drivers/input/serio/hyperv-keyboard.c 11727 11727 F: drivers/iommu/hyperv-iommu.c 11728 11728 F: drivers/net/ethernet/microsoft/ ··· 11742 11740 F: include/linux/hyperv.h 11743 11741 F: include/net/mana 11744 11742 F: include/uapi/linux/hyperv.h 11743 + F: include/uapi/rdma/mana-abi.h 11745 11744 F: net/vmw_vsock/hyperv_transport.c 11746 11745 F: tools/hv/ 11747 11746

+15 -1

arch/x86/hyperv/Makefile

··· 1 1 # SPDX-License-Identifier: GPL-2.0-only 2 2 obj-y := hv_init.o mmu.o nested.o irqdomain.o ivm.o 3 3 obj-$(CONFIG_X86_64) += hv_apic.o 4 - obj-$(CONFIG_HYPERV_VTL_MODE) += hv_vtl.o 4 + obj-$(CONFIG_HYPERV_VTL_MODE) += hv_vtl.o mshv_vtl_asm.o 5 + 6 + $(obj)/mshv_vtl_asm.o: $(obj)/mshv-asm-offsets.h 7 + 8 + $(obj)/mshv-asm-offsets.h: $(obj)/mshv-asm-offsets.s FORCE 9 + $(call filechk,offsets,__MSHV_ASM_OFFSETS_H__) 5 10 6 11 ifdef CONFIG_X86_64 7 12 obj-$(CONFIG_PARAVIRT_SPINLOCKS) += hv_spinlock.o 13 + 14 + ifdef CONFIG_MSHV_ROOT 15 + CFLAGS_REMOVE_hv_trampoline.o += -pg 16 + CFLAGS_hv_trampoline.o += -fno-stack-protector 17 + obj-$(CONFIG_CRASH_DUMP) += hv_crash.o hv_trampoline.o 18 + endif 8 19 endif 20 + 21 + targets += mshv-asm-offsets.s 22 + clean-files += mshv-asm-offsets.h

+8

arch/x86/hyperv/hv_apic.c

··· 53 53 wrmsrq(HV_X64_MSR_ICR, reg_val); 54 54 } 55 55 56 + void hv_enable_coco_interrupt(unsigned int cpu, unsigned int vector, bool set) 57 + { 58 + apic_update_vector(cpu, vector, set); 59 + } 60 + 56 61 static u32 hv_apic_read(u32 reg) 57 62 { 58 63 u32 reg_val, hi; ··· 298 293 299 294 void __init hv_apic_init(void) 300 295 { 296 + if (cc_platform_has(CC_ATTR_SNP_SECURE_AVIC)) 297 + return; 298 + 301 299 if (ms_hyperv.hints & HV_X64_CLUSTER_IPI_RECOMMENDED) { 302 300 pr_info("Hyper-V: Using IPI hypercalls\n"); 303 301 /*

+642

arch/x86/hyperv/hv_crash.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * X86 specific Hyper-V root partition kdump/crash support module 4 + * 5 + * Copyright (C) 2025, Microsoft, Inc. 6 + * 7 + * This module implements hypervisor RAM collection into vmcore for both 8 + * cases of the hypervisor crash and Linux root crash. Hyper-V implements 9 + * a disable hypercall with a 32bit protected mode ABI callback. This 10 + * mechanism must be used to unlock hypervisor RAM. Since the hypervisor RAM 11 + * is already mapped in Linux, it is automatically collected into Linux vmcore, 12 + * and can be examined by the crash command (raw RAM dump) or windbg. 13 + * 14 + * At a high level: 15 + * 16 + * Hypervisor Crash: 17 + * Upon crash, hypervisor goes into an emergency minimal dispatch loop, a 18 + * restrictive mode with very limited hypercall and MSR support. Each cpu 19 + * then injects NMIs into root vcpus. A shared page is used to check 20 + * by Linux in the NMI handler if the hypervisor has crashed. This shared 21 + * page is setup in hv_root_crash_init during boot. 22 + * 23 + * Linux Crash: 24 + * In case of Linux crash, the callback hv_crash_stop_other_cpus will send 25 + * NMIs to all cpus, then proceed to the crash_nmi_callback where it waits 26 + * for all cpus to be in NMI. 27 + * 28 + * NMI Handler (upon quorum): 29 + * Eventually, in both cases, all cpus will end up in the NMI handler. 30 + * Hyper-V requires the disable hypervisor must be done from the BSP. So 31 + * the BSP NMI handler saves current context, does some fixups and makes 32 + * the hypercall to disable the hypervisor, ie, devirtualize. Hypervisor 33 + * at that point will suspend all vcpus (except the BSP), unlock all its 34 + * RAM, and return to Linux at the 32bit mode entry RIP. 35 + * 36 + * Linux 32bit entry trampoline will then restore long mode and call C 37 + * function here to restore context and continue execution to crash kexec. 38 + */ 39 + 40 + #include <linux/delay.h> 41 + #include <linux/kexec.h> 42 + #include <linux/crash_dump.h> 43 + #include <linux/panic.h> 44 + #include <asm/apic.h> 45 + #include <asm/desc.h> 46 + #include <asm/page.h> 47 + #include <asm/pgalloc.h> 48 + #include <asm/mshyperv.h> 49 + #include <asm/nmi.h> 50 + #include <asm/idtentry.h> 51 + #include <asm/reboot.h> 52 + #include <asm/intel_pt.h> 53 + 54 + bool hv_crash_enabled; 55 + EXPORT_SYMBOL_GPL(hv_crash_enabled); 56 + 57 + struct hv_crash_ctxt { 58 + ulong rsp; 59 + ulong cr0; 60 + ulong cr2; 61 + ulong cr4; 62 + ulong cr8; 63 + 64 + u16 cs; 65 + u16 ss; 66 + u16 ds; 67 + u16 es; 68 + u16 fs; 69 + u16 gs; 70 + 71 + u16 gdt_fill; 72 + struct desc_ptr gdtr; 73 + char idt_fill[6]; 74 + struct desc_ptr idtr; 75 + 76 + u64 gsbase; 77 + u64 efer; 78 + u64 pat; 79 + }; 80 + static struct hv_crash_ctxt hv_crash_ctxt; 81 + 82 + /* Shared hypervisor page that contains crash dump area we peek into. 83 + * NB: windbg looks for "hv_cda" symbol so don't change it. 84 + */ 85 + static struct hv_crashdump_area *hv_cda; 86 + 87 + static u32 trampoline_pa, devirt_arg; 88 + static atomic_t crash_cpus_wait; 89 + static void *hv_crash_ptpgs[4]; 90 + static bool hv_has_crashed, lx_has_crashed; 91 + 92 + static void __noreturn hv_panic_timeout_reboot(void) 93 + { 94 + #define PANIC_TIMER_STEP 100 95 + 96 + if (panic_timeout > 0) { 97 + int i; 98 + 99 + for (i = 0; i < panic_timeout * 1000; i += PANIC_TIMER_STEP) 100 + mdelay(PANIC_TIMER_STEP); 101 + } 102 + 103 + if (panic_timeout) 104 + native_wrmsrq(HV_X64_MSR_RESET, 1); /* get hyp to reboot */ 105 + 106 + for (;;) 107 + cpu_relax(); 108 + } 109 + 110 + /* This cannot be inlined as it needs stack */ 111 + static noinline __noclone void hv_crash_restore_tss(void) 112 + { 113 + load_TR_desc(); 114 + } 115 + 116 + /* This cannot be inlined as it needs stack */ 117 + static noinline void hv_crash_clear_kernpt(void) 118 + { 119 + pgd_t *pgd; 120 + p4d_t *p4d; 121 + 122 + /* Clear entry so it's not confusing to someone looking at the core */ 123 + pgd = pgd_offset_k(trampoline_pa); 124 + p4d = p4d_offset(pgd, trampoline_pa); 125 + native_p4d_clear(p4d); 126 + } 127 + 128 + /* 129 + * This is the C entry point from the asm glue code after the disable hypercall. 130 + * We enter here in IA32-e long mode, ie, full 64bit mode running on kernel 131 + * page tables with our below 4G page identity mapped, but using a temporary 132 + * GDT. ds/fs/gs/es are null. ss is not usable. bp is null. stack is not 133 + * available. We restore kernel GDT, and rest of the context, and continue 134 + * to kexec. 135 + */ 136 + static asmlinkage void __noreturn hv_crash_c_entry(void) 137 + { 138 + struct hv_crash_ctxt *ctxt = &hv_crash_ctxt; 139 + 140 + /* first thing, restore kernel gdt */ 141 + native_load_gdt(&ctxt->gdtr); 142 + 143 + asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss)); 144 + asm volatile("movq %0, %%rsp" : : "m"(ctxt->rsp)); 145 + 146 + asm volatile("movw %%ax, %%ds" : : "a"(ctxt->ds)); 147 + asm volatile("movw %%ax, %%es" : : "a"(ctxt->es)); 148 + asm volatile("movw %%ax, %%fs" : : "a"(ctxt->fs)); 149 + asm volatile("movw %%ax, %%gs" : : "a"(ctxt->gs)); 150 + 151 + native_wrmsrq(MSR_IA32_CR_PAT, ctxt->pat); 152 + asm volatile("movq %0, %%cr0" : : "r"(ctxt->cr0)); 153 + 154 + asm volatile("movq %0, %%cr8" : : "r"(ctxt->cr8)); 155 + asm volatile("movq %0, %%cr4" : : "r"(ctxt->cr4)); 156 + asm volatile("movq %0, %%cr2" : : "r"(ctxt->cr4)); 157 + 158 + native_load_idt(&ctxt->idtr); 159 + native_wrmsrq(MSR_GS_BASE, ctxt->gsbase); 160 + native_wrmsrq(MSR_EFER, ctxt->efer); 161 + 162 + /* restore the original kernel CS now via far return */ 163 + asm volatile("movzwq %0, %%rax\n\t" 164 + "pushq %%rax\n\t" 165 + "pushq $1f\n\t" 166 + "lretq\n\t" 167 + "1:nop\n\t" : : "m"(ctxt->cs) : "rax"); 168 + 169 + /* We are in asmlinkage without stack frame, hence make C function 170 + * calls which will buy stack frames. 171 + */ 172 + hv_crash_restore_tss(); 173 + hv_crash_clear_kernpt(); 174 + 175 + /* we are now fully in devirtualized normal kernel mode */ 176 + __crash_kexec(NULL); 177 + 178 + hv_panic_timeout_reboot(); 179 + } 180 + /* Tell gcc we are using lretq long jump in the above function intentionally */ 181 + STACK_FRAME_NON_STANDARD(hv_crash_c_entry); 182 + 183 + static void hv_mark_tss_not_busy(void) 184 + { 185 + struct desc_struct *desc = get_current_gdt_rw(); 186 + tss_desc tss; 187 + 188 + memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc)); 189 + tss.type = 0x9; /* available 64-bit TSS. 0xB is busy TSS */ 190 + write_gdt_entry(desc, GDT_ENTRY_TSS, &tss, DESC_TSS); 191 + } 192 + 193 + /* Save essential context */ 194 + static void hv_hvcrash_ctxt_save(void) 195 + { 196 + struct hv_crash_ctxt *ctxt = &hv_crash_ctxt; 197 + 198 + asm volatile("movq %%rsp,%0" : "=m"(ctxt->rsp)); 199 + 200 + ctxt->cr0 = native_read_cr0(); 201 + ctxt->cr4 = native_read_cr4(); 202 + 203 + asm volatile("movq %%cr2, %0" : "=a"(ctxt->cr2)); 204 + asm volatile("movq %%cr8, %0" : "=a"(ctxt->cr8)); 205 + 206 + asm volatile("movl %%cs, %%eax" : "=a"(ctxt->cs)); 207 + asm volatile("movl %%ss, %%eax" : "=a"(ctxt->ss)); 208 + asm volatile("movl %%ds, %%eax" : "=a"(ctxt->ds)); 209 + asm volatile("movl %%es, %%eax" : "=a"(ctxt->es)); 210 + asm volatile("movl %%fs, %%eax" : "=a"(ctxt->fs)); 211 + asm volatile("movl %%gs, %%eax" : "=a"(ctxt->gs)); 212 + 213 + native_store_gdt(&ctxt->gdtr); 214 + store_idt(&ctxt->idtr); 215 + 216 + ctxt->gsbase = __rdmsr(MSR_GS_BASE); 217 + ctxt->efer = __rdmsr(MSR_EFER); 218 + ctxt->pat = __rdmsr(MSR_IA32_CR_PAT); 219 + } 220 + 221 + /* Add trampoline page to the kernel pagetable for transition to kernel PT */ 222 + static void hv_crash_fixup_kernpt(void) 223 + { 224 + pgd_t *pgd; 225 + p4d_t *p4d; 226 + 227 + pgd = pgd_offset_k(trampoline_pa); 228 + p4d = p4d_offset(pgd, trampoline_pa); 229 + 230 + /* trampoline_pa is below 4G, so no pre-existing entry to clobber */ 231 + p4d_populate(&init_mm, p4d, (pud_t *)hv_crash_ptpgs[1]); 232 + p4d->p4d = p4d->p4d & ~(_PAGE_NX); /* enable execute */ 233 + } 234 + 235 + /* 236 + * Notify the hyp that Linux has crashed. This will cause the hyp to quiesce 237 + * and suspend all guest VPs. 238 + */ 239 + static void hv_notify_prepare_hyp(void) 240 + { 241 + u64 status; 242 + struct hv_input_notify_partition_event *input; 243 + struct hv_partition_event_root_crashdump_input *cda; 244 + 245 + input = *this_cpu_ptr(hyperv_pcpu_input_arg); 246 + cda = &input->input.crashdump_input; 247 + memset(input, 0, sizeof(*input)); 248 + input->event = HV_PARTITION_EVENT_ROOT_CRASHDUMP; 249 + 250 + cda->crashdump_action = HV_CRASHDUMP_ENTRY; 251 + status = hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL); 252 + if (!hv_result_success(status)) 253 + return; 254 + 255 + cda->crashdump_action = HV_CRASHDUMP_SUSPEND_ALL_VPS; 256 + hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL); 257 + } 258 + 259 + /* 260 + * Common function for all cpus before devirtualization. 261 + * 262 + * Hypervisor crash: all cpus get here in NMI context. 263 + * Linux crash: the panicing cpu gets here at base level, all others in NMI 264 + * context. Note, panicing cpu may not be the BSP. 265 + * 266 + * The function is not inlined so it will show on the stack. It is named so 267 + * because the crash cmd looks for certain well known function names on the 268 + * stack before looking into the cpu saved note in the elf section, and 269 + * that work is currently incomplete. 270 + * 271 + * Notes: 272 + * Hypervisor crash: 273 + * - the hypervisor is in a very restrictive mode at this point and any 274 + * vmexit it cannot handle would result in reboot. So, no mumbo jumbo, 275 + * just get to kexec as quickly as possible. 276 + * 277 + * Devirtualization is supported from the BSP only at present. 278 + */ 279 + static noinline __noclone void crash_nmi_callback(struct pt_regs *regs) 280 + { 281 + struct hv_input_disable_hyp_ex *input; 282 + u64 status; 283 + int msecs = 1000, ccpu = smp_processor_id(); 284 + 285 + if (ccpu == 0) { 286 + /* crash_save_cpu() will be done in the kexec path */ 287 + cpu_emergency_stop_pt(); /* disable performance trace */ 288 + atomic_inc(&crash_cpus_wait); 289 + } else { 290 + crash_save_cpu(regs, ccpu); 291 + cpu_emergency_stop_pt(); /* disable performance trace */ 292 + atomic_inc(&crash_cpus_wait); 293 + for (;;) 294 + cpu_relax(); 295 + } 296 + 297 + while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--) 298 + mdelay(1); 299 + 300 + stop_nmi(); 301 + if (!hv_has_crashed) 302 + hv_notify_prepare_hyp(); 303 + 304 + if (crashing_cpu == -1) 305 + crashing_cpu = ccpu; /* crash cmd uses this */ 306 + 307 + hv_hvcrash_ctxt_save(); 308 + hv_mark_tss_not_busy(); 309 + hv_crash_fixup_kernpt(); 310 + 311 + input = *this_cpu_ptr(hyperv_pcpu_input_arg); 312 + memset(input, 0, sizeof(*input)); 313 + input->rip = trampoline_pa; 314 + input->arg = devirt_arg; 315 + 316 + status = hv_do_hypercall(HVCALL_DISABLE_HYP_EX, input, NULL); 317 + 318 + hv_panic_timeout_reboot(); 319 + } 320 + 321 + 322 + static DEFINE_SPINLOCK(hv_crash_reboot_lk); 323 + 324 + /* 325 + * Generic NMI callback handler: could be called without any crash also. 326 + * hv crash: hypervisor injects NMI's into all cpus 327 + * lx crash: panicing cpu sends NMI to all but self via crash_stop_other_cpus 328 + */ 329 + static int hv_crash_nmi_local(unsigned int cmd, struct pt_regs *regs) 330 + { 331 + if (!hv_has_crashed && hv_cda && hv_cda->cda_valid) 332 + hv_has_crashed = true; 333 + 334 + if (!hv_has_crashed && !lx_has_crashed) 335 + return NMI_DONE; /* ignore the NMI */ 336 + 337 + if (hv_has_crashed && !kexec_crash_loaded()) { 338 + if (spin_trylock(&hv_crash_reboot_lk)) 339 + hv_panic_timeout_reboot(); 340 + else 341 + for (;;) 342 + cpu_relax(); 343 + } 344 + 345 + crash_nmi_callback(regs); 346 + 347 + return NMI_DONE; 348 + } 349 + 350 + /* 351 + * hv_crash_stop_other_cpus() == smp_ops.crash_stop_other_cpus 352 + * 353 + * On normal Linux panic, this is called twice: first from panic and then again 354 + * from native_machine_crash_shutdown. 355 + * 356 + * In case of hyperv, 3 ways to get here: 357 + * 1. hv crash (only BSP will get here): 358 + * BSP : NMI callback -> DisableHv -> hv_crash_asm32 -> hv_crash_c_entry 359 + * -> __crash_kexec -> native_machine_crash_shutdown 360 + * -> crash_smp_send_stop -> smp_ops.crash_stop_other_cpus 361 + * Linux panic: 362 + * 2. panic cpu x: panic() -> crash_smp_send_stop 363 + * -> smp_ops.crash_stop_other_cpus 364 + * 3. BSP: native_machine_crash_shutdown -> crash_smp_send_stop 365 + * 366 + * NB: noclone and non standard stack because of call to crash_setup_regs(). 367 + */ 368 + static void __noclone hv_crash_stop_other_cpus(void) 369 + { 370 + static bool crash_stop_done; 371 + struct pt_regs lregs; 372 + int ccpu = smp_processor_id(); 373 + 374 + if (hv_has_crashed) 375 + return; /* all cpus already in NMI handler path */ 376 + 377 + if (!kexec_crash_loaded()) { 378 + hv_notify_prepare_hyp(); 379 + hv_panic_timeout_reboot(); /* no return */ 380 + } 381 + 382 + /* If the hv crashes also, we could come here again before cpus_stopped 383 + * is set in crash_smp_send_stop(). So use our own check. 384 + */ 385 + if (crash_stop_done) 386 + return; 387 + crash_stop_done = true; 388 + 389 + /* Linux has crashed: hv is healthy, we can IPI safely */ 390 + lx_has_crashed = true; 391 + wmb(); /* NMI handlers look at lx_has_crashed */ 392 + 393 + apic->send_IPI_allbutself(NMI_VECTOR); 394 + 395 + if (crashing_cpu == -1) 396 + crashing_cpu = ccpu; /* crash cmd uses this */ 397 + 398 + /* crash_setup_regs() happens in kexec also, but for the kexec cpu which 399 + * is the BSP. We could be here on non-BSP cpu, collect regs if so. 400 + */ 401 + if (ccpu) 402 + crash_setup_regs(&lregs, NULL); 403 + 404 + crash_nmi_callback(&lregs); 405 + } 406 + STACK_FRAME_NON_STANDARD(hv_crash_stop_other_cpus); 407 + 408 + /* This GDT is accessed in IA32-e compat mode which uses 32bits addresses */ 409 + struct hv_gdtreg_32 { 410 + u16 fill; 411 + u16 limit; 412 + u32 address; 413 + } __packed; 414 + 415 + /* We need a CS with L bit to goto IA32-e long mode from 32bit compat mode */ 416 + struct hv_crash_tramp_gdt { 417 + u64 null; /* index 0, selector 0, null selector */ 418 + u64 cs64; /* index 1, selector 8, cs64 selector */ 419 + } __packed; 420 + 421 + /* No stack, so jump via far ptr in memory to load the 64bit CS */ 422 + struct hv_cs_jmptgt { 423 + u32 address; 424 + u16 csval; 425 + u16 fill; 426 + } __packed; 427 + 428 + /* Linux use only, hypervisor doesn't look at this struct */ 429 + struct hv_crash_tramp_data { 430 + u64 tramp32_cr3; 431 + u64 kernel_cr3; 432 + struct hv_gdtreg_32 gdtr32; 433 + struct hv_crash_tramp_gdt tramp_gdt; 434 + struct hv_cs_jmptgt cs_jmptgt; 435 + u64 c_entry_addr; 436 + } __packed; 437 + 438 + /* 439 + * Setup a temporary gdt to allow the asm code to switch to the long mode. 440 + * Since the asm code is relocated/copied to a below 4G page, it cannot use rip 441 + * relative addressing, hence we must use trampoline_pa here. Also, save other 442 + * info like jmp and C entry targets for same reasons. 443 + * 444 + * Returns: 0 on success, -1 on error 445 + */ 446 + static int hv_crash_setup_trampdata(u64 trampoline_va) 447 + { 448 + int size, offs; 449 + void *dest; 450 + struct hv_crash_tramp_data *tramp; 451 + 452 + /* These must match exactly the ones in the corresponding asm file */ 453 + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, tramp32_cr3) != 0); 454 + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, kernel_cr3) != 8); 455 + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, gdtr32.limit) != 18); 456 + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, 457 + cs_jmptgt.address) != 40); 458 + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, c_entry_addr) != 48); 459 + 460 + /* hv_crash_asm_end is beyond last byte by 1 */ 461 + size = &hv_crash_asm_end - &hv_crash_asm32; 462 + if (size + sizeof(struct hv_crash_tramp_data) > PAGE_SIZE) { 463 + pr_err("%s: trampoline page overflow\n", __func__); 464 + return -1; 465 + } 466 + 467 + dest = (void *)trampoline_va; 468 + memcpy(dest, &hv_crash_asm32, size); 469 + 470 + dest += size; 471 + dest = (void *)round_up((ulong)dest, 16); 472 + tramp = (struct hv_crash_tramp_data *)dest; 473 + 474 + /* see MAX_ASID_AVAILABLE in tlb.c: "PCID 0 is reserved for use by 475 + * non-PCID-aware users". Build cr3 with pcid 0 476 + */ 477 + tramp->tramp32_cr3 = __sme_pa(hv_crash_ptpgs[0]); 478 + 479 + /* Note, when restoring X86_CR4_PCIDE, cr3[11:0] must be zero */ 480 + tramp->kernel_cr3 = __sme_pa(init_mm.pgd); 481 + 482 + tramp->gdtr32.limit = sizeof(struct hv_crash_tramp_gdt); 483 + tramp->gdtr32.address = trampoline_pa + 484 + (ulong)&tramp->tramp_gdt - trampoline_va; 485 + 486 + /* base:0 limit:0xfffff type:b dpl:0 P:1 L:1 D:0 avl:0 G:1 */ 487 + tramp->tramp_gdt.cs64 = 0x00af9a000000ffff; 488 + 489 + tramp->cs_jmptgt.csval = 0x8; 490 + offs = (ulong)&hv_crash_asm64 - (ulong)&hv_crash_asm32; 491 + tramp->cs_jmptgt.address = trampoline_pa + offs; 492 + 493 + tramp->c_entry_addr = (u64)&hv_crash_c_entry; 494 + 495 + devirt_arg = trampoline_pa + (ulong)dest - trampoline_va; 496 + 497 + return 0; 498 + } 499 + 500 + /* 501 + * Build 32bit trampoline page table for transition from protected mode 502 + * non-paging to long-mode paging. This transition needs pagetables below 4G. 503 + */ 504 + static void hv_crash_build_tramp_pt(void) 505 + { 506 + p4d_t *p4d; 507 + pud_t *pud; 508 + pmd_t *pmd; 509 + pte_t *pte; 510 + u64 pa, addr = trampoline_pa; 511 + 512 + p4d = hv_crash_ptpgs[0] + pgd_index(addr) * sizeof(p4d); 513 + pa = virt_to_phys(hv_crash_ptpgs[1]); 514 + set_p4d(p4d, __p4d(_PAGE_TABLE | pa)); 515 + p4d->p4d &= ~(_PAGE_NX); /* enable execute */ 516 + 517 + pud = hv_crash_ptpgs[1] + pud_index(addr) * sizeof(pud); 518 + pa = virt_to_phys(hv_crash_ptpgs[2]); 519 + set_pud(pud, __pud(_PAGE_TABLE | pa)); 520 + 521 + pmd = hv_crash_ptpgs[2] + pmd_index(addr) * sizeof(pmd); 522 + pa = virt_to_phys(hv_crash_ptpgs[3]); 523 + set_pmd(pmd, __pmd(_PAGE_TABLE | pa)); 524 + 525 + pte = hv_crash_ptpgs[3] + pte_index(addr) * sizeof(pte); 526 + set_pte(pte, pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_EXEC)); 527 + } 528 + 529 + /* 530 + * Setup trampoline for devirtualization: 531 + * - a page below 4G, ie 32bit addr containing asm glue code that hyp jmps to 532 + * in protected mode. 533 + * - 4 pages for a temporary page table that asm code uses to turn paging on 534 + * - a temporary gdt to use in the compat mode. 535 + * 536 + * Returns: 0 on success 537 + */ 538 + static int hv_crash_trampoline_setup(void) 539 + { 540 + int i, rc, order; 541 + struct page *page; 542 + u64 trampoline_va; 543 + gfp_t flags32 = GFP_KERNEL | GFP_DMA32 | __GFP_ZERO; 544 + 545 + /* page for 32bit trampoline assembly code + hv_crash_tramp_data */ 546 + page = alloc_page(flags32); 547 + if (page == NULL) { 548 + pr_err("%s: failed to alloc asm stub page\n", __func__); 549 + return -1; 550 + } 551 + 552 + trampoline_va = (u64)page_to_virt(page); 553 + trampoline_pa = (u32)page_to_phys(page); 554 + 555 + order = 2; /* alloc 2^2 pages */ 556 + page = alloc_pages(flags32, order); 557 + if (page == NULL) { 558 + pr_err("%s: failed to alloc pt pages\n", __func__); 559 + free_page(trampoline_va); 560 + return -1; 561 + } 562 + 563 + for (i = 0; i < 4; i++, page++) 564 + hv_crash_ptpgs[i] = page_to_virt(page); 565 + 566 + hv_crash_build_tramp_pt(); 567 + 568 + rc = hv_crash_setup_trampdata(trampoline_va); 569 + if (rc) 570 + goto errout; 571 + 572 + return 0; 573 + 574 + errout: 575 + free_page(trampoline_va); 576 + free_pages((ulong)hv_crash_ptpgs[0], order); 577 + 578 + return rc; 579 + } 580 + 581 + /* Setup for kdump kexec to collect hypervisor RAM when running as root */ 582 + void hv_root_crash_init(void) 583 + { 584 + int rc; 585 + struct hv_input_get_system_property *input; 586 + struct hv_output_get_system_property *output; 587 + unsigned long flags; 588 + u64 status; 589 + union hv_pfn_range cda_info; 590 + 591 + if (pgtable_l5_enabled()) { 592 + pr_err("Hyper-V: crash dump not yet supported on 5level PTs\n"); 593 + return; 594 + } 595 + 596 + rc = register_nmi_handler(NMI_LOCAL, hv_crash_nmi_local, NMI_FLAG_FIRST, 597 + "hv_crash_nmi"); 598 + if (rc) { 599 + pr_err("Hyper-V: failed to register crash nmi handler\n"); 600 + return; 601 + } 602 + 603 + local_irq_save(flags); 604 + input = *this_cpu_ptr(hyperv_pcpu_input_arg); 605 + output = *this_cpu_ptr(hyperv_pcpu_output_arg); 606 + 607 + memset(input, 0, sizeof(*input)); 608 + input->property_id = HV_SYSTEM_PROPERTY_CRASHDUMPAREA; 609 + 610 + status = hv_do_hypercall(HVCALL_GET_SYSTEM_PROPERTY, input, output); 611 + cda_info.as_uint64 = output->hv_cda_info.as_uint64; 612 + local_irq_restore(flags); 613 + 614 + if (!hv_result_success(status)) { 615 + pr_err("Hyper-V: %s: property:%d %s\n", __func__, 616 + input->property_id, hv_result_to_string(status)); 617 + goto err_out; 618 + } 619 + 620 + if (cda_info.base_pfn == 0) { 621 + pr_err("Hyper-V: hypervisor crash dump area pfn is 0\n"); 622 + goto err_out; 623 + } 624 + 625 + hv_cda = phys_to_virt(cda_info.base_pfn << HV_HYP_PAGE_SHIFT); 626 + 627 + rc = hv_crash_trampoline_setup(); 628 + if (rc) 629 + goto err_out; 630 + 631 + smp_ops.crash_stop_other_cpus = hv_crash_stop_other_cpus; 632 + 633 + crash_kexec_post_notifiers = true; 634 + hv_crash_enabled = true; 635 + pr_info("Hyper-V: both linux and hypervisor kdump support enabled\n"); 636 + 637 + return; 638 + 639 + err_out: 640 + unregister_nmi_handler(NMI_LOCAL, "hv_crash_nmi"); 641 + pr_err("Hyper-V: only linux root kdump support enabled\n"); 642 + }

+9

arch/x86/hyperv/hv_init.c

··· 170 170 wrmsrq(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64); 171 171 } 172 172 173 + /* Allow Hyper-V stimer vector to be injected from Hypervisor. */ 174 + if (ms_hyperv.misc_features & HV_STIMER_DIRECT_MODE_AVAILABLE) 175 + apic_update_vector(cpu, HYPERV_STIMER0_VECTOR, true); 176 + 173 177 return hyperv_init_ghcb(); 174 178 } 175 179 ··· 280 276 iounmap(*ghcb_va); 281 277 *ghcb_va = NULL; 282 278 } 279 + 280 + if (ms_hyperv.misc_features & HV_STIMER_DIRECT_MODE_AVAILABLE) 281 + apic_update_vector(cpu, HYPERV_STIMER0_VECTOR, false); 283 282 284 283 hv_common_cpu_die(cpu); 285 284 ··· 558 551 memunmap(src); 559 552 560 553 hv_remap_tsc_clocksource(); 554 + hv_root_crash_init(); 555 + hv_sleep_notifiers_register(); 561 556 } else { 562 557 hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg); 563 558 wrmsrq(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);

+101

arch/x86/hyperv/hv_trampoline.S

··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + /* 3 + * X86 specific Hyper-V kdump/crash related code. 4 + * 5 + * Copyright (C) 2025, Microsoft, Inc. 6 + * 7 + */ 8 + #include <linux/linkage.h> 9 + #include <asm/alternative.h> 10 + #include <asm/msr.h> 11 + #include <asm/processor-flags.h> 12 + #include <asm/nospec-branch.h> 13 + 14 + /* 15 + * void noreturn hv_crash_asm32(arg1) 16 + * arg1 == edi == 32bit PA of struct hv_crash_tramp_data 17 + * 18 + * The hypervisor jumps here upon devirtualization in protected mode. This 19 + * code gets copied to a page in the low 4G ie, 32bit space so it can run 20 + * in the protected mode. Hence we cannot use any compile/link time offsets or 21 + * addresses. It restores long mode via temporary gdt and page tables and 22 + * eventually jumps to kernel code entry at HV_CRASHDATA_OFFS_C_entry. 23 + * 24 + * PreCondition (ie, Hypervisor call back ABI): 25 + * o CR0 is set to 0x0021: PE(prot mode) and NE are set, paging is disabled 26 + * o CR4 is set to 0x0 27 + * o IA32_EFER is set to 0x901 (SCE and NXE are set) 28 + * o EDI is set to the Arg passed to HVCALL_DISABLE_HYP_EX. 29 + * o CS, DS, ES, FS, GS are all initialized with a base of 0 and limit 0xFFFF 30 + * o IDTR, TR and GDTR are initialized with a base of 0 and limit of 0xFFFF 31 + * o LDTR is initialized as invalid (limit of 0) 32 + * o MSR PAT is power on default. 33 + * o Other state/registers are cleared. All TLBs flushed. 34 + */ 35 + 36 + #define HV_CRASHDATA_OFFS_TRAMPCR3 0x0 /* 0 */ 37 + #define HV_CRASHDATA_OFFS_KERNCR3 0x8 /* 8 */ 38 + #define HV_CRASHDATA_OFFS_GDTRLIMIT 0x12 /* 18 */ 39 + #define HV_CRASHDATA_OFFS_CS_JMPTGT 0x28 /* 40 */ 40 + #define HV_CRASHDATA_OFFS_C_entry 0x30 /* 48 */ 41 + 42 + .text 43 + .code32 44 + 45 + SYM_CODE_START(hv_crash_asm32) 46 + UNWIND_HINT_UNDEFINED 47 + ENDBR 48 + movl $X86_CR4_PAE, %ecx 49 + movl %ecx, %cr4 50 + 51 + movl %edi, %ebx 52 + add $HV_CRASHDATA_OFFS_TRAMPCR3, %ebx 53 + movl %cs:(%ebx), %eax 54 + movl %eax, %cr3 55 + 56 + /* Setup EFER for long mode now */ 57 + movl $MSR_EFER, %ecx 58 + rdmsr 59 + btsl $_EFER_LME, %eax 60 + wrmsr 61 + 62 + /* Turn paging on using the temp 32bit trampoline page table */ 63 + movl %cr0, %eax 64 + orl $(X86_CR0_PG), %eax 65 + movl %eax, %cr0 66 + 67 + /* since kernel cr3 could be above 4G, we need to be in the long mode 68 + * before we can load 64bits of the kernel cr3. We use a temp gdt for 69 + * that with CS.L=1 and CS.D=0 */ 70 + mov %edi, %eax 71 + add $HV_CRASHDATA_OFFS_GDTRLIMIT, %eax 72 + lgdtl %cs:(%eax) 73 + 74 + /* not done yet, restore CS now to switch to CS.L=1 */ 75 + mov %edi, %eax 76 + add $HV_CRASHDATA_OFFS_CS_JMPTGT, %eax 77 + ljmp %cs:*(%eax) 78 + SYM_CODE_END(hv_crash_asm32) 79 + 80 + /* we now run in full 64bit IA32-e long mode, CS.L=1 and CS.D=0 */ 81 + .code64 82 + .balign 8 83 + SYM_CODE_START(hv_crash_asm64) 84 + UNWIND_HINT_UNDEFINED 85 + ENDBR 86 + /* restore kernel page tables so we can jump to kernel code */ 87 + mov %edi, %eax 88 + add $HV_CRASHDATA_OFFS_KERNCR3, %eax 89 + movq %cs:(%eax), %rbx 90 + movq %rbx, %cr3 91 + 92 + mov %edi, %eax 93 + add $HV_CRASHDATA_OFFS_C_entry, %eax 94 + movq %cs:(%eax), %rbx 95 + ANNOTATE_RETPOLINE_SAFE 96 + jmp *%rbx 97 + 98 + int $3 99 + 100 + SYM_INNER_LABEL(hv_crash_asm_end, SYM_L_GLOBAL) 101 + SYM_CODE_END(hv_crash_asm64)

+30

arch/x86/hyperv/hv_vtl.c

··· 9 9 #include <asm/apic.h> 10 10 #include <asm/boot.h> 11 11 #include <asm/desc.h> 12 + #include <asm/fpu/api.h> 13 + #include <asm/fpu/types.h> 12 14 #include <asm/i8259.h> 13 15 #include <asm/mshyperv.h> 14 16 #include <asm/msr.h> 15 17 #include <asm/realmode.h> 16 18 #include <asm/reboot.h> 19 + #include <asm/smap.h> 20 + #include <linux/export.h> 17 21 #include <../kernel/smpboot.h> 22 + #include "../../kernel/fpu/legacy.h" 18 23 19 24 extern struct boot_params boot_params; 20 25 static struct real_mode_header hv_vtl_real_mode_header; ··· 254 249 255 250 return 0; 256 251 } 252 + 253 + DEFINE_STATIC_CALL_NULL(__mshv_vtl_return_hypercall, void (*)(void)); 254 + 255 + void mshv_vtl_return_call_init(u64 vtl_return_offset) 256 + { 257 + static_call_update(__mshv_vtl_return_hypercall, 258 + (void *)((u8 *)hv_hypercall_pg + vtl_return_offset)); 259 + } 260 + EXPORT_SYMBOL(mshv_vtl_return_call_init); 261 + 262 + void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) 263 + { 264 + struct hv_vp_assist_page *hvp; 265 + 266 + hvp = hv_vp_assist_page[smp_processor_id()]; 267 + hvp->vtl_ret_x64rax = vtl0->rax; 268 + hvp->vtl_ret_x64rcx = vtl0->rcx; 269 + 270 + kernel_fpu_begin_mask(0); 271 + fxrstor(&vtl0->fx_state); 272 + __mshv_vtl_return_call(vtl0); 273 + fxsave(&vtl0->fx_state); 274 + kernel_fpu_end(); 275 + } 276 + EXPORT_SYMBOL(mshv_vtl_return_call);

+37

arch/x86/hyperv/mshv-asm-offsets.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Generate definitions needed by assembly language modules. 4 + * This code generates raw asm output which is post-processed to extract 5 + * and format the required data. 6 + * 7 + * Copyright (c) 2025, Microsoft Corporation. 8 + * 9 + * Author: 10 + * Naman Jain <namjain@microsoft.com> 11 + */ 12 + #define COMPILE_OFFSETS 13 + 14 + #include <linux/kbuild.h> 15 + #include <asm/mshyperv.h> 16 + 17 + static void __used common(void) 18 + { 19 + if (IS_ENABLED(CONFIG_HYPERV_VTL_MODE)) { 20 + OFFSET(MSHV_VTL_CPU_CONTEXT_rax, mshv_vtl_cpu_context, rax); 21 + OFFSET(MSHV_VTL_CPU_CONTEXT_rcx, mshv_vtl_cpu_context, rcx); 22 + OFFSET(MSHV_VTL_CPU_CONTEXT_rdx, mshv_vtl_cpu_context, rdx); 23 + OFFSET(MSHV_VTL_CPU_CONTEXT_rbx, mshv_vtl_cpu_context, rbx); 24 + OFFSET(MSHV_VTL_CPU_CONTEXT_rbp, mshv_vtl_cpu_context, rbp); 25 + OFFSET(MSHV_VTL_CPU_CONTEXT_rsi, mshv_vtl_cpu_context, rsi); 26 + OFFSET(MSHV_VTL_CPU_CONTEXT_rdi, mshv_vtl_cpu_context, rdi); 27 + OFFSET(MSHV_VTL_CPU_CONTEXT_r8, mshv_vtl_cpu_context, r8); 28 + OFFSET(MSHV_VTL_CPU_CONTEXT_r9, mshv_vtl_cpu_context, r9); 29 + OFFSET(MSHV_VTL_CPU_CONTEXT_r10, mshv_vtl_cpu_context, r10); 30 + OFFSET(MSHV_VTL_CPU_CONTEXT_r11, mshv_vtl_cpu_context, r11); 31 + OFFSET(MSHV_VTL_CPU_CONTEXT_r12, mshv_vtl_cpu_context, r12); 32 + OFFSET(MSHV_VTL_CPU_CONTEXT_r13, mshv_vtl_cpu_context, r13); 33 + OFFSET(MSHV_VTL_CPU_CONTEXT_r14, mshv_vtl_cpu_context, r14); 34 + OFFSET(MSHV_VTL_CPU_CONTEXT_r15, mshv_vtl_cpu_context, r15); 35 + OFFSET(MSHV_VTL_CPU_CONTEXT_cr2, mshv_vtl_cpu_context, cr2); 36 + } 37 + }

+116

arch/x86/hyperv/mshv_vtl_asm.S

··· 1 + /* SPDX-License-Identifier: GPL-2.0 2 + * 3 + * Assembly level code for mshv_vtl VTL transition 4 + * 5 + * Copyright (c) 2025, Microsoft Corporation. 6 + * 7 + * Author: 8 + * Naman Jain <namjain@microsoft.com> 9 + */ 10 + 11 + #include <linux/linkage.h> 12 + #include <linux/static_call_types.h> 13 + #include <asm/asm.h> 14 + #include <asm/asm-offsets.h> 15 + #include <asm/frame.h> 16 + #include "mshv-asm-offsets.h" 17 + 18 + .text 19 + .section .noinstr.text, "ax" 20 + /* 21 + * void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) 22 + * 23 + * This function is used to context switch between different Virtual Trust Levels. 24 + * It is marked as 'noinstr' to prevent against instrumentation and debugging facilities. 25 + * NMIs aren't a problem because the NMI handler saves/restores CR2 specifically to guard 26 + * against #PFs in NMI context clobbering the guest state. 27 + */ 28 + SYM_FUNC_START(__mshv_vtl_return_call) 29 + /* Push callee save registers */ 30 + pushq %rbp 31 + mov %rsp, %rbp 32 + pushq %r12 33 + pushq %r13 34 + pushq %r14 35 + pushq %r15 36 + pushq %rbx 37 + 38 + /* register switch to VTL0 clobbers all registers except rax/rcx */ 39 + mov %_ASM_ARG1, %rax 40 + 41 + /* grab rbx/rbp/rsi/rdi/r8-r15 */ 42 + mov MSHV_VTL_CPU_CONTEXT_rbx(%rax), %rbx 43 + mov MSHV_VTL_CPU_CONTEXT_rbp(%rax), %rbp 44 + mov MSHV_VTL_CPU_CONTEXT_rsi(%rax), %rsi 45 + mov MSHV_VTL_CPU_CONTEXT_rdi(%rax), %rdi 46 + mov MSHV_VTL_CPU_CONTEXT_r8(%rax), %r8 47 + mov MSHV_VTL_CPU_CONTEXT_r9(%rax), %r9 48 + mov MSHV_VTL_CPU_CONTEXT_r10(%rax), %r10 49 + mov MSHV_VTL_CPU_CONTEXT_r11(%rax), %r11 50 + mov MSHV_VTL_CPU_CONTEXT_r12(%rax), %r12 51 + mov MSHV_VTL_CPU_CONTEXT_r13(%rax), %r13 52 + mov MSHV_VTL_CPU_CONTEXT_r14(%rax), %r14 53 + mov MSHV_VTL_CPU_CONTEXT_r15(%rax), %r15 54 + 55 + mov MSHV_VTL_CPU_CONTEXT_cr2(%rax), %rdx 56 + mov %rdx, %cr2 57 + mov MSHV_VTL_CPU_CONTEXT_rdx(%rax), %rdx 58 + 59 + /* stash host registers on stack */ 60 + pushq %rax 61 + pushq %rcx 62 + 63 + xor %ecx, %ecx 64 + 65 + /* make a hypercall to switch VTL */ 66 + call STATIC_CALL_TRAMP_STR(__mshv_vtl_return_hypercall) 67 + 68 + /* stash guest registers on stack, restore saved host copies */ 69 + pushq %rax 70 + pushq %rcx 71 + mov 16(%rsp), %rcx 72 + mov 24(%rsp), %rax 73 + 74 + mov %rdx, MSHV_VTL_CPU_CONTEXT_rdx(%rax) 75 + mov %cr2, %rdx 76 + mov %rdx, MSHV_VTL_CPU_CONTEXT_cr2(%rax) 77 + pop MSHV_VTL_CPU_CONTEXT_rcx(%rax) 78 + pop MSHV_VTL_CPU_CONTEXT_rax(%rax) 79 + add $16, %rsp 80 + 81 + /* save rbx/rbp/rsi/rdi/r8-r15 */ 82 + mov %rbx, MSHV_VTL_CPU_CONTEXT_rbx(%rax) 83 + mov %rbp, MSHV_VTL_CPU_CONTEXT_rbp(%rax) 84 + mov %rsi, MSHV_VTL_CPU_CONTEXT_rsi(%rax) 85 + mov %rdi, MSHV_VTL_CPU_CONTEXT_rdi(%rax) 86 + mov %r8, MSHV_VTL_CPU_CONTEXT_r8(%rax) 87 + mov %r9, MSHV_VTL_CPU_CONTEXT_r9(%rax) 88 + mov %r10, MSHV_VTL_CPU_CONTEXT_r10(%rax) 89 + mov %r11, MSHV_VTL_CPU_CONTEXT_r11(%rax) 90 + mov %r12, MSHV_VTL_CPU_CONTEXT_r12(%rax) 91 + mov %r13, MSHV_VTL_CPU_CONTEXT_r13(%rax) 92 + mov %r14, MSHV_VTL_CPU_CONTEXT_r14(%rax) 93 + mov %r15, MSHV_VTL_CPU_CONTEXT_r15(%rax) 94 + 95 + /* pop callee-save registers r12-r15, rbx */ 96 + pop %rbx 97 + pop %r15 98 + pop %r14 99 + pop %r13 100 + pop %r12 101 + 102 + pop %rbp 103 + RET 104 + SYM_FUNC_END(__mshv_vtl_return_call) 105 + /* 106 + * Make sure that static_call_key symbol: __SCK____mshv_vtl_return_hypercall is accessible here. 107 + * Below code is inspired from __ADDRESSABLE(sym) macro. Symbol name is kept simple, to avoid 108 + * naming it something like "__UNIQUE_ID_addressable___SCK____mshv_vtl_return_hypercall_662.0" 109 + * which would otherwise have been generated by the macro. 110 + */ 111 + .section .discard.addressable,"aw" 112 + .align 8 113 + .type mshv_vtl_return_sym, @object 114 + .size mshv_vtl_return_sym, 8 115 + mshv_vtl_return_sym: 116 + .quad __SCK____mshv_vtl_return_hypercall

+45

arch/x86/include/asm/mshyperv.h

··· 11 11 #include <asm/paravirt.h> 12 12 #include <asm/msr.h> 13 13 #include <hyperv/hvhdk.h> 14 + #include <asm/fpu/types.h> 14 15 15 16 /* 16 17 * Hyper-V always provides a single IO-APIC at this MMIO address. ··· 177 176 int hyperv_fill_flush_guest_mapping_list( 178 177 struct hv_guest_mapping_flush_list *flush, 179 178 u64 start_gfn, u64 end_gfn); 179 + void hv_sleep_notifiers_register(void); 180 + void hv_machine_power_off(void); 180 181 181 182 #ifdef CONFIG_X86_64 182 183 void hv_apic_init(void); ··· 240 237 } 241 238 int hv_apicid_to_vp_index(u32 apic_id); 242 239 240 + #if IS_ENABLED(CONFIG_MSHV_ROOT) && IS_ENABLED(CONFIG_CRASH_DUMP) 241 + void hv_root_crash_init(void); 242 + void hv_crash_asm32(void); 243 + void hv_crash_asm64(void); 244 + void hv_crash_asm_end(void); 245 + #else /* CONFIG_MSHV_ROOT && CONFIG_CRASH_DUMP */ 246 + static inline void hv_root_crash_init(void) {} 247 + #endif /* CONFIG_MSHV_ROOT && CONFIG_CRASH_DUMP */ 248 + 243 249 #else /* CONFIG_HYPERV */ 244 250 static inline void hyperv_init(void) {} 245 251 static inline void hyperv_setup_mmu_ops(void) {} ··· 272 260 static inline int hv_apicid_to_vp_index(u32 apic_id) { return -EINVAL; } 273 261 #endif /* CONFIG_HYPERV */ 274 262 263 + struct mshv_vtl_cpu_context { 264 + union { 265 + struct { 266 + u64 rax; 267 + u64 rcx; 268 + u64 rdx; 269 + u64 rbx; 270 + u64 cr2; 271 + u64 rbp; 272 + u64 rsi; 273 + u64 rdi; 274 + u64 r8; 275 + u64 r9; 276 + u64 r10; 277 + u64 r11; 278 + u64 r12; 279 + u64 r13; 280 + u64 r14; 281 + u64 r15; 282 + }; 283 + u64 gp_regs[16]; 284 + }; 285 + 286 + struct fxregs_state fx_state; 287 + }; 275 288 276 289 #ifdef CONFIG_HYPERV_VTL_MODE 277 290 void __init hv_vtl_init_platform(void); 278 291 int __init hv_vtl_early_init(void); 292 + void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0); 293 + void mshv_vtl_return_call_init(u64 vtl_return_offset); 294 + void mshv_vtl_return_hypercall(void); 295 + void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0); 279 296 #else 280 297 static inline void __init hv_vtl_init_platform(void) {} 281 298 static inline int __init hv_vtl_early_init(void) { return 0; } 299 + static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {} 300 + static inline void mshv_vtl_return_call_init(u64 vtl_return_offset) {} 301 + static inline void mshv_vtl_return_hypercall(void) {} 302 + static inline void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {} 282 303 #endif 283 304 284 305 #include <asm-generic/mshyperv.h>

+68 -20

arch/x86/kernel/cpu/mshyperv.c

··· 28 28 #include <asm/apic.h> 29 29 #include <asm/timer.h> 30 30 #include <asm/reboot.h> 31 + #include <asm/msr.h> 31 32 #include <asm/nmi.h> 32 33 #include <clocksource/hyperv_timer.h> 33 - #include <asm/msr.h> 34 34 #include <asm/numa.h> 35 35 #include <asm/svm.h> 36 36 ··· 39 39 struct ms_hyperv_info ms_hyperv; 40 40 41 41 #if IS_ENABLED(CONFIG_HYPERV) 42 + /* 43 + * When running with the paravisor, controls proxying the synthetic interrupts 44 + * from the host 45 + */ 46 + static bool hv_para_sint_proxy; 47 + 42 48 static inline unsigned int hv_get_nested_msr(unsigned int reg) 43 49 { 44 50 if (hv_is_sint_msr(reg)) ··· 81 75 void hv_set_non_nested_msr(unsigned int reg, u64 value) 82 76 { 83 77 if (hv_is_synic_msr(reg) && ms_hyperv.paravisor_present) { 78 + /* The hypervisor will get the intercept. */ 84 79 hv_ivm_msr_write(reg, value); 85 80 86 - /* Write proxy bit via wrmsl instruction */ 87 - if (hv_is_sint_msr(reg)) 88 - wrmsrq(reg, value | 1 << 20); 81 + /* Using wrmsrq so the following goes to the paravisor. */ 82 + if (hv_is_sint_msr(reg)) { 83 + union hv_synic_sint sint = { .as_uint64 = value }; 84 + 85 + sint.proxy = hv_para_sint_proxy; 86 + native_wrmsrq(reg, sint.as_uint64); 87 + } 89 88 } else { 90 - wrmsrq(reg, value); 89 + native_wrmsrq(reg, value); 91 90 } 92 91 } 93 92 EXPORT_SYMBOL_GPL(hv_set_non_nested_msr); 93 + 94 + /* 95 + * Enable or disable proxying synthetic interrupts 96 + * to the paravisor. 97 + */ 98 + void hv_para_set_sint_proxy(bool enable) 99 + { 100 + hv_para_sint_proxy = enable; 101 + } 102 + 103 + /* 104 + * Get the SynIC register value from the paravisor. 105 + */ 106 + u64 hv_para_get_synic_register(unsigned int reg) 107 + { 108 + if (WARN_ON(!ms_hyperv.paravisor_present || !hv_is_synic_msr(reg))) 109 + return ~0ULL; 110 + return native_read_msr(reg); 111 + } 112 + 113 + /* 114 + * Set the SynIC register value with the paravisor. 115 + */ 116 + void hv_para_set_synic_register(unsigned int reg, u64 val) 117 + { 118 + if (WARN_ON(!ms_hyperv.paravisor_present || !hv_is_synic_msr(reg))) 119 + return; 120 + native_write_msr(reg, val); 121 + } 94 122 95 123 u64 hv_get_msr(unsigned int reg) 96 124 { ··· 255 215 #endif /* CONFIG_KEXEC_CORE */ 256 216 257 217 #ifdef CONFIG_CRASH_DUMP 258 - static void hv_machine_crash_shutdown(struct pt_regs *regs) 218 + static void hv_guest_crash_shutdown(struct pt_regs *regs) 259 219 { 260 220 if (hv_crash_handler) 261 221 hv_crash_handler(regs); ··· 480 440 481 441 static void __init ms_hyperv_init_platform(void) 482 442 { 483 - int hv_max_functions_eax; 443 + int hv_max_functions_eax, eax; 484 444 485 445 #ifdef CONFIG_PARAVIRT 486 446 pv_info.name = "Hyper-V"; ··· 510 470 511 471 hv_identify_partition_type(); 512 472 473 + if (cc_platform_has(CC_ATTR_SNP_SECURE_AVIC)) 474 + ms_hyperv.hints |= HV_DEPRECATING_AEOI_RECOMMENDED; 475 + 513 476 if (ms_hyperv.hints & HV_X64_HYPERV_NESTED) { 514 477 hv_nested = true; 515 478 pr_info("Hyper-V: running on a nested hypervisor\n"); 516 479 } 480 + 481 + /* 482 + * There is no check against the max function for HYPERV_CPUID_VIRT_STACK_* CPUID 483 + * leaves as the hypervisor doesn't handle them. Even a nested root partition (L2 484 + * root) will not get them because the nested (L1) hypervisor filters them out. 485 + * These are handled through intercept processing by the Windows Hyper-V stack 486 + * or the paravisor. 487 + */ 488 + eax = cpuid_eax(HYPERV_CPUID_VIRT_STACK_PROPERTIES); 489 + ms_hyperv.confidential_vmbus_available = 490 + eax & HYPERV_VS_PROPERTIES_EAX_CONFIDENTIAL_VMBUS_AVAILABLE; 491 + ms_hyperv.msi_ext_dest_id = 492 + eax & HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE; 517 493 518 494 if (ms_hyperv.features & HV_ACCESS_FREQUENCY_MSRS && 519 495 ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) { ··· 621 565 #endif 622 566 623 567 #if IS_ENABLED(CONFIG_HYPERV) 568 + if (hv_root_partition()) 569 + machine_ops.power_off = hv_machine_power_off; 624 570 #if defined(CONFIG_KEXEC_CORE) 625 571 machine_ops.shutdown = hv_machine_shutdown; 626 572 #endif 627 573 #if defined(CONFIG_CRASH_DUMP) 628 - machine_ops.crash_shutdown = hv_machine_crash_shutdown; 574 + if (!hv_root_partition()) 575 + machine_ops.crash_shutdown = hv_guest_crash_shutdown; 629 576 #endif 630 577 #endif 631 578 /* ··· 734 675 * pci-hyperv host bridge. 735 676 * 736 677 * Note: for a Hyper-V root partition, this will always return false. 737 - * The hypervisor doesn't expose these HYPERV_CPUID_VIRT_STACK_* cpuids by 738 - * default, they are implemented as intercepts by the Windows Hyper-V stack. 739 - * Even a nested root partition (L2 root) will not get them because the 740 - * nested (L1) hypervisor filters them out. 741 678 */ 742 679 static bool __init ms_hyperv_msi_ext_dest_id(void) 743 680 { 744 - u32 eax; 745 - 746 - eax = cpuid_eax(HYPERV_CPUID_VIRT_STACK_INTERFACE); 747 - if (eax != HYPERV_VS_INTERFACE_EAX_SIGNATURE) 748 - return false; 749 - 750 - eax = cpuid_eax(HYPERV_CPUID_VIRT_STACK_PROPERTIES); 751 - return eax & HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE; 681 + return ms_hyperv.msi_ext_dest_id; 752 682 } 753 683 754 684 #ifdef CONFIG_AMD_MEM_ENCRYPT

+28 -1

drivers/hv/Kconfig

··· 17 17 18 18 config HYPERV_VTL_MODE 19 19 bool "Enable Linux to boot in VTL context" 20 - depends on (X86_64 || ARM64) && HYPERV 20 + depends on (X86_64 && HAVE_STATIC_CALL) || ARM64 21 + depends on HYPERV 21 22 depends on SMP 22 23 default n 23 24 help ··· 76 75 depends on PAGE_SIZE_4KB 77 76 select EVENTFD 78 77 select VIRT_XFER_TO_GUEST_WORK 78 + select HMM_MIRROR 79 + select MMU_NOTIFIER 79 80 default n 80 81 help 81 82 Select this option to enable support for booting and running as root 82 83 partition on Microsoft Hyper-V. 84 + 85 + If unsure, say N. 86 + 87 + config MSHV_VTL 88 + tristate "Microsoft Hyper-V VTL driver" 89 + depends on X86_64 && HYPERV_VTL_MODE 90 + depends on HYPERV_VMBUS 91 + # Mapping VTL0 memory to a userspace process in VTL2 is supported in OpenHCL. 92 + # VTL2 for OpenHCL makes use of Huge Pages to improve performance on VMs, 93 + # specially with large memory requirements. 94 + depends on TRANSPARENT_HUGEPAGE 95 + # MTRRs are controlled by VTL0, and are not specific to individual VTLs. 96 + # Therefore, do not attempt to access or modify MTRRs here. 97 + depends on !MTRR 98 + select CPUMASK_OFFSTACK 99 + select VIRT_XFER_TO_GUEST_WORK 100 + default n 101 + help 102 + Select this option to enable Hyper-V VTL driver support. 103 + This driver provides interfaces for Virtual Machine Manager (VMM) running in VTL2 104 + userspace to create VTLs and partitions, setup and manage VTL0 memory and 105 + allow userspace to make direct hypercalls. This also allows to map VTL0's address 106 + space to a usermode process in VTL2 and supports getting new VMBus messages and channel 107 + events in VTL2. 83 108 84 109 If unsure, say N. 85 110

+7 -2

drivers/hv/Makefile

··· 3 3 obj-$(CONFIG_HYPERV_UTILS) += hv_utils.o 4 4 obj-$(CONFIG_HYPERV_BALLOON) += hv_balloon.o 5 5 obj-$(CONFIG_MSHV_ROOT) += mshv_root.o 6 + obj-$(CONFIG_MSHV_VTL) += mshv_vtl.o 6 7 7 8 CFLAGS_hv_trace.o = -I$(src) 8 9 CFLAGS_hv_balloon.o = -I$(src) ··· 14 13 hv_vmbus-$(CONFIG_HYPERV_TESTING) += hv_debugfs.o 15 14 hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_utils_transport.o 16 15 mshv_root-y := mshv_root_main.o mshv_synic.o mshv_eventfd.o mshv_irq.o \ 17 - mshv_root_hv_call.o mshv_portid_table.o 16 + mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o 17 + mshv_vtl-y := mshv_vtl_main.o 18 18 19 19 # Code that must be built-in 20 20 obj-$(CONFIG_HYPERV) += hv_common.o 21 - obj-$(subst m,y,$(CONFIG_MSHV_ROOT)) += hv_proc.o mshv_common.o 21 + obj-$(subst m,y,$(CONFIG_MSHV_ROOT)) += hv_proc.o 22 + ifneq ($(CONFIG_MSHV_ROOT)$(CONFIG_MSHV_VTL),) 23 + obj-y += mshv_common.o 24 + endif

+48 -27

drivers/hv/channel.c

··· 410 410 return 0; 411 411 } 412 412 413 + static void vmbus_free_channel_msginfo(struct vmbus_channel_msginfo *msginfo) 414 + { 415 + struct vmbus_channel_msginfo *submsginfo, *tmp; 416 + 417 + if (!msginfo) 418 + return; 419 + 420 + list_for_each_entry_safe(submsginfo, tmp, &msginfo->submsglist, 421 + msglistentry) { 422 + kfree(submsginfo); 423 + } 424 + 425 + kfree(msginfo); 426 + } 427 + 413 428 /* 414 429 * __vmbus_establish_gpadl - Establish a GPADL for a buffer or ringbuffer 415 430 * ··· 444 429 struct vmbus_channel_gpadl_header *gpadlmsg; 445 430 struct vmbus_channel_gpadl_body *gpadl_body; 446 431 struct vmbus_channel_msginfo *msginfo = NULL; 447 - struct vmbus_channel_msginfo *submsginfo, *tmp; 432 + struct vmbus_channel_msginfo *submsginfo; 448 433 struct list_head *curr; 449 434 u32 next_gpadl_handle; 450 435 unsigned long flags; ··· 459 444 return ret; 460 445 } 461 446 462 - /* 463 - * Set the "decrypted" flag to true for the set_memory_decrypted() 464 - * success case. In the failure case, the encryption state of the 465 - * memory is unknown. Leave "decrypted" as true to ensure the 466 - * memory will be leaked instead of going back on the free list. 467 - */ 468 - gpadl->decrypted = true; 469 - ret = set_memory_decrypted((unsigned long)kbuffer, 470 - PFN_UP(size)); 471 - if (ret) { 472 - dev_warn(&channel->device_obj->device, 473 - "Failed to set host visibility for new GPADL %d.\n", 474 - ret); 475 - return ret; 447 + gpadl->decrypted = !((channel->co_external_memory && type == HV_GPADL_BUFFER) || 448 + (channel->co_ring_buffer && type == HV_GPADL_RING)); 449 + if (gpadl->decrypted) { 450 + /* 451 + * The "decrypted" flag being true assumes that set_memory_decrypted() succeeds. 452 + * But if it fails, the encryption state of the memory is unknown. In that case, 453 + * leave "decrypted" as true to ensure the memory is leaked instead of going back 454 + * on the free list. 455 + */ 456 + ret = set_memory_decrypted((unsigned long)kbuffer, 457 + PFN_UP(size)); 458 + if (ret) { 459 + dev_warn(&channel->device_obj->device, 460 + "Failed to set host visibility for new GPADL %d.\n", 461 + ret); 462 + vmbus_free_channel_msginfo(msginfo); 463 + return ret; 464 + } 476 465 } 477 466 478 467 init_completion(&msginfo->waitevent); ··· 551 532 spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags); 552 533 list_del(&msginfo->msglistentry); 553 534 spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags); 554 - list_for_each_entry_safe(submsginfo, tmp, &msginfo->submsglist, 555 - msglistentry) { 556 - kfree(submsginfo); 557 - } 558 535 559 - kfree(msginfo); 536 + vmbus_free_channel_msginfo(msginfo); 560 537 561 538 if (ret) { 562 539 /* ··· 560 545 * left as true so the memory is leaked instead of being 561 546 * put back on the free list. 562 547 */ 563 - if (!set_memory_encrypted((unsigned long)kbuffer, PFN_UP(size))) 564 - gpadl->decrypted = false; 548 + if (gpadl->decrypted) { 549 + if (!set_memory_encrypted((unsigned long)kbuffer, PFN_UP(size))) 550 + gpadl->decrypted = false; 551 + } 565 552 } 566 553 567 554 return ret; ··· 590 573 * keeps track of the next available slot in the array. Initially, each 591 574 * slot points to the next one (as in a Linked List). The last slot 592 575 * does not point to anything, so its value is U64_MAX by default. 593 - * @size The size of the array 576 + * @size: The size of the array 594 577 */ 595 578 static u64 *request_arr_init(u32 size) 596 579 { ··· 694 677 goto error_clean_ring; 695 678 696 679 err = hv_ringbuffer_init(&newchannel->outbound, 697 - page, send_pages, 0); 680 + page, send_pages, 0, newchannel->co_ring_buffer); 698 681 if (err) 699 682 goto error_free_gpadl; 700 683 701 684 err = hv_ringbuffer_init(&newchannel->inbound, &page[send_pages], 702 - recv_pages, newchannel->max_pkt_size); 685 + recv_pages, newchannel->max_pkt_size, 686 + newchannel->co_ring_buffer); 703 687 if (err) 704 688 goto error_free_gpadl; 705 689 ··· 881 863 882 864 kfree(info); 883 865 884 - ret = set_memory_encrypted((unsigned long)gpadl->buffer, 885 - PFN_UP(gpadl->size)); 866 + if (gpadl->decrypted) 867 + ret = set_memory_encrypted((unsigned long)gpadl->buffer, 868 + PFN_UP(gpadl->size)); 869 + else 870 + ret = 0; 886 871 if (ret) 887 872 pr_warn("Fail to set mem host visibility in GPADL teardown %d.\n", ret); 888 873

+23 -4

drivers/hv/channel_mgmt.c

··· 844 844 = per_cpu_ptr(hv_context.cpu_context, cpu); 845 845 846 846 /* 847 - * In a CoCo VM the synic_message_page is not allocated 847 + * In a CoCo VM the hyp_synic_message_page is not allocated 848 848 * in hv_synic_alloc(). Instead it is set/cleared in 849 - * hv_synic_enable_regs() and hv_synic_disable_regs() 849 + * hv_hyp_synic_enable_regs() and hv_hyp_synic_disable_regs() 850 850 * such that it is set only when the CPU is online. If 851 851 * not all present CPUs are online, the message page 852 852 * might be NULL, so skip such CPUs. 853 853 */ 854 - page_addr = hv_cpu->synic_message_page; 854 + page_addr = hv_cpu->hyp_synic_message_page; 855 855 if (!page_addr) 856 856 continue; 857 857 ··· 892 892 struct hv_per_cpu_context *hv_cpu 893 893 = per_cpu_ptr(hv_context.cpu_context, cpu); 894 894 895 - page_addr = hv_cpu->synic_message_page; 895 + page_addr = hv_cpu->hyp_synic_message_page; 896 896 if (!page_addr) 897 897 continue; 898 898 ··· 1022 1022 struct vmbus_channel_offer_channel *offer; 1023 1023 struct vmbus_channel *oldchannel, *newchannel; 1024 1024 size_t offer_sz; 1025 + bool co_ring_buffer, co_external_memory; 1025 1026 1026 1027 offer = (struct vmbus_channel_offer_channel *)hdr; 1027 1028 ··· 1033 1032 offer->child_relid); 1034 1033 atomic_dec(&vmbus_connection.offer_in_progress); 1035 1034 return; 1035 + } 1036 + 1037 + co_ring_buffer = is_co_ring_buffer(offer); 1038 + co_external_memory = is_co_external_memory(offer); 1039 + if (!co_ring_buffer && co_external_memory) { 1040 + pr_err("Invalid offer relid=%d: the ring buffer isn't encrypted\n", 1041 + offer->child_relid); 1042 + return; 1043 + } 1044 + if (co_ring_buffer || co_external_memory) { 1045 + if (vmbus_proto_version < VERSION_WIN10_V6_0 || !vmbus_is_confidential()) { 1046 + pr_err("Invalid offer relid=%d: no support for confidential VMBus\n", 1047 + offer->child_relid); 1048 + atomic_dec(&vmbus_connection.offer_in_progress); 1049 + return; 1050 + } 1036 1051 } 1037 1052 1038 1053 oldchannel = find_primary_channel_by_offer(offer); ··· 1129 1112 pr_err("Unable to allocate channel object\n"); 1130 1113 return; 1131 1114 } 1115 + newchannel->co_ring_buffer = co_ring_buffer; 1116 + newchannel->co_external_memory = co_external_memory; 1132 1117 1133 1118 vmbus_setup_channel_state(newchannel, offer); 1134 1119

+5 -1

drivers/hv/connection.c

··· 51 51 * Linux guests and are not listed. 52 52 */ 53 53 static __u32 vmbus_versions[] = { 54 + VERSION_WIN10_V6_0, 54 55 VERSION_WIN10_V5_3, 55 56 VERSION_WIN10_V5_2, 56 57 VERSION_WIN10_V5_1, ··· 66 65 * Maximal VMBus protocol version guests can negotiate. Useful to cap the 67 66 * VMBus version for testing and debugging purpose. 68 67 */ 69 - static uint max_version = VERSION_WIN10_V5_3; 68 + static uint max_version = VERSION_WIN10_V6_0; 70 69 71 70 module_param(max_version, uint, S_IRUGO); 72 71 MODULE_PARM_DESC(max_version, ··· 105 104 msg->interrupt_page = virt_to_phys(vmbus_connection.int_page); 106 105 vmbus_connection.msg_conn_id = VMBUS_MESSAGE_CONNECTION_ID; 107 106 } 107 + 108 + if (vmbus_is_confidential() && version >= VERSION_WIN10_V6_0) 109 + msg->feature_flags = VMBUS_FEATURE_FLAG_CONFIDENTIAL_CHANNELS; 108 110 109 111 /* 110 112 * shared_gpa_boundary is zero in non-SNP VMs, so it's safe to always

+255 -122

drivers/hv/hv.c

··· 18 18 #include <linux/clockchips.h> 19 19 #include <linux/delay.h> 20 20 #include <linux/interrupt.h> 21 + #include <linux/export.h> 21 22 #include <clocksource/hyperv_timer.h> 22 23 #include <asm/mshyperv.h> 23 24 #include <linux/set_memory.h> ··· 26 25 27 26 /* The one and only */ 28 27 struct hv_context hv_context; 28 + EXPORT_SYMBOL_FOR_MODULES(hv_context, "mshv_vtl"); 29 29 30 30 /* 31 31 * hv_init - Main initialization routine. ··· 76 74 aligned_msg->payload_size = payload_size; 77 75 memcpy((void *)aligned_msg->payload, payload, payload_size); 78 76 79 - if (ms_hyperv.paravisor_present) { 77 + if (ms_hyperv.paravisor_present && !vmbus_is_confidential()) { 78 + /* 79 + * If the VMBus isn't confidential, use the CoCo-specific 80 + * mechanism to communicate with the hypervisor. 81 + */ 80 82 if (hv_isolation_type_tdx()) 81 83 status = hv_tdx_hypercall(HVCALL_POST_MESSAGE, 82 84 virt_to_phys(aligned_msg), 0); ··· 94 88 u64 control = HVCALL_POST_MESSAGE; 95 89 96 90 control |= hv_nested ? HV_HYPERCALL_NESTED : 0; 91 + /* 92 + * If there is no paravisor, this will go to the hypervisor. 93 + * In the Confidential VMBus case, there is the paravisor 94 + * to which this will trap. 95 + */ 97 96 status = hv_do_hypercall(control, aligned_msg, NULL); 98 97 } 99 98 ··· 106 95 107 96 return hv_result(status); 108 97 } 98 + EXPORT_SYMBOL_FOR_MODULES(hv_post_message, "mshv_vtl"); 99 + 100 + static int hv_alloc_page(void **page, bool decrypt, const char *note) 101 + { 102 + int ret = 0; 103 + 104 + /* 105 + * After the page changes its encryption status, its contents might 106 + * appear scrambled on some hardware. Thus `get_zeroed_page` would 107 + * zero the page out in vain, so do that explicitly exactly once. 108 + * 109 + * By default, the page is allocated encrypted in a CoCo VM. 110 + */ 111 + *page = (void *)__get_free_page(GFP_KERNEL); 112 + if (!*page) 113 + return -ENOMEM; 114 + 115 + if (decrypt) 116 + ret = set_memory_decrypted((unsigned long)*page, 1); 117 + if (ret) 118 + goto failed; 119 + 120 + memset(*page, 0, PAGE_SIZE); 121 + return 0; 122 + 123 + failed: 124 + /* 125 + * Report the failure but don't put the page back on the free list as 126 + * its encryption status is unknown. 127 + */ 128 + pr_err("allocation failed for %s page, error %d, decrypted %d\n", 129 + note, ret, decrypt); 130 + *page = NULL; 131 + return ret; 132 + } 133 + 134 + static int hv_free_page(void **page, bool encrypt, const char *note) 135 + { 136 + int ret = 0; 137 + 138 + if (!*page) 139 + return 0; 140 + 141 + if (encrypt) 142 + ret = set_memory_encrypted((unsigned long)*page, 1); 143 + 144 + /* 145 + * In the case of the failure, the page is leaked. Something is wrong, 146 + * prefer to lose the page with the unknown encryption status and stay afloat. 147 + */ 148 + if (ret) 149 + pr_err("deallocation failed for %s page, error %d, encrypt %d\n", 150 + note, ret, encrypt); 151 + else 152 + free_page((unsigned long)*page); 153 + 154 + *page = NULL; 155 + 156 + return ret; 157 + } 109 158 110 159 int hv_synic_alloc(void) 111 160 { 112 161 int cpu, ret = -ENOMEM; 113 162 struct hv_per_cpu_context *hv_cpu; 163 + const bool decrypt = !vmbus_is_confidential(); 114 164 115 165 /* 116 166 * First, zero all per-cpu memory areas so hv_synic_free() can ··· 197 125 vmbus_on_msg_dpc, (unsigned long)hv_cpu); 198 126 199 127 if (ms_hyperv.paravisor_present && hv_isolation_type_tdx()) { 200 - hv_cpu->post_msg_page = (void *)get_zeroed_page(GFP_ATOMIC); 201 - if (!hv_cpu->post_msg_page) { 202 - pr_err("Unable to allocate post msg page\n"); 128 + ret = hv_alloc_page(&hv_cpu->post_msg_page, 129 + decrypt, "post msg"); 130 + if (ret) 203 131 goto err; 204 - } 205 - 206 - ret = set_memory_decrypted((unsigned long)hv_cpu->post_msg_page, 1); 207 - if (ret) { 208 - pr_err("Failed to decrypt post msg page: %d\n", ret); 209 - /* Just leak the page, as it's unsafe to free the page. */ 210 - hv_cpu->post_msg_page = NULL; 211 - goto err; 212 - } 213 - 214 - memset(hv_cpu->post_msg_page, 0, PAGE_SIZE); 215 132 } 216 133 217 134 /* 218 - * Synic message and event pages are allocated by paravisor. 219 - * Skip these pages allocation here. 135 + * If these SynIC pages are not allocated, SIEF and SIM pages 136 + * are configured using what the root partition or the paravisor 137 + * provides upon reading the SIEFP and SIMP registers. 220 138 */ 221 139 if (!ms_hyperv.paravisor_present && !hv_root_partition()) { 222 - hv_cpu->synic_message_page = 223 - (void *)get_zeroed_page(GFP_ATOMIC); 224 - if (!hv_cpu->synic_message_page) { 225 - pr_err("Unable to allocate SYNIC message page\n"); 140 + ret = hv_alloc_page(&hv_cpu->hyp_synic_message_page, 141 + decrypt, "hypervisor SynIC msg"); 142 + if (ret) 226 143 goto err; 227 - } 228 - 229 - hv_cpu->synic_event_page = 230 - (void *)get_zeroed_page(GFP_ATOMIC); 231 - if (!hv_cpu->synic_event_page) { 232 - pr_err("Unable to allocate SYNIC event page\n"); 233 - 234 - free_page((unsigned long)hv_cpu->synic_message_page); 235 - hv_cpu->synic_message_page = NULL; 144 + ret = hv_alloc_page(&hv_cpu->hyp_synic_event_page, 145 + decrypt, "hypervisor SynIC event"); 146 + if (ret) 236 147 goto err; 237 - } 238 148 } 239 149 240 - if (!ms_hyperv.paravisor_present && 241 - (hv_isolation_type_snp() || hv_isolation_type_tdx())) { 242 - ret = set_memory_decrypted((unsigned long) 243 - hv_cpu->synic_message_page, 1); 244 - if (ret) { 245 - pr_err("Failed to decrypt SYNIC msg page: %d\n", ret); 246 - hv_cpu->synic_message_page = NULL; 247 - 248 - /* 249 - * Free the event page here so that hv_synic_free() 250 - * won't later try to re-encrypt it. 251 - */ 252 - free_page((unsigned long)hv_cpu->synic_event_page); 253 - hv_cpu->synic_event_page = NULL; 150 + if (vmbus_is_confidential()) { 151 + ret = hv_alloc_page(&hv_cpu->para_synic_message_page, 152 + false, "paravisor SynIC msg"); 153 + if (ret) 254 154 goto err; 255 - } 256 - 257 - ret = set_memory_decrypted((unsigned long) 258 - hv_cpu->synic_event_page, 1); 259 - if (ret) { 260 - pr_err("Failed to decrypt SYNIC event page: %d\n", ret); 261 - hv_cpu->synic_event_page = NULL; 155 + ret = hv_alloc_page(&hv_cpu->para_synic_event_page, 156 + false, "paravisor SynIC event"); 157 + if (ret) 262 158 goto err; 263 - } 264 - 265 - memset(hv_cpu->synic_message_page, 0, PAGE_SIZE); 266 - memset(hv_cpu->synic_event_page, 0, PAGE_SIZE); 267 159 } 268 160 } 269 161 ··· 243 207 244 208 void hv_synic_free(void) 245 209 { 246 - int cpu, ret; 210 + int cpu; 211 + const bool encrypt = !vmbus_is_confidential(); 247 212 248 213 for_each_present_cpu(cpu) { 249 214 struct hv_per_cpu_context *hv_cpu = 250 215 per_cpu_ptr(hv_context.cpu_context, cpu); 251 216 252 - /* It's better to leak the page if the encryption fails. */ 253 - if (ms_hyperv.paravisor_present && hv_isolation_type_tdx()) { 254 - if (hv_cpu->post_msg_page) { 255 - ret = set_memory_encrypted((unsigned long) 256 - hv_cpu->post_msg_page, 1); 257 - if (ret) { 258 - pr_err("Failed to encrypt post msg page: %d\n", ret); 259 - hv_cpu->post_msg_page = NULL; 260 - } 261 - } 217 + if (ms_hyperv.paravisor_present && hv_isolation_type_tdx()) 218 + hv_free_page(&hv_cpu->post_msg_page, 219 + encrypt, "post msg"); 220 + if (!ms_hyperv.paravisor_present && !hv_root_partition()) { 221 + hv_free_page(&hv_cpu->hyp_synic_event_page, 222 + encrypt, "hypervisor SynIC event"); 223 + hv_free_page(&hv_cpu->hyp_synic_message_page, 224 + encrypt, "hypervisor SynIC msg"); 262 225 } 263 - 264 - if (!ms_hyperv.paravisor_present && 265 - (hv_isolation_type_snp() || hv_isolation_type_tdx())) { 266 - if (hv_cpu->synic_message_page) { 267 - ret = set_memory_encrypted((unsigned long) 268 - hv_cpu->synic_message_page, 1); 269 - if (ret) { 270 - pr_err("Failed to encrypt SYNIC msg page: %d\n", ret); 271 - hv_cpu->synic_message_page = NULL; 272 - } 273 - } 274 - 275 - if (hv_cpu->synic_event_page) { 276 - ret = set_memory_encrypted((unsigned long) 277 - hv_cpu->synic_event_page, 1); 278 - if (ret) { 279 - pr_err("Failed to encrypt SYNIC event page: %d\n", ret); 280 - hv_cpu->synic_event_page = NULL; 281 - } 282 - } 226 + if (vmbus_is_confidential()) { 227 + hv_free_page(&hv_cpu->para_synic_event_page, 228 + false, "paravisor SynIC event"); 229 + hv_free_page(&hv_cpu->para_synic_message_page, 230 + false, "paravisor SynIC msg"); 283 231 } 284 - 285 - free_page((unsigned long)hv_cpu->post_msg_page); 286 - free_page((unsigned long)hv_cpu->synic_event_page); 287 - free_page((unsigned long)hv_cpu->synic_message_page); 288 232 } 289 233 290 234 kfree(hv_context.hv_numa_map); 291 235 } 292 236 293 237 /* 294 - * hv_synic_init - Initialize the Synthetic Interrupt Controller. 295 - * 296 - * If it is already initialized by another entity (ie x2v shim), we need to 297 - * retrieve the initialized message and event pages. Otherwise, we create and 298 - * initialize the message and event pages. 238 + * hv_hyp_synic_enable_regs - Initialize the Synthetic Interrupt Controller 239 + * with the hypervisor. 299 240 */ 300 - void hv_synic_enable_regs(unsigned int cpu) 241 + void hv_hyp_synic_enable_regs(unsigned int cpu) 301 242 { 302 243 struct hv_per_cpu_context *hv_cpu = 303 244 per_cpu_ptr(hv_context.cpu_context, cpu); 304 245 union hv_synic_simp simp; 305 246 union hv_synic_siefp siefp; 306 247 union hv_synic_sint shared_sint; 307 - union hv_synic_scontrol sctrl; 308 248 309 - /* Setup the Synic's message page */ 249 + /* Setup the Synic's message page with the hypervisor. */ 310 250 simp.as_uint64 = hv_get_msr(HV_MSR_SIMP); 311 251 simp.simp_enabled = 1; 312 252 ··· 290 278 /* Mask out vTOM bit. ioremap_cache() maps decrypted */ 291 279 u64 base = (simp.base_simp_gpa << HV_HYP_PAGE_SHIFT) & 292 280 ~ms_hyperv.shared_gpa_boundary; 293 - hv_cpu->synic_message_page = 281 + hv_cpu->hyp_synic_message_page = 294 282 (void *)ioremap_cache(base, HV_HYP_PAGE_SIZE); 295 - if (!hv_cpu->synic_message_page) 283 + if (!hv_cpu->hyp_synic_message_page) 296 284 pr_err("Fail to map synic message page.\n"); 297 285 } else { 298 - simp.base_simp_gpa = virt_to_phys(hv_cpu->synic_message_page) 286 + simp.base_simp_gpa = virt_to_phys(hv_cpu->hyp_synic_message_page) 299 287 >> HV_HYP_PAGE_SHIFT; 300 288 } 301 289 302 290 hv_set_msr(HV_MSR_SIMP, simp.as_uint64); 303 291 304 - /* Setup the Synic's event page */ 292 + /* Setup the Synic's event page with the hypervisor. */ 305 293 siefp.as_uint64 = hv_get_msr(HV_MSR_SIEFP); 306 294 siefp.siefp_enabled = 1; 307 295 ··· 309 297 /* Mask out vTOM bit. ioremap_cache() maps decrypted */ 310 298 u64 base = (siefp.base_siefp_gpa << HV_HYP_PAGE_SHIFT) & 311 299 ~ms_hyperv.shared_gpa_boundary; 312 - hv_cpu->synic_event_page = 300 + hv_cpu->hyp_synic_event_page = 313 301 (void *)ioremap_cache(base, HV_HYP_PAGE_SIZE); 314 - if (!hv_cpu->synic_event_page) 302 + if (!hv_cpu->hyp_synic_event_page) 315 303 pr_err("Fail to map synic event page.\n"); 316 304 } else { 317 - siefp.base_siefp_gpa = virt_to_phys(hv_cpu->synic_event_page) 305 + siefp.base_siefp_gpa = virt_to_phys(hv_cpu->hyp_synic_event_page) 318 306 >> HV_HYP_PAGE_SHIFT; 319 307 } 320 308 321 309 hv_set_msr(HV_MSR_SIEFP, siefp.as_uint64); 310 + hv_enable_coco_interrupt(cpu, vmbus_interrupt, true); 322 311 323 312 /* Setup the shared SINT. */ 324 313 if (vmbus_irq != -1) ··· 330 317 shared_sint.masked = false; 331 318 shared_sint.auto_eoi = hv_recommend_using_aeoi(); 332 319 hv_set_msr(HV_MSR_SINT0 + VMBUS_MESSAGE_SINT, shared_sint.as_uint64); 320 + } 321 + 322 + static void hv_hyp_synic_enable_interrupts(void) 323 + { 324 + union hv_synic_scontrol sctrl; 333 325 334 326 /* Enable the global synic bit */ 335 327 sctrl.as_uint64 = hv_get_msr(HV_MSR_SCONTROL); ··· 343 325 hv_set_msr(HV_MSR_SCONTROL, sctrl.as_uint64); 344 326 } 345 327 328 + static void hv_para_synic_enable_regs(unsigned int cpu) 329 + { 330 + union hv_synic_simp simp; 331 + union hv_synic_siefp siefp; 332 + struct hv_per_cpu_context *hv_cpu 333 + = per_cpu_ptr(hv_context.cpu_context, cpu); 334 + 335 + /* Setup the Synic's message page with the paravisor. */ 336 + simp.as_uint64 = hv_para_get_synic_register(HV_MSR_SIMP); 337 + simp.simp_enabled = 1; 338 + simp.base_simp_gpa = virt_to_phys(hv_cpu->para_synic_message_page) 339 + >> HV_HYP_PAGE_SHIFT; 340 + hv_para_set_synic_register(HV_MSR_SIMP, simp.as_uint64); 341 + 342 + /* Setup the Synic's event page with the paravisor. */ 343 + siefp.as_uint64 = hv_para_get_synic_register(HV_MSR_SIEFP); 344 + siefp.siefp_enabled = 1; 345 + siefp.base_siefp_gpa = virt_to_phys(hv_cpu->para_synic_event_page) 346 + >> HV_HYP_PAGE_SHIFT; 347 + hv_para_set_synic_register(HV_MSR_SIEFP, siefp.as_uint64); 348 + } 349 + 350 + static void hv_para_synic_enable_interrupts(void) 351 + { 352 + union hv_synic_scontrol sctrl; 353 + 354 + /* Enable the global synic bit */ 355 + sctrl.as_uint64 = hv_para_get_synic_register(HV_MSR_SCONTROL); 356 + sctrl.enable = 1; 357 + hv_para_set_synic_register(HV_MSR_SCONTROL, sctrl.as_uint64); 358 + } 359 + 346 360 int hv_synic_init(unsigned int cpu) 347 361 { 348 - hv_synic_enable_regs(cpu); 362 + if (vmbus_is_confidential()) 363 + hv_para_synic_enable_regs(cpu); 364 + 365 + /* 366 + * The SINT is set in hv_hyp_synic_enable_regs() by calling 367 + * hv_set_msr(). hv_set_msr() in turn has special case code for the 368 + * SINT MSRs that write to the hypervisor version of the MSR *and* 369 + * the paravisor version of the MSR (but *without* the proxy bit when 370 + * VMBus is confidential). 371 + * 372 + * Then enable interrupts via the paravisor if VMBus is confidential, 373 + * and otherwise via the hypervisor. 374 + */ 375 + 376 + hv_hyp_synic_enable_regs(cpu); 377 + if (vmbus_is_confidential()) 378 + hv_para_synic_enable_interrupts(); 379 + else 380 + hv_hyp_synic_enable_interrupts(); 349 381 350 382 hv_stimer_legacy_init(cpu, VMBUS_MESSAGE_SINT); 351 383 352 384 return 0; 353 385 } 354 386 355 - void hv_synic_disable_regs(unsigned int cpu) 387 + void hv_hyp_synic_disable_regs(unsigned int cpu) 356 388 { 357 389 struct hv_per_cpu_context *hv_cpu = 358 390 per_cpu_ptr(hv_context.cpu_context, cpu); 359 391 union hv_synic_sint shared_sint; 360 392 union hv_synic_simp simp; 361 393 union hv_synic_siefp siefp; 362 - union hv_synic_scontrol sctrl; 363 394 364 395 shared_sint.as_uint64 = hv_get_msr(HV_MSR_SINT0 + VMBUS_MESSAGE_SINT); 365 396 ··· 417 350 /* Need to correctly cleanup in the case of SMP!!! */ 418 351 /* Disable the interrupt */ 419 352 hv_set_msr(HV_MSR_SINT0 + VMBUS_MESSAGE_SINT, shared_sint.as_uint64); 353 + hv_enable_coco_interrupt(cpu, vmbus_interrupt, false); 420 354 421 355 simp.as_uint64 = hv_get_msr(HV_MSR_SIMP); 422 356 /* 423 - * In Isolation VM, sim and sief pages are allocated by 357 + * In Isolation VM, simp and sief pages are allocated by 424 358 * paravisor. These pages also will be used by kdump 425 359 * kernel. So just reset enable bit here and keep page 426 360 * addresses. 427 361 */ 428 362 simp.simp_enabled = 0; 429 363 if (ms_hyperv.paravisor_present || hv_root_partition()) { 430 - iounmap(hv_cpu->synic_message_page); 431 - hv_cpu->synic_message_page = NULL; 364 + if (hv_cpu->hyp_synic_message_page) { 365 + iounmap(hv_cpu->hyp_synic_message_page); 366 + hv_cpu->hyp_synic_message_page = NULL; 367 + } 432 368 } else { 433 369 simp.base_simp_gpa = 0; 434 370 } ··· 442 372 siefp.siefp_enabled = 0; 443 373 444 374 if (ms_hyperv.paravisor_present || hv_root_partition()) { 445 - iounmap(hv_cpu->synic_event_page); 446 - hv_cpu->synic_event_page = NULL; 375 + if (hv_cpu->hyp_synic_event_page) { 376 + iounmap(hv_cpu->hyp_synic_event_page); 377 + hv_cpu->hyp_synic_event_page = NULL; 378 + } 447 379 } else { 448 380 siefp.base_siefp_gpa = 0; 449 381 } 450 382 451 383 hv_set_msr(HV_MSR_SIEFP, siefp.as_uint64); 384 + } 385 + 386 + static void hv_hyp_synic_disable_interrupts(void) 387 + { 388 + union hv_synic_scontrol sctrl; 452 389 453 390 /* Disable the global synic bit */ 454 391 sctrl.as_uint64 = hv_get_msr(HV_MSR_SCONTROL); 455 392 sctrl.enable = 0; 456 393 hv_set_msr(HV_MSR_SCONTROL, sctrl.as_uint64); 394 + } 457 395 458 - if (vmbus_irq != -1) 459 - disable_percpu_irq(vmbus_irq); 396 + static void hv_para_synic_disable_regs(unsigned int cpu) 397 + { 398 + union hv_synic_simp simp; 399 + union hv_synic_siefp siefp; 400 + 401 + /* Disable SynIC's message page in the paravisor. */ 402 + simp.as_uint64 = hv_para_get_synic_register(HV_MSR_SIMP); 403 + simp.simp_enabled = 0; 404 + hv_para_set_synic_register(HV_MSR_SIMP, simp.as_uint64); 405 + 406 + /* Disable SynIC's event page in the paravisor. */ 407 + siefp.as_uint64 = hv_para_get_synic_register(HV_MSR_SIEFP); 408 + siefp.siefp_enabled = 0; 409 + hv_para_set_synic_register(HV_MSR_SIEFP, siefp.as_uint64); 410 + } 411 + 412 + static void hv_para_synic_disable_interrupts(void) 413 + { 414 + union hv_synic_scontrol sctrl; 415 + 416 + /* Disable the global synic bit */ 417 + sctrl.as_uint64 = hv_para_get_synic_register(HV_MSR_SCONTROL); 418 + sctrl.enable = 0; 419 + hv_para_set_synic_register(HV_MSR_SCONTROL, sctrl.as_uint64); 460 420 } 461 421 462 422 #define HV_MAX_TRIES 3 ··· 499 399 * that the normal interrupt handling mechanism will find and process the channel interrupt 500 400 * "very soon", and in the process clear the bit. 501 401 */ 502 - static bool hv_synic_event_pending(void) 402 + static bool __hv_synic_event_pending(union hv_synic_event_flags *event, int sint) 503 403 { 504 - struct hv_per_cpu_context *hv_cpu = this_cpu_ptr(hv_context.cpu_context); 505 - union hv_synic_event_flags *event = 506 - (union hv_synic_event_flags *)hv_cpu->synic_event_page + VMBUS_MESSAGE_SINT; 507 - unsigned long *recv_int_page = event->flags; /* assumes VMBus version >= VERSION_WIN8 */ 404 + unsigned long *recv_int_page; 508 405 bool pending; 509 406 u32 relid; 510 407 int tries = 0; 511 408 409 + if (!event) 410 + return false; 411 + 412 + event += sint; 413 + recv_int_page = event->flags; /* assumes VMBus version >= VERSION_WIN8 */ 512 414 retry: 513 415 pending = false; 514 416 for_each_set_bit(relid, recv_int_page, HV_EVENT_FLAGS_COUNT) { ··· 525 423 goto retry; 526 424 } 527 425 return pending; 426 + } 427 + 428 + static bool hv_synic_event_pending(void) 429 + { 430 + struct hv_per_cpu_context *hv_cpu = this_cpu_ptr(hv_context.cpu_context); 431 + union hv_synic_event_flags *hyp_synic_event_page = hv_cpu->hyp_synic_event_page; 432 + union hv_synic_event_flags *para_synic_event_page = hv_cpu->para_synic_event_page; 433 + 434 + return 435 + __hv_synic_event_pending(hyp_synic_event_page, VMBUS_MESSAGE_SINT) || 436 + __hv_synic_event_pending(para_synic_event_page, VMBUS_MESSAGE_SINT); 528 437 } 529 438 530 439 static int hv_pick_new_cpu(struct vmbus_channel *channel) ··· 630 517 always_cleanup: 631 518 hv_stimer_legacy_cleanup(cpu); 632 519 633 - hv_synic_disable_regs(cpu); 520 + /* 521 + * First, disable the event and message pages 522 + * used for communicating with the host, and then 523 + * disable the host interrupts if VMBus is not 524 + * confidential. 525 + */ 526 + hv_hyp_synic_disable_regs(cpu); 527 + if (!vmbus_is_confidential()) 528 + hv_hyp_synic_disable_interrupts(); 529 + 530 + /* 531 + * Perform the same steps for the Confidential VMBus. 532 + * The sequencing provides the guarantee that no data 533 + * may be posted for processing before disabling interrupts. 534 + */ 535 + if (vmbus_is_confidential()) { 536 + hv_para_synic_disable_regs(cpu); 537 + hv_para_synic_disable_interrupts(); 538 + } 539 + if (vmbus_irq != -1) 540 + disable_percpu_irq(vmbus_irq); 634 541 635 542 return ret; 636 543 }

+24 -3

drivers/hv/hv_common.c

··· 315 315 int i; 316 316 union hv_hypervisor_version_info version; 317 317 318 - /* Get information about the Hyper-V host version */ 318 + /* Get information about the Microsoft Hypervisor version */ 319 319 if (!hv_get_hypervisor_version(&version)) 320 - pr_info("Hyper-V: Host Build %d.%d.%d.%d-%d-%d\n", 320 + pr_info("Hyper-V: Hypervisor Build %d.%d.%d.%d-%d-%d\n", 321 321 version.major_version, version.minor_version, 322 322 version.build_number, version.service_number, 323 323 version.service_pack, version.service_branch); ··· 487 487 * online and then taken offline 488 488 */ 489 489 if (!*inputarg) { 490 - mem = kmalloc(pgcount * HV_HYP_PAGE_SIZE, flags); 490 + mem = kmalloc_array(pgcount, HV_HYP_PAGE_SIZE, flags); 491 491 if (!mem) 492 492 return -ENOMEM; 493 493 ··· 715 715 return HV_STATUS_INVALID_PARAMETER; 716 716 } 717 717 EXPORT_SYMBOL_GPL(hv_tdx_hypercall); 718 + 719 + void __weak hv_enable_coco_interrupt(unsigned int cpu, unsigned int vector, bool set) 720 + { 721 + } 722 + EXPORT_SYMBOL_GPL(hv_enable_coco_interrupt); 723 + 724 + void __weak hv_para_set_sint_proxy(bool enable) 725 + { 726 + } 727 + EXPORT_SYMBOL_GPL(hv_para_set_sint_proxy); 728 + 729 + u64 __weak hv_para_get_synic_register(unsigned int reg) 730 + { 731 + return ~0ULL; 732 + } 733 + EXPORT_SYMBOL_GPL(hv_para_get_synic_register); 734 + 735 + void __weak hv_para_set_synic_register(unsigned int reg, u64 val) 736 + { 737 + } 738 + EXPORT_SYMBOL_GPL(hv_para_set_synic_register); 718 739 719 740 void hv_identify_partition_type(void) 720 741 {

+1 -1

drivers/hv/hv_util.c

··· 586 586 (struct hv_util_service *)dev_id->driver_data; 587 587 int ret; 588 588 589 - srv->recv_buffer = kmalloc(HV_HYP_PAGE_SIZE * 4, GFP_KERNEL); 589 + srv->recv_buffer = kmalloc_array(4, HV_HYP_PAGE_SIZE, GFP_KERNEL); 590 590 if (!srv->recv_buffer) 591 591 return -ENOMEM; 592 592 srv->channel = dev->channel;

+71 -5

drivers/hv/hyperv_vmbus.h

··· 15 15 #include <linux/list.h> 16 16 #include <linux/bitops.h> 17 17 #include <asm/sync_bitops.h> 18 + #include <asm/mshyperv.h> 18 19 #include <linux/atomic.h> 19 20 #include <linux/hyperv.h> 20 21 #include <linux/interrupt.h> ··· 33 32 */ 34 33 #define HV_UTIL_NEGO_TIMEOUT 55 35 34 35 + void vmbus_isr(void); 36 36 37 37 /* Definitions for the monitored notification facility */ 38 38 union hv_monitor_trigger_group { ··· 122 120 * Per cpu state for channel handling 123 121 */ 124 122 struct hv_per_cpu_context { 125 - void *synic_message_page; 126 - void *synic_event_page; 123 + /* 124 + * SynIC pages for communicating with the host. 125 + * 126 + * These pages are accessible to the host partition and the hypervisor. 127 + * They may be used for exchanging data with the host partition and the 128 + * hypervisor even when they aren't trusted yet the guest partition 129 + * must be prepared to handle the malicious behavior. 130 + */ 131 + void *hyp_synic_message_page; 132 + void *hyp_synic_event_page; 133 + /* 134 + * SynIC pages for communicating with the paravisor. 135 + * 136 + * These pages may be accessed from within the guest partition only in 137 + * CoCo VMs. Neither the host partition nor the hypervisor can access 138 + * these pages in that case; they are used for exchanging data with the 139 + * paravisor. 140 + */ 141 + void *para_synic_message_page; 142 + void *para_synic_event_page; 127 143 128 144 /* 129 145 * The page is only used in hv_post_message() for a TDX VM (with the ··· 191 171 192 172 extern void hv_synic_free(void); 193 173 194 - extern void hv_synic_enable_regs(unsigned int cpu); 174 + extern void hv_hyp_synic_enable_regs(unsigned int cpu); 195 175 extern int hv_synic_init(unsigned int cpu); 196 176 197 - extern void hv_synic_disable_regs(unsigned int cpu); 177 + extern void hv_hyp_synic_disable_regs(unsigned int cpu); 198 178 extern int hv_synic_cleanup(unsigned int cpu); 199 179 200 180 /* Interface */ ··· 202 182 void hv_ringbuffer_pre_init(struct vmbus_channel *channel); 203 183 204 184 int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info, 205 - struct page *pages, u32 pagecnt, u32 max_pkt_size); 185 + struct page *pages, u32 pagecnt, u32 max_pkt_size, 186 + bool confidential); 206 187 207 188 void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info); 208 189 ··· 353 332 354 333 355 334 /* General vmbus interface */ 335 + 336 + bool vmbus_is_confidential(void); 337 + 338 + #if IS_ENABLED(CONFIG_HYPERV_VMBUS) 339 + /* Free the message slot and signal end-of-message if required */ 340 + static inline void vmbus_signal_eom(struct hv_message *msg, u32 old_msg_type) 341 + { 342 + /* 343 + * On crash we're reading some other CPU's message page and we need 344 + * to be careful: this other CPU may already had cleared the header 345 + * and the host may already had delivered some other message there. 346 + * In case we blindly write msg->header.message_type we're going 347 + * to lose it. We can still lose a message of the same type but 348 + * we count on the fact that there can only be one 349 + * CHANNELMSG_UNLOAD_RESPONSE and we don't care about other messages 350 + * on crash. 351 + */ 352 + if (cmpxchg(&msg->header.message_type, old_msg_type, 353 + HVMSG_NONE) != old_msg_type) 354 + return; 355 + 356 + /* 357 + * The cmxchg() above does an implicit memory barrier to 358 + * ensure the write to MessageType (ie set to 359 + * HVMSG_NONE) happens before we read the 360 + * MessagePending and EOMing. Otherwise, the EOMing 361 + * will not deliver any more messages since there is 362 + * no empty slot 363 + */ 364 + if (msg->header.message_flags.msg_pending) { 365 + /* 366 + * This will cause message queue rescan to 367 + * possibly deliver another msg from the 368 + * hypervisor 369 + */ 370 + if (vmbus_is_confidential()) 371 + hv_para_set_synic_register(HV_MSR_EOM, 0); 372 + else 373 + hv_set_msr(HV_MSR_EOM, 0); 374 + } 375 + } 376 + 377 + extern int vmbus_interrupt; 378 + extern int vmbus_irq; 379 + #endif /* CONFIG_HYPERV_VMBUS */ 356 380 357 381 struct hv_device *vmbus_device_create(const guid_t *type, 358 382 const guid_t *instance,

+99

drivers/hv/mshv_common.c

··· 14 14 #include <asm/mshyperv.h> 15 15 #include <linux/resume_user_mode.h> 16 16 #include <linux/export.h> 17 + #include <linux/acpi.h> 18 + #include <linux/notifier.h> 19 + #include <linux/reboot.h> 17 20 18 21 #include "mshv.h" 19 22 ··· 141 138 return 0; 142 139 } 143 140 EXPORT_SYMBOL_GPL(hv_call_get_partition_property); 141 + 142 + /* 143 + * Corresponding sleep states have to be initialized in order for a subsequent 144 + * HVCALL_ENTER_SLEEP_STATE call to succeed. Currently only S5 state as per 145 + * ACPI 6.4 chapter 7.4.2 is relevant, while S1, S2 and S3 can be supported. 146 + * 147 + * In order to pass proper PM values to mshv, ACPI should be initialized and 148 + * should support S5 sleep state when this method is invoked. 149 + */ 150 + static int hv_initialize_sleep_states(void) 151 + { 152 + u64 status; 153 + unsigned long flags; 154 + struct hv_input_set_system_property *in; 155 + acpi_status acpi_status; 156 + u8 sleep_type_a, sleep_type_b; 157 + 158 + if (!acpi_sleep_state_supported(ACPI_STATE_S5)) { 159 + pr_err("%s: S5 sleep state not supported.\n", __func__); 160 + return -ENODEV; 161 + } 162 + 163 + acpi_status = acpi_get_sleep_type_data(ACPI_STATE_S5, &sleep_type_a, 164 + &sleep_type_b); 165 + if (ACPI_FAILURE(acpi_status)) 166 + return -ENODEV; 167 + 168 + local_irq_save(flags); 169 + in = *this_cpu_ptr(hyperv_pcpu_input_arg); 170 + memset(in, 0, sizeof(*in)); 171 + 172 + in->property_id = HV_SYSTEM_PROPERTY_SLEEP_STATE; 173 + in->set_sleep_state_info.sleep_state = HV_SLEEP_STATE_S5; 174 + in->set_sleep_state_info.pm1a_slp_typ = sleep_type_a; 175 + in->set_sleep_state_info.pm1b_slp_typ = sleep_type_b; 176 + 177 + status = hv_do_hypercall(HVCALL_SET_SYSTEM_PROPERTY, in, NULL); 178 + local_irq_restore(flags); 179 + 180 + if (!hv_result_success(status)) { 181 + hv_status_err(status, "\n"); 182 + return hv_result_to_errno(status); 183 + } 184 + 185 + return 0; 186 + } 187 + 188 + /* 189 + * This notifier initializes sleep states in mshv hypervisor which will be 190 + * used during power off. 191 + */ 192 + static int hv_reboot_notifier_handler(struct notifier_block *this, 193 + unsigned long code, void *another) 194 + { 195 + int ret = 0; 196 + 197 + if (code == SYS_HALT || code == SYS_POWER_OFF) 198 + ret = hv_initialize_sleep_states(); 199 + 200 + return ret ? NOTIFY_DONE : NOTIFY_OK; 201 + } 202 + 203 + static struct notifier_block hv_reboot_notifier = { 204 + .notifier_call = hv_reboot_notifier_handler, 205 + }; 206 + 207 + void hv_sleep_notifiers_register(void) 208 + { 209 + int ret; 210 + 211 + ret = register_reboot_notifier(&hv_reboot_notifier); 212 + if (ret) 213 + pr_err("%s: cannot register reboot notifier %d\n", __func__, 214 + ret); 215 + } 216 + 217 + /* 218 + * Power off the machine by entering S5 sleep state via Hyper-V hypercall. 219 + * This call does not return if successful. 220 + */ 221 + void hv_machine_power_off(void) 222 + { 223 + unsigned long flags; 224 + struct hv_input_enter_sleep_state *in; 225 + 226 + local_irq_save(flags); 227 + in = *this_cpu_ptr(hyperv_pcpu_input_arg); 228 + in->sleep_state = HV_SLEEP_STATE_S5; 229 + 230 + (void)hv_do_hypercall(HVCALL_ENTER_SLEEP_STATE, in, NULL); 231 + local_irq_restore(flags); 232 + 233 + /* should never reach here */ 234 + BUG(); 235 + 236 + }

+7 -1

drivers/hv/mshv_eventfd.c

··· 163 163 if (hv_scheduler_type != HV_SCHEDULER_TYPE_ROOT) 164 164 return -EOPNOTSUPP; 165 165 166 + #if IS_ENABLED(CONFIG_X86) 166 167 if (irq->lapic_control.logical_dest_mode) 167 168 return -EOPNOTSUPP; 169 + #endif 168 170 169 171 vp = partition->pt_vp_array[irq->lapic_apic_id]; 170 172 ··· 198 196 unsigned int seq; 199 197 int idx; 200 198 199 + #if IS_ENABLED(CONFIG_X86) 201 200 WARN_ON(irqfd->irqfd_resampler && 202 201 !irq->lapic_control.level_triggered); 202 + #endif 203 203 204 204 idx = srcu_read_lock(&partition->pt_irq_srcu); 205 205 if (irqfd->irqfd_girq_ent.guest_irq_num) { ··· 473 469 init_poll_funcptr(&irqfd->irqfd_polltbl, mshv_irqfd_queue_proc); 474 470 475 471 spin_lock_irq(&pt->pt_irqfds_lock); 472 + #if IS_ENABLED(CONFIG_X86) 476 473 if (args->flags & BIT(MSHV_IRQFD_BIT_RESAMPLE) && 477 474 !irqfd->irqfd_lapic_irq.lapic_control.level_triggered) { 478 475 /* ··· 484 479 ret = -EINVAL; 485 480 goto fail; 486 481 } 482 + #endif 487 483 ret = 0; 488 484 hlist_for_each_entry(tmp, &pt->pt_irqfds_list, irqfd_hnode) { 489 485 if (irqfd->irqfd_eventfd_ctx != tmp->irqfd_eventfd_ctx) ··· 598 592 599 593 int mshv_irqfd_wq_init(void) 600 594 { 601 - irqfd_cleanup_wq = alloc_workqueue("mshv-irqfd-cleanup", 0, 0); 595 + irqfd_cleanup_wq = alloc_workqueue("mshv-irqfd-cleanup", WQ_PERCPU, 0); 602 596 if (!irqfd_cleanup_wq) 603 597 return -ENOMEM; 604 598

+4

drivers/hv/mshv_irq.c

··· 119 119 lirq->lapic_vector = ent->girq_irq_data & 0xFF; 120 120 lirq->lapic_apic_id = (ent->girq_addr_lo >> 12) & 0xFF; 121 121 lirq->lapic_control.interrupt_type = (ent->girq_irq_data & 0x700) >> 8; 122 + #if IS_ENABLED(CONFIG_X86) 122 123 lirq->lapic_control.level_triggered = (ent->girq_irq_data >> 15) & 0x1; 123 124 lirq->lapic_control.logical_dest_mode = (ent->girq_addr_lo >> 2) & 0x1; 125 + #elif IS_ENABLED(CONFIG_ARM64) 126 + lirq->lapic_control.asserted = 1; 127 + #endif 124 128 }

+555

drivers/hv/mshv_regions.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * Copyright (c) 2025, Microsoft Corporation. 4 + * 5 + * Memory region management for mshv_root module. 6 + * 7 + * Authors: Microsoft Linux virtualization team 8 + */ 9 + 10 + #include <linux/hmm.h> 11 + #include <linux/hyperv.h> 12 + #include <linux/kref.h> 13 + #include <linux/mm.h> 14 + #include <linux/vmalloc.h> 15 + 16 + #include <asm/mshyperv.h> 17 + 18 + #include "mshv_root.h" 19 + 20 + #define MSHV_MAP_FAULT_IN_PAGES PTRS_PER_PMD 21 + 22 + /** 23 + * mshv_region_process_chunk - Processes a contiguous chunk of memory pages 24 + * in a region. 25 + * @region : Pointer to the memory region structure. 26 + * @flags : Flags to pass to the handler. 27 + * @page_offset: Offset into the region's pages array to start processing. 28 + * @page_count : Number of pages to process. 29 + * @handler : Callback function to handle the chunk. 30 + * 31 + * This function scans the region's pages starting from @page_offset, 32 + * checking for contiguous present pages of the same size (normal or huge). 33 + * It invokes @handler for the chunk of contiguous pages found. Returns the 34 + * number of pages handled, or a negative error code if the first page is 35 + * not present or the handler fails. 36 + * 37 + * Note: The @handler callback must be able to handle both normal and huge 38 + * pages. 39 + * 40 + * Return: Number of pages handled, or negative error code. 41 + */ 42 + static long mshv_region_process_chunk(struct mshv_mem_region *region, 43 + u32 flags, 44 + u64 page_offset, u64 page_count, 45 + int (*handler)(struct mshv_mem_region *region, 46 + u32 flags, 47 + u64 page_offset, 48 + u64 page_count)) 49 + { 50 + u64 count, stride; 51 + unsigned int page_order; 52 + struct page *page; 53 + int ret; 54 + 55 + page = region->pages[page_offset]; 56 + if (!page) 57 + return -EINVAL; 58 + 59 + page_order = folio_order(page_folio(page)); 60 + /* The hypervisor only supports 4K and 2M page sizes */ 61 + if (page_order && page_order != HPAGE_PMD_ORDER) 62 + return -EINVAL; 63 + 64 + stride = 1 << page_order; 65 + 66 + /* Start at stride since the first page is validated */ 67 + for (count = stride; count < page_count; count += stride) { 68 + page = region->pages[page_offset + count]; 69 + 70 + /* Break if current page is not present */ 71 + if (!page) 72 + break; 73 + 74 + /* Break if page size changes */ 75 + if (page_order != folio_order(page_folio(page))) 76 + break; 77 + } 78 + 79 + ret = handler(region, flags, page_offset, count); 80 + if (ret) 81 + return ret; 82 + 83 + return count; 84 + } 85 + 86 + /** 87 + * mshv_region_process_range - Processes a range of memory pages in a 88 + * region. 89 + * @region : Pointer to the memory region structure. 90 + * @flags : Flags to pass to the handler. 91 + * @page_offset: Offset into the region's pages array to start processing. 92 + * @page_count : Number of pages to process. 93 + * @handler : Callback function to handle each chunk of contiguous 94 + * pages. 95 + * 96 + * Iterates over the specified range of pages in @region, skipping 97 + * non-present pages. For each contiguous chunk of present pages, invokes 98 + * @handler via mshv_region_process_chunk. 99 + * 100 + * Note: The @handler callback must be able to handle both normal and huge 101 + * pages. 102 + * 103 + * Returns 0 on success, or a negative error code on failure. 104 + */ 105 + static int mshv_region_process_range(struct mshv_mem_region *region, 106 + u32 flags, 107 + u64 page_offset, u64 page_count, 108 + int (*handler)(struct mshv_mem_region *region, 109 + u32 flags, 110 + u64 page_offset, 111 + u64 page_count)) 112 + { 113 + long ret; 114 + 115 + if (page_offset + page_count > region->nr_pages) 116 + return -EINVAL; 117 + 118 + while (page_count) { 119 + /* Skip non-present pages */ 120 + if (!region->pages[page_offset]) { 121 + page_offset++; 122 + page_count--; 123 + continue; 124 + } 125 + 126 + ret = mshv_region_process_chunk(region, flags, 127 + page_offset, 128 + page_count, 129 + handler); 130 + if (ret < 0) 131 + return ret; 132 + 133 + page_offset += ret; 134 + page_count -= ret; 135 + } 136 + 137 + return 0; 138 + } 139 + 140 + struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages, 141 + u64 uaddr, u32 flags) 142 + { 143 + struct mshv_mem_region *region; 144 + 145 + region = vzalloc(sizeof(*region) + sizeof(struct page *) * nr_pages); 146 + if (!region) 147 + return ERR_PTR(-ENOMEM); 148 + 149 + region->nr_pages = nr_pages; 150 + region->start_gfn = guest_pfn; 151 + region->start_uaddr = uaddr; 152 + region->hv_map_flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_ADJUSTABLE; 153 + if (flags & BIT(MSHV_SET_MEM_BIT_WRITABLE)) 154 + region->hv_map_flags |= HV_MAP_GPA_WRITABLE; 155 + if (flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE)) 156 + region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE; 157 + 158 + kref_init(&region->refcount); 159 + 160 + return region; 161 + } 162 + 163 + static int mshv_region_chunk_share(struct mshv_mem_region *region, 164 + u32 flags, 165 + u64 page_offset, u64 page_count) 166 + { 167 + struct page *page = region->pages[page_offset]; 168 + 169 + if (PageHuge(page) || PageTransCompound(page)) 170 + flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE; 171 + 172 + return hv_call_modify_spa_host_access(region->partition->pt_id, 173 + region->pages + page_offset, 174 + page_count, 175 + HV_MAP_GPA_READABLE | 176 + HV_MAP_GPA_WRITABLE, 177 + flags, true); 178 + } 179 + 180 + int mshv_region_share(struct mshv_mem_region *region) 181 + { 182 + u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED; 183 + 184 + return mshv_region_process_range(region, flags, 185 + 0, region->nr_pages, 186 + mshv_region_chunk_share); 187 + } 188 + 189 + static int mshv_region_chunk_unshare(struct mshv_mem_region *region, 190 + u32 flags, 191 + u64 page_offset, u64 page_count) 192 + { 193 + struct page *page = region->pages[page_offset]; 194 + 195 + if (PageHuge(page) || PageTransCompound(page)) 196 + flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE; 197 + 198 + return hv_call_modify_spa_host_access(region->partition->pt_id, 199 + region->pages + page_offset, 200 + page_count, 0, 201 + flags, false); 202 + } 203 + 204 + int mshv_region_unshare(struct mshv_mem_region *region) 205 + { 206 + u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE; 207 + 208 + return mshv_region_process_range(region, flags, 209 + 0, region->nr_pages, 210 + mshv_region_chunk_unshare); 211 + } 212 + 213 + static int mshv_region_chunk_remap(struct mshv_mem_region *region, 214 + u32 flags, 215 + u64 page_offset, u64 page_count) 216 + { 217 + struct page *page = region->pages[page_offset]; 218 + 219 + if (PageHuge(page) || PageTransCompound(page)) 220 + flags |= HV_MAP_GPA_LARGE_PAGE; 221 + 222 + return hv_call_map_gpa_pages(region->partition->pt_id, 223 + region->start_gfn + page_offset, 224 + page_count, flags, 225 + region->pages + page_offset); 226 + } 227 + 228 + static int mshv_region_remap_pages(struct mshv_mem_region *region, 229 + u32 map_flags, 230 + u64 page_offset, u64 page_count) 231 + { 232 + return mshv_region_process_range(region, map_flags, 233 + page_offset, page_count, 234 + mshv_region_chunk_remap); 235 + } 236 + 237 + int mshv_region_map(struct mshv_mem_region *region) 238 + { 239 + u32 map_flags = region->hv_map_flags; 240 + 241 + return mshv_region_remap_pages(region, map_flags, 242 + 0, region->nr_pages); 243 + } 244 + 245 + static void mshv_region_invalidate_pages(struct mshv_mem_region *region, 246 + u64 page_offset, u64 page_count) 247 + { 248 + if (region->type == MSHV_REGION_TYPE_MEM_PINNED) 249 + unpin_user_pages(region->pages + page_offset, page_count); 250 + 251 + memset(region->pages + page_offset, 0, 252 + page_count * sizeof(struct page *)); 253 + } 254 + 255 + void mshv_region_invalidate(struct mshv_mem_region *region) 256 + { 257 + mshv_region_invalidate_pages(region, 0, region->nr_pages); 258 + } 259 + 260 + int mshv_region_pin(struct mshv_mem_region *region) 261 + { 262 + u64 done_count, nr_pages; 263 + struct page **pages; 264 + __u64 userspace_addr; 265 + int ret; 266 + 267 + for (done_count = 0; done_count < region->nr_pages; done_count += ret) { 268 + pages = region->pages + done_count; 269 + userspace_addr = region->start_uaddr + 270 + done_count * HV_HYP_PAGE_SIZE; 271 + nr_pages = min(region->nr_pages - done_count, 272 + MSHV_PIN_PAGES_BATCH_SIZE); 273 + 274 + /* 275 + * Pinning assuming 4k pages works for large pages too. 276 + * All page structs within the large page are returned. 277 + * 278 + * Pin requests are batched because pin_user_pages_fast 279 + * with the FOLL_LONGTERM flag does a large temporary 280 + * allocation of contiguous memory. 281 + */ 282 + ret = pin_user_pages_fast(userspace_addr, nr_pages, 283 + FOLL_WRITE | FOLL_LONGTERM, 284 + pages); 285 + if (ret < 0) 286 + goto release_pages; 287 + } 288 + 289 + return 0; 290 + 291 + release_pages: 292 + mshv_region_invalidate_pages(region, 0, done_count); 293 + return ret; 294 + } 295 + 296 + static int mshv_region_chunk_unmap(struct mshv_mem_region *region, 297 + u32 flags, 298 + u64 page_offset, u64 page_count) 299 + { 300 + struct page *page = region->pages[page_offset]; 301 + 302 + if (PageHuge(page) || PageTransCompound(page)) 303 + flags |= HV_UNMAP_GPA_LARGE_PAGE; 304 + 305 + return hv_call_unmap_gpa_pages(region->partition->pt_id, 306 + region->start_gfn + page_offset, 307 + page_count, flags); 308 + } 309 + 310 + static int mshv_region_unmap(struct mshv_mem_region *region) 311 + { 312 + return mshv_region_process_range(region, 0, 313 + 0, region->nr_pages, 314 + mshv_region_chunk_unmap); 315 + } 316 + 317 + static void mshv_region_destroy(struct kref *ref) 318 + { 319 + struct mshv_mem_region *region = 320 + container_of(ref, struct mshv_mem_region, refcount); 321 + struct mshv_partition *partition = region->partition; 322 + int ret; 323 + 324 + if (region->type == MSHV_REGION_TYPE_MEM_MOVABLE) 325 + mshv_region_movable_fini(region); 326 + 327 + if (mshv_partition_encrypted(partition)) { 328 + ret = mshv_region_share(region); 329 + if (ret) { 330 + pt_err(partition, 331 + "Failed to regain access to memory, unpinning user pages will fail and crash the host error: %d\n", 332 + ret); 333 + return; 334 + } 335 + } 336 + 337 + mshv_region_unmap(region); 338 + 339 + mshv_region_invalidate(region); 340 + 341 + vfree(region); 342 + } 343 + 344 + void mshv_region_put(struct mshv_mem_region *region) 345 + { 346 + kref_put(&region->refcount, mshv_region_destroy); 347 + } 348 + 349 + int mshv_region_get(struct mshv_mem_region *region) 350 + { 351 + return kref_get_unless_zero(&region->refcount); 352 + } 353 + 354 + /** 355 + * mshv_region_hmm_fault_and_lock - Handle HMM faults and lock the memory region 356 + * @region: Pointer to the memory region structure 357 + * @range: Pointer to the HMM range structure 358 + * 359 + * This function performs the following steps: 360 + * 1. Reads the notifier sequence for the HMM range. 361 + * 2. Acquires a read lock on the memory map. 362 + * 3. Handles HMM faults for the specified range. 363 + * 4. Releases the read lock on the memory map. 364 + * 5. If successful, locks the memory region mutex. 365 + * 6. Verifies if the notifier sequence has changed during the operation. 366 + * If it has, releases the mutex and returns -EBUSY to match with 367 + * hmm_range_fault() return code for repeating. 368 + * 369 + * Return: 0 on success, a negative error code otherwise. 370 + */ 371 + static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region, 372 + struct hmm_range *range) 373 + { 374 + int ret; 375 + 376 + range->notifier_seq = mmu_interval_read_begin(range->notifier); 377 + mmap_read_lock(region->mni.mm); 378 + ret = hmm_range_fault(range); 379 + mmap_read_unlock(region->mni.mm); 380 + if (ret) 381 + return ret; 382 + 383 + mutex_lock(&region->mutex); 384 + 385 + if (mmu_interval_read_retry(range->notifier, range->notifier_seq)) { 386 + mutex_unlock(&region->mutex); 387 + cond_resched(); 388 + return -EBUSY; 389 + } 390 + 391 + return 0; 392 + } 393 + 394 + /** 395 + * mshv_region_range_fault - Handle memory range faults for a given region. 396 + * @region: Pointer to the memory region structure. 397 + * @page_offset: Offset of the page within the region. 398 + * @page_count: Number of pages to handle. 399 + * 400 + * This function resolves memory faults for a specified range of pages 401 + * within a memory region. It uses HMM (Heterogeneous Memory Management) 402 + * to fault in the required pages and updates the region's page array. 403 + * 404 + * Return: 0 on success, negative error code on failure. 405 + */ 406 + static int mshv_region_range_fault(struct mshv_mem_region *region, 407 + u64 page_offset, u64 page_count) 408 + { 409 + struct hmm_range range = { 410 + .notifier = &region->mni, 411 + .default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE, 412 + }; 413 + unsigned long *pfns; 414 + int ret; 415 + u64 i; 416 + 417 + pfns = kmalloc_array(page_count, sizeof(*pfns), GFP_KERNEL); 418 + if (!pfns) 419 + return -ENOMEM; 420 + 421 + range.hmm_pfns = pfns; 422 + range.start = region->start_uaddr + page_offset * HV_HYP_PAGE_SIZE; 423 + range.end = range.start + page_count * HV_HYP_PAGE_SIZE; 424 + 425 + do { 426 + ret = mshv_region_hmm_fault_and_lock(region, &range); 427 + } while (ret == -EBUSY); 428 + 429 + if (ret) 430 + goto out; 431 + 432 + for (i = 0; i < page_count; i++) 433 + region->pages[page_offset + i] = hmm_pfn_to_page(pfns[i]); 434 + 435 + ret = mshv_region_remap_pages(region, region->hv_map_flags, 436 + page_offset, page_count); 437 + 438 + mutex_unlock(&region->mutex); 439 + out: 440 + kfree(pfns); 441 + return ret; 442 + } 443 + 444 + bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn) 445 + { 446 + u64 page_offset, page_count; 447 + int ret; 448 + 449 + /* Align the page offset to the nearest MSHV_MAP_FAULT_IN_PAGES. */ 450 + page_offset = ALIGN_DOWN(gfn - region->start_gfn, 451 + MSHV_MAP_FAULT_IN_PAGES); 452 + 453 + /* Map more pages than requested to reduce the number of faults. */ 454 + page_count = min(region->nr_pages - page_offset, 455 + MSHV_MAP_FAULT_IN_PAGES); 456 + 457 + ret = mshv_region_range_fault(region, page_offset, page_count); 458 + 459 + WARN_ONCE(ret, 460 + "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, page_offset %llu, page_count %llu\n", 461 + region->partition->pt_id, region->start_uaddr, 462 + region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT), 463 + gfn, page_offset, page_count); 464 + 465 + return !ret; 466 + } 467 + 468 + /** 469 + * mshv_region_interval_invalidate - Invalidate a range of memory region 470 + * @mni: Pointer to the mmu_interval_notifier structure 471 + * @range: Pointer to the mmu_notifier_range structure 472 + * @cur_seq: Current sequence number for the interval notifier 473 + * 474 + * This function invalidates a memory region by remapping its pages with 475 + * no access permissions. It locks the region's mutex to ensure thread safety 476 + * and updates the sequence number for the interval notifier. If the range 477 + * is blockable, it uses a blocking lock; otherwise, it attempts a non-blocking 478 + * lock and returns false if unsuccessful. 479 + * 480 + * NOTE: Failure to invalidate a region is a serious error, as the pages will 481 + * be considered freed while they are still mapped by the hypervisor. 482 + * Any attempt to access such pages will likely crash the system. 483 + * 484 + * Return: true if the region was successfully invalidated, false otherwise. 485 + */ 486 + static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni, 487 + const struct mmu_notifier_range *range, 488 + unsigned long cur_seq) 489 + { 490 + struct mshv_mem_region *region = container_of(mni, 491 + struct mshv_mem_region, 492 + mni); 493 + u64 page_offset, page_count; 494 + unsigned long mstart, mend; 495 + int ret = -EPERM; 496 + 497 + if (mmu_notifier_range_blockable(range)) 498 + mutex_lock(&region->mutex); 499 + else if (!mutex_trylock(&region->mutex)) 500 + goto out_fail; 501 + 502 + mmu_interval_set_seq(mni, cur_seq); 503 + 504 + mstart = max(range->start, region->start_uaddr); 505 + mend = min(range->end, region->start_uaddr + 506 + (region->nr_pages << HV_HYP_PAGE_SHIFT)); 507 + 508 + page_offset = HVPFN_DOWN(mstart - region->start_uaddr); 509 + page_count = HVPFN_DOWN(mend - mstart); 510 + 511 + ret = mshv_region_remap_pages(region, HV_MAP_GPA_NO_ACCESS, 512 + page_offset, page_count); 513 + if (ret) 514 + goto out_fail; 515 + 516 + mshv_region_invalidate_pages(region, page_offset, page_count); 517 + 518 + mutex_unlock(&region->mutex); 519 + 520 + return true; 521 + 522 + out_fail: 523 + WARN_ONCE(ret, 524 + "Failed to invalidate region %#llx-%#llx (range %#lx-%#lx, event: %u, pages %#llx-%#llx, mm: %#llx): %d\n", 525 + region->start_uaddr, 526 + region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT), 527 + range->start, range->end, range->event, 528 + page_offset, page_offset + page_count - 1, (u64)range->mm, ret); 529 + return false; 530 + } 531 + 532 + static const struct mmu_interval_notifier_ops mshv_region_mni_ops = { 533 + .invalidate = mshv_region_interval_invalidate, 534 + }; 535 + 536 + void mshv_region_movable_fini(struct mshv_mem_region *region) 537 + { 538 + mmu_interval_notifier_remove(&region->mni); 539 + } 540 + 541 + bool mshv_region_movable_init(struct mshv_mem_region *region) 542 + { 543 + int ret; 544 + 545 + ret = mmu_interval_notifier_insert(&region->mni, current->mm, 546 + region->start_uaddr, 547 + region->nr_pages << HV_HYP_PAGE_SHIFT, 548 + &mshv_region_mni_ops); 549 + if (ret) 550 + return false; 551 + 552 + mutex_init(&region->mutex); 553 + 554 + return true; 555 + }

+41 -16

drivers/hv/mshv_root.h

··· 15 15 #include <linux/hashtable.h> 16 16 #include <linux/dev_printk.h> 17 17 #include <linux/build_bug.h> 18 + #include <linux/mmu_notifier.h> 18 19 #include <uapi/linux/mshv.h> 19 20 20 21 /* ··· 71 70 #define vp_info(v, fmt, ...) vp_devprintk(info, v, fmt, ##__VA_ARGS__) 72 71 #define vp_dbg(v, fmt, ...) vp_devprintk(dbg, v, fmt, ##__VA_ARGS__) 73 72 73 + enum mshv_region_type { 74 + MSHV_REGION_TYPE_MEM_PINNED, 75 + MSHV_REGION_TYPE_MEM_MOVABLE, 76 + MSHV_REGION_TYPE_MMIO 77 + }; 78 + 74 79 struct mshv_mem_region { 75 80 struct hlist_node hnode; 81 + struct kref refcount; 76 82 u64 nr_pages; 77 83 u64 start_gfn; 78 84 u64 start_uaddr; 79 85 u32 hv_map_flags; 80 - struct { 81 - u64 large_pages: 1; /* 2MiB */ 82 - u64 range_pinned: 1; 83 - u64 reserved: 62; 84 - } flags; 85 86 struct mshv_partition *partition; 87 + enum mshv_region_type type; 88 + struct mmu_interval_notifier mni; 89 + struct mutex mutex; /* protects region pages remapping */ 86 90 struct page *pages[]; 87 91 }; 88 92 ··· 104 98 u64 pt_id; 105 99 refcount_t pt_ref_count; 106 100 struct mutex pt_mutex; 101 + 102 + spinlock_t pt_mem_regions_lock; 107 103 struct hlist_head pt_mem_regions; // not ordered 108 104 109 105 u32 pt_vp_count; ··· 177 169 }; 178 170 179 171 struct hv_synic_pages { 180 - struct hv_message_page *synic_message_page; 172 + struct hv_message_page *hyp_synic_message_page; 181 173 struct hv_synic_event_flags_page *synic_event_flags_page; 182 174 struct hv_synic_event_ring_page *synic_event_ring_page; 183 175 }; ··· 186 178 struct hv_synic_pages __percpu *synic_pages; 187 179 spinlock_t pt_ht_lock; 188 180 DECLARE_HASHTABLE(pt_htable, MSHV_PARTITIONS_HASH_BITS); 181 + struct hv_partition_property_vmm_capabilities vmm_caps; 189 182 }; 190 183 191 184 /* ··· 287 278 /* Choose between pages and bytes */ 288 279 struct hv_vp_state_data state_data, u64 page_count, 289 280 struct page **pages, u32 num_bytes, u8 *bytes); 290 - int hv_call_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type, 291 - union hv_input_vtl input_vtl, 292 - struct page **state_page); 293 - int hv_call_unmap_vp_state_page(u64 partition_id, u32 vp_index, u32 type, 294 - union hv_input_vtl input_vtl); 281 + int hv_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type, 282 + union hv_input_vtl input_vtl, 283 + struct page **state_page); 284 + int hv_unmap_vp_state_page(u64 partition_id, u32 vp_index, u32 type, 285 + struct page *state_page, 286 + union hv_input_vtl input_vtl); 295 287 int hv_call_create_port(u64 port_partition_id, union hv_port_id port_id, 296 288 u64 connection_partition_id, struct hv_port_info *port_info, 297 289 u8 port_vtl, u8 min_connection_vtl, int node); ··· 305 295 int hv_call_disconnect_port(u64 connection_partition_id, 306 296 union hv_connection_id connection_id); 307 297 int hv_call_notify_port_ring_empty(u32 sint_index); 308 - int hv_call_map_stat_page(enum hv_stats_object_type type, 309 - const union hv_stats_object_identity *identity, 310 - void **addr); 311 - int hv_call_unmap_stat_page(enum hv_stats_object_type type, 312 - const union hv_stats_object_identity *identity); 298 + int hv_map_stats_page(enum hv_stats_object_type type, 299 + const union hv_stats_object_identity *identity, 300 + void **addr); 301 + int hv_unmap_stats_page(enum hv_stats_object_type type, void *page_addr, 302 + const union hv_stats_object_identity *identity); 313 303 int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages, 314 304 u64 page_struct_count, u32 host_access, 315 305 u32 flags, u8 acquire); 306 + int hv_call_get_partition_property_ex(u64 partition_id, u64 property_code, u64 arg, 307 + void *property_value, size_t property_value_sz); 316 308 317 309 extern struct mshv_root mshv_root; 318 310 extern enum hv_scheduler_type hv_scheduler_type; 319 311 extern u8 * __percpu *hv_synic_eventring_tail; 312 + 313 + struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages, 314 + u64 uaddr, u32 flags); 315 + int mshv_region_share(struct mshv_mem_region *region); 316 + int mshv_region_unshare(struct mshv_mem_region *region); 317 + int mshv_region_map(struct mshv_mem_region *region); 318 + void mshv_region_invalidate(struct mshv_mem_region *region); 319 + int mshv_region_pin(struct mshv_mem_region *region); 320 + void mshv_region_put(struct mshv_mem_region *region); 321 + int mshv_region_get(struct mshv_mem_region *region); 322 + bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn); 323 + void mshv_region_movable_fini(struct mshv_mem_region *region); 324 + bool mshv_region_movable_init(struct mshv_mem_region *region); 320 325 321 326 #endif /* _MSHV_ROOT_H_ */

+185 -11

drivers/hv/mshv_root_hv_call.c

··· 388 388 memset(input, 0, sizeof(*input)); 389 389 input->partition_id = partition_id; 390 390 input->vector = vector; 391 + /* 392 + * NOTE: dest_addr only needs to be provided while asserting an 393 + * interrupt on x86 platform 394 + */ 395 + #if IS_ENABLED(CONFIG_X86) 391 396 input->dest_addr = dest_addr; 397 + #endif 392 398 input->control = control; 393 399 status = hv_do_hypercall(HVCALL_ASSERT_VIRTUAL_INTERRUPT, input, NULL); 394 400 local_irq_restore(flags); ··· 532 526 return ret; 533 527 } 534 528 535 - int hv_call_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type, 536 - union hv_input_vtl input_vtl, 537 - struct page **state_page) 529 + static int hv_call_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type, 530 + union hv_input_vtl input_vtl, 531 + struct page **state_page) 538 532 { 539 533 struct hv_input_map_vp_state_page *input; 540 534 struct hv_output_map_vp_state_page *output; ··· 548 542 input = *this_cpu_ptr(hyperv_pcpu_input_arg); 549 543 output = *this_cpu_ptr(hyperv_pcpu_output_arg); 550 544 545 + memset(input, 0, sizeof(*input)); 551 546 input->partition_id = partition_id; 552 547 input->vp_index = vp_index; 553 548 input->type = type; 554 549 input->input_vtl = input_vtl; 555 550 556 - status = hv_do_hypercall(HVCALL_MAP_VP_STATE_PAGE, input, output); 551 + if (*state_page) { 552 + input->flags.map_location_provided = 1; 553 + input->requested_map_location = 554 + page_to_pfn(*state_page); 555 + } 556 + 557 + status = hv_do_hypercall(HVCALL_MAP_VP_STATE_PAGE, input, 558 + output); 557 559 558 560 if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) { 559 561 if (hv_result_success(status)) ··· 579 565 return ret; 580 566 } 581 567 582 - int hv_call_unmap_vp_state_page(u64 partition_id, u32 vp_index, u32 type, 583 - union hv_input_vtl input_vtl) 568 + static bool mshv_use_overlay_gpfn(void) 569 + { 570 + return hv_l1vh_partition() && 571 + mshv_root.vmm_caps.vmm_can_provide_overlay_gpfn; 572 + } 573 + 574 + int hv_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type, 575 + union hv_input_vtl input_vtl, 576 + struct page **state_page) 577 + { 578 + int ret = 0; 579 + struct page *allocated_page = NULL; 580 + 581 + if (mshv_use_overlay_gpfn()) { 582 + allocated_page = alloc_page(GFP_KERNEL); 583 + if (!allocated_page) 584 + return -ENOMEM; 585 + *state_page = allocated_page; 586 + } else { 587 + *state_page = NULL; 588 + } 589 + 590 + ret = hv_call_map_vp_state_page(partition_id, vp_index, type, input_vtl, 591 + state_page); 592 + 593 + if (ret && allocated_page) { 594 + __free_page(allocated_page); 595 + *state_page = NULL; 596 + } 597 + 598 + return ret; 599 + } 600 + 601 + static int hv_call_unmap_vp_state_page(u64 partition_id, u32 vp_index, u32 type, 602 + union hv_input_vtl input_vtl) 584 603 { 585 604 unsigned long flags; 586 605 u64 status; ··· 635 588 local_irq_restore(flags); 636 589 637 590 return hv_result_to_errno(status); 591 + } 592 + 593 + int hv_unmap_vp_state_page(u64 partition_id, u32 vp_index, u32 type, 594 + struct page *state_page, union hv_input_vtl input_vtl) 595 + { 596 + int ret = hv_call_unmap_vp_state_page(partition_id, vp_index, type, input_vtl); 597 + 598 + if (mshv_use_overlay_gpfn() && state_page) 599 + __free_page(state_page); 600 + 601 + return ret; 602 + } 603 + 604 + int hv_call_get_partition_property_ex(u64 partition_id, u64 property_code, 605 + u64 arg, void *property_value, 606 + size_t property_value_sz) 607 + { 608 + u64 status; 609 + unsigned long flags; 610 + struct hv_input_get_partition_property_ex *input; 611 + struct hv_output_get_partition_property_ex *output; 612 + 613 + local_irq_save(flags); 614 + input = *this_cpu_ptr(hyperv_pcpu_input_arg); 615 + output = *this_cpu_ptr(hyperv_pcpu_output_arg); 616 + 617 + memset(input, 0, sizeof(*input)); 618 + input->partition_id = partition_id; 619 + input->property_code = property_code; 620 + input->arg = arg; 621 + status = hv_do_hypercall(HVCALL_GET_PARTITION_PROPERTY_EX, input, output); 622 + 623 + if (!hv_result_success(status)) { 624 + local_irq_restore(flags); 625 + hv_status_debug(status, "\n"); 626 + return hv_result_to_errno(status); 627 + } 628 + memcpy(property_value, &output->property_value, property_value_sz); 629 + 630 + local_irq_restore(flags); 631 + 632 + return 0; 638 633 } 639 634 640 635 int ··· 813 724 return hv_result_to_errno(status); 814 725 } 815 726 816 - int hv_call_map_stat_page(enum hv_stats_object_type type, 817 - const union hv_stats_object_identity *identity, 818 - void **addr) 727 + static int hv_call_map_stats_page2(enum hv_stats_object_type type, 728 + const union hv_stats_object_identity *identity, 729 + u64 map_location) 730 + { 731 + unsigned long flags; 732 + struct hv_input_map_stats_page2 *input; 733 + u64 status; 734 + int ret; 735 + 736 + if (!map_location || !mshv_use_overlay_gpfn()) 737 + return -EINVAL; 738 + 739 + do { 740 + local_irq_save(flags); 741 + input = *this_cpu_ptr(hyperv_pcpu_input_arg); 742 + 743 + memset(input, 0, sizeof(*input)); 744 + input->type = type; 745 + input->identity = *identity; 746 + input->map_location = map_location; 747 + 748 + status = hv_do_hypercall(HVCALL_MAP_STATS_PAGE2, input, NULL); 749 + 750 + local_irq_restore(flags); 751 + 752 + ret = hv_result_to_errno(status); 753 + 754 + if (!ret) 755 + break; 756 + 757 + if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) { 758 + hv_status_debug(status, "\n"); 759 + break; 760 + } 761 + 762 + ret = hv_call_deposit_pages(NUMA_NO_NODE, 763 + hv_current_partition_id, 1); 764 + } while (!ret); 765 + 766 + return ret; 767 + } 768 + 769 + static int hv_call_map_stats_page(enum hv_stats_object_type type, 770 + const union hv_stats_object_identity *identity, 771 + void **addr) 819 772 { 820 773 unsigned long flags; 821 774 struct hv_input_map_stats_page *input; ··· 896 765 return ret; 897 766 } 898 767 899 - int hv_call_unmap_stat_page(enum hv_stats_object_type type, 900 - const union hv_stats_object_identity *identity) 768 + int hv_map_stats_page(enum hv_stats_object_type type, 769 + const union hv_stats_object_identity *identity, 770 + void **addr) 771 + { 772 + int ret; 773 + struct page *allocated_page = NULL; 774 + 775 + if (!addr) 776 + return -EINVAL; 777 + 778 + if (mshv_use_overlay_gpfn()) { 779 + allocated_page = alloc_page(GFP_KERNEL); 780 + if (!allocated_page) 781 + return -ENOMEM; 782 + 783 + ret = hv_call_map_stats_page2(type, identity, 784 + page_to_pfn(allocated_page)); 785 + *addr = page_address(allocated_page); 786 + } else { 787 + ret = hv_call_map_stats_page(type, identity, addr); 788 + } 789 + 790 + if (ret && allocated_page) { 791 + __free_page(allocated_page); 792 + *addr = NULL; 793 + } 794 + 795 + return ret; 796 + } 797 + 798 + static int hv_call_unmap_stats_page(enum hv_stats_object_type type, 799 + const union hv_stats_object_identity *identity) 901 800 { 902 801 unsigned long flags; 903 802 struct hv_input_unmap_stats_page *input; ··· 944 783 local_irq_restore(flags); 945 784 946 785 return hv_result_to_errno(status); 786 + } 787 + 788 + int hv_unmap_stats_page(enum hv_stats_object_type type, void *page_addr, 789 + const union hv_stats_object_identity *identity) 790 + { 791 + int ret; 792 + 793 + ret = hv_call_unmap_stats_page(type, identity); 794 + 795 + if (mshv_use_overlay_gpfn() && page_addr) 796 + __free_page(virt_to_page(page_addr)); 797 + 798 + return ret; 947 799 } 948 800 949 801 int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,

+409 -340

drivers/hv/mshv_root_main.c

··· 42 42 /* TODO move this to another file when debugfs code is added */ 43 43 enum hv_stats_vp_counters { /* HV_THREAD_COUNTER */ 44 44 #if defined(CONFIG_X86) 45 - VpRootDispatchThreadBlocked = 201, 45 + VpRootDispatchThreadBlocked = 202, 46 46 #elif defined(CONFIG_ARM64) 47 47 VpRootDispatchThreadBlocked = 94, 48 48 #endif ··· 123 123 */ 124 124 static u16 mshv_passthru_hvcalls[] = { 125 125 HVCALL_GET_PARTITION_PROPERTY, 126 + HVCALL_GET_PARTITION_PROPERTY_EX, 126 127 HVCALL_SET_PARTITION_PROPERTY, 127 128 HVCALL_INSTALL_INTERCEPT, 128 129 HVCALL_GET_VP_REGISTERS, ··· 138 137 HVCALL_GET_VP_CPUID_VALUES, 139 138 }; 140 139 140 + /* 141 + * Only allow hypercalls that are safe to be called by the VMM with the host 142 + * partition as target (i.e. HV_PARTITION_ID_SELF). Carefully audit that a 143 + * hypercall cannot be misused by the VMM before adding it to this list. 144 + */ 145 + static u16 mshv_self_passthru_hvcalls[] = { 146 + HVCALL_GET_PARTITION_PROPERTY, 147 + HVCALL_GET_PARTITION_PROPERTY_EX, 148 + }; 149 + 141 150 static bool mshv_hvcall_is_async(u16 code) 142 151 { 143 152 switch (code) { ··· 159 148 return false; 160 149 } 161 150 151 + static bool mshv_passthru_hvcall_allowed(u16 code, u64 pt_id) 152 + { 153 + int i; 154 + int n = ARRAY_SIZE(mshv_passthru_hvcalls); 155 + u16 *allowed_hvcalls = mshv_passthru_hvcalls; 156 + 157 + if (pt_id == HV_PARTITION_ID_SELF) { 158 + n = ARRAY_SIZE(mshv_self_passthru_hvcalls); 159 + allowed_hvcalls = mshv_self_passthru_hvcalls; 160 + } 161 + 162 + for (i = 0; i < n; ++i) 163 + if (allowed_hvcalls[i] == code) 164 + return true; 165 + 166 + return false; 167 + } 168 + 162 169 static int mshv_ioctl_passthru_hvcall(struct mshv_partition *partition, 163 170 bool partition_locked, 164 171 void __user *user_args) 165 172 { 166 173 u64 status; 167 - int ret = 0, i; 174 + int ret = 0; 168 175 bool is_async; 169 176 struct mshv_root_hvcall args; 170 177 struct page *page; 171 178 unsigned int pages_order; 172 179 void *input_pg = NULL; 173 180 void *output_pg = NULL; 181 + u16 reps_completed; 182 + u64 pt_id = partition ? partition->pt_id : HV_PARTITION_ID_SELF; 174 183 175 184 if (copy_from_user(&args, user_args, sizeof(args))) 176 185 return -EFAULT; ··· 202 171 if (args.out_ptr && (!args.out_sz || args.out_sz > HV_HYP_PAGE_SIZE)) 203 172 return -EINVAL; 204 173 205 - for (i = 0; i < ARRAY_SIZE(mshv_passthru_hvcalls); ++i) 206 - if (args.code == mshv_passthru_hvcalls[i]) 207 - break; 208 - 209 - if (i >= ARRAY_SIZE(mshv_passthru_hvcalls)) 174 + if (!mshv_passthru_hvcall_allowed(args.code, pt_id)) 210 175 return -EINVAL; 211 176 212 177 is_async = mshv_hvcall_is_async(args.code); 213 178 if (is_async) { 214 179 /* async hypercalls can only be called from partition fd */ 215 - if (!partition_locked) 180 + if (!partition || !partition_locked) 216 181 return -EINVAL; 217 182 ret = mshv_init_async_handler(partition); 218 183 if (ret) ··· 236 209 * NOTE: This only works because all the allowed hypercalls' input 237 210 * structs begin with a u64 partition_id field. 238 211 */ 239 - *(u64 *)input_pg = partition->pt_id; 212 + *(u64 *)input_pg = pt_id; 240 213 241 - if (args.reps) 242 - status = hv_do_rep_hypercall(args.code, args.reps, 0, 243 - input_pg, output_pg); 244 - else 245 - status = hv_do_hypercall(args.code, input_pg, output_pg); 246 - 247 - if (hv_result(status) == HV_STATUS_CALL_PENDING) { 248 - if (is_async) { 249 - mshv_async_hvcall_handler(partition, &status); 250 - } else { /* Paranoia check. This shouldn't happen! */ 251 - ret = -EBADFD; 252 - goto free_pages_out; 214 + reps_completed = 0; 215 + do { 216 + if (args.reps) { 217 + status = hv_do_rep_hypercall_ex(args.code, args.reps, 218 + 0, reps_completed, 219 + input_pg, output_pg); 220 + reps_completed = hv_repcomp(status); 221 + } else { 222 + status = hv_do_hypercall(args.code, input_pg, output_pg); 253 223 } 254 - } 255 224 256 - if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) { 257 - ret = hv_call_deposit_pages(NUMA_NO_NODE, partition->pt_id, 1); 258 - if (!ret) 259 - ret = -EAGAIN; 260 - } else if (!hv_result_success(status)) { 261 - ret = hv_result_to_errno(status); 262 - } 225 + if (hv_result(status) == HV_STATUS_CALL_PENDING) { 226 + if (is_async) { 227 + mshv_async_hvcall_handler(partition, &status); 228 + } else { /* Paranoia check. This shouldn't happen! */ 229 + ret = -EBADFD; 230 + goto free_pages_out; 231 + } 232 + } 263 233 264 - /* 265 - * Always return the status and output data regardless of result. 266 - * The VMM may need it to determine how to proceed. E.g. the status may 267 - * contain the number of reps completed if a rep hypercall partially 268 - * succeeded. 269 - */ 234 + if (hv_result_success(status)) 235 + break; 236 + 237 + if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) 238 + ret = hv_result_to_errno(status); 239 + else 240 + ret = hv_call_deposit_pages(NUMA_NO_NODE, 241 + pt_id, 1); 242 + } while (!ret); 243 + 270 244 args.status = hv_result(status); 271 - args.reps = args.reps ? hv_repcomp(status) : 0; 245 + args.reps = reps_completed; 272 246 if (copy_to_user(user_args, &args, sizeof(args))) 273 247 ret = -EFAULT; 274 248 275 - if (output_pg && 249 + if (!ret && output_pg && 276 250 copy_to_user((void __user *)args.out_ptr, output_pg, args.out_sz)) 277 251 ret = -EFAULT; 278 252 ··· 597 569 static_assert(sizeof(struct hv_message) <= MSHV_RUN_VP_BUF_SZ, 598 570 "sizeof(struct hv_message) must not exceed MSHV_RUN_VP_BUF_SZ"); 599 571 572 + static struct mshv_mem_region * 573 + mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn) 574 + { 575 + struct mshv_mem_region *region; 576 + 577 + hlist_for_each_entry(region, &partition->pt_mem_regions, hnode) { 578 + if (gfn >= region->start_gfn && 579 + gfn < region->start_gfn + region->nr_pages) 580 + return region; 581 + } 582 + 583 + return NULL; 584 + } 585 + 586 + #ifdef CONFIG_X86_64 587 + static struct mshv_mem_region * 588 + mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn) 589 + { 590 + struct mshv_mem_region *region; 591 + 592 + spin_lock(&p->pt_mem_regions_lock); 593 + region = mshv_partition_region_by_gfn(p, gfn); 594 + if (!region || !mshv_region_get(region)) { 595 + spin_unlock(&p->pt_mem_regions_lock); 596 + return NULL; 597 + } 598 + spin_unlock(&p->pt_mem_regions_lock); 599 + 600 + return region; 601 + } 602 + 603 + /** 604 + * mshv_handle_gpa_intercept - Handle GPA (Guest Physical Address) intercepts. 605 + * @vp: Pointer to the virtual processor structure. 606 + * 607 + * This function processes GPA intercepts by identifying the memory region 608 + * corresponding to the intercepted GPA, aligning the page offset, and 609 + * mapping the required pages. It ensures that the region is valid and 610 + * handles faults efficiently by mapping multiple pages at once. 611 + * 612 + * Return: true if the intercept was handled successfully, false otherwise. 613 + */ 614 + static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) 615 + { 616 + struct mshv_partition *p = vp->vp_partition; 617 + struct mshv_mem_region *region; 618 + struct hv_x64_memory_intercept_message *msg; 619 + bool ret; 620 + u64 gfn; 621 + 622 + msg = (struct hv_x64_memory_intercept_message *) 623 + vp->vp_intercept_msg_page->u.payload; 624 + 625 + gfn = HVPFN_DOWN(msg->guest_physical_address); 626 + 627 + region = mshv_partition_region_by_gfn_get(p, gfn); 628 + if (!region) 629 + return false; 630 + 631 + /* Only movable memory ranges are supported for GPA intercepts */ 632 + if (region->type == MSHV_REGION_TYPE_MEM_MOVABLE) 633 + ret = mshv_region_handle_gfn_fault(region, gfn); 634 + else 635 + ret = false; 636 + 637 + mshv_region_put(region); 638 + 639 + return ret; 640 + } 641 + #else /* CONFIG_X86_64 */ 642 + static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; } 643 + #endif /* CONFIG_X86_64 */ 644 + 645 + static bool mshv_vp_handle_intercept(struct mshv_vp *vp) 646 + { 647 + switch (vp->vp_intercept_msg_page->header.message_type) { 648 + case HVMSG_GPA_INTERCEPT: 649 + return mshv_handle_gpa_intercept(vp); 650 + } 651 + return false; 652 + } 653 + 600 654 static long mshv_vp_ioctl_run_vp(struct mshv_vp *vp, void __user *ret_msg) 601 655 { 602 656 long rc; 603 657 604 - if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT) 605 - rc = mshv_run_vp_with_root_scheduler(vp); 606 - else 607 - rc = mshv_run_vp_with_hyp_scheduler(vp); 658 + do { 659 + if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT) 660 + rc = mshv_run_vp_with_root_scheduler(vp); 661 + else 662 + rc = mshv_run_vp_with_hyp_scheduler(vp); 663 + } while (rc == 0 && mshv_vp_handle_intercept(vp)); 608 664 609 665 if (rc) 610 666 return rc; ··· 956 844 return 0; 957 845 } 958 846 959 - static void mshv_vp_stats_unmap(u64 partition_id, u32 vp_index) 847 + static void mshv_vp_stats_unmap(u64 partition_id, u32 vp_index, 848 + void *stats_pages[]) 960 849 { 961 850 union hv_stats_object_identity identity = { 962 851 .vp.partition_id = partition_id, ··· 965 852 }; 966 853 967 854 identity.vp.stats_area_type = HV_STATS_AREA_SELF; 968 - hv_call_unmap_stat_page(HV_STATS_OBJECT_VP, &identity); 855 + hv_unmap_stats_page(HV_STATS_OBJECT_VP, NULL, &identity); 969 856 970 857 identity.vp.stats_area_type = HV_STATS_AREA_PARENT; 971 - hv_call_unmap_stat_page(HV_STATS_OBJECT_VP, &identity); 858 + hv_unmap_stats_page(HV_STATS_OBJECT_VP, NULL, &identity); 972 859 } 973 860 974 861 static int mshv_vp_stats_map(u64 partition_id, u32 vp_index, ··· 981 868 int err; 982 869 983 870 identity.vp.stats_area_type = HV_STATS_AREA_SELF; 984 - err = hv_call_map_stat_page(HV_STATS_OBJECT_VP, &identity, 985 - &stats_pages[HV_STATS_AREA_SELF]); 871 + err = hv_map_stats_page(HV_STATS_OBJECT_VP, &identity, 872 + &stats_pages[HV_STATS_AREA_SELF]); 986 873 if (err) 987 874 return err; 988 875 989 876 identity.vp.stats_area_type = HV_STATS_AREA_PARENT; 990 - err = hv_call_map_stat_page(HV_STATS_OBJECT_VP, &identity, 991 - &stats_pages[HV_STATS_AREA_PARENT]); 877 + err = hv_map_stats_page(HV_STATS_OBJECT_VP, &identity, 878 + &stats_pages[HV_STATS_AREA_PARENT]); 992 879 if (err) 993 880 goto unmap_self; 994 881 ··· 996 883 997 884 unmap_self: 998 885 identity.vp.stats_area_type = HV_STATS_AREA_SELF; 999 - hv_call_unmap_stat_page(HV_STATS_OBJECT_VP, &identity); 886 + hv_unmap_stats_page(HV_STATS_OBJECT_VP, NULL, &identity); 1000 887 return err; 1001 888 } 1002 889 ··· 1006 893 { 1007 894 struct mshv_create_vp args; 1008 895 struct mshv_vp *vp; 1009 - struct page *intercept_message_page, *register_page, *ghcb_page; 896 + struct page *intercept_msg_page, *register_page, *ghcb_page; 1010 897 void *stats_pages[2]; 1011 898 long ret; 1012 899 ··· 1024 911 if (ret) 1025 912 return ret; 1026 913 1027 - ret = hv_call_map_vp_state_page(partition->pt_id, args.vp_index, 1028 - HV_VP_STATE_PAGE_INTERCEPT_MESSAGE, 1029 - input_vtl_zero, 1030 - &intercept_message_page); 914 + ret = hv_map_vp_state_page(partition->pt_id, args.vp_index, 915 + HV_VP_STATE_PAGE_INTERCEPT_MESSAGE, 916 + input_vtl_zero, &intercept_msg_page); 1031 917 if (ret) 1032 918 goto destroy_vp; 1033 919 1034 920 if (!mshv_partition_encrypted(partition)) { 1035 - ret = hv_call_map_vp_state_page(partition->pt_id, args.vp_index, 1036 - HV_VP_STATE_PAGE_REGISTERS, 1037 - input_vtl_zero, 1038 - &register_page); 921 + ret = hv_map_vp_state_page(partition->pt_id, args.vp_index, 922 + HV_VP_STATE_PAGE_REGISTERS, 923 + input_vtl_zero, &register_page); 1039 924 if (ret) 1040 925 goto unmap_intercept_message_page; 1041 926 } 1042 927 1043 928 if (mshv_partition_encrypted(partition) && 1044 929 is_ghcb_mapping_available()) { 1045 - ret = hv_call_map_vp_state_page(partition->pt_id, args.vp_index, 1046 - HV_VP_STATE_PAGE_GHCB, 1047 - input_vtl_normal, 1048 - &ghcb_page); 930 + ret = hv_map_vp_state_page(partition->pt_id, args.vp_index, 931 + HV_VP_STATE_PAGE_GHCB, 932 + input_vtl_normal, &ghcb_page); 1049 933 if (ret) 1050 934 goto unmap_register_page; 1051 935 } 1052 936 1053 - if (hv_parent_partition()) { 937 + /* 938 + * This mapping of the stats page is for detecting if dispatch thread 939 + * is blocked - only relevant for root scheduler 940 + */ 941 + if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT) { 1054 942 ret = mshv_vp_stats_map(partition->pt_id, args.vp_index, 1055 943 stats_pages); 1056 944 if (ret) ··· 1073 959 atomic64_set(&vp->run.vp_signaled_count, 0); 1074 960 1075 961 vp->vp_index = args.vp_index; 1076 - vp->vp_intercept_msg_page = page_to_virt(intercept_message_page); 962 + vp->vp_intercept_msg_page = page_to_virt(intercept_msg_page); 1077 963 if (!mshv_partition_encrypted(partition)) 1078 964 vp->vp_register_page = page_to_virt(register_page); 1079 965 1080 966 if (mshv_partition_encrypted(partition) && is_ghcb_mapping_available()) 1081 967 vp->vp_ghcb_page = page_to_virt(ghcb_page); 1082 968 1083 - if (hv_parent_partition()) 969 + if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT) 1084 970 memcpy(vp->vp_stats_pages, stats_pages, sizeof(stats_pages)); 1085 971 1086 972 /* ··· 1103 989 free_vp: 1104 990 kfree(vp); 1105 991 unmap_stats_pages: 1106 - if (hv_parent_partition()) 1107 - mshv_vp_stats_unmap(partition->pt_id, args.vp_index); 992 + if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT) 993 + mshv_vp_stats_unmap(partition->pt_id, args.vp_index, stats_pages); 1108 994 unmap_ghcb_page: 1109 - if (mshv_partition_encrypted(partition) && is_ghcb_mapping_available()) { 1110 - hv_call_unmap_vp_state_page(partition->pt_id, args.vp_index, 1111 - HV_VP_STATE_PAGE_GHCB, 1112 - input_vtl_normal); 1113 - } 995 + if (mshv_partition_encrypted(partition) && is_ghcb_mapping_available()) 996 + hv_unmap_vp_state_page(partition->pt_id, args.vp_index, 997 + HV_VP_STATE_PAGE_GHCB, ghcb_page, 998 + input_vtl_normal); 1114 999 unmap_register_page: 1115 - if (!mshv_partition_encrypted(partition)) { 1116 - hv_call_unmap_vp_state_page(partition->pt_id, args.vp_index, 1117 - HV_VP_STATE_PAGE_REGISTERS, 1118 - input_vtl_zero); 1119 - } 1000 + if (!mshv_partition_encrypted(partition)) 1001 + hv_unmap_vp_state_page(partition->pt_id, args.vp_index, 1002 + HV_VP_STATE_PAGE_REGISTERS, 1003 + register_page, input_vtl_zero); 1120 1004 unmap_intercept_message_page: 1121 - hv_call_unmap_vp_state_page(partition->pt_id, args.vp_index, 1122 - HV_VP_STATE_PAGE_INTERCEPT_MESSAGE, 1123 - input_vtl_zero); 1005 + hv_unmap_vp_state_page(partition->pt_id, args.vp_index, 1006 + HV_VP_STATE_PAGE_INTERCEPT_MESSAGE, 1007 + intercept_msg_page, input_vtl_zero); 1124 1008 destroy_vp: 1125 1009 hv_call_delete_vp(partition->pt_id, args.vp_index); 1126 1010 return ret; ··· 1146 1034 *status = partition->async_hypercall_status; 1147 1035 } 1148 1036 1149 - static int 1150 - mshv_partition_region_share(struct mshv_mem_region *region) 1151 - { 1152 - u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED; 1153 - 1154 - if (region->flags.large_pages) 1155 - flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE; 1156 - 1157 - return hv_call_modify_spa_host_access(region->partition->pt_id, 1158 - region->pages, region->nr_pages, 1159 - HV_MAP_GPA_READABLE | HV_MAP_GPA_WRITABLE, 1160 - flags, true); 1161 - } 1162 - 1163 - static int 1164 - mshv_partition_region_unshare(struct mshv_mem_region *region) 1165 - { 1166 - u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE; 1167 - 1168 - if (region->flags.large_pages) 1169 - flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE; 1170 - 1171 - return hv_call_modify_spa_host_access(region->partition->pt_id, 1172 - region->pages, region->nr_pages, 1173 - 0, 1174 - flags, false); 1175 - } 1176 - 1177 - static int 1178 - mshv_region_remap_pages(struct mshv_mem_region *region, u32 map_flags, 1179 - u64 page_offset, u64 page_count) 1180 - { 1181 - if (page_offset + page_count > region->nr_pages) 1182 - return -EINVAL; 1183 - 1184 - if (region->flags.large_pages) 1185 - map_flags |= HV_MAP_GPA_LARGE_PAGE; 1186 - 1187 - /* ask the hypervisor to map guest ram */ 1188 - return hv_call_map_gpa_pages(region->partition->pt_id, 1189 - region->start_gfn + page_offset, 1190 - page_count, map_flags, 1191 - region->pages + page_offset); 1192 - } 1193 - 1194 - static int 1195 - mshv_region_map(struct mshv_mem_region *region) 1196 - { 1197 - u32 map_flags = region->hv_map_flags; 1198 - 1199 - return mshv_region_remap_pages(region, map_flags, 1200 - 0, region->nr_pages); 1201 - } 1202 - 1203 - static void 1204 - mshv_region_evict_pages(struct mshv_mem_region *region, 1205 - u64 page_offset, u64 page_count) 1206 - { 1207 - if (region->flags.range_pinned) 1208 - unpin_user_pages(region->pages + page_offset, page_count); 1209 - 1210 - memset(region->pages + page_offset, 0, 1211 - page_count * sizeof(struct page *)); 1212 - } 1213 - 1214 - static void 1215 - mshv_region_evict(struct mshv_mem_region *region) 1216 - { 1217 - mshv_region_evict_pages(region, 0, region->nr_pages); 1218 - } 1219 - 1220 - static int 1221 - mshv_region_populate_pages(struct mshv_mem_region *region, 1222 - u64 page_offset, u64 page_count) 1223 - { 1224 - u64 done_count, nr_pages; 1225 - struct page **pages; 1226 - __u64 userspace_addr; 1227 - int ret; 1228 - 1229 - if (page_offset + page_count > region->nr_pages) 1230 - return -EINVAL; 1231 - 1232 - for (done_count = 0; done_count < page_count; done_count += ret) { 1233 - pages = region->pages + page_offset + done_count; 1234 - userspace_addr = region->start_uaddr + 1235 - (page_offset + done_count) * 1236 - HV_HYP_PAGE_SIZE; 1237 - nr_pages = min(page_count - done_count, 1238 - MSHV_PIN_PAGES_BATCH_SIZE); 1239 - 1240 - /* 1241 - * Pinning assuming 4k pages works for large pages too. 1242 - * All page structs within the large page are returned. 1243 - * 1244 - * Pin requests are batched because pin_user_pages_fast 1245 - * with the FOLL_LONGTERM flag does a large temporary 1246 - * allocation of contiguous memory. 1247 - */ 1248 - if (region->flags.range_pinned) 1249 - ret = pin_user_pages_fast(userspace_addr, 1250 - nr_pages, 1251 - FOLL_WRITE | FOLL_LONGTERM, 1252 - pages); 1253 - else 1254 - ret = -EOPNOTSUPP; 1255 - 1256 - if (ret < 0) 1257 - goto release_pages; 1258 - } 1259 - 1260 - if (PageHuge(region->pages[page_offset])) 1261 - region->flags.large_pages = true; 1262 - 1263 - return 0; 1264 - 1265 - release_pages: 1266 - mshv_region_evict_pages(region, page_offset, done_count); 1267 - return ret; 1268 - } 1269 - 1270 - static int 1271 - mshv_region_populate(struct mshv_mem_region *region) 1272 - { 1273 - return mshv_region_populate_pages(region, 0, region->nr_pages); 1274 - } 1275 - 1276 - static struct mshv_mem_region * 1277 - mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn) 1278 - { 1279 - struct mshv_mem_region *region; 1280 - 1281 - hlist_for_each_entry(region, &partition->pt_mem_regions, hnode) { 1282 - if (gfn >= region->start_gfn && 1283 - gfn < region->start_gfn + region->nr_pages) 1284 - return region; 1285 - } 1286 - 1287 - return NULL; 1288 - } 1289 - 1290 - static struct mshv_mem_region * 1291 - mshv_partition_region_by_uaddr(struct mshv_partition *partition, u64 uaddr) 1292 - { 1293 - struct mshv_mem_region *region; 1294 - 1295 - hlist_for_each_entry(region, &partition->pt_mem_regions, hnode) { 1296 - if (uaddr >= region->start_uaddr && 1297 - uaddr < region->start_uaddr + 1298 - (region->nr_pages << HV_HYP_PAGE_SHIFT)) 1299 - return region; 1300 - } 1301 - 1302 - return NULL; 1303 - } 1304 - 1305 1037 /* 1306 1038 * NB: caller checks and makes sure mem->size is page aligned 1307 1039 * Returns: 0 with regionpp updated on success, or -errno ··· 1155 1199 struct mshv_mem_region **regionpp, 1156 1200 bool is_mmio) 1157 1201 { 1158 - struct mshv_mem_region *region; 1202 + struct mshv_mem_region *rg; 1159 1203 u64 nr_pages = HVPFN_DOWN(mem->size); 1160 1204 1161 1205 /* Reject overlapping regions */ 1162 - if (mshv_partition_region_by_gfn(partition, mem->guest_pfn) || 1163 - mshv_partition_region_by_gfn(partition, mem->guest_pfn + nr_pages - 1) || 1164 - mshv_partition_region_by_uaddr(partition, mem->userspace_addr) || 1165 - mshv_partition_region_by_uaddr(partition, mem->userspace_addr + mem->size - 1)) 1206 + spin_lock(&partition->pt_mem_regions_lock); 1207 + hlist_for_each_entry(rg, &partition->pt_mem_regions, hnode) { 1208 + if (mem->guest_pfn + nr_pages <= rg->start_gfn || 1209 + rg->start_gfn + rg->nr_pages <= mem->guest_pfn) 1210 + continue; 1211 + spin_unlock(&partition->pt_mem_regions_lock); 1166 1212 return -EEXIST; 1213 + } 1214 + spin_unlock(&partition->pt_mem_regions_lock); 1167 1215 1168 - region = vzalloc(sizeof(*region) + sizeof(struct page *) * nr_pages); 1169 - if (!region) 1170 - return -ENOMEM; 1216 + rg = mshv_region_create(mem->guest_pfn, nr_pages, 1217 + mem->userspace_addr, mem->flags); 1218 + if (IS_ERR(rg)) 1219 + return PTR_ERR(rg); 1171 1220 1172 - region->nr_pages = nr_pages; 1173 - region->start_gfn = mem->guest_pfn; 1174 - region->start_uaddr = mem->userspace_addr; 1175 - region->hv_map_flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_ADJUSTABLE; 1176 - if (mem->flags & BIT(MSHV_SET_MEM_BIT_WRITABLE)) 1177 - region->hv_map_flags |= HV_MAP_GPA_WRITABLE; 1178 - if (mem->flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE)) 1179 - region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE; 1221 + if (is_mmio) 1222 + rg->type = MSHV_REGION_TYPE_MMIO; 1223 + else if (mshv_partition_encrypted(partition) || 1224 + !mshv_region_movable_init(rg)) 1225 + rg->type = MSHV_REGION_TYPE_MEM_PINNED; 1226 + else 1227 + rg->type = MSHV_REGION_TYPE_MEM_MOVABLE; 1180 1228 1181 - /* Note: large_pages flag populated when we pin the pages */ 1182 - if (!is_mmio) 1183 - region->flags.range_pinned = true; 1229 + rg->partition = partition; 1184 1230 1185 - region->partition = partition; 1186 - 1187 - *regionpp = region; 1231 + *regionpp = rg; 1188 1232 1189 1233 return 0; 1190 1234 } 1191 1235 1192 - /* 1193 - * Map guest ram. if snp, make sure to release that from the host first 1194 - * Side Effects: In case of failure, pages are unpinned when feasible. 1236 + /** 1237 + * mshv_prepare_pinned_region - Pin and map memory regions 1238 + * @region: Pointer to the memory region structure 1239 + * 1240 + * This function processes memory regions that are explicitly marked as pinned. 1241 + * Pinned regions are preallocated, mapped upfront, and do not rely on fault-based 1242 + * population. The function ensures the region is properly populated, handles 1243 + * encryption requirements for SNP partitions if applicable, maps the region, 1244 + * and performs necessary sharing or eviction operations based on the mapping 1245 + * result. 1246 + * 1247 + * Return: 0 on success, negative error code on failure. 1195 1248 */ 1196 - static int 1197 - mshv_partition_mem_region_map(struct mshv_mem_region *region) 1249 + static int mshv_prepare_pinned_region(struct mshv_mem_region *region) 1198 1250 { 1199 1251 struct mshv_partition *partition = region->partition; 1200 1252 int ret; 1201 1253 1202 - ret = mshv_region_populate(region); 1254 + ret = mshv_region_pin(region); 1203 1255 if (ret) { 1204 - pt_err(partition, "Failed to populate memory region: %d\n", 1256 + pt_err(partition, "Failed to pin memory region: %d\n", 1205 1257 ret); 1206 1258 goto err_out; 1207 1259 } ··· 1222 1258 * access to guest memory regions. 1223 1259 */ 1224 1260 if (mshv_partition_encrypted(partition)) { 1225 - ret = mshv_partition_region_unshare(region); 1261 + ret = mshv_region_unshare(region); 1226 1262 if (ret) { 1227 1263 pt_err(partition, 1228 1264 "Failed to unshare memory region (guest_pfn: %llu): %d\n", 1229 1265 region->start_gfn, ret); 1230 - goto evict_region; 1266 + goto invalidate_region; 1231 1267 } 1232 1268 } 1233 1269 ··· 1235 1271 if (ret && mshv_partition_encrypted(partition)) { 1236 1272 int shrc; 1237 1273 1238 - shrc = mshv_partition_region_share(region); 1274 + shrc = mshv_region_share(region); 1239 1275 if (!shrc) 1240 - goto evict_region; 1276 + goto invalidate_region; 1241 1277 1242 1278 pt_err(partition, 1243 1279 "Failed to share memory region (guest_pfn: %llu): %d\n", ··· 1251 1287 1252 1288 return 0; 1253 1289 1254 - evict_region: 1255 - mshv_region_evict(region); 1290 + invalidate_region: 1291 + mshv_region_invalidate(region); 1256 1292 err_out: 1257 1293 return ret; 1258 1294 } ··· 1297 1333 if (ret) 1298 1334 return ret; 1299 1335 1300 - if (is_mmio) 1301 - ret = hv_call_map_mmio_pages(partition->pt_id, mem.guest_pfn, 1302 - mmio_pfn, HVPFN_DOWN(mem.size)); 1303 - else 1304 - ret = mshv_partition_mem_region_map(region); 1336 + switch (region->type) { 1337 + case MSHV_REGION_TYPE_MEM_PINNED: 1338 + ret = mshv_prepare_pinned_region(region); 1339 + break; 1340 + case MSHV_REGION_TYPE_MEM_MOVABLE: 1341 + /* 1342 + * For movable memory regions, remap with no access to let 1343 + * the hypervisor track dirty pages, enabling pre-copy live 1344 + * migration. 1345 + */ 1346 + ret = hv_call_map_gpa_pages(partition->pt_id, 1347 + region->start_gfn, 1348 + region->nr_pages, 1349 + HV_MAP_GPA_NO_ACCESS, NULL); 1350 + break; 1351 + case MSHV_REGION_TYPE_MMIO: 1352 + ret = hv_call_map_mmio_pages(partition->pt_id, 1353 + region->start_gfn, 1354 + mmio_pfn, 1355 + region->nr_pages); 1356 + break; 1357 + } 1305 1358 1306 1359 if (ret) 1307 1360 goto errout; 1308 1361 1309 - /* Install the new region */ 1362 + spin_lock(&partition->pt_mem_regions_lock); 1310 1363 hlist_add_head(&region->hnode, &partition->pt_mem_regions); 1364 + spin_unlock(&partition->pt_mem_regions_lock); 1311 1365 1312 1366 return 0; 1313 1367 ··· 1340 1358 struct mshv_user_mem_region mem) 1341 1359 { 1342 1360 struct mshv_mem_region *region; 1343 - u32 unmap_flags = 0; 1344 1361 1345 1362 if (!(mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP))) 1346 1363 return -EINVAL; 1347 1364 1365 + spin_lock(&partition->pt_mem_regions_lock); 1366 + 1348 1367 region = mshv_partition_region_by_gfn(partition, mem.guest_pfn); 1349 - if (!region) 1350 - return -EINVAL; 1368 + if (!region) { 1369 + spin_unlock(&partition->pt_mem_regions_lock); 1370 + return -ENOENT; 1371 + } 1351 1372 1352 1373 /* Paranoia check */ 1353 1374 if (region->start_uaddr != mem.userspace_addr || 1354 1375 region->start_gfn != mem.guest_pfn || 1355 - region->nr_pages != HVPFN_DOWN(mem.size)) 1376 + region->nr_pages != HVPFN_DOWN(mem.size)) { 1377 + spin_unlock(&partition->pt_mem_regions_lock); 1356 1378 return -EINVAL; 1379 + } 1357 1380 1358 1381 hlist_del(&region->hnode); 1359 1382 1360 - if (region->flags.large_pages) 1361 - unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE; 1383 + spin_unlock(&partition->pt_mem_regions_lock); 1362 1384 1363 - /* ignore unmap failures and continue as process may be exiting */ 1364 - hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn, 1365 - region->nr_pages, unmap_flags); 1385 + mshv_region_put(region); 1366 1386 1367 - mshv_region_evict(region); 1368 - 1369 - vfree(region); 1370 1387 return 0; 1371 1388 } 1372 1389 ··· 1701 1720 { 1702 1721 struct mshv_vp *vp; 1703 1722 struct mshv_mem_region *region; 1704 - int i, ret; 1705 1723 struct hlist_node *n; 1724 + int i; 1706 1725 1707 1726 if (refcount_read(&partition->pt_ref_count)) { 1708 1727 pt_err(partition, ··· 1724 1743 if (!vp) 1725 1744 continue; 1726 1745 1727 - if (hv_parent_partition()) 1728 - mshv_vp_stats_unmap(partition->pt_id, vp->vp_index); 1746 + if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT) 1747 + mshv_vp_stats_unmap(partition->pt_id, vp->vp_index, 1748 + (void **)vp->vp_stats_pages); 1729 1749 1730 1750 if (vp->vp_register_page) { 1731 - (void)hv_call_unmap_vp_state_page(partition->pt_id, 1732 - vp->vp_index, 1733 - HV_VP_STATE_PAGE_REGISTERS, 1734 - input_vtl_zero); 1751 + (void)hv_unmap_vp_state_page(partition->pt_id, 1752 + vp->vp_index, 1753 + HV_VP_STATE_PAGE_REGISTERS, 1754 + virt_to_page(vp->vp_register_page), 1755 + input_vtl_zero); 1735 1756 vp->vp_register_page = NULL; 1736 1757 } 1737 1758 1738 - (void)hv_call_unmap_vp_state_page(partition->pt_id, 1739 - vp->vp_index, 1740 - HV_VP_STATE_PAGE_INTERCEPT_MESSAGE, 1741 - input_vtl_zero); 1759 + (void)hv_unmap_vp_state_page(partition->pt_id, 1760 + vp->vp_index, 1761 + HV_VP_STATE_PAGE_INTERCEPT_MESSAGE, 1762 + virt_to_page(vp->vp_intercept_msg_page), 1763 + input_vtl_zero); 1742 1764 vp->vp_intercept_msg_page = NULL; 1743 1765 1744 1766 if (vp->vp_ghcb_page) { 1745 - (void)hv_call_unmap_vp_state_page(partition->pt_id, 1746 - vp->vp_index, 1747 - HV_VP_STATE_PAGE_GHCB, 1748 - input_vtl_normal); 1767 + (void)hv_unmap_vp_state_page(partition->pt_id, 1768 + vp->vp_index, 1769 + HV_VP_STATE_PAGE_GHCB, 1770 + virt_to_page(vp->vp_ghcb_page), 1771 + input_vtl_normal); 1749 1772 vp->vp_ghcb_page = NULL; 1750 1773 } 1751 1774 ··· 1766 1781 1767 1782 remove_partition(partition); 1768 1783 1769 - /* Remove regions, regain access to the memory and unpin the pages */ 1770 1784 hlist_for_each_entry_safe(region, n, &partition->pt_mem_regions, 1771 1785 hnode) { 1772 1786 hlist_del(&region->hnode); 1773 - 1774 - if (mshv_partition_encrypted(partition)) { 1775 - ret = mshv_partition_region_share(region); 1776 - if (ret) { 1777 - pt_err(partition, 1778 - "Failed to regain access to memory, unpinning user pages will fail and crash the host error: %d\n", 1779 - ret); 1780 - return; 1781 - } 1782 - } 1783 - 1784 - mshv_region_evict(region); 1785 - 1786 - vfree(region); 1787 + mshv_region_put(region); 1787 1788 } 1788 1789 1789 1790 /* Withdraw and free all pages we deposited */ ··· 1836 1865 return 0; 1837 1866 } 1838 1867 1839 - static long 1840 - mshv_ioctl_create_partition(void __user *user_arg, struct device *module_dev) 1841 - { 1842 - struct mshv_create_partition args; 1843 - u64 creation_flags; 1844 - struct hv_partition_creation_properties creation_properties = {}; 1845 - union hv_partition_isolation_properties isolation_properties = {}; 1846 - struct mshv_partition *partition; 1847 - long ret; 1868 + static_assert(MSHV_NUM_CPU_FEATURES_BANKS == 1869 + HV_PARTITION_PROCESSOR_FEATURES_BANKS); 1848 1870 1849 - if (copy_from_user(&args, user_arg, sizeof(args))) 1871 + static long mshv_ioctl_process_pt_flags(void __user *user_arg, u64 *pt_flags, 1872 + struct hv_partition_creation_properties *cr_props, 1873 + union hv_partition_isolation_properties *isol_props) 1874 + { 1875 + int i; 1876 + struct mshv_create_partition_v2 args; 1877 + union hv_partition_processor_features *disabled_procs; 1878 + union hv_partition_processor_xsave_features *disabled_xsave; 1879 + 1880 + /* First, copy v1 struct in case user is on previous versions */ 1881 + if (copy_from_user(&args, user_arg, 1882 + sizeof(struct mshv_create_partition))) 1850 1883 return -EFAULT; 1851 1884 1852 1885 if ((args.pt_flags & ~MSHV_PT_FLAGS_MASK) || 1853 1886 args.pt_isolation >= MSHV_PT_ISOLATION_COUNT) 1854 1887 return -EINVAL; 1855 1888 1856 - /* Only support EXO partitions */ 1857 - creation_flags = HV_PARTITION_CREATION_FLAG_EXO_PARTITION | 1858 - HV_PARTITION_CREATION_FLAG_INTERCEPT_MESSAGE_PAGE_ENABLED; 1889 + disabled_procs = &cr_props->disabled_processor_features; 1890 + disabled_xsave = &cr_props->disabled_processor_xsave_features; 1859 1891 1860 - if (args.pt_flags & BIT(MSHV_PT_BIT_LAPIC)) 1861 - creation_flags |= HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED; 1862 - if (args.pt_flags & BIT(MSHV_PT_BIT_X2APIC)) 1863 - creation_flags |= HV_PARTITION_CREATION_FLAG_X2APIC_CAPABLE; 1864 - if (args.pt_flags & BIT(MSHV_PT_BIT_GPA_SUPER_PAGES)) 1865 - creation_flags |= HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED; 1892 + /* Check if user provided newer struct with feature fields */ 1893 + if (args.pt_flags & BIT_ULL(MSHV_PT_BIT_CPU_AND_XSAVE_FEATURES)) { 1894 + if (copy_from_user(&args, user_arg, sizeof(args))) 1895 + return -EFAULT; 1896 + 1897 + /* Re-validate v1 fields after second copy_from_user() */ 1898 + if ((args.pt_flags & ~MSHV_PT_FLAGS_MASK) || 1899 + args.pt_isolation >= MSHV_PT_ISOLATION_COUNT) 1900 + return -EINVAL; 1901 + 1902 + if (args.pt_num_cpu_fbanks != MSHV_NUM_CPU_FEATURES_BANKS || 1903 + mshv_field_nonzero(args, pt_rsvd) || 1904 + mshv_field_nonzero(args, pt_rsvd1)) 1905 + return -EINVAL; 1906 + 1907 + /* 1908 + * Note this assumes MSHV_NUM_CPU_FEATURES_BANKS will never 1909 + * change and equals HV_PARTITION_PROCESSOR_FEATURES_BANKS 1910 + * (i.e. 2). 1911 + * 1912 + * Further banks (index >= 2) will be modifiable as 'early' 1913 + * properties via the set partition property hypercall. 1914 + */ 1915 + for (i = 0; i < HV_PARTITION_PROCESSOR_FEATURES_BANKS; i++) 1916 + disabled_procs->as_uint64[i] = args.pt_cpu_fbanks[i]; 1917 + 1918 + #if IS_ENABLED(CONFIG_X86_64) 1919 + disabled_xsave->as_uint64 = args.pt_disabled_xsave; 1920 + #else 1921 + /* 1922 + * In practice this field is ignored on arm64, but safer to 1923 + * zero it in case it is ever used. 1924 + */ 1925 + disabled_xsave->as_uint64 = 0; 1926 + 1927 + if (mshv_field_nonzero(args, pt_rsvd2)) 1928 + return -EINVAL; 1929 + #endif 1930 + } else { 1931 + /* 1932 + * v1 behavior: try to enable everything. The hypervisor will 1933 + * disable features that are not supported. The banks can be 1934 + * queried via the get partition property hypercall. 1935 + */ 1936 + for (i = 0; i < HV_PARTITION_PROCESSOR_FEATURES_BANKS; i++) 1937 + disabled_procs->as_uint64[i] = 0; 1938 + 1939 + disabled_xsave->as_uint64 = 0; 1940 + } 1941 + 1942 + /* Only support EXO partitions */ 1943 + *pt_flags = HV_PARTITION_CREATION_FLAG_EXO_PARTITION | 1944 + HV_PARTITION_CREATION_FLAG_INTERCEPT_MESSAGE_PAGE_ENABLED; 1945 + 1946 + if (args.pt_flags & BIT_ULL(MSHV_PT_BIT_LAPIC)) 1947 + *pt_flags |= HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED; 1948 + if (args.pt_flags & BIT_ULL(MSHV_PT_BIT_X2APIC)) 1949 + *pt_flags |= HV_PARTITION_CREATION_FLAG_X2APIC_CAPABLE; 1950 + if (args.pt_flags & BIT_ULL(MSHV_PT_BIT_GPA_SUPER_PAGES)) 1951 + *pt_flags |= HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED; 1952 + 1953 + isol_props->as_uint64 = 0; 1866 1954 1867 1955 switch (args.pt_isolation) { 1868 1956 case MSHV_PT_ISOLATION_NONE: 1869 - isolation_properties.isolation_type = 1870 - HV_PARTITION_ISOLATION_TYPE_NONE; 1957 + isol_props->isolation_type = HV_PARTITION_ISOLATION_TYPE_NONE; 1871 1958 break; 1872 1959 } 1960 + 1961 + return 0; 1962 + } 1963 + 1964 + static long 1965 + mshv_ioctl_create_partition(void __user *user_arg, struct device *module_dev) 1966 + { 1967 + u64 creation_flags; 1968 + struct hv_partition_creation_properties creation_properties; 1969 + union hv_partition_isolation_properties isolation_properties; 1970 + struct mshv_partition *partition; 1971 + long ret; 1972 + 1973 + ret = mshv_ioctl_process_pt_flags(user_arg, &creation_flags, 1974 + &creation_properties, 1975 + &isolation_properties); 1976 + if (ret) 1977 + return ret; 1873 1978 1874 1979 partition = kzalloc(sizeof(*partition), GFP_KERNEL); 1875 1980 if (!partition) ··· 1966 1919 1967 1920 INIT_HLIST_HEAD(&partition->pt_devices); 1968 1921 1922 + spin_lock_init(&partition->pt_mem_regions_lock); 1969 1923 INIT_HLIST_HEAD(&partition->pt_mem_regions); 1970 1924 1971 1925 mshv_eventfd_init(partition); ··· 2014 1966 case MSHV_CREATE_PARTITION: 2015 1967 return mshv_ioctl_create_partition((void __user *)arg, 2016 1968 misc->this_device); 1969 + case MSHV_ROOT_HVCALL: 1970 + return mshv_ioctl_passthru_hvcall(NULL, false, 1971 + (void __user *)arg); 2017 1972 } 2018 1973 2019 1974 return -ENOTTY; ··· 2233 2182 return err; 2234 2183 } 2235 2184 2185 + static void mshv_init_vmm_caps(struct device *dev) 2186 + { 2187 + /* 2188 + * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or 2189 + * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that 2190 + * case it's valid to proceed as if all vmm_caps are disabled (zero). 2191 + */ 2192 + if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF, 2193 + HV_PARTITION_PROPERTY_VMM_CAPABILITIES, 2194 + 0, &mshv_root.vmm_caps, 2195 + sizeof(mshv_root.vmm_caps))) 2196 + dev_warn(dev, "Unable to get VMM capabilities\n"); 2197 + 2198 + dev_dbg(dev, "vmm_caps = %#llx\n", mshv_root.vmm_caps.as_uint64[0]); 2199 + } 2200 + 2236 2201 static int __init mshv_parent_partition_init(void) 2237 2202 { 2238 2203 int ret; ··· 2300 2233 ret = mshv_root_partition_init(dev); 2301 2234 if (ret) 2302 2235 goto remove_cpu_state; 2236 + 2237 + mshv_init_vmm_caps(dev); 2303 2238 2304 2239 ret = mshv_irqfd_wq_init(); 2305 2240 if (ret)

+3 -3

drivers/hv/mshv_synic.c

··· 394 394 void mshv_isr(void) 395 395 { 396 396 struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages); 397 - struct hv_message_page **msg_page = &spages->synic_message_page; 397 + struct hv_message_page **msg_page = &spages->hyp_synic_message_page; 398 398 struct hv_message *msg; 399 399 bool handled; 400 400 ··· 456 456 #endif 457 457 union hv_synic_scontrol sctrl; 458 458 struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages); 459 - struct hv_message_page **msg_page = &spages->synic_message_page; 459 + struct hv_message_page **msg_page = &spages->hyp_synic_message_page; 460 460 struct hv_synic_event_flags_page **event_flags_page = 461 461 &spages->synic_event_flags_page; 462 462 struct hv_synic_event_ring_page **event_ring_page = ··· 550 550 union hv_synic_sirbp sirbp; 551 551 union hv_synic_scontrol sctrl; 552 552 struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages); 553 - struct hv_message_page **msg_page = &spages->synic_message_page; 553 + struct hv_message_page **msg_page = &spages->hyp_synic_message_page; 554 554 struct hv_synic_event_flags_page **event_flags_page = 555 555 &spages->synic_event_flags_page; 556 556 struct hv_synic_event_ring_page **event_ring_page =

+25

drivers/hv/mshv_vtl.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ 2 + #ifndef _MSHV_VTL_H 3 + #define _MSHV_VTL_H 4 + 5 + #include <linux/mshv.h> 6 + #include <linux/types.h> 7 + 8 + struct mshv_vtl_run { 9 + u32 cancel; 10 + u32 vtl_ret_action_size; 11 + u32 pad[2]; 12 + char exit_message[MSHV_MAX_RUN_MSG_SIZE]; 13 + union { 14 + struct mshv_vtl_cpu_context cpu_context; 15 + 16 + /* 17 + * Reserving room for the cpu context to grow and to maintain compatibility 18 + * with user mode. 19 + */ 20 + char reserved[1024]; 21 + }; 22 + char vtl_ret_actions[MSHV_MAX_RUN_MSG_SIZE]; 23 + }; 24 + 25 + #endif /* _MSHV_VTL_H */

+1392

drivers/hv/mshv_vtl_main.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * Copyright (c) 2023, Microsoft Corporation. 4 + * 5 + * Author: 6 + * Roman Kisel <romank@linux.microsoft.com> 7 + * Saurabh Sengar <ssengar@linux.microsoft.com> 8 + * Naman Jain <namjain@linux.microsoft.com> 9 + */ 10 + 11 + #include <linux/kernel.h> 12 + #include <linux/module.h> 13 + #include <linux/miscdevice.h> 14 + #include <linux/anon_inodes.h> 15 + #include <linux/cpuhotplug.h> 16 + #include <linux/count_zeros.h> 17 + #include <linux/entry-virt.h> 18 + #include <linux/eventfd.h> 19 + #include <linux/poll.h> 20 + #include <linux/file.h> 21 + #include <linux/vmalloc.h> 22 + #include <asm/debugreg.h> 23 + #include <asm/mshyperv.h> 24 + #include <trace/events/ipi.h> 25 + #include <uapi/asm/mtrr.h> 26 + #include <uapi/linux/mshv.h> 27 + #include <hyperv/hvhdk.h> 28 + 29 + #include "../../kernel/fpu/legacy.h" 30 + #include "mshv.h" 31 + #include "mshv_vtl.h" 32 + #include "hyperv_vmbus.h" 33 + 34 + MODULE_AUTHOR("Microsoft"); 35 + MODULE_LICENSE("GPL"); 36 + MODULE_DESCRIPTION("Microsoft Hyper-V VTL Driver"); 37 + 38 + #define MSHV_ENTRY_REASON_LOWER_VTL_CALL 0x1 39 + #define MSHV_ENTRY_REASON_INTERRUPT 0x2 40 + #define MSHV_ENTRY_REASON_INTERCEPT 0x3 41 + 42 + #define MSHV_REAL_OFF_SHIFT 16 43 + #define MSHV_PG_OFF_CPU_MASK (BIT_ULL(MSHV_REAL_OFF_SHIFT) - 1) 44 + #define MSHV_RUN_PAGE_OFFSET 0 45 + #define MSHV_REG_PAGE_OFFSET 1 46 + #define VTL2_VMBUS_SINT_INDEX 7 47 + 48 + static struct device *mem_dev; 49 + 50 + static struct tasklet_struct msg_dpc; 51 + static wait_queue_head_t fd_wait_queue; 52 + static bool has_message; 53 + static struct eventfd_ctx *flag_eventfds[HV_EVENT_FLAGS_COUNT]; 54 + static DEFINE_MUTEX(flag_lock); 55 + static bool __read_mostly mshv_has_reg_page; 56 + 57 + /* hvcall code is of type u16, allocate a bitmap of size (1 << 16) to accommodate it */ 58 + #define MAX_BITMAP_SIZE ((U16_MAX + 1) / 8) 59 + 60 + struct mshv_vtl_hvcall_fd { 61 + u8 allow_bitmap[MAX_BITMAP_SIZE]; 62 + bool allow_map_initialized; 63 + /* 64 + * Used to protect hvcall setup in IOCTLs 65 + */ 66 + struct mutex init_mutex; 67 + struct miscdevice *dev; 68 + }; 69 + 70 + struct mshv_vtl_poll_file { 71 + struct file *file; 72 + wait_queue_entry_t wait; 73 + wait_queue_head_t *wqh; 74 + poll_table pt; 75 + int cpu; 76 + }; 77 + 78 + struct mshv_vtl { 79 + struct device *module_dev; 80 + u64 id; 81 + }; 82 + 83 + struct mshv_vtl_per_cpu { 84 + struct mshv_vtl_run *run; 85 + struct page *reg_page; 86 + }; 87 + 88 + /* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */ 89 + union hv_synic_overlay_page_msr { 90 + u64 as_uint64; 91 + struct { 92 + u64 enabled: 1; 93 + u64 reserved: 11; 94 + u64 pfn: 52; 95 + } __packed; 96 + }; 97 + 98 + static struct mutex mshv_vtl_poll_file_lock; 99 + static union hv_register_vsm_page_offsets mshv_vsm_page_offsets; 100 + static union hv_register_vsm_capabilities mshv_vsm_capabilities; 101 + 102 + static DEFINE_PER_CPU(struct mshv_vtl_poll_file, mshv_vtl_poll_file); 103 + static DEFINE_PER_CPU(unsigned long long, num_vtl0_transitions); 104 + static DEFINE_PER_CPU(struct mshv_vtl_per_cpu, mshv_vtl_per_cpu); 105 + 106 + static const union hv_input_vtl input_vtl_zero; 107 + static const union hv_input_vtl input_vtl_normal = { 108 + .use_target_vtl = 1, 109 + }; 110 + 111 + static const struct file_operations mshv_vtl_fops; 112 + 113 + static long 114 + mshv_ioctl_create_vtl(void __user *user_arg, struct device *module_dev) 115 + { 116 + struct mshv_vtl *vtl; 117 + struct file *file; 118 + int fd; 119 + 120 + vtl = kzalloc(sizeof(*vtl), GFP_KERNEL); 121 + if (!vtl) 122 + return -ENOMEM; 123 + 124 + fd = get_unused_fd_flags(O_CLOEXEC); 125 + if (fd < 0) { 126 + kfree(vtl); 127 + return fd; 128 + } 129 + file = anon_inode_getfile("mshv_vtl", &mshv_vtl_fops, 130 + vtl, O_RDWR); 131 + if (IS_ERR(file)) { 132 + kfree(vtl); 133 + return PTR_ERR(file); 134 + } 135 + vtl->module_dev = module_dev; 136 + fd_install(fd, file); 137 + 138 + return fd; 139 + } 140 + 141 + static long 142 + mshv_ioctl_check_extension(void __user *user_arg) 143 + { 144 + u32 arg; 145 + 146 + if (copy_from_user(&arg, user_arg, sizeof(arg))) 147 + return -EFAULT; 148 + 149 + switch (arg) { 150 + case MSHV_CAP_CORE_API_STABLE: 151 + return 0; 152 + case MSHV_CAP_REGISTER_PAGE: 153 + return mshv_has_reg_page; 154 + case MSHV_CAP_VTL_RETURN_ACTION: 155 + return mshv_vsm_capabilities.return_action_available; 156 + case MSHV_CAP_DR6_SHARED: 157 + return mshv_vsm_capabilities.dr6_shared; 158 + } 159 + 160 + return -EOPNOTSUPP; 161 + } 162 + 163 + static long 164 + mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) 165 + { 166 + struct miscdevice *misc = filp->private_data; 167 + 168 + switch (ioctl) { 169 + case MSHV_CHECK_EXTENSION: 170 + return mshv_ioctl_check_extension((void __user *)arg); 171 + case MSHV_CREATE_VTL: 172 + return mshv_ioctl_create_vtl((void __user *)arg, misc->this_device); 173 + } 174 + 175 + return -ENOTTY; 176 + } 177 + 178 + static const struct file_operations mshv_dev_fops = { 179 + .owner = THIS_MODULE, 180 + .unlocked_ioctl = mshv_dev_ioctl, 181 + .llseek = noop_llseek, 182 + }; 183 + 184 + static struct miscdevice mshv_dev = { 185 + .minor = MISC_DYNAMIC_MINOR, 186 + .name = "mshv", 187 + .fops = &mshv_dev_fops, 188 + .mode = 0600, 189 + }; 190 + 191 + static struct mshv_vtl_run *mshv_vtl_this_run(void) 192 + { 193 + return *this_cpu_ptr(&mshv_vtl_per_cpu.run); 194 + } 195 + 196 + static struct mshv_vtl_run *mshv_vtl_cpu_run(int cpu) 197 + { 198 + return *per_cpu_ptr(&mshv_vtl_per_cpu.run, cpu); 199 + } 200 + 201 + static struct page *mshv_vtl_cpu_reg_page(int cpu) 202 + { 203 + return *per_cpu_ptr(&mshv_vtl_per_cpu.reg_page, cpu); 204 + } 205 + 206 + static void mshv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu) 207 + { 208 + struct hv_register_assoc reg_assoc = {}; 209 + union hv_synic_overlay_page_msr overlay = {}; 210 + struct page *reg_page; 211 + 212 + reg_page = alloc_page(GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL); 213 + if (!reg_page) { 214 + WARN(1, "failed to allocate register page\n"); 215 + return; 216 + } 217 + 218 + overlay.enabled = 1; 219 + overlay.pfn = page_to_hvpfn(reg_page); 220 + reg_assoc.name = HV_X64_REGISTER_REG_PAGE; 221 + reg_assoc.value.reg64 = overlay.as_uint64; 222 + 223 + if (hv_call_set_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF, 224 + 1, input_vtl_zero, &reg_assoc)) { 225 + WARN(1, "failed to setup register page\n"); 226 + __free_page(reg_page); 227 + return; 228 + } 229 + 230 + per_cpu->reg_page = reg_page; 231 + mshv_has_reg_page = true; 232 + } 233 + 234 + static void mshv_vtl_synic_enable_regs(unsigned int cpu) 235 + { 236 + union hv_synic_sint sint; 237 + 238 + sint.as_uint64 = 0; 239 + sint.vector = HYPERVISOR_CALLBACK_VECTOR; 240 + sint.masked = false; 241 + sint.auto_eoi = hv_recommend_using_aeoi(); 242 + 243 + /* Enable intercepts */ 244 + if (!mshv_vsm_capabilities.intercept_page_available) 245 + hv_set_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX, 246 + sint.as_uint64); 247 + 248 + /* VTL2 Host VSP SINT is (un)masked when the user mode requests that */ 249 + } 250 + 251 + static int mshv_vtl_get_vsm_regs(void) 252 + { 253 + struct hv_register_assoc registers[2]; 254 + int ret, count = 2; 255 + 256 + registers[0].name = HV_REGISTER_VSM_CODE_PAGE_OFFSETS; 257 + registers[1].name = HV_REGISTER_VSM_CAPABILITIES; 258 + 259 + ret = hv_call_get_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF, 260 + count, input_vtl_zero, registers); 261 + if (ret) 262 + return ret; 263 + 264 + mshv_vsm_page_offsets.as_uint64 = registers[0].value.reg64; 265 + mshv_vsm_capabilities.as_uint64 = registers[1].value.reg64; 266 + 267 + return ret; 268 + } 269 + 270 + static int mshv_vtl_configure_vsm_partition(struct device *dev) 271 + { 272 + union hv_register_vsm_partition_config config; 273 + struct hv_register_assoc reg_assoc; 274 + 275 + config.as_uint64 = 0; 276 + config.default_vtl_protection_mask = HV_MAP_GPA_PERMISSIONS_MASK; 277 + config.enable_vtl_protection = 1; 278 + config.zero_memory_on_reset = 1; 279 + config.intercept_vp_startup = 1; 280 + config.intercept_cpuid_unimplemented = 1; 281 + 282 + if (mshv_vsm_capabilities.intercept_page_available) { 283 + dev_dbg(dev, "using intercept page\n"); 284 + config.intercept_page = 1; 285 + } 286 + 287 + reg_assoc.name = HV_REGISTER_VSM_PARTITION_CONFIG; 288 + reg_assoc.value.reg64 = config.as_uint64; 289 + 290 + return hv_call_set_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF, 291 + 1, input_vtl_zero, &reg_assoc); 292 + } 293 + 294 + static void mshv_vtl_vmbus_isr(void) 295 + { 296 + struct hv_per_cpu_context *per_cpu; 297 + struct hv_message *msg; 298 + u32 message_type; 299 + union hv_synic_event_flags *event_flags; 300 + struct eventfd_ctx *eventfd; 301 + u16 i; 302 + 303 + per_cpu = this_cpu_ptr(hv_context.cpu_context); 304 + if (smp_processor_id() == 0) { 305 + msg = (struct hv_message *)per_cpu->hyp_synic_message_page + VTL2_VMBUS_SINT_INDEX; 306 + message_type = READ_ONCE(msg->header.message_type); 307 + if (message_type != HVMSG_NONE) 308 + tasklet_schedule(&msg_dpc); 309 + } 310 + 311 + event_flags = (union hv_synic_event_flags *)per_cpu->hyp_synic_event_page + 312 + VTL2_VMBUS_SINT_INDEX; 313 + for_each_set_bit(i, event_flags->flags, HV_EVENT_FLAGS_COUNT) { 314 + if (!sync_test_and_clear_bit(i, event_flags->flags)) 315 + continue; 316 + rcu_read_lock(); 317 + eventfd = READ_ONCE(flag_eventfds[i]); 318 + if (eventfd) 319 + eventfd_signal(eventfd); 320 + rcu_read_unlock(); 321 + } 322 + 323 + vmbus_isr(); 324 + } 325 + 326 + static int mshv_vtl_alloc_context(unsigned int cpu) 327 + { 328 + struct mshv_vtl_per_cpu *per_cpu = this_cpu_ptr(&mshv_vtl_per_cpu); 329 + 330 + per_cpu->run = (struct mshv_vtl_run *)__get_free_page(GFP_KERNEL | __GFP_ZERO); 331 + if (!per_cpu->run) 332 + return -ENOMEM; 333 + 334 + if (mshv_vsm_capabilities.intercept_page_available) 335 + mshv_vtl_configure_reg_page(per_cpu); 336 + 337 + mshv_vtl_synic_enable_regs(cpu); 338 + 339 + return 0; 340 + } 341 + 342 + static int mshv_vtl_cpuhp_online; 343 + 344 + static int hv_vtl_setup_synic(void) 345 + { 346 + int ret; 347 + 348 + /* Use our isr to first filter out packets destined for userspace */ 349 + hv_setup_vmbus_handler(mshv_vtl_vmbus_isr); 350 + 351 + ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "hyperv/vtl:online", 352 + mshv_vtl_alloc_context, NULL); 353 + if (ret < 0) { 354 + hv_setup_vmbus_handler(vmbus_isr); 355 + return ret; 356 + } 357 + 358 + mshv_vtl_cpuhp_online = ret; 359 + 360 + return 0; 361 + } 362 + 363 + static void hv_vtl_remove_synic(void) 364 + { 365 + cpuhp_remove_state(mshv_vtl_cpuhp_online); 366 + hv_setup_vmbus_handler(vmbus_isr); 367 + } 368 + 369 + static int vtl_get_vp_register(struct hv_register_assoc *reg) 370 + { 371 + return hv_call_get_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF, 372 + 1, input_vtl_normal, reg); 373 + } 374 + 375 + static int vtl_set_vp_register(struct hv_register_assoc *reg) 376 + { 377 + return hv_call_set_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF, 378 + 1, input_vtl_normal, reg); 379 + } 380 + 381 + static int mshv_vtl_ioctl_add_vtl0_mem(struct mshv_vtl *vtl, void __user *arg) 382 + { 383 + struct mshv_vtl_ram_disposition vtl0_mem; 384 + struct dev_pagemap *pgmap; 385 + void *addr; 386 + 387 + if (copy_from_user(&vtl0_mem, arg, sizeof(vtl0_mem))) 388 + return -EFAULT; 389 + /* vtl0_mem.last_pfn is excluded in the pagemap range for VTL0 as per design */ 390 + if (vtl0_mem.last_pfn <= vtl0_mem.start_pfn) { 391 + dev_err(vtl->module_dev, "range start pfn (%llx) > end pfn (%llx)\n", 392 + vtl0_mem.start_pfn, vtl0_mem.last_pfn); 393 + return -EFAULT; 394 + } 395 + 396 + pgmap = kzalloc(sizeof(*pgmap), GFP_KERNEL); 397 + if (!pgmap) 398 + return -ENOMEM; 399 + 400 + pgmap->ranges[0].start = PFN_PHYS(vtl0_mem.start_pfn); 401 + pgmap->ranges[0].end = PFN_PHYS(vtl0_mem.last_pfn) - 1; 402 + pgmap->nr_range = 1; 403 + pgmap->type = MEMORY_DEVICE_GENERIC; 404 + 405 + /* 406 + * Determine the highest page order that can be used for the given memory range. 407 + * This works best when the range is aligned; i.e. both the start and the length. 408 + */ 409 + pgmap->vmemmap_shift = count_trailing_zeros(vtl0_mem.start_pfn | vtl0_mem.last_pfn); 410 + dev_dbg(vtl->module_dev, 411 + "Add VTL0 memory: start: 0x%llx, end_pfn: 0x%llx, page order: %lu\n", 412 + vtl0_mem.start_pfn, vtl0_mem.last_pfn, pgmap->vmemmap_shift); 413 + 414 + addr = devm_memremap_pages(mem_dev, pgmap); 415 + if (IS_ERR(addr)) { 416 + dev_err(vtl->module_dev, "devm_memremap_pages error: %ld\n", PTR_ERR(addr)); 417 + kfree(pgmap); 418 + return -EFAULT; 419 + } 420 + 421 + /* Don't free pgmap, since it has to stick around until the memory 422 + * is unmapped, which will never happen as there is no scenario 423 + * where VTL0 can be released/shutdown without bringing down VTL2. 424 + */ 425 + return 0; 426 + } 427 + 428 + static void mshv_vtl_cancel(int cpu) 429 + { 430 + int here = get_cpu(); 431 + 432 + if (here != cpu) { 433 + if (!xchg_relaxed(&mshv_vtl_cpu_run(cpu)->cancel, 1)) 434 + smp_send_reschedule(cpu); 435 + } else { 436 + WRITE_ONCE(mshv_vtl_this_run()->cancel, 1); 437 + } 438 + put_cpu(); 439 + } 440 + 441 + static int mshv_vtl_poll_file_wake(wait_queue_entry_t *wait, unsigned int mode, int sync, void *key) 442 + { 443 + struct mshv_vtl_poll_file *poll_file = container_of(wait, struct mshv_vtl_poll_file, wait); 444 + 445 + mshv_vtl_cancel(poll_file->cpu); 446 + 447 + return 0; 448 + } 449 + 450 + static void mshv_vtl_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh, poll_table *pt) 451 + { 452 + struct mshv_vtl_poll_file *poll_file = container_of(pt, struct mshv_vtl_poll_file, pt); 453 + 454 + WARN_ON(poll_file->wqh); 455 + poll_file->wqh = wqh; 456 + add_wait_queue(wqh, &poll_file->wait); 457 + } 458 + 459 + static int mshv_vtl_ioctl_set_poll_file(struct mshv_vtl_set_poll_file __user *user_input) 460 + { 461 + struct file *file, *old_file; 462 + struct mshv_vtl_poll_file *poll_file; 463 + struct mshv_vtl_set_poll_file input; 464 + 465 + if (copy_from_user(&input, user_input, sizeof(input))) 466 + return -EFAULT; 467 + 468 + if (input.cpu >= num_possible_cpus() || !cpu_online(input.cpu)) 469 + return -EINVAL; 470 + /* 471 + * CPU Hotplug is not supported in VTL2 in OpenHCL, where this kernel driver exists. 472 + * CPU is expected to remain online after above cpu_online() check. 473 + */ 474 + 475 + file = NULL; 476 + file = fget(input.fd); 477 + if (!file) 478 + return -EBADFD; 479 + 480 + poll_file = per_cpu_ptr(&mshv_vtl_poll_file, READ_ONCE(input.cpu)); 481 + if (!poll_file) 482 + return -EINVAL; 483 + 484 + mutex_lock(&mshv_vtl_poll_file_lock); 485 + 486 + if (poll_file->wqh) 487 + remove_wait_queue(poll_file->wqh, &poll_file->wait); 488 + poll_file->wqh = NULL; 489 + 490 + old_file = poll_file->file; 491 + poll_file->file = file; 492 + poll_file->cpu = input.cpu; 493 + 494 + if (file) { 495 + init_waitqueue_func_entry(&poll_file->wait, mshv_vtl_poll_file_wake); 496 + init_poll_funcptr(&poll_file->pt, mshv_vtl_ptable_queue_proc); 497 + vfs_poll(file, &poll_file->pt); 498 + } 499 + 500 + mutex_unlock(&mshv_vtl_poll_file_lock); 501 + 502 + if (old_file) 503 + fput(old_file); 504 + 505 + return 0; 506 + } 507 + 508 + /* Static table mapping register names to their corresponding actions */ 509 + static const struct { 510 + enum hv_register_name reg_name; 511 + int debug_reg_num; /* -1 if not a debug register */ 512 + u32 msr_addr; /* 0 if not an MSR */ 513 + } reg_table[] = { 514 + /* Debug registers */ 515 + {HV_X64_REGISTER_DR0, 0, 0}, 516 + {HV_X64_REGISTER_DR1, 1, 0}, 517 + {HV_X64_REGISTER_DR2, 2, 0}, 518 + {HV_X64_REGISTER_DR3, 3, 0}, 519 + {HV_X64_REGISTER_DR6, 6, 0}, 520 + /* MTRR MSRs */ 521 + {HV_X64_REGISTER_MSR_MTRR_CAP, -1, MSR_MTRRcap}, 522 + {HV_X64_REGISTER_MSR_MTRR_DEF_TYPE, -1, MSR_MTRRdefType}, 523 + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASE0, -1, MTRRphysBase_MSR(0)}, 524 + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASE1, -1, MTRRphysBase_MSR(1)}, 525 + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASE2, -1, MTRRphysBase_MSR(2)}, 526 + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASE3, -1, MTRRphysBase_MSR(3)}, 527 + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASE4, -1, MTRRphysBase_MSR(4)}, 528 + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASE5, -1, MTRRphysBase_MSR(5)}, 529 + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASE6, -1, MTRRphysBase_MSR(6)}, 530 + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASE7, -1, MTRRphysBase_MSR(7)}, 531 + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASE8, -1, MTRRphysBase_MSR(8)}, 532 + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASE9, -1, MTRRphysBase_MSR(9)}, 533 + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASEA, -1, MTRRphysBase_MSR(0xa)}, 534 + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASEB, -1, MTRRphysBase_MSR(0xb)}, 535 + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASEC, -1, MTRRphysBase_MSR(0xc)}, 536 + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASED, -1, MTRRphysBase_MSR(0xd)}, 537 + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASEE, -1, MTRRphysBase_MSR(0xe)}, 538 + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASEF, -1, MTRRphysBase_MSR(0xf)}, 539 + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASK0, -1, MTRRphysMask_MSR(0)}, 540 + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASK1, -1, MTRRphysMask_MSR(1)}, 541 + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASK2, -1, MTRRphysMask_MSR(2)}, 542 + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASK3, -1, MTRRphysMask_MSR(3)}, 543 + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASK4, -1, MTRRphysMask_MSR(4)}, 544 + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASK5, -1, MTRRphysMask_MSR(5)}, 545 + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASK6, -1, MTRRphysMask_MSR(6)}, 546 + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASK7, -1, MTRRphysMask_MSR(7)}, 547 + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASK8, -1, MTRRphysMask_MSR(8)}, 548 + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASK9, -1, MTRRphysMask_MSR(9)}, 549 + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASKA, -1, MTRRphysMask_MSR(0xa)}, 550 + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASKB, -1, MTRRphysMask_MSR(0xb)}, 551 + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASKC, -1, MTRRphysMask_MSR(0xc)}, 552 + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASKD, -1, MTRRphysMask_MSR(0xd)}, 553 + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASKE, -1, MTRRphysMask_MSR(0xe)}, 554 + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASKF, -1, MTRRphysMask_MSR(0xf)}, 555 + {HV_X64_REGISTER_MSR_MTRR_FIX64K00000, -1, MSR_MTRRfix64K_00000}, 556 + {HV_X64_REGISTER_MSR_MTRR_FIX16K80000, -1, MSR_MTRRfix16K_80000}, 557 + {HV_X64_REGISTER_MSR_MTRR_FIX16KA0000, -1, MSR_MTRRfix16K_A0000}, 558 + {HV_X64_REGISTER_MSR_MTRR_FIX4KC0000, -1, MSR_MTRRfix4K_C0000}, 559 + {HV_X64_REGISTER_MSR_MTRR_FIX4KC8000, -1, MSR_MTRRfix4K_C8000}, 560 + {HV_X64_REGISTER_MSR_MTRR_FIX4KD0000, -1, MSR_MTRRfix4K_D0000}, 561 + {HV_X64_REGISTER_MSR_MTRR_FIX4KD8000, -1, MSR_MTRRfix4K_D8000}, 562 + {HV_X64_REGISTER_MSR_MTRR_FIX4KE0000, -1, MSR_MTRRfix4K_E0000}, 563 + {HV_X64_REGISTER_MSR_MTRR_FIX4KE8000, -1, MSR_MTRRfix4K_E8000}, 564 + {HV_X64_REGISTER_MSR_MTRR_FIX4KF0000, -1, MSR_MTRRfix4K_F0000}, 565 + {HV_X64_REGISTER_MSR_MTRR_FIX4KF8000, -1, MSR_MTRRfix4K_F8000}, 566 + }; 567 + 568 + static int mshv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set) 569 + { 570 + u64 *reg64; 571 + enum hv_register_name gpr_name; 572 + int i; 573 + 574 + gpr_name = regs->name; 575 + reg64 = &regs->value.reg64; 576 + 577 + /* Search for the register in the table */ 578 + for (i = 0; i < ARRAY_SIZE(reg_table); i++) { 579 + if (reg_table[i].reg_name != gpr_name) 580 + continue; 581 + if (reg_table[i].debug_reg_num != -1) { 582 + /* Handle debug registers */ 583 + if (gpr_name == HV_X64_REGISTER_DR6 && 584 + !mshv_vsm_capabilities.dr6_shared) 585 + goto hypercall; 586 + if (set) 587 + native_set_debugreg(reg_table[i].debug_reg_num, *reg64); 588 + else 589 + *reg64 = native_get_debugreg(reg_table[i].debug_reg_num); 590 + } else { 591 + /* Handle MSRs */ 592 + if (set) 593 + wrmsrl(reg_table[i].msr_addr, *reg64); 594 + else 595 + rdmsrl(reg_table[i].msr_addr, *reg64); 596 + } 597 + return 0; 598 + } 599 + 600 + hypercall: 601 + return 1; 602 + } 603 + 604 + static void mshv_vtl_return(struct mshv_vtl_cpu_context *vtl0) 605 + { 606 + struct hv_vp_assist_page *hvp; 607 + 608 + hvp = hv_vp_assist_page[smp_processor_id()]; 609 + 610 + /* 611 + * Process signal event direct set in the run page, if any. 612 + */ 613 + if (mshv_vsm_capabilities.return_action_available) { 614 + u32 offset = READ_ONCE(mshv_vtl_this_run()->vtl_ret_action_size); 615 + 616 + WRITE_ONCE(mshv_vtl_this_run()->vtl_ret_action_size, 0); 617 + 618 + /* 619 + * Hypervisor will take care of clearing out the actions 620 + * set in the assist page. 621 + */ 622 + memcpy(hvp->vtl_ret_actions, 623 + mshv_vtl_this_run()->vtl_ret_actions, 624 + min_t(u32, offset, sizeof(hvp->vtl_ret_actions))); 625 + } 626 + 627 + mshv_vtl_return_call(vtl0); 628 + } 629 + 630 + static bool mshv_vtl_process_intercept(void) 631 + { 632 + struct hv_per_cpu_context *mshv_cpu; 633 + void *synic_message_page; 634 + struct hv_message *msg; 635 + u32 message_type; 636 + 637 + mshv_cpu = this_cpu_ptr(hv_context.cpu_context); 638 + synic_message_page = mshv_cpu->hyp_synic_message_page; 639 + if (unlikely(!synic_message_page)) 640 + return true; 641 + 642 + msg = (struct hv_message *)synic_message_page + HV_SYNIC_INTERCEPTION_SINT_INDEX; 643 + message_type = READ_ONCE(msg->header.message_type); 644 + if (message_type == HVMSG_NONE) 645 + return true; 646 + 647 + memcpy(mshv_vtl_this_run()->exit_message, msg, sizeof(*msg)); 648 + vmbus_signal_eom(msg, message_type); 649 + 650 + return false; 651 + } 652 + 653 + static int mshv_vtl_ioctl_return_to_lower_vtl(void) 654 + { 655 + preempt_disable(); 656 + for (;;) { 657 + unsigned long irq_flags; 658 + struct hv_vp_assist_page *hvp; 659 + int ret; 660 + 661 + if (__xfer_to_guest_mode_work_pending()) { 662 + preempt_enable(); 663 + ret = xfer_to_guest_mode_handle_work(); 664 + if (ret) 665 + return ret; 666 + preempt_disable(); 667 + } 668 + 669 + local_irq_save(irq_flags); 670 + if (READ_ONCE(mshv_vtl_this_run()->cancel)) { 671 + local_irq_restore(irq_flags); 672 + preempt_enable(); 673 + return -EINTR; 674 + } 675 + 676 + mshv_vtl_return(&mshv_vtl_this_run()->cpu_context); 677 + local_irq_restore(irq_flags); 678 + 679 + hvp = hv_vp_assist_page[smp_processor_id()]; 680 + this_cpu_inc(num_vtl0_transitions); 681 + switch (hvp->vtl_entry_reason) { 682 + case MSHV_ENTRY_REASON_INTERRUPT: 683 + if (!mshv_vsm_capabilities.intercept_page_available && 684 + likely(!mshv_vtl_process_intercept())) 685 + goto done; 686 + break; 687 + 688 + case MSHV_ENTRY_REASON_INTERCEPT: 689 + WARN_ON(!mshv_vsm_capabilities.intercept_page_available); 690 + memcpy(mshv_vtl_this_run()->exit_message, hvp->intercept_message, 691 + sizeof(hvp->intercept_message)); 692 + goto done; 693 + 694 + default: 695 + panic("unknown entry reason: %d", hvp->vtl_entry_reason); 696 + } 697 + } 698 + 699 + done: 700 + preempt_enable(); 701 + 702 + return 0; 703 + } 704 + 705 + static long 706 + mshv_vtl_ioctl_get_regs(void __user *user_args) 707 + { 708 + struct mshv_vp_registers args; 709 + struct hv_register_assoc reg; 710 + long ret; 711 + 712 + if (copy_from_user(&args, user_args, sizeof(args))) 713 + return -EFAULT; 714 + 715 + /* This IOCTL supports processing only one register at a time. */ 716 + if (args.count != 1) 717 + return -EINVAL; 718 + 719 + if (copy_from_user(&reg, (void __user *)args.regs_ptr, 720 + sizeof(reg))) 721 + return -EFAULT; 722 + 723 + ret = mshv_vtl_get_set_reg(&reg, false); 724 + if (!ret) 725 + goto copy_args; /* No need of hypercall */ 726 + ret = vtl_get_vp_register(&reg); 727 + if (ret) 728 + return ret; 729 + 730 + copy_args: 731 + if (copy_to_user((void __user *)args.regs_ptr, &reg, sizeof(reg))) 732 + ret = -EFAULT; 733 + 734 + return ret; 735 + } 736 + 737 + static long 738 + mshv_vtl_ioctl_set_regs(void __user *user_args) 739 + { 740 + struct mshv_vp_registers args; 741 + struct hv_register_assoc reg; 742 + long ret; 743 + 744 + if (copy_from_user(&args, user_args, sizeof(args))) 745 + return -EFAULT; 746 + 747 + /* This IOCTL supports processing only one register at a time. */ 748 + if (args.count != 1) 749 + return -EINVAL; 750 + 751 + if (copy_from_user(&reg, (void __user *)args.regs_ptr, sizeof(reg))) 752 + return -EFAULT; 753 + 754 + ret = mshv_vtl_get_set_reg(&reg, true); 755 + if (!ret) 756 + return ret; /* No need of hypercall */ 757 + ret = vtl_set_vp_register(&reg); 758 + 759 + return ret; 760 + } 761 + 762 + static long 763 + mshv_vtl_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) 764 + { 765 + long ret; 766 + struct mshv_vtl *vtl = filp->private_data; 767 + 768 + switch (ioctl) { 769 + case MSHV_SET_POLL_FILE: 770 + ret = mshv_vtl_ioctl_set_poll_file((struct mshv_vtl_set_poll_file __user *)arg); 771 + break; 772 + case MSHV_GET_VP_REGISTERS: 773 + ret = mshv_vtl_ioctl_get_regs((void __user *)arg); 774 + break; 775 + case MSHV_SET_VP_REGISTERS: 776 + ret = mshv_vtl_ioctl_set_regs((void __user *)arg); 777 + break; 778 + case MSHV_RETURN_TO_LOWER_VTL: 779 + ret = mshv_vtl_ioctl_return_to_lower_vtl(); 780 + break; 781 + case MSHV_ADD_VTL0_MEMORY: 782 + ret = mshv_vtl_ioctl_add_vtl0_mem(vtl, (void __user *)arg); 783 + break; 784 + default: 785 + dev_err(vtl->module_dev, "invalid vtl ioctl: %#x\n", ioctl); 786 + ret = -ENOTTY; 787 + } 788 + 789 + return ret; 790 + } 791 + 792 + static vm_fault_t mshv_vtl_fault(struct vm_fault *vmf) 793 + { 794 + struct page *page; 795 + int cpu = vmf->pgoff & MSHV_PG_OFF_CPU_MASK; 796 + int real_off = vmf->pgoff >> MSHV_REAL_OFF_SHIFT; 797 + 798 + if (!cpu_online(cpu)) 799 + return VM_FAULT_SIGBUS; 800 + /* 801 + * CPU Hotplug is not supported in VTL2 in OpenHCL, where this kernel driver exists. 802 + * CPU is expected to remain online after above cpu_online() check. 803 + */ 804 + 805 + if (real_off == MSHV_RUN_PAGE_OFFSET) { 806 + page = virt_to_page(mshv_vtl_cpu_run(cpu)); 807 + } else if (real_off == MSHV_REG_PAGE_OFFSET) { 808 + if (!mshv_has_reg_page) 809 + return VM_FAULT_SIGBUS; 810 + page = mshv_vtl_cpu_reg_page(cpu); 811 + } else { 812 + return VM_FAULT_NOPAGE; 813 + } 814 + 815 + get_page(page); 816 + vmf->page = page; 817 + 818 + return 0; 819 + } 820 + 821 + static const struct vm_operations_struct mshv_vtl_vm_ops = { 822 + .fault = mshv_vtl_fault, 823 + }; 824 + 825 + static int mshv_vtl_mmap(struct file *filp, struct vm_area_struct *vma) 826 + { 827 + vma->vm_ops = &mshv_vtl_vm_ops; 828 + 829 + return 0; 830 + } 831 + 832 + static int mshv_vtl_release(struct inode *inode, struct file *filp) 833 + { 834 + struct mshv_vtl *vtl = filp->private_data; 835 + 836 + kfree(vtl); 837 + 838 + return 0; 839 + } 840 + 841 + static const struct file_operations mshv_vtl_fops = { 842 + .owner = THIS_MODULE, 843 + .unlocked_ioctl = mshv_vtl_ioctl, 844 + .release = mshv_vtl_release, 845 + .mmap = mshv_vtl_mmap, 846 + }; 847 + 848 + static void mshv_vtl_synic_mask_vmbus_sint(const u8 *mask) 849 + { 850 + union hv_synic_sint sint; 851 + 852 + sint.as_uint64 = 0; 853 + sint.vector = HYPERVISOR_CALLBACK_VECTOR; 854 + sint.masked = (*mask != 0); 855 + sint.auto_eoi = hv_recommend_using_aeoi(); 856 + 857 + hv_set_msr(HV_MSR_SINT0 + VTL2_VMBUS_SINT_INDEX, 858 + sint.as_uint64); 859 + 860 + if (!sint.masked) 861 + pr_debug("%s: Unmasking VTL2 VMBUS SINT on VP %d\n", __func__, smp_processor_id()); 862 + else 863 + pr_debug("%s: Masking VTL2 VMBUS SINT on VP %d\n", __func__, smp_processor_id()); 864 + } 865 + 866 + static void mshv_vtl_read_remote(void *buffer) 867 + { 868 + struct hv_per_cpu_context *mshv_cpu = this_cpu_ptr(hv_context.cpu_context); 869 + struct hv_message *msg = (struct hv_message *)mshv_cpu->hyp_synic_message_page + 870 + VTL2_VMBUS_SINT_INDEX; 871 + u32 message_type = READ_ONCE(msg->header.message_type); 872 + 873 + WRITE_ONCE(has_message, false); 874 + if (message_type == HVMSG_NONE) 875 + return; 876 + 877 + memcpy(buffer, msg, sizeof(*msg)); 878 + vmbus_signal_eom(msg, message_type); 879 + } 880 + 881 + static bool vtl_synic_mask_vmbus_sint_masked = true; 882 + 883 + static ssize_t mshv_vtl_sint_read(struct file *filp, char __user *arg, size_t size, loff_t *offset) 884 + { 885 + struct hv_message msg = {}; 886 + int ret; 887 + 888 + if (size < sizeof(msg)) 889 + return -EINVAL; 890 + 891 + for (;;) { 892 + smp_call_function_single(VMBUS_CONNECT_CPU, mshv_vtl_read_remote, &msg, true); 893 + if (msg.header.message_type != HVMSG_NONE) 894 + break; 895 + 896 + if (READ_ONCE(vtl_synic_mask_vmbus_sint_masked)) 897 + return 0; /* EOF */ 898 + 899 + if (filp->f_flags & O_NONBLOCK) 900 + return -EAGAIN; 901 + 902 + ret = wait_event_interruptible(fd_wait_queue, 903 + READ_ONCE(has_message) || 904 + READ_ONCE(vtl_synic_mask_vmbus_sint_masked)); 905 + if (ret) 906 + return ret; 907 + } 908 + 909 + if (copy_to_user(arg, &msg, sizeof(msg))) 910 + return -EFAULT; 911 + 912 + return sizeof(msg); 913 + } 914 + 915 + static __poll_t mshv_vtl_sint_poll(struct file *filp, poll_table *wait) 916 + { 917 + __poll_t mask = 0; 918 + 919 + poll_wait(filp, &fd_wait_queue, wait); 920 + if (READ_ONCE(has_message) || READ_ONCE(vtl_synic_mask_vmbus_sint_masked)) 921 + mask |= EPOLLIN | EPOLLRDNORM; 922 + 923 + return mask; 924 + } 925 + 926 + static void mshv_vtl_sint_on_msg_dpc(unsigned long data) 927 + { 928 + WRITE_ONCE(has_message, true); 929 + wake_up_interruptible_poll(&fd_wait_queue, EPOLLIN); 930 + } 931 + 932 + static int mshv_vtl_sint_ioctl_post_msg(struct mshv_vtl_sint_post_msg __user *arg) 933 + { 934 + struct mshv_vtl_sint_post_msg message; 935 + u8 payload[HV_MESSAGE_PAYLOAD_BYTE_COUNT]; 936 + 937 + if (copy_from_user(&message, arg, sizeof(message))) 938 + return -EFAULT; 939 + if (message.payload_size > HV_MESSAGE_PAYLOAD_BYTE_COUNT) 940 + return -EINVAL; 941 + if (copy_from_user(payload, (void __user *)message.payload_ptr, 942 + message.payload_size)) 943 + return -EFAULT; 944 + 945 + return hv_post_message((union hv_connection_id)message.connection_id, 946 + message.message_type, (void *)payload, 947 + message.payload_size); 948 + } 949 + 950 + static int mshv_vtl_sint_ioctl_signal_event(struct mshv_vtl_signal_event __user *arg) 951 + { 952 + u64 input, status; 953 + struct mshv_vtl_signal_event signal_event; 954 + 955 + if (copy_from_user(&signal_event, arg, sizeof(signal_event))) 956 + return -EFAULT; 957 + 958 + input = signal_event.connection_id | ((u64)signal_event.flag << 32); 959 + 960 + status = hv_do_fast_hypercall8(HVCALL_SIGNAL_EVENT, input); 961 + 962 + return hv_result_to_errno(status); 963 + } 964 + 965 + static int mshv_vtl_sint_ioctl_set_eventfd(struct mshv_vtl_set_eventfd __user *arg) 966 + { 967 + struct mshv_vtl_set_eventfd set_eventfd; 968 + struct eventfd_ctx *eventfd, *old_eventfd; 969 + 970 + if (copy_from_user(&set_eventfd, arg, sizeof(set_eventfd))) 971 + return -EFAULT; 972 + if (set_eventfd.flag >= HV_EVENT_FLAGS_COUNT) 973 + return -EINVAL; 974 + 975 + eventfd = NULL; 976 + if (set_eventfd.fd >= 0) { 977 + eventfd = eventfd_ctx_fdget(set_eventfd.fd); 978 + if (IS_ERR(eventfd)) 979 + return PTR_ERR(eventfd); 980 + } 981 + 982 + guard(mutex)(&flag_lock); 983 + old_eventfd = READ_ONCE(flag_eventfds[set_eventfd.flag]); 984 + WRITE_ONCE(flag_eventfds[set_eventfd.flag], eventfd); 985 + 986 + if (old_eventfd) { 987 + synchronize_rcu(); 988 + eventfd_ctx_put(old_eventfd); 989 + } 990 + 991 + return 0; 992 + } 993 + 994 + static int mshv_vtl_sint_ioctl_pause_msg_stream(struct mshv_sint_mask __user *arg) 995 + { 996 + static DEFINE_MUTEX(vtl2_vmbus_sint_mask_mutex); 997 + struct mshv_sint_mask mask; 998 + 999 + if (copy_from_user(&mask, arg, sizeof(mask))) 1000 + return -EFAULT; 1001 + guard(mutex)(&vtl2_vmbus_sint_mask_mutex); 1002 + on_each_cpu((smp_call_func_t)mshv_vtl_synic_mask_vmbus_sint, &mask.mask, 1); 1003 + WRITE_ONCE(vtl_synic_mask_vmbus_sint_masked, mask.mask != 0); 1004 + if (mask.mask) 1005 + wake_up_interruptible_poll(&fd_wait_queue, EPOLLIN); 1006 + 1007 + return 0; 1008 + } 1009 + 1010 + static long mshv_vtl_sint_ioctl(struct file *f, unsigned int cmd, unsigned long arg) 1011 + { 1012 + switch (cmd) { 1013 + case MSHV_SINT_POST_MESSAGE: 1014 + return mshv_vtl_sint_ioctl_post_msg((struct mshv_vtl_sint_post_msg __user *)arg); 1015 + case MSHV_SINT_SIGNAL_EVENT: 1016 + return mshv_vtl_sint_ioctl_signal_event((struct mshv_vtl_signal_event __user *)arg); 1017 + case MSHV_SINT_SET_EVENTFD: 1018 + return mshv_vtl_sint_ioctl_set_eventfd((struct mshv_vtl_set_eventfd __user *)arg); 1019 + case MSHV_SINT_PAUSE_MESSAGE_STREAM: 1020 + return mshv_vtl_sint_ioctl_pause_msg_stream((struct mshv_sint_mask __user *)arg); 1021 + default: 1022 + return -ENOIOCTLCMD; 1023 + } 1024 + } 1025 + 1026 + static const struct file_operations mshv_vtl_sint_ops = { 1027 + .owner = THIS_MODULE, 1028 + .read = mshv_vtl_sint_read, 1029 + .poll = mshv_vtl_sint_poll, 1030 + .unlocked_ioctl = mshv_vtl_sint_ioctl, 1031 + }; 1032 + 1033 + static struct miscdevice mshv_vtl_sint_dev = { 1034 + .name = "mshv_sint", 1035 + .fops = &mshv_vtl_sint_ops, 1036 + .mode = 0600, 1037 + .minor = MISC_DYNAMIC_MINOR, 1038 + }; 1039 + 1040 + static int mshv_vtl_hvcall_dev_open(struct inode *node, struct file *f) 1041 + { 1042 + struct miscdevice *dev = f->private_data; 1043 + struct mshv_vtl_hvcall_fd *fd; 1044 + 1045 + if (!capable(CAP_SYS_ADMIN)) 1046 + return -EPERM; 1047 + 1048 + fd = vzalloc(sizeof(*fd)); 1049 + if (!fd) 1050 + return -ENOMEM; 1051 + fd->dev = dev; 1052 + f->private_data = fd; 1053 + mutex_init(&fd->init_mutex); 1054 + 1055 + return 0; 1056 + } 1057 + 1058 + static int mshv_vtl_hvcall_dev_release(struct inode *node, struct file *f) 1059 + { 1060 + struct mshv_vtl_hvcall_fd *fd; 1061 + 1062 + fd = f->private_data; 1063 + if (fd) { 1064 + vfree(fd); 1065 + f->private_data = NULL; 1066 + } 1067 + 1068 + return 0; 1069 + } 1070 + 1071 + static int mshv_vtl_hvcall_do_setup(struct mshv_vtl_hvcall_fd *fd, 1072 + struct mshv_vtl_hvcall_setup __user *hvcall_setup_user) 1073 + { 1074 + struct mshv_vtl_hvcall_setup hvcall_setup; 1075 + 1076 + guard(mutex)(&fd->init_mutex); 1077 + 1078 + if (fd->allow_map_initialized) { 1079 + dev_err(fd->dev->this_device, 1080 + "Hypercall allow map has already been set, pid %d\n", 1081 + current->pid); 1082 + return -EINVAL; 1083 + } 1084 + 1085 + if (copy_from_user(&hvcall_setup, hvcall_setup_user, 1086 + sizeof(struct mshv_vtl_hvcall_setup))) { 1087 + return -EFAULT; 1088 + } 1089 + if (hvcall_setup.bitmap_array_size > ARRAY_SIZE(fd->allow_bitmap)) 1090 + return -EINVAL; 1091 + 1092 + if (copy_from_user(&fd->allow_bitmap, 1093 + (void __user *)hvcall_setup.allow_bitmap_ptr, 1094 + hvcall_setup.bitmap_array_size)) { 1095 + return -EFAULT; 1096 + } 1097 + 1098 + dev_info(fd->dev->this_device, "Hypercall allow map has been set, pid %d\n", 1099 + current->pid); 1100 + fd->allow_map_initialized = true; 1101 + return 0; 1102 + } 1103 + 1104 + static bool mshv_vtl_hvcall_is_allowed(struct mshv_vtl_hvcall_fd *fd, u16 call_code) 1105 + { 1106 + return test_bit(call_code, (unsigned long *)fd->allow_bitmap); 1107 + } 1108 + 1109 + static int mshv_vtl_hvcall_call(struct mshv_vtl_hvcall_fd *fd, 1110 + struct mshv_vtl_hvcall __user *hvcall_user) 1111 + { 1112 + struct mshv_vtl_hvcall hvcall; 1113 + void *in, *out; 1114 + int ret; 1115 + 1116 + if (copy_from_user(&hvcall, hvcall_user, sizeof(struct mshv_vtl_hvcall))) 1117 + return -EFAULT; 1118 + if (hvcall.input_size > HV_HYP_PAGE_SIZE) 1119 + return -EINVAL; 1120 + if (hvcall.output_size > HV_HYP_PAGE_SIZE) 1121 + return -EINVAL; 1122 + 1123 + /* 1124 + * By default, all hypercalls are not allowed. 1125 + * The user mode code has to set up the allow bitmap once. 1126 + */ 1127 + 1128 + if (!mshv_vtl_hvcall_is_allowed(fd, hvcall.control & 0xFFFF)) { 1129 + dev_err(fd->dev->this_device, 1130 + "Hypercall with control data %#llx isn't allowed\n", 1131 + hvcall.control); 1132 + return -EPERM; 1133 + } 1134 + 1135 + /* 1136 + * This may create a problem for Confidential VM (CVM) usecase where we need to use 1137 + * Hyper-V driver allocated per-cpu input and output pages (hyperv_pcpu_input_arg and 1138 + * hyperv_pcpu_output_arg) for making a hypervisor call. 1139 + * 1140 + * TODO: Take care of this when CVM support is added. 1141 + */ 1142 + in = (void *)__get_free_page(GFP_KERNEL); 1143 + out = (void *)__get_free_page(GFP_KERNEL); 1144 + 1145 + if (copy_from_user(in, (void __user *)hvcall.input_ptr, hvcall.input_size)) { 1146 + ret = -EFAULT; 1147 + goto free_pages; 1148 + } 1149 + 1150 + hvcall.status = hv_do_hypercall(hvcall.control, in, out); 1151 + 1152 + if (copy_to_user((void __user *)hvcall.output_ptr, out, hvcall.output_size)) { 1153 + ret = -EFAULT; 1154 + goto free_pages; 1155 + } 1156 + ret = put_user(hvcall.status, &hvcall_user->status); 1157 + free_pages: 1158 + free_page((unsigned long)in); 1159 + free_page((unsigned long)out); 1160 + 1161 + return ret; 1162 + } 1163 + 1164 + static long mshv_vtl_hvcall_dev_ioctl(struct file *f, unsigned int cmd, unsigned long arg) 1165 + { 1166 + struct mshv_vtl_hvcall_fd *fd = f->private_data; 1167 + 1168 + switch (cmd) { 1169 + case MSHV_HVCALL_SETUP: 1170 + return mshv_vtl_hvcall_do_setup(fd, (struct mshv_vtl_hvcall_setup __user *)arg); 1171 + case MSHV_HVCALL: 1172 + return mshv_vtl_hvcall_call(fd, (struct mshv_vtl_hvcall __user *)arg); 1173 + default: 1174 + break; 1175 + } 1176 + 1177 + return -ENOIOCTLCMD; 1178 + } 1179 + 1180 + static const struct file_operations mshv_vtl_hvcall_dev_file_ops = { 1181 + .owner = THIS_MODULE, 1182 + .open = mshv_vtl_hvcall_dev_open, 1183 + .release = mshv_vtl_hvcall_dev_release, 1184 + .unlocked_ioctl = mshv_vtl_hvcall_dev_ioctl, 1185 + }; 1186 + 1187 + static struct miscdevice mshv_vtl_hvcall_dev = { 1188 + .name = "mshv_hvcall", 1189 + .nodename = "mshv_hvcall", 1190 + .fops = &mshv_vtl_hvcall_dev_file_ops, 1191 + .mode = 0600, 1192 + .minor = MISC_DYNAMIC_MINOR, 1193 + }; 1194 + 1195 + static int mshv_vtl_low_open(struct inode *inodep, struct file *filp) 1196 + { 1197 + pid_t pid = task_pid_vnr(current); 1198 + uid_t uid = current_uid().val; 1199 + int ret = 0; 1200 + 1201 + pr_debug("%s: Opening VTL low, task group %d, uid %d\n", __func__, pid, uid); 1202 + 1203 + if (capable(CAP_SYS_ADMIN)) { 1204 + filp->private_data = inodep; 1205 + } else { 1206 + pr_err("%s: VTL low open failed: CAP_SYS_ADMIN required. task group %d, uid %d", 1207 + __func__, pid, uid); 1208 + ret = -EPERM; 1209 + } 1210 + 1211 + return ret; 1212 + } 1213 + 1214 + static bool can_fault(struct vm_fault *vmf, unsigned long size, unsigned long *pfn) 1215 + { 1216 + unsigned long mask = size - 1; 1217 + unsigned long start = vmf->address & ~mask; 1218 + unsigned long end = start + size; 1219 + bool is_valid; 1220 + 1221 + is_valid = (vmf->address & mask) == ((vmf->pgoff << PAGE_SHIFT) & mask) && 1222 + start >= vmf->vma->vm_start && 1223 + end <= vmf->vma->vm_end; 1224 + 1225 + if (is_valid) 1226 + *pfn = vmf->pgoff & ~(mask >> PAGE_SHIFT); 1227 + 1228 + return is_valid; 1229 + } 1230 + 1231 + static vm_fault_t mshv_vtl_low_huge_fault(struct vm_fault *vmf, unsigned int order) 1232 + { 1233 + unsigned long pfn = vmf->pgoff; 1234 + vm_fault_t ret = VM_FAULT_FALLBACK; 1235 + 1236 + switch (order) { 1237 + case 0: 1238 + return vmf_insert_mixed(vmf->vma, vmf->address, pfn); 1239 + 1240 + case PMD_ORDER: 1241 + if (can_fault(vmf, PMD_SIZE, &pfn)) 1242 + ret = vmf_insert_pfn_pmd(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE); 1243 + return ret; 1244 + 1245 + case PUD_ORDER: 1246 + if (can_fault(vmf, PUD_SIZE, &pfn)) 1247 + ret = vmf_insert_pfn_pud(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE); 1248 + return ret; 1249 + 1250 + default: 1251 + return VM_FAULT_SIGBUS; 1252 + } 1253 + } 1254 + 1255 + static vm_fault_t mshv_vtl_low_fault(struct vm_fault *vmf) 1256 + { 1257 + return mshv_vtl_low_huge_fault(vmf, 0); 1258 + } 1259 + 1260 + static const struct vm_operations_struct mshv_vtl_low_vm_ops = { 1261 + .fault = mshv_vtl_low_fault, 1262 + .huge_fault = mshv_vtl_low_huge_fault, 1263 + }; 1264 + 1265 + static int mshv_vtl_low_mmap(struct file *filp, struct vm_area_struct *vma) 1266 + { 1267 + vma->vm_ops = &mshv_vtl_low_vm_ops; 1268 + vm_flags_set(vma, VM_HUGEPAGE | VM_MIXEDMAP); 1269 + 1270 + return 0; 1271 + } 1272 + 1273 + static const struct file_operations mshv_vtl_low_file_ops = { 1274 + .owner = THIS_MODULE, 1275 + .open = mshv_vtl_low_open, 1276 + .mmap = mshv_vtl_low_mmap, 1277 + }; 1278 + 1279 + static struct miscdevice mshv_vtl_low = { 1280 + .name = "mshv_vtl_low", 1281 + .nodename = "mshv_vtl_low", 1282 + .fops = &mshv_vtl_low_file_ops, 1283 + .mode = 0600, 1284 + .minor = MISC_DYNAMIC_MINOR, 1285 + }; 1286 + 1287 + static int __init mshv_vtl_init(void) 1288 + { 1289 + int ret; 1290 + struct device *dev = mshv_dev.this_device; 1291 + 1292 + /* 1293 + * This creates /dev/mshv which provides functionality to create VTLs and partitions. 1294 + */ 1295 + ret = misc_register(&mshv_dev); 1296 + if (ret) { 1297 + dev_err(dev, "mshv device register failed: %d\n", ret); 1298 + goto free_dev; 1299 + } 1300 + 1301 + tasklet_init(&msg_dpc, mshv_vtl_sint_on_msg_dpc, 0); 1302 + init_waitqueue_head(&fd_wait_queue); 1303 + 1304 + if (mshv_vtl_get_vsm_regs()) { 1305 + dev_emerg(dev, "Unable to get VSM capabilities !!\n"); 1306 + ret = -ENODEV; 1307 + goto free_dev; 1308 + } 1309 + if (mshv_vtl_configure_vsm_partition(dev)) { 1310 + dev_emerg(dev, "VSM configuration failed !!\n"); 1311 + ret = -ENODEV; 1312 + goto free_dev; 1313 + } 1314 + 1315 + mshv_vtl_return_call_init(mshv_vsm_page_offsets.vtl_return_offset); 1316 + ret = hv_vtl_setup_synic(); 1317 + if (ret) 1318 + goto free_dev; 1319 + 1320 + /* 1321 + * mshv_sint device adds VMBus relay ioctl support. 1322 + * This provides a channel for VTL0 to communicate with VTL2. 1323 + */ 1324 + ret = misc_register(&mshv_vtl_sint_dev); 1325 + if (ret) 1326 + goto free_synic; 1327 + 1328 + /* 1329 + * mshv_hvcall device adds interface to enable userspace for direct hypercalls support. 1330 + */ 1331 + ret = misc_register(&mshv_vtl_hvcall_dev); 1332 + if (ret) 1333 + goto free_sint; 1334 + 1335 + /* 1336 + * mshv_vtl_low device is used to map VTL0 address space to a user-mode process in VTL2. 1337 + * It implements mmap() to allow a user-mode process in VTL2 to map to the address of VTL0. 1338 + */ 1339 + ret = misc_register(&mshv_vtl_low); 1340 + if (ret) 1341 + goto free_hvcall; 1342 + 1343 + /* 1344 + * "mshv vtl mem dev" device is later used to setup VTL0 memory. 1345 + */ 1346 + mem_dev = kzalloc(sizeof(*mem_dev), GFP_KERNEL); 1347 + if (!mem_dev) { 1348 + ret = -ENOMEM; 1349 + goto free_low; 1350 + } 1351 + 1352 + mutex_init(&mshv_vtl_poll_file_lock); 1353 + 1354 + device_initialize(mem_dev); 1355 + dev_set_name(mem_dev, "mshv vtl mem dev"); 1356 + ret = device_add(mem_dev); 1357 + if (ret) { 1358 + dev_err(dev, "mshv vtl mem dev add: %d\n", ret); 1359 + goto free_mem; 1360 + } 1361 + 1362 + return 0; 1363 + 1364 + free_mem: 1365 + kfree(mem_dev); 1366 + free_low: 1367 + misc_deregister(&mshv_vtl_low); 1368 + free_hvcall: 1369 + misc_deregister(&mshv_vtl_hvcall_dev); 1370 + free_sint: 1371 + misc_deregister(&mshv_vtl_sint_dev); 1372 + free_synic: 1373 + hv_vtl_remove_synic(); 1374 + free_dev: 1375 + misc_deregister(&mshv_dev); 1376 + 1377 + return ret; 1378 + } 1379 + 1380 + static void __exit mshv_vtl_exit(void) 1381 + { 1382 + device_del(mem_dev); 1383 + kfree(mem_dev); 1384 + misc_deregister(&mshv_vtl_low); 1385 + misc_deregister(&mshv_vtl_hvcall_dev); 1386 + misc_deregister(&mshv_vtl_sint_dev); 1387 + hv_vtl_remove_synic(); 1388 + misc_deregister(&mshv_dev); 1389 + } 1390 + 1391 + module_init(mshv_vtl_init); 1392 + module_exit(mshv_vtl_exit);

+3 -2

drivers/hv/ring_buffer.c

··· 184 184 185 185 /* Initialize the ring buffer. */ 186 186 int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info, 187 - struct page *pages, u32 page_cnt, u32 max_pkt_size) 187 + struct page *pages, u32 page_cnt, u32 max_pkt_size, 188 + bool confidential) 188 189 { 189 190 struct page **pages_wraparound; 190 191 int i; ··· 209 208 210 209 ring_info->ring_buffer = (struct hv_ring_buffer *) 211 210 vmap(pages_wraparound, page_cnt * 2 - 1, VM_MAP, 212 - pgprot_decrypted(PAGE_KERNEL)); 211 + confidential ? PAGE_KERNEL : pgprot_decrypted(PAGE_KERNEL)); 213 212 214 213 kfree(pages_wraparound); 215 214 if (!ring_info->ring_buffer)

+131 -73

drivers/hv/vmbus_drv.c

··· 36 36 #include <linux/syscore_ops.h> 37 37 #include <linux/dma-map-ops.h> 38 38 #include <linux/pci.h> 39 + #include <linux/export.h> 39 40 #include <clocksource/hyperv_timer.h> 40 41 #include <asm/mshyperv.h> 41 42 #include "hyperv_vmbus.h" ··· 56 55 /* Values parsed from ACPI DSDT */ 57 56 int vmbus_irq; 58 57 int vmbus_interrupt; 58 + 59 + /* 60 + * If the Confidential VMBus is used, the data on the "wire" is not 61 + * visible to either the host or the hypervisor. 62 + */ 63 + static bool is_confidential; 64 + 65 + bool vmbus_is_confidential(void) 66 + { 67 + return is_confidential; 68 + } 69 + EXPORT_SYMBOL_GPL(vmbus_is_confidential); 59 70 60 71 /* 61 72 * The panic notifier below is responsible solely for unloading the ··· 1058 1045 kfree(ctx); 1059 1046 } 1060 1047 1061 - void vmbus_on_msg_dpc(unsigned long data) 1048 + static void __vmbus_on_msg_dpc(void *message_page_addr) 1062 1049 { 1063 - struct hv_per_cpu_context *hv_cpu = (void *)data; 1064 - void *page_addr = hv_cpu->synic_message_page; 1065 - struct hv_message msg_copy, *msg = (struct hv_message *)page_addr + 1066 - VMBUS_MESSAGE_SINT; 1050 + struct hv_message msg_copy, *msg; 1067 1051 struct vmbus_channel_message_header *hdr; 1068 1052 enum vmbus_channel_message_type msgtype; 1069 1053 const struct vmbus_channel_message_table_entry *entry; 1070 1054 struct onmessage_work_context *ctx; 1071 1055 __u8 payload_size; 1072 1056 u32 message_type; 1057 + 1058 + if (!message_page_addr) 1059 + return; 1060 + msg = (struct hv_message *)message_page_addr + VMBUS_MESSAGE_SINT; 1073 1061 1074 1062 /* 1075 1063 * 'enum vmbus_channel_message_type' is supposed to always be 'u32' as ··· 1197 1183 vmbus_signal_eom(msg, message_type); 1198 1184 } 1199 1185 1186 + void vmbus_on_msg_dpc(unsigned long data) 1187 + { 1188 + struct hv_per_cpu_context *hv_cpu = (void *)data; 1189 + 1190 + __vmbus_on_msg_dpc(hv_cpu->hyp_synic_message_page); 1191 + __vmbus_on_msg_dpc(hv_cpu->para_synic_message_page); 1192 + } 1193 + 1200 1194 #ifdef CONFIG_PM_SLEEP 1201 1195 /* 1202 1196 * Fake RESCIND_CHANNEL messages to clean up hv_sock channels by force for ··· 1243 1221 #endif /* CONFIG_PM_SLEEP */ 1244 1222 1245 1223 /* 1246 - * Schedule all channels with events pending 1224 + * Schedule all channels with events pending. 1225 + * The event page can be directly checked to get the id of 1226 + * the channel that has the interrupt pending. 1247 1227 */ 1248 - static void vmbus_chan_sched(struct hv_per_cpu_context *hv_cpu) 1228 + static void vmbus_chan_sched(void *event_page_addr) 1249 1229 { 1250 1230 unsigned long *recv_int_page; 1251 1231 u32 maxbits, relid; 1232 + union hv_synic_event_flags *event; 1252 1233 1253 - /* 1254 - * The event page can be directly checked to get the id of 1255 - * the channel that has the interrupt pending. 1256 - */ 1257 - void *page_addr = hv_cpu->synic_event_page; 1258 - union hv_synic_event_flags *event 1259 - = (union hv_synic_event_flags *)page_addr + 1260 - VMBUS_MESSAGE_SINT; 1234 + if (!event_page_addr) 1235 + return; 1236 + event = (union hv_synic_event_flags *)event_page_addr + VMBUS_MESSAGE_SINT; 1261 1237 1262 1238 maxbits = HV_EVENT_FLAGS_COUNT; 1263 1239 recv_int_page = event->flags; ··· 1263 1243 if (unlikely(!recv_int_page)) 1264 1244 return; 1265 1245 1246 + /* 1247 + * Suggested-by: Michael Kelley <mhklinux@outlook.com> 1248 + * One possible optimization would be to keep track of the largest relID that's in use, 1249 + * and only scan up to that relID. 1250 + */ 1266 1251 for_each_set_bit(relid, recv_int_page, maxbits) { 1267 1252 void (*callback_fn)(void *context); 1268 1253 struct vmbus_channel *channel; ··· 1331 1306 } 1332 1307 } 1333 1308 1334 - static void vmbus_isr(void) 1309 + static void vmbus_message_sched(struct hv_per_cpu_context *hv_cpu, void *message_page_addr) 1335 1310 { 1336 - struct hv_per_cpu_context *hv_cpu 1337 - = this_cpu_ptr(hv_context.cpu_context); 1338 - void *page_addr; 1339 1311 struct hv_message *msg; 1340 1312 1341 - vmbus_chan_sched(hv_cpu); 1342 - 1343 - page_addr = hv_cpu->synic_message_page; 1344 - msg = (struct hv_message *)page_addr + VMBUS_MESSAGE_SINT; 1313 + if (!message_page_addr) 1314 + return; 1315 + msg = (struct hv_message *)message_page_addr + VMBUS_MESSAGE_SINT; 1345 1316 1346 1317 /* Check if there are actual msgs to be processed */ 1347 1318 if (msg->header.message_type != HVMSG_NONE) { 1348 1319 if (msg->header.message_type == HVMSG_TIMER_EXPIRED) { 1349 1320 hv_stimer0_isr(); 1350 1321 vmbus_signal_eom(msg, HVMSG_TIMER_EXPIRED); 1351 - } else 1322 + } else { 1352 1323 tasklet_schedule(&hv_cpu->msg_dpc); 1324 + } 1353 1325 } 1326 + } 1327 + 1328 + void vmbus_isr(void) 1329 + { 1330 + struct hv_per_cpu_context *hv_cpu 1331 + = this_cpu_ptr(hv_context.cpu_context); 1332 + 1333 + vmbus_chan_sched(hv_cpu->hyp_synic_event_page); 1334 + vmbus_chan_sched(hv_cpu->para_synic_event_page); 1335 + 1336 + vmbus_message_sched(hv_cpu, hv_cpu->hyp_synic_message_page); 1337 + vmbus_message_sched(hv_cpu, hv_cpu->para_synic_message_page); 1354 1338 1355 1339 add_interrupt_randomness(vmbus_interrupt); 1356 1340 } 1341 + EXPORT_SYMBOL_FOR_MODULES(vmbus_isr, "mshv_vtl"); 1357 1342 1358 1343 static irqreturn_t vmbus_percpu_isr(int irq, void *dev_id) 1359 1344 { ··· 1378 1343 hv_synic_init(cpu); 1379 1344 } 1380 1345 1381 - /* 1382 - * vmbus_bus_init -Main vmbus driver initialization routine. 1383 - * 1384 - * Here, we 1385 - * - initialize the vmbus driver context 1386 - * - invoke the vmbus hv main init routine 1387 - * - retrieve the channel offers 1388 - */ 1389 - static int vmbus_bus_init(void) 1346 + static int vmbus_alloc_synic_and_connect(void) 1390 1347 { 1391 1348 int ret, cpu; 1392 1349 struct work_struct __percpu *works; 1393 - 1394 - ret = hv_init(); 1395 - if (ret != 0) { 1396 - pr_err("Unable to initialize the hypervisor - 0x%x\n", ret); 1397 - return ret; 1398 - } 1399 - 1400 - ret = bus_register(&hv_bus); 1401 - if (ret) 1402 - return ret; 1403 - 1404 - /* 1405 - * VMbus interrupts are best modeled as per-cpu interrupts. If 1406 - * on an architecture with support for per-cpu IRQs (e.g. ARM64), 1407 - * allocate a per-cpu IRQ using standard Linux kernel functionality. 1408 - * If not on such an architecture (e.g., x86/x64), then rely on 1409 - * code in the arch-specific portion of the code tree to connect 1410 - * the VMbus interrupt handler. 1411 - */ 1412 - 1413 - if (vmbus_irq == -1) { 1414 - hv_setup_vmbus_handler(vmbus_isr); 1415 - } else { 1416 - vmbus_evt = alloc_percpu(long); 1417 - ret = request_percpu_irq(vmbus_irq, vmbus_percpu_isr, 1418 - "Hyper-V VMbus", vmbus_evt); 1419 - if (ret) { 1420 - pr_err("Can't request Hyper-V VMbus IRQ %d, Err %d", 1421 - vmbus_irq, ret); 1422 - free_percpu(vmbus_evt); 1423 - goto err_setup; 1424 - } 1425 - } 1350 + int hyperv_cpuhp_online; 1426 1351 1427 1352 ret = hv_synic_alloc(); 1428 - if (ret) 1353 + if (ret < 0) 1429 1354 goto err_alloc; 1430 1355 1431 1356 works = alloc_percpu(struct work_struct); ··· 1421 1426 ret = vmbus_connect(); 1422 1427 if (ret) 1423 1428 goto err_connect; 1429 + return 0; 1430 + 1431 + err_connect: 1432 + cpuhp_remove_state(hyperv_cpuhp_online); 1433 + return -ENODEV; 1434 + err_alloc: 1435 + hv_synic_free(); 1436 + return -ENOMEM; 1437 + } 1438 + 1439 + /* 1440 + * vmbus_bus_init -Main vmbus driver initialization routine. 1441 + * 1442 + * Here, we 1443 + * - initialize the vmbus driver context 1444 + * - invoke the vmbus hv main init routine 1445 + * - retrieve the channel offers 1446 + */ 1447 + static int vmbus_bus_init(void) 1448 + { 1449 + int ret; 1450 + 1451 + ret = hv_init(); 1452 + if (ret != 0) { 1453 + pr_err("Unable to initialize the hypervisor - 0x%x\n", ret); 1454 + return ret; 1455 + } 1456 + 1457 + ret = bus_register(&hv_bus); 1458 + if (ret) 1459 + return ret; 1460 + 1461 + /* 1462 + * VMbus interrupts are best modeled as per-cpu interrupts. If 1463 + * on an architecture with support for per-cpu IRQs (e.g. ARM64), 1464 + * allocate a per-cpu IRQ using standard Linux kernel functionality. 1465 + * If not on such an architecture (e.g., x86/x64), then rely on 1466 + * code in the arch-specific portion of the code tree to connect 1467 + * the VMbus interrupt handler. 1468 + */ 1469 + 1470 + if (vmbus_irq == -1) { 1471 + hv_setup_vmbus_handler(vmbus_isr); 1472 + } else { 1473 + vmbus_evt = alloc_percpu(long); 1474 + ret = request_percpu_irq(vmbus_irq, vmbus_percpu_isr, 1475 + "Hyper-V VMbus", vmbus_evt); 1476 + if (ret) { 1477 + pr_err("Can't request Hyper-V VMbus IRQ %d, Err %d", 1478 + vmbus_irq, ret); 1479 + free_percpu(vmbus_evt); 1480 + goto err_setup; 1481 + } 1482 + } 1483 + 1484 + /* 1485 + * Cache the value as getting it involves a VM exit on x86(_64), and 1486 + * doing that on each VP while initializing SynIC's wastes time. 1487 + */ 1488 + is_confidential = ms_hyperv.confidential_vmbus_available; 1489 + if (is_confidential) 1490 + pr_info("Establishing connection to the confidential VMBus\n"); 1491 + hv_para_set_sint_proxy(!is_confidential); 1492 + ret = vmbus_alloc_synic_and_connect(); 1493 + if (ret) 1494 + goto err_connect; 1424 1495 1425 1496 /* 1426 1497 * Always register the vmbus unload panic notifier because we ··· 1500 1439 return 0; 1501 1440 1502 1441 err_connect: 1503 - cpuhp_remove_state(hyperv_cpuhp_online); 1504 - err_alloc: 1505 - hv_synic_free(); 1506 1442 if (vmbus_irq == -1) { 1507 1443 hv_remove_vmbus_handler(); 1508 1444 } else { ··· 2856 2798 */ 2857 2799 cpu = smp_processor_id(); 2858 2800 hv_stimer_cleanup(cpu); 2859 - hv_synic_disable_regs(cpu); 2801 + hv_hyp_synic_disable_regs(cpu); 2860 2802 }; 2861 2803 2862 2804 static int hv_synic_suspend(void *data) ··· 2881 2823 * interrupts-disabled context. 2882 2824 */ 2883 2825 2884 - hv_synic_disable_regs(0); 2826 + hv_hyp_synic_disable_regs(0); 2885 2827 2886 2828 return 0; 2887 2829 } 2888 2830 2889 2831 static void hv_synic_resume(void *data) 2890 2832 { 2891 - hv_synic_enable_regs(0); 2833 + hv_hyp_synic_enable_regs(0); 2892 2834 2893 2835 /* 2894 2836 * Note: we don't need to call hv_stimer_init(0), because the timer

+20 -43

include/asm-generic/mshyperv.h

··· 62 62 }; 63 63 }; 64 64 u64 shared_gpa_boundary; 65 + bool msi_ext_dest_id; 66 + bool confidential_vmbus_available; 65 67 }; 66 68 extern struct ms_hyperv_info ms_hyperv; 67 69 extern bool hv_nested; ··· 126 124 127 125 /* 128 126 * Rep hypercalls. Callers of this functions are supposed to ensure that 129 - * rep_count and varhead_size comply with Hyper-V hypercall definition. 127 + * rep_count, varhead_size, and rep_start comply with Hyper-V hypercall 128 + * definition. 130 129 */ 131 - static inline u64 hv_do_rep_hypercall(u16 code, u16 rep_count, u16 varhead_size, 132 - void *input, void *output) 130 + static inline u64 hv_do_rep_hypercall_ex(u16 code, u16 rep_count, 131 + u16 varhead_size, u16 rep_start, 132 + void *input, void *output) 133 133 { 134 134 u64 control = code; 135 135 u64 status; ··· 139 135 140 136 control |= (u64)varhead_size << HV_HYPERCALL_VARHEAD_OFFSET; 141 137 control |= (u64)rep_count << HV_HYPERCALL_REP_COMP_OFFSET; 138 + control |= (u64)rep_start << HV_HYPERCALL_REP_START_OFFSET; 142 139 143 140 do { 144 141 status = hv_do_hypercall(control, input, output); ··· 157 152 return status; 158 153 } 159 154 155 + /* For the typical case where rep_start is 0 */ 156 + static inline u64 hv_do_rep_hypercall(u16 code, u16 rep_count, u16 varhead_size, 157 + void *input, void *output) 158 + { 159 + return hv_do_rep_hypercall_ex(code, rep_count, varhead_size, 0, 160 + input, output); 161 + } 162 + 160 163 /* Generate the guest OS identifier as described in the Hyper-V TLFS */ 161 164 static inline u64 hv_generate_guest_id(u64 kernel_version) 162 165 { ··· 175 162 176 163 return guest_id; 177 164 } 178 - 179 - #if IS_ENABLED(CONFIG_HYPERV_VMBUS) 180 - /* Free the message slot and signal end-of-message if required */ 181 - static inline void vmbus_signal_eom(struct hv_message *msg, u32 old_msg_type) 182 - { 183 - /* 184 - * On crash we're reading some other CPU's message page and we need 185 - * to be careful: this other CPU may already had cleared the header 186 - * and the host may already had delivered some other message there. 187 - * In case we blindly write msg->header.message_type we're going 188 - * to lose it. We can still lose a message of the same type but 189 - * we count on the fact that there can only be one 190 - * CHANNELMSG_UNLOAD_RESPONSE and we don't care about other messages 191 - * on crash. 192 - */ 193 - if (cmpxchg(&msg->header.message_type, old_msg_type, 194 - HVMSG_NONE) != old_msg_type) 195 - return; 196 - 197 - /* 198 - * The cmxchg() above does an implicit memory barrier to 199 - * ensure the write to MessageType (ie set to 200 - * HVMSG_NONE) happens before we read the 201 - * MessagePending and EOMing. Otherwise, the EOMing 202 - * will not deliver any more messages since there is 203 - * no empty slot 204 - */ 205 - if (msg->header.message_flags.msg_pending) { 206 - /* 207 - * This will cause message queue rescan to 208 - * possibly deliver another msg from the 209 - * hypervisor 210 - */ 211 - hv_set_msr(HV_MSR_EOM, 0); 212 - } 213 - } 214 - 215 - extern int vmbus_interrupt; 216 - extern int vmbus_irq; 217 - #endif /* CONFIG_HYPERV_VMBUS */ 218 165 219 166 int hv_get_hypervisor_version(union hv_hypervisor_version_info *info); 220 167 ··· 309 336 bool hv_isolation_type_snp(void); 310 337 u64 hv_ghcb_hypercall(u64 control, void *input, void *output, u32 input_size); 311 338 u64 hv_tdx_hypercall(u64 control, u64 param1, u64 param2); 339 + void hv_enable_coco_interrupt(unsigned int cpu, unsigned int vector, bool set); 340 + void hv_para_set_sint_proxy(bool enable); 341 + u64 hv_para_get_synic_register(unsigned int reg); 342 + void hv_para_set_synic_register(unsigned int reg, u64 val); 312 343 void hyperv_cleanup(void); 313 344 bool hv_query_ext_cap(u64 cap_query); 314 345 void hv_setup_dma_ops(struct device *dev, bool coherent);

+114 -1

include/hyperv/hvgdk_mini.h

··· 260 260 #define HYPERV_CPUID_VIRT_STACK_PROPERTIES 0x40000082 261 261 /* Support for the extended IOAPIC RTE format */ 262 262 #define HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE BIT(2) 263 + #define HYPERV_VS_PROPERTIES_EAX_CONFIDENTIAL_VMBUS_AVAILABLE BIT(3) 263 264 264 265 #define HYPERV_HYPERVISOR_PRESENT_BIT 0x80000000 265 266 #define HYPERV_CPUID_MIN 0x40000005 ··· 465 464 #define HVCALL_RESET_DEBUG_SESSION 0x006b 466 465 #define HVCALL_MAP_STATS_PAGE 0x006c 467 466 #define HVCALL_UNMAP_STATS_PAGE 0x006d 467 + #define HVCALL_SET_SYSTEM_PROPERTY 0x006f 468 468 #define HVCALL_ADD_LOGICAL_PROCESSOR 0x0076 469 469 #define HVCALL_GET_SYSTEM_PROPERTY 0x007b 470 470 #define HVCALL_MAP_DEVICE_INTERRUPT 0x007c 471 471 #define HVCALL_UNMAP_DEVICE_INTERRUPT 0x007d 472 472 #define HVCALL_RETARGET_INTERRUPT 0x007e 473 + #define HVCALL_NOTIFY_PARTITION_EVENT 0x0087 474 + #define HVCALL_ENTER_SLEEP_STATE 0x0084 473 475 #define HVCALL_NOTIFY_PORT_RING_EMPTY 0x008b 474 476 #define HVCALL_REGISTER_INTERCEPT_RESULT 0x0091 475 477 #define HVCALL_ASSERT_VIRTUAL_INTERRUPT 0x0094 476 478 #define HVCALL_CREATE_PORT 0x0095 477 479 #define HVCALL_CONNECT_PORT 0x0096 478 480 #define HVCALL_START_VP 0x0099 479 - #define HVCALL_GET_VP_INDEX_FROM_APIC_ID 0x009a 481 + #define HVCALL_GET_VP_INDEX_FROM_APIC_ID 0x009a 480 482 #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE 0x00af 481 483 #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST 0x00b0 482 484 #define HVCALL_SIGNAL_EVENT_DIRECT 0x00c0 ··· 494 490 #define HVCALL_GET_VP_STATE 0x00e3 495 491 #define HVCALL_SET_VP_STATE 0x00e4 496 492 #define HVCALL_GET_VP_CPUID_VALUES 0x00f4 493 + #define HVCALL_GET_PARTITION_PROPERTY_EX 0x0101 497 494 #define HVCALL_MMIO_READ 0x0106 498 495 #define HVCALL_MMIO_WRITE 0x0107 496 + #define HVCALL_DISABLE_HYP_EX 0x010f 497 + #define HVCALL_MAP_STATS_PAGE2 0x0131 499 498 500 499 /* HV_HYPERCALL_INPUT */ 501 500 #define HV_HYPERCALL_RESULT_MASK GENMASK_ULL(15, 0) ··· 887 880 u32 apic_ids[]; 888 881 } __packed; 889 882 883 + union hv_register_vsm_partition_config { 884 + u64 as_uint64; 885 + struct { 886 + u64 enable_vtl_protection : 1; 887 + u64 default_vtl_protection_mask : 4; 888 + u64 zero_memory_on_reset : 1; 889 + u64 deny_lower_vtl_startup : 1; 890 + u64 intercept_acceptance : 1; 891 + u64 intercept_enable_vtl_protection : 1; 892 + u64 intercept_vp_startup : 1; 893 + u64 intercept_cpuid_unimplemented : 1; 894 + u64 intercept_unrecoverable_exception : 1; 895 + u64 intercept_page : 1; 896 + u64 mbz : 51; 897 + } __packed; 898 + }; 899 + 900 + union hv_register_vsm_capabilities { 901 + u64 as_uint64; 902 + struct { 903 + u64 dr6_shared: 1; 904 + u64 mbec_vtl_mask: 16; 905 + u64 deny_lower_vtl_startup: 1; 906 + u64 supervisor_shadow_stack: 1; 907 + u64 hardware_hvpt_available: 1; 908 + u64 software_hvpt_available: 1; 909 + u64 hardware_hvpt_range_bits: 6; 910 + u64 intercept_page_available: 1; 911 + u64 return_action_available: 1; 912 + u64 reserved: 35; 913 + } __packed; 914 + }; 915 + 916 + union hv_register_vsm_page_offsets { 917 + struct { 918 + u64 vtl_call_offset : 12; 919 + u64 vtl_return_offset : 12; 920 + u64 reserved_mbz : 40; 921 + } __packed; 922 + u64 as_uint64; 923 + }; 924 + 890 925 struct hv_nested_enlightenments_control { 891 926 struct { 892 927 u32 directhypercall : 1; ··· 1051 1002 1052 1003 /* VSM */ 1053 1004 HV_REGISTER_VSM_VP_STATUS = 0x000D0003, 1005 + 1006 + /* Synthetic VSM registers */ 1007 + HV_REGISTER_VSM_CODE_PAGE_OFFSETS = 0x000D0002, 1008 + HV_REGISTER_VSM_CAPABILITIES = 0x000D0006, 1009 + HV_REGISTER_VSM_PARTITION_CONFIG = 0x000D0007, 1010 + 1011 + #if defined(CONFIG_X86) 1012 + /* X64 Debug Registers */ 1013 + HV_X64_REGISTER_DR0 = 0x00050000, 1014 + HV_X64_REGISTER_DR1 = 0x00050001, 1015 + HV_X64_REGISTER_DR2 = 0x00050002, 1016 + HV_X64_REGISTER_DR3 = 0x00050003, 1017 + HV_X64_REGISTER_DR6 = 0x00050004, 1018 + HV_X64_REGISTER_DR7 = 0x00050005, 1019 + 1020 + /* X64 Cache control MSRs */ 1021 + HV_X64_REGISTER_MSR_MTRR_CAP = 0x0008000D, 1022 + HV_X64_REGISTER_MSR_MTRR_DEF_TYPE = 0x0008000E, 1023 + HV_X64_REGISTER_MSR_MTRR_PHYS_BASE0 = 0x00080010, 1024 + HV_X64_REGISTER_MSR_MTRR_PHYS_BASE1 = 0x00080011, 1025 + HV_X64_REGISTER_MSR_MTRR_PHYS_BASE2 = 0x00080012, 1026 + HV_X64_REGISTER_MSR_MTRR_PHYS_BASE3 = 0x00080013, 1027 + HV_X64_REGISTER_MSR_MTRR_PHYS_BASE4 = 0x00080014, 1028 + HV_X64_REGISTER_MSR_MTRR_PHYS_BASE5 = 0x00080015, 1029 + HV_X64_REGISTER_MSR_MTRR_PHYS_BASE6 = 0x00080016, 1030 + HV_X64_REGISTER_MSR_MTRR_PHYS_BASE7 = 0x00080017, 1031 + HV_X64_REGISTER_MSR_MTRR_PHYS_BASE8 = 0x00080018, 1032 + HV_X64_REGISTER_MSR_MTRR_PHYS_BASE9 = 0x00080019, 1033 + HV_X64_REGISTER_MSR_MTRR_PHYS_BASEA = 0x0008001A, 1034 + HV_X64_REGISTER_MSR_MTRR_PHYS_BASEB = 0x0008001B, 1035 + HV_X64_REGISTER_MSR_MTRR_PHYS_BASEC = 0x0008001C, 1036 + HV_X64_REGISTER_MSR_MTRR_PHYS_BASED = 0x0008001D, 1037 + HV_X64_REGISTER_MSR_MTRR_PHYS_BASEE = 0x0008001E, 1038 + HV_X64_REGISTER_MSR_MTRR_PHYS_BASEF = 0x0008001F, 1039 + HV_X64_REGISTER_MSR_MTRR_PHYS_MASK0 = 0x00080040, 1040 + HV_X64_REGISTER_MSR_MTRR_PHYS_MASK1 = 0x00080041, 1041 + HV_X64_REGISTER_MSR_MTRR_PHYS_MASK2 = 0x00080042, 1042 + HV_X64_REGISTER_MSR_MTRR_PHYS_MASK3 = 0x00080043, 1043 + HV_X64_REGISTER_MSR_MTRR_PHYS_MASK4 = 0x00080044, 1044 + HV_X64_REGISTER_MSR_MTRR_PHYS_MASK5 = 0x00080045, 1045 + HV_X64_REGISTER_MSR_MTRR_PHYS_MASK6 = 0x00080046, 1046 + HV_X64_REGISTER_MSR_MTRR_PHYS_MASK7 = 0x00080047, 1047 + HV_X64_REGISTER_MSR_MTRR_PHYS_MASK8 = 0x00080048, 1048 + HV_X64_REGISTER_MSR_MTRR_PHYS_MASK9 = 0x00080049, 1049 + HV_X64_REGISTER_MSR_MTRR_PHYS_MASKA = 0x0008004A, 1050 + HV_X64_REGISTER_MSR_MTRR_PHYS_MASKB = 0x0008004B, 1051 + HV_X64_REGISTER_MSR_MTRR_PHYS_MASKC = 0x0008004C, 1052 + HV_X64_REGISTER_MSR_MTRR_PHYS_MASKD = 0x0008004D, 1053 + HV_X64_REGISTER_MSR_MTRR_PHYS_MASKE = 0x0008004E, 1054 + HV_X64_REGISTER_MSR_MTRR_PHYS_MASKF = 0x0008004F, 1055 + HV_X64_REGISTER_MSR_MTRR_FIX64K00000 = 0x00080070, 1056 + HV_X64_REGISTER_MSR_MTRR_FIX16K80000 = 0x00080071, 1057 + HV_X64_REGISTER_MSR_MTRR_FIX16KA0000 = 0x00080072, 1058 + HV_X64_REGISTER_MSR_MTRR_FIX4KC0000 = 0x00080073, 1059 + HV_X64_REGISTER_MSR_MTRR_FIX4KC8000 = 0x00080074, 1060 + HV_X64_REGISTER_MSR_MTRR_FIX4KD0000 = 0x00080075, 1061 + HV_X64_REGISTER_MSR_MTRR_FIX4KD8000 = 0x00080076, 1062 + HV_X64_REGISTER_MSR_MTRR_FIX4KE0000 = 0x00080077, 1063 + HV_X64_REGISTER_MSR_MTRR_FIX4KE8000 = 0x00080078, 1064 + HV_X64_REGISTER_MSR_MTRR_FIX4KF0000 = 0x00080079, 1065 + HV_X64_REGISTER_MSR_MTRR_FIX4KF8000 = 0x0008007A, 1066 + 1067 + HV_X64_REGISTER_REG_PAGE = 0x0009001C, 1068 + #endif 1054 1069 }; 1055 1070 1056 1071 /*

+46

include/hyperv/hvhdk.h

··· 376 376 u64 property_value; 377 377 } __packed; 378 378 379 + union hv_partition_property_arg { 380 + u64 as_uint64; 381 + struct { 382 + union { 383 + u32 arg; 384 + u32 vp_index; 385 + }; 386 + u16 reserved0; 387 + u8 reserved1; 388 + u8 object_type; 389 + } __packed; 390 + }; 391 + 392 + struct hv_input_get_partition_property_ex { 393 + u64 partition_id; 394 + u32 property_code; /* enum hv_partition_property_code */ 395 + u32 padding; 396 + union { 397 + union hv_partition_property_arg arg_data; 398 + u64 arg; 399 + }; 400 + } __packed; 401 + 402 + /* 403 + * NOTE: Should use hv_input_set_partition_property_ex_header to compute this 404 + * size, but hv_input_get_partition_property_ex is identical so it suffices 405 + */ 406 + #define HV_PARTITION_PROPERTY_EX_MAX_VAR_SIZE \ 407 + (HV_HYP_PAGE_SIZE - sizeof(struct hv_input_get_partition_property_ex)) 408 + 409 + union hv_partition_property_ex { 410 + u8 buffer[HV_PARTITION_PROPERTY_EX_MAX_VAR_SIZE]; 411 + struct hv_partition_property_vmm_capabilities vmm_capabilities; 412 + /* More fields to be filled in when needed */ 413 + }; 414 + 415 + struct hv_output_get_partition_property_ex { 416 + union hv_partition_property_ex property_value; 417 + } __packed; 418 + 379 419 enum hv_vp_state_page_type { 380 420 HV_VP_STATE_PAGE_REGISTERS = 0, 381 421 HV_VP_STATE_PAGE_INTERCEPT_MESSAGE = 1, ··· 579 539 u64 as_uint64; 580 540 struct { 581 541 u32 interrupt_type; /* enum hv_interrupt_type */ 542 + #if IS_ENABLED(CONFIG_X86) 582 543 u32 level_triggered : 1; 583 544 u32 logical_dest_mode : 1; 584 545 u32 rsvd : 30; 546 + #elif IS_ENABLED(CONFIG_ARM64) 547 + u32 rsvd1 : 2; 548 + u32 asserted : 1; 549 + u32 rsvd2 : 29; 550 + #endif 585 551 } __packed; 586 552 }; 587 553

+128

include/hyperv/hvhdk_mini.h

··· 96 96 HV_PARTITION_PROPERTY_XSAVE_STATES = 0x00060007, 97 97 HV_PARTITION_PROPERTY_MAX_XSAVE_DATA_SIZE = 0x00060008, 98 98 HV_PARTITION_PROPERTY_PROCESSOR_CLOCK_FREQUENCY = 0x00060009, 99 + 100 + /* Extended properties with larger property values */ 101 + HV_PARTITION_PROPERTY_VMM_CAPABILITIES = 0x00090007, 99 102 }; 103 + 104 + #define HV_PARTITION_VMM_CAPABILITIES_BANK_COUNT 1 105 + #define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT 59 106 + 107 + struct hv_partition_property_vmm_capabilities { 108 + u16 bank_count; 109 + u16 reserved[3]; 110 + union { 111 + u64 as_uint64[HV_PARTITION_VMM_CAPABILITIES_BANK_COUNT]; 112 + struct { 113 + u64 map_gpa_preserve_adjustable: 1; 114 + u64 vmm_can_provide_overlay_gpfn: 1; 115 + u64 vp_affinity_property: 1; 116 + #if IS_ENABLED(CONFIG_ARM64) 117 + u64 vmm_can_provide_gic_overlay_locations: 1; 118 + #else 119 + u64 reservedbit3: 1; 120 + #endif 121 + u64 assignable_synthetic_proc_features: 1; 122 + u64 reserved0: HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT; 123 + } __packed; 124 + }; 125 + } __packed; 100 126 101 127 enum hv_snp_status { 102 128 HV_SNP_STATUS_NONE = 0, ··· 140 114 141 115 enum hv_system_property { 142 116 /* Add more values when needed */ 117 + HV_SYSTEM_PROPERTY_SLEEP_STATE = 3, 143 118 HV_SYSTEM_PROPERTY_SCHEDULER_TYPE = 15, 144 119 HV_DYNAMIC_PROCESSOR_FEATURE_PROPERTY = 21, 120 + HV_SYSTEM_PROPERTY_CRASHDUMPAREA = 47, 121 + }; 122 + 123 + #define HV_PFN_RANGE_PGBITS 24 /* HV_SPA_PAGE_RANGE_ADDITIONAL_PAGES_BITS */ 124 + union hv_pfn_range { /* HV_SPA_PAGE_RANGE */ 125 + u64 as_uint64; 126 + struct { 127 + /* 39:0: base pfn. 63:40: additional pages */ 128 + u64 base_pfn : 64 - HV_PFN_RANGE_PGBITS; 129 + u64 add_pfns : HV_PFN_RANGE_PGBITS; 130 + } __packed; 131 + }; 132 + 133 + enum hv_sleep_state { 134 + HV_SLEEP_STATE_S1 = 1, 135 + HV_SLEEP_STATE_S2 = 2, 136 + HV_SLEEP_STATE_S3 = 3, 137 + HV_SLEEP_STATE_S4 = 4, 138 + HV_SLEEP_STATE_S5 = 5, 139 + /* 140 + * After hypervisor has received this, any follow up sleep 141 + * state registration requests will be rejected. 142 + */ 143 + HV_SLEEP_STATE_LOCK = 6 145 144 }; 146 145 147 146 enum hv_dynamic_processor_feature_property { ··· 193 142 #if IS_ENABLED(CONFIG_X86) 194 143 u64 hv_processor_feature_value; 195 144 #endif 145 + union hv_pfn_range hv_cda_info; /* CrashdumpAreaAddress */ 146 + u64 hv_tramp_pa; /* CrashdumpTrampolineAddress */ 196 147 }; 148 + } __packed; 149 + 150 + struct hv_sleep_state_info { 151 + u32 sleep_state; /* enum hv_sleep_state */ 152 + u8 pm1a_slp_typ; 153 + u8 pm1b_slp_typ; 154 + } __packed; 155 + 156 + struct hv_input_set_system_property { 157 + u32 property_id; /* enum hv_system_property */ 158 + u32 reserved; 159 + union { 160 + /* More fields to be filled in when needed */ 161 + struct hv_sleep_state_info set_sleep_state_info; 162 + 163 + /* 164 + * Add a reserved field to ensure the union is 8-byte aligned as 165 + * existing members may not be. This is a temporary measure 166 + * until all remaining members are added. 167 + */ 168 + u64 reserved0[8]; 169 + }; 170 + } __packed; 171 + 172 + struct hv_input_enter_sleep_state { /* HV_INPUT_ENTER_SLEEP_STATE */ 173 + u32 sleep_state; /* enum hv_sleep_state */ 197 174 } __packed; 198 175 199 176 struct hv_input_map_stats_page { 200 177 u32 type; /* enum hv_stats_object_type */ 201 178 u32 padding; 202 179 union hv_stats_object_identity identity; 180 + } __packed; 181 + 182 + struct hv_input_map_stats_page2 { 183 + u32 type; /* enum hv_stats_object_type */ 184 + u32 padding; 185 + union hv_stats_object_identity identity; 186 + u64 map_location; 203 187 } __packed; 204 188 205 189 struct hv_output_map_stats_page { ··· 318 232 u8 reserved: 6; 319 233 }; 320 234 u8 as_uint8; 235 + } __packed; 236 + 237 + enum hv_crashdump_action { 238 + HV_CRASHDUMP_NONE = 0, 239 + HV_CRASHDUMP_SUSPEND_ALL_VPS, 240 + HV_CRASHDUMP_PREPARE_FOR_STATE_SAVE, 241 + HV_CRASHDUMP_STATE_SAVED, 242 + HV_CRASHDUMP_ENTRY, 243 + }; 244 + 245 + struct hv_partition_event_root_crashdump_input { 246 + u32 crashdump_action; /* enum hv_crashdump_action */ 247 + } __packed; 248 + 249 + struct hv_input_disable_hyp_ex { /* HV_X64_INPUT_DISABLE_HYPERVISOR_EX */ 250 + u64 rip; 251 + u64 arg; 252 + } __packed; 253 + 254 + struct hv_crashdump_area { /* HV_CRASHDUMP_AREA */ 255 + u32 version; 256 + union { 257 + u32 flags_as_uint32; 258 + struct { 259 + u32 cda_valid : 1; 260 + u32 cda_unused : 31; 261 + } __packed; 262 + }; 263 + /* more unused fields */ 264 + } __packed; 265 + 266 + union hv_partition_event_input { 267 + struct hv_partition_event_root_crashdump_input crashdump_input; 268 + }; 269 + 270 + enum hv_partition_event { 271 + HV_PARTITION_EVENT_ROOT_CRASHDUMP = 2, 272 + }; 273 + 274 + struct hv_input_notify_partition_event { 275 + u32 event; /* enum hv_partition_event */ 276 + union hv_partition_event_input input; 321 277 } __packed; 322 278 323 279 struct hv_lp_startup_status {

+4 -4

include/linux/compiler_types.h

··· 11 11 #define __has_builtin(x) (0) 12 12 #endif 13 13 14 + /* Indirect macros required for expanded argument pasting, eg. __LINE__. */ 15 + #define ___PASTE(a, b) a##b 16 + #define __PASTE(a, b) ___PASTE(a, b) 17 + 14 18 #ifndef __ASSEMBLY__ 15 19 16 20 /* ··· 82 78 # define ACCESS_PRIVATE(p, member) ((p)->member) 83 79 # define __builtin_warning(x, y...) (1) 84 80 #endif /* __CHECKER__ */ 85 - 86 - /* Indirect macros required for expanded argument pasting, eg. __LINE__. */ 87 - #define ___PASTE(a,b) a##b 88 - #define __PASTE(a,b) ___PASTE(a,b) 89 81 90 82 #ifdef __KERNEL__ 91 83

+50 -19

include/linux/hyperv.h

··· 265 265 * Linux kernel. 266 266 */ 267 267 268 - #define VERSION_WS2008 ((0 << 16) | (13)) 269 - #define VERSION_WIN7 ((1 << 16) | (1)) 270 - #define VERSION_WIN8 ((2 << 16) | (4)) 271 - #define VERSION_WIN8_1 ((3 << 16) | (0)) 272 - #define VERSION_WIN10 ((4 << 16) | (0)) 273 - #define VERSION_WIN10_V4_1 ((4 << 16) | (1)) 274 - #define VERSION_WIN10_V5 ((5 << 16) | (0)) 275 - #define VERSION_WIN10_V5_1 ((5 << 16) | (1)) 276 - #define VERSION_WIN10_V5_2 ((5 << 16) | (2)) 277 - #define VERSION_WIN10_V5_3 ((5 << 16) | (3)) 268 + #define VMBUS_MAKE_VERSION(MAJ, MIN) ((((u32)MAJ) << 16) | (MIN)) 269 + #define VERSION_WS2008 VMBUS_MAKE_VERSION(0, 13) 270 + #define VERSION_WIN7 VMBUS_MAKE_VERSION(1, 1) 271 + #define VERSION_WIN8 VMBUS_MAKE_VERSION(2, 4) 272 + #define VERSION_WIN8_1 VMBUS_MAKE_VERSION(3, 0) 273 + #define VERSION_WIN10 VMBUS_MAKE_VERSION(4, 0) 274 + #define VERSION_WIN10_V4_1 VMBUS_MAKE_VERSION(4, 1) 275 + #define VERSION_WIN10_V5 VMBUS_MAKE_VERSION(5, 0) 276 + #define VERSION_WIN10_V5_1 VMBUS_MAKE_VERSION(5, 1) 277 + #define VERSION_WIN10_V5_2 VMBUS_MAKE_VERSION(5, 2) 278 + #define VERSION_WIN10_V5_3 VMBUS_MAKE_VERSION(5, 3) 279 + #define VERSION_WIN10_V6_0 VMBUS_MAKE_VERSION(6, 0) 278 280 279 281 /* Make maximum size of pipe payload of 16K */ 280 282 #define MAX_PIPE_DATA_PAYLOAD (sizeof(u8) * 16384) ··· 337 335 } __packed; 338 336 339 337 /* Server Flags */ 340 - #define VMBUS_CHANNEL_ENUMERATE_DEVICE_INTERFACE 1 341 - #define VMBUS_CHANNEL_SERVER_SUPPORTS_TRANSFER_PAGES 2 342 - #define VMBUS_CHANNEL_SERVER_SUPPORTS_GPADLS 4 343 - #define VMBUS_CHANNEL_NAMED_PIPE_MODE 0x10 344 - #define VMBUS_CHANNEL_LOOPBACK_OFFER 0x100 345 - #define VMBUS_CHANNEL_PARENT_OFFER 0x200 346 - #define VMBUS_CHANNEL_REQUEST_MONITORED_NOTIFICATION 0x400 347 - #define VMBUS_CHANNEL_TLNPI_PROVIDER_OFFER 0x2000 338 + #define VMBUS_CHANNEL_ENUMERATE_DEVICE_INTERFACE 0x0001 339 + /* 340 + * This flag indicates that the channel is offered by the paravisor, and must 341 + * use encrypted memory for the channel ring buffer. 342 + */ 343 + #define VMBUS_CHANNEL_CONFIDENTIAL_RING_BUFFER 0x0002 344 + /* 345 + * This flag indicates that the channel is offered by the paravisor, and must 346 + * use encrypted memory for GPA direct packets and additional GPADLs. 347 + */ 348 + #define VMBUS_CHANNEL_CONFIDENTIAL_EXTERNAL_MEMORY 0x0004 349 + #define VMBUS_CHANNEL_NAMED_PIPE_MODE 0x0010 350 + #define VMBUS_CHANNEL_LOOPBACK_OFFER 0x0100 351 + #define VMBUS_CHANNEL_PARENT_OFFER 0x0200 352 + #define VMBUS_CHANNEL_REQUEST_MONITORED_NOTIFICATION 0x0400 353 + #define VMBUS_CHANNEL_TLNPI_PROVIDER_OFFER 0x2000 348 354 349 355 struct vmpacket_descriptor { 350 356 u16 type; ··· 631 621 u32 child_relid; 632 622 } __packed; 633 623 624 + /* 625 + * Used by the paravisor only, means that the encrypted ring buffers and 626 + * the encrypted external memory are supported 627 + */ 628 + #define VMBUS_FEATURE_FLAG_CONFIDENTIAL_CHANNELS 0x10 629 + 634 630 struct vmbus_channel_initiate_contact { 635 631 struct vmbus_channel_message_header header; 636 632 u32 vmbus_version_requested; ··· 646 630 struct { 647 631 u8 msg_sint; 648 632 u8 msg_vtl; 649 - u8 reserved[6]; 633 + u8 reserved[2]; 634 + u32 feature_flags; /* VMBus version 6.0 */ 650 635 }; 651 636 }; 652 637 u64 monitor_page1; ··· 1020 1003 1021 1004 /* boolean to control visibility of sysfs for ring buffer */ 1022 1005 bool ring_sysfs_visible; 1006 + /* The ring buffer is encrypted */ 1007 + bool co_ring_buffer; 1008 + /* The external memory is encrypted */ 1009 + bool co_external_memory; 1023 1010 }; 1024 1011 1025 1012 #define lock_requestor(channel, flags) \ ··· 1047 1026 u64 vmbus_request_addr_match(struct vmbus_channel *channel, u64 trans_id, 1048 1027 u64 rqst_addr); 1049 1028 u64 vmbus_request_addr(struct vmbus_channel *channel, u64 trans_id); 1029 + 1030 + static inline bool is_co_ring_buffer(const struct vmbus_channel_offer_channel *o) 1031 + { 1032 + return !!(o->offer.chn_flags & VMBUS_CHANNEL_CONFIDENTIAL_RING_BUFFER); 1033 + } 1034 + 1035 + static inline bool is_co_external_memory(const struct vmbus_channel_offer_channel *o) 1036 + { 1037 + return !!(o->offer.chn_flags & VMBUS_CHANNEL_CONFIDENTIAL_EXTERNAL_MEMORY); 1038 + } 1050 1039 1051 1040 static inline bool is_hvsock_offer(const struct vmbus_channel_offer_channel *o) 1052 1041 {

+4

include/linux/static_call_types.h

··· 25 25 #define STATIC_CALL_SITE_INIT 2UL /* init section */ 26 26 #define STATIC_CALL_SITE_FLAGS 3UL 27 27 28 + #ifndef __ASSEMBLY__ 29 + 28 30 /* 29 31 * The static call site table needs to be created by external tooling (objtool 30 32 * or a compiler plugin). ··· 101 99 ((typeof(STATIC_CALL_TRAMP(name))*)(STATIC_CALL_KEY(name).func)) 102 100 103 101 #endif /* CONFIG_HAVE_STATIC_CALL */ 102 + 103 + #endif /* __ASSEMBLY__ */ 104 104 105 105 #endif /* _STATIC_CALL_TYPES_H */

+115 -1

include/uapi/linux/mshv.h

··· 26 26 MSHV_PT_BIT_LAPIC, 27 27 MSHV_PT_BIT_X2APIC, 28 28 MSHV_PT_BIT_GPA_SUPER_PAGES, 29 + MSHV_PT_BIT_CPU_AND_XSAVE_FEATURES, 29 30 MSHV_PT_BIT_COUNT, 30 31 }; 31 32 ··· 42 41 * @pt_flags: Bitmask of 1 << MSHV_PT_BIT_* 43 42 * @pt_isolation: MSHV_PT_ISOLATION_* 44 43 * 44 + * This is the initial/v1 version for backward compatibility. 45 + * 45 46 * Returns a file descriptor to act as a handle to a guest partition. 46 47 * At this point the partition is not yet initialized in the hypervisor. 47 48 * Some operations must be done with the partition in this state, e.g. setting ··· 54 51 __u64 pt_flags; 55 52 __u64 pt_isolation; 56 53 }; 54 + 55 + #define MSHV_NUM_CPU_FEATURES_BANKS 2 56 + 57 + /** 58 + * struct mshv_create_partition_v2 59 + * 60 + * This is extended version of the above initial MSHV_CREATE_PARTITION 61 + * ioctl and allows for following additional parameters: 62 + * 63 + * @pt_num_cpu_fbanks: Must be set to MSHV_NUM_CPU_FEATURES_BANKS. 64 + * @pt_cpu_fbanks: Disabled processor feature banks array. 65 + * @pt_disabled_xsave: Disabled xsave feature bits. 66 + * 67 + * pt_cpu_fbanks and pt_disabled_xsave are passed through as-is to the create 68 + * partition hypercall. 69 + * 70 + * Returns : same as above original mshv_create_partition 71 + */ 72 + struct mshv_create_partition_v2 { 73 + __u64 pt_flags; 74 + __u64 pt_isolation; 75 + __u16 pt_num_cpu_fbanks; 76 + __u8 pt_rsvd[6]; /* MBZ */ 77 + __u64 pt_cpu_fbanks[MSHV_NUM_CPU_FEATURES_BANKS]; 78 + __u64 pt_rsvd1[2]; /* MBZ */ 79 + #if defined(__x86_64__) 80 + __u64 pt_disabled_xsave; 81 + #else 82 + __u64 pt_rsvd2; /* MBZ */ 83 + #endif 84 + } __packed; 57 85 58 86 /* /dev/mshv */ 59 87 #define MSHV_CREATE_PARTITION _IOW(MSHV_IOCTL, 0x00, struct mshv_create_partition) ··· 123 89 * @rsvd: MBZ 124 90 * 125 91 * Map or unmap a region of userspace memory to Guest Physical Addresses (GPA). 126 - * Mappings can't overlap in GPA space or userspace. 92 + * Mappings can't overlap in GPA space. 127 93 * To unmap, these fields must match an existing mapping. 128 94 */ 129 95 struct mshv_user_mem_region { ··· 322 288 * #define MSHV_ROOT_HVCALL _IOWR(MSHV_IOCTL, 0x07, struct mshv_root_hvcall) 323 289 */ 324 290 291 + /* Structure definitions, macros and IOCTLs for mshv_vtl */ 292 + 293 + #define MSHV_CAP_CORE_API_STABLE 0x0 294 + #define MSHV_CAP_REGISTER_PAGE 0x1 295 + #define MSHV_CAP_VTL_RETURN_ACTION 0x2 296 + #define MSHV_CAP_DR6_SHARED 0x3 297 + #define MSHV_MAX_RUN_MSG_SIZE 256 298 + 299 + struct mshv_vp_registers { 300 + __u32 count; /* supports only 1 register at a time */ 301 + __u32 reserved; /* Reserved for alignment or future use */ 302 + __u64 regs_ptr; /* pointer to struct hv_register_assoc */ 303 + }; 304 + 305 + struct mshv_vtl_set_eventfd { 306 + __s32 fd; 307 + __u32 flag; 308 + }; 309 + 310 + struct mshv_vtl_signal_event { 311 + __u32 connection_id; 312 + __u32 flag; 313 + }; 314 + 315 + struct mshv_vtl_sint_post_msg { 316 + __u64 message_type; 317 + __u32 connection_id; 318 + __u32 payload_size; /* Must not exceed HV_MESSAGE_PAYLOAD_BYTE_COUNT */ 319 + __u64 payload_ptr; /* pointer to message payload (bytes) */ 320 + }; 321 + 322 + struct mshv_vtl_ram_disposition { 323 + __u64 start_pfn; 324 + __u64 last_pfn; 325 + }; 326 + 327 + struct mshv_vtl_set_poll_file { 328 + __u32 cpu; 329 + __u32 fd; 330 + }; 331 + 332 + struct mshv_vtl_hvcall_setup { 333 + __u64 bitmap_array_size; /* stores number of bytes */ 334 + __u64 allow_bitmap_ptr; 335 + }; 336 + 337 + struct mshv_vtl_hvcall { 338 + __u64 control; /* Hypercall control code */ 339 + __u64 input_size; /* Size of the input data */ 340 + __u64 input_ptr; /* Pointer to the input struct */ 341 + __u64 status; /* Status of the hypercall (output) */ 342 + __u64 output_size; /* Size of the output data */ 343 + __u64 output_ptr; /* Pointer to the output struct */ 344 + }; 345 + 346 + struct mshv_sint_mask { 347 + __u8 mask; 348 + __u8 reserved[7]; 349 + }; 350 + 351 + /* /dev/mshv device IOCTL */ 352 + #define MSHV_CHECK_EXTENSION _IOW(MSHV_IOCTL, 0x00, __u32) 353 + 354 + /* vtl device */ 355 + #define MSHV_CREATE_VTL _IOR(MSHV_IOCTL, 0x1D, char) 356 + #define MSHV_ADD_VTL0_MEMORY _IOW(MSHV_IOCTL, 0x21, struct mshv_vtl_ram_disposition) 357 + #define MSHV_SET_POLL_FILE _IOW(MSHV_IOCTL, 0x25, struct mshv_vtl_set_poll_file) 358 + #define MSHV_RETURN_TO_LOWER_VTL _IO(MSHV_IOCTL, 0x27) 359 + #define MSHV_GET_VP_REGISTERS _IOWR(MSHV_IOCTL, 0x05, struct mshv_vp_registers) 360 + #define MSHV_SET_VP_REGISTERS _IOW(MSHV_IOCTL, 0x06, struct mshv_vp_registers) 361 + 362 + /* VMBus device IOCTLs */ 363 + #define MSHV_SINT_SIGNAL_EVENT _IOW(MSHV_IOCTL, 0x22, struct mshv_vtl_signal_event) 364 + #define MSHV_SINT_POST_MESSAGE _IOW(MSHV_IOCTL, 0x23, struct mshv_vtl_sint_post_msg) 365 + #define MSHV_SINT_SET_EVENTFD _IOW(MSHV_IOCTL, 0x24, struct mshv_vtl_set_eventfd) 366 + #define MSHV_SINT_PAUSE_MESSAGE_STREAM _IOW(MSHV_IOCTL, 0x25, struct mshv_sint_mask) 367 + 368 + /* hv_hvcall device */ 369 + #define MSHV_HVCALL_SETUP _IOW(MSHV_IOCTL, 0x1E, struct mshv_vtl_hvcall_setup) 370 + #define MSHV_HVCALL _IOWR(MSHV_IOCTL, 0x1F, struct mshv_vtl_hvcall) 325 371 #endif

+4

tools/include/linux/static_call_types.h

··· 25 25 #define STATIC_CALL_SITE_INIT 2UL /* init section */ 26 26 #define STATIC_CALL_SITE_FLAGS 3UL 27 27 28 + #ifndef __ASSEMBLY__ 29 + 28 30 /* 29 31 * The static call site table needs to be created by external tooling (objtool 30 32 * or a compiler plugin). ··· 101 99 ((typeof(STATIC_CALL_TRAMP(name))*)(STATIC_CALL_KEY(name).func)) 102 100 103 101 #endif /* CONFIG_HAVE_STATIC_CALL */ 102 + 103 + #endif /* __ASSEMBLY__ */ 104 104 105 105 #endif /* _STATIC_CALL_TYPES_H */