Documentation: hyperv: Add overview of PCI pass-thru device support

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

Add documentation topic for PCI pass-thru devices in Linux guests
on Hyper-V and for the associated PCI controller driver (pci-hyperv.c).

Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Link: https://lore.kernel.org/r/20240222200710.305259-1-mhklinux@outlook.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20240222200710.305259-1-mhklinux@outlook.com>

authored by

Michael Kelley and committed by

Wei Liu 2 years ago 04ed680e 9645e744

+317

2 changed files

expand all

Documentation

virt

hyperv

index.rst

vpci.rst

Documentation/virt/hyperv/index.rst

··· 10 10 overview 11 11 vmbus 12 12 clocks 13 + vpci

+316

Documentation/virt/hyperv/vpci.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + PCI pass-thru devices 4 + ========================= 5 + In a Hyper-V guest VM, PCI pass-thru devices (also called 6 + virtual PCI devices, or vPCI devices) are physical PCI devices 7 + that are mapped directly into the VM's physical address space. 8 + Guest device drivers can interact directly with the hardware 9 + without intermediation by the host hypervisor. This approach 10 + provides higher bandwidth access to the device with lower 11 + latency, compared with devices that are virtualized by the 12 + hypervisor. The device should appear to the guest just as it 13 + would when running on bare metal, so no changes are required 14 + to the Linux device drivers for the device. 15 + 16 + Hyper-V terminology for vPCI devices is "Discrete Device 17 + Assignment" (DDA). Public documentation for Hyper-V DDA is 18 + available here: `DDA`_ 19 + 20 + .. _DDA: https://learn.microsoft.com/en-us/windows-server/virtualization/hyper-v/plan/plan-for-deploying-devices-using-discrete-device-assignment 21 + 22 + DDA is typically used for storage controllers, such as NVMe, 23 + and for GPUs. A similar mechanism for NICs is called SR-IOV 24 + and produces the same benefits by allowing a guest device 25 + driver to interact directly with the hardware. See Hyper-V 26 + public documentation here: `SR-IOV`_ 27 + 28 + .. _SR-IOV: https://learn.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-single-root-i-o-virtualization--sr-iov- 29 + 30 + This discussion of vPCI devices includes DDA and SR-IOV 31 + devices. 32 + 33 + Device Presentation 34 + ------------------- 35 + Hyper-V provides full PCI functionality for a vPCI device when 36 + it is operating, so the Linux device driver for the device can 37 + be used unchanged, provided it uses the correct Linux kernel 38 + APIs for accessing PCI config space and for other integration 39 + with Linux. But the initial detection of the PCI device and 40 + its integration with the Linux PCI subsystem must use Hyper-V 41 + specific mechanisms. Consequently, vPCI devices on Hyper-V 42 + have a dual identity. They are initially presented to Linux 43 + guests as VMBus devices via the standard VMBus "offer" 44 + mechanism, so they have a VMBus identity and appear under 45 + /sys/bus/vmbus/devices. The VMBus vPCI driver in Linux at 46 + drivers/pci/controller/pci-hyperv.c handles a newly introduced 47 + vPCI device by fabricating a PCI bus topology and creating all 48 + the normal PCI device data structures in Linux that would 49 + exist if the PCI device were discovered via ACPI on a bare- 50 + metal system. Once those data structures are set up, the 51 + device also has a normal PCI identity in Linux, and the normal 52 + Linux device driver for the vPCI device can function as if it 53 + were running in Linux on bare-metal. Because vPCI devices are 54 + presented dynamically through the VMBus offer mechanism, they 55 + do not appear in the Linux guest's ACPI tables. vPCI devices 56 + may be added to a VM or removed from a VM at any time during 57 + the life of the VM, and not just during initial boot. 58 + 59 + With this approach, the vPCI device is a VMBus device and a 60 + PCI device at the same time. In response to the VMBus offer 61 + message, the hv_pci_probe() function runs and establishes a 62 + VMBus connection to the vPCI VSP on the Hyper-V host. That 63 + connection has a single VMBus channel. The channel is used to 64 + exchange messages with the vPCI VSP for the purpose of setting 65 + up and configuring the vPCI device in Linux. Once the device 66 + is fully configured in Linux as a PCI device, the VMBus 67 + channel is used only if Linux changes the vCPU to be interrupted 68 + in the guest, or if the vPCI device is removed from 69 + the VM while the VM is running. The ongoing operation of the 70 + device happens directly between the Linux device driver for 71 + the device and the hardware, with VMBus and the VMBus channel 72 + playing no role. 73 + 74 + PCI Device Setup 75 + ---------------- 76 + PCI device setup follows a sequence that Hyper-V originally 77 + created for Windows guests, and that can be ill-suited for 78 + Linux guests due to differences in the overall structure of 79 + the Linux PCI subsystem compared with Windows. Nonetheless, 80 + with a bit of hackery in the Hyper-V virtual PCI driver for 81 + Linux, the virtual PCI device is setup in Linux so that 82 + generic Linux PCI subsystem code and the Linux driver for the 83 + device "just work". 84 + 85 + Each vPCI device is set up in Linux to be in its own PCI 86 + domain with a host bridge. The PCI domainID is derived from 87 + bytes 4 and 5 of the instance GUID assigned to the VMBus vPCI 88 + device. The Hyper-V host does not guarantee that these bytes 89 + are unique, so hv_pci_probe() has an algorithm to resolve 90 + collisions. The collision resolution is intended to be stable 91 + across reboots of the same VM so that the PCI domainIDs don't 92 + change, as the domainID appears in the user space 93 + configuration of some devices. 94 + 95 + hv_pci_probe() allocates a guest MMIO range to be used as PCI 96 + config space for the device. This MMIO range is communicated 97 + to the Hyper-V host over the VMBus channel as part of telling 98 + the host that the device is ready to enter d0. See 99 + hv_pci_enter_d0(). When the guest subsequently accesses this 100 + MMIO range, the Hyper-V host intercepts the accesses and maps 101 + them to the physical device PCI config space. 102 + 103 + hv_pci_probe() also gets BAR information for the device from 104 + the Hyper-V host, and uses this information to allocate MMIO 105 + space for the BARs. That MMIO space is then setup to be 106 + associated with the host bridge so that it works when generic 107 + PCI subsystem code in Linux processes the BARs. 108 + 109 + Finally, hv_pci_probe() creates the root PCI bus. At this 110 + point the Hyper-V virtual PCI driver hackery is done, and the 111 + normal Linux PCI machinery for scanning the root bus works to 112 + detect the device, to perform driver matching, and to 113 + initialize the driver and device. 114 + 115 + PCI Device Removal 116 + ------------------ 117 + A Hyper-V host may initiate removal of a vPCI device from a 118 + guest VM at any time during the life of the VM. The removal 119 + is instigated by an admin action taken on the Hyper-V host and 120 + is not under the control of the guest OS. 121 + 122 + A guest VM is notified of the removal by an unsolicited 123 + "Eject" message sent from the host to the guest over the VMBus 124 + channel associated with the vPCI device. Upon receipt of such 125 + a message, the Hyper-V virtual PCI driver in Linux 126 + asynchronously invokes Linux kernel PCI subsystem calls to 127 + shutdown and remove the device. When those calls are 128 + complete, an "Ejection Complete" message is sent back to 129 + Hyper-V over the VMBus channel indicating that the device has 130 + been removed. At this point, Hyper-V sends a VMBus rescind 131 + message to the Linux guest, which the VMBus driver in Linux 132 + processes by removing the VMBus identity for the device. Once 133 + that processing is complete, all vestiges of the device having 134 + been present are gone from the Linux kernel. The rescind 135 + message also indicates to the guest that Hyper-V has stopped 136 + providing support for the vPCI device in the guest. If the 137 + guest were to attempt to access that device's MMIO space, it 138 + would be an invalid reference. Hypercalls affecting the device 139 + return errors, and any further messages sent in the VMBus 140 + channel are ignored. 141 + 142 + After sending the Eject message, Hyper-V allows the guest VM 143 + 60 seconds to cleanly shutdown the device and respond with 144 + Ejection Complete before sending the VMBus rescind 145 + message. If for any reason the Eject steps don't complete 146 + within the allowed 60 seconds, the Hyper-V host forcibly 147 + performs the rescind steps, which will likely result in 148 + cascading errors in the guest because the device is now no 149 + longer present from the guest standpoint and accessing the 150 + device MMIO space will fail. 151 + 152 + Because ejection is asynchronous and can happen at any point 153 + during the guest VM lifecycle, proper synchronization in the 154 + Hyper-V virtual PCI driver is very tricky. Ejection has been 155 + observed even before a newly offered vPCI device has been 156 + fully setup. The Hyper-V virtual PCI driver has been updated 157 + several times over the years to fix race conditions when 158 + ejections happen at inopportune times. Care must be taken when 159 + modifying this code to prevent re-introducing such problems. 160 + See comments in the code. 161 + 162 + Interrupt Assignment 163 + -------------------- 164 + The Hyper-V virtual PCI driver supports vPCI devices using 165 + MSI, multi-MSI, or MSI-X. Assigning the guest vCPU that will 166 + receive the interrupt for a particular MSI or MSI-X message is 167 + complex because of the way the Linux setup of IRQs maps onto 168 + the Hyper-V interfaces. For the single-MSI and MSI-X cases, 169 + Linux calls hv_compse_msi_msg() twice, with the first call 170 + containing a dummy vCPU and the second call containing the 171 + real vCPU. Furthermore, hv_irq_unmask() is finally called 172 + (on x86) or the GICD registers are set (on arm64) to specify 173 + the real vCPU again. Each of these three calls interact 174 + with Hyper-V, which must decide which physical CPU should 175 + receive the interrupt before it is forwarded to the guest VM. 176 + Unfortunately, the Hyper-V decision-making process is a bit 177 + limited, and can result in concentrating the physical 178 + interrupts on a single CPU, causing a performance bottleneck. 179 + See details about how this is resolved in the extensive 180 + comment above the function hv_compose_msi_req_get_cpu(). 181 + 182 + The Hyper-V virtual PCI driver implements the 183 + irq_chip.irq_compose_msi_msg function as hv_compose_msi_msg(). 184 + Unfortunately, on Hyper-V the implementation requires sending 185 + a VMBus message to the Hyper-V host and awaiting an interrupt 186 + indicating receipt of a reply message. Since 187 + irq_chip.irq_compose_msi_msg can be called with IRQ locks 188 + held, it doesn't work to do the normal sleep until awakened by 189 + the interrupt. Instead hv_compose_msi_msg() must send the 190 + VMBus message, and then poll for the completion message. As 191 + further complexity, the vPCI device could be ejected/rescinded 192 + while the polling is in progress, so this scenario must be 193 + detected as well. See comments in the code regarding this 194 + very tricky area. 195 + 196 + Most of the code in the Hyper-V virtual PCI driver (pci- 197 + hyperv.c) applies to Hyper-V and Linux guests running on x86 198 + and on arm64 architectures. But there are differences in how 199 + interrupt assignments are managed. On x86, the Hyper-V 200 + virtual PCI driver in the guest must make a hypercall to tell 201 + Hyper-V which guest vCPU should be interrupted by each 202 + MSI/MSI-X interrupt, and the x86 interrupt vector number that 203 + the x86_vector IRQ domain has picked for the interrupt. This 204 + hypercall is made by hv_arch_irq_unmask(). On arm64, the 205 + Hyper-V virtual PCI driver manages the allocation of an SPI 206 + for each MSI/MSI-X interrupt. The Hyper-V virtual PCI driver 207 + stores the allocated SPI in the architectural GICD registers, 208 + which Hyper-V emulates, so no hypercall is necessary as with 209 + x86. Hyper-V does not support using LPIs for vPCI devices in 210 + arm64 guest VMs because it does not emulate a GICv3 ITS. 211 + 212 + The Hyper-V virtual PCI driver in Linux supports vPCI devices 213 + whose drivers create managed or unmanaged Linux IRQs. If the 214 + smp_affinity for an unmanaged IRQ is updated via the /proc/irq 215 + interface, the Hyper-V virtual PCI driver is called to tell 216 + the Hyper-V host to change the interrupt targeting and 217 + everything works properly. However, on x86 if the x86_vector 218 + IRQ domain needs to reassign an interrupt vector due to 219 + running out of vectors on a CPU, there's no path to inform the 220 + Hyper-V host of the change, and things break. Fortunately, 221 + guest VMs operate in a constrained device environment where 222 + using all the vectors on a CPU doesn't happen. Since such a 223 + problem is only a theoretical concern rather than a practical 224 + concern, it has been left unaddressed. 225 + 226 + DMA 227 + --- 228 + By default, Hyper-V pins all guest VM memory in the host 229 + when the VM is created, and programs the physical IOMMU to 230 + allow the VM to have DMA access to all its memory. Hence 231 + it is safe to assign PCI devices to the VM, and allow the 232 + guest operating system to program the DMA transfers. The 233 + physical IOMMU prevents a malicious guest from initiating 234 + DMA to memory belonging to the host or to other VMs on the 235 + host. From the Linux guest standpoint, such DMA transfers 236 + are in "direct" mode since Hyper-V does not provide a virtual 237 + IOMMU in the guest. 238 + 239 + Hyper-V assumes that physical PCI devices always perform 240 + cache-coherent DMA. When running on x86, this behavior is 241 + required by the architecture. When running on arm64, the 242 + architecture allows for both cache-coherent and 243 + non-cache-coherent devices, with the behavior of each device 244 + specified in the ACPI DSDT. But when a PCI device is assigned 245 + to a guest VM, that device does not appear in the DSDT, so the 246 + Hyper-V VMBus driver propagates cache-coherency information 247 + from the VMBus node in the ACPI DSDT to all VMBus devices, 248 + including vPCI devices (since they have a dual identity as a VMBus 249 + device and as a PCI device). See vmbus_dma_configure(). 250 + Current Hyper-V versions always indicate that the VMBus is 251 + cache coherent, so vPCI devices on arm64 always get marked as 252 + cache coherent and the CPU does not perform any sync 253 + operations as part of dma_map/unmap_*() calls. 254 + 255 + vPCI protocol versions 256 + ---------------------- 257 + As previously described, during vPCI device setup and teardown 258 + messages are passed over a VMBus channel between the Hyper-V 259 + host and the Hyper-v vPCI driver in the Linux guest. Some 260 + messages have been revised in newer versions of Hyper-V, so 261 + the guest and host must agree on the vPCI protocol version to 262 + be used. The version is negotiated when communication over 263 + the VMBus channel is first established. See 264 + hv_pci_protocol_negotiation(). Newer versions of the protocol 265 + extend support to VMs with more than 64 vCPUs, and provide 266 + additional information about the vPCI device, such as the 267 + guest virtual NUMA node to which it is most closely affined in 268 + the underlying hardware. 269 + 270 + Guest NUMA node affinity 271 + ------------------------ 272 + When the vPCI protocol version provides it, the guest NUMA 273 + node affinity of the vPCI device is stored as part of the Linux 274 + device information for subsequent use by the Linux driver. See 275 + hv_pci_assign_numa_node(). If the negotiated protocol version 276 + does not support the host providing NUMA affinity information, 277 + the Linux guest defaults the device NUMA node to 0. But even 278 + when the negotiated protocol version includes NUMA affinity 279 + information, the ability of the host to provide such 280 + information depends on certain host configuration options. If 281 + the guest receives NUMA node value "0", it could mean NUMA 282 + node 0, or it could mean "no information is available". 283 + Unfortunately it is not possible to distinguish the two cases 284 + from the guest side. 285 + 286 + PCI config space access in a CoCo VM 287 + ------------------------------------ 288 + Linux PCI device drivers access PCI config space using a 289 + standard set of functions provided by the Linux PCI subsystem. 290 + In Hyper-V guests these standard functions map to functions 291 + hv_pcifront_read_config() and hv_pcifront_write_config() 292 + in the Hyper-V virtual PCI driver. In normal VMs, 293 + these hv_pcifront_*() functions directly access the PCI config 294 + space, and the accesses trap to Hyper-V to be handled. 295 + But in CoCo VMs, memory encryption prevents Hyper-V 296 + from reading the guest instruction stream to emulate the 297 + access, so the hv_pcifront_*() functions must invoke 298 + hypercalls with explicit arguments describing the access to be 299 + made. 300 + 301 + Config Block back-channel 302 + ------------------------- 303 + The Hyper-V host and Hyper-V virtual PCI driver in Linux 304 + together implement a non-standard back-channel communication 305 + path between the host and guest. The back-channel path uses 306 + messages sent over the VMBus channel associated with the vPCI 307 + device. The functions hyperv_read_cfg_blk() and 308 + hyperv_write_cfg_blk() are the primary interfaces provided to 309 + other parts of the Linux kernel. As of this writing, these 310 + interfaces are used only by the Mellanox mlx5 driver to pass 311 + diagnostic data to a Hyper-V host running in the Azure public 312 + cloud. The functions hyperv_read_cfg_blk() and 313 + hyperv_write_cfg_blk() are implemented in a separate module 314 + (pci-hyperv-intf.c, under CONFIG_PCI_HYPERV_INTERFACE) that 315 + effectively stubs them out when running in non-Hyper-V 316 + environments.