commit c6f4d41338a78bcc3ddcc4e00f5de63c8ee2ad20 · tjh.dev/kernel

tjh.dev / kernel

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

vfio.txt: standardize document format

Each text file under Documentation follows a different
format. Some doesn't even have titles!

Change its representation to follow the adopted standard,
using ReST markups for it to be parseable by Sphinx:
- adjust title marks;
- use footnote marks;
- mark literal blocks;
- adjust identation.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>

authored by Mauro Carvalho Chehab and committed by Jonathan Corbet 8 years ago c6f4d413 2a26ed8e

+134 -127

1 changed file

expand all

unified split

Documentation

vfio.txt

+134 -127

Documentation/vfio.txt

··· 1 - VFIO - "Virtual Function I/O"[1] 2 - ------------------------------------------------------------------------------- 1 + ================================== 2 + VFIO - "Virtual Function I/O" [1]_ 3 + ================================== 4 + 3 5 Many modern system now provide DMA and interrupt remapping facilities 4 6 to help ensure I/O devices behave within the boundaries they've been 5 7 allotted. This includes x86 hardware with AMD-Vi and Intel VT-d, ··· 9 7 systems such as Freescale PAMU. The VFIO driver is an IOMMU/device 10 8 agnostic framework for exposing direct device access to userspace, in 11 9 a secure, IOMMU protected environment. In other words, this allows 12 - safe[2], non-privileged, userspace drivers. 10 + safe [2]_, non-privileged, userspace drivers. 13 11 14 12 Why do we want that? Virtual machines often make use of direct device 15 13 access ("device assignment") when configured for the highest possible 16 14 I/O performance. From a device and host perspective, this simply 17 15 turns the VM into a userspace driver, with the benefits of 18 16 significantly reduced latency, higher bandwidth, and direct use of 19 - bare-metal device drivers[3]. 17 + bare-metal device drivers [3]_. 20 18 21 19 Some applications, particularly in the high performance computing 22 20 field, also benefit from low-overhead, direct device access from ··· 33 31 secure, more featureful userspace driver environment than UIO. 34 32 35 33 Groups, Devices, and IOMMUs 36 - ------------------------------------------------------------------------------- 34 + --------------------------- 37 35 38 36 Devices are the main target of any I/O driver. Devices typically 39 37 create a programming interface made up of I/O access, interrupts, ··· 116 114 notifications. 117 115 118 116 VFIO Usage Example 119 - ------------------------------------------------------------------------------- 117 + ------------------ 120 118 121 - Assume user wants to access PCI device 0000:06:0d.0 119 + Assume user wants to access PCI device 0000:06:0d.0:: 122 120 123 - $ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group 124 - ../../../../kernel/iommu_groups/26 121 + $ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group 122 + ../../../../kernel/iommu_groups/26 125 123 126 124 This device is therefore in IOMMU group 26. This device is on the 127 125 pci bus, therefore the user will make use of vfio-pci to manage the 128 - group: 126 + group:: 129 127 130 - # modprobe vfio-pci 128 + # modprobe vfio-pci 131 129 132 130 Binding this device to the vfio-pci driver creates the VFIO group 133 - character devices for this group: 131 + character devices for this group:: 134 132 135 - $ lspci -n -s 0000:06:0d.0 136 - 06:0d.0 0401: 1102:0002 (rev 08) 137 - # echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind 138 - # echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id 133 + $ lspci -n -s 0000:06:0d.0 134 + 06:0d.0 0401: 1102:0002 (rev 08) 135 + # echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind 136 + # echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id 139 137 140 138 Now we need to look at what other devices are in the group to free 141 - it for use by VFIO: 139 + it for use by VFIO:: 142 140 143 - $ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices 144 - total 0 145 - lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 -> 146 - ../../../../devices/pci0000:00/0000:00:1e.0 147 - lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 -> 148 - ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0 149 - lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 -> 150 - ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1 141 + $ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices 142 + total 0 143 + lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 -> 144 + ../../../../devices/pci0000:00/0000:00:1e.0 145 + lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 -> 146 + ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0 147 + lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 -> 148 + ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1 151 149 152 - This device is behind a PCIe-to-PCI bridge[4], therefore we also 150 + This device is behind a PCIe-to-PCI bridge [4]_, therefore we also 153 151 need to add device 0000:06:0d.1 to the group following the same 154 152 procedure as above. Device 0000:00:1e.0 is a bridge that does 155 153 not currently have a host driver, therefore it's not required to ··· 159 157 The final step is to provide the user with access to the group if 160 158 unprivileged operation is desired (note that /dev/vfio/vfio provides 161 159 no capabilities on its own and is therefore expected to be set to 162 - mode 0666 by the system). 160 + mode 0666 by the system):: 163 161 164 - # chown user:user /dev/vfio/26 162 + # chown user:user /dev/vfio/26 165 163 166 164 The user now has full access to all the devices and the iommu for this 167 - group and can access them as follows: 165 + group and can access them as follows:: 168 166 169 167 int container, group, device, i; 170 168 struct vfio_group_status group_status = ··· 250 248 VFIO bus drivers, such as vfio-pci make use of only a few interfaces 251 249 into VFIO core. When devices are bound and unbound to the driver, 252 250 the driver should call vfio_add_group_dev() and vfio_del_group_dev() 253 - respectively: 251 + respectively:: 254 252 255 - extern int vfio_add_group_dev(struct iommu_group *iommu_group, 256 - struct device *dev, 257 - const struct vfio_device_ops *ops, 258 - void *device_data); 253 + extern int vfio_add_group_dev(struct iommu_group *iommu_group, 254 + struct device *dev, 255 + const struct vfio_device_ops *ops, 256 + void *device_data); 259 257 260 - extern void *vfio_del_group_dev(struct device *dev); 258 + extern void *vfio_del_group_dev(struct device *dev); 261 259 262 260 vfio_add_group_dev() indicates to the core to begin tracking the 263 261 specified iommu_group and register the specified dev as owned by 264 262 a VFIO bus driver. The driver provides an ops structure for callbacks 265 - similar to a file operations structure: 263 + similar to a file operations structure:: 266 264 267 - struct vfio_device_ops { 268 - int (*open)(void *device_data); 269 - void (*release)(void *device_data); 270 - ssize_t (*read)(void *device_data, char __user *buf, 271 - size_t count, loff_t *ppos); 272 - ssize_t (*write)(void *device_data, const char __user *buf, 273 - size_t size, loff_t *ppos); 274 - long (*ioctl)(void *device_data, unsigned int cmd, 275 - unsigned long arg); 276 - int (*mmap)(void *device_data, struct vm_area_struct *vma); 277 - }; 265 + struct vfio_device_ops { 266 + int (*open)(void *device_data); 267 + void (*release)(void *device_data); 268 + ssize_t (*read)(void *device_data, char __user *buf, 269 + size_t count, loff_t *ppos); 270 + ssize_t (*write)(void *device_data, const char __user *buf, 271 + size_t size, loff_t *ppos); 272 + long (*ioctl)(void *device_data, unsigned int cmd, 273 + unsigned long arg); 274 + int (*mmap)(void *device_data, struct vm_area_struct *vma); 275 + }; 278 276 279 277 Each function is passed the device_data that was originally registered 280 278 in the vfio_add_group_dev() call above. This allows the bus driver ··· 287 285 288 286 289 287 PPC64 sPAPR implementation note 290 - ------------------------------------------------------------------------------- 288 + ------------------------------- 291 289 292 290 This implementation has some specifics: 293 291 294 292 1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per 295 - container is supported as an IOMMU table is allocated at the boot time, 296 - one table per a IOMMU group which is a Partitionable Endpoint (PE) 297 - (PE is often a PCI domain but not always). 298 - Newer systems (POWER8 with IODA2) have improved hardware design which allows 299 - to remove this limitation and have multiple IOMMU groups per a VFIO container. 293 + container is supported as an IOMMU table is allocated at the boot time, 294 + one table per a IOMMU group which is a Partitionable Endpoint (PE) 295 + (PE is often a PCI domain but not always). 296 + 297 + Newer systems (POWER8 with IODA2) have improved hardware design which allows 298 + to remove this limitation and have multiple IOMMU groups per a VFIO 299 + container. 300 300 301 301 2) The hardware supports so called DMA windows - the PCI address range 302 - within which DMA transfer is allowed, any attempt to access address space 303 - out of the window leads to the whole PE isolation. 302 + within which DMA transfer is allowed, any attempt to access address space 303 + out of the window leads to the whole PE isolation. 304 304 305 305 3) PPC64 guests are paravirtualized but not fully emulated. There is an API 306 - to map/unmap pages for DMA, and it normally maps 1..32 pages per call and 307 - currently there is no way to reduce the number of calls. In order to make things 308 - faster, the map/unmap handling has been implemented in real mode which provides 309 - an excellent performance which has limitations such as inability to do 310 - locked pages accounting in real time. 306 + to map/unmap pages for DMA, and it normally maps 1..32 pages per call and 307 + currently there is no way to reduce the number of calls. In order to make 308 + things faster, the map/unmap handling has been implemented in real mode 309 + which provides an excellent performance which has limitations such as 310 + inability to do locked pages accounting in real time. 311 311 312 312 4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O 313 - subtree that can be treated as a unit for the purposes of partitioning and 314 - error recovery. A PE may be a single or multi-function IOA (IO Adapter), a 315 - function of a multi-function IOA, or multiple IOAs (possibly including switch 316 - and bridge structures above the multiple IOAs). PPC64 guests detect PCI errors 317 - and recover from them via EEH RTAS services, which works on the basis of 318 - additional ioctl commands. 313 + subtree that can be treated as a unit for the purposes of partitioning and 314 + error recovery. A PE may be a single or multi-function IOA (IO Adapter), a 315 + function of a multi-function IOA, or multiple IOAs (possibly including 316 + switch and bridge structures above the multiple IOAs). PPC64 guests detect 317 + PCI errors and recover from them via EEH RTAS services, which works on the 318 + basis of additional ioctl commands. 319 319 320 - So 4 additional ioctls have been added: 320 + So 4 additional ioctls have been added: 321 321 322 - VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start 323 - of the DMA window on the PCI bus. 322 + VFIO_IOMMU_SPAPR_TCE_GET_INFO 323 + returns the size and the start of the DMA window on the PCI bus. 324 324 325 - VFIO_IOMMU_ENABLE - enables the container. The locked pages accounting 325 + VFIO_IOMMU_ENABLE 326 + enables the container. The locked pages accounting 326 327 is done at this point. This lets user first to know what 327 328 the DMA window is and adjust rlimit before doing any real job. 328 329 329 - VFIO_IOMMU_DISABLE - disables the container. 330 + VFIO_IOMMU_DISABLE 331 + disables the container. 330 332 331 - VFIO_EEH_PE_OP - provides an API for EEH setup, error detection and recovery. 333 + VFIO_EEH_PE_OP 334 + provides an API for EEH setup, error detection and recovery. 332 335 333 - The code flow from the example above should be slightly changed: 336 + The code flow from the example above should be slightly changed:: 334 337 335 338 struct vfio_eeh_pe_op pe_op = { .argsz = sizeof(pe_op), .flags = 0 }; 336 339 ··· 449 442 .... 450 443 451 444 5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/ 452 - VFIO_IOMMU_DISABLE and implements 2 new ioctls: 453 - VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY 454 - (which are unsupported in v1 IOMMU). 445 + VFIO_IOMMU_DISABLE and implements 2 new ioctls: 446 + VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY 447 + (which are unsupported in v1 IOMMU). 455 448 456 - PPC64 paravirtualized guests generate a lot of map/unmap requests, 457 - and the handling of those includes pinning/unpinning pages and updating 458 - mm::locked_vm counter to make sure we do not exceed the rlimit. 459 - The v2 IOMMU splits accounting and pinning into separate operations: 449 + PPC64 paravirtualized guests generate a lot of map/unmap requests, 450 + and the handling of those includes pinning/unpinning pages and updating 451 + mm::locked_vm counter to make sure we do not exceed the rlimit. 452 + The v2 IOMMU splits accounting and pinning into separate operations: 460 453 461 - - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls 462 - receive a user space address and size of the block to be pinned. 463 - Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to 464 - be called with the exact address and size used for registering 465 - the memory block. The userspace is not expected to call these often. 466 - The ranges are stored in a linked list in a VFIO container. 454 + - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls 455 + receive a user space address and size of the block to be pinned. 456 + Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to 457 + be called with the exact address and size used for registering 458 + the memory block. The userspace is not expected to call these often. 459 + The ranges are stored in a linked list in a VFIO container. 467 460 468 - - VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual 469 - IOMMU table and do not do pinning; instead these check that the userspace 470 - address is from pre-registered range. 461 + - VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual 462 + IOMMU table and do not do pinning; instead these check that the userspace 463 + address is from pre-registered range. 471 464 472 - This separation helps in optimizing DMA for guests. 465 + This separation helps in optimizing DMA for guests. 473 466 474 467 6) sPAPR specification allows guests to have an additional DMA window(s) on 475 - a PCI bus with a variable page size. Two ioctls have been added to support 476 - this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE. 477 - The platform has to support the functionality or error will be returned to 478 - the userspace. The existing hardware supports up to 2 DMA windows, one is 479 - 2GB long, uses 4K pages and called "default 32bit window"; the other can 480 - be as big as entire RAM, use different page size, it is optional - guests 481 - create those in run-time if the guest driver supports 64bit DMA. 468 + a PCI bus with a variable page size. Two ioctls have been added to support 469 + this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE. 470 + The platform has to support the functionality or error will be returned to 471 + the userspace. The existing hardware supports up to 2 DMA windows, one is 472 + 2GB long, uses 4K pages and called "default 32bit window"; the other can 473 + be as big as entire RAM, use different page size, it is optional - guests 474 + create those in run-time if the guest driver supports 64bit DMA. 482 475 483 - VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and 484 - a number of TCE table levels (if a TCE table is going to be big enough and 485 - the kernel may not be able to allocate enough of physically contiguous memory). 486 - It creates a new window in the available slot and returns the bus address where 487 - the new window starts. Due to hardware limitation, the user space cannot choose 488 - the location of DMA windows. 476 + VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and 477 + a number of TCE table levels (if a TCE table is going to be big enough and 478 + the kernel may not be able to allocate enough of physically contiguous 479 + memory). It creates a new window in the available slot and returns the bus 480 + address where the new window starts. Due to hardware limitation, the user 481 + space cannot choose the location of DMA windows. 489 482 490 - VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window 491 - and removes it. 483 + VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window 484 + and removes it. 492 485 493 486 ------------------------------------------------------------------------------- 494 487 495 - [1] VFIO was originally an acronym for "Virtual Function I/O" in its 496 - initial implementation by Tom Lyon while as Cisco. We've since 497 - outgrown the acronym, but it's catchy. 488 + .. [1] VFIO was originally an acronym for "Virtual Function I/O" in its 489 + initial implementation by Tom Lyon while as Cisco. We've since 490 + outgrown the acronym, but it's catchy. 498 491 499 - [2] "safe" also depends upon a device being "well behaved". It's 500 - possible for multi-function devices to have backdoors between 501 - functions and even for single function devices to have alternative 502 - access to things like PCI config space through MMIO registers. To 503 - guard against the former we can include additional precautions in the 504 - IOMMU driver to group multi-function PCI devices together 505 - (iommu=group_mf). The latter we can't prevent, but the IOMMU should 506 - still provide isolation. For PCI, SR-IOV Virtual Functions are the 507 - best indicator of "well behaved", as these are designed for 508 - virtualization usage models. 492 + .. [2] "safe" also depends upon a device being "well behaved". It's 493 + possible for multi-function devices to have backdoors between 494 + functions and even for single function devices to have alternative 495 + access to things like PCI config space through MMIO registers. To 496 + guard against the former we can include additional precautions in the 497 + IOMMU driver to group multi-function PCI devices together 498 + (iommu=group_mf). The latter we can't prevent, but the IOMMU should 499 + still provide isolation. For PCI, SR-IOV Virtual Functions are the 500 + best indicator of "well behaved", as these are designed for 501 + virtualization usage models. 509 502 510 - [3] As always there are trade-offs to virtual machine device 511 - assignment that are beyond the scope of VFIO. It's expected that 512 - future IOMMU technologies will reduce some, but maybe not all, of 513 - these trade-offs. 503 + .. [3] As always there are trade-offs to virtual machine device 504 + assignment that are beyond the scope of VFIO. It's expected that 505 + future IOMMU technologies will reduce some, but maybe not all, of 506 + these trade-offs. 514 507 515 - [4] In this case the device is below a PCI bridge, so transactions 516 - from either function of the device are indistinguishable to the iommu: 508 + .. [4] In this case the device is below a PCI bridge, so transactions 509 + from either function of the device are indistinguishable to the iommu:: 517 510 518 - -[0000:00]-+-1e.0-[06]--+-0d.0 519 - \-0d.1 511 + -[0000:00]-+-1e.0-[06]--+-0d.0 512 + \-0d.1 520 513 521 - 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90) 514 + 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)