vfio.txt: standardize document format

Each text file under Documentation follows a different
format. Some doesn't even have titles!

Change its representation to follow the adopted standard,
using ReST markups for it to be parseable by Sphinx:
- adjust title marks;
- use footnote marks;
- mark literal blocks;
- adjust identation.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>

authored by Mauro Carvalho Chehab and committed by Jonathan Corbet c6f4d413 2a26ed8e

+134 -127
+134 -127
Documentation/vfio.txt
··· 1 - VFIO - "Virtual Function I/O"[1] 2 - ------------------------------------------------------------------------------- 3 Many modern system now provide DMA and interrupt remapping facilities 4 to help ensure I/O devices behave within the boundaries they've been 5 allotted. This includes x86 hardware with AMD-Vi and Intel VT-d, ··· 9 systems such as Freescale PAMU. The VFIO driver is an IOMMU/device 10 agnostic framework for exposing direct device access to userspace, in 11 a secure, IOMMU protected environment. In other words, this allows 12 - safe[2], non-privileged, userspace drivers. 13 14 Why do we want that? Virtual machines often make use of direct device 15 access ("device assignment") when configured for the highest possible 16 I/O performance. From a device and host perspective, this simply 17 turns the VM into a userspace driver, with the benefits of 18 significantly reduced latency, higher bandwidth, and direct use of 19 - bare-metal device drivers[3]. 20 21 Some applications, particularly in the high performance computing 22 field, also benefit from low-overhead, direct device access from ··· 33 secure, more featureful userspace driver environment than UIO. 34 35 Groups, Devices, and IOMMUs 36 - ------------------------------------------------------------------------------- 37 38 Devices are the main target of any I/O driver. Devices typically 39 create a programming interface made up of I/O access, interrupts, ··· 116 notifications. 117 118 VFIO Usage Example 119 - ------------------------------------------------------------------------------- 120 121 - Assume user wants to access PCI device 0000:06:0d.0 122 123 - $ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group 124 - ../../../../kernel/iommu_groups/26 125 126 This device is therefore in IOMMU group 26. This device is on the 127 pci bus, therefore the user will make use of vfio-pci to manage the 128 - group: 129 130 - # modprobe vfio-pci 131 132 Binding this device to the vfio-pci driver creates the VFIO group 133 - character devices for this group: 134 135 - $ lspci -n -s 0000:06:0d.0 136 - 06:0d.0 0401: 1102:0002 (rev 08) 137 - # echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind 138 - # echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id 139 140 Now we need to look at what other devices are in the group to free 141 - it for use by VFIO: 142 143 - $ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices 144 - total 0 145 - lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 -> 146 - ../../../../devices/pci0000:00/0000:00:1e.0 147 - lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 -> 148 - ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0 149 - lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 -> 150 - ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1 151 152 - This device is behind a PCIe-to-PCI bridge[4], therefore we also 153 need to add device 0000:06:0d.1 to the group following the same 154 procedure as above. Device 0000:00:1e.0 is a bridge that does 155 not currently have a host driver, therefore it's not required to ··· 159 The final step is to provide the user with access to the group if 160 unprivileged operation is desired (note that /dev/vfio/vfio provides 161 no capabilities on its own and is therefore expected to be set to 162 - mode 0666 by the system). 163 164 - # chown user:user /dev/vfio/26 165 166 The user now has full access to all the devices and the iommu for this 167 - group and can access them as follows: 168 169 int container, group, device, i; 170 struct vfio_group_status group_status = ··· 250 VFIO bus drivers, such as vfio-pci make use of only a few interfaces 251 into VFIO core. When devices are bound and unbound to the driver, 252 the driver should call vfio_add_group_dev() and vfio_del_group_dev() 253 - respectively: 254 255 - extern int vfio_add_group_dev(struct iommu_group *iommu_group, 256 - struct device *dev, 257 - const struct vfio_device_ops *ops, 258 - void *device_data); 259 260 - extern void *vfio_del_group_dev(struct device *dev); 261 262 vfio_add_group_dev() indicates to the core to begin tracking the 263 specified iommu_group and register the specified dev as owned by 264 a VFIO bus driver. The driver provides an ops structure for callbacks 265 - similar to a file operations structure: 266 267 - struct vfio_device_ops { 268 - int (*open)(void *device_data); 269 - void (*release)(void *device_data); 270 - ssize_t (*read)(void *device_data, char __user *buf, 271 - size_t count, loff_t *ppos); 272 - ssize_t (*write)(void *device_data, const char __user *buf, 273 - size_t size, loff_t *ppos); 274 - long (*ioctl)(void *device_data, unsigned int cmd, 275 - unsigned long arg); 276 - int (*mmap)(void *device_data, struct vm_area_struct *vma); 277 - }; 278 279 Each function is passed the device_data that was originally registered 280 in the vfio_add_group_dev() call above. This allows the bus driver ··· 287 288 289 PPC64 sPAPR implementation note 290 - ------------------------------------------------------------------------------- 291 292 This implementation has some specifics: 293 294 1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per 295 - container is supported as an IOMMU table is allocated at the boot time, 296 - one table per a IOMMU group which is a Partitionable Endpoint (PE) 297 - (PE is often a PCI domain but not always). 298 - Newer systems (POWER8 with IODA2) have improved hardware design which allows 299 - to remove this limitation and have multiple IOMMU groups per a VFIO container. 300 301 2) The hardware supports so called DMA windows - the PCI address range 302 - within which DMA transfer is allowed, any attempt to access address space 303 - out of the window leads to the whole PE isolation. 304 305 3) PPC64 guests are paravirtualized but not fully emulated. There is an API 306 - to map/unmap pages for DMA, and it normally maps 1..32 pages per call and 307 - currently there is no way to reduce the number of calls. In order to make things 308 - faster, the map/unmap handling has been implemented in real mode which provides 309 - an excellent performance which has limitations such as inability to do 310 - locked pages accounting in real time. 311 312 4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O 313 - subtree that can be treated as a unit for the purposes of partitioning and 314 - error recovery. A PE may be a single or multi-function IOA (IO Adapter), a 315 - function of a multi-function IOA, or multiple IOAs (possibly including switch 316 - and bridge structures above the multiple IOAs). PPC64 guests detect PCI errors 317 - and recover from them via EEH RTAS services, which works on the basis of 318 - additional ioctl commands. 319 320 - So 4 additional ioctls have been added: 321 322 - VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start 323 - of the DMA window on the PCI bus. 324 325 - VFIO_IOMMU_ENABLE - enables the container. The locked pages accounting 326 is done at this point. This lets user first to know what 327 the DMA window is and adjust rlimit before doing any real job. 328 329 - VFIO_IOMMU_DISABLE - disables the container. 330 331 - VFIO_EEH_PE_OP - provides an API for EEH setup, error detection and recovery. 332 333 - The code flow from the example above should be slightly changed: 334 335 struct vfio_eeh_pe_op pe_op = { .argsz = sizeof(pe_op), .flags = 0 }; 336 ··· 449 .... 450 451 5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/ 452 - VFIO_IOMMU_DISABLE and implements 2 new ioctls: 453 - VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY 454 - (which are unsupported in v1 IOMMU). 455 456 - PPC64 paravirtualized guests generate a lot of map/unmap requests, 457 - and the handling of those includes pinning/unpinning pages and updating 458 - mm::locked_vm counter to make sure we do not exceed the rlimit. 459 - The v2 IOMMU splits accounting and pinning into separate operations: 460 461 - - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls 462 - receive a user space address and size of the block to be pinned. 463 - Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to 464 - be called with the exact address and size used for registering 465 - the memory block. The userspace is not expected to call these often. 466 - The ranges are stored in a linked list in a VFIO container. 467 468 - - VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual 469 - IOMMU table and do not do pinning; instead these check that the userspace 470 - address is from pre-registered range. 471 472 - This separation helps in optimizing DMA for guests. 473 474 6) sPAPR specification allows guests to have an additional DMA window(s) on 475 - a PCI bus with a variable page size. Two ioctls have been added to support 476 - this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE. 477 - The platform has to support the functionality or error will be returned to 478 - the userspace. The existing hardware supports up to 2 DMA windows, one is 479 - 2GB long, uses 4K pages and called "default 32bit window"; the other can 480 - be as big as entire RAM, use different page size, it is optional - guests 481 - create those in run-time if the guest driver supports 64bit DMA. 482 483 - VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and 484 - a number of TCE table levels (if a TCE table is going to be big enough and 485 - the kernel may not be able to allocate enough of physically contiguous memory). 486 - It creates a new window in the available slot and returns the bus address where 487 - the new window starts. Due to hardware limitation, the user space cannot choose 488 - the location of DMA windows. 489 490 - VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window 491 - and removes it. 492 493 ------------------------------------------------------------------------------- 494 495 - [1] VFIO was originally an acronym for "Virtual Function I/O" in its 496 - initial implementation by Tom Lyon while as Cisco. We've since 497 - outgrown the acronym, but it's catchy. 498 499 - [2] "safe" also depends upon a device being "well behaved". It's 500 - possible for multi-function devices to have backdoors between 501 - functions and even for single function devices to have alternative 502 - access to things like PCI config space through MMIO registers. To 503 - guard against the former we can include additional precautions in the 504 - IOMMU driver to group multi-function PCI devices together 505 - (iommu=group_mf). The latter we can't prevent, but the IOMMU should 506 - still provide isolation. For PCI, SR-IOV Virtual Functions are the 507 - best indicator of "well behaved", as these are designed for 508 - virtualization usage models. 509 510 - [3] As always there are trade-offs to virtual machine device 511 - assignment that are beyond the scope of VFIO. It's expected that 512 - future IOMMU technologies will reduce some, but maybe not all, of 513 - these trade-offs. 514 515 - [4] In this case the device is below a PCI bridge, so transactions 516 - from either function of the device are indistinguishable to the iommu: 517 518 - -[0000:00]-+-1e.0-[06]--+-0d.0 519 - \-0d.1 520 521 - 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
··· 1 + ================================== 2 + VFIO - "Virtual Function I/O" [1]_ 3 + ================================== 4 + 5 Many modern system now provide DMA and interrupt remapping facilities 6 to help ensure I/O devices behave within the boundaries they've been 7 allotted. This includes x86 hardware with AMD-Vi and Intel VT-d, ··· 7 systems such as Freescale PAMU. The VFIO driver is an IOMMU/device 8 agnostic framework for exposing direct device access to userspace, in 9 a secure, IOMMU protected environment. In other words, this allows 10 + safe [2]_, non-privileged, userspace drivers. 11 12 Why do we want that? Virtual machines often make use of direct device 13 access ("device assignment") when configured for the highest possible 14 I/O performance. From a device and host perspective, this simply 15 turns the VM into a userspace driver, with the benefits of 16 significantly reduced latency, higher bandwidth, and direct use of 17 + bare-metal device drivers [3]_. 18 19 Some applications, particularly in the high performance computing 20 field, also benefit from low-overhead, direct device access from ··· 31 secure, more featureful userspace driver environment than UIO. 32 33 Groups, Devices, and IOMMUs 34 + --------------------------- 35 36 Devices are the main target of any I/O driver. Devices typically 37 create a programming interface made up of I/O access, interrupts, ··· 114 notifications. 115 116 VFIO Usage Example 117 + ------------------ 118 119 + Assume user wants to access PCI device 0000:06:0d.0:: 120 121 + $ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group 122 + ../../../../kernel/iommu_groups/26 123 124 This device is therefore in IOMMU group 26. This device is on the 125 pci bus, therefore the user will make use of vfio-pci to manage the 126 + group:: 127 128 + # modprobe vfio-pci 129 130 Binding this device to the vfio-pci driver creates the VFIO group 131 + character devices for this group:: 132 133 + $ lspci -n -s 0000:06:0d.0 134 + 06:0d.0 0401: 1102:0002 (rev 08) 135 + # echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind 136 + # echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id 137 138 Now we need to look at what other devices are in the group to free 139 + it for use by VFIO:: 140 141 + $ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices 142 + total 0 143 + lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 -> 144 + ../../../../devices/pci0000:00/0000:00:1e.0 145 + lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 -> 146 + ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0 147 + lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 -> 148 + ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1 149 150 + This device is behind a PCIe-to-PCI bridge [4]_, therefore we also 151 need to add device 0000:06:0d.1 to the group following the same 152 procedure as above. Device 0000:00:1e.0 is a bridge that does 153 not currently have a host driver, therefore it's not required to ··· 157 The final step is to provide the user with access to the group if 158 unprivileged operation is desired (note that /dev/vfio/vfio provides 159 no capabilities on its own and is therefore expected to be set to 160 + mode 0666 by the system):: 161 162 + # chown user:user /dev/vfio/26 163 164 The user now has full access to all the devices and the iommu for this 165 + group and can access them as follows:: 166 167 int container, group, device, i; 168 struct vfio_group_status group_status = ··· 248 VFIO bus drivers, such as vfio-pci make use of only a few interfaces 249 into VFIO core. When devices are bound and unbound to the driver, 250 the driver should call vfio_add_group_dev() and vfio_del_group_dev() 251 + respectively:: 252 253 + extern int vfio_add_group_dev(struct iommu_group *iommu_group, 254 + struct device *dev, 255 + const struct vfio_device_ops *ops, 256 + void *device_data); 257 258 + extern void *vfio_del_group_dev(struct device *dev); 259 260 vfio_add_group_dev() indicates to the core to begin tracking the 261 specified iommu_group and register the specified dev as owned by 262 a VFIO bus driver. The driver provides an ops structure for callbacks 263 + similar to a file operations structure:: 264 265 + struct vfio_device_ops { 266 + int (*open)(void *device_data); 267 + void (*release)(void *device_data); 268 + ssize_t (*read)(void *device_data, char __user *buf, 269 + size_t count, loff_t *ppos); 270 + ssize_t (*write)(void *device_data, const char __user *buf, 271 + size_t size, loff_t *ppos); 272 + long (*ioctl)(void *device_data, unsigned int cmd, 273 + unsigned long arg); 274 + int (*mmap)(void *device_data, struct vm_area_struct *vma); 275 + }; 276 277 Each function is passed the device_data that was originally registered 278 in the vfio_add_group_dev() call above. This allows the bus driver ··· 285 286 287 PPC64 sPAPR implementation note 288 + ------------------------------- 289 290 This implementation has some specifics: 291 292 1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per 293 + container is supported as an IOMMU table is allocated at the boot time, 294 + one table per a IOMMU group which is a Partitionable Endpoint (PE) 295 + (PE is often a PCI domain but not always). 296 + 297 + Newer systems (POWER8 with IODA2) have improved hardware design which allows 298 + to remove this limitation and have multiple IOMMU groups per a VFIO 299 + container. 300 301 2) The hardware supports so called DMA windows - the PCI address range 302 + within which DMA transfer is allowed, any attempt to access address space 303 + out of the window leads to the whole PE isolation. 304 305 3) PPC64 guests are paravirtualized but not fully emulated. There is an API 306 + to map/unmap pages for DMA, and it normally maps 1..32 pages per call and 307 + currently there is no way to reduce the number of calls. In order to make 308 + things faster, the map/unmap handling has been implemented in real mode 309 + which provides an excellent performance which has limitations such as 310 + inability to do locked pages accounting in real time. 311 312 4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O 313 + subtree that can be treated as a unit for the purposes of partitioning and 314 + error recovery. A PE may be a single or multi-function IOA (IO Adapter), a 315 + function of a multi-function IOA, or multiple IOAs (possibly including 316 + switch and bridge structures above the multiple IOAs). PPC64 guests detect 317 + PCI errors and recover from them via EEH RTAS services, which works on the 318 + basis of additional ioctl commands. 319 320 + So 4 additional ioctls have been added: 321 322 + VFIO_IOMMU_SPAPR_TCE_GET_INFO 323 + returns the size and the start of the DMA window on the PCI bus. 324 325 + VFIO_IOMMU_ENABLE 326 + enables the container. The locked pages accounting 327 is done at this point. This lets user first to know what 328 the DMA window is and adjust rlimit before doing any real job. 329 330 + VFIO_IOMMU_DISABLE 331 + disables the container. 332 333 + VFIO_EEH_PE_OP 334 + provides an API for EEH setup, error detection and recovery. 335 336 + The code flow from the example above should be slightly changed:: 337 338 struct vfio_eeh_pe_op pe_op = { .argsz = sizeof(pe_op), .flags = 0 }; 339 ··· 442 .... 443 444 5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/ 445 + VFIO_IOMMU_DISABLE and implements 2 new ioctls: 446 + VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY 447 + (which are unsupported in v1 IOMMU). 448 449 + PPC64 paravirtualized guests generate a lot of map/unmap requests, 450 + and the handling of those includes pinning/unpinning pages and updating 451 + mm::locked_vm counter to make sure we do not exceed the rlimit. 452 + The v2 IOMMU splits accounting and pinning into separate operations: 453 454 + - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls 455 + receive a user space address and size of the block to be pinned. 456 + Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to 457 + be called with the exact address and size used for registering 458 + the memory block. The userspace is not expected to call these often. 459 + The ranges are stored in a linked list in a VFIO container. 460 461 + - VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual 462 + IOMMU table and do not do pinning; instead these check that the userspace 463 + address is from pre-registered range. 464 465 + This separation helps in optimizing DMA for guests. 466 467 6) sPAPR specification allows guests to have an additional DMA window(s) on 468 + a PCI bus with a variable page size. Two ioctls have been added to support 469 + this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE. 470 + The platform has to support the functionality or error will be returned to 471 + the userspace. The existing hardware supports up to 2 DMA windows, one is 472 + 2GB long, uses 4K pages and called "default 32bit window"; the other can 473 + be as big as entire RAM, use different page size, it is optional - guests 474 + create those in run-time if the guest driver supports 64bit DMA. 475 476 + VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and 477 + a number of TCE table levels (if a TCE table is going to be big enough and 478 + the kernel may not be able to allocate enough of physically contiguous 479 + memory). It creates a new window in the available slot and returns the bus 480 + address where the new window starts. Due to hardware limitation, the user 481 + space cannot choose the location of DMA windows. 482 483 + VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window 484 + and removes it. 485 486 ------------------------------------------------------------------------------- 487 488 + .. [1] VFIO was originally an acronym for "Virtual Function I/O" in its 489 + initial implementation by Tom Lyon while as Cisco. We've since 490 + outgrown the acronym, but it's catchy. 491 492 + .. [2] "safe" also depends upon a device being "well behaved". It's 493 + possible for multi-function devices to have backdoors between 494 + functions and even for single function devices to have alternative 495 + access to things like PCI config space through MMIO registers. To 496 + guard against the former we can include additional precautions in the 497 + IOMMU driver to group multi-function PCI devices together 498 + (iommu=group_mf). The latter we can't prevent, but the IOMMU should 499 + still provide isolation. For PCI, SR-IOV Virtual Functions are the 500 + best indicator of "well behaved", as these are designed for 501 + virtualization usage models. 502 503 + .. [3] As always there are trade-offs to virtual machine device 504 + assignment that are beyond the scope of VFIO. It's expected that 505 + future IOMMU technologies will reduce some, but maybe not all, of 506 + these trade-offs. 507 508 + .. [4] In this case the device is below a PCI bridge, so transactions 509 + from either function of the device are indistinguishable to the iommu:: 510 511 + -[0000:00]-+-1e.0-[06]--+-0d.0 512 + \-0d.1 513 514 + 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)