Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

at v3.7-rc7 3210 lines 100 kB view raw
1[Generated file: see http://ozlabs.org/~rusty/virtio-spec/] 2Virtio PCI Card Specification 3v0.9.5 DRAFT 4- 5 6Rusty Russell <rusty@rustcorp.com.au> IBM Corporation (Editor) 7 82012 May 7. 9 10Purpose and Description 11 12This document describes the specifications of the “virtio” family 13of PCI[LaTeX Command: nomenclature] devices. These are devices 14are found in virtual environments[LaTeX Command: nomenclature], 15yet by design they are not all that different from physical PCI 16devices, and this document treats them as such. This allows the 17guest to use standard PCI drivers and discovery mechanisms. 18 19The purpose of virtio and this specification is that virtual 20environments and guests should have a straightforward, efficient, 21standard and extensible mechanism for virtual devices, rather 22than boutique per-environment or per-OS mechanisms. 23 24 Straightforward: Virtio PCI devices use normal PCI mechanisms 25 of interrupts and DMA which should be familiar to any device 26 driver author. There is no exotic page-flipping or COW 27 mechanism: it's just a PCI device.[footnote: 28This lack of page-sharing implies that the implementation of the 29device (e.g. the hypervisor or host) needs full access to the 30guest memory. Communication with untrusted parties (i.e. 31inter-guest communication) requires copying. 32] 33 34 Efficient: Virtio PCI devices consist of rings of descriptors 35 for input and output, which are neatly separated to avoid cache 36 effects from both guest and device writing to the same cache 37 lines. 38 39 Standard: Virtio PCI makes no assumptions about the environment 40 in which it operates, beyond supporting PCI. In fact the virtio 41 devices specified in the appendices do not require PCI at all: 42 they have been implemented on non-PCI buses.[footnote: 43The Linux implementation further separates the PCI virtio code 44from the specific virtio drivers: these drivers are shared with 45the non-PCI implementations (currently lguest and S/390). 46] 47 48 Extensible: Virtio PCI devices contain feature bits which are 49 acknowledged by the guest operating system during device setup. 50 This allows forwards and backwards compatibility: the device 51 offers all the features it knows about, and the driver 52 acknowledges those it understands and wishes to use. 53 54 Virtqueues 55 56The mechanism for bulk data transport on virtio PCI devices is 57pretentiously called a virtqueue. Each device can have zero or 58more virtqueues: for example, the network device has one for 59transmit and one for receive. 60 61Each virtqueue occupies two or more physically-contiguous pages 62(defined, for the purposes of this specification, as 4096 bytes), 63and consists of three parts: 64 65 66+-------------------+-----------------------------------+-----------+ 67| Descriptor Table | Available Ring (padding) | Used Ring | 68+-------------------+-----------------------------------+-----------+ 69 70 71When the driver wants to send a buffer to the device, it fills in 72a slot in the descriptor table (or chains several together), and 73writes the descriptor index into the available ring. It then 74notifies the device. When the device has finished a buffer, it 75writes the descriptor into the used ring, and sends an interrupt. 76 77Specification 78 79 PCI Discovery 80 81Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 82through 0x103F inclusive is a virtio device[footnote: 83The actual value within this range is ignored 84]. The device must also have a Revision ID of 0 to match this 85specification. 86 87The Subsystem Device ID indicates which virtio device is 88supported by the device. The Subsystem Vendor ID should reflect 89the PCI Vendor ID of the environment (it's currently only used 90for informational purposes by the guest). 91 92 93+----------------------+--------------------+---------------+ 94| Subsystem Device ID | Virtio Device | Specification | 95+----------------------+--------------------+---------------+ 96+----------------------+--------------------+---------------+ 97| 1 | network card | Appendix C | 98+----------------------+--------------------+---------------+ 99| 2 | block device | Appendix D | 100+----------------------+--------------------+---------------+ 101| 3 | console | Appendix E | 102+----------------------+--------------------+---------------+ 103| 4 | entropy source | Appendix F | 104+----------------------+--------------------+---------------+ 105| 5 | memory ballooning | Appendix G | 106+----------------------+--------------------+---------------+ 107| 6 | ioMemory | - | 108+----------------------+--------------------+---------------+ 109| 7 | rpmsg | Appendix H | 110+----------------------+--------------------+---------------+ 111| 8 | SCSI host | Appendix I | 112+----------------------+--------------------+---------------+ 113| 9 | 9P transport | - | 114+----------------------+--------------------+---------------+ 115| 10 | mac80211 wlan | - | 116+----------------------+--------------------+---------------+ 117 118 119 Device Configuration 120 121To configure the device, we use the first I/O region of the PCI 122device. This contains a virtio header followed by a 123device-specific region. 124 125There may be different widths of accesses to the I/O region; the “ 126natural” access method for each field in the virtio header must 127be used (i.e. 32-bit accesses for 32-bit fields, etc), but the 128device-specific region can be accessed using any width accesses, 129and should obtain the same results. 130 131Note that this is possible because while the virtio header is PCI 132(i.e. little) endian, the device-specific region is encoded in 133the native endian of the guest (where such distinction is 134applicable). 135 136 Device Initialization Sequence<sub:Device-Initialization-Sequence> 137 138We start with an overview of device initialization, then expand 139on the details of the device and how each step is preformed. 140 141 Reset the device. This is not required on initial start up. 142 143 The ACKNOWLEDGE status bit is set: we have noticed the device. 144 145 The DRIVER status bit is set: we know how to drive the device. 146 147 Device-specific setup, including reading the Device Feature 148 Bits, discovery of virtqueues for the device, optional MSI-X 149 setup, and reading and possibly writing the virtio 150 configuration space. 151 152 The subset of Device Feature Bits understood by the driver is 153 written to the device. 154 155 The DRIVER_OK status bit is set. 156 157 The device can now be used (ie. buffers added to the 158 virtqueues)[footnote: 159Historically, drivers have used the device before steps 5 and 6. 160This is only allowed if the driver does not use any features 161which would alter this early use of the device. 162] 163 164If any of these steps go irrecoverably wrong, the guest should 165set the FAILED status bit to indicate that it has given up on the 166device (it can reset the device later to restart if desired). 167 168We now cover the fields required for general setup in detail. 169 170 Virtio Header 171 172The virtio header looks as follows: 173 174 175+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ 176| Bits || 32 | 32 | 32 | 16 | 16 | 16 | 8 | 8 | 177+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ 178| Read/Write || R | R+W | R+W | R | R+W | R+W | R+W | R | 179+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ 180| Purpose || Device | Guest | Queue | Queue | Queue | Queue | Device | ISR | 181| || Features bits 0:31 | Features bits 0:31 | Address | Size | Select | Notify | Status | Status | 182+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ 183 184 185If MSI-X is enabled for the device, two additional fields 186immediately follow this header:[footnote: 187ie. once you enable MSI-X on the device, the other fields move. 188If you turn it off again, they move back! 189] 190 191 192+------------++----------------+--------+ 193| Bits || 16 | 16 | 194 +----------------+--------+ 195+------------++----------------+--------+ 196| Read/Write || R+W | R+W | 197+------------++----------------+--------+ 198| Purpose || Configuration | Queue | 199| (MSI-X) || Vector | Vector | 200+------------++----------------+--------+ 201 202 203Immediately following these general headers, there may be 204device-specific headers: 205 206 207+------------++--------------------+ 208| Bits || Device Specific | 209 +--------------------+ 210+------------++--------------------+ 211| Read/Write || Device Specific | 212+------------++--------------------+ 213| Purpose || Device Specific... | 214| || | 215+------------++--------------------+ 216 217 218 Device Status 219 220The Device Status field is updated by the guest to indicate its 221progress. This provides a simple low-level diagnostic: it's most 222useful to imagine them hooked up to traffic lights on the console 223indicating the status of each device. 224 225The device can be reset by writing a 0 to this field, otherwise 226at least one bit should be set: 227 228 ACKNOWLEDGE (1) Indicates that the guest OS has found the 229 device and recognized it as a valid virtio device. 230 231 DRIVER (2) Indicates that the guest OS knows how to drive the 232 device. Under Linux, drivers can be loadable modules so there 233 may be a significant (or infinite) delay before setting this 234 bit. 235 236 DRIVER_OK (4) Indicates that the driver is set up and ready to 237 drive the device. 238 239 FAILED (128) Indicates that something went wrong in the guest, 240 and it has given up on the device. This could be an internal 241 error, or the driver didn't like the device for some reason, or 242 even a fatal error during device operation. The device must be 243 reset before attempting to re-initialize. 244 245 Feature Bits<sub:Feature-Bits> 246 247Thefirst configuration field indicates the features that the 248device supports. The bits are allocated as follows: 249 250 0 to 23 Feature bits for the specific device type 251 252 24 to 32 Feature bits reserved for extensions to the queue and 253 feature negotiation mechanisms 254 255For example, feature bit 0 for a network device (i.e. Subsystem 256Device ID 1) indicates that the device supports checksumming of 257packets. 258 259The feature bits are negotiated: the device lists all the 260features it understands in the Device Features field, and the 261guest writes the subset that it understands into the Guest 262Features field. The only way to renegotiate is to reset the 263device. 264 265In particular, new fields in the device configuration header are 266indicated by offering a feature bit, so the guest can check 267before accessing that part of the configuration space. 268 269This allows for forwards and backwards compatibility: if the 270device is enhanced with a new feature bit, older guests will not 271write that feature bit back to the Guest Features field and it 272can go into backwards compatibility mode. Similarly, if a guest 273is enhanced with a feature that the device doesn't support, it 274will not see that feature bit in the Device Features field and 275can go into backwards compatibility mode (or, for poor 276implementations, set the FAILED Device Status bit). 277 278 Configuration/Queue Vectors 279 280When MSI-X capability is present and enabled in the device 281(through standard PCI configuration space) 4 bytes at byte offset 28220 are used to map configuration change and queue interrupts to 283MSI-X vectors. In this case, the ISR Status field is unused, and 284device specific configuration starts at byte offset 24 in virtio 285header structure. When MSI-X capability is not enabled, device 286specific configuration starts at byte offset 20 in virtio header. 287 288Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of 289Configuration/Queue Vector registers, maps interrupts triggered 290by the configuration change/selected queue events respectively to 291the corresponding MSI-X vector. To disable interrupts for a 292specific event type, unmap it by writing a special NO_VECTOR 293value: 294 295/* Vector value used to disable MSI for queue */ 296 297#define VIRTIO_MSI_NO_VECTOR 0xffff 298 299Reading these registers returns vector mapped to a given event, 300or NO_VECTOR if unmapped. All queue and configuration change 301events are unmapped by default. 302 303Note that mapping an event to vector might require allocating 304internal device resources, and might fail. Devices report such 305failures by returning the NO_VECTOR value when the relevant 306Vector field is read. After mapping an event to vector, the 307driver must verify success by reading the Vector field value: on 308success, the previously written value is returned, and on 309failure, NO_VECTOR is returned. If a mapping failure is detected, 310the driver can retry mapping with fewervectors, or disable MSI-X. 311 312 Virtqueue Configuration<sec:Virtqueue-Configuration> 313 314As a device can have zero or more virtqueues for bulk data 315transport (for example, the network driver has two), the driver 316needs to configure them as part of the device-specific 317configuration. 318 319This is done as follows, for each virtqueue a device has: 320 321 Write the virtqueue index (first queue is 0) to the Queue 322 Select field. 323 324 Read the virtqueue size from the Queue Size field, which is 325 always a power of 2. This controls how big the virtqueue is 326 (see below). If this field is 0, the virtqueue does not exist. 327 328 Allocate and zero virtqueue in contiguous physical memory, on a 329 4096 byte alignment. Write the physical address, divided by 330 4096 to the Queue Address field.[footnote: 331The 4096 is based on the x86 page size, but it's also large 332enough to ensure that the separate parts of the virtqueue are on 333separate cache lines. 334] 335 336 Optionally, if MSI-X capability is present and enabled on the 337 device, select a vector to use to request interrupts triggered 338 by virtqueue events. Write the MSI-X Table entry number 339 corresponding to this vector in Queue Vector field. Read the 340 Queue Vector field: on success, previously written value is 341 returned; on failure, NO_VECTOR value is returned. 342 343The Queue Size field controls the total number of bytes required 344for the virtqueue according to the following formula: 345 346#define ALIGN(x) (((x) + 4095) & ~4095) 347 348static inline unsigned vring_size(unsigned int qsz) 349 350{ 351 352 return ALIGN(sizeof(struct vring_desc)*qsz + sizeof(u16)*(2 353+ qsz)) 354 355 + ALIGN(sizeof(struct vring_used_elem)*qsz); 356 357} 358 359This currently wastes some space with padding, but also allows 360future extensions. The virtqueue layout structure looks like this 361(qsz is the Queue Size field, which is a variable, so this code 362won't compile): 363 364struct vring { 365 366 /* The actual descriptors (16 bytes each) */ 367 368 struct vring_desc desc[qsz]; 369 370 371 372 /* A ring of available descriptor heads with free-running 373index. */ 374 375 struct vring_avail avail; 376 377 378 379 // Padding to the next 4096 boundary. 380 381 char pad[]; 382 383 384 385 // A ring of used descriptor heads with free-running index. 386 387 struct vring_used used; 388 389}; 390 391 A Note on Virtqueue Endianness 392 393Note that the endian of these fields and everything else in the 394virtqueue is the native endian of the guest, not little-endian as 395PCI normally is. This makes for simpler guest code, and it is 396assumed that the host already has to be deeply aware of the guest 397endian so such an “endian-aware” device is not a significant 398issue. 399 400 Descriptor Table 401 402The descriptor table refers to the buffers the guest is using for 403the device. The addresses are physical addresses, and the buffers 404can be chained via the next field. Each descriptor describes a 405buffer which is read-only or write-only, but a chain of 406descriptors can contain both read-only and write-only buffers. 407 408No descriptor chain may be more than 2^32 bytes long in total.struct vring_desc { 409 410 /* Address (guest-physical). */ 411 412 u64 addr; 413 414 /* Length. */ 415 416 u32 len; 417 418/* This marks a buffer as continuing via the next field. */ 419 420#define VRING_DESC_F_NEXT 1 421 422/* This marks a buffer as write-only (otherwise read-only). */ 423 424#define VRING_DESC_F_WRITE 2 425 426/* This means the buffer contains a list of buffer descriptors. 427*/ 428 429#define VRING_DESC_F_INDIRECT 4 430 431 /* The flags as indicated above. */ 432 433 u16 flags; 434 435 /* Next field if flags & NEXT */ 436 437 u16 next; 438 439}; 440 441The number of descriptors in the table is specified by the Queue 442Size field for this virtqueue. 443 444 <sub:Indirect-Descriptors>Indirect Descriptors 445 446Some devices benefit by concurrently dispatching a large number 447of large requests. The VIRTIO_RING_F_INDIRECT_DESC feature can be 448used to allow this (see [cha:Reserved-Feature-Bits]). To increase 449ring capacity it is possible to store a table of indirect 450descriptors anywhere in memory, and insert a descriptor in main 451virtqueue (with flags&INDIRECT on) that refers to memory buffer 452containing this indirect descriptor table; fields addr and len 453refer to the indirect table address and length in bytes, 454respectively. The indirect table layout structure looks like this 455(len is the length of the descriptor that refers to this table, 456which is a variable, so this code won't compile): 457 458struct indirect_descriptor_table { 459 460 /* The actual descriptors (16 bytes each) */ 461 462 struct vring_desc desc[len / 16]; 463 464}; 465 466The first indirect descriptor is located at start of the indirect 467descriptor table (index 0), additional indirect descriptors are 468chained by next field. An indirect descriptor without next field 469(with flags&NEXT off) signals the end of the indirect descriptor 470table, and transfers control back to the main virtqueue. An 471indirect descriptor can not refer to another indirect descriptor 472table (flags&INDIRECT must be off). A single indirect descriptor 473table can include both read-only and write-only descriptors; 474write-only flag (flags&WRITE) in the descriptor that refers to it 475is ignored. 476 477 Available Ring 478 479The available ring refers to what descriptors we are offering the 480device: it refers to the head of a descriptor chain. The “flags” 481field is currently 0 or 1: 1 indicating that we do not need an 482interrupt when the device consumes a descriptor from the 483available ring. Alternatively, the guest can ask the device to 484delay interrupts until an entry with an index specified by the “ 485used_event” field is written in the used ring (equivalently, 486until the idx field in the used ring will reach the value 487used_event + 1). The method employed by the device is controlled 488by the VIRTIO_RING_F_EVENT_IDX feature bit (see [cha:Reserved-Feature-Bits] 489). This interrupt suppression is merely an optimization; it may 490not suppress interrupts entirely. 491 492The “idx” field indicates where we would put the next descriptor 493entry (modulo the ring size). This starts at 0, and increases. 494 495struct vring_avail { 496 497#define VRING_AVAIL_F_NO_INTERRUPT 1 498 499 u16 flags; 500 501 u16 idx; 502 503 u16 ring[qsz]; /* qsz is the Queue Size field read from device 504*/ 505 506 u16 used_event; 507 508}; 509 510 Used Ring 511 512The used ring is where the device returns buffers once it is done 513with them. The flags field can be used by the device to hint that 514no notification is necessary when the guest adds to the available 515ring. Alternatively, the “avail_event” field can be used by the 516device to hint that no notification is necessary until an entry 517with an index specified by the “avail_event” is written in the 518available ring (equivalently, until the idx field in the 519available ring will reach the value avail_event + 1). The method 520employed by the device is controlled by the guest through the 521VIRTIO_RING_F_EVENT_IDX feature bit (see [cha:Reserved-Feature-Bits] 522). [footnote: 523These fields are kept here because this is the only part of the 524virtqueue written by the device 525]. 526 527Each entry in the ring is a pair: the head entry of the 528descriptor chain describing the buffer (this matches an entry 529placed in the available ring by the guest earlier), and the total 530of bytes written into the buffer. The latter is extremely useful 531for guests using untrusted buffers: if you do not know exactly 532how much has been written by the device, you usually have to zero 533the buffer to ensure no data leakage occurs. 534 535/* u32 is used here for ids for padding reasons. */ 536 537struct vring_used_elem { 538 539 /* Index of start of used descriptor chain. */ 540 541 u32 id; 542 543 /* Total length of the descriptor chain which was used 544(written to) */ 545 546 u32 len; 547 548}; 549 550 551 552struct vring_used { 553 554#define VRING_USED_F_NO_NOTIFY 1 555 556 u16 flags; 557 558 u16 idx; 559 560 struct vring_used_elem ring[qsz]; 561 562 u16 avail_event; 563 564}; 565 566 Helpers for Managing Virtqueues 567 568The Linux Kernel Source code contains the definitions above and 569helper routines in a more usable form, in 570include/linux/virtio_ring.h. This was explicitly licensed by IBM 571and Red Hat under the (3-clause) BSD license so that it can be 572freely used by all other projects, and is reproduced (with slight 573variation to remove Linux assumptions) in Appendix A. 574 575 Device Operation<sec:Device-Operation> 576 577There are two parts to device operation: supplying new buffers to 578the device, and processing used buffers from the device. As an 579example, the virtio network device has two virtqueues: the 580transmit virtqueue and the receive virtqueue. The driver adds 581outgoing (read-only) packets to the transmit virtqueue, and then 582frees them after they are used. Similarly, incoming (write-only) 583buffers are added to the receive virtqueue, and processed after 584they are used. 585 586 Supplying Buffers to The Device 587 588Actual transfer of buffers from the guest OS to the device 589operates as follows: 590 591 Place the buffer(s) into free descriptor(s). 592 593 If there are no free descriptors, the guest may choose to 594 notify the device even if notifications are suppressed (to 595 reduce latency).[footnote: 596The Linux drivers do this only for read-only buffers: for 597write-only buffers, it is assumed that the driver is merely 598trying to keep the receive buffer ring full, and no notification 599of this expected condition is necessary. 600] 601 602 Place the id of the buffer in the next ring entry of the 603 available ring. 604 605 The steps (1) and (2) may be performed repeatedly if batching 606 is possible. 607 608 A memory barrier should be executed to ensure the device sees 609 the updated descriptor table and available ring before the next 610 step. 611 612 The available “idx” field should be increased by the number of 613 entries added to the available ring. 614 615 A memory barrier should be executed to ensure that we update 616 the idx field before checking for notification suppression. 617 618 If notifications are not suppressed, the device should be 619 notified of the new buffers. 620 621Note that the above code does not take precautions against the 622available ring buffer wrapping around: this is not possible since 623the ring buffer is the same size as the descriptor table, so step 624(1) will prevent such a condition. 625 626In addition, the maximum queue size is 32768 (it must be a power 627of 2 which fits in 16 bits), so the 16-bit “idx” value can always 628distinguish between a full and empty buffer. 629 630Here is a description of each stage in more detail. 631 632 Placing Buffers Into The Descriptor Table 633 634A buffer consists of zero or more read-only physically-contiguous 635elements followed by zero or more physically-contiguous 636write-only elements (it must have at least one element). This 637algorithm maps it into the descriptor table: 638 639 for each buffer element, b: 640 641 Get the next free descriptor table entry, d 642 643 Set d.addr to the physical address of the start of b 644 645 Set d.len to the length of b. 646 647 If b is write-only, set d.flags to VRING_DESC_F_WRITE, 648 otherwise 0. 649 650 If there is a buffer element after this: 651 652 Set d.next to the index of the next free descriptor element. 653 654 Set the VRING_DESC_F_NEXT bit in d.flags. 655 656In practice, the d.next fields are usually used to chain free 657descriptors, and a separate count kept to check there are enough 658free descriptors before beginning the mappings. 659 660 Updating The Available Ring 661 662The head of the buffer we mapped is the first d in the algorithm 663above. A naive implementation would do the following: 664 665avail->ring[avail->idx % qsz] = head; 666 667However, in general we can add many descriptors before we update 668the “idx” field (at which point they become visible to the 669device), so we keep a counter of how many we've added: 670 671avail->ring[(avail->idx + added++) % qsz] = head; 672 673 Updating The Index Field 674 675Once the idx field of the virtqueue is updated, the device will 676be able to access the descriptor entries we've created and the 677memory they refer to. This is why a memory barrier is generally 678used before the idx update, to ensure it sees the most up-to-date 679copy. 680 681The idx field always increments, and we let it wrap naturally at 68265536: 683 684avail->idx += added; 685 686 <sub:Notifying-The-Device>Notifying The Device 687 688Device notification occurs by writing the 16-bit virtqueue index 689of this virtqueue to the Queue Notify field of the virtio header 690in the first I/O region of the PCI device. This can be expensive, 691however, so the device can suppress such notifications if it 692doesn't need them. We have to be careful to expose the new idx 693value before checking the suppression flag: it's OK to notify 694gratuitously, but not to omit a required notification. So again, 695we use a memory barrier here before reading the flags or the 696avail_event field. 697 698If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated, and if 699the VRING_USED_F_NOTIFY flag is not set, we go ahead and write to 700the PCI configuration space. 701 702If the VIRTIO_F_RING_EVENT_IDX feature is negotiated, we read the 703avail_event field in the available ring structure. If the 704available index crossed_the avail_event field value since the 705last notification, we go ahead and write to the PCI configuration 706space. The avail_event field wraps naturally at 65536 as well: 707 708(u16)(new_idx - avail_event - 1) < (u16)(new_idx - old_idx) 709 710 <sub:Receiving-Used-Buffers>Receiving Used Buffers From The 711 Device 712 713Once the device has used a buffer (read from or written to it, or 714parts of both, depending on the nature of the virtqueue and the 715device), it sends an interrupt, following an algorithm very 716similar to the algorithm used for the driver to send the device a 717buffer: 718 719 Write the head descriptor number to the next field in the used 720 ring. 721 722 Update the used ring idx. 723 724 Determine whether an interrupt is necessary: 725 726 If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated: check 727 if f the VRING_AVAIL_F_NO_INTERRUPT flag is not set in avail- 728 >flags 729 730 If the VIRTIO_F_RING_EVENT_IDX feature is negotiated: check 731 whether the used index crossed the used_event field value 732 since the last update. The used_event field wraps naturally 733 at 65536 as well:(u16)(new_idx - used_event - 1) < (u16)(new_idx - old_idx) 734 735 If an interrupt is necessary: 736 737 If MSI-X capability is disabled: 738 739 Set the lower bit of the ISR Status field for the device. 740 741 Send the appropriate PCI interrupt for the device. 742 743 If MSI-X capability is enabled: 744 745 Request the appropriate MSI-X interrupt message for the 746 device, Queue Vector field sets the MSI-X Table entry 747 number. 748 749 If Queue Vector field value is NO_VECTOR, no interrupt 750 message is requested for this event. 751 752The guest interrupt handler should: 753 754 If MSI-X capability is disabled: read the ISR Status field, 755 which will reset it to zero. If the lower bit is zero, the 756 interrupt was not for this device. Otherwise, the guest driver 757 should look through the used rings of each virtqueue for the 758 device, to see if any progress has been made by the device 759 which requires servicing. 760 761 If MSI-X capability is enabled: look through the used rings of 762 each virtqueue mapped to the specific MSI-X vector for the 763 device, to see if any progress has been made by the device 764 which requires servicing. 765 766For each ring, guest should then disable interrupts by writing 767VRING_AVAIL_F_NO_INTERRUPT flag in avail structure, if required. 768It can then process used ring entries finally enabling interrupts 769by clearing the VRING_AVAIL_F_NO_INTERRUPT flag or updating the 770EVENT_IDX field in the available structure, Guest should then 771execute a memory barrier, and then recheck the ring empty 772condition. This is necessary to handle the case where, after the 773last check and before enabling interrupts, an interrupt has been 774suppressed by the device: 775 776vring_disable_interrupts(vq); 777 778for (;;) { 779 780 if (vq->last_seen_used != vring->used.idx) { 781 782 vring_enable_interrupts(vq); 783 784 mb(); 785 786 if (vq->last_seen_used != vring->used.idx) 787 788 break; 789 790 } 791 792 struct vring_used_elem *e = 793vring.used->ring[vq->last_seen_used%vsz]; 794 795 process_buffer(e); 796 797 vq->last_seen_used++; 798 799} 800 801 Dealing With Configuration Changes<sub:Dealing-With-Configuration> 802 803Some virtio PCI devices can change the device configuration 804state, as reflected in the virtio header in the PCI configuration 805space. In this case: 806 807 If MSI-X capability is disabled: an interrupt is delivered and 808 the second highest bit is set in the ISR Status field to 809 indicate that the driver should re-examine the configuration 810 space.Note that a single interrupt can indicate both that one 811 or more virtqueue has been used and that the configuration 812 space has changed: even if the config bit is set, virtqueues 813 must be scanned. 814 815 If MSI-X capability is enabled: an interrupt message is 816 requested. The Configuration Vector field sets the MSI-X Table 817 entry number to use. If Configuration Vector field value is 818 NO_VECTOR, no interrupt message is requested for this event. 819 820Creating New Device Types 821 822Various considerations are necessary when creating a new device 823type: 824 825 How Many Virtqueues? 826 827It is possible that a very simple device will operate entirely 828through its configuration space, but most will need at least one 829virtqueue in which it will place requests. A device with both 830input and output (eg. console and network devices described here) 831need two queues: one which the driver fills with buffers to 832receive input, and one which the driver places buffers to 833transmit output. 834 835 What Configuration Space Layout? 836 837Configuration space is generally used for rarely-changing or 838initialization-time parameters. But it is a limited resource, so 839it might be better to use a virtqueue to update configuration 840information (the network device does this for filtering, 841otherwise the table in the config space could potentially be very 842large). 843 844Note that this space is generally the guest's native endian, 845rather than PCI's little-endian. 846 847 What Device Number? 848 849Currently device numbers are assigned quite freely: a simple 850request mail to the author of this document or the Linux 851virtualization mailing list[footnote: 852 853https://lists.linux-foundation.org/mailman/listinfo/virtualization 854] will be sufficient to secure a unique one. 855 856Meanwhile for experimental drivers, use 65535 and work backwards. 857 858 How many MSI-X vectors? 859 860Using the optional MSI-X capability devices can speed up 861interrupt processing by removing the need to read ISR Status 862register by guest driver (which might be an expensive operation), 863reducing interrupt sharing between devices and queues within the 864device, and handling interrupts from multiple CPUs. However, some 865systems impose a limit (which might be as low as 256) on the 866total number of MSI-X vectors that can be allocated to all 867devices. Devices and/or device drivers should take this into 868account, limiting the number of vectors used unless the device is 869expected to cause a high volume of interrupts. Devices can 870control the number of vectors used by limiting the MSI-X Table 871Size or not presenting MSI-X capability in PCI configuration 872space. Drivers can control this by mapping events to as small 873number of vectors as possible, or disabling MSI-X capability 874altogether. 875 876 Message Framing 877 878The descriptors used for a buffer should not effect the semantics 879of the message, except for the total length of the buffer. For 880example, a network buffer consists of a 10 byte header followed 881by the network packet. Whether this is presented in the ring 882descriptor chain as (say) a 10 byte buffer and a 1514 byte 883buffer, or a single 1524 byte buffer, or even three buffers, 884should have no effect. 885 886In particular, no implementation should use the descriptor 887boundaries to determine the size of any header in a request.[footnote: 888The current qemu device implementations mistakenly insist that 889the first descriptor cover the header in these cases exactly, so 890a cautious driver should arrange it so. 891] 892 893 Device Improvements 894 895Any change to configuration space, or new virtqueues, or 896behavioural changes, should be indicated by negotiation of a new 897feature bit. This establishes clarity[footnote: 898Even if it does mean documenting design or implementation 899mistakes! 900] and avoids future expansion problems. 901 902Clusters of functionality which are always implemented together 903can use a single bit, but if one feature makes sense without the 904others they should not be gratuitously grouped together to 905conserve feature bits. We can always extend the spec when the 906first person needs more than 24 feature bits for their device. 907 908[LaTeX Command: printnomenclature] 909 910Appendix A: virtio_ring.h 911 912#ifndef VIRTIO_RING_H 913 914#define VIRTIO_RING_H 915 916/* An interface for efficient virtio implementation. 917 918 * 919 920 * This header is BSD licensed so anyone can use the definitions 921 922 * to implement compatible drivers/servers. 923 924 * 925 926 * Copyright 2007, 2009, IBM Corporation 927 928 * Copyright 2011, Red Hat, Inc 929 930 * All rights reserved. 931 932 * 933 934 * Redistribution and use in source and binary forms, with or 935without 936 937 * modification, are permitted provided that the following 938conditions 939 940 * are met: 941 942 * 1. Redistributions of source code must retain the above 943copyright 944 945 * notice, this list of conditions and the following 946disclaimer. 947 948 * 2. Redistributions in binary form must reproduce the above 949copyright 950 951 * notice, this list of conditions and the following 952disclaimer in the 953 954 * documentation and/or other materials provided with the 955distribution. 956 957 * 3. Neither the name of IBM nor the names of its contributors 958 959 * may be used to endorse or promote products derived from 960this software 961 962 * without specific prior written permission. 963 964 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND 965CONTRIBUTORS ``AS IS'' AND 966 967 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED 968TO, THE 969 970 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A 971PARTICULAR PURPOSE 972 973 * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE 974LIABLE 975 976 * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 977CONSEQUENTIAL 978 979 * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 980SUBSTITUTE GOODS 981 982 * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 983INTERRUPTION) 984 985 * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 986CONTRACT, STRICT 987 988 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 989IN ANY WAY 990 991 * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 992POSSIBILITY OF 993 994 * SUCH DAMAGE. 995 996 */ 997 998 999 1000/* This marks a buffer as continuing via the next field. */ 1001 1002#define VRING_DESC_F_NEXT 1 1003 1004/* This marks a buffer as write-only (otherwise read-only). */ 1005 1006#define VRING_DESC_F_WRITE 2 1007 1008 1009 1010/* The Host uses this in used->flags to advise the Guest: don't 1011kick me 1012 1013 * when you add a buffer. It's unreliable, so it's simply an 1014 1015 * optimization. Guest will still kick if it's out of buffers. 1016*/ 1017 1018#define VRING_USED_F_NO_NOTIFY 1 1019 1020/* The Guest uses this in avail->flags to advise the Host: don't 1021 1022 * interrupt me when you consume a buffer. It's unreliable, so 1023it's 1024 1025 * simply an optimization. */ 1026 1027#define VRING_AVAIL_F_NO_INTERRUPT 1 1028 1029 1030 1031/* Virtio ring descriptors: 16 bytes. 1032 1033 * These can chain together via "next". */ 1034 1035struct vring_desc { 1036 1037 /* Address (guest-physical). */ 1038 1039 uint64_t addr; 1040 1041 /* Length. */ 1042 1043 uint32_t len; 1044 1045 /* The flags as indicated above. */ 1046 1047 uint16_t flags; 1048 1049 /* We chain unused descriptors via this, too */ 1050 1051 uint16_t next; 1052 1053}; 1054 1055 1056 1057struct vring_avail { 1058 1059 uint16_t flags; 1060 1061 uint16_t idx; 1062 1063 uint16_t ring[]; 1064 1065 uint16_t used_event; 1066 1067}; 1068 1069 1070 1071/* u32 is used here for ids for padding reasons. */ 1072 1073struct vring_used_elem { 1074 1075 /* Index of start of used descriptor chain. */ 1076 1077 uint32_t id; 1078 1079 /* Total length of the descriptor chain which was written 1080to. */ 1081 1082 uint32_t len; 1083 1084}; 1085 1086 1087 1088struct vring_used { 1089 1090 uint16_t flags; 1091 1092 uint16_t idx; 1093 1094 struct vring_used_elem ring[]; 1095 1096 uint16_t avail_event; 1097 1098}; 1099 1100 1101 1102struct vring { 1103 1104 unsigned int num; 1105 1106 1107 1108 struct vring_desc *desc; 1109 1110 struct vring_avail *avail; 1111 1112 struct vring_used *used; 1113 1114}; 1115 1116 1117 1118/* The standard layout for the ring is a continuous chunk of 1119memory which 1120 1121 * looks like this. We assume num is a power of 2. 1122 1123 * 1124 1125 * struct vring { 1126 1127 * // The actual descriptors (16 bytes each) 1128 1129 * struct vring_desc desc[num]; 1130 1131 * 1132 1133 * // A ring of available descriptor heads with free-running 1134index. 1135 1136 * __u16 avail_flags; 1137 1138 * __u16 avail_idx; 1139 1140 * __u16 available[num]; 1141 1142 * 1143 1144 * // Padding to the next align boundary. 1145 1146 * char pad[]; 1147 1148 * 1149 1150 * // A ring of used descriptor heads with free-running 1151index. 1152 1153 * __u16 used_flags; 1154 1155 * __u16 EVENT_IDX; 1156 1157 * struct vring_used_elem used[num]; 1158 1159 * }; 1160 1161 * Note: for virtio PCI, align is 4096. 1162 1163 */ 1164 1165static inline void vring_init(struct vring *vr, unsigned int num, 1166void *p, 1167 1168 unsigned long align) 1169 1170{ 1171 1172 vr->num = num; 1173 1174 vr->desc = p; 1175 1176 vr->avail = p + num*sizeof(struct vring_desc); 1177 1178 vr->used = (void *)(((unsigned long)&vr->avail->ring[num] 1179 1180 + align-1) 1181 1182 & ~(align - 1)); 1183 1184} 1185 1186 1187 1188static inline unsigned vring_size(unsigned int num, unsigned long 1189align) 1190 1191{ 1192 1193 return ((sizeof(struct vring_desc)*num + 1194sizeof(uint16_t)*(2+num) 1195 1196 + align - 1) & ~(align - 1)) 1197 1198 + sizeof(uint16_t)*3 + sizeof(struct 1199vring_used_elem)*num; 1200 1201} 1202 1203 1204 1205static inline int vring_need_event(uint16_t event_idx, uint16_t 1206new_idx, uint16_t old_idx) 1207 1208{ 1209 1210 return (uint16_t)(new_idx - event_idx - 1) < 1211(uint16_t)(new_idx - old_idx); 1212 1213} 1214 1215#endif /* VIRTIO_RING_H */ 1216 1217<cha:Reserved-Feature-Bits>Appendix B: Reserved Feature Bits 1218 1219Currently there are five device-independent feature bits defined: 1220 1221 VIRTIO_F_NOTIFY_ON_EMPTY (24) Negotiating this feature 1222 indicates that the driver wants an interrupt if the device runs 1223 out of available descriptors on a virtqueue, even though 1224 interrupts are suppressed using the VRING_AVAIL_F_NO_INTERRUPT 1225 flag or the used_event field. An example of this is the 1226 networking driver: it doesn't need to know every time a packet 1227 is transmitted, but it does need to free the transmitted 1228 packets a finite time after they are transmitted. It can avoid 1229 using a timer if the device interrupts it when all the packets 1230 are transmitted. 1231 1232 VIRTIO_F_RING_INDIRECT_DESC (28) Negotiating this feature 1233 indicates that the driver can use descriptors with the 1234 VRING_DESC_F_INDIRECT flag set, as described in [sub:Indirect-Descriptors] 1235 . 1236 1237 VIRTIO_F_RING_EVENT_IDX(29) This feature enables the used_event 1238 and the avail_event fields. If set, it indicates that the 1239 device should ignore the flags field in the available ring 1240 structure. Instead, the used_event field in this structure is 1241 used by guest to suppress device interrupts. Further, the 1242 driver should ignore the flags field in the used ring 1243 structure. Instead, the avail_event field in this structure is 1244 used by the device to suppress notifications. If unset, the 1245 driver should ignore the used_event field; the device should 1246 ignore the avail_event field; the flags field is used 1247 1248Appendix C: Network Device 1249 1250The virtio network device is a virtual ethernet card, and is the 1251most complex of the devices supported so far by virtio. It has 1252enhanced rapidly and demonstrates clearly how support for new 1253features should be added to an existing device. Empty buffers are 1254placed in one virtqueue for receiving packets, and outgoing 1255packets are enqueued into another for transmission in that order. 1256A third command queue is used to control advanced filtering 1257features. 1258 1259 Configuration 1260 1261 Subsystem Device ID 1 1262 1263 Virtqueues 0:receiveq. 1:transmitq. 2:controlq[footnote: 1264Only if VIRTIO_NET_F_CTRL_VQ set 1265] 1266 1267 Feature bits 1268 1269 VIRTIO_NET_F_CSUM (0) Device handles packets with partial 1270 checksum 1271 1272 VIRTIO_NET_F_GUEST_CSUM (1) Guest handles packets with partial 1273 checksum 1274 1275 VIRTIO_NET_F_MAC (5) Device has given MAC address. 1276 1277 VIRTIO_NET_F_GSO (6) (Deprecated) device handles packets with 1278 any GSO type.[footnote: 1279It was supposed to indicate segmentation offload support, but 1280upon further investigation it became clear that multiple bits 1281were required. 1282] 1283 1284 VIRTIO_NET_F_GUEST_TSO4 (7) Guest can receive TSOv4. 1285 1286 VIRTIO_NET_F_GUEST_TSO6 (8) Guest can receive TSOv6. 1287 1288 VIRTIO_NET_F_GUEST_ECN (9) Guest can receive TSO with ECN. 1289 1290 VIRTIO_NET_F_GUEST_UFO (10) Guest can receive UFO. 1291 1292 VIRTIO_NET_F_HOST_TSO4 (11) Device can receive TSOv4. 1293 1294 VIRTIO_NET_F_HOST_TSO6 (12) Device can receive TSOv6. 1295 1296 VIRTIO_NET_F_HOST_ECN (13) Device can receive TSO with ECN. 1297 1298 VIRTIO_NET_F_HOST_UFO (14) Device can receive UFO. 1299 1300 VIRTIO_NET_F_MRG_RXBUF (15) Guest can merge receive buffers. 1301 1302 VIRTIO_NET_F_STATUS (16) Configuration status field is 1303 available. 1304 1305 VIRTIO_NET_F_CTRL_VQ (17) Control channel is available. 1306 1307 VIRTIO_NET_F_CTRL_RX (18) Control channel RX mode support. 1308 1309 VIRTIO_NET_F_CTRL_VLAN (19) Control channel VLAN filtering. 1310 1311 VIRTIO_NET_F_GUEST_ANNOUNCE(21) Guest can send gratuitous 1312 packets. 1313 1314 Device configuration layout Two configuration fields are 1315 currently defined. The mac address field always exists (though 1316 is only valid if VIRTIO_NET_F_MAC is set), and the status field 1317 only exists if VIRTIO_NET_F_STATUS is set. Two read-only bits 1318 are currently defined for the status field: 1319 VIRTIO_NET_S_LINK_UP and VIRTIO_NET_S_ANNOUNCE. #define VIRTIO_NET_S_LINK_UP 1 1320 1321#define VIRTIO_NET_S_ANNOUNCE 2 1322 1323 1324 1325struct virtio_net_config { 1326 1327 u8 mac[6]; 1328 1329 u16 status; 1330 1331}; 1332 1333 Device Initialization 1334 1335 The initialization routine should identify the receive and 1336 transmission virtqueues. 1337 1338 If the VIRTIO_NET_F_MAC feature bit is set, the configuration 1339 space “mac” entry indicates the “physical” address of the the 1340 network card, otherwise a private MAC address should be 1341 assigned. All guests are expected to negotiate this feature if 1342 it is set. 1343 1344 If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated, identify 1345 the control virtqueue. 1346 1347 If the VIRTIO_NET_F_STATUS feature bit is negotiated, the link 1348 status can be read from the bottom bit of the “status” config 1349 field. Otherwise, the link should be assumed active. 1350 1351 The receive virtqueue should be filled with receive buffers. 1352 This is described in detail below in “Setting Up Receive 1353 Buffers”. 1354 1355 A driver can indicate that it will generate checksumless 1356 packets by negotating the VIRTIO_NET_F_CSUM feature. This “ 1357 checksum offload” is a common feature on modern network cards. 1358 1359 If that feature is negotiated[footnote: 1360ie. VIRTIO_NET_F_HOST_TSO* and VIRTIO_NET_F_HOST_UFO are 1361dependent on VIRTIO_NET_F_CSUM; a dvice which offers the offload 1362features must offer the checksum feature, and a driver which 1363accepts the offload features must accept the checksum feature. 1364Similar logic applies to the VIRTIO_NET_F_GUEST_TSO4 features 1365depending on VIRTIO_NET_F_GUEST_CSUM. 1366], a driver can use TCP or UDP segmentation offload by 1367 negotiating the VIRTIO_NET_F_HOST_TSO4 (IPv4 TCP), 1368 VIRTIO_NET_F_HOST_TSO6 (IPv6 TCP) and VIRTIO_NET_F_HOST_UFO 1369 (UDP fragmentation) features. It should not send TCP packets 1370 requiring segmentation offload which have the Explicit 1371 Congestion Notification bit set, unless the 1372 VIRTIO_NET_F_HOST_ECN feature is negotiated.[footnote: 1373This is a common restriction in real, older network cards. 1374] 1375 1376 The converse features are also available: a driver can save the 1377 virtual device some work by negotiating these features.[footnote: 1378For example, a network packet transported between two guests on 1379the same system may not require checksumming at all, nor 1380segmentation, if both guests are amenable. 1381] The VIRTIO_NET_F_GUEST_CSUM feature indicates that partially 1382 checksummed packets can be received, and if it can do that then 1383 the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6, 1384 VIRTIO_NET_F_GUEST_UFO and VIRTIO_NET_F_GUEST_ECN are the input 1385 equivalents of the features described above. See “Receiving 1386 Packets” below. 1387 1388 Device Operation 1389 1390Packets are transmitted by placing them in the transmitq, and 1391buffers for incoming packets are placed in the receiveq. In each 1392case, the packet itself is preceeded by a header: 1393 1394struct virtio_net_hdr { 1395 1396#define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 1397 1398 u8 flags; 1399 1400#define VIRTIO_NET_HDR_GSO_NONE 0 1401 1402#define VIRTIO_NET_HDR_GSO_TCPV4 1 1403 1404#define VIRTIO_NET_HDR_GSO_UDP 3 1405 1406#define VIRTIO_NET_HDR_GSO_TCPV6 4 1407 1408#define VIRTIO_NET_HDR_GSO_ECN 0x80 1409 1410 u8 gso_type; 1411 1412 u16 hdr_len; 1413 1414 u16 gso_size; 1415 1416 u16 csum_start; 1417 1418 u16 csum_offset; 1419 1420/* Only if VIRTIO_NET_F_MRG_RXBUF: */ 1421 1422 u16 num_buffers 1423 1424}; 1425 1426The controlq is used to control device features such as 1427filtering. 1428 1429 Packet Transmission 1430 1431Transmitting a single packet is simple, but varies depending on 1432the different features the driver negotiated. 1433 1434 If the driver negotiated VIRTIO_NET_F_CSUM, and the packet has 1435 not been fully checksummed, then the virtio_net_hdr's fields 1436 are set as follows. Otherwise, the packet must be fully 1437 checksummed, and flags is zero. 1438 1439 flags has the VIRTIO_NET_HDR_F_NEEDS_CSUM set, 1440 1441 <ite:csum_start-is-set>csum_start is set to the offset within 1442 the packet to begin checksumming, and 1443 1444 csum_offset indicates how many bytes after the csum_start the 1445 new (16 bit ones' complement) checksum should be placed.[footnote: 1446For example, consider a partially checksummed TCP (IPv4) packet. 1447It will have a 14 byte ethernet header and 20 byte IP header 1448followed by the TCP header (with the TCP checksum field 16 bytes 1449into that header). csum_start will be 14+20 = 34 (the TCP 1450checksum includes the header), and csum_offset will be 16. The 1451value in the TCP checksum field should be initialized to the sum 1452of the TCP pseudo header, so that replacing it by the ones' 1453complement checksum of the TCP header and body will give the 1454correct result. 1455] 1456 1457 <enu:If-the-driver>If the driver negotiated 1458 VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO, and the packet requires 1459 TCP segmentation or UDP fragmentation, then the “gso_type” 1460 field is set to VIRTIO_NET_HDR_GSO_TCPV4, TCPV6 or UDP. 1461 (Otherwise, it is set to VIRTIO_NET_HDR_GSO_NONE). In this 1462 case, packets larger than 1514 bytes can be transmitted: the 1463 metadata indicates how to replicate the packet header to cut it 1464 into smaller packets. The other gso fields are set: 1465 1466 hdr_len is a hint to the device as to how much of the header 1467 needs to be kept to copy into each packet, usually set to the 1468 length of the headers, including the transport header.[footnote: 1469Due to various bugs in implementations, this field is not useful 1470as a guarantee of the transport header size. 1471] 1472 1473 gso_size is the maximum size of each packet beyond that header 1474 (ie. MSS). 1475 1476 If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, the 1477 VIRTIO_NET_HDR_GSO_ECN bit may be set in “gso_type” as well, 1478 indicating that the TCP packet has the ECN bit set.[footnote: 1479This case is not handled by some older hardware, so is called out 1480specifically in the protocol. 1481] 1482 1483 If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature, 1484 the num_buffers field is set to zero. 1485 1486 The header and packet are added as one output buffer to the 1487 transmitq, and the device is notified of the new entry (see [sub:Notifying-The-Device] 1488 ).[footnote: 1489Note that the header will be two bytes longer for the 1490VIRTIO_NET_F_MRG_RXBUF case. 1491] 1492 1493 Packet Transmission Interrupt 1494 1495Often a driver will suppress transmission interrupts using the 1496VRING_AVAIL_F_NO_INTERRUPT flag (see [sub:Receiving-Used-Buffers] 1497) and check for used packets in the transmit path of following 1498packets. However, it will still receive interrupts if the 1499VIRTIO_F_NOTIFY_ON_EMPTY feature is negotiated, indicating that 1500the transmission queue is completely emptied. 1501 1502The normal behavior in this interrupt handler is to retrieve and 1503new descriptors from the used ring and free the corresponding 1504headers and packets. 1505 1506 Setting Up Receive Buffers 1507 1508It is generally a good idea to keep the receive virtqueue as 1509fully populated as possible: if it runs out, network performance 1510will suffer. 1511 1512If the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6 or 1513VIRTIO_NET_F_GUEST_UFO features are used, the Guest will need to 1514accept packets of up to 65550 bytes long (the maximum size of a 1515TCP or UDP packet, plus the 14 byte ethernet header), otherwise 15161514 bytes. So unless VIRTIO_NET_F_MRG_RXBUF is negotiated, every 1517buffer in the receive queue needs to be at least this length [footnote: 1518Obviously each one can be split across multiple descriptor 1519elements. 1520]. 1521 1522If VIRTIO_NET_F_MRG_RXBUF is negotiated, each buffer must be at 1523least the size of the struct virtio_net_hdr. 1524 1525 Packet Receive Interrupt 1526 1527When a packet is copied into a buffer in the receiveq, the 1528optimal path is to disable further interrupts for the receiveq 1529(see [sub:Receiving-Used-Buffers]) and process packets until no 1530more are found, then re-enable them. 1531 1532Processing packet involves: 1533 1534 If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature, 1535 then the “num_buffers” field indicates how many descriptors 1536 this packet is spread over (including this one). This allows 1537 receipt of large packets without having to allocate large 1538 buffers. In this case, there will be at least “num_buffers” in 1539 the used ring, and they should be chained together to form a 1540 single packet. The other buffers will not begin with a struct 1541 virtio_net_hdr. 1542 1543 If the VIRTIO_NET_F_MRG_RXBUF feature was not negotiated, or 1544 the “num_buffers” field is one, then the entire packet will be 1545 contained within this buffer, immediately following the struct 1546 virtio_net_hdr. 1547 1548 If the VIRTIO_NET_F_GUEST_CSUM feature was negotiated, the 1549 VIRTIO_NET_HDR_F_NEEDS_CSUM bit in the “flags” field may be 1550 set: if so, the checksum on the packet is incomplete and the “ 1551 csum_start” and “csum_offset” fields indicate how to calculate 1552 it (see [ite:csum_start-is-set]). 1553 1554 If the VIRTIO_NET_F_GUEST_TSO4, TSO6 or UFO options were 1555 negotiated, then the “gso_type” may be something other than 1556 VIRTIO_NET_HDR_GSO_NONE, and the “gso_size” field indicates the 1557 desired MSS (see [enu:If-the-driver]). 1558 1559 Control Virtqueue 1560 1561The driver uses the control virtqueue (if VIRTIO_NET_F_VTRL_VQ is 1562negotiated) to send commands to manipulate various features of 1563the device which would not easily map into the configuration 1564space. 1565 1566All commands are of the following form: 1567 1568struct virtio_net_ctrl { 1569 1570 u8 class; 1571 1572 u8 command; 1573 1574 u8 command-specific-data[]; 1575 1576 u8 ack; 1577 1578}; 1579 1580 1581 1582/* ack values */ 1583 1584#define VIRTIO_NET_OK 0 1585 1586#define VIRTIO_NET_ERR 1 1587 1588The class, command and command-specific-data are set by the 1589driver, and the device sets the ack byte. There is little it can 1590do except issue a diagnostic if the ack byte is not 1591VIRTIO_NET_OK. 1592 1593 Packet Receive Filtering 1594 1595If the VIRTIO_NET_F_CTRL_RX feature is negotiated, the driver can 1596send control commands for promiscuous mode, multicast receiving, 1597and filtering of MAC addresses. 1598 1599Note that in general, these commands are best-effort: unwanted 1600packets may still arrive. 1601 1602 Setting Promiscuous Mode 1603 1604#define VIRTIO_NET_CTRL_RX 0 1605 1606 #define VIRTIO_NET_CTRL_RX_PROMISC 0 1607 1608 #define VIRTIO_NET_CTRL_RX_ALLMULTI 1 1609 1610The class VIRTIO_NET_CTRL_RX has two commands: 1611VIRTIO_NET_CTRL_RX_PROMISC turns promiscuous mode on and off, and 1612VIRTIO_NET_CTRL_RX_ALLMULTI turns all-multicast receive on and 1613off. The command-specific-data is one byte containing 0 (off) or 16141 (on). 1615 1616 Setting MAC Address Filtering 1617 1618struct virtio_net_ctrl_mac { 1619 1620 u32 entries; 1621 1622 u8 macs[entries][ETH_ALEN]; 1623 1624}; 1625 1626 1627 1628#define VIRTIO_NET_CTRL_MAC 1 1629 1630 #define VIRTIO_NET_CTRL_MAC_TABLE_SET 0 1631 1632The device can filter incoming packets by any number of 1633destination MAC addresses.[footnote: 1634Since there are no guarentees, it can use a hash filter 1635orsilently switch to allmulti or promiscuous mode if it is given 1636too many addresses. 1637] This table is set using the class VIRTIO_NET_CTRL_MAC and the 1638command VIRTIO_NET_CTRL_MAC_TABLE_SET. The command-specific-data 1639is two variable length tables of 6-byte MAC addresses. The first 1640table contains unicast addresses, and the second contains 1641multicast addresses. 1642 1643 VLAN Filtering 1644 1645If the driver negotiates the VIRTION_NET_F_CTRL_VLAN feature, it 1646can control a VLAN filter table in the device. 1647 1648#define VIRTIO_NET_CTRL_VLAN 2 1649 1650 #define VIRTIO_NET_CTRL_VLAN_ADD 0 1651 1652 #define VIRTIO_NET_CTRL_VLAN_DEL 1 1653 1654Both the VIRTIO_NET_CTRL_VLAN_ADD and VIRTIO_NET_CTRL_VLAN_DEL 1655command take a 16-bit VLAN id as the command-specific-data. 1656 1657 Gratuitous Packet Sending 1658 1659If the driver negotiates the VIRTIO_NET_F_GUEST_ANNOUNCE (depends 1660on VIRTIO_NET_F_CTRL_VQ), it can ask the guest to send gratuitous 1661packets; this is usually done after the guest has been physically 1662migrated, and needs to announce its presence on the new network 1663links. (As hypervisor does not have the knowledge of guest 1664network configuration (eg. tagged vlan) it is simplest to prod 1665the guest in this way). 1666 1667#define VIRTIO_NET_CTRL_ANNOUNCE 3 1668 1669 #define VIRTIO_NET_CTRL_ANNOUNCE_ACK 0 1670 1671The Guest needs to check VIRTIO_NET_S_ANNOUNCE bit in status 1672field when it notices the changes of device configuration. The 1673command VIRTIO_NET_CTRL_ANNOUNCE_ACK is used to indicate that 1674driver has recevied the notification and device would clear the 1675VIRTIO_NET_S_ANNOUNCE bit in the status filed after it received 1676this command. 1677 1678Processing this notification involves: 1679 1680 Sending the gratuitous packets or marking there are pending 1681 gratuitous packets to be sent and letting deferred routine to 1682 send them. 1683 1684 Sending VIRTIO_NET_CTRL_ANNOUNCE_ACK command through control 1685 vq. 1686 1687 . 1688 1689Appendix D: Block Device 1690 1691The virtio block device is a simple virtual block device (ie. 1692disk). Read and write requests (and other exotic requests) are 1693placed in the queue, and serviced (probably out of order) by the 1694device except where noted. 1695 1696 Configuration 1697 1698 Subsystem Device ID 2 1699 1700 Virtqueues 0:requestq. 1701 1702 Feature bits 1703 1704 VIRTIO_BLK_F_BARRIER (0) Host supports request barriers. 1705 1706 VIRTIO_BLK_F_SIZE_MAX (1) Maximum size of any single segment is 1707 in “size_max”. 1708 1709 VIRTIO_BLK_F_SEG_MAX (2) Maximum number of segments in a 1710 request is in “seg_max”. 1711 1712 VIRTIO_BLK_F_GEOMETRY (4) Disk-style geometry specified in “ 1713 geometry”. 1714 1715 VIRTIO_BLK_F_RO (5) Device is read-only. 1716 1717 VIRTIO_BLK_F_BLK_SIZE (6) Block size of disk is in “blk_size”. 1718 1719 VIRTIO_BLK_F_SCSI (7) Device supports scsi packet commands. 1720 1721 VIRTIO_BLK_F_FLUSH (9) Cache flush command support. 1722 1723 Device configuration layout The capacity of the device 1724 (expressed in 512-byte sectors) is always present. The 1725 availability of the others all depend on various feature bits 1726 as indicated above. struct virtio_blk_config { 1727 1728 u64 capacity; 1729 1730 u32 size_max; 1731 1732 u32 seg_max; 1733 1734 struct virtio_blk_geometry { 1735 1736 u16 cylinders; 1737 1738 u8 heads; 1739 1740 u8 sectors; 1741 1742 } geometry; 1743 1744 u32 blk_size; 1745 1746 1747 1748}; 1749 1750 Device Initialization 1751 1752 The device size should be read from the “capacity” 1753 configuration field. No requests should be submitted which goes 1754 beyond this limit. 1755 1756 If the VIRTIO_BLK_F_BLK_SIZE feature is negotiated, the 1757 blk_size field can be read to determine the optimal sector size 1758 for the driver to use. This does not effect the units used in 1759 the protocol (always 512 bytes), but awareness of the correct 1760 value can effect performance. 1761 1762 If the VIRTIO_BLK_F_RO feature is set by the device, any write 1763 requests will fail. 1764 1765 Device Operation 1766 1767The driver queues requests to the virtqueue, and they are used by 1768the device (not necessarily in order). Each request is of form: 1769 1770struct virtio_blk_req { 1771 1772 1773 1774 u32 type; 1775 1776 u32 ioprio; 1777 1778 u64 sector; 1779 1780 char data[][512]; 1781 1782 u8 status; 1783 1784}; 1785 1786If the device has VIRTIO_BLK_F_SCSI feature, it can also support 1787scsi packet command requests, each of these requests is of form:struct virtio_scsi_pc_req { 1788 1789 u32 type; 1790 1791 u32 ioprio; 1792 1793 u64 sector; 1794 1795 char cmd[]; 1796 1797 char data[][512]; 1798 1799#define SCSI_SENSE_BUFFERSIZE 96 1800 1801 u8 sense[SCSI_SENSE_BUFFERSIZE]; 1802 1803 u32 errors; 1804 1805 u32 data_len; 1806 1807 u32 sense_len; 1808 1809 u32 residual; 1810 1811 u8 status; 1812 1813}; 1814 1815The type of the request is either a read (VIRTIO_BLK_T_IN), a 1816write (VIRTIO_BLK_T_OUT), a scsi packet command 1817(VIRTIO_BLK_T_SCSI_CMD or VIRTIO_BLK_T_SCSI_CMD_OUT[footnote: 1818the SCSI_CMD and SCSI_CMD_OUT types are equivalent, the device 1819does not distinguish between them 1820]) or a flush (VIRTIO_BLK_T_FLUSH or VIRTIO_BLK_T_FLUSH_OUT[footnote: 1821the FLUSH and FLUSH_OUT types are equivalent, the device does not 1822distinguish between them 1823]). If the device has VIRTIO_BLK_F_BARRIER feature the high bit 1824(VIRTIO_BLK_T_BARRIER) indicates that this request acts as a 1825barrier and that all preceeding requests must be complete before 1826this one, and all following requests must not be started until 1827this is complete. Note that a barrier does not flush caches in 1828the underlying backend device in host, and thus does not serve as 1829data consistency guarantee. Driver must use FLUSH request to 1830flush the host cache. 1831 1832#define VIRTIO_BLK_T_IN 0 1833 1834#define VIRTIO_BLK_T_OUT 1 1835 1836#define VIRTIO_BLK_T_SCSI_CMD 2 1837 1838#define VIRTIO_BLK_T_SCSI_CMD_OUT 3 1839 1840#define VIRTIO_BLK_T_FLUSH 4 1841 1842#define VIRTIO_BLK_T_FLUSH_OUT 5 1843 1844#define VIRTIO_BLK_T_BARRIER 0x80000000 1845 1846The ioprio field is a hint about the relative priorities of 1847requests to the device: higher numbers indicate more important 1848requests. 1849 1850The sector number indicates the offset (multiplied by 512) where 1851the read or write is to occur. This field is unused and set to 0 1852for scsi packet commands and for flush commands. 1853 1854The cmd field is only present for scsi packet command requests, 1855and indicates the command to perform. This field must reside in a 1856single, separate read-only buffer; command length can be derived 1857from the length of this buffer. 1858 1859Note that these first three (four for scsi packet commands) 1860fields are always read-only: the data field is either read-only 1861or write-only, depending on the request. The size of the read or 1862write can be derived from the total size of the request buffers. 1863 1864The sense field is only present for scsi packet command requests, 1865and indicates the buffer for scsi sense data. 1866 1867The data_len field is only present for scsi packet command 1868requests, this field is deprecated, and should be ignored by the 1869driver. Historically, devices copied data length there. 1870 1871The sense_len field is only present for scsi packet command 1872requests and indicates the number of bytes actually written to 1873the sense buffer. 1874 1875The residual field is only present for scsi packet command 1876requests and indicates the residual size, calculated as data 1877length - number of bytes actually transferred. 1878 1879The final status byte is written by the device: either 1880VIRTIO_BLK_S_OK for success, VIRTIO_BLK_S_IOERR for host or guest 1881error or VIRTIO_BLK_S_UNSUPP for a request unsupported by host:#define VIRTIO_BLK_S_OK 0 1882 1883#define VIRTIO_BLK_S_IOERR 1 1884 1885#define VIRTIO_BLK_S_UNSUPP 2 1886 1887Historically, devices assumed that the fields type, ioprio and 1888sector reside in a single, separate read-only buffer; the fields 1889errors, data_len, sense_len and residual reside in a single, 1890separate write-only buffer; the sense field in a separate 1891write-only buffer of size 96 bytes, by itself; the fields errors, 1892data_len, sense_len and residual in a single write-only buffer; 1893and the status field is a separate read-only buffer of size 1 1894byte, by itself. 1895 1896Appendix E: Console Device 1897 1898The virtio console device is a simple device for data input and 1899output. A device may have one or more ports. Each port has a pair 1900of input and output virtqueues. Moreover, a device has a pair of 1901control IO virtqueues. The control virtqueues are used to 1902communicate information between the device and the driver about 1903ports being opened and closed on either side of the connection, 1904indication from the host about whether a particular port is a 1905console port, adding new ports, port hot-plug/unplug, etc., and 1906indication from the guest about whether a port or a device was 1907successfully added, port open/close, etc.. For data IO, one or 1908more empty buffers are placed in the receive queue for incoming 1909data and outgoing characters are placed in the transmit queue. 1910 1911 Configuration 1912 1913 Subsystem Device ID 3 1914 1915 Virtqueues 0:receiveq(port0). 1:transmitq(port0), 2:control 1916 receiveq[footnote: 1917Ports 2 onwards only if VIRTIO_CONSOLE_F_MULTIPORT is set 1918], 3:control transmitq, 4:receiveq(port1), 5:transmitq(port1), 1919 ... 1920 1921 Feature bits 1922 1923 VIRTIO_CONSOLE_F_SIZE (0) Configuration cols and rows fields 1924 are valid. 1925 1926 VIRTIO_CONSOLE_F_MULTIPORT(1) Device has support for multiple 1927 ports; configuration fields nr_ports and max_nr_ports are 1928 valid and control virtqueues will be used. 1929 1930 Device configuration layout The size of the console is supplied 1931 in the configuration space if the VIRTIO_CONSOLE_F_SIZE feature 1932 is set. Furthermore, if the VIRTIO_CONSOLE_F_MULTIPORT feature 1933 is set, the maximum number of ports supported by the device can 1934 be fetched.struct virtio_console_config { 1935 1936 u16 cols; 1937 1938 u16 rows; 1939 1940 1941 1942 u32 max_nr_ports; 1943 1944}; 1945 1946 Device Initialization 1947 1948 If the VIRTIO_CONSOLE_F_SIZE feature is negotiated, the driver 1949 can read the console dimensions from the configuration fields. 1950 1951 If the VIRTIO_CONSOLE_F_MULTIPORT feature is negotiated, the 1952 driver can spawn multiple ports, not all of which may be 1953 attached to a console. Some could be generic ports. In this 1954 case, the control virtqueues are enabled and according to the 1955 max_nr_ports configuration-space value, the appropriate number 1956 of virtqueues are created. A control message indicating the 1957 driver is ready is sent to the host. The host can then send 1958 control messages for adding new ports to the device. After 1959 creating and initializing each port, a 1960 VIRTIO_CONSOLE_PORT_READY control message is sent to the host 1961 for that port so the host can let us know of any additional 1962 configuration options set for that port. 1963 1964 The receiveq for each port is populated with one or more 1965 receive buffers. 1966 1967 Device Operation 1968 1969 For output, a buffer containing the characters is placed in the 1970 port's transmitq.[footnote: 1971Because this is high importance and low bandwidth, the current 1972Linux implementation polls for the buffer to be used, rather than 1973waiting for an interrupt, simplifying the implementation 1974significantly. However, for generic serial ports with the 1975O_NONBLOCK flag set, the polling limitation is relaxed and the 1976consumed buffers are freed upon the next write or poll call or 1977when a port is closed or hot-unplugged. 1978] 1979 1980 When a buffer is used in the receiveq (signalled by an 1981 interrupt), the contents is the input to the port associated 1982 with the virtqueue for which the notification was received. 1983 1984 If the driver negotiated the VIRTIO_CONSOLE_F_SIZE feature, a 1985 configuration change interrupt may occur. The updated size can 1986 be read from the configuration fields. 1987 1988 If the driver negotiated the VIRTIO_CONSOLE_F_MULTIPORT 1989 feature, active ports are announced by the host using the 1990 VIRTIO_CONSOLE_PORT_ADD control message. The same message is 1991 used for port hot-plug as well. 1992 1993 If the host specified a port `name', a sysfs attribute is 1994 created with the name filled in, so that udev rules can be 1995 written that can create a symlink from the port's name to the 1996 char device for port discovery by applications in the guest. 1997 1998 Changes to ports' state are effected by control messages. 1999 Appropriate action is taken on the port indicated in the 2000 control message. The layout of the structure of the control 2001 buffer and the events associated are:struct virtio_console_control { 2002 2003 uint32_t id; /* Port number */ 2004 2005 uint16_t event; /* The kind of control event */ 2006 2007 uint16_t value; /* Extra information for the event */ 2008 2009}; 2010 2011 2012 2013/* Some events for the internal messages (control packets) */ 2014 2015 2016 2017#define VIRTIO_CONSOLE_DEVICE_READY 0 2018 2019#define VIRTIO_CONSOLE_PORT_ADD 1 2020 2021#define VIRTIO_CONSOLE_PORT_REMOVE 2 2022 2023#define VIRTIO_CONSOLE_PORT_READY 3 2024 2025#define VIRTIO_CONSOLE_CONSOLE_PORT 4 2026 2027#define VIRTIO_CONSOLE_RESIZE 5 2028 2029#define VIRTIO_CONSOLE_PORT_OPEN 6 2030 2031#define VIRTIO_CONSOLE_PORT_NAME 7 2032 2033Appendix F: Entropy Device 2034 2035The virtio entropy device supplies high-quality randomness for 2036guest use. 2037 2038 Configuration 2039 2040 Subsystem Device ID 4 2041 2042 Virtqueues 0:requestq. 2043 2044 Feature bits None currently defined 2045 2046 Device configuration layout None currently defined. 2047 2048 Device Initialization 2049 2050 The virtqueue is initialized 2051 2052 Device Operation 2053 2054When the driver requires random bytes, it places the descriptor 2055of one or more buffers in the queue. It will be completely filled 2056by random data by the device. 2057 2058Appendix G: Memory Balloon Device 2059 2060The virtio memory balloon device is a primitive device for 2061managing guest memory: the device asks for a certain amount of 2062memory, and the guest supplies it (or withdraws it, if the device 2063has more than it asks for). This allows the guest to adapt to 2064changes in allowance of underlying physical memory. If the 2065feature is negotiated, the device can also be used to communicate 2066guest memory statistics to the host. 2067 2068 Configuration 2069 2070 Subsystem Device ID 5 2071 2072 Virtqueues 0:inflateq. 1:deflateq. 2:statsq.[footnote: 2073Only if VIRTIO_BALLON_F_STATS_VQ set 2074] 2075 2076 Feature bits 2077 2078 VIRTIO_BALLOON_F_MUST_TELL_HOST (0) Host must be told before 2079 pages from the balloon are used. 2080 2081 VIRTIO_BALLOON_F_STATS_VQ (1) A virtqueue for reporting guest 2082 memory statistics is present. 2083 2084 Device configuration layout Both fields of this configuration 2085 are always available. Note that they are little endian, despite 2086 convention that device fields are guest endian:struct virtio_balloon_config { 2087 2088 u32 num_pages; 2089 2090 u32 actual; 2091 2092}; 2093 2094 Device Initialization 2095 2096 The inflate and deflate virtqueues are identified. 2097 2098 If the VIRTIO_BALLOON_F_STATS_VQ feature bit is negotiated: 2099 2100 Identify the stats virtqueue. 2101 2102 Add one empty buffer to the stats virtqueue and notify the 2103 host. 2104 2105Device operation begins immediately. 2106 2107 Device Operation 2108 2109 Memory Ballooning The device is driven by the receipt of a 2110 configuration change interrupt. 2111 2112 The “num_pages” configuration field is examined. If this is 2113 greater than the “actual” number of pages, memory must be given 2114 to the balloon. If it is less than the “actual” number of 2115 pages, memory may be taken back from the balloon for general 2116 use. 2117 2118 To supply memory to the balloon (aka. inflate): 2119 2120 The driver constructs an array of addresses of unused memory 2121 pages. These addresses are divided by 4096[footnote: 2122This is historical, and independent of the guest page size 2123] and the descriptor describing the resulting 32-bit array is 2124 added to the inflateq. 2125 2126 To remove memory from the balloon (aka. deflate): 2127 2128 The driver constructs an array of addresses of memory pages it 2129 has previously given to the balloon, as described above. This 2130 descriptor is added to the deflateq. 2131 2132 If the VIRTIO_BALLOON_F_MUST_TELL_HOST feature is set, the 2133 guest may not use these requested pages until that descriptor 2134 in the deflateq has been used by the device. 2135 2136 Otherwise, the guest may begin to re-use pages previously given 2137 to the balloon before the device has acknowledged their 2138 withdrawl. [footnote: 2139In this case, deflation advice is merely a courtesy 2140] 2141 2142 In either case, once the device has completed the inflation or 2143 deflation, the “actual” field of the configuration should be 2144 updated to reflect the new number of pages in the balloon.[footnote: 2145As updates to configuration space are not atomic, this field 2146isn't particularly reliable, but can be used to diagnose buggy 2147guests. 2148] 2149 2150 Memory Statistics 2151 2152The stats virtqueue is atypical because communication is driven 2153by the device (not the driver). The channel becomes active at 2154driver initialization time when the driver adds an empty buffer 2155and notifies the device. A request for memory statistics proceeds 2156as follows: 2157 2158 The device pushes the buffer onto the used ring and sends an 2159 interrupt. 2160 2161 The driver pops the used buffer and discards it. 2162 2163 The driver collects memory statistics and writes them into a 2164 new buffer. 2165 2166 The driver adds the buffer to the virtqueue and notifies the 2167 device. 2168 2169 The device pops the buffer (retaining it to initiate a 2170 subsequent request) and consumes the statistics. 2171 2172 Memory Statistics Format Each statistic consists of a 16 bit 2173 tag and a 64 bit value. Both quantities are represented in the 2174 native endian of the guest. All statistics are optional and the 2175 driver may choose which ones to supply. To guarantee backwards 2176 compatibility, unsupported statistics should be omitted. 2177 2178 struct virtio_balloon_stat { 2179 2180#define VIRTIO_BALLOON_S_SWAP_IN 0 2181 2182#define VIRTIO_BALLOON_S_SWAP_OUT 1 2183 2184#define VIRTIO_BALLOON_S_MAJFLT 2 2185 2186#define VIRTIO_BALLOON_S_MINFLT 3 2187 2188#define VIRTIO_BALLOON_S_MEMFREE 4 2189 2190#define VIRTIO_BALLOON_S_MEMTOT 5 2191 2192 u16 tag; 2193 2194 u64 val; 2195 2196} __attribute__((packed)); 2197 2198 Tags 2199 2200 VIRTIO_BALLOON_S_SWAP_IN The amount of memory that has been 2201 swapped in (in bytes). 2202 2203 VIRTIO_BALLOON_S_SWAP_OUT The amount of memory that has been 2204 swapped out to disk (in bytes). 2205 2206 VIRTIO_BALLOON_S_MAJFLT The number of major page faults that 2207 have occurred. 2208 2209 VIRTIO_BALLOON_S_MINFLT The number of minor page faults that 2210 have occurred. 2211 2212 VIRTIO_BALLOON_S_MEMFREE The amount of memory not being used 2213 for any purpose (in bytes). 2214 2215 VIRTIO_BALLOON_S_MEMTOT The total amount of memory available 2216 (in bytes). 2217 2218Appendix H: Rpmsg: Remote Processor Messaging 2219 2220Virtio rpmsg devices represent remote processors on the system 2221which run in asymmetric multi-processing (AMP) configuration, and 2222which are usually used to offload cpu-intensive tasks from the 2223main application processor (a typical SoC methodology). 2224 2225Virtio is being used to communicate with those remote processors; 2226empty buffers are placed in one virtqueue for receiving messages, 2227and non-empty buffers, containing outbound messages, are enqueued 2228in a second virtqueue for transmission. 2229 2230Numerous communication channels can be multiplexed over those two 2231virtqueues, so different entities, running on the application and 2232remote processor, can directly communicate in a point-to-point 2233fashion. 2234 2235 Configuration 2236 2237 Subsystem Device ID 7 2238 2239 Virtqueues 0:receiveq. 1:transmitq. 2240 2241 Feature bits 2242 2243 VIRTIO_RPMSG_F_NS (0) Device sends (and capable of receiving) 2244 name service messages announcing the creation (or 2245 destruction) of a channel:/** 2246 2247 * struct rpmsg_ns_msg - dynamic name service announcement 2248message 2249 2250 * @name: name of remote service that is published 2251 2252 * @addr: address of remote service that is published 2253 2254 * @flags: indicates whether service is created or destroyed 2255 2256 * 2257 2258 * This message is sent across to publish a new service (or 2259announce 2260 2261 * about its removal). When we receives these messages, an 2262appropriate 2263 2264 * rpmsg channel (i.e device) is created/destroyed. 2265 2266 */ 2267 2268struct rpmsg_ns_msgoon_config { 2269 2270 char name[RPMSG_NAME_SIZE]; 2271 2272 u32 addr; 2273 2274 u32 flags; 2275 2276} __packed; 2277 2278 2279 2280/** 2281 2282 * enum rpmsg_ns_flags - dynamic name service announcement flags 2283 2284 * 2285 2286 * @RPMSG_NS_CREATE: a new remote service was just created 2287 2288 * @RPMSG_NS_DESTROY: a remote service was just destroyed 2289 2290 */ 2291 2292enum rpmsg_ns_flags { 2293 2294 RPMSG_NS_CREATE = 0, 2295 2296 RPMSG_NS_DESTROY = 1, 2297 2298}; 2299 2300 Device configuration layout 2301 2302At his point none currently defined. 2303 2304 Device Initialization 2305 2306 The initialization routine should identify the receive and 2307 transmission virtqueues. 2308 2309 The receive virtqueue should be filled with receive buffers. 2310 2311 Device Operation 2312 2313Messages are transmitted by placing them in the transmitq, and 2314buffers for inbound messages are placed in the receiveq. In any 2315case, messages are always preceded by the following header: /** 2316 2317 * struct rpmsg_hdr - common header for all rpmsg messages 2318 2319 * @src: source address 2320 2321 * @dst: destination address 2322 2323 * @reserved: reserved for future use 2324 2325 * @len: length of payload (in bytes) 2326 2327 * @flags: message flags 2328 2329 * @data: @len bytes of message payload data 2330 2331 * 2332 2333 * Every message sent(/received) on the rpmsg bus begins with 2334this header. 2335 2336 */ 2337 2338struct rpmsg_hdr { 2339 2340 u32 src; 2341 2342 u32 dst; 2343 2344 u32 reserved; 2345 2346 u16 len; 2347 2348 u16 flags; 2349 2350 u8 data[0]; 2351 2352} __packed; 2353 2354Appendix I: SCSI Host Device 2355 2356The virtio SCSI host device groups together one or more virtual 2357logical units (such as disks), and allows communicating to them 2358using the SCSI protocol. An instance of the device represents a 2359SCSI host to which many targets and LUNs are attached. 2360 2361The virtio SCSI device services two kinds of requests: 2362 2363 command requests for a logical unit; 2364 2365 task management functions related to a logical unit, target or 2366 command. 2367 2368The device is also able to send out notifications about added and 2369removed logical units. Together, these capabilities provide a 2370SCSI transport protocol that uses virtqueues as the transfer 2371medium. In the transport protocol, the virtio driver acts as the 2372initiator, while the virtio SCSI host provides one or more 2373targets that receive and process the requests. 2374 2375 Configuration 2376 2377 Subsystem Device ID 8 2378 2379 Virtqueues 0:controlq; 1:eventq; 2..n:request queues. 2380 2381 Feature bits 2382 2383 VIRTIO_SCSI_F_INOUT (0) A single request can include both 2384 read-only and write-only data buffers. 2385 2386 VIRTIO_SCSI_F_HOTPLUG (1) The host should enable 2387 hot-plug/hot-unplug of new LUNs and targets on the SCSI bus. 2388 2389 Device configuration layout All fields of this configuration 2390 are always available. sense_size and cdb_size are writable by 2391 the guest.struct virtio_scsi_config { 2392 2393 u32 num_queues; 2394 2395 u32 seg_max; 2396 2397 u32 max_sectors; 2398 2399 u32 cmd_per_lun; 2400 2401 u32 event_info_size; 2402 2403 u32 sense_size; 2404 2405 u32 cdb_size; 2406 2407 u16 max_channel; 2408 2409 u16 max_target; 2410 2411 u32 max_lun; 2412 2413}; 2414 2415 num_queues is the total number of request virtqueues exposed by 2416 the device. The driver is free to use only one request queue, 2417 or it can use more to achieve better performance. 2418 2419 seg_max is the maximum number of segments that can be in a 2420 command. A bidirectional command can include seg_max input 2421 segments and seg_max output segments. 2422 2423 max_sectors is a hint to the guest about the maximum transfer 2424 size it should use. 2425 2426 cmd_per_lun is a hint to the guest about the maximum number of 2427 linked commands it should send to one LUN. The actual value 2428 to be used is the minimum of cmd_per_lun and the virtqueue 2429 size. 2430 2431 event_info_size is the maximum size that the device will fill 2432 for buffers that the driver places in the eventq. The driver 2433 should always put buffers at least of this size. It is 2434 written by the device depending on the set of negotated 2435 features. 2436 2437 sense_size is the maximum size of the sense data that the 2438 device will write. The default value is written by the device 2439 and will always be 96, but the driver can modify it. It is 2440 restored to the default when the device is reset. 2441 2442 cdb_size is the maximum size of the CDB that the driver will 2443 write. The default value is written by the device and will 2444 always be 32, but the driver can likewise modify it. It is 2445 restored to the default when the device is reset. 2446 2447 max_channel, max_target and max_lun can be used by the driver 2448 as hints to constrain scanning the logical units on the 2449 host.h 2450 2451 Device Initialization 2452 2453The initialization routine should first of all discover the 2454device's virtqueues. 2455 2456If the driver uses the eventq, it should then place at least a 2457buffer in the eventq. 2458 2459The driver can immediately issue requests (for example, INQUIRY 2460or REPORT LUNS) or task management functions (for example, I_T 2461RESET). 2462 2463 Device Operation: request queues 2464 2465The driver queues requests to an arbitrary request queue, and 2466they are used by the device on that same queue. It is the 2467responsibility of the driver to ensure strict request ordering 2468for commands placed on different queues, because they will be 2469consumed with no order constraints. 2470 2471Requests have the following format: 2472 2473struct virtio_scsi_req_cmd { 2474 2475 // Read-only 2476 2477 u8 lun[8]; 2478 2479 u64 id; 2480 2481 u8 task_attr; 2482 2483 u8 prio; 2484 2485 u8 crn; 2486 2487 char cdb[cdb_size]; 2488 2489 char dataout[]; 2490 2491 // Write-only part 2492 2493 u32 sense_len; 2494 2495 u32 residual; 2496 2497 u16 status_qualifier; 2498 2499 u8 status; 2500 2501 u8 response; 2502 2503 u8 sense[sense_size]; 2504 2505 char datain[]; 2506 2507}; 2508 2509 2510 2511/* command-specific response values */ 2512 2513#define VIRTIO_SCSI_S_OK 0 2514 2515#define VIRTIO_SCSI_S_OVERRUN 1 2516 2517#define VIRTIO_SCSI_S_ABORTED 2 2518 2519#define VIRTIO_SCSI_S_BAD_TARGET 3 2520 2521#define VIRTIO_SCSI_S_RESET 4 2522 2523#define VIRTIO_SCSI_S_BUSY 5 2524 2525#define VIRTIO_SCSI_S_TRANSPORT_FAILURE 6 2526 2527#define VIRTIO_SCSI_S_TARGET_FAILURE 7 2528 2529#define VIRTIO_SCSI_S_NEXUS_FAILURE 8 2530 2531#define VIRTIO_SCSI_S_FAILURE 9 2532 2533 2534 2535/* task_attr */ 2536 2537#define VIRTIO_SCSI_S_SIMPLE 0 2538 2539#define VIRTIO_SCSI_S_ORDERED 1 2540 2541#define VIRTIO_SCSI_S_HEAD 2 2542 2543#define VIRTIO_SCSI_S_ACA 3 2544 2545The lun field addresses a target and logical unit in the 2546virtio-scsi device's SCSI domain. The only supported format for 2547the LUN field is: first byte set to 1, second byte set to target, 2548third and fourth byte representing a single level LUN structure, 2549followed by four zero bytes. With this representation, a 2550virtio-scsi device can serve up to 256 targets and 16384 LUNs per 2551target. 2552 2553The id field is the command identifier (“tag”). 2554 2555task_attr, prio and crn should be left to zero. task_attr defines 2556the task attribute as in the table above, but all task attributes 2557may be mapped to SIMPLE by the device; crn may also be provided 2558by clients, but is generally expected to be 0. The maximum CRN 2559value defined by the protocol is 255, since CRN is stored in an 25608-bit integer. 2561 2562All of these fields are defined in SAM. They are always 2563read-only, as are the cdb and dataout field. The cdb_size is 2564taken from the configuration space. 2565 2566sense and subsequent fields are always write-only. The sense_len 2567field indicates the number of bytes actually written to the sense 2568buffer. The residual field indicates the residual size, 2569calculated as “data_length - number_of_transferred_bytes”, for 2570read or write operations. For bidirectional commands, the 2571number_of_transferred_bytes includes both read and written bytes. 2572A residual field that is less than the size of datain means that 2573the dataout field was processed entirely. A residual field that 2574exceeds the size of datain means that the dataout field was 2575processed partially and the datain field was not processed at 2576all. 2577 2578The status byte is written by the device to be the status code as 2579defined in SAM. 2580 2581The response byte is written by the device to be one of the 2582following: 2583 2584 VIRTIO_SCSI_S_OK when the request was completed and the status 2585 byte is filled with a SCSI status code (not necessarily 2586 "GOOD"). 2587 2588 VIRTIO_SCSI_S_OVERRUN if the content of the CDB requires 2589 transferring more data than is available in the data buffers. 2590 2591 VIRTIO_SCSI_S_ABORTED if the request was cancelled due to an 2592 ABORT TASK or ABORT TASK SET task management function. 2593 2594 VIRTIO_SCSI_S_BAD_TARGET if the request was never processed 2595 because the target indicated by the lun field does not exist. 2596 2597 VIRTIO_SCSI_S_RESET if the request was cancelled due to a bus 2598 or device reset (including a task management function). 2599 2600 VIRTIO_SCSI_S_TRANSPORT_FAILURE if the request failed due to a 2601 problem in the connection between the host and the target 2602 (severed link). 2603 2604 VIRTIO_SCSI_S_TARGET_FAILURE if the target is suffering a 2605 failure and the guest should not retry on other paths. 2606 2607 VIRTIO_SCSI_S_NEXUS_FAILURE if the nexus is suffering a failure 2608 but retrying on other paths might yield a different result. 2609 2610 VIRTIO_SCSI_S_BUSY if the request failed but retrying on the 2611 same path should work. 2612 2613 VIRTIO_SCSI_S_FAILURE for other host or guest error. In 2614 particular, if neither dataout nor datain is empty, and the 2615 VIRTIO_SCSI_F_INOUT feature has not been negotiated, the 2616 request will be immediately returned with a response equal to 2617 VIRTIO_SCSI_S_FAILURE. 2618 2619 Device Operation: controlq 2620 2621The controlq is used for other SCSI transport operations. 2622Requests have the following format: 2623 2624struct virtio_scsi_ctrl { 2625 2626 u32 type; 2627 2628 ... 2629 2630 u8 response; 2631 2632}; 2633 2634 2635 2636/* response values valid for all commands */ 2637 2638#define VIRTIO_SCSI_S_OK 0 2639 2640#define VIRTIO_SCSI_S_BAD_TARGET 3 2641 2642#define VIRTIO_SCSI_S_BUSY 5 2643 2644#define VIRTIO_SCSI_S_TRANSPORT_FAILURE 6 2645 2646#define VIRTIO_SCSI_S_TARGET_FAILURE 7 2647 2648#define VIRTIO_SCSI_S_NEXUS_FAILURE 8 2649 2650#define VIRTIO_SCSI_S_FAILURE 9 2651 2652#define VIRTIO_SCSI_S_INCORRECT_LUN 12 2653 2654The type identifies the remaining fields. 2655 2656The following commands are defined: 2657 2658 Task management function 2659#define VIRTIO_SCSI_T_TMF 0 2660 2661 2662 2663#define VIRTIO_SCSI_T_TMF_ABORT_TASK 0 2664 2665#define VIRTIO_SCSI_T_TMF_ABORT_TASK_SET 1 2666 2667#define VIRTIO_SCSI_T_TMF_CLEAR_ACA 2 2668 2669#define VIRTIO_SCSI_T_TMF_CLEAR_TASK_SET 3 2670 2671#define VIRTIO_SCSI_T_TMF_I_T_NEXUS_RESET 4 2672 2673#define VIRTIO_SCSI_T_TMF_LOGICAL_UNIT_RESET 5 2674 2675#define VIRTIO_SCSI_T_TMF_QUERY_TASK 6 2676 2677#define VIRTIO_SCSI_T_TMF_QUERY_TASK_SET 7 2678 2679 2680 2681struct virtio_scsi_ctrl_tmf 2682 2683{ 2684 2685 // Read-only part 2686 2687 u32 type; 2688 2689 u32 subtype; 2690 2691 u8 lun[8]; 2692 2693 u64 id; 2694 2695 // Write-only part 2696 2697 u8 response; 2698 2699} 2700 2701 2702 2703/* command-specific response values */ 2704 2705#define VIRTIO_SCSI_S_FUNCTION_COMPLETE 0 2706 2707#define VIRTIO_SCSI_S_FUNCTION_SUCCEEDED 10 2708 2709#define VIRTIO_SCSI_S_FUNCTION_REJECTED 11 2710 2711 The type is VIRTIO_SCSI_T_TMF; the subtype field defines. All 2712 fields except response are filled by the driver. The subtype 2713 field must always be specified and identifies the requested 2714 task management function. 2715 2716 Other fields may be irrelevant for the requested TMF; if so, 2717 they are ignored but they should still be present. The lun 2718 field is in the same format specified for request queues; the 2719 single level LUN is ignored when the task management function 2720 addresses a whole I_T nexus. When relevant, the value of the id 2721 field is matched against the id values passed on the requestq. 2722 2723 The outcome of the task management function is written by the 2724 device in the response field. The command-specific response 2725 values map 1-to-1 with those defined in SAM. 2726 2727 Asynchronous notification query 2728#define VIRTIO_SCSI_T_AN_QUERY 1 2729 2730 2731 2732struct virtio_scsi_ctrl_an { 2733 2734 // Read-only part 2735 2736 u32 type; 2737 2738 u8 lun[8]; 2739 2740 u32 event_requested; 2741 2742 // Write-only part 2743 2744 u32 event_actual; 2745 2746 u8 response; 2747 2748} 2749 2750 2751 2752#define VIRTIO_SCSI_EVT_ASYNC_OPERATIONAL_CHANGE 2 2753 2754#define VIRTIO_SCSI_EVT_ASYNC_POWER_MGMT 4 2755 2756#define VIRTIO_SCSI_EVT_ASYNC_EXTERNAL_REQUEST 8 2757 2758#define VIRTIO_SCSI_EVT_ASYNC_MEDIA_CHANGE 16 2759 2760#define VIRTIO_SCSI_EVT_ASYNC_MULTI_HOST 32 2761 2762#define VIRTIO_SCSI_EVT_ASYNC_DEVICE_BUSY 64 2763 2764 By sending this command, the driver asks the device which 2765 events the given LUN can report, as described in paragraphs 6.6 2766 and A.6 of the SCSI MMC specification. The driver writes the 2767 events it is interested in into the event_requested; the device 2768 responds by writing the events that it supports into 2769 event_actual. 2770 2771 The type is VIRTIO_SCSI_T_AN_QUERY. The lun and event_requested 2772 fields are written by the driver. The event_actual and response 2773 fields are written by the device. 2774 2775 No command-specific values are defined for the response byte. 2776 2777 Asynchronous notification subscription 2778#define VIRTIO_SCSI_T_AN_SUBSCRIBE 2 2779 2780 2781 2782struct virtio_scsi_ctrl_an { 2783 2784 // Read-only part 2785 2786 u32 type; 2787 2788 u8 lun[8]; 2789 2790 u32 event_requested; 2791 2792 // Write-only part 2793 2794 u32 event_actual; 2795 2796 u8 response; 2797 2798} 2799 2800 By sending this command, the driver asks the specified LUN to 2801 report events for its physical interface, again as described in 2802 the SCSI MMC specification. The driver writes the events it is 2803 interested in into the event_requested; the device responds by 2804 writing the events that it supports into event_actual. 2805 2806 Event types are the same as for the asynchronous notification 2807 query message. 2808 2809 The type is VIRTIO_SCSI_T_AN_SUBSCRIBE. The lun and 2810 event_requested fields are written by the driver. The 2811 event_actual and response fields are written by the device. 2812 2813 No command-specific values are defined for the response byte. 2814 2815 Device Operation: eventq 2816 2817The eventq is used by the device to report information on logical 2818units that are attached to it. The driver should always leave a 2819few buffers ready in the eventq. In general, the device will not 2820queue events to cope with an empty eventq, and will end up 2821dropping events if it finds no buffer ready. However, when 2822reporting events for many LUNs (e.g. when a whole target 2823disappears), the device can throttle events to avoid dropping 2824them. For this reason, placing 10-15 buffers on the event queue 2825should be enough. 2826 2827Buffers are placed in the eventq and filled by the device when 2828interesting events occur. The buffers should be strictly 2829write-only (device-filled) and the size of the buffers should be 2830at least the value given in the device's configuration 2831information. 2832 2833Buffers returned by the device on the eventq will be referred to 2834as "events" in the rest of this section. Events have the 2835following format: 2836 2837#define VIRTIO_SCSI_T_EVENTS_MISSED 0x80000000 2838 2839 2840 2841struct virtio_scsi_event { 2842 2843 // Write-only part 2844 2845 u32 event; 2846 2847 ... 2848 2849} 2850 2851If bit 31 is set in the event field, the device failed to report 2852an event due to missing buffers. In this case, the driver should 2853poll the logical units for unit attention conditions, and/or do 2854whatever form of bus scan is appropriate for the guest operating 2855system. 2856 2857Other data that the device writes to the buffer depends on the 2858contents of the event field. The following events are defined: 2859 2860 No event 2861#define VIRTIO_SCSI_T_NO_EVENT 0 2862 2863 This event is fired in the following cases: 2864 2865 When the device detects in the eventq a buffer that is shorter 2866 than what is indicated in the configuration field, it might 2867 use it immediately and put this dummy value in the event 2868 field. A well-written driver will never observe this 2869 situation. 2870 2871 When events are dropped, the device may signal this event as 2872 soon as the drivers makes a buffer available, in order to 2873 request action from the driver. In this case, of course, this 2874 event will be reported with the VIRTIO_SCSI_T_EVENTS_MISSED 2875 flag. 2876 2877 Transport reset 2878#define VIRTIO_SCSI_T_TRANSPORT_RESET 1 2879 2880 2881 2882struct virtio_scsi_event_reset { 2883 2884 // Write-only part 2885 2886 u32 event; 2887 2888 u8 lun[8]; 2889 2890 u32 reason; 2891 2892} 2893 2894 2895 2896#define VIRTIO_SCSI_EVT_RESET_HARD 0 2897 2898#define VIRTIO_SCSI_EVT_RESET_RESCAN 1 2899 2900#define VIRTIO_SCSI_EVT_RESET_REMOVED 2 2901 2902 By sending this event, the device signals that a logical unit 2903 on a target has been reset, including the case of a new device 2904 appearing or disappearing on the bus.The device fills in all 2905 fields. The event field is set to 2906 VIRTIO_SCSI_T_TRANSPORT_RESET. The lun field addresses a 2907 logical unit in the SCSI host. 2908 2909 The reason value is one of the three #define values appearing 2910 above: 2911 2912 VIRTIO_SCSI_EVT_RESET_REMOVED (“LUN/target removed”) is used if 2913 the target or logical unit is no longer able to receive 2914 commands. 2915 2916 VIRTIO_SCSI_EVT_RESET_HARD (“LUN hard reset”) is used if the 2917 logical unit has been reset, but is still present. 2918 2919 VIRTIO_SCSI_EVT_RESET_RESCAN (“rescan LUN/target”) is used if a 2920 target or logical unit has just appeared on the device. 2921 2922 The “removed” and “rescan” events, when sent for LUN 0, may 2923 apply to the entire target. After receiving them the driver 2924 should ask the initiator to rescan the target, in order to 2925 detect the case when an entire target has appeared or 2926 disappeared. These two events will never be reported unless the 2927 VIRTIO_SCSI_F_HOTPLUG feature was negotiated between the host 2928 and the guest. 2929 2930 Events will also be reported via sense codes (this obviously 2931 does not apply to newly appeared buses or targets, since the 2932 application has never discovered them): 2933 2934 “LUN/target removed” maps to sense key ILLEGAL REQUEST, asc 2935 0x25, ascq 0x00 (LOGICAL UNIT NOT SUPPORTED) 2936 2937 “LUN hard reset” maps to sense key UNIT ATTENTION, asc 0x29 2938 (POWER ON, RESET OR BUS DEVICE RESET OCCURRED) 2939 2940 “rescan LUN/target” maps to sense key UNIT ATTENTION, asc 0x3f, 2941 ascq 0x0e (REPORTED LUNS DATA HAS CHANGED) 2942 2943 The preferred way to detect transport reset is always to use 2944 events, because sense codes are only seen by the driver when it 2945 sends a SCSI command to the logical unit or target. However, in 2946 case events are dropped, the initiator will still be able to 2947 synchronize with the actual state of the controller if the 2948 driver asks the initiator to rescan of the SCSI bus. During the 2949 rescan, the initiator will be able to observe the above sense 2950 codes, and it will process them as if it the driver had 2951 received the equivalent event. 2952 2953 Asynchronous notification 2954#define VIRTIO_SCSI_T_ASYNC_NOTIFY 2 2955 2956 2957 2958struct virtio_scsi_event_an { 2959 2960 // Write-only part 2961 2962 u32 event; 2963 2964 u8 lun[8]; 2965 2966 u32 reason; 2967 2968} 2969 2970 By sending this event, the device signals that an asynchronous 2971 event was fired from a physical interface. 2972 2973 All fields are written by the device. The event field is set to 2974 VIRTIO_SCSI_T_ASYNC_NOTIFY. The lun field addresses a logical 2975 unit in the SCSI host. The reason field is a subset of the 2976 events that the driver has subscribed to via the "Asynchronous 2977 notification subscription" command. 2978 2979 When dropped events are reported, the driver should poll for 2980 asynchronous events manually using SCSI commands. 2981 2982Appendix X: virtio-mmio 2983 2984Virtual environments without PCI support (a common situation in 2985embedded devices models) might use simple memory mapped device (“ 2986virtio-mmio”) instead of the PCI device. 2987 2988The memory mapped virtio device behaviour is based on the PCI 2989device specification. Therefore most of operations like device 2990initialization, queues configuration and buffer transfers are 2991nearly identical. Existing differences are described in the 2992following sections. 2993 2994 Device Initialization 2995 2996Instead of using the PCI IO space for virtio header, the “ 2997virtio-mmio” device provides a set of memory mapped control 2998registers, all 32 bits wide, followed by device-specific 2999configuration space. The following list presents their layout: 3000 3001 Offset from the device base address | Direction | Name 3002 Description 3003 3004 0x000 | R | MagicValue 3005 “virt” string. 3006 3007 0x004 | R | Version 3008 Device version number. Currently must be 1. 3009 3010 0x008 | R | DeviceID 3011 Virtio Subsystem Device ID (ie. 1 for network card). 3012 3013 0x00c | R | VendorID 3014 Virtio Subsystem Vendor ID. 3015 3016 0x010 | R | HostFeatures 3017 Flags representing features the device supports. 3018 Reading from this register returns 32 consecutive flag bits, 3019 first bit depending on the last value written to 3020 HostFeaturesSel register. Access to this register returns bits HostFeaturesSel*32 3021 3022 to (HostFeaturesSel*32)+31 3023, eg. feature bits 0 to 31 if 3024 HostFeaturesSel is set to 0 and features bits 32 to 63 if 3025 HostFeaturesSel is set to 1. Also see [sub:Feature-Bits] 3026 3027 0x014 | W | HostFeaturesSel 3028 Device (Host) features word selection. 3029 Writing to this register selects a set of 32 device feature bits 3030 accessible by reading from HostFeatures register. Device driver 3031 must write a value to the HostFeaturesSel register before 3032 reading from the HostFeatures register. 3033 3034 0x020 | W | GuestFeatures 3035 Flags representing device features understood and activated by 3036 the driver. 3037 Writing to this register sets 32 consecutive flag bits, first 3038 bit depending on the last value written to GuestFeaturesSel 3039 register. Access to this register sets bits GuestFeaturesSel*32 3040 3041 to (GuestFeaturesSel*32)+31 3042, eg. feature bits 0 to 31 if 3043 GuestFeaturesSel is set to 0 and features bits 32 to 63 if 3044 GuestFeaturesSel is set to 1. Also see [sub:Feature-Bits] 3045 3046 0x024 | W | GuestFeaturesSel 3047 Activated (Guest) features word selection. 3048 Writing to this register selects a set of 32 activated feature 3049 bits accessible by writing to the GuestFeatures register. 3050 Device driver must write a value to the GuestFeaturesSel 3051 register before writing to the GuestFeatures register. 3052 3053 0x028 | W | GuestPageSize 3054 Guest page size. 3055 Device driver must write the guest page size in bytes to the 3056 register during initialization, before any queues are used. 3057 This value must be a power of 2 and is used by the Host to 3058 calculate Guest address of the first queue page (see QueuePFN). 3059 3060 0x030 | W | QueueSel 3061 Virtual queue index (first queue is 0). 3062 Writing to this register selects the virtual queue that the 3063 following operations on QueueNum, QueueAlign and QueuePFN apply 3064 to. 3065 3066 0x034 | R | QueueNumMax 3067 Maximum virtual queue size. 3068 Reading from the register returns the maximum size of the queue 3069 the Host is ready to process or zero (0x0) if the queue is not 3070 available. This applies to the queue selected by writing to 3071 QueueSel and is allowed only when QueuePFN is set to zero 3072 (0x0), so when the queue is not actively used. 3073 3074 0x038 | W | QueueNum 3075 Virtual queue size. 3076 Queue size is a number of elements in the queue, therefore size 3077 of the descriptor table and both available and used rings. 3078 Writing to this register notifies the Host what size of the 3079 queue the Guest will use. This applies to the queue selected by 3080 writing to QueueSel. 3081 3082 0x03c | W | QueueAlign 3083 Used Ring alignment in the virtual queue. 3084 Writing to this register notifies the Host about alignment 3085 boundary of the Used Ring in bytes. This value must be a power 3086 of 2 and applies to the queue selected by writing to QueueSel. 3087 3088 0x040 | RW | QueuePFN 3089 Guest physical page number of the virtual queue. 3090 Writing to this register notifies the host about location of the 3091 virtual queue in the Guest's physical address space. This value 3092 is the index number of a page starting with the queue 3093 Descriptor Table. Value zero (0x0) means physical address zero 3094 (0x00000000) and is illegal. When the Guest stops using the 3095 queue it must write zero (0x0) to this register. 3096 Reading from this register returns the currently used page 3097 number of the queue, therefore a value other than zero (0x0) 3098 means that the queue is in use. 3099 Both read and write accesses apply to the queue selected by 3100 writing to QueueSel. 3101 3102 0x050 | W | QueueNotify 3103 Queue notifier. 3104 Writing a queue index to this register notifies the Host that 3105 there are new buffers to process in the queue. 3106 3107 0x60 | R | InterruptStatus 3108Interrupt status. 3109Reading from this register returns a bit mask of interrupts 3110 asserted by the device. An interrupt is asserted if the 3111 corresponding bit is set, ie. equals one (1). 3112 3113 Bit 0 | Used Ring Update 3114This interrupt is asserted when the Host has updated the Used 3115 Ring in at least one of the active virtual queues. 3116 3117 Bit 1 | Configuration change 3118This interrupt is asserted when configuration of the device has 3119 changed. 3120 3121 0x064 | W | InterruptACK 3122 Interrupt acknowledge. 3123 Writing to this register notifies the Host that the Guest 3124 finished handling interrupts. Set bits in the value clear the 3125 corresponding bits of the InterruptStatus register. 3126 3127 0x070 | RW | Status 3128 Device status. 3129 Reading from this register returns the current device status 3130 flags. 3131 Writing non-zero values to this register sets the status flags, 3132 indicating the Guest progress. Writing zero (0x0) to this 3133 register triggers a device reset. 3134 Also see [sub:Device-Initialization-Sequence] 3135 3136 0x100+ | RW | Config 3137 Device-specific configuration space starts at an offset 0x100 3138 and is accessed with byte alignment. Its meaning and size 3139 depends on the device and the driver. 3140 3141Virtual queue size is a number of elements in the queue, 3142therefore size of the descriptor table and both available and 3143used rings. 3144 3145The endianness of the registers follows the native endianness of 3146the Guest. Writing to registers described as “R” and reading from 3147registers described as “W” is not permitted and can cause 3148undefined behavior. 3149 3150The device initialization is performed as described in [sub:Device-Initialization-Sequence] 3151 with one exception: the Guest must notify the Host about its 3152page size, writing the size in bytes to GuestPageSize register 3153before the initialization is finished. 3154 3155The memory mapped virtio devices generate single interrupt only, 3156therefore no special configuration is required. 3157 3158 Virtqueue Configuration 3159 3160The virtual queue configuration is performed in a similar way to 3161the one described in [sec:Virtqueue-Configuration] with a few 3162additional operations: 3163 3164 Select the queue writing its index (first queue is 0) to the 3165 QueueSel register. 3166 3167 Check if the queue is not already in use: read QueuePFN 3168 register, returned value should be zero (0x0). 3169 3170 Read maximum queue size (number of elements) from the 3171 QueueNumMax register. If the returned value is zero (0x0) the 3172 queue is not available. 3173 3174 Allocate and zero the queue pages in contiguous virtual memory, 3175 aligning the Used Ring to an optimal boundary (usually page 3176 size). Size of the allocated queue may be smaller than or equal 3177 to the maximum size returned by the Host. 3178 3179 Notify the Host about the queue size by writing the size to 3180 QueueNum register. 3181 3182 Notify the Host about the used alignment by writing its value 3183 in bytes to QueueAlign register. 3184 3185 Write the physical number of the first page of the queue to the 3186 QueuePFN register. 3187 3188The queue and the device are ready to begin normal operations 3189now. 3190 3191 Device Operation 3192 3193The memory mapped virtio device behaves in the same way as 3194described in [sec:Device-Operation], with the following 3195exceptions: 3196 3197 The device is notified about new buffers available in a queue 3198 by writing the queue index to register QueueNum instead of the 3199 virtio header in PCI I/O space ([sub:Notifying-The-Device]). 3200 3201 The memory mapped virtio device is using single, dedicated 3202 interrupt signal, which is raised when at least one of the 3203 interrupts described in the InterruptStatus register 3204 description is asserted. After receiving an interrupt, the 3205 driver must read the InterruptStatus register to check what 3206 caused the interrupt (see the register description). After the 3207 interrupt is handled, the driver must acknowledge it by writing 3208 a bit mask corresponding to the serviced interrupt to the 3209 InterruptACK register. 3210