Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

vfio: Extend the device migration protocol with RUNNING_P2P

The RUNNING_P2P state is designed to support multiple devices in the same
VM that are doing P2P transactions between themselves. When in RUNNING_P2P
the device must be able to accept incoming P2P transactions but should not
generate outgoing P2P transactions.

As an optional extension to the mandatory states it is defined as
in between STOP and RUNNING:
STOP -> RUNNING_P2P -> RUNNING -> RUNNING_P2P -> STOP

For drivers that are unable to support RUNNING_P2P the core code
silently merges RUNNING_P2P and RUNNING together. Unless driver support
is present, the new state cannot be used in SET_STATE.
Drivers that support this will be required to implement 4 FSM arcs
beyond the basic FSM. 2 of the basic FSM arcs become combination
transitions.

Compared to the v1 clarification, NDMA is redefined into FSM states and is
described in terms of the desired P2P quiescent behavior, noting that
halting all DMA is an acceptable implementation.

Link: https://lore.kernel.org/all/20220224142024.147653-11-yishaih@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

authored by

Jason Gunthorpe and committed by
Leon Romanovsky
8cb3d83b 115dcec6

+104 -20
+69 -18
drivers/vfio/vfio.c
··· 1577 1577 enum vfio_device_mig_state new_fsm, 1578 1578 enum vfio_device_mig_state *next_fsm) 1579 1579 { 1580 - enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RESUMING + 1 }; 1580 + enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RUNNING_P2P + 1 }; 1581 1581 /* 1582 - * The coding in this table requires the driver to implement 6 1583 - * FSM arcs: 1582 + * The coding in this table requires the driver to implement the 1583 + * following FSM arcs: 1584 1584 * RESUMING -> STOP 1585 - * RUNNING -> STOP 1586 1585 * STOP -> RESUMING 1587 - * STOP -> RUNNING 1588 1586 * STOP -> STOP_COPY 1589 1587 * STOP_COPY -> STOP 1590 1588 * 1591 - * The coding will step through multiple states for these combination 1592 - * transitions: 1593 - * RESUMING -> STOP -> RUNNING 1589 + * If P2P is supported then the driver must also implement these FSM 1590 + * arcs: 1591 + * RUNNING -> RUNNING_P2P 1592 + * RUNNING_P2P -> RUNNING 1593 + * RUNNING_P2P -> STOP 1594 + * STOP -> RUNNING_P2P 1595 + * Without P2P the driver must implement: 1596 + * RUNNING -> STOP 1597 + * STOP -> RUNNING 1598 + * 1599 + * The coding will step through multiple states for some combination 1600 + * transitions; if all optional features are supported, this means the 1601 + * following ones: 1602 + * RESUMING -> STOP -> RUNNING_P2P 1603 + * RESUMING -> STOP -> RUNNING_P2P -> RUNNING 1594 1604 * RESUMING -> STOP -> STOP_COPY 1595 - * RUNNING -> STOP -> RESUMING 1596 - * RUNNING -> STOP -> STOP_COPY 1605 + * RUNNING -> RUNNING_P2P -> STOP 1606 + * RUNNING -> RUNNING_P2P -> STOP -> RESUMING 1607 + * RUNNING -> RUNNING_P2P -> STOP -> STOP_COPY 1608 + * RUNNING_P2P -> STOP -> RESUMING 1609 + * RUNNING_P2P -> STOP -> STOP_COPY 1610 + * STOP -> RUNNING_P2P -> RUNNING 1597 1611 * STOP_COPY -> STOP -> RESUMING 1598 - * STOP_COPY -> STOP -> RUNNING 1612 + * STOP_COPY -> STOP -> RUNNING_P2P 1613 + * STOP_COPY -> STOP -> RUNNING_P2P -> RUNNING 1599 1614 */ 1600 1615 static const u8 vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STATES] = { 1601 1616 [VFIO_DEVICE_STATE_STOP] = { 1602 1617 [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP, 1603 - [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING, 1618 + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P, 1604 1619 [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY, 1605 1620 [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING, 1621 + [VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P, 1606 1622 [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, 1607 1623 }, 1608 1624 [VFIO_DEVICE_STATE_RUNNING] = { 1609 - [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP, 1625 + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P, 1610 1626 [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING, 1611 - [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP, 1612 - [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP, 1627 + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P, 1628 + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P, 1629 + [VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P, 1613 1630 [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, 1614 1631 }, 1615 1632 [VFIO_DEVICE_STATE_STOP_COPY] = { ··· 1634 1617 [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP, 1635 1618 [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY, 1636 1619 [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP, 1620 + [VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP, 1637 1621 [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, 1638 1622 }, 1639 1623 [VFIO_DEVICE_STATE_RESUMING] = { ··· 1642 1624 [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP, 1643 1625 [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP, 1644 1626 [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING, 1627 + [VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP, 1628 + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, 1629 + }, 1630 + [VFIO_DEVICE_STATE_RUNNING_P2P] = { 1631 + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP, 1632 + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING, 1633 + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP, 1634 + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP, 1635 + [VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P, 1645 1636 [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, 1646 1637 }, 1647 1638 [VFIO_DEVICE_STATE_ERROR] = { ··· 1658 1631 [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR, 1659 1632 [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR, 1660 1633 [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR, 1634 + [VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_ERROR, 1661 1635 [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, 1662 1636 }, 1663 1637 }; 1664 1638 1665 - if (WARN_ON(cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table))) 1639 + static const unsigned int state_flags_table[VFIO_DEVICE_NUM_STATES] = { 1640 + [VFIO_DEVICE_STATE_STOP] = VFIO_MIGRATION_STOP_COPY, 1641 + [VFIO_DEVICE_STATE_RUNNING] = VFIO_MIGRATION_STOP_COPY, 1642 + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_MIGRATION_STOP_COPY, 1643 + [VFIO_DEVICE_STATE_RESUMING] = VFIO_MIGRATION_STOP_COPY, 1644 + [VFIO_DEVICE_STATE_RUNNING_P2P] = 1645 + VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P, 1646 + [VFIO_DEVICE_STATE_ERROR] = ~0U, 1647 + }; 1648 + 1649 + if (WARN_ON(cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table) || 1650 + (state_flags_table[cur_fsm] & device->migration_flags) != 1651 + state_flags_table[cur_fsm])) 1666 1652 return -EINVAL; 1667 1653 1668 - if (new_fsm >= ARRAY_SIZE(vfio_from_fsm_table)) 1654 + if (new_fsm >= ARRAY_SIZE(vfio_from_fsm_table) || 1655 + (state_flags_table[new_fsm] & device->migration_flags) != 1656 + state_flags_table[new_fsm]) 1669 1657 return -EINVAL; 1670 1658 1659 + /* 1660 + * Arcs touching optional and unsupported states are skipped over. The 1661 + * driver will instead see an arc from the original state to the next 1662 + * logical state, as per the above comment. 1663 + */ 1671 1664 *next_fsm = vfio_from_fsm_table[cur_fsm][new_fsm]; 1665 + while ((state_flags_table[*next_fsm] & device->migration_flags) != 1666 + state_flags_table[*next_fsm]) 1667 + *next_fsm = vfio_from_fsm_table[*next_fsm][new_fsm]; 1668 + 1672 1669 return (*next_fsm != VFIO_DEVICE_STATE_ERROR) ? 0 : -EINVAL; 1673 1670 } 1674 1671 EXPORT_SYMBOL_GPL(vfio_mig_get_next_state); ··· 1782 1731 size_t argsz) 1783 1732 { 1784 1733 struct vfio_device_feature_migration mig = { 1785 - .flags = VFIO_MIGRATION_STOP_COPY, 1734 + .flags = device->migration_flags, 1786 1735 }; 1787 1736 int ret; 1788 1737
+1
include/linux/vfio.h
··· 33 33 struct vfio_group *group; 34 34 struct vfio_device_set *dev_set; 35 35 struct list_head dev_set_list; 36 + unsigned int migration_flags; 36 37 37 38 /* Members below here are private, not for driver use */ 38 39 refcount_t refcount;
+34 -2
include/uapi/linux/vfio.h
··· 1011 1011 * 1012 1012 * VFIO_MIGRATION_STOP_COPY means that STOP, STOP_COPY and 1013 1013 * RESUMING are supported. 1014 + * 1015 + * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P means that RUNNING_P2P 1016 + * is supported in addition to the STOP_COPY states. 1017 + * 1018 + * Other combinations of flags have behavior to be defined in the future. 1014 1019 */ 1015 1020 struct vfio_device_feature_migration { 1016 1021 __aligned_u64 flags; 1017 1022 #define VFIO_MIGRATION_STOP_COPY (1 << 0) 1023 + #define VFIO_MIGRATION_P2P (1 << 1) 1018 1024 }; 1019 1025 #define VFIO_DEVICE_FEATURE_MIGRATION 1 1020 1026 ··· 1071 1065 * RESUMING - The device is stopped and is loading a new internal state 1072 1066 * ERROR - The device has failed and must be reset 1073 1067 * 1068 + * And 1 optional state to support VFIO_MIGRATION_P2P: 1069 + * RUNNING_P2P - RUNNING, except the device cannot do peer to peer DMA 1070 + * 1074 1071 * The FSM takes actions on the arcs between FSM states. The driver implements 1075 1072 * the following behavior for the FSM arcs: 1076 1073 * 1077 - * RUNNING -> STOP 1074 + * RUNNING_P2P -> STOP 1078 1075 * STOP_COPY -> STOP 1079 1076 * While in STOP the device must stop the operation of the device. The device 1080 1077 * must not generate interrupts, DMA, or any other change to external state. ··· 1104 1095 * 1105 1096 * To abort a RESUMING session the device must be reset. 1106 1097 * 1107 - * STOP -> RUNNING 1098 + * RUNNING_P2P -> RUNNING 1108 1099 * While in RUNNING the device is fully operational, the device may generate 1109 1100 * interrupts, DMA, respond to MMIO, all vfio device regions are functional, 1110 1101 * and the device may advance its internal state. 1102 + * 1103 + * RUNNING -> RUNNING_P2P 1104 + * STOP -> RUNNING_P2P 1105 + * While in RUNNING_P2P the device is partially running in the P2P quiescent 1106 + * state defined below. 1111 1107 * 1112 1108 * STOP -> STOP_COPY 1113 1109 * This arc begin the process of saving the device state and will return a ··· 1143 1129 * To recover from ERROR VFIO_DEVICE_RESET must be used to return the 1144 1130 * device_state back to RUNNING. 1145 1131 * 1132 + * The optional peer to peer (P2P) quiescent state is intended to be a quiescent 1133 + * state for the device for the purposes of managing multiple devices within a 1134 + * user context where peer-to-peer DMA between devices may be active. The 1135 + * RUNNING_P2P states must prevent the device from initiating 1136 + * any new P2P DMA transactions. If the device can identify P2P transactions 1137 + * then it can stop only P2P DMA, otherwise it must stop all DMA. The migration 1138 + * driver must complete any such outstanding operations prior to completing the 1139 + * FSM arc into a P2P state. For the purpose of specification the states 1140 + * behave as though the device was fully running if not supported. Like while in 1141 + * STOP or STOP_COPY the user must not touch the device, otherwise the state 1142 + * can be exited. 1143 + * 1146 1144 * The remaining possible transitions are interpreted as combinations of the 1147 1145 * above FSM arcs. As there are multiple paths through the FSM arcs the path 1148 1146 * should be selected based on the following rules: ··· 1167 1141 * fails. When handling these types of errors users should anticipate future 1168 1142 * revisions of this protocol using new states and those states becoming 1169 1143 * visible in this case. 1144 + * 1145 + * The optional states cannot be used with SET_STATE if the device does not 1146 + * support them. The user can discover if these states are supported by using 1147 + * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions the user can 1148 + * avoid knowing about these optional states if the kernel driver supports them. 1170 1149 */ 1171 1150 enum vfio_device_mig_state { 1172 1151 VFIO_DEVICE_STATE_ERROR = 0, ··· 1179 1148 VFIO_DEVICE_STATE_RUNNING = 2, 1180 1149 VFIO_DEVICE_STATE_STOP_COPY = 3, 1181 1150 VFIO_DEVICE_STATE_RESUMING = 4, 1151 + VFIO_DEVICE_STATE_RUNNING_P2P = 5, 1182 1152 }; 1183 1153 1184 1154 /* -------- API for Type1 VFIO IOMMU -------- */