Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

+2

Documentation/arm64/cpu-feature-registers.rst

··· 290 290 +------------------------------+---------+---------+ 291 291 | RPRES | [7-4] | y | 292 292 +------------------------------+---------+---------+ 293 + | WFXT | [3-0] | y | 294 + +------------------------------+---------+---------+ 293 295 294 296 295 297 Appendix I: Example

+4

Documentation/arm64/elf_hwcaps.rst

··· 297 297 298 298 Functionality implied by ID_AA64SMFR0_EL1.FA64 == 0b1. 299 299 300 + HWCAP2_WFXT 301 + 302 + Functionality implied by ID_AA64ISAR2_EL1.WFXT == 0b0010. 303 + 300 304 4. Unused AT_HWCAP bits 301 305 ----------------------- 302 306

+236 -16

Documentation/virt/kvm/api.rst

··· 982 982 __u8 pad2[30]; 983 983 }; 984 984 985 - If the KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL flag is returned from the 986 - KVM_CAP_XEN_HVM check, it may be set in the flags field of this ioctl. 987 - This requests KVM to generate the contents of the hypercall page 988 - automatically; hypercalls will be intercepted and passed to userspace 989 - through KVM_EXIT_XEN. In this case, all of the blob size and address 990 - fields must be zero. 985 + If certain flags are returned from the KVM_CAP_XEN_HVM check, they may 986 + be set in the flags field of this ioctl: 987 + 988 + The KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL flag requests KVM to generate 989 + the contents of the hypercall page automatically; hypercalls will be 990 + intercepted and passed to userspace through KVM_EXIT_XEN. In this 991 + ase, all of the blob size and address fields must be zero. 992 + 993 + The KVM_XEN_HVM_CONFIG_EVTCHN_SEND flag indicates to KVM that userspace 994 + will always use the KVM_XEN_HVM_EVTCHN_SEND ioctl to deliver event 995 + channel interrupts rather than manipulating the guest's shared_info 996 + structures directly. This, in turn, may allow KVM to enable features 997 + such as intercepting the SCHEDOP_poll hypercall to accelerate PV 998 + spinlock operation for the guest. Userspace may still use the ioctl 999 + to deliver events if it was advertised, even if userspace does not 1000 + send this indication that it will always do so 991 1001 992 1002 No other flags are currently valid in the struct kvm_xen_hvm_config. 993 1003 ··· 1486 1476 [s390] 1487 1477 KVM_MP_STATE_LOAD the vcpu is in a special load/startup state 1488 1478 [s390] 1479 + KVM_MP_STATE_SUSPENDED the vcpu is in a suspend state and is waiting 1480 + for a wakeup event [arm64] 1489 1481 ========================== =============================================== 1490 1482 1491 1483 On x86, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an 1492 1484 in-kernel irqchip, the multiprocessing state must be maintained by userspace on 1493 1485 these architectures. 1494 1486 1495 - For arm64/riscv: 1496 - ^^^^^^^^^^^^^^^^ 1487 + For arm64: 1488 + ^^^^^^^^^^ 1489 + 1490 + If a vCPU is in the KVM_MP_STATE_SUSPENDED state, KVM will emulate the 1491 + architectural execution of a WFI instruction. 1492 + 1493 + If a wakeup event is recognized, KVM will exit to userspace with a 1494 + KVM_SYSTEM_EVENT exit, where the event type is KVM_SYSTEM_EVENT_WAKEUP. If 1495 + userspace wants to honor the wakeup, it must set the vCPU's MP state to 1496 + KVM_MP_STATE_RUNNABLE. If it does not, KVM will continue to await a wakeup 1497 + event in subsequent calls to KVM_RUN. 1498 + 1499 + .. warning:: 1500 + 1501 + If userspace intends to keep the vCPU in a SUSPENDED state, it is 1502 + strongly recommended that userspace take action to suppress the 1503 + wakeup event (such as masking an interrupt). Otherwise, subsequent 1504 + calls to KVM_RUN will immediately exit with a KVM_SYSTEM_EVENT_WAKEUP 1505 + event and inadvertently waste CPU cycles. 1506 + 1507 + Additionally, if userspace takes action to suppress a wakeup event, 1508 + it is strongly recommended that it also restores the vCPU to its 1509 + original state when the vCPU is made RUNNABLE again. For example, 1510 + if userspace masked a pending interrupt to suppress the wakeup, 1511 + the interrupt should be unmasked before returning control to the 1512 + guest. 1513 + 1514 + For riscv: 1515 + ^^^^^^^^^^ 1497 1516 1498 1517 The only states that are valid are KVM_MP_STATE_STOPPED and 1499 1518 KVM_MP_STATE_RUNNABLE which reflect if the vcpu is paused or not. ··· 1926 1887 4.55 KVM_SET_TSC_KHZ 1927 1888 -------------------- 1928 1889 1929 - :Capability: KVM_CAP_TSC_CONTROL 1890 + :Capability: KVM_CAP_TSC_CONTROL / KVM_CAP_VM_TSC_CONTROL 1930 1891 :Architectures: x86 1931 - :Type: vcpu ioctl 1892 + :Type: vcpu ioctl / vm ioctl 1932 1893 :Parameters: virtual tsc_khz 1933 1894 :Returns: 0 on success, -1 on error 1934 1895 1935 1896 Specifies the tsc frequency for the virtual machine. The unit of the 1936 1897 frequency is KHz. 1937 1898 1899 + If the KVM_CAP_VM_TSC_CONTROL capability is advertised, this can also 1900 + be used as a vm ioctl to set the initial tsc frequency of subsequently 1901 + created vCPUs. 1938 1902 1939 1903 4.56 KVM_GET_TSC_KHZ 1940 1904 -------------------- 1941 1905 1942 - :Capability: KVM_CAP_GET_TSC_KHZ 1906 + :Capability: KVM_CAP_GET_TSC_KHZ / KVM_CAP_VM_TSC_CONTROL 1943 1907 :Architectures: x86 1944 - :Type: vcpu ioctl 1908 + :Type: vcpu ioctl / vm ioctl 1945 1909 :Parameters: none 1946 1910 :Returns: virtual tsc-khz on success, negative value on error 1947 1911 ··· 2642 2600 2643 2601 After the vcpu's SVE configuration is finalized, further attempts to 2644 2602 write this register will fail with EPERM. 2603 + 2604 + arm64 bitmap feature firmware pseudo-registers have the following bit pattern:: 2605 + 2606 + 0x6030 0000 0016 <regno:16> 2607 + 2608 + The bitmap feature firmware registers exposes the hypercall services that 2609 + are available for userspace to configure. The set bits corresponds to the 2610 + services that are available for the guests to access. By default, KVM 2611 + sets all the supported bits during VM initialization. The userspace can 2612 + discover the available services via KVM_GET_ONE_REG, and write back the 2613 + bitmap corresponding to the features that it wishes guests to see via 2614 + KVM_SET_ONE_REG. 2615 + 2616 + Note: These registers are immutable once any of the vCPUs of the VM has 2617 + run at least once. A KVM_SET_ONE_REG in such a scenario will return 2618 + a -EBUSY to userspace. 2619 + 2620 + (See Documentation/virt/kvm/arm/hypercalls.rst for more details.) 2645 2621 2646 2622 2647 2623 MIPS registers are mapped using the lower 32 bits. The upper 16 of that is ··· 3814 3754 error number indicating the type of exception. This exception is also 3815 3755 raised directly at the corresponding VCPU if the flag 3816 3756 KVM_S390_MEMOP_F_INJECT_EXCEPTION is set. 3757 + On protection exceptions, unless specified otherwise, the injected 3758 + translation-exception identifier (TEID) indicates suppression. 3817 3759 3818 3760 If the KVM_S390_MEMOP_F_SKEY_PROTECTION flag is set, storage key 3819 3761 protection is also in effect and may cause exceptions if accesses are 3820 3762 prohibited given the access key designated by "key"; the valid range is 0..15. 3821 3763 KVM_S390_MEMOP_F_SKEY_PROTECTION is available if KVM_CAP_S390_MEM_OP_EXTENSION 3822 3764 is > 0. 3765 + Since the accessed memory may span multiple pages and those pages might have 3766 + different storage keys, it is possible that a protection exception occurs 3767 + after memory has been modified. In this case, if the exception is injected, 3768 + the TEID does not indicate suppression. 3823 3769 3824 3770 Absolute read/write: 3825 3771 ^^^^^^^^^^^^^^^^^^^^ ··· 5282 5216 struct { 5283 5217 __u64 gfn; 5284 5218 } shared_info; 5285 - __u64 pad[4]; 5219 + struct { 5220 + __u32 send_port; 5221 + __u32 type; /* EVTCHNSTAT_ipi / EVTCHNSTAT_interdomain */ 5222 + __u32 flags; 5223 + union { 5224 + struct { 5225 + __u32 port; 5226 + __u32 vcpu; 5227 + __u32 priority; 5228 + } port; 5229 + struct { 5230 + __u32 port; /* Zero for eventfd */ 5231 + __s32 fd; 5232 + } eventfd; 5233 + __u32 padding[4]; 5234 + } deliver; 5235 + } evtchn; 5236 + __u32 xen_version; 5237 + __u64 pad[8]; 5286 5238 } u; 5287 5239 }; 5288 5240 ··· 5331 5247 5332 5248 KVM_XEN_ATTR_TYPE_UPCALL_VECTOR 5333 5249 Sets the exception vector used to deliver Xen event channel upcalls. 5250 + This is the HVM-wide vector injected directly by the hypervisor 5251 + (not through the local APIC), typically configured by a guest via 5252 + HVM_PARAM_CALLBACK_IRQ. 5253 + 5254 + KVM_XEN_ATTR_TYPE_EVTCHN 5255 + This attribute is available when the KVM_CAP_XEN_HVM ioctl indicates 5256 + support for KVM_XEN_HVM_CONFIG_EVTCHN_SEND features. It configures 5257 + an outbound port number for interception of EVTCHNOP_send requests 5258 + from the guest. A given sending port number may be directed back 5259 + to a specified vCPU (by APIC ID) / port / priority on the guest, 5260 + or to trigger events on an eventfd. The vCPU and priority can be 5261 + changed by setting KVM_XEN_EVTCHN_UPDATE in a subsequent call, 5262 + but other fields cannot change for a given sending port. A port 5263 + mapping is removed by using KVM_XEN_EVTCHN_DEASSIGN in the flags 5264 + field. 5265 + 5266 + KVM_XEN_ATTR_TYPE_XEN_VERSION 5267 + This attribute is available when the KVM_CAP_XEN_HVM ioctl indicates 5268 + support for KVM_XEN_HVM_CONFIG_EVTCHN_SEND features. It configures 5269 + the 32-bit version code returned to the guest when it invokes the 5270 + XENVER_version call; typically (XEN_MAJOR << 16 | XEN_MINOR). PV 5271 + Xen guests will often use this to as a dummy hypercall to trigger 5272 + event channel delivery, so responding within the kernel without 5273 + exiting to userspace is beneficial. 5334 5274 5335 5275 4.127 KVM_XEN_HVM_GET_ATTR 5336 5276 -------------------------- ··· 5366 5258 :Returns: 0 on success, < 0 on error 5367 5259 5368 5260 Allows Xen VM attributes to be read. For the structure and types, 5369 - see KVM_XEN_HVM_SET_ATTR above. 5261 + see KVM_XEN_HVM_SET_ATTR above. The KVM_XEN_ATTR_TYPE_EVTCHN 5262 + attribute cannot be read. 5370 5263 5371 5264 4.128 KVM_XEN_VCPU_SET_ATTR 5372 5265 --------------------------- ··· 5394 5285 __u64 time_blocked; 5395 5286 __u64 time_offline; 5396 5287 } runstate; 5288 + __u32 vcpu_id; 5289 + struct { 5290 + __u32 port; 5291 + __u32 priority; 5292 + __u64 expires_ns; 5293 + } timer; 5294 + __u8 vector; 5397 5295 } u; 5398 5296 }; 5399 5297 ··· 5441 5325 runstate value (RUNSTATE_running, RUNSTATE_runnable, RUNSTATE_blocked 5442 5326 or RUNSTATE_offline) to set the current accounted state as of the 5443 5327 adjusted state_entry_time. 5328 + 5329 + KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID 5330 + This attribute is available when the KVM_CAP_XEN_HVM ioctl indicates 5331 + support for KVM_XEN_HVM_CONFIG_EVTCHN_SEND features. It sets the Xen 5332 + vCPU ID of the given vCPU, to allow timer-related VCPU operations to 5333 + be intercepted by KVM. 5334 + 5335 + KVM_XEN_VCPU_ATTR_TYPE_TIMER 5336 + This attribute is available when the KVM_CAP_XEN_HVM ioctl indicates 5337 + support for KVM_XEN_HVM_CONFIG_EVTCHN_SEND features. It sets the 5338 + event channel port/priority for the VIRQ_TIMER of the vCPU, as well 5339 + as allowing a pending timer to be saved/restored. 5340 + 5341 + KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR 5342 + This attribute is available when the KVM_CAP_XEN_HVM ioctl indicates 5343 + support for KVM_XEN_HVM_CONFIG_EVTCHN_SEND features. It sets the 5344 + per-vCPU local APIC upcall vector, configured by a Xen guest with 5345 + the HVMOP_set_evtchn_upcall_vector hypercall. This is typically 5346 + used by Windows guests, and is distinct from the HVM-wide upcall 5347 + vector configured with HVM_PARAM_CALLBACK_IRQ. 5348 + 5444 5349 5445 5350 4.129 KVM_XEN_VCPU_GET_ATTR 5446 5351 --------------------------- ··· 5782 5645 The offsets of the state save areas in struct kvm_xsave follow the contents 5783 5646 of CPUID leaf 0xD on the host. 5784 5647 5648 + 4.135 KVM_XEN_HVM_EVTCHN_SEND 5649 + ----------------------------- 5650 + 5651 + :Capability: KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_EVTCHN_SEND 5652 + :Architectures: x86 5653 + :Type: vm ioctl 5654 + :Parameters: struct kvm_irq_routing_xen_evtchn 5655 + :Returns: 0 on success, < 0 on error 5656 + 5657 + 5658 + :: 5659 + 5660 + struct kvm_irq_routing_xen_evtchn { 5661 + __u32 port; 5662 + __u32 vcpu; 5663 + __u32 priority; 5664 + }; 5665 + 5666 + This ioctl injects an event channel interrupt directly to the guest vCPU. 5785 5667 5786 5668 5. The kvm_run structure 5787 5669 ======================== ··· 6143 5987 #define KVM_SYSTEM_EVENT_SHUTDOWN 1 6144 5988 #define KVM_SYSTEM_EVENT_RESET 2 6145 5989 #define KVM_SYSTEM_EVENT_CRASH 3 5990 + #define KVM_SYSTEM_EVENT_WAKEUP 4 5991 + #define KVM_SYSTEM_EVENT_SUSPEND 5 5992 + #define KVM_SYSTEM_EVENT_SEV_TERM 6 6146 5993 __u32 type; 6147 5994 __u32 ndata; 6148 5995 __u64 data[16]; ··· 6170 6011 has requested a crash condition maintenance. Userspace can choose 6171 6012 to ignore the request, or to gather VM memory core dump and/or 6172 6013 reset/shutdown of the VM. 6014 + - KVM_SYSTEM_EVENT_SEV_TERM -- an AMD SEV guest requested termination. 6015 + The guest physical address of the guest's GHCB is stored in `data[0]`. 6016 + - KVM_SYSTEM_EVENT_WAKEUP -- the exiting vCPU is in a suspended state and 6017 + KVM has recognized a wakeup event. Userspace may honor this event by 6018 + marking the exiting vCPU as runnable, or deny it and call KVM_RUN again. 6019 + - KVM_SYSTEM_EVENT_SUSPEND -- the guest has requested a suspension of 6020 + the VM. 6173 6021 6174 6022 If KVM_CAP_SYSTEM_EVENT_DATA is present, the 'data' field can contain 6175 6023 architecture specific information for the system-level event. Only ··· 6192 6026 Previous versions of Linux defined a `flags` member in this struct. The 6193 6027 field is now aliased to `data[0]`. Userspace can assume that it is only 6194 6028 written if ndata is greater than 0. 6029 + 6030 + For arm/arm64: 6031 + -------------- 6032 + 6033 + KVM_SYSTEM_EVENT_SUSPEND exits are enabled with the 6034 + KVM_CAP_ARM_SYSTEM_SUSPEND VM capability. If a guest invokes the PSCI 6035 + SYSTEM_SUSPEND function, KVM will exit to userspace with this event 6036 + type. 6037 + 6038 + It is the sole responsibility of userspace to implement the PSCI 6039 + SYSTEM_SUSPEND call according to ARM DEN0022D.b 5.19 "SYSTEM_SUSPEND". 6040 + KVM does not change the vCPU's state before exiting to userspace, so 6041 + the call parameters are left in-place in the vCPU registers. 6042 + 6043 + Userspace is _required_ to take action for such an exit. It must 6044 + either: 6045 + 6046 + - Honor the guest request to suspend the VM. Userspace can request 6047 + in-kernel emulation of suspension by setting the calling vCPU's 6048 + state to KVM_MP_STATE_SUSPENDED. Userspace must configure the vCPU's 6049 + state according to the parameters passed to the PSCI function when 6050 + the calling vCPU is resumed. See ARM DEN0022D.b 5.19.1 "Intended use" 6051 + for details on the function parameters. 6052 + 6053 + - Deny the guest request to suspend the VM. See ARM DEN0022D.b 5.19.2 6054 + "Caller responsibilities" for possible return values. 6195 6055 6196 6056 :: 6197 6057 ··· 7339 7147 Additionally, when this quirk is disabled, 7340 7148 KVM clears CPUID.01H:ECX[bit 3] if 7341 7149 IA32_MISC_ENABLE[bit 18] is cleared. 7150 + 7151 + KVM_X86_QUIRK_FIX_HYPERCALL_INSN By default, KVM rewrites guest 7152 + VMMCALL/VMCALL instructions to match the 7153 + vendor's hypercall instruction for the 7154 + system. When this quirk is disabled, KVM 7155 + will no longer rewrite invalid guest 7156 + hypercall instructions. Executing the 7157 + incorrect hypercall instruction will 7158 + generate a #UD within the guest. 7342 7159 =================================== ============================================ 7343 7160 7344 7161 8. Other capabilities. ··· 7825 7624 #define KVM_XEN_HVM_CONFIG_HYPERCALL_MSR (1 << 0) 7826 7625 #define KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL (1 << 1) 7827 7626 #define KVM_XEN_HVM_CONFIG_SHARED_INFO (1 << 2) 7828 - #define KVM_XEN_HVM_CONFIG_RUNSTATE (1 << 2) 7829 - #define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 3) 7627 + #define KVM_XEN_HVM_CONFIG_RUNSTATE (1 << 3) 7628 + #define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 4) 7629 + #define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5) 7830 7630 7831 7631 The KVM_XEN_HVM_CONFIG_HYPERCALL_MSR flag indicates that the KVM_XEN_HVM_CONFIG 7832 7632 ioctl is available, for the guest to set its hypercall page. ··· 7850 7648 The KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL flag indicates that IRQ routing entries 7851 7649 of the type KVM_IRQ_ROUTING_XEN_EVTCHN are supported, with the priority 7852 7650 field set to indicate 2 level event channel delivery. 7651 + 7652 + The KVM_XEN_HVM_CONFIG_EVTCHN_SEND flag indicates that KVM supports 7653 + injecting event channel events directly into the guest with the 7654 + KVM_XEN_HVM_EVTCHN_SEND ioctl. It also indicates support for the 7655 + KVM_XEN_ATTR_TYPE_EVTCHN/XEN_VERSION HVM attributes and the 7656 + KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID/TIMER/UPCALL_VECTOR vCPU attributes. 7657 + related to event channel delivery, timers, and the XENVER_version 7658 + interception. 7853 7659 7854 7660 8.31 KVM_CAP_PPC_MULTITCE 7855 7661 ------------------------- ··· 7945 7735 At this time, KVM_PMU_CAP_DISABLE is the only capability. Setting 7946 7736 this capability will disable PMU virtualization for that VM. Usermode 7947 7737 should adjust CPUID leaf 0xA to reflect that the PMU is disabled. 7738 + 7739 + 8.36 KVM_CAP_ARM_SYSTEM_SUSPEND 7740 + ------------------------------- 7741 + 7742 + :Capability: KVM_CAP_ARM_SYSTEM_SUSPEND 7743 + :Architectures: arm64 7744 + :Type: vm 7745 + 7746 + When enabled, KVM will exit to userspace with KVM_EXIT_SYSTEM_EVENT of 7747 + type KVM_SYSTEM_EVENT_SUSPEND to process the guest suspend request. 7948 7748 7949 7749 9. Known KVM API problems 7950 7750 =========================

+138

Documentation/virt/kvm/arm/hypercalls.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ======================= 4 + ARM Hypercall Interface 5 + ======================= 6 + 7 + KVM handles the hypercall services as requested by the guests. New hypercall 8 + services are regularly made available by the ARM specification or by KVM (as 9 + vendor services) if they make sense from a virtualization point of view. 10 + 11 + This means that a guest booted on two different versions of KVM can observe 12 + two different "firmware" revisions. This could cause issues if a given guest 13 + is tied to a particular version of a hypercall service, or if a migration 14 + causes a different version to be exposed out of the blue to an unsuspecting 15 + guest. 16 + 17 + In order to remedy this situation, KVM exposes a set of "firmware 18 + pseudo-registers" that can be manipulated using the GET/SET_ONE_REG 19 + interface. These registers can be saved/restored by userspace, and set 20 + to a convenient value as required. 21 + 22 + The following registers are defined: 23 + 24 + * KVM_REG_ARM_PSCI_VERSION: 25 + 26 + KVM implements the PSCI (Power State Coordination Interface) 27 + specification in order to provide services such as CPU on/off, reset 28 + and power-off to the guest. 29 + 30 + - Only valid if the vcpu has the KVM_ARM_VCPU_PSCI_0_2 feature set 31 + (and thus has already been initialized) 32 + - Returns the current PSCI version on GET_ONE_REG (defaulting to the 33 + highest PSCI version implemented by KVM and compatible with v0.2) 34 + - Allows any PSCI version implemented by KVM and compatible with 35 + v0.2 to be set with SET_ONE_REG 36 + - Affects the whole VM (even if the register view is per-vcpu) 37 + 38 + * KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1: 39 + Holds the state of the firmware support to mitigate CVE-2017-5715, as 40 + offered by KVM to the guest via a HVC call. The workaround is described 41 + under SMCCC_ARCH_WORKAROUND_1 in [1]. 42 + 43 + Accepted values are: 44 + 45 + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_AVAIL: 46 + KVM does not offer 47 + firmware support for the workaround. The mitigation status for the 48 + guest is unknown. 49 + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_AVAIL: 50 + The workaround HVC call is 51 + available to the guest and required for the mitigation. 52 + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_REQUIRED: 53 + The workaround HVC call 54 + is available to the guest, but it is not needed on this VCPU. 55 + 56 + * KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2: 57 + Holds the state of the firmware support to mitigate CVE-2018-3639, as 58 + offered by KVM to the guest via a HVC call. The workaround is described 59 + under SMCCC_ARCH_WORKAROUND_2 in [1]_. 60 + 61 + Accepted values are: 62 + 63 + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL: 64 + A workaround is not 65 + available. KVM does not offer firmware support for the workaround. 66 + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_UNKNOWN: 67 + The workaround state is 68 + unknown. KVM does not offer firmware support for the workaround. 69 + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL: 70 + The workaround is available, 71 + and can be disabled by a vCPU. If 72 + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED is set, it is active for 73 + this vCPU. 74 + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED: 75 + The workaround is always active on this vCPU or it is not needed. 76 + 77 + 78 + Bitmap Feature Firmware Registers 79 + --------------------------------- 80 + 81 + Contrary to the above registers, the following registers exposes the 82 + hypercall services in the form of a feature-bitmap to the userspace. This 83 + bitmap is translated to the services that are available to the guest. 84 + There is a register defined per service call owner and can be accessed via 85 + GET/SET_ONE_REG interface. 86 + 87 + By default, these registers are set with the upper limit of the features 88 + that are supported. This way userspace can discover all the usable 89 + hypercall services via GET_ONE_REG. The user-space can write-back the 90 + desired bitmap back via SET_ONE_REG. The features for the registers that 91 + are untouched, probably because userspace isn't aware of them, will be 92 + exposed as is to the guest. 93 + 94 + Note that KVM will not allow the userspace to configure the registers 95 + anymore once any of the vCPUs has run at least once. Instead, it will 96 + return a -EBUSY. 97 + 98 + The pseudo-firmware bitmap register are as follows: 99 + 100 + * KVM_REG_ARM_STD_BMAP: 101 + Controls the bitmap of the ARM Standard Secure Service Calls. 102 + 103 + The following bits are accepted: 104 + 105 + Bit-0: KVM_REG_ARM_STD_BIT_TRNG_V1_0: 106 + The bit represents the services offered under v1.0 of ARM True Random 107 + Number Generator (TRNG) specification, ARM DEN0098. 108 + 109 + * KVM_REG_ARM_STD_HYP_BMAP: 110 + Controls the bitmap of the ARM Standard Hypervisor Service Calls. 111 + 112 + The following bits are accepted: 113 + 114 + Bit-0: KVM_REG_ARM_STD_HYP_BIT_PV_TIME: 115 + The bit represents the Paravirtualized Time service as represented by 116 + ARM DEN0057A. 117 + 118 + * KVM_REG_ARM_VENDOR_HYP_BMAP: 119 + Controls the bitmap of the Vendor specific Hypervisor Service Calls. 120 + 121 + The following bits are accepted: 122 + 123 + Bit-0: KVM_REG_ARM_VENDOR_HYP_BIT_FUNC_FEAT 124 + The bit represents the ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID 125 + and ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID function-ids. 126 + 127 + Bit-1: KVM_REG_ARM_VENDOR_HYP_BIT_PTP: 128 + The bit represents the Precision Time Protocol KVM service. 129 + 130 + Errors: 131 + 132 + ======= ============================================================= 133 + -ENOENT Unknown register accessed. 134 + -EBUSY Attempt a 'write' to the register after the VM has started. 135 + -EINVAL Invalid bitmap written to the register. 136 + ======= ============================================================= 137 + 138 + .. [1] https://developer.arm.com/-/media/developer/pdf/ARM_DEN_0070A_Firmware_interfaces_for_mitigating_CVE-2017-5715.pdf

+1 -1

Documentation/virt/kvm/arm/index.rst

··· 8 8 :maxdepth: 2 9 9 10 10 hyp-abi 11 - psci 11 + hypercalls 12 12 pvtime 13 13 ptp_kvm

-77

Documentation/virt/kvm/arm/psci.rst

··· 1 - .. SPDX-License-Identifier: GPL-2.0 2 - 3 - ========================================= 4 - Power State Coordination Interface (PSCI) 5 - ========================================= 6 - 7 - KVM implements the PSCI (Power State Coordination Interface) 8 - specification in order to provide services such as CPU on/off, reset 9 - and power-off to the guest. 10 - 11 - The PSCI specification is regularly updated to provide new features, 12 - and KVM implements these updates if they make sense from a virtualization 13 - point of view. 14 - 15 - This means that a guest booted on two different versions of KVM can 16 - observe two different "firmware" revisions. This could cause issues if 17 - a given guest is tied to a particular PSCI revision (unlikely), or if 18 - a migration causes a different PSCI version to be exposed out of the 19 - blue to an unsuspecting guest. 20 - 21 - In order to remedy this situation, KVM exposes a set of "firmware 22 - pseudo-registers" that can be manipulated using the GET/SET_ONE_REG 23 - interface. These registers can be saved/restored by userspace, and set 24 - to a convenient value if required. 25 - 26 - The following register is defined: 27 - 28 - * KVM_REG_ARM_PSCI_VERSION: 29 - 30 - - Only valid if the vcpu has the KVM_ARM_VCPU_PSCI_0_2 feature set 31 - (and thus has already been initialized) 32 - - Returns the current PSCI version on GET_ONE_REG (defaulting to the 33 - highest PSCI version implemented by KVM and compatible with v0.2) 34 - - Allows any PSCI version implemented by KVM and compatible with 35 - v0.2 to be set with SET_ONE_REG 36 - - Affects the whole VM (even if the register view is per-vcpu) 37 - 38 - * KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1: 39 - Holds the state of the firmware support to mitigate CVE-2017-5715, as 40 - offered by KVM to the guest via a HVC call. The workaround is described 41 - under SMCCC_ARCH_WORKAROUND_1 in [1]. 42 - 43 - Accepted values are: 44 - 45 - KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_AVAIL: 46 - KVM does not offer 47 - firmware support for the workaround. The mitigation status for the 48 - guest is unknown. 49 - KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_AVAIL: 50 - The workaround HVC call is 51 - available to the guest and required for the mitigation. 52 - KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_REQUIRED: 53 - The workaround HVC call 54 - is available to the guest, but it is not needed on this VCPU. 55 - 56 - * KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2: 57 - Holds the state of the firmware support to mitigate CVE-2018-3639, as 58 - offered by KVM to the guest via a HVC call. The workaround is described 59 - under SMCCC_ARCH_WORKAROUND_2 in [1]_. 60 - 61 - Accepted values are: 62 - 63 - KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL: 64 - A workaround is not 65 - available. KVM does not offer firmware support for the workaround. 66 - KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_UNKNOWN: 67 - The workaround state is 68 - unknown. KVM does not offer firmware support for the workaround. 69 - KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL: 70 - The workaround is available, 71 - and can be disabled by a vCPU. If 72 - KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED is set, it is active for 73 - this vCPU. 74 - KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED: 75 - The workaround is always active on this vCPU or it is not needed. 76 - 77 - .. [1] https://developer.arm.com/-/media/developer/pdf/ARM_DEN_0070A_Firmware_interfaces_for_mitigating_CVE-2017-5715.pdf

+4

Documentation/virt/kvm/x86/mmu.rst

··· 202 202 Is 1 if the MMU instance cannot use A/D bits. EPT did not have A/D 203 203 bits before Haswell; shadow EPT page tables also cannot use A/D bits 204 204 if the L1 hypervisor does not enable them. 205 + role.passthrough: 206 + The page is not backed by a guest page table, but its first entry 207 + points to one. This is set if NPT uses 5-level page tables (host 208 + CR4.LA57=1) and is shadowing L1's 4-level NPT (L1 CR4.LA57=1). 205 209 gfn: 206 210 Either the guest page table containing the translations shadowed by this 207 211 page, or the base page frame for linear translations. See role.direct.

+5

MAINTAINERS

··· 10830 10830 F: arch/riscv/include/asm/kvm* 10831 10831 F: arch/riscv/include/uapi/asm/kvm* 10832 10832 F: arch/riscv/kvm/ 10833 + F: tools/testing/selftests/kvm/*/riscv/ 10834 + F: tools/testing/selftests/kvm/riscv/ 10833 10835 10834 10836 KERNEL VIRTUAL MACHINE for s390 (KVM/s390) 10835 10837 M: Christian Borntraeger <borntraeger@linux.ibm.com> ··· 10846 10844 F: arch/s390/include/asm/gmap.h 10847 10845 F: arch/s390/include/asm/kvm* 10848 10846 F: arch/s390/include/uapi/asm/kvm* 10847 + F: arch/s390/include/uapi/asm/uvdevice.h 10849 10848 F: arch/s390/kernel/uv.c 10850 10849 F: arch/s390/kvm/ 10851 10850 F: arch/s390/mm/gmap.c 10851 + F: drivers/s390/char/uvdevice.c 10852 + F: tools/testing/selftests/drivers/s390x/uvdevice/ 10852 10853 F: tools/testing/selftests/kvm/*/s390x/ 10853 10854 F: tools/testing/selftests/kvm/s390x/ 10854 10855

+4

arch/arm64/include/asm/barrier.h

··· 16 16 17 17 #define sev() asm volatile("sev" : : : "memory") 18 18 #define wfe() asm volatile("wfe" : : : "memory") 19 + #define wfet(val) asm volatile("msr s0_3_c1_c0_0, %0" \ 20 + : : "r" (val) : "memory") 19 21 #define wfi() asm volatile("wfi" : : : "memory") 22 + #define wfit(val) asm volatile("msr s0_3_c1_c0_1, %0" \ 23 + : : "r" (val) : "memory") 20 24 21 25 #define isb() asm volatile("isb" : : : "memory") 22 26 #define dmb(opt) asm volatile("dmb " #opt : : : "memory")

+8

arch/arm64/include/asm/cputype.h

··· 118 118 119 119 #define APPLE_CPU_PART_M1_ICESTORM 0x022 120 120 #define APPLE_CPU_PART_M1_FIRESTORM 0x023 121 + #define APPLE_CPU_PART_M1_ICESTORM_PRO 0x024 122 + #define APPLE_CPU_PART_M1_FIRESTORM_PRO 0x025 123 + #define APPLE_CPU_PART_M1_ICESTORM_MAX 0x028 124 + #define APPLE_CPU_PART_M1_FIRESTORM_MAX 0x029 121 125 122 126 #define MIDR_CORTEX_A53 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A53) 123 127 #define MIDR_CORTEX_A57 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A57) ··· 168 164 #define MIDR_HISI_TSV110 MIDR_CPU_MODEL(ARM_CPU_IMP_HISI, HISI_CPU_PART_TSV110) 169 165 #define MIDR_APPLE_M1_ICESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_ICESTORM) 170 166 #define MIDR_APPLE_M1_FIRESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_FIRESTORM) 167 + #define MIDR_APPLE_M1_ICESTORM_PRO MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_ICESTORM_PRO) 168 + #define MIDR_APPLE_M1_FIRESTORM_PRO MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_FIRESTORM_PRO) 169 + #define MIDR_APPLE_M1_ICESTORM_MAX MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_ICESTORM_MAX) 170 + #define MIDR_APPLE_M1_FIRESTORM_MAX MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_FIRESTORM_MAX) 171 171 172 172 /* Fujitsu Erratum 010001 affects A64FX 1.0 and 1.1, (v0r0 and v1r0) */ 173 173 #define MIDR_FUJITSU_ERRATUM_010001 MIDR_FUJITSU_A64FX

+6 -2

arch/arm64/include/asm/esr.h

··· 135 135 #define ESR_ELx_CV (UL(1) << 24) 136 136 #define ESR_ELx_COND_SHIFT (20) 137 137 #define ESR_ELx_COND_MASK (UL(0xF) << ESR_ELx_COND_SHIFT) 138 - #define ESR_ELx_WFx_ISS_TI (UL(1) << 0) 138 + #define ESR_ELx_WFx_ISS_RN (UL(0x1F) << 5) 139 + #define ESR_ELx_WFx_ISS_RV (UL(1) << 2) 140 + #define ESR_ELx_WFx_ISS_TI (UL(3) << 0) 141 + #define ESR_ELx_WFx_ISS_WFxT (UL(2) << 0) 139 142 #define ESR_ELx_WFx_ISS_WFI (UL(0) << 0) 140 143 #define ESR_ELx_WFx_ISS_WFE (UL(1) << 0) 141 144 #define ESR_ELx_xVC_IMM_MASK ((UL(1) << 16) - 1) ··· 151 148 #define DISR_EL1_ESR_MASK (ESR_ELx_AET | ESR_ELx_EA | ESR_ELx_FSC) 152 149 153 150 /* ESR value templates for specific events */ 154 - #define ESR_ELx_WFx_MASK (ESR_ELx_EC_MASK | ESR_ELx_WFx_ISS_TI) 151 + #define ESR_ELx_WFx_MASK (ESR_ELx_EC_MASK | \ 152 + (ESR_ELx_WFx_ISS_TI & ~ESR_ELx_WFx_ISS_WFxT)) 155 153 #define ESR_ELx_WFx_WFI_VAL ((ESR_ELx_EC_WFx << ESR_ELx_EC_SHIFT) | \ 156 154 ESR_ELx_WFx_ISS_WFI) 157 155

+1

arch/arm64/include/asm/hwcap.h

··· 117 117 #define KERNEL_HWCAP_SME_B16F32 __khwcap2_feature(SME_B16F32) 118 118 #define KERNEL_HWCAP_SME_F32F32 __khwcap2_feature(SME_F32F32) 119 119 #define KERNEL_HWCAP_SME_FA64 __khwcap2_feature(SME_FA64) 120 + #define KERNEL_HWCAP_WFXT __khwcap2_feature(WFXT) 120 121 121 122 /* 122 123 * This yields a mask that user programs can use to figure out what

+2 -1

arch/arm64/include/asm/kvm_arm.h

+1

arch/arm64/include/asm/kvm_asm.h

··· 169 169 unsigned long tcr_el2; 170 170 unsigned long tpidr_el2; 171 171 unsigned long stack_hyp_va; 172 + unsigned long stack_pa; 172 173 phys_addr_t pgd_pa; 173 174 unsigned long hcr_el2; 174 175 unsigned long vttbr;

-7

arch/arm64/include/asm/kvm_emulate.h

··· 87 87 88 88 if (vcpu_el1_is_32bit(vcpu)) 89 89 vcpu->arch.hcr_el2 &= ~HCR_RW; 90 - else 91 - /* 92 - * TID3: trap feature register accesses that we virtualise. 93 - * For now this is conditional, since no AArch32 feature regs 94 - * are currently virtualised. 95 - */ 96 - vcpu->arch.hcr_el2 |= HCR_TID3; 97 90 98 91 if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) || 99 92 vcpu_el1_is_32bit(vcpu))

+27 -17

arch/arm64/include/asm/kvm_host.h

··· 46 46 #define KVM_REQ_RECORD_STEAL KVM_ARCH_REQ(3) 47 47 #define KVM_REQ_RELOAD_GICv4 KVM_ARCH_REQ(4) 48 48 #define KVM_REQ_RELOAD_PMU KVM_ARCH_REQ(5) 49 + #define KVM_REQ_SUSPEND KVM_ARCH_REQ(6) 49 50 50 51 #define KVM_DIRTY_LOG_MANUAL_CAPS (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \ 51 52 KVM_DIRTY_LOG_INITIALLY_SET) ··· 102 101 struct kvm_arch_memory_slot { 103 102 }; 104 103 104 + /** 105 + * struct kvm_smccc_features: Descriptor of the hypercall services exposed to the guests 106 + * 107 + * @std_bmap: Bitmap of standard secure service calls 108 + * @std_hyp_bmap: Bitmap of standard hypervisor service calls 109 + * @vendor_hyp_bmap: Bitmap of vendor specific hypervisor service calls 110 + */ 111 + struct kvm_smccc_features { 112 + unsigned long std_bmap; 113 + unsigned long std_hyp_bmap; 114 + unsigned long vendor_hyp_bmap; 115 + }; 116 + 105 117 struct kvm_arch { 106 118 struct kvm_s2_mmu mmu; 107 119 108 120 /* VTCR_EL2 value for this VM */ 109 121 u64 vtcr; 110 - 111 - /* The maximum number of vCPUs depends on the used GIC model */ 112 - int max_vcpus; 113 122 114 123 /* Interrupt controller */ 115 124 struct vgic_dist vgic; ··· 147 136 */ 148 137 #define KVM_ARCH_FLAG_REG_WIDTH_CONFIGURED 3 149 138 #define KVM_ARCH_FLAG_EL1_32BIT 4 139 + /* PSCI SYSTEM_SUSPEND enabled for the guest */ 140 + #define KVM_ARCH_FLAG_SYSTEM_SUSPEND_ENABLED 5 150 141 151 142 unsigned long flags; 152 143 ··· 163 150 164 151 u8 pfr0_csv2; 165 152 u8 pfr0_csv3; 153 + 154 + /* Hypercall features firmware registers' descriptor */ 155 + struct kvm_smccc_features smccc_feat; 166 156 }; 167 157 168 158 struct kvm_vcpu_fault_info { ··· 270 254 struct kvm_vcpu *__hyp_running_vcpu; 271 255 }; 272 256 273 - struct kvm_pmu_events { 274 - u32 events_host; 275 - u32 events_guest; 276 - }; 277 - 278 257 struct kvm_host_data { 279 258 struct kvm_cpu_context host_ctxt; 280 - struct kvm_pmu_events pmu_events; 281 259 }; 282 260 283 261 struct kvm_host_psci_config { ··· 378 368 u32 mdscr_el1; 379 369 } guest_debug_preserved; 380 370 381 - /* vcpu power-off state */ 382 - bool power_off; 371 + /* vcpu power state */ 372 + struct kvm_mp_state mp_state; 383 373 384 374 /* Don't run the guest (internal implementation need) */ 385 375 bool pause; ··· 465 455 #define KVM_ARM64_FP_FOREIGN_FPSTATE (1 << 14) 466 456 #define KVM_ARM64_ON_UNSUPPORTED_CPU (1 << 15) /* Physical CPU not in supported_cpus */ 467 457 #define KVM_ARM64_HOST_SME_ENABLED (1 << 16) /* SME enabled for EL0 */ 458 + #define KVM_ARM64_WFIT (1 << 17) /* WFIT instruction trapped */ 468 459 469 460 #define KVM_GUESTDBG_VALID_MASK (KVM_GUESTDBG_ENABLE | \ 470 461 KVM_GUESTDBG_USE_SW_BP | \ ··· 698 687 int kvm_handle_cp15_32(struct kvm_vcpu *vcpu); 699 688 int kvm_handle_cp15_64(struct kvm_vcpu *vcpu); 700 689 int kvm_handle_sys_reg(struct kvm_vcpu *vcpu); 690 + int kvm_handle_cp10_id(struct kvm_vcpu *vcpu); 701 691 702 692 void kvm_reset_sys_regs(struct kvm_vcpu *vcpu); 703 693 704 - void kvm_sys_reg_table_init(void); 694 + int kvm_sys_reg_table_init(void); 705 695 706 696 /* MMIO helpers */ 707 697 void kvm_mmio_write_buf(void *buf, unsigned int len, unsigned long data); ··· 811 799 #ifdef CONFIG_KVM 812 800 void kvm_set_pmu_events(u32 set, struct perf_event_attr *attr); 813 801 void kvm_clr_pmu_events(u32 clr); 814 - 815 - void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu); 816 - void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu); 817 802 #else 818 803 static inline void kvm_set_pmu_events(u32 set, struct perf_event_attr *attr) {} 819 804 static inline void kvm_clr_pmu_events(u32 clr) {} ··· 842 833 #define kvm_has_mte(kvm) \ 843 834 (system_supports_mte() && \ 844 835 test_bit(KVM_ARCH_FLAG_MTE_ENABLED, &(kvm)->arch.flags)) 845 - #define kvm_vcpu_has_pmu(vcpu) \ 846 - (test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features)) 847 836 848 837 int kvm_trng_call(struct kvm_vcpu *vcpu); 849 838 #ifdef CONFIG_KVM ··· 851 844 #else 852 845 static inline void kvm_hyp_reserve(void) { } 853 846 #endif 847 + 848 + void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu); 849 + bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu); 854 850 855 851 #endif /* __ARM64_KVM_HOST_H__ */

+3

arch/arm64/include/asm/kvm_mmu.h

··· 154 154 int kvm_share_hyp(void *from, void *to); 155 155 void kvm_unshare_hyp(void *from, void *to); 156 156 int create_hyp_mappings(void *from, void *to, enum kvm_pgtable_prot prot); 157 + int __create_hyp_mappings(unsigned long start, unsigned long size, 158 + unsigned long phys, enum kvm_pgtable_prot prot); 159 + int hyp_alloc_private_va_range(size_t size, unsigned long *haddr); 157 160 int create_hyp_io_mappings(phys_addr_t phys_addr, size_t size, 158 161 void __iomem **kaddr, 159 162 void __iomem **haddr);

+1

arch/arm64/include/uapi/asm/hwcap.h

··· 87 87 #define HWCAP2_SME_B16F32 (1 << 28) 88 88 #define HWCAP2_SME_F32F32 (1 << 29) 89 89 #define HWCAP2_SME_FA64 (1 << 30) 90 + #define HWCAP2_WFXT (1UL << 31) 90 91 91 92 #endif /* _UAPI__ASM_HWCAP_H */

+34

arch/arm64/include/uapi/asm/kvm.h

··· 334 334 #define KVM_ARM64_SVE_VLS_WORDS \ 335 335 ((KVM_ARM64_SVE_VQ_MAX - KVM_ARM64_SVE_VQ_MIN) / 64 + 1) 336 336 337 + /* Bitmap feature firmware registers */ 338 + #define KVM_REG_ARM_FW_FEAT_BMAP (0x0016 << KVM_REG_ARM_COPROC_SHIFT) 339 + #define KVM_REG_ARM_FW_FEAT_BMAP_REG(r) (KVM_REG_ARM64 | KVM_REG_SIZE_U64 | \ 340 + KVM_REG_ARM_FW_FEAT_BMAP | \ 341 + ((r) & 0xffff)) 342 + 343 + #define KVM_REG_ARM_STD_BMAP KVM_REG_ARM_FW_FEAT_BMAP_REG(0) 344 + 345 + enum { 346 + KVM_REG_ARM_STD_BIT_TRNG_V1_0 = 0, 347 + #ifdef __KERNEL__ 348 + KVM_REG_ARM_STD_BMAP_BIT_COUNT, 349 + #endif 350 + }; 351 + 352 + #define KVM_REG_ARM_STD_HYP_BMAP KVM_REG_ARM_FW_FEAT_BMAP_REG(1) 353 + 354 + enum { 355 + KVM_REG_ARM_STD_HYP_BIT_PV_TIME = 0, 356 + #ifdef __KERNEL__ 357 + KVM_REG_ARM_STD_HYP_BMAP_BIT_COUNT, 358 + #endif 359 + }; 360 + 361 + #define KVM_REG_ARM_VENDOR_HYP_BMAP KVM_REG_ARM_FW_FEAT_BMAP_REG(2) 362 + 363 + enum { 364 + KVM_REG_ARM_VENDOR_HYP_BIT_FUNC_FEAT = 0, 365 + KVM_REG_ARM_VENDOR_HYP_BIT_PTP = 1, 366 + #ifdef __KERNEL__ 367 + KVM_REG_ARM_VENDOR_HYP_BMAP_BIT_COUNT, 368 + #endif 369 + }; 370 + 337 371 /* Device Control API: ARM VGIC */ 338 372 #define KVM_DEV_ARM_VGIC_GRP_ADDR 0 339 373 #define KVM_DEV_ARM_VGIC_GRP_DIST_REGS 1

+13

arch/arm64/kernel/cpufeature.c

··· 237 237 ARM64_FTR_BITS(FTR_VISIBLE_IF_IS_ENABLED(CONFIG_ARM64_PTR_AUTH), 238 238 FTR_STRICT, FTR_LOWER_SAFE, ID_AA64ISAR2_GPA3_SHIFT, 4, 0), 239 239 ARM64_FTR_BITS(FTR_VISIBLE, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64ISAR2_RPRES_SHIFT, 4, 0), 240 + ARM64_FTR_BITS(FTR_VISIBLE, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64ISAR2_WFXT_SHIFT, 4, 0), 240 241 ARM64_FTR_END, 241 242 }; 242 243 ··· 2518 2517 .cpu_enable = fa64_kernel_enable, 2519 2518 }, 2520 2519 #endif /* CONFIG_ARM64_SME */ 2520 + { 2521 + .desc = "WFx with timeout", 2522 + .capability = ARM64_HAS_WFXT, 2523 + .type = ARM64_CPUCAP_SYSTEM_FEATURE, 2524 + .sys_reg = SYS_ID_AA64ISAR2_EL1, 2525 + .sign = FTR_UNSIGNED, 2526 + .field_pos = ID_AA64ISAR2_WFXT_SHIFT, 2527 + .field_width = 4, 2528 + .matches = has_cpuid_feature, 2529 + .min_field_value = ID_AA64ISAR2_WFXT_SUPPORTED, 2530 + }, 2521 2531 {}, 2522 2532 }; 2523 2533 ··· 2662 2650 HWCAP_CAP(SYS_ID_AA64MMFR0_EL1, ID_AA64MMFR0_ECV_SHIFT, 4, FTR_UNSIGNED, 1, CAP_HWCAP, KERNEL_HWCAP_ECV), 2663 2651 HWCAP_CAP(SYS_ID_AA64MMFR1_EL1, ID_AA64MMFR1_AFP_SHIFT, 4, FTR_UNSIGNED, 1, CAP_HWCAP, KERNEL_HWCAP_AFP), 2664 2652 HWCAP_CAP(SYS_ID_AA64ISAR2_EL1, ID_AA64ISAR2_RPRES_SHIFT, 4, FTR_UNSIGNED, 1, CAP_HWCAP, KERNEL_HWCAP_RPRES), 2653 + HWCAP_CAP(SYS_ID_AA64ISAR2_EL1, ID_AA64ISAR2_WFXT_SHIFT, 4, FTR_UNSIGNED, ID_AA64ISAR2_WFXT_SUPPORTED, CAP_HWCAP, KERNEL_HWCAP_WFXT), 2665 2654 #ifdef CONFIG_ARM64_SME 2666 2655 HWCAP_CAP(SYS_ID_AA64PFR1_EL1, ID_AA64PFR1_SME_SHIFT, 4, FTR_UNSIGNED, ID_AA64PFR1_SME, CAP_HWCAP, KERNEL_HWCAP_SME), 2667 2656 HWCAP_CAP(SYS_ID_AA64SMFR0_EL1, ID_AA64SMFR0_FA64_SHIFT, 1, FTR_UNSIGNED, ID_AA64SMFR0_FA64, CAP_HWCAP, KERNEL_HWCAP_SME_FA64),

+1

arch/arm64/kernel/cpuinfo.c

··· 106 106 [KERNEL_HWCAP_SME_B16F32] = "smeb16f32", 107 107 [KERNEL_HWCAP_SME_F32F32] = "smef32f32", 108 108 [KERNEL_HWCAP_SME_FA64] = "smefa64", 109 + [KERNEL_HWCAP_WFXT] = "wfxt", 109 110 }; 110 111 111 112 #ifdef CONFIG_COMPAT

+2 -2

arch/arm64/kvm/Makefile

··· 13 13 kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \ 14 14 inject_fault.o va_layout.o handle_exit.o \ 15 15 guest.o debug.o reset.o sys_regs.o \ 16 - vgic-sys-reg-v3.o fpsimd.o pmu.o pkvm.o \ 16 + vgic-sys-reg-v3.o fpsimd.o pkvm.o \ 17 17 arch_timer.o trng.o vmid.o \ 18 18 vgic/vgic.o vgic/vgic-init.o \ 19 19 vgic/vgic-irqfd.o vgic/vgic-v2.o \ ··· 22 22 vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \ 23 23 vgic/vgic-its.o vgic/vgic-debug.o 24 24 25 - kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o 25 + kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o 26 26 27 27 always-y := hyp_constants.h hyp-constants.s 28 28

+31 -16

arch/arm64/kvm/arch_timer.c

··· 208 208 return IRQ_HANDLED; 209 209 } 210 210 211 - static u64 kvm_timer_compute_delta(struct arch_timer_context *timer_ctx) 211 + static u64 kvm_counter_compute_delta(struct arch_timer_context *timer_ctx, 212 + u64 val) 212 213 { 213 - u64 cval, now; 214 + u64 now = kvm_phys_timer_read() - timer_get_offset(timer_ctx); 214 215 215 - cval = timer_get_cval(timer_ctx); 216 - now = kvm_phys_timer_read() - timer_get_offset(timer_ctx); 217 - 218 - if (now < cval) { 216 + if (now < val) { 219 217 u64 ns; 220 218 221 219 ns = cyclecounter_cyc2ns(timecounter->cc, 222 - cval - now, 220 + val - now, 223 221 timecounter->mask, 224 222 &timecounter->frac); 225 223 return ns; ··· 226 228 return 0; 227 229 } 228 230 231 + static u64 kvm_timer_compute_delta(struct arch_timer_context *timer_ctx) 232 + { 233 + return kvm_counter_compute_delta(timer_ctx, timer_get_cval(timer_ctx)); 234 + } 235 + 229 236 static bool kvm_timer_irq_can_fire(struct arch_timer_context *timer_ctx) 230 237 { 231 238 WARN_ON(timer_ctx && timer_ctx->loaded); 232 239 return timer_ctx && 233 240 ((timer_get_ctl(timer_ctx) & 234 241 (ARCH_TIMER_CTRL_IT_MASK | ARCH_TIMER_CTRL_ENABLE)) == ARCH_TIMER_CTRL_ENABLE); 242 + } 243 + 244 + static bool vcpu_has_wfit_active(struct kvm_vcpu *vcpu) 245 + { 246 + return (cpus_have_final_cap(ARM64_HAS_WFXT) && 247 + (vcpu->arch.flags & KVM_ARM64_WFIT)); 248 + } 249 + 250 + static u64 wfit_delay_ns(struct kvm_vcpu *vcpu) 251 + { 252 + struct arch_timer_context *ctx = vcpu_vtimer(vcpu); 253 + u64 val = vcpu_get_reg(vcpu, kvm_vcpu_sys_get_rt(vcpu)); 254 + 255 + return kvm_counter_compute_delta(ctx, val); 235 256 } 236 257 237 258 /* ··· 269 252 if (kvm_timer_irq_can_fire(ctx)) 270 253 min_delta = min(min_delta, kvm_timer_compute_delta(ctx)); 271 254 } 255 + 256 + if (vcpu_has_wfit_active(vcpu)) 257 + min_delta = min(min_delta, wfit_delay_ns(vcpu)); 272 258 273 259 /* If none of timers can fire, then return 0 */ 274 260 if (min_delta == ULLONG_MAX) ··· 370 350 return cval <= now; 371 351 } 372 352 373 - bool kvm_timer_is_pending(struct kvm_vcpu *vcpu) 353 + int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu) 374 354 { 375 - struct timer_map map; 376 - 377 - get_timer_map(vcpu, &map); 378 - 379 - return kvm_timer_should_fire(map.direct_vtimer) || 380 - kvm_timer_should_fire(map.direct_ptimer) || 381 - kvm_timer_should_fire(map.emul_ptimer); 355 + return vcpu_has_wfit_active(vcpu) && wfit_delay_ns(vcpu) == 0; 382 356 } 383 357 384 358 /* ··· 498 484 */ 499 485 if (!kvm_timer_irq_can_fire(map.direct_vtimer) && 500 486 !kvm_timer_irq_can_fire(map.direct_ptimer) && 501 - !kvm_timer_irq_can_fire(map.emul_ptimer)) 487 + !kvm_timer_irq_can_fire(map.emul_ptimer) && 488 + !vcpu_has_wfit_active(vcpu)) 502 489 return; 503 490 504 491 /*

+134 -30

arch/arm64/kvm/arm.c

··· 97 97 } 98 98 mutex_unlock(&kvm->lock); 99 99 break; 100 + case KVM_CAP_ARM_SYSTEM_SUSPEND: 101 + r = 0; 102 + set_bit(KVM_ARCH_FLAG_SYSTEM_SUSPEND_ENABLED, &kvm->arch.flags); 103 + break; 100 104 default: 101 105 r = -EINVAL; 102 106 break; ··· 157 153 kvm_vgic_early_init(kvm); 158 154 159 155 /* The maximum number of VCPUs is limited by the host's GIC model */ 160 - kvm->arch.max_vcpus = kvm_arm_default_max_vcpus(); 156 + kvm->max_vcpus = kvm_arm_default_max_vcpus(); 161 157 162 158 set_default_spectre(kvm); 159 + kvm_arm_init_hypercalls(kvm); 163 160 164 161 return ret; 165 162 out_free_stage2_pgd: ··· 215 210 case KVM_CAP_SET_GUEST_DEBUG: 216 211 case KVM_CAP_VCPU_ATTRIBUTES: 217 212 case KVM_CAP_PTP_KVM: 213 + case KVM_CAP_ARM_SYSTEM_SUSPEND: 218 214 r = 1; 219 215 break; 220 216 case KVM_CAP_SET_GUEST_DEBUG2: ··· 236 230 case KVM_CAP_MAX_VCPUS: 237 231 case KVM_CAP_MAX_VCPU_ID: 238 232 if (kvm) 239 - r = kvm->arch.max_vcpus; 233 + r = kvm->max_vcpus; 240 234 else 241 235 r = kvm_arm_default_max_vcpus(); 242 236 break; ··· 312 306 if (irqchip_in_kernel(kvm) && vgic_initialized(kvm)) 313 307 return -EBUSY; 314 308 315 - if (id >= kvm->arch.max_vcpus) 309 + if (id >= kvm->max_vcpus) 316 310 return -EINVAL; 317 311 318 312 return 0; ··· 360 354 kvm_pmu_vcpu_destroy(vcpu); 361 355 362 356 kvm_arm_vcpu_destroy(vcpu); 363 - } 364 - 365 - int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu) 366 - { 367 - return kvm_timer_is_pending(vcpu); 368 357 } 369 358 370 359 void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) ··· 433 432 vcpu->cpu = -1; 434 433 } 435 434 436 - static void vcpu_power_off(struct kvm_vcpu *vcpu) 435 + void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu) 437 436 { 438 - vcpu->arch.power_off = true; 437 + vcpu->arch.mp_state.mp_state = KVM_MP_STATE_STOPPED; 439 438 kvm_make_request(KVM_REQ_SLEEP, vcpu); 440 439 kvm_vcpu_kick(vcpu); 440 + } 441 + 442 + bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu) 443 + { 444 + return vcpu->arch.mp_state.mp_state == KVM_MP_STATE_STOPPED; 445 + } 446 + 447 + static void kvm_arm_vcpu_suspend(struct kvm_vcpu *vcpu) 448 + { 449 + vcpu->arch.mp_state.mp_state = KVM_MP_STATE_SUSPENDED; 450 + kvm_make_request(KVM_REQ_SUSPEND, vcpu); 451 + kvm_vcpu_kick(vcpu); 452 + } 453 + 454 + static bool kvm_arm_vcpu_suspended(struct kvm_vcpu *vcpu) 455 + { 456 + return vcpu->arch.mp_state.mp_state == KVM_MP_STATE_SUSPENDED; 441 457 } 442 458 443 459 int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu, 444 460 struct kvm_mp_state *mp_state) 445 461 { 446 - if (vcpu->arch.power_off) 447 - mp_state->mp_state = KVM_MP_STATE_STOPPED; 448 - else 449 - mp_state->mp_state = KVM_MP_STATE_RUNNABLE; 462 + *mp_state = vcpu->arch.mp_state; 450 463 451 464 return 0; 452 465 } ··· 472 457 473 458 switch (mp_state->mp_state) { 474 459 case KVM_MP_STATE_RUNNABLE: 475 - vcpu->arch.power_off = false; 460 + vcpu->arch.mp_state = *mp_state; 476 461 break; 477 462 case KVM_MP_STATE_STOPPED: 478 - vcpu_power_off(vcpu); 463 + kvm_arm_vcpu_power_off(vcpu); 464 + break; 465 + case KVM_MP_STATE_SUSPENDED: 466 + kvm_arm_vcpu_suspend(vcpu); 479 467 break; 480 468 default: 481 469 ret = -EINVAL; ··· 498 480 { 499 481 bool irq_lines = *vcpu_hcr(v) & (HCR_VI | HCR_VF); 500 482 return ((irq_lines || kvm_vgic_vcpu_pending_irq(v)) 501 - && !v->arch.power_off && !v->arch.pause); 483 + && !kvm_arm_vcpu_stopped(v) && !v->arch.pause); 502 484 } 503 485 504 486 bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu) ··· 610 592 } 611 593 } 612 594 613 - static void vcpu_req_sleep(struct kvm_vcpu *vcpu) 595 + static void kvm_vcpu_sleep(struct kvm_vcpu *vcpu) 614 596 { 615 597 struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu); 616 598 617 599 rcuwait_wait_event(wait, 618 - (!vcpu->arch.power_off) &&(!vcpu->arch.pause), 600 + (!kvm_arm_vcpu_stopped(vcpu)) && (!vcpu->arch.pause), 619 601 TASK_INTERRUPTIBLE); 620 602 621 - if (vcpu->arch.power_off || vcpu->arch.pause) { 603 + if (kvm_arm_vcpu_stopped(vcpu) || vcpu->arch.pause) { 622 604 /* Awaken to handle a signal, request we sleep again later. */ 623 605 kvm_make_request(KVM_REQ_SLEEP, vcpu); 624 606 } ··· 657 639 preempt_enable(); 658 640 659 641 kvm_vcpu_halt(vcpu); 642 + vcpu->arch.flags &= ~KVM_ARM64_WFIT; 660 643 kvm_clear_request(KVM_REQ_UNHALT, vcpu); 661 644 662 645 preempt_disable(); ··· 665 646 preempt_enable(); 666 647 } 667 648 668 - static void check_vcpu_requests(struct kvm_vcpu *vcpu) 649 + static int kvm_vcpu_suspend(struct kvm_vcpu *vcpu) 650 + { 651 + if (!kvm_arm_vcpu_suspended(vcpu)) 652 + return 1; 653 + 654 + kvm_vcpu_wfi(vcpu); 655 + 656 + /* 657 + * The suspend state is sticky; we do not leave it until userspace 658 + * explicitly marks the vCPU as runnable. Request that we suspend again 659 + * later. 660 + */ 661 + kvm_make_request(KVM_REQ_SUSPEND, vcpu); 662 + 663 + /* 664 + * Check to make sure the vCPU is actually runnable. If so, exit to 665 + * userspace informing it of the wakeup condition. 666 + */ 667 + if (kvm_arch_vcpu_runnable(vcpu)) { 668 + memset(&vcpu->run->system_event, 0, sizeof(vcpu->run->system_event)); 669 + vcpu->run->system_event.type = KVM_SYSTEM_EVENT_WAKEUP; 670 + vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT; 671 + return 0; 672 + } 673 + 674 + /* 675 + * Otherwise, we were unblocked to process a different event, such as a 676 + * pending signal. Return 1 and allow kvm_arch_vcpu_ioctl_run() to 677 + * process the event. 678 + */ 679 + return 1; 680 + } 681 + 682 + /** 683 + * check_vcpu_requests - check and handle pending vCPU requests 684 + * @vcpu: the VCPU pointer 685 + * 686 + * Return: 1 if we should enter the guest 687 + * 0 if we should exit to userspace 688 + * < 0 if we should exit to userspace, where the return value indicates 689 + * an error 690 + */ 691 + static int check_vcpu_requests(struct kvm_vcpu *vcpu) 669 692 { 670 693 if (kvm_request_pending(vcpu)) { 671 694 if (kvm_check_request(KVM_REQ_SLEEP, vcpu)) 672 - vcpu_req_sleep(vcpu); 695 + kvm_vcpu_sleep(vcpu); 673 696 674 697 if (kvm_check_request(KVM_REQ_VCPU_RESET, vcpu)) 675 698 kvm_reset_vcpu(vcpu); ··· 736 675 if (kvm_check_request(KVM_REQ_RELOAD_PMU, vcpu)) 737 676 kvm_pmu_handle_pmcr(vcpu, 738 677 __vcpu_sys_reg(vcpu, PMCR_EL0)); 678 + 679 + if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) 680 + return kvm_vcpu_suspend(vcpu); 739 681 } 682 + 683 + return 1; 740 684 } 741 685 742 686 static bool vcpu_mode_is_bad_32bit(struct kvm_vcpu *vcpu) ··· 858 792 if (!ret) 859 793 ret = 1; 860 794 861 - check_vcpu_requests(vcpu); 795 + if (ret > 0) 796 + ret = check_vcpu_requests(vcpu); 862 797 863 798 /* 864 799 * Preparing the interrupts to be injected also ··· 882 815 local_irq_disable(); 883 816 884 817 kvm_vgic_flush_hwstate(vcpu); 818 + 819 + kvm_pmu_update_vcpu_events(vcpu); 885 820 886 821 /* 887 822 * Ensure we set mode to IN_GUEST_MODE after we disable ··· 1194 1125 * Handle the "start in power-off" case. 1195 1126 */ 1196 1127 if (test_bit(KVM_ARM_VCPU_POWER_OFF, vcpu->arch.features)) 1197 - vcpu_power_off(vcpu); 1128 + kvm_arm_vcpu_power_off(vcpu); 1198 1129 else 1199 - vcpu->arch.power_off = false; 1130 + vcpu->arch.mp_state.mp_state = KVM_MP_STATE_RUNNABLE; 1200 1131 1201 1132 return 0; 1202 1133 } ··· 1554 1485 tcr |= (idmap_t0sz & GENMASK(TCR_TxSZ_WIDTH - 1, 0)) << TCR_T0SZ_OFFSET; 1555 1486 params->tcr_el2 = tcr; 1556 1487 1557 - params->stack_hyp_va = kern_hyp_va(per_cpu(kvm_arm_hyp_stack_page, cpu) + PAGE_SIZE); 1558 1488 params->pgd_pa = kvm_mmu_get_httbr(); 1559 1489 if (is_protected_kvm_enabled()) 1560 1490 params->hcr_el2 = HCR_HOST_NVHE_PROTECTED_FLAGS; ··· 1831 1763 1832 1764 kvm_register_perf_callbacks(NULL); 1833 1765 1834 - kvm_sys_reg_table_init(); 1835 - 1836 1766 out: 1837 1767 if (err || !is_protected_kvm_enabled()) 1838 1768 on_each_cpu(_kvm_arch_hardware_disable, NULL, 1); ··· 2001 1935 * Map the Hyp stack pages 2002 1936 */ 2003 1937 for_each_possible_cpu(cpu) { 1938 + struct kvm_nvhe_init_params *params = per_cpu_ptr_nvhe_sym(kvm_init_params, cpu); 2004 1939 char *stack_page = (char *)per_cpu(kvm_arm_hyp_stack_page, cpu); 2005 - err = create_hyp_mappings(stack_page, stack_page + PAGE_SIZE, 2006 - PAGE_HYP); 1940 + unsigned long hyp_addr; 2007 1941 1942 + /* 1943 + * Allocate a contiguous HYP private VA range for the stack 1944 + * and guard page. The allocation is also aligned based on 1945 + * the order of its size. 1946 + */ 1947 + err = hyp_alloc_private_va_range(PAGE_SIZE * 2, &hyp_addr); 1948 + if (err) { 1949 + kvm_err("Cannot allocate hyp stack guard page\n"); 1950 + goto out_err; 1951 + } 1952 + 1953 + /* 1954 + * Since the stack grows downwards, map the stack to the page 1955 + * at the higher address and leave the lower guard page 1956 + * unbacked. 1957 + * 1958 + * Any valid stack address now has the PAGE_SHIFT bit as 1 1959 + * and addresses corresponding to the guard page have the 1960 + * PAGE_SHIFT bit as 0 - this is used for overflow detection. 1961 + */ 1962 + err = __create_hyp_mappings(hyp_addr + PAGE_SIZE, PAGE_SIZE, 1963 + __pa(stack_page), PAGE_HYP); 2008 1964 if (err) { 2009 1965 kvm_err("Cannot map hyp stack\n"); 2010 1966 goto out_err; 2011 1967 } 1968 + 1969 + /* 1970 + * Save the stack PA in nvhe_init_params. This will be needed 1971 + * to recreate the stack mapping in protected nVHE mode. 1972 + * __hyp_pa() won't do the right thing there, since the stack 1973 + * has been mapped in the flexible private VA space. 1974 + */ 1975 + params->stack_pa = __pa(stack_page); 1976 + 1977 + params->stack_hyp_va = hyp_addr + (2 * PAGE_SIZE); 2012 1978 } 2013 1979 2014 1980 for_each_possible_cpu(cpu) { ··· 2187 2089 if (kvm_get_mode() == KVM_MODE_NONE) { 2188 2090 kvm_info("KVM disabled from command line\n"); 2189 2091 return -ENODEV; 2092 + } 2093 + 2094 + err = kvm_sys_reg_table_init(); 2095 + if (err) { 2096 + kvm_info("Error initializing system register tables"); 2097 + return err; 2190 2098 } 2191 2099 2192 2100 in_hyp_mode = is_kernel_in_hyp_mode();

+7 -3

arch/arm64/kvm/guest.c

··· 18 18 #include <linux/string.h> 19 19 #include <linux/vmalloc.h> 20 20 #include <linux/fs.h> 21 - #include <kvm/arm_psci.h> 21 + #include <kvm/arm_hypercalls.h> 22 22 #include <asm/cputype.h> 23 23 #include <linux/uaccess.h> 24 24 #include <asm/fpsimd.h> ··· 756 756 757 757 switch (reg->id & KVM_REG_ARM_COPROC_MASK) { 758 758 case KVM_REG_ARM_CORE: return get_core_reg(vcpu, reg); 759 - case KVM_REG_ARM_FW: return kvm_arm_get_fw_reg(vcpu, reg); 759 + case KVM_REG_ARM_FW: 760 + case KVM_REG_ARM_FW_FEAT_BMAP: 761 + return kvm_arm_get_fw_reg(vcpu, reg); 760 762 case KVM_REG_ARM64_SVE: return get_sve_reg(vcpu, reg); 761 763 } 762 764 ··· 776 774 777 775 switch (reg->id & KVM_REG_ARM_COPROC_MASK) { 778 776 case KVM_REG_ARM_CORE: return set_core_reg(vcpu, reg); 779 - case KVM_REG_ARM_FW: return kvm_arm_set_fw_reg(vcpu, reg); 777 + case KVM_REG_ARM_FW: 778 + case KVM_REG_ARM_FW_FEAT_BMAP: 779 + return kvm_arm_set_fw_reg(vcpu, reg); 780 780 case KVM_REG_ARM64_SVE: return set_sve_reg(vcpu, reg); 781 781 } 782 782

+37 -12

arch/arm64/kvm/handle_exit.c

··· 80 80 * 81 81 * @vcpu: the vcpu pointer 82 82 * 83 - * WFE: Yield the CPU and come back to this vcpu when the scheduler 83 + * WFE[T]: Yield the CPU and come back to this vcpu when the scheduler 84 84 * decides to. 85 85 * WFI: Simply call kvm_vcpu_halt(), which will halt execution of 86 86 * world-switches and schedule other host processes until there is an 87 87 * incoming IRQ or FIQ to the VM. 88 + * WFIT: Same as WFI, with a timed wakeup implemented as a background timer 89 + * 90 + * WF{I,E}T can immediately return if the deadline has already expired. 88 91 */ 89 92 static int kvm_handle_wfx(struct kvm_vcpu *vcpu) 90 93 { 91 - if (kvm_vcpu_get_esr(vcpu) & ESR_ELx_WFx_ISS_WFE) { 94 + u64 esr = kvm_vcpu_get_esr(vcpu); 95 + 96 + if (esr & ESR_ELx_WFx_ISS_WFE) { 92 97 trace_kvm_wfx_arm64(*vcpu_pc(vcpu), true); 93 98 vcpu->stat.wfe_exit_stat++; 94 - kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu)); 95 99 } else { 96 100 trace_kvm_wfx_arm64(*vcpu_pc(vcpu), false); 97 101 vcpu->stat.wfi_exit_stat++; 98 - kvm_vcpu_wfi(vcpu); 99 102 } 100 103 104 + if (esr & ESR_ELx_WFx_ISS_WFxT) { 105 + if (esr & ESR_ELx_WFx_ISS_RV) { 106 + u64 val, now; 107 + 108 + now = kvm_arm_timer_get_reg(vcpu, KVM_REG_ARM_TIMER_CNT); 109 + val = vcpu_get_reg(vcpu, kvm_vcpu_sys_get_rt(vcpu)); 110 + 111 + if (now >= val) 112 + goto out; 113 + } else { 114 + /* Treat WFxT as WFx if RN is invalid */ 115 + esr &= ~ESR_ELx_WFx_ISS_WFxT; 116 + } 117 + } 118 + 119 + if (esr & ESR_ELx_WFx_ISS_WFE) { 120 + kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu)); 121 + } else { 122 + if (esr & ESR_ELx_WFx_ISS_WFxT) 123 + vcpu->arch.flags |= KVM_ARM64_WFIT; 124 + 125 + kvm_vcpu_wfi(vcpu); 126 + } 127 + out: 101 128 kvm_incr_pc(vcpu); 102 129 103 130 return 1; ··· 196 169 [ESR_ELx_EC_CP15_64] = kvm_handle_cp15_64, 197 170 [ESR_ELx_EC_CP14_MR] = kvm_handle_cp14_32, 198 171 [ESR_ELx_EC_CP14_LS] = kvm_handle_cp14_load_store, 172 + [ESR_ELx_EC_CP10_ID] = kvm_handle_cp10_id, 199 173 [ESR_ELx_EC_CP14_64] = kvm_handle_cp14_64, 200 174 [ESR_ELx_EC_HVC32] = handle_hvc, 201 175 [ESR_ELx_EC_SMC32] = handle_smc, ··· 325 297 u64 elr_in_kimg = __phys_to_kimg(elr_phys); 326 298 u64 hyp_offset = elr_in_kimg - kaslr_offset() - elr_virt; 327 299 u64 mode = spsr & PSR_MODE_MASK; 300 + u64 panic_addr = elr_virt + hyp_offset; 328 301 329 - /* 330 - * The nVHE hyp symbols are not included by kallsyms to avoid issues 331 - * with aliasing. That means that the symbols cannot be printed with the 332 - * "%pS" format specifier, so fall back to the vmlinux address if 333 - * there's no better option. 334 - */ 335 302 if (mode != PSR_MODE_EL2t && mode != PSR_MODE_EL2h) { 336 303 kvm_err("Invalid host exception to nVHE hyp!\n"); 337 304 } else if (ESR_ELx_EC(esr) == ESR_ELx_EC_BRK64 && ··· 346 323 if (file) 347 324 kvm_err("nVHE hyp BUG at: %s:%u!\n", file, line); 348 325 else 349 - kvm_err("nVHE hyp BUG at: %016llx!\n", elr_virt + hyp_offset); 326 + kvm_err("nVHE hyp BUG at: [<%016llx>] %pB!\n", panic_addr, 327 + (void *)panic_addr); 350 328 } else { 351 - kvm_err("nVHE hyp panic at: %016llx!\n", elr_virt + hyp_offset); 329 + kvm_err("nVHE hyp panic at: [<%016llx>] %pB!\n", panic_addr, 330 + (void *)panic_addr); 352 331 } 353 332 354 333 /*

+4 -2

arch/arm64/kvm/hyp/include/nvhe/mm.h

··· 19 19 int pkvm_cpu_set_vector(enum arm64_hyp_spectre_vector slot); 20 20 int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot); 21 21 int pkvm_create_mappings_locked(void *from, void *to, enum kvm_pgtable_prot prot); 22 - unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size, 23 - enum kvm_pgtable_prot prot); 22 + int __pkvm_create_private_mapping(phys_addr_t phys, size_t size, 23 + enum kvm_pgtable_prot prot, 24 + unsigned long *haddr); 25 + int pkvm_alloc_private_va_range(size_t size, unsigned long *haddr); 24 26 25 27 static inline void hyp_vmemmap_range(phys_addr_t phys, unsigned long size, 26 28 unsigned long *start, unsigned long *end)

+27 -5

arch/arm64/kvm/hyp/nvhe/host.S

··· 80 80 mov lr, #(PSR_F_BIT | PSR_I_BIT | PSR_A_BIT | PSR_D_BIT |\ 81 81 PSR_MODE_EL1h) 82 82 msr spsr_el2, lr 83 - ldr lr, =nvhe_hyp_panic_handler 83 + adr_l lr, nvhe_hyp_panic_handler 84 84 hyp_kimg_va lr, x6 85 85 msr elr_el2, lr 86 86 ··· 125 125 add sp, sp, #16 126 126 /* 127 127 * Compute the idmap address of __kvm_handle_stub_hvc and 128 - * jump there. Since we use kimage_voffset, do not use the 129 - * HYP VA for __kvm_handle_stub_hvc, but the kernel VA instead 130 - * (by loading it from the constant pool). 128 + * jump there. 131 129 * 132 130 * Preserve x0-x4, which may contain stub parameters. 133 131 */ 134 - ldr x5, =__kvm_handle_stub_hvc 132 + adr_l x5, __kvm_handle_stub_hvc 135 133 hyp_pa x5, x6 136 134 br x5 137 135 SYM_FUNC_END(__host_hvc) ··· 151 153 152 154 .macro invalid_host_el2_vect 153 155 .align 7 156 + 157 + /* 158 + * Test whether the SP has overflowed, without corrupting a GPR. 159 + * nVHE hypervisor stacks are aligned so that the PAGE_SHIFT bit 160 + * of SP should always be 1. 161 + */ 162 + add sp, sp, x0 // sp' = sp + x0 163 + sub x0, sp, x0 // x0' = sp' - x0 = (sp + x0) - x0 = sp 164 + tbz x0, #PAGE_SHIFT, .L__hyp_sp_overflow\@ 165 + sub x0, sp, x0 // x0'' = sp' - x0' = (sp + x0) - sp = x0 166 + sub sp, sp, x0 // sp'' = sp' - x0 = (sp + x0) - x0 = sp 167 + 154 168 /* If a guest is loaded, panic out of it. */ 155 169 stp x0, x1, [sp, #-16]! 156 170 get_loaded_vcpu x0, x1 ··· 175 165 * been partially clobbered by __host_enter. 176 166 */ 177 167 b hyp_panic 168 + 169 + .L__hyp_sp_overflow\@: 170 + /* 171 + * Reset SP to the top of the stack, to allow handling the hyp_panic. 172 + * This corrupts the stack but is ok, since we won't be attempting 173 + * any unwinding here. 174 + */ 175 + ldr_this_cpu x0, kvm_init_params + NVHE_INIT_STACK_HYP_VA, x1 176 + mov sp, x0 177 + 178 + b hyp_panic_bad_stack 179 + ASM_BUG() 178 180 .endm 179 181 180 182 .macro invalid_host_el1_vect

+17 -1

arch/arm64/kvm/hyp/nvhe/hyp-main.c

··· 160 160 DECLARE_REG(size_t, size, host_ctxt, 2); 161 161 DECLARE_REG(enum kvm_pgtable_prot, prot, host_ctxt, 3); 162 162 163 - cpu_reg(host_ctxt, 1) = __pkvm_create_private_mapping(phys, size, prot); 163 + /* 164 + * __pkvm_create_private_mapping() populates a pointer with the 165 + * hypervisor start address of the allocation. 166 + * 167 + * However, handle___pkvm_create_private_mapping() hypercall crosses the 168 + * EL1/EL2 boundary so the pointer would not be valid in this context. 169 + * 170 + * Instead pass the allocation address as the return value (or return 171 + * ERR_PTR() on failure). 172 + */ 173 + unsigned long haddr; 174 + int err = __pkvm_create_private_mapping(phys, size, prot, &haddr); 175 + 176 + if (err) 177 + haddr = (unsigned long)ERR_PTR(err); 178 + 179 + cpu_reg(host_ctxt, 1) = haddr; 164 180 } 165 181 166 182 static void handle___pkvm_prot_finalize(struct kvm_cpu_context *host_ctxt)

+54 -30

arch/arm64/kvm/hyp/nvhe/mm.c

··· 37 37 return err; 38 38 } 39 39 40 - unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size, 41 - enum kvm_pgtable_prot prot) 40 + /** 41 + * pkvm_alloc_private_va_range - Allocates a private VA range. 42 + * @size: The size of the VA range to reserve. 43 + * @haddr: The hypervisor virtual start address of the allocation. 44 + * 45 + * The private virtual address (VA) range is allocated above __io_map_base 46 + * and aligned based on the order of @size. 47 + * 48 + * Return: 0 on success or negative error code on failure. 49 + */ 50 + int pkvm_alloc_private_va_range(size_t size, unsigned long *haddr) 51 + { 52 + unsigned long base, addr; 53 + int ret = 0; 54 + 55 + hyp_spin_lock(&pkvm_pgd_lock); 56 + 57 + /* Align the allocation based on the order of its size */ 58 + addr = ALIGN(__io_map_base, PAGE_SIZE << get_order(size)); 59 + 60 + /* The allocated size is always a multiple of PAGE_SIZE */ 61 + base = addr + PAGE_ALIGN(size); 62 + 63 + /* Are we overflowing on the vmemmap ? */ 64 + if (!addr || base > __hyp_vmemmap) 65 + ret = -ENOMEM; 66 + else { 67 + __io_map_base = base; 68 + *haddr = addr; 69 + } 70 + 71 + hyp_spin_unlock(&pkvm_pgd_lock); 72 + 73 + return ret; 74 + } 75 + 76 + int __pkvm_create_private_mapping(phys_addr_t phys, size_t size, 77 + enum kvm_pgtable_prot prot, 78 + unsigned long *haddr) 42 79 { 43 80 unsigned long addr; 44 81 int err; 45 82 46 - hyp_spin_lock(&pkvm_pgd_lock); 47 - 48 83 size = PAGE_ALIGN(size + offset_in_page(phys)); 49 - addr = __io_map_base; 50 - __io_map_base += size; 84 + err = pkvm_alloc_private_va_range(size, &addr); 85 + if (err) 86 + return err; 51 87 52 - /* Are we overflowing on the vmemmap ? */ 53 - if (__io_map_base > __hyp_vmemmap) { 54 - __io_map_base -= size; 55 - addr = (unsigned long)ERR_PTR(-ENOMEM); 56 - goto out; 57 - } 88 + err = __pkvm_create_mappings(addr, size, phys, prot); 89 + if (err) 90 + return err; 58 91 59 - err = kvm_pgtable_hyp_map(&pkvm_pgtable, addr, size, phys, prot); 60 - if (err) { 61 - addr = (unsigned long)ERR_PTR(err); 62 - goto out; 63 - } 64 - 65 - addr = addr + offset_in_page(phys); 66 - out: 67 - hyp_spin_unlock(&pkvm_pgd_lock); 68 - 69 - return addr; 92 + *haddr = addr + offset_in_page(phys); 93 + return err; 70 94 } 71 95 72 96 int pkvm_create_mappings_locked(void *from, void *to, enum kvm_pgtable_prot prot) ··· 170 146 int hyp_map_vectors(void) 171 147 { 172 148 phys_addr_t phys; 173 - void *bp_base; 149 + unsigned long bp_base; 150 + int ret; 174 151 175 152 if (!kvm_system_needs_idmapped_vectors()) { 176 153 __hyp_bp_vect_base = __bp_harden_hyp_vecs; ··· 179 154 } 180 155 181 156 phys = __hyp_pa(__bp_harden_hyp_vecs); 182 - bp_base = (void *)__pkvm_create_private_mapping(phys, 183 - __BP_HARDEN_HYP_VECS_SZ, 184 - PAGE_HYP_EXEC); 185 - if (IS_ERR_OR_NULL(bp_base)) 186 - return PTR_ERR(bp_base); 157 + ret = __pkvm_create_private_mapping(phys, __BP_HARDEN_HYP_VECS_SZ, 158 + PAGE_HYP_EXEC, &bp_base); 159 + if (ret) 160 + return ret; 187 161 188 - __hyp_bp_vect_base = bp_base; 162 + __hyp_bp_vect_base = (void *)bp_base; 189 163 190 164 return 0; 191 165 }

+28 -3

arch/arm64/kvm/hyp/nvhe/setup.c

··· 99 99 return ret; 100 100 101 101 for (i = 0; i < hyp_nr_cpus; i++) { 102 + struct kvm_nvhe_init_params *params = per_cpu_ptr(&kvm_init_params, i); 103 + unsigned long hyp_addr; 104 + 102 105 start = (void *)kern_hyp_va(per_cpu_base[i]); 103 106 end = start + PAGE_ALIGN(hyp_percpu_size); 104 107 ret = pkvm_create_mappings(start, end, PAGE_HYP); 105 108 if (ret) 106 109 return ret; 107 110 108 - end = (void *)per_cpu_ptr(&kvm_init_params, i)->stack_hyp_va; 109 - start = end - PAGE_SIZE; 110 - ret = pkvm_create_mappings(start, end, PAGE_HYP); 111 + /* 112 + * Allocate a contiguous HYP private VA range for the stack 113 + * and guard page. The allocation is also aligned based on 114 + * the order of its size. 115 + */ 116 + ret = pkvm_alloc_private_va_range(PAGE_SIZE * 2, &hyp_addr); 111 117 if (ret) 112 118 return ret; 119 + 120 + /* 121 + * Since the stack grows downwards, map the stack to the page 122 + * at the higher address and leave the lower guard page 123 + * unbacked. 124 + * 125 + * Any valid stack address now has the PAGE_SHIFT bit as 1 126 + * and addresses corresponding to the guard page have the 127 + * PAGE_SHIFT bit as 0 - this is used for overflow detection. 128 + */ 129 + hyp_spin_lock(&pkvm_pgd_lock); 130 + ret = kvm_pgtable_hyp_map(&pkvm_pgtable, hyp_addr + PAGE_SIZE, 131 + PAGE_SIZE, params->stack_pa, PAGE_HYP); 132 + hyp_spin_unlock(&pkvm_pgd_lock); 133 + if (ret) 134 + return ret; 135 + 136 + /* Update stack_hyp_va to end of the stack's private VA range */ 137 + params->stack_hyp_va = hyp_addr + (2 * PAGE_SIZE); 113 138 } 114 139 115 140 /*

+21 -36

arch/arm64/kvm/hyp/nvhe/switch.c

··· 150 150 } 151 151 } 152 152 153 - /** 153 + /* 154 154 * Disable host events, enable guest events 155 155 */ 156 - static bool __pmu_switch_to_guest(struct kvm_cpu_context *host_ctxt) 156 + #ifdef CONFIG_HW_PERF_EVENTS 157 + static bool __pmu_switch_to_guest(struct kvm_vcpu *vcpu) 157 158 { 158 - struct kvm_host_data *host; 159 - struct kvm_pmu_events *pmu; 160 - 161 - host = container_of(host_ctxt, struct kvm_host_data, host_ctxt); 162 - pmu = &host->pmu_events; 159 + struct kvm_pmu_events *pmu = &vcpu->arch.pmu.events; 163 160 164 161 if (pmu->events_host) 165 162 write_sysreg(pmu->events_host, pmcntenclr_el0); ··· 167 170 return (pmu->events_host || pmu->events_guest); 168 171 } 169 172 170 - /** 173 + /* 171 174 * Disable guest events, enable host events 172 175 */ 173 - static void __pmu_switch_to_host(struct kvm_cpu_context *host_ctxt) 176 + static void __pmu_switch_to_host(struct kvm_vcpu *vcpu) 174 177 { 175 - struct kvm_host_data *host; 176 - struct kvm_pmu_events *pmu; 177 - 178 - host = container_of(host_ctxt, struct kvm_host_data, host_ctxt); 179 - pmu = &host->pmu_events; 178 + struct kvm_pmu_events *pmu = &vcpu->arch.pmu.events; 180 179 181 180 if (pmu->events_guest) 182 181 write_sysreg(pmu->events_guest, pmcntenclr_el0); ··· 180 187 if (pmu->events_host) 181 188 write_sysreg(pmu->events_host, pmcntenset_el0); 182 189 } 190 + #else 191 + #define __pmu_switch_to_guest(v) ({ false; }) 192 + #define __pmu_switch_to_host(v) do {} while (0) 193 + #endif 183 194 184 - /** 195 + /* 185 196 * Handler for protected VM MSR, MRS or System instruction execution in AArch64. 186 197 * 187 198 * Returns true if the hypervisor has handled the exit, and control should go ··· 200 203 */ 201 204 return (kvm_hyp_handle_sysreg(vcpu, exit_code) || 202 205 kvm_handle_pvm_sysreg(vcpu, exit_code)); 203 - } 204 - 205 - /** 206 - * Handler for protected floating-point and Advanced SIMD accesses. 207 - * 208 - * Returns true if the hypervisor has handled the exit, and control should go 209 - * back to the guest, or false if it hasn't. 210 - */ 211 - static bool kvm_handle_pvm_fpsimd(struct kvm_vcpu *vcpu, u64 *exit_code) 212 - { 213 - /* Linux guests assume support for floating-point and Advanced SIMD. */ 214 - BUILD_BUG_ON(!FIELD_GET(ARM64_FEATURE_MASK(ID_AA64PFR0_FP), 215 - PVM_ID_AA64PFR0_ALLOW)); 216 - BUILD_BUG_ON(!FIELD_GET(ARM64_FEATURE_MASK(ID_AA64PFR0_ASIMD), 217 - PVM_ID_AA64PFR0_ALLOW)); 218 - 219 - return kvm_hyp_handle_fpsimd(vcpu, exit_code); 220 206 } 221 207 222 208 static const exit_handler_fn hyp_exit_handlers[] = { ··· 217 237 [0 ... ESR_ELx_EC_MAX] = NULL, 218 238 [ESR_ELx_EC_SYS64] = kvm_handle_pvm_sys64, 219 239 [ESR_ELx_EC_SVE] = kvm_handle_pvm_restricted, 220 - [ESR_ELx_EC_FP_ASIMD] = kvm_handle_pvm_fpsimd, 240 + [ESR_ELx_EC_FP_ASIMD] = kvm_hyp_handle_fpsimd, 221 241 [ESR_ELx_EC_IABT_LOW] = kvm_hyp_handle_iabt_low, 222 242 [ESR_ELx_EC_DABT_LOW] = kvm_hyp_handle_dabt_low, 223 243 [ESR_ELx_EC_PAC] = kvm_hyp_handle_ptrauth, ··· 284 304 host_ctxt->__hyp_running_vcpu = vcpu; 285 305 guest_ctxt = &vcpu->arch.ctxt; 286 306 287 - pmu_switch_needed = __pmu_switch_to_guest(host_ctxt); 307 + pmu_switch_needed = __pmu_switch_to_guest(vcpu); 288 308 289 309 __sysreg_save_state_nvhe(host_ctxt); 290 310 /* ··· 346 366 __debug_restore_host_buffers_nvhe(vcpu); 347 367 348 368 if (pmu_switch_needed) 349 - __pmu_switch_to_host(host_ctxt); 369 + __pmu_switch_to_host(vcpu); 350 370 351 371 /* Returning to host will clear PSR.I, remask PMR if needed */ 352 372 if (system_uses_irq_prio_masking()) ··· 357 377 return exit_code; 358 378 } 359 379 360 - void __noreturn hyp_panic(void) 380 + asmlinkage void __noreturn hyp_panic(void) 361 381 { 362 382 u64 spsr = read_sysreg_el2(SYS_SPSR); 363 383 u64 elr = read_sysreg_el2(SYS_ELR); ··· 377 397 378 398 __hyp_do_panic(host_ctxt, spsr, elr, par); 379 399 unreachable(); 400 + } 401 + 402 + asmlinkage void __noreturn hyp_panic_bad_stack(void) 403 + { 404 + hyp_panic(); 380 405 } 381 406 382 407 asmlinkage void kvm_unexpected_el2_exception(void)

-3

arch/arm64/kvm/hyp/nvhe/sys_regs.c

··· 90 90 u64 set_mask = 0; 91 91 u64 allow_mask = PVM_ID_AA64PFR0_ALLOW; 92 92 93 - if (!vcpu_has_sve(vcpu)) 94 - allow_mask &= ~ARM64_FEATURE_MASK(ID_AA64PFR0_SVE); 95 - 96 93 set_mask |= get_restricted_features_unsigned(id_aa64pfr0_el1_sys_val, 97 94 PVM_ID_AA64PFR0_RESTRICT_UNSIGNED); 98 95

+324 -3

arch/arm64/kvm/hypercalls.c

··· 9 9 #include <kvm/arm_hypercalls.h> 10 10 #include <kvm/arm_psci.h> 11 11 12 + #define KVM_ARM_SMCCC_STD_FEATURES \ 13 + GENMASK(KVM_REG_ARM_STD_BMAP_BIT_COUNT - 1, 0) 14 + #define KVM_ARM_SMCCC_STD_HYP_FEATURES \ 15 + GENMASK(KVM_REG_ARM_STD_HYP_BMAP_BIT_COUNT - 1, 0) 16 + #define KVM_ARM_SMCCC_VENDOR_HYP_FEATURES \ 17 + GENMASK(KVM_REG_ARM_VENDOR_HYP_BMAP_BIT_COUNT - 1, 0) 18 + 12 19 static void kvm_ptp_get_time(struct kvm_vcpu *vcpu, u64 *val) 13 20 { 14 21 struct system_time_snapshot systime_snapshot; ··· 65 58 val[3] = lower_32_bits(cycles); 66 59 } 67 60 61 + static bool kvm_hvc_call_default_allowed(u32 func_id) 62 + { 63 + switch (func_id) { 64 + /* 65 + * List of function-ids that are not gated with the bitmapped 66 + * feature firmware registers, and are to be allowed for 67 + * servicing the call by default. 68 + */ 69 + case ARM_SMCCC_VERSION_FUNC_ID: 70 + case ARM_SMCCC_ARCH_FEATURES_FUNC_ID: 71 + return true; 72 + default: 73 + /* PSCI 0.2 and up is in the 0:0x1f range */ 74 + if (ARM_SMCCC_OWNER_NUM(func_id) == ARM_SMCCC_OWNER_STANDARD && 75 + ARM_SMCCC_FUNC_NUM(func_id) <= 0x1f) 76 + return true; 77 + 78 + /* 79 + * KVM's PSCI 0.1 doesn't comply with SMCCC, and has 80 + * its own function-id base and range 81 + */ 82 + if (func_id >= KVM_PSCI_FN(0) && func_id <= KVM_PSCI_FN(3)) 83 + return true; 84 + 85 + return false; 86 + } 87 + } 88 + 89 + static bool kvm_hvc_call_allowed(struct kvm_vcpu *vcpu, u32 func_id) 90 + { 91 + struct kvm_smccc_features *smccc_feat = &vcpu->kvm->arch.smccc_feat; 92 + 93 + switch (func_id) { 94 + case ARM_SMCCC_TRNG_VERSION: 95 + case ARM_SMCCC_TRNG_FEATURES: 96 + case ARM_SMCCC_TRNG_GET_UUID: 97 + case ARM_SMCCC_TRNG_RND32: 98 + case ARM_SMCCC_TRNG_RND64: 99 + return test_bit(KVM_REG_ARM_STD_BIT_TRNG_V1_0, 100 + &smccc_feat->std_bmap); 101 + case ARM_SMCCC_HV_PV_TIME_FEATURES: 102 + case ARM_SMCCC_HV_PV_TIME_ST: 103 + return test_bit(KVM_REG_ARM_STD_HYP_BIT_PV_TIME, 104 + &smccc_feat->std_hyp_bmap); 105 + case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID: 106 + case ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID: 107 + return test_bit(KVM_REG_ARM_VENDOR_HYP_BIT_FUNC_FEAT, 108 + &smccc_feat->vendor_hyp_bmap); 109 + case ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID: 110 + return test_bit(KVM_REG_ARM_VENDOR_HYP_BIT_PTP, 111 + &smccc_feat->vendor_hyp_bmap); 112 + default: 113 + return kvm_hvc_call_default_allowed(func_id); 114 + } 115 + } 116 + 68 117 int kvm_hvc_call_handler(struct kvm_vcpu *vcpu) 69 118 { 119 + struct kvm_smccc_features *smccc_feat = &vcpu->kvm->arch.smccc_feat; 70 120 u32 func_id = smccc_get_function(vcpu); 71 121 u64 val[4] = {SMCCC_RET_NOT_SUPPORTED}; 72 122 u32 feature; 73 123 gpa_t gpa; 124 + 125 + if (!kvm_hvc_call_allowed(vcpu, func_id)) 126 + goto out; 74 127 75 128 switch (func_id) { 76 129 case ARM_SMCCC_VERSION_FUNC_ID: ··· 187 120 } 188 121 break; 189 122 case ARM_SMCCC_HV_PV_TIME_FEATURES: 190 - val[0] = SMCCC_RET_SUCCESS; 123 + if (test_bit(KVM_REG_ARM_STD_HYP_BIT_PV_TIME, 124 + &smccc_feat->std_hyp_bmap)) 125 + val[0] = SMCCC_RET_SUCCESS; 191 126 break; 192 127 } 193 128 break; ··· 208 139 val[3] = ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_3; 209 140 break; 210 141 case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID: 211 - val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES); 212 - val[0] |= BIT(ARM_SMCCC_KVM_FUNC_PTP); 142 + val[0] = smccc_feat->vendor_hyp_bmap; 213 143 break; 214 144 case ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID: 215 145 kvm_ptp_get_time(vcpu, val); ··· 223 155 return kvm_psci_call(vcpu); 224 156 } 225 157 158 + out: 226 159 smccc_set_retval(vcpu, val[0], val[1], val[2], val[3]); 227 160 return 1; 161 + } 162 + 163 + static const u64 kvm_arm_fw_reg_ids[] = { 164 + KVM_REG_ARM_PSCI_VERSION, 165 + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1, 166 + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2, 167 + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3, 168 + KVM_REG_ARM_STD_BMAP, 169 + KVM_REG_ARM_STD_HYP_BMAP, 170 + KVM_REG_ARM_VENDOR_HYP_BMAP, 171 + }; 172 + 173 + void kvm_arm_init_hypercalls(struct kvm *kvm) 174 + { 175 + struct kvm_smccc_features *smccc_feat = &kvm->arch.smccc_feat; 176 + 177 + smccc_feat->std_bmap = KVM_ARM_SMCCC_STD_FEATURES; 178 + smccc_feat->std_hyp_bmap = KVM_ARM_SMCCC_STD_HYP_FEATURES; 179 + smccc_feat->vendor_hyp_bmap = KVM_ARM_SMCCC_VENDOR_HYP_FEATURES; 180 + } 181 + 182 + int kvm_arm_get_fw_num_regs(struct kvm_vcpu *vcpu) 183 + { 184 + return ARRAY_SIZE(kvm_arm_fw_reg_ids); 185 + } 186 + 187 + int kvm_arm_copy_fw_reg_indices(struct kvm_vcpu *vcpu, u64 __user *uindices) 188 + { 189 + int i; 190 + 191 + for (i = 0; i < ARRAY_SIZE(kvm_arm_fw_reg_ids); i++) { 192 + if (put_user(kvm_arm_fw_reg_ids[i], uindices++)) 193 + return -EFAULT; 194 + } 195 + 196 + return 0; 197 + } 198 + 199 + #define KVM_REG_FEATURE_LEVEL_MASK GENMASK(3, 0) 200 + 201 + /* 202 + * Convert the workaround level into an easy-to-compare number, where higher 203 + * values mean better protection. 204 + */ 205 + static int get_kernel_wa_level(u64 regid) 206 + { 207 + switch (regid) { 208 + case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1: 209 + switch (arm64_get_spectre_v2_state()) { 210 + case SPECTRE_VULNERABLE: 211 + return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_AVAIL; 212 + case SPECTRE_MITIGATED: 213 + return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_AVAIL; 214 + case SPECTRE_UNAFFECTED: 215 + return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_REQUIRED; 216 + } 217 + return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_AVAIL; 218 + case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2: 219 + switch (arm64_get_spectre_v4_state()) { 220 + case SPECTRE_MITIGATED: 221 + /* 222 + * As for the hypercall discovery, we pretend we 223 + * don't have any FW mitigation if SSBS is there at 224 + * all times. 225 + */ 226 + if (cpus_have_final_cap(ARM64_SSBS)) 227 + return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL; 228 + fallthrough; 229 + case SPECTRE_UNAFFECTED: 230 + return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED; 231 + case SPECTRE_VULNERABLE: 232 + return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL; 233 + } 234 + break; 235 + case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3: 236 + switch (arm64_get_spectre_bhb_state()) { 237 + case SPECTRE_VULNERABLE: 238 + return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3_NOT_AVAIL; 239 + case SPECTRE_MITIGATED: 240 + return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3_AVAIL; 241 + case SPECTRE_UNAFFECTED: 242 + return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3_NOT_REQUIRED; 243 + } 244 + return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3_NOT_AVAIL; 245 + } 246 + 247 + return -EINVAL; 248 + } 249 + 250 + int kvm_arm_get_fw_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg) 251 + { 252 + struct kvm_smccc_features *smccc_feat = &vcpu->kvm->arch.smccc_feat; 253 + void __user *uaddr = (void __user *)(long)reg->addr; 254 + u64 val; 255 + 256 + switch (reg->id) { 257 + case KVM_REG_ARM_PSCI_VERSION: 258 + val = kvm_psci_version(vcpu); 259 + break; 260 + case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1: 261 + case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2: 262 + case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3: 263 + val = get_kernel_wa_level(reg->id) & KVM_REG_FEATURE_LEVEL_MASK; 264 + break; 265 + case KVM_REG_ARM_STD_BMAP: 266 + val = READ_ONCE(smccc_feat->std_bmap); 267 + break; 268 + case KVM_REG_ARM_STD_HYP_BMAP: 269 + val = READ_ONCE(smccc_feat->std_hyp_bmap); 270 + break; 271 + case KVM_REG_ARM_VENDOR_HYP_BMAP: 272 + val = READ_ONCE(smccc_feat->vendor_hyp_bmap); 273 + break; 274 + default: 275 + return -ENOENT; 276 + } 277 + 278 + if (copy_to_user(uaddr, &val, KVM_REG_SIZE(reg->id))) 279 + return -EFAULT; 280 + 281 + return 0; 282 + } 283 + 284 + static int kvm_arm_set_fw_reg_bmap(struct kvm_vcpu *vcpu, u64 reg_id, u64 val) 285 + { 286 + int ret = 0; 287 + struct kvm *kvm = vcpu->kvm; 288 + struct kvm_smccc_features *smccc_feat = &kvm->arch.smccc_feat; 289 + unsigned long *fw_reg_bmap, fw_reg_features; 290 + 291 + switch (reg_id) { 292 + case KVM_REG_ARM_STD_BMAP: 293 + fw_reg_bmap = &smccc_feat->std_bmap; 294 + fw_reg_features = KVM_ARM_SMCCC_STD_FEATURES; 295 + break; 296 + case KVM_REG_ARM_STD_HYP_BMAP: 297 + fw_reg_bmap = &smccc_feat->std_hyp_bmap; 298 + fw_reg_features = KVM_ARM_SMCCC_STD_HYP_FEATURES; 299 + break; 300 + case KVM_REG_ARM_VENDOR_HYP_BMAP: 301 + fw_reg_bmap = &smccc_feat->vendor_hyp_bmap; 302 + fw_reg_features = KVM_ARM_SMCCC_VENDOR_HYP_FEATURES; 303 + break; 304 + default: 305 + return -ENOENT; 306 + } 307 + 308 + /* Check for unsupported bit */ 309 + if (val & ~fw_reg_features) 310 + return -EINVAL; 311 + 312 + mutex_lock(&kvm->lock); 313 + 314 + if (test_bit(KVM_ARCH_FLAG_HAS_RAN_ONCE, &kvm->arch.flags) && 315 + val != *fw_reg_bmap) { 316 + ret = -EBUSY; 317 + goto out; 318 + } 319 + 320 + WRITE_ONCE(*fw_reg_bmap, val); 321 + out: 322 + mutex_unlock(&kvm->lock); 323 + return ret; 324 + } 325 + 326 + int kvm_arm_set_fw_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg) 327 + { 328 + void __user *uaddr = (void __user *)(long)reg->addr; 329 + u64 val; 330 + int wa_level; 331 + 332 + if (copy_from_user(&val, uaddr, KVM_REG_SIZE(reg->id))) 333 + return -EFAULT; 334 + 335 + switch (reg->id) { 336 + case KVM_REG_ARM_PSCI_VERSION: 337 + { 338 + bool wants_02; 339 + 340 + wants_02 = test_bit(KVM_ARM_VCPU_PSCI_0_2, vcpu->arch.features); 341 + 342 + switch (val) { 343 + case KVM_ARM_PSCI_0_1: 344 + if (wants_02) 345 + return -EINVAL; 346 + vcpu->kvm->arch.psci_version = val; 347 + return 0; 348 + case KVM_ARM_PSCI_0_2: 349 + case KVM_ARM_PSCI_1_0: 350 + case KVM_ARM_PSCI_1_1: 351 + if (!wants_02) 352 + return -EINVAL; 353 + vcpu->kvm->arch.psci_version = val; 354 + return 0; 355 + } 356 + break; 357 + } 358 + 359 + case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1: 360 + case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3: 361 + if (val & ~KVM_REG_FEATURE_LEVEL_MASK) 362 + return -EINVAL; 363 + 364 + if (get_kernel_wa_level(reg->id) < val) 365 + return -EINVAL; 366 + 367 + return 0; 368 + 369 + case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2: 370 + if (val & ~(KVM_REG_FEATURE_LEVEL_MASK | 371 + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED)) 372 + return -EINVAL; 373 + 374 + /* The enabled bit must not be set unless the level is AVAIL. */ 375 + if ((val & KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED) && 376 + (val & KVM_REG_FEATURE_LEVEL_MASK) != KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL) 377 + return -EINVAL; 378 + 379 + /* 380 + * Map all the possible incoming states to the only two we 381 + * really want to deal with. 382 + */ 383 + switch (val & KVM_REG_FEATURE_LEVEL_MASK) { 384 + case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL: 385 + case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_UNKNOWN: 386 + wa_level = KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL; 387 + break; 388 + case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL: 389 + case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED: 390 + wa_level = KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED; 391 + break; 392 + default: 393 + return -EINVAL; 394 + } 395 + 396 + /* 397 + * We can deal with NOT_AVAIL on NOT_REQUIRED, but not the 398 + * other way around. 399 + */ 400 + if (get_kernel_wa_level(reg->id) < wa_level) 401 + return -EINVAL; 402 + 403 + return 0; 404 + case KVM_REG_ARM_STD_BMAP: 405 + case KVM_REG_ARM_STD_HYP_BMAP: 406 + case KVM_REG_ARM_VENDOR_HYP_BMAP: 407 + return kvm_arm_set_fw_reg_bmap(vcpu, reg->id, val); 408 + default: 409 + return -ENOENT; 410 + } 411 + 412 + return -EINVAL; 228 413 }

+47 -25

arch/arm64/kvm/mmu.c

··· 258 258 return true; 259 259 } 260 260 261 - static int __create_hyp_mappings(unsigned long start, unsigned long size, 262 - unsigned long phys, enum kvm_pgtable_prot prot) 261 + int __create_hyp_mappings(unsigned long start, unsigned long size, 262 + unsigned long phys, enum kvm_pgtable_prot prot) 263 263 { 264 264 int err; 265 265 ··· 457 457 return 0; 458 458 } 459 459 460 - static int __create_hyp_private_mapping(phys_addr_t phys_addr, size_t size, 461 - unsigned long *haddr, 462 - enum kvm_pgtable_prot prot) 460 + 461 + /** 462 + * hyp_alloc_private_va_range - Allocates a private VA range. 463 + * @size: The size of the VA range to reserve. 464 + * @haddr: The hypervisor virtual start address of the allocation. 465 + * 466 + * The private virtual address (VA) range is allocated below io_map_base 467 + * and aligned based on the order of @size. 468 + * 469 + * Return: 0 on success or negative error code on failure. 470 + */ 471 + int hyp_alloc_private_va_range(size_t size, unsigned long *haddr) 463 472 { 464 473 unsigned long base; 465 474 int ret = 0; 466 - 467 - if (!kvm_host_owns_hyp_mappings()) { 468 - base = kvm_call_hyp_nvhe(__pkvm_create_private_mapping, 469 - phys_addr, size, prot); 470 - if (IS_ERR_OR_NULL((void *)base)) 471 - return PTR_ERR((void *)base); 472 - *haddr = base; 473 - 474 - return 0; 475 - } 476 475 477 476 mutex_lock(&kvm_hyp_pgd_mutex); 478 477 ··· 483 484 * 484 485 * The allocated size is always a multiple of PAGE_SIZE. 485 486 */ 486 - size = PAGE_ALIGN(size + offset_in_page(phys_addr)); 487 - base = io_map_base - size; 487 + base = io_map_base - PAGE_ALIGN(size); 488 + 489 + /* Align the allocation based on the order of its size */ 490 + base = ALIGN_DOWN(base, PAGE_SIZE << get_order(size)); 488 491 489 492 /* 490 493 * Verify that BIT(VA_BITS - 1) hasn't been flipped by ··· 496 495 if ((base ^ io_map_base) & BIT(VA_BITS - 1)) 497 496 ret = -ENOMEM; 498 497 else 499 - io_map_base = base; 498 + *haddr = io_map_base = base; 500 499 501 500 mutex_unlock(&kvm_hyp_pgd_mutex); 502 501 503 - if (ret) 504 - goto out; 502 + return ret; 503 + } 505 504 506 - ret = __create_hyp_mappings(base, size, phys_addr, prot); 507 - if (ret) 508 - goto out; 505 + static int __create_hyp_private_mapping(phys_addr_t phys_addr, size_t size, 506 + unsigned long *haddr, 507 + enum kvm_pgtable_prot prot) 508 + { 509 + unsigned long addr; 510 + int ret = 0; 509 511 510 - *haddr = base + offset_in_page(phys_addr); 511 - out: 512 + if (!kvm_host_owns_hyp_mappings()) { 513 + addr = kvm_call_hyp_nvhe(__pkvm_create_private_mapping, 514 + phys_addr, size, prot); 515 + if (IS_ERR_VALUE(addr)) 516 + return addr; 517 + *haddr = addr; 518 + 519 + return 0; 520 + } 521 + 522 + size = PAGE_ALIGN(size + offset_in_page(phys_addr)); 523 + ret = hyp_alloc_private_va_range(size, &addr); 524 + if (ret) 525 + return ret; 526 + 527 + ret = __create_hyp_mappings(addr, size, phys_addr, prot); 528 + if (ret) 529 + return ret; 530 + 531 + *haddr = addr + offset_in_page(phys_addr); 512 532 return ret; 513 533 } 514 534

+1 -2

arch/arm64/kvm/pmu-emul.c

··· 774 774 { 775 775 struct arm_pmu_entry *entry; 776 776 777 - if (pmu->pmuver == 0 || pmu->pmuver == ID_AA64DFR0_PMUVER_IMP_DEF || 778 - is_protected_kvm_enabled()) 777 + if (pmu->pmuver == 0 || pmu->pmuver == ID_AA64DFR0_PMUVER_IMP_DEF) 779 778 return; 780 779 781 780 mutex_lock(&arm_pmus_lock);

+23 -17

arch/arm64/kvm/pmu.c

··· 5 5 */ 6 6 #include <linux/kvm_host.h> 7 7 #include <linux/perf_event.h> 8 - #include <asm/kvm_hyp.h> 8 + 9 + static DEFINE_PER_CPU(struct kvm_pmu_events, kvm_pmu_events); 9 10 10 11 /* 11 12 * Given the perf event attributes and system type, determine ··· 26 25 return (attr->exclude_host != attr->exclude_guest); 27 26 } 28 27 28 + struct kvm_pmu_events *kvm_get_pmu_events(void) 29 + { 30 + return this_cpu_ptr(&kvm_pmu_events); 31 + } 32 + 29 33 /* 30 34 * Add events to track that we may want to switch at guest entry/exit 31 35 * time. 32 36 */ 33 37 void kvm_set_pmu_events(u32 set, struct perf_event_attr *attr) 34 38 { 35 - struct kvm_host_data *ctx = this_cpu_ptr_hyp_sym(kvm_host_data); 39 + struct kvm_pmu_events *pmu = kvm_get_pmu_events(); 36 40 37 - if (!kvm_arm_support_pmu_v3() || !ctx || !kvm_pmu_switch_needed(attr)) 41 + if (!kvm_arm_support_pmu_v3() || !pmu || !kvm_pmu_switch_needed(attr)) 38 42 return; 39 43 40 44 if (!attr->exclude_host) 41 - ctx->pmu_events.events_host |= set; 45 + pmu->events_host |= set; 42 46 if (!attr->exclude_guest) 43 - ctx->pmu_events.events_guest |= set; 47 + pmu->events_guest |= set; 44 48 } 45 49 46 50 /* ··· 53 47 */ 54 48 void kvm_clr_pmu_events(u32 clr) 55 49 { 56 - struct kvm_host_data *ctx = this_cpu_ptr_hyp_sym(kvm_host_data); 50 + struct kvm_pmu_events *pmu = kvm_get_pmu_events(); 57 51 58 - if (!kvm_arm_support_pmu_v3() || !ctx) 52 + if (!kvm_arm_support_pmu_v3() || !pmu) 59 53 return; 60 54 61 - ctx->pmu_events.events_host &= ~clr; 62 - ctx->pmu_events.events_guest &= ~clr; 55 + pmu->events_host &= ~clr; 56 + pmu->events_guest &= ~clr; 63 57 } 64 58 65 59 #define PMEVTYPER_READ_CASE(idx) \ ··· 175 169 */ 176 170 void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu) 177 171 { 178 - struct kvm_host_data *host; 172 + struct kvm_pmu_events *pmu; 179 173 u32 events_guest, events_host; 180 174 181 175 if (!kvm_arm_support_pmu_v3() || !has_vhe()) 182 176 return; 183 177 184 178 preempt_disable(); 185 - host = this_cpu_ptr_hyp_sym(kvm_host_data); 186 - events_guest = host->pmu_events.events_guest; 187 - events_host = host->pmu_events.events_host; 179 + pmu = kvm_get_pmu_events(); 180 + events_guest = pmu->events_guest; 181 + events_host = pmu->events_host; 188 182 189 183 kvm_vcpu_pmu_enable_el0(events_guest); 190 184 kvm_vcpu_pmu_disable_el0(events_host); ··· 196 190 */ 197 191 void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu) 198 192 { 199 - struct kvm_host_data *host; 193 + struct kvm_pmu_events *pmu; 200 194 u32 events_guest, events_host; 201 195 202 196 if (!kvm_arm_support_pmu_v3() || !has_vhe()) 203 197 return; 204 198 205 - host = this_cpu_ptr_hyp_sym(kvm_host_data); 206 - events_guest = host->pmu_events.events_guest; 207 - events_host = host->pmu_events.events_host; 199 + pmu = kvm_get_pmu_events(); 200 + events_guest = pmu->events_guest; 201 + events_host = pmu->events_host; 208 202 209 203 kvm_vcpu_pmu_enable_el0(events_host); 210 204 kvm_vcpu_pmu_disable_el0(events_guest);

+42 -206

arch/arm64/kvm/psci.c

··· 51 51 return PSCI_RET_SUCCESS; 52 52 } 53 53 54 - static void kvm_psci_vcpu_off(struct kvm_vcpu *vcpu) 55 - { 56 - vcpu->arch.power_off = true; 57 - kvm_make_request(KVM_REQ_SLEEP, vcpu); 58 - kvm_vcpu_kick(vcpu); 59 - } 60 - 61 54 static inline bool kvm_psci_valid_affinity(struct kvm_vcpu *vcpu, 62 55 unsigned long affinity) 63 56 { ··· 76 83 */ 77 84 if (!vcpu) 78 85 return PSCI_RET_INVALID_PARAMS; 79 - if (!vcpu->arch.power_off) { 86 + if (!kvm_arm_vcpu_stopped(vcpu)) { 80 87 if (kvm_psci_version(source_vcpu) != KVM_ARM_PSCI_0_1) 81 88 return PSCI_RET_ALREADY_ON; 82 89 else ··· 100 107 kvm_make_request(KVM_REQ_VCPU_RESET, vcpu); 101 108 102 109 /* 103 - * Make sure the reset request is observed if the change to 104 - * power_off is observed. 110 + * Make sure the reset request is observed if the RUNNABLE mp_state is 111 + * observed. 105 112 */ 106 113 smp_wmb(); 107 114 108 - vcpu->arch.power_off = false; 115 + vcpu->arch.mp_state.mp_state = KVM_MP_STATE_RUNNABLE; 109 116 kvm_vcpu_wake_up(vcpu); 110 117 111 118 return PSCI_RET_SUCCESS; ··· 143 150 mpidr = kvm_vcpu_get_mpidr_aff(tmp); 144 151 if ((mpidr & target_affinity_mask) == target_affinity) { 145 152 matching_cpus++; 146 - if (!tmp->arch.power_off) 153 + if (!kvm_arm_vcpu_stopped(tmp)) 147 154 return PSCI_0_2_AFFINITY_LEVEL_ON; 148 155 } 149 156 } ··· 169 176 * re-initialized. 170 177 */ 171 178 kvm_for_each_vcpu(i, tmp, vcpu->kvm) 172 - tmp->arch.power_off = true; 179 + tmp->arch.mp_state.mp_state = KVM_MP_STATE_STOPPED; 173 180 kvm_make_all_cpus_request(vcpu->kvm, KVM_REQ_SLEEP); 174 181 175 182 memset(&vcpu->run->system_event, 0, sizeof(vcpu->run->system_event)); ··· 193 200 { 194 201 kvm_prepare_system_event(vcpu, KVM_SYSTEM_EVENT_RESET, 195 202 KVM_SYSTEM_EVENT_RESET_FLAG_PSCI_RESET2); 203 + } 204 + 205 + static void kvm_psci_system_suspend(struct kvm_vcpu *vcpu) 206 + { 207 + struct kvm_run *run = vcpu->run; 208 + 209 + memset(&run->system_event, 0, sizeof(vcpu->run->system_event)); 210 + run->system_event.type = KVM_SYSTEM_EVENT_SUSPEND; 211 + run->exit_reason = KVM_EXIT_SYSTEM_EVENT; 196 212 } 197 213 198 214 static void kvm_psci_narrow_to_32bit(struct kvm_vcpu *vcpu) ··· 247 245 val = kvm_psci_vcpu_suspend(vcpu); 248 246 break; 249 247 case PSCI_0_2_FN_CPU_OFF: 250 - kvm_psci_vcpu_off(vcpu); 248 + kvm_arm_vcpu_power_off(vcpu); 251 249 val = PSCI_RET_SUCCESS; 252 250 break; 253 251 case PSCI_0_2_FN_CPU_ON: ··· 307 305 308 306 static int kvm_psci_1_x_call(struct kvm_vcpu *vcpu, u32 minor) 309 307 { 308 + unsigned long val = PSCI_RET_NOT_SUPPORTED; 310 309 u32 psci_fn = smccc_get_function(vcpu); 310 + struct kvm *kvm = vcpu->kvm; 311 311 u32 arg; 312 - unsigned long val; 313 312 int ret = 1; 314 313 315 314 switch(psci_fn) { ··· 322 319 val = kvm_psci_check_allowed_function(vcpu, arg); 323 320 if (val) 324 321 break; 322 + 323 + val = PSCI_RET_NOT_SUPPORTED; 325 324 326 325 switch(arg) { 327 326 case PSCI_0_2_FN_PSCI_VERSION: ··· 341 336 case ARM_SMCCC_VERSION_FUNC_ID: 342 337 val = 0; 343 338 break; 339 + case PSCI_1_0_FN_SYSTEM_SUSPEND: 340 + case PSCI_1_0_FN64_SYSTEM_SUSPEND: 341 + if (test_bit(KVM_ARCH_FLAG_SYSTEM_SUSPEND_ENABLED, &kvm->arch.flags)) 342 + val = 0; 343 + break; 344 344 case PSCI_1_1_FN_SYSTEM_RESET2: 345 345 case PSCI_1_1_FN64_SYSTEM_RESET2: 346 - if (minor >= 1) { 346 + if (minor >= 1) 347 347 val = 0; 348 - break; 349 - } 350 - fallthrough; 351 - default: 352 - val = PSCI_RET_NOT_SUPPORTED; 353 348 break; 349 + } 350 + break; 351 + case PSCI_1_0_FN_SYSTEM_SUSPEND: 352 + kvm_psci_narrow_to_32bit(vcpu); 353 + fallthrough; 354 + case PSCI_1_0_FN64_SYSTEM_SUSPEND: 355 + /* 356 + * Return directly to userspace without changing the vCPU's 357 + * registers. Userspace depends on reading the SMCCC parameters 358 + * to implement SYSTEM_SUSPEND. 359 + */ 360 + if (test_bit(KVM_ARCH_FLAG_SYSTEM_SUSPEND_ENABLED, &kvm->arch.flags)) { 361 + kvm_psci_system_suspend(vcpu); 362 + return 0; 354 363 } 355 364 break; 356 365 case PSCI_1_1_FN_SYSTEM_RESET2: ··· 384 365 val = PSCI_RET_INVALID_PARAMS; 385 366 break; 386 367 } 387 - fallthrough; 368 + break; 388 369 default: 389 370 return kvm_psci_0_2_call(vcpu); 390 371 } ··· 401 382 402 383 switch (psci_fn) { 403 384 case KVM_PSCI_FN_CPU_OFF: 404 - kvm_psci_vcpu_off(vcpu); 385 + kvm_arm_vcpu_power_off(vcpu); 405 386 val = PSCI_RET_SUCCESS; 406 387 break; 407 388 case KVM_PSCI_FN_CPU_ON: ··· 455 436 default: 456 437 return -EINVAL; 457 438 } 458 - } 459 - 460 - int kvm_arm_get_fw_num_regs(struct kvm_vcpu *vcpu) 461 - { 462 - return 4; /* PSCI version and three workaround registers */ 463 - } 464 - 465 - int kvm_arm_copy_fw_reg_indices(struct kvm_vcpu *vcpu, u64 __user *uindices) 466 - { 467 - if (put_user(KVM_REG_ARM_PSCI_VERSION, uindices++)) 468 - return -EFAULT; 469 - 470 - if (put_user(KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1, uindices++)) 471 - return -EFAULT; 472 - 473 - if (put_user(KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2, uindices++)) 474 - return -EFAULT; 475 - 476 - if (put_user(KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3, uindices++)) 477 - return -EFAULT; 478 - 479 - return 0; 480 - } 481 - 482 - #define KVM_REG_FEATURE_LEVEL_WIDTH 4 483 - #define KVM_REG_FEATURE_LEVEL_MASK (BIT(KVM_REG_FEATURE_LEVEL_WIDTH) - 1) 484 - 485 - /* 486 - * Convert the workaround level into an easy-to-compare number, where higher 487 - * values mean better protection. 488 - */ 489 - static int get_kernel_wa_level(u64 regid) 490 - { 491 - switch (regid) { 492 - case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1: 493 - switch (arm64_get_spectre_v2_state()) { 494 - case SPECTRE_VULNERABLE: 495 - return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_AVAIL; 496 - case SPECTRE_MITIGATED: 497 - return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_AVAIL; 498 - case SPECTRE_UNAFFECTED: 499 - return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_REQUIRED; 500 - } 501 - return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_AVAIL; 502 - case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2: 503 - switch (arm64_get_spectre_v4_state()) { 504 - case SPECTRE_MITIGATED: 505 - /* 506 - * As for the hypercall discovery, we pretend we 507 - * don't have any FW mitigation if SSBS is there at 508 - * all times. 509 - */ 510 - if (cpus_have_final_cap(ARM64_SSBS)) 511 - return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL; 512 - fallthrough; 513 - case SPECTRE_UNAFFECTED: 514 - return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED; 515 - case SPECTRE_VULNERABLE: 516 - return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL; 517 - } 518 - break; 519 - case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3: 520 - switch (arm64_get_spectre_bhb_state()) { 521 - case SPECTRE_VULNERABLE: 522 - return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3_NOT_AVAIL; 523 - case SPECTRE_MITIGATED: 524 - return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3_AVAIL; 525 - case SPECTRE_UNAFFECTED: 526 - return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3_NOT_REQUIRED; 527 - } 528 - return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3_NOT_AVAIL; 529 - } 530 - 531 - return -EINVAL; 532 - } 533 - 534 - int kvm_arm_get_fw_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg) 535 - { 536 - void __user *uaddr = (void __user *)(long)reg->addr; 537 - u64 val; 538 - 539 - switch (reg->id) { 540 - case KVM_REG_ARM_PSCI_VERSION: 541 - val = kvm_psci_version(vcpu); 542 - break; 543 - case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1: 544 - case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2: 545 - case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3: 546 - val = get_kernel_wa_level(reg->id) & KVM_REG_FEATURE_LEVEL_MASK; 547 - break; 548 - default: 549 - return -ENOENT; 550 - } 551 - 552 - if (copy_to_user(uaddr, &val, KVM_REG_SIZE(reg->id))) 553 - return -EFAULT; 554 - 555 - return 0; 556 - } 557 - 558 - int kvm_arm_set_fw_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg) 559 - { 560 - void __user *uaddr = (void __user *)(long)reg->addr; 561 - u64 val; 562 - int wa_level; 563 - 564 - if (copy_from_user(&val, uaddr, KVM_REG_SIZE(reg->id))) 565 - return -EFAULT; 566 - 567 - switch (reg->id) { 568 - case KVM_REG_ARM_PSCI_VERSION: 569 - { 570 - bool wants_02; 571 - 572 - wants_02 = test_bit(KVM_ARM_VCPU_PSCI_0_2, vcpu->arch.features); 573 - 574 - switch (val) { 575 - case KVM_ARM_PSCI_0_1: 576 - if (wants_02) 577 - return -EINVAL; 578 - vcpu->kvm->arch.psci_version = val; 579 - return 0; 580 - case KVM_ARM_PSCI_0_2: 581 - case KVM_ARM_PSCI_1_0: 582 - case KVM_ARM_PSCI_1_1: 583 - if (!wants_02) 584 - return -EINVAL; 585 - vcpu->kvm->arch.psci_version = val; 586 - return 0; 587 - } 588 - break; 589 - } 590 - 591 - case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1: 592 - case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3: 593 - if (val & ~KVM_REG_FEATURE_LEVEL_MASK) 594 - return -EINVAL; 595 - 596 - if (get_kernel_wa_level(reg->id) < val) 597 - return -EINVAL; 598 - 599 - return 0; 600 - 601 - case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2: 602 - if (val & ~(KVM_REG_FEATURE_LEVEL_MASK | 603 - KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED)) 604 - return -EINVAL; 605 - 606 - /* The enabled bit must not be set unless the level is AVAIL. */ 607 - if ((val & KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED) && 608 - (val & KVM_REG_FEATURE_LEVEL_MASK) != KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL) 609 - return -EINVAL; 610 - 611 - /* 612 - * Map all the possible incoming states to the only two we 613 - * really want to deal with. 614 - */ 615 - switch (val & KVM_REG_FEATURE_LEVEL_MASK) { 616 - case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL: 617 - case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_UNKNOWN: 618 - wa_level = KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL; 619 - break; 620 - case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL: 621 - case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED: 622 - wa_level = KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED; 623 - break; 624 - default: 625 - return -EINVAL; 626 - } 627 - 628 - /* 629 - * We can deal with NOT_AVAIL on NOT_REQUIRED, but not the 630 - * other way around. 631 - */ 632 - if (get_kernel_wa_level(reg->id) < wa_level) 633 - return -EINVAL; 634 - 635 - return 0; 636 - default: 637 - return -ENOENT; 638 - } 639 - 640 - return -EINVAL; 641 439 }

+220 -74

arch/arm64/kvm/sys_regs.c

··· 1145 1145 if (!vcpu_has_ptrauth(vcpu)) 1146 1146 val &= ~(ARM64_FEATURE_MASK(ID_AA64ISAR2_APA3) | 1147 1147 ARM64_FEATURE_MASK(ID_AA64ISAR2_GPA3)); 1148 + if (!cpus_have_final_cap(ARM64_HAS_WFXT)) 1149 + val &= ~ARM64_FEATURE_MASK(ID_AA64ISAR2_WFXT); 1148 1150 break; 1149 1151 case SYS_ID_AA64DFR0_EL1: 1150 1152 /* Limit debug to ARMv8.0 */ ··· 2022 2020 { Op1( 0), CRm( 2), .access = trap_raz_wi }, 2023 2021 }; 2024 2022 2023 + #define CP15_PMU_SYS_REG(_map, _Op1, _CRn, _CRm, _Op2) \ 2024 + AA32(_map), \ 2025 + Op1(_Op1), CRn(_CRn), CRm(_CRm), Op2(_Op2), \ 2026 + .visibility = pmu_visibility 2027 + 2025 2028 /* Macro to expand the PMEVCNTRn register */ 2026 2029 #define PMU_PMEVCNTR(n) \ 2027 - /* PMEVCNTRn */ \ 2028 - { Op1(0), CRn(0b1110), \ 2029 - CRm((0b1000 | (((n) >> 3) & 0x3))), Op2(((n) & 0x7)), \ 2030 - access_pmu_evcntr } 2030 + { CP15_PMU_SYS_REG(DIRECT, 0, 0b1110, \ 2031 + (0b1000 | (((n) >> 3) & 0x3)), ((n) & 0x7)), \ 2032 + .access = access_pmu_evcntr } 2031 2033 2032 2034 /* Macro to expand the PMEVTYPERn register */ 2033 2035 #define PMU_PMEVTYPER(n) \ 2034 - /* PMEVTYPERn */ \ 2035 - { Op1(0), CRn(0b1110), \ 2036 - CRm((0b1100 | (((n) >> 3) & 0x3))), Op2(((n) & 0x7)), \ 2037 - access_pmu_evtyper } 2038 - 2036 + { CP15_PMU_SYS_REG(DIRECT, 0, 0b1110, \ 2037 + (0b1100 | (((n) >> 3) & 0x3)), ((n) & 0x7)), \ 2038 + .access = access_pmu_evtyper } 2039 2039 /* 2040 2040 * Trapped cp15 registers. TTBR0/TTBR1 get a double encoding, 2041 2041 * depending on the way they are accessed (as a 32bit or a 64bit ··· 2077 2073 { Op1( 0), CRn( 7), CRm(14), Op2( 2), access_dcsw }, 2078 2074 2079 2075 /* PMU */ 2080 - { Op1( 0), CRn( 9), CRm(12), Op2( 0), access_pmcr }, 2081 - { Op1( 0), CRn( 9), CRm(12), Op2( 1), access_pmcnten }, 2082 - { Op1( 0), CRn( 9), CRm(12), Op2( 2), access_pmcnten }, 2083 - { Op1( 0), CRn( 9), CRm(12), Op2( 3), access_pmovs }, 2084 - { Op1( 0), CRn( 9), CRm(12), Op2( 4), access_pmswinc }, 2085 - { Op1( 0), CRn( 9), CRm(12), Op2( 5), access_pmselr }, 2086 - { AA32(LO), Op1( 0), CRn( 9), CRm(12), Op2( 6), access_pmceid }, 2087 - { AA32(LO), Op1( 0), CRn( 9), CRm(12), Op2( 7), access_pmceid }, 2088 - { Op1( 0), CRn( 9), CRm(13), Op2( 0), access_pmu_evcntr }, 2089 - { Op1( 0), CRn( 9), CRm(13), Op2( 1), access_pmu_evtyper }, 2090 - { Op1( 0), CRn( 9), CRm(13), Op2( 2), access_pmu_evcntr }, 2091 - { Op1( 0), CRn( 9), CRm(14), Op2( 0), access_pmuserenr }, 2092 - { Op1( 0), CRn( 9), CRm(14), Op2( 1), access_pminten }, 2093 - { Op1( 0), CRn( 9), CRm(14), Op2( 2), access_pminten }, 2094 - { Op1( 0), CRn( 9), CRm(14), Op2( 3), access_pmovs }, 2095 - { AA32(HI), Op1( 0), CRn( 9), CRm(14), Op2( 4), access_pmceid }, 2096 - { AA32(HI), Op1( 0), CRn( 9), CRm(14), Op2( 5), access_pmceid }, 2076 + { CP15_PMU_SYS_REG(DIRECT, 0, 9, 12, 0), .access = access_pmcr }, 2077 + { CP15_PMU_SYS_REG(DIRECT, 0, 9, 12, 1), .access = access_pmcnten }, 2078 + { CP15_PMU_SYS_REG(DIRECT, 0, 9, 12, 2), .access = access_pmcnten }, 2079 + { CP15_PMU_SYS_REG(DIRECT, 0, 9, 12, 3), .access = access_pmovs }, 2080 + { CP15_PMU_SYS_REG(DIRECT, 0, 9, 12, 4), .access = access_pmswinc }, 2081 + { CP15_PMU_SYS_REG(DIRECT, 0, 9, 12, 5), .access = access_pmselr }, 2082 + { CP15_PMU_SYS_REG(LO, 0, 9, 12, 6), .access = access_pmceid }, 2083 + { CP15_PMU_SYS_REG(LO, 0, 9, 12, 7), .access = access_pmceid }, 2084 + { CP15_PMU_SYS_REG(DIRECT, 0, 9, 13, 0), .access = access_pmu_evcntr }, 2085 + { CP15_PMU_SYS_REG(DIRECT, 0, 9, 13, 1), .access = access_pmu_evtyper }, 2086 + { CP15_PMU_SYS_REG(DIRECT, 0, 9, 13, 2), .access = access_pmu_evcntr }, 2087 + { CP15_PMU_SYS_REG(DIRECT, 0, 9, 14, 0), .access = access_pmuserenr }, 2088 + { CP15_PMU_SYS_REG(DIRECT, 0, 9, 14, 1), .access = access_pminten }, 2089 + { CP15_PMU_SYS_REG(DIRECT, 0, 9, 14, 2), .access = access_pminten }, 2090 + { CP15_PMU_SYS_REG(DIRECT, 0, 9, 14, 3), .access = access_pmovs }, 2091 + { CP15_PMU_SYS_REG(HI, 0, 9, 14, 4), .access = access_pmceid }, 2092 + { CP15_PMU_SYS_REG(HI, 0, 9, 14, 5), .access = access_pmceid }, 2097 2093 /* PMMIR */ 2098 - { Op1( 0), CRn( 9), CRm(14), Op2( 6), trap_raz_wi }, 2094 + { CP15_PMU_SYS_REG(DIRECT, 0, 9, 14, 6), .access = trap_raz_wi }, 2099 2095 2100 2096 /* PRRR/MAIR0 */ 2101 2097 { AA32(LO), Op1( 0), CRn(10), CRm( 2), Op2( 0), access_vm_reg, NULL, MAIR_EL1 }, ··· 2180 2176 PMU_PMEVTYPER(29), 2181 2177 PMU_PMEVTYPER(30), 2182 2178 /* PMCCFILTR */ 2183 - { Op1(0), CRn(14), CRm(15), Op2(7), access_pmu_evtyper }, 2179 + { CP15_PMU_SYS_REG(DIRECT, 0, 14, 15, 7), .access = access_pmu_evtyper }, 2184 2180 2185 2181 { Op1(1), CRn( 0), CRm( 0), Op2(0), access_ccsidr }, 2186 2182 { Op1(1), CRn( 0), CRm( 0), Op2(1), access_clidr }, ··· 2189 2185 2190 2186 static const struct sys_reg_desc cp15_64_regs[] = { 2191 2187 { Op1( 0), CRn( 0), CRm( 2), Op2( 0), access_vm_reg, NULL, TTBR0_EL1 }, 2192 - { Op1( 0), CRn( 0), CRm( 9), Op2( 0), access_pmu_evcntr }, 2188 + { CP15_PMU_SYS_REG(DIRECT, 0, 0, 9, 0), .access = access_pmu_evcntr }, 2193 2189 { Op1( 0), CRn( 0), CRm(12), Op2( 0), access_gic_sgi }, /* ICC_SGI1R */ 2194 2190 { Op1( 1), CRn( 0), CRm( 2), Op2( 0), access_vm_reg, NULL, TTBR1_EL1 }, 2195 2191 { Op1( 1), CRn( 0), CRm(12), Op2( 0), access_gic_sgi }, /* ICC_ASGI1R */ ··· 2197 2193 { SYS_DESC(SYS_AARCH32_CNTP_CVAL), access_arch_timer }, 2198 2194 }; 2199 2195 2200 - static int check_sysreg_table(const struct sys_reg_desc *table, unsigned int n, 2201 - bool is_32) 2196 + static bool check_sysreg_table(const struct sys_reg_desc *table, unsigned int n, 2197 + bool is_32) 2202 2198 { 2203 2199 unsigned int i; 2204 2200 2205 2201 for (i = 0; i < n; i++) { 2206 2202 if (!is_32 && table[i].reg && !table[i].reset) { 2207 - kvm_err("sys_reg table %p entry %d has lacks reset\n", 2208 - table, i); 2209 - return 1; 2203 + kvm_err("sys_reg table %pS entry %d lacks reset\n", &table[i], i); 2204 + return false; 2210 2205 } 2211 2206 2212 2207 if (i && cmp_sys_reg(&table[i-1], &table[i]) >= 0) { 2213 - kvm_err("sys_reg table %p out of order (%d)\n", table, i - 1); 2214 - return 1; 2208 + kvm_err("sys_reg table %pS entry %d out of order\n", &table[i - 1], i - 1); 2209 + return false; 2215 2210 } 2216 2211 } 2217 2212 2218 - return 0; 2213 + return true; 2219 2214 } 2220 2215 2221 2216 int kvm_handle_cp14_load_store(struct kvm_vcpu *vcpu) ··· 2255 2252 * @table: array of trap descriptors 2256 2253 * @num: size of the trap descriptor array 2257 2254 * 2258 - * Return 0 if the access has been handled, and -1 if not. 2255 + * Return true if the access has been handled, false if not. 2259 2256 */ 2260 - static int emulate_cp(struct kvm_vcpu *vcpu, 2261 - struct sys_reg_params *params, 2262 - const struct sys_reg_desc *table, 2263 - size_t num) 2257 + static bool emulate_cp(struct kvm_vcpu *vcpu, 2258 + struct sys_reg_params *params, 2259 + const struct sys_reg_desc *table, 2260 + size_t num) 2264 2261 { 2265 2262 const struct sys_reg_desc *r; 2266 2263 2267 2264 if (!table) 2268 - return -1; /* Not handled */ 2265 + return false; /* Not handled */ 2269 2266 2270 2267 r = find_reg(params, table, num); 2271 2268 2272 2269 if (r) { 2273 2270 perform_access(vcpu, params, r); 2274 - return 0; 2271 + return true; 2275 2272 } 2276 2273 2277 2274 /* Not handled */ 2278 - return -1; 2275 + return false; 2279 2276 } 2280 2277 2281 2278 static void unhandled_cp_access(struct kvm_vcpu *vcpu, ··· 2339 2336 * potential register operation in the case of a read and return 2340 2337 * with success. 2341 2338 */ 2342 - if (!emulate_cp(vcpu, &params, global, nr_global)) { 2339 + if (emulate_cp(vcpu, &params, global, nr_global)) { 2343 2340 /* Split up the value between registers for the read side */ 2344 2341 if (!params.is_write) { 2345 2342 vcpu_set_reg(vcpu, Rt, lower_32_bits(params.regval)); ··· 2353 2350 return 1; 2354 2351 } 2355 2352 2353 + static bool emulate_sys_reg(struct kvm_vcpu *vcpu, struct sys_reg_params *params); 2354 + 2355 + /* 2356 + * The CP10 ID registers are architecturally mapped to AArch64 feature 2357 + * registers. Abuse that fact so we can rely on the AArch64 handler for accesses 2358 + * from AArch32. 2359 + */ 2360 + static bool kvm_esr_cp10_id_to_sys64(u64 esr, struct sys_reg_params *params) 2361 + { 2362 + u8 reg_id = (esr >> 10) & 0xf; 2363 + bool valid; 2364 + 2365 + params->is_write = ((esr & 1) == 0); 2366 + params->Op0 = 3; 2367 + params->Op1 = 0; 2368 + params->CRn = 0; 2369 + params->CRm = 3; 2370 + 2371 + /* CP10 ID registers are read-only */ 2372 + valid = !params->is_write; 2373 + 2374 + switch (reg_id) { 2375 + /* MVFR0 */ 2376 + case 0b0111: 2377 + params->Op2 = 0; 2378 + break; 2379 + /* MVFR1 */ 2380 + case 0b0110: 2381 + params->Op2 = 1; 2382 + break; 2383 + /* MVFR2 */ 2384 + case 0b0101: 2385 + params->Op2 = 2; 2386 + break; 2387 + default: 2388 + valid = false; 2389 + } 2390 + 2391 + if (valid) 2392 + return true; 2393 + 2394 + kvm_pr_unimpl("Unhandled cp10 register %s: %u\n", 2395 + params->is_write ? "write" : "read", reg_id); 2396 + return false; 2397 + } 2398 + 2399 + /** 2400 + * kvm_handle_cp10_id() - Handles a VMRS trap on guest access to a 'Media and 2401 + * VFP Register' from AArch32. 2402 + * @vcpu: The vCPU pointer 2403 + * 2404 + * MVFR{0-2} are architecturally mapped to the AArch64 MVFR{0-2}_EL1 registers. 2405 + * Work out the correct AArch64 system register encoding and reroute to the 2406 + * AArch64 system register emulation. 2407 + */ 2408 + int kvm_handle_cp10_id(struct kvm_vcpu *vcpu) 2409 + { 2410 + int Rt = kvm_vcpu_sys_get_rt(vcpu); 2411 + u64 esr = kvm_vcpu_get_esr(vcpu); 2412 + struct sys_reg_params params; 2413 + 2414 + /* UNDEF on any unhandled register access */ 2415 + if (!kvm_esr_cp10_id_to_sys64(esr, &params)) { 2416 + kvm_inject_undefined(vcpu); 2417 + return 1; 2418 + } 2419 + 2420 + if (emulate_sys_reg(vcpu, &params)) 2421 + vcpu_set_reg(vcpu, Rt, params.regval); 2422 + 2423 + return 1; 2424 + } 2425 + 2426 + /** 2427 + * kvm_emulate_cp15_id_reg() - Handles an MRC trap on a guest CP15 access where 2428 + * CRn=0, which corresponds to the AArch32 feature 2429 + * registers. 2430 + * @vcpu: the vCPU pointer 2431 + * @params: the system register access parameters. 2432 + * 2433 + * Our cp15 system register tables do not enumerate the AArch32 feature 2434 + * registers. Conveniently, our AArch64 table does, and the AArch32 system 2435 + * register encoding can be trivially remapped into the AArch64 for the feature 2436 + * registers: Append op0=3, leaving op1, CRn, CRm, and op2 the same. 2437 + * 2438 + * According to DDI0487G.b G7.3.1, paragraph "Behavior of VMSAv8-32 32-bit 2439 + * System registers with (coproc=0b1111, CRn==c0)", read accesses from this 2440 + * range are either UNKNOWN or RES0. Rerouting remains architectural as we 2441 + * treat undefined registers in this range as RAZ. 2442 + */ 2443 + static int kvm_emulate_cp15_id_reg(struct kvm_vcpu *vcpu, 2444 + struct sys_reg_params *params) 2445 + { 2446 + int Rt = kvm_vcpu_sys_get_rt(vcpu); 2447 + 2448 + /* Treat impossible writes to RO registers as UNDEFINED */ 2449 + if (params->is_write) { 2450 + unhandled_cp_access(vcpu, params); 2451 + return 1; 2452 + } 2453 + 2454 + params->Op0 = 3; 2455 + 2456 + /* 2457 + * All registers where CRm > 3 are known to be UNKNOWN/RAZ from AArch32. 2458 + * Avoid conflicting with future expansion of AArch64 feature registers 2459 + * and simply treat them as RAZ here. 2460 + */ 2461 + if (params->CRm > 3) 2462 + params->regval = 0; 2463 + else if (!emulate_sys_reg(vcpu, params)) 2464 + return 1; 2465 + 2466 + vcpu_set_reg(vcpu, Rt, params->regval); 2467 + return 1; 2468 + } 2469 + 2356 2470 /** 2357 2471 * kvm_handle_cp_32 -- handles a mrc/mcr trap on a guest CP14/CP15 access 2358 2472 * @vcpu: The VCPU pointer 2359 2473 * @run: The kvm_run struct 2360 2474 */ 2361 2475 static int kvm_handle_cp_32(struct kvm_vcpu *vcpu, 2476 + struct sys_reg_params *params, 2362 2477 const struct sys_reg_desc *global, 2363 2478 size_t nr_global) 2364 2479 { 2365 - struct sys_reg_params params; 2366 - u64 esr = kvm_vcpu_get_esr(vcpu); 2367 2480 int Rt = kvm_vcpu_sys_get_rt(vcpu); 2368 2481 2369 - params.CRm = (esr >> 1) & 0xf; 2370 - params.regval = vcpu_get_reg(vcpu, Rt); 2371 - params.is_write = ((esr & 1) == 0); 2372 - params.CRn = (esr >> 10) & 0xf; 2373 - params.Op0 = 0; 2374 - params.Op1 = (esr >> 14) & 0x7; 2375 - params.Op2 = (esr >> 17) & 0x7; 2482 + params->regval = vcpu_get_reg(vcpu, Rt); 2376 2483 2377 - if (!emulate_cp(vcpu, &params, global, nr_global)) { 2378 - if (!params.is_write) 2379 - vcpu_set_reg(vcpu, Rt, params.regval); 2484 + if (emulate_cp(vcpu, params, global, nr_global)) { 2485 + if (!params->is_write) 2486 + vcpu_set_reg(vcpu, Rt, params->regval); 2380 2487 return 1; 2381 2488 } 2382 2489 2383 - unhandled_cp_access(vcpu, &params); 2490 + unhandled_cp_access(vcpu, params); 2384 2491 return 1; 2385 2492 } 2386 2493 ··· 2501 2388 2502 2389 int kvm_handle_cp15_32(struct kvm_vcpu *vcpu) 2503 2390 { 2504 - return kvm_handle_cp_32(vcpu, cp15_regs, ARRAY_SIZE(cp15_regs)); 2391 + struct sys_reg_params params; 2392 + 2393 + params = esr_cp1x_32_to_params(kvm_vcpu_get_esr(vcpu)); 2394 + 2395 + /* 2396 + * Certain AArch32 ID registers are handled by rerouting to the AArch64 2397 + * system register table. Registers in the ID range where CRm=0 are 2398 + * excluded from this scheme as they do not trivially map into AArch64 2399 + * system register encodings. 2400 + */ 2401 + if (params.Op1 == 0 && params.CRn == 0 && params.CRm) 2402 + return kvm_emulate_cp15_id_reg(vcpu, &params); 2403 + 2404 + return kvm_handle_cp_32(vcpu, &params, cp15_regs, ARRAY_SIZE(cp15_regs)); 2505 2405 } 2506 2406 2507 2407 int kvm_handle_cp14_64(struct kvm_vcpu *vcpu) ··· 2524 2398 2525 2399 int kvm_handle_cp14_32(struct kvm_vcpu *vcpu) 2526 2400 { 2527 - return kvm_handle_cp_32(vcpu, cp14_regs, ARRAY_SIZE(cp14_regs)); 2401 + struct sys_reg_params params; 2402 + 2403 + params = esr_cp1x_32_to_params(kvm_vcpu_get_esr(vcpu)); 2404 + 2405 + return kvm_handle_cp_32(vcpu, &params, cp14_regs, ARRAY_SIZE(cp14_regs)); 2528 2406 } 2529 2407 2530 2408 static bool is_imp_def_sys_reg(struct sys_reg_params *params) ··· 2537 2407 return params->Op0 == 3 && (params->CRn & 0b1011) == 0b1011; 2538 2408 } 2539 2409 2540 - static int emulate_sys_reg(struct kvm_vcpu *vcpu, 2410 + /** 2411 + * emulate_sys_reg - Emulate a guest access to an AArch64 system register 2412 + * @vcpu: The VCPU pointer 2413 + * @params: Decoded system register parameters 2414 + * 2415 + * Return: true if the system register access was successful, false otherwise. 2416 + */ 2417 + static bool emulate_sys_reg(struct kvm_vcpu *vcpu, 2541 2418 struct sys_reg_params *params) 2542 2419 { 2543 2420 const struct sys_reg_desc *r; ··· 2553 2416 2554 2417 if (likely(r)) { 2555 2418 perform_access(vcpu, params, r); 2556 - } else if (is_imp_def_sys_reg(params)) { 2419 + return true; 2420 + } 2421 + 2422 + if (is_imp_def_sys_reg(params)) { 2557 2423 kvm_inject_undefined(vcpu); 2558 2424 } else { 2559 2425 print_sys_reg_msg(params, ··· 2564 2424 *vcpu_pc(vcpu), *vcpu_cpsr(vcpu)); 2565 2425 kvm_inject_undefined(vcpu); 2566 2426 } 2567 - return 1; 2427 + return false; 2568 2428 } 2569 2429 2570 2430 /** ··· 2592 2452 struct sys_reg_params params; 2593 2453 unsigned long esr = kvm_vcpu_get_esr(vcpu); 2594 2454 int Rt = kvm_vcpu_sys_get_rt(vcpu); 2595 - int ret; 2596 2455 2597 2456 trace_kvm_handle_sys_reg(esr); 2598 2457 2599 2458 params = esr_sys64_to_params(esr); 2600 2459 params.regval = vcpu_get_reg(vcpu, Rt); 2601 2460 2602 - ret = emulate_sys_reg(vcpu, &params); 2461 + if (!emulate_sys_reg(vcpu, &params)) 2462 + return 1; 2603 2463 2604 2464 if (!params.is_write) 2605 2465 vcpu_set_reg(vcpu, Rt, params.regval); 2606 - return ret; 2466 + return 1; 2607 2467 } 2608 2468 2609 2469 /****************************************************************************** ··· 3006 2866 return write_demux_regids(uindices); 3007 2867 } 3008 2868 3009 - void kvm_sys_reg_table_init(void) 2869 + int kvm_sys_reg_table_init(void) 3010 2870 { 2871 + bool valid = true; 3011 2872 unsigned int i; 3012 2873 struct sys_reg_desc clidr; 3013 2874 3014 2875 /* Make sure tables are unique and in order. */ 3015 - BUG_ON(check_sysreg_table(sys_reg_descs, ARRAY_SIZE(sys_reg_descs), false)); 3016 - BUG_ON(check_sysreg_table(cp14_regs, ARRAY_SIZE(cp14_regs), true)); 3017 - BUG_ON(check_sysreg_table(cp14_64_regs, ARRAY_SIZE(cp14_64_regs), true)); 3018 - BUG_ON(check_sysreg_table(cp15_regs, ARRAY_SIZE(cp15_regs), true)); 3019 - BUG_ON(check_sysreg_table(cp15_64_regs, ARRAY_SIZE(cp15_64_regs), true)); 3020 - BUG_ON(check_sysreg_table(invariant_sys_regs, ARRAY_SIZE(invariant_sys_regs), false)); 2876 + valid &= check_sysreg_table(sys_reg_descs, ARRAY_SIZE(sys_reg_descs), false); 2877 + valid &= check_sysreg_table(cp14_regs, ARRAY_SIZE(cp14_regs), true); 2878 + valid &= check_sysreg_table(cp14_64_regs, ARRAY_SIZE(cp14_64_regs), true); 2879 + valid &= check_sysreg_table(cp15_regs, ARRAY_SIZE(cp15_regs), true); 2880 + valid &= check_sysreg_table(cp15_64_regs, ARRAY_SIZE(cp15_64_regs), true); 2881 + valid &= check_sysreg_table(invariant_sys_regs, ARRAY_SIZE(invariant_sys_regs), false); 2882 + 2883 + if (!valid) 2884 + return -EINVAL; 3021 2885 3022 2886 /* We abuse the reset function to overwrite the table itself. */ 3023 2887 for (i = 0; i < ARRAY_SIZE(invariant_sys_regs); i++) ··· 3044 2900 break; 3045 2901 /* Clear all higher bits. */ 3046 2902 cache_levels &= (1 << (i*3))-1; 2903 + 2904 + return 0; 3047 2905 }

+8 -1

arch/arm64/kvm/sys_regs.h

··· 35 35 .Op2 = ((esr) >> 17) & 0x7, \ 36 36 .is_write = !((esr) & 1) }) 37 37 38 + #define esr_cp1x_32_to_params(esr) \ 39 + ((struct sys_reg_params){ .Op1 = ((esr) >> 14) & 0x7, \ 40 + .CRn = ((esr) >> 10) & 0xf, \ 41 + .CRm = ((esr) >> 1) & 0xf, \ 42 + .Op2 = ((esr) >> 17) & 0x7, \ 43 + .is_write = !((esr) & 1) }) 44 + 38 45 struct sys_reg_desc { 39 46 /* Sysreg string for debug */ 40 47 const char *name; 41 48 42 49 enum { 43 - AA32_ZEROHIGH, 50 + AA32_DIRECT, 44 51 AA32_LO, 45 52 AA32_HI, 46 53 } aarch32_map;

+9 -4

arch/arm64/kvm/vgic/vgic-init.c

··· 98 98 ret = 0; 99 99 100 100 if (type == KVM_DEV_TYPE_ARM_VGIC_V2) 101 - kvm->arch.max_vcpus = VGIC_V2_MAX_CPUS; 101 + kvm->max_vcpus = VGIC_V2_MAX_CPUS; 102 102 else 103 - kvm->arch.max_vcpus = VGIC_V3_MAX_CPUS; 103 + kvm->max_vcpus = VGIC_V3_MAX_CPUS; 104 104 105 - if (atomic_read(&kvm->online_vcpus) > kvm->arch.max_vcpus) { 105 + if (atomic_read(&kvm->online_vcpus) > kvm->max_vcpus) { 106 106 ret = -E2BIG; 107 107 goto out_unlock; 108 108 } ··· 319 319 320 320 vgic_debug_init(kvm); 321 321 322 - dist->implementation_rev = 2; 322 + /* 323 + * If userspace didn't set the GIC implementation revision, 324 + * default to the latest and greatest. You know want it. 325 + */ 326 + if (!dist->implementation_rev) 327 + dist->implementation_rev = KVM_VGIC_IMP_REV_LATEST; 323 328 dist->initialized = true; 324 329 325 330 out:

+120 -40

arch/arm64/kvm/vgic/vgic-its.c

··· 683 683 if (!vcpu) 684 684 return E_ITS_INT_UNMAPPED_INTERRUPT; 685 685 686 - if (!vcpu->arch.vgic_cpu.lpis_enabled) 686 + if (!vgic_lpis_enabled(vcpu)) 687 687 return -EBUSY; 688 688 689 689 vgic_its_cache_translation(kvm, its, devid, eventid, ite->irq); ··· 894 894 return update_affinity(ite->irq, vcpu); 895 895 } 896 896 897 + static bool __is_visible_gfn_locked(struct vgic_its *its, gpa_t gpa) 898 + { 899 + gfn_t gfn = gpa >> PAGE_SHIFT; 900 + int idx; 901 + bool ret; 902 + 903 + idx = srcu_read_lock(&its->dev->kvm->srcu); 904 + ret = kvm_is_visible_gfn(its->dev->kvm, gfn); 905 + srcu_read_unlock(&its->dev->kvm->srcu, idx); 906 + return ret; 907 + } 908 + 897 909 /* 898 910 * Check whether an ID can be stored into the corresponding guest table. 899 911 * For a direct table this is pretty easy, but gets a bit nasty for ··· 920 908 u64 indirect_ptr, type = GITS_BASER_TYPE(baser); 921 909 phys_addr_t base = GITS_BASER_ADDR_48_to_52(baser); 922 910 int esz = GITS_BASER_ENTRY_SIZE(baser); 923 - int index, idx; 924 - gfn_t gfn; 925 - bool ret; 911 + int index; 926 912 927 913 switch (type) { 928 914 case GITS_BASER_TYPE_DEVICE: ··· 943 933 return false; 944 934 945 935 addr = base + id * esz; 946 - gfn = addr >> PAGE_SHIFT; 947 936 948 937 if (eaddr) 949 938 *eaddr = addr; 950 939 951 - goto out; 940 + return __is_visible_gfn_locked(its, addr); 952 941 } 953 942 954 943 /* calculate and check the index into the 1st level */ ··· 973 964 /* Find the address of the actual entry */ 974 965 index = id % (SZ_64K / esz); 975 966 indirect_ptr += index * esz; 976 - gfn = indirect_ptr >> PAGE_SHIFT; 977 967 978 968 if (eaddr) 979 969 *eaddr = indirect_ptr; 980 970 981 - out: 982 - idx = srcu_read_lock(&its->dev->kvm->srcu); 983 - ret = kvm_is_visible_gfn(its->dev->kvm, gfn); 984 - srcu_read_unlock(&its->dev->kvm->srcu, idx); 985 - return ret; 971 + return __is_visible_gfn_locked(its, indirect_ptr); 986 972 } 987 973 974 + /* 975 + * Check whether an event ID can be stored in the corresponding Interrupt 976 + * Translation Table, which starts at device->itt_addr. 977 + */ 978 + static bool vgic_its_check_event_id(struct vgic_its *its, struct its_device *device, 979 + u32 event_id) 980 + { 981 + const struct vgic_its_abi *abi = vgic_its_get_abi(its); 982 + int ite_esz = abi->ite_esz; 983 + gpa_t gpa; 984 + 985 + /* max table size is: BIT_ULL(device->num_eventid_bits) * ite_esz */ 986 + if (event_id >= BIT_ULL(device->num_eventid_bits)) 987 + return false; 988 + 989 + gpa = device->itt_addr + event_id * ite_esz; 990 + return __is_visible_gfn_locked(its, gpa); 991 + } 992 + 993 + /* 994 + * Add a new collection into the ITS collection table. 995 + * Returns 0 on success, and a negative error value for generic errors. 996 + */ 988 997 static int vgic_its_alloc_collection(struct vgic_its *its, 989 998 struct its_collection **colp, 990 999 u32 coll_id) 991 1000 { 992 1001 struct its_collection *collection; 993 - 994 - if (!vgic_its_check_id(its, its->baser_coll_table, coll_id, NULL)) 995 - return E_ITS_MAPC_COLLECTION_OOR; 996 1002 997 1003 collection = kzalloc(sizeof(*collection), GFP_KERNEL_ACCOUNT); 998 1004 if (!collection) ··· 1085 1061 if (!device) 1086 1062 return E_ITS_MAPTI_UNMAPPED_DEVICE; 1087 1063 1088 - if (event_id >= BIT_ULL(device->num_eventid_bits)) 1064 + if (!vgic_its_check_event_id(its, device, event_id)) 1089 1065 return E_ITS_MAPTI_ID_OOR; 1090 1066 1091 1067 if (its_cmd_get_command(its_cmd) == GITS_CMD_MAPTI) ··· 1102 1078 1103 1079 collection = find_collection(its, coll_id); 1104 1080 if (!collection) { 1105 - int ret = vgic_its_alloc_collection(its, &collection, coll_id); 1081 + int ret; 1082 + 1083 + if (!vgic_its_check_id(its, its->baser_coll_table, coll_id, NULL)) 1084 + return E_ITS_MAPC_COLLECTION_OOR; 1085 + 1086 + ret = vgic_its_alloc_collection(its, &collection, coll_id); 1106 1087 if (ret) 1107 1088 return ret; 1108 1089 new_coll = collection; ··· 1262 1233 if (!collection) { 1263 1234 int ret; 1264 1235 1236 + if (!vgic_its_check_id(its, its->baser_coll_table, 1237 + coll_id, NULL)) 1238 + return E_ITS_MAPC_COLLECTION_OOR; 1239 + 1265 1240 ret = vgic_its_alloc_collection(its, &collection, 1266 1241 coll_id); 1267 1242 if (ret) ··· 1305 1272 return 0; 1306 1273 } 1307 1274 1275 + int vgic_its_inv_lpi(struct kvm *kvm, struct vgic_irq *irq) 1276 + { 1277 + return update_lpi_config(kvm, irq, NULL, true); 1278 + } 1279 + 1308 1280 /* 1309 1281 * The INV command syncs the configuration bits from the memory table. 1310 1282 * Must be called with the its_lock mutex held. ··· 1326 1288 if (!ite) 1327 1289 return E_ITS_INV_UNMAPPED_INTERRUPT; 1328 1290 1329 - return update_lpi_config(kvm, ite->irq, NULL, true); 1291 + return vgic_its_inv_lpi(kvm, ite->irq); 1292 + } 1293 + 1294 + /** 1295 + * vgic_its_invall - invalidate all LPIs targetting a given vcpu 1296 + * @vcpu: the vcpu for which the RD is targetted by an invalidation 1297 + * 1298 + * Contrary to the INVALL command, this targets a RD instead of a 1299 + * collection, and we don't need to hold the its_lock, since no ITS is 1300 + * involved here. 1301 + */ 1302 + int vgic_its_invall(struct kvm_vcpu *vcpu) 1303 + { 1304 + struct kvm *kvm = vcpu->kvm; 1305 + int irq_count, i = 0; 1306 + u32 *intids; 1307 + 1308 + irq_count = vgic_copy_lpi_list(kvm, vcpu, &intids); 1309 + if (irq_count < 0) 1310 + return irq_count; 1311 + 1312 + for (i = 0; i < irq_count; i++) { 1313 + struct vgic_irq *irq = vgic_get_irq(kvm, NULL, intids[i]); 1314 + if (!irq) 1315 + continue; 1316 + update_lpi_config(kvm, irq, vcpu, false); 1317 + vgic_put_irq(kvm, irq); 1318 + } 1319 + 1320 + kfree(intids); 1321 + 1322 + if (vcpu->arch.vgic_cpu.vgic_v3.its_vpe.its_vm) 1323 + its_invall_vpe(&vcpu->arch.vgic_cpu.vgic_v3.its_vpe); 1324 + 1325 + return 0; 1330 1326 } 1331 1327 1332 1328 /* ··· 1377 1305 u32 coll_id = its_cmd_get_collection(its_cmd); 1378 1306 struct its_collection *collection; 1379 1307 struct kvm_vcpu *vcpu; 1380 - struct vgic_irq *irq; 1381 - u32 *intids; 1382 - int irq_count, i; 1383 1308 1384 1309 collection = find_collection(its, coll_id); 1385 1310 if (!its_is_collection_mapped(collection)) 1386 1311 return E_ITS_INVALL_UNMAPPED_COLLECTION; 1387 1312 1388 1313 vcpu = kvm_get_vcpu(kvm, collection->target_addr); 1389 - 1390 - irq_count = vgic_copy_lpi_list(kvm, vcpu, &intids); 1391 - if (irq_count < 0) 1392 - return irq_count; 1393 - 1394 - for (i = 0; i < irq_count; i++) { 1395 - irq = vgic_get_irq(kvm, NULL, intids[i]); 1396 - if (!irq) 1397 - continue; 1398 - update_lpi_config(kvm, irq, vcpu, false); 1399 - vgic_put_irq(kvm, irq); 1400 - } 1401 - 1402 - kfree(intids); 1403 - 1404 - if (vcpu->arch.vgic_cpu.vgic_v3.its_vpe.its_vm) 1405 - its_invall_vpe(&vcpu->arch.vgic_cpu.vgic_v3.its_vpe); 1314 + vgic_its_invall(vcpu); 1406 1315 1407 1316 return 0; 1408 1317 } ··· 2228 2175 if (!collection) 2229 2176 return -EINVAL; 2230 2177 2178 + if (!vgic_its_check_event_id(its, dev, event_id)) 2179 + return -EINVAL; 2180 + 2231 2181 ite = vgic_its_alloc_ite(dev, collection, event_id); 2232 2182 if (IS_ERR(ite)) 2233 2183 return PTR_ERR(ite); ··· 2239 2183 vcpu = kvm_get_vcpu(kvm, collection->target_addr); 2240 2184 2241 2185 irq = vgic_add_lpi(kvm, lpi_id, vcpu); 2242 - if (IS_ERR(irq)) 2186 + if (IS_ERR(irq)) { 2187 + its_free_ite(kvm, ite); 2243 2188 return PTR_ERR(irq); 2189 + } 2244 2190 ite->irq = irq; 2245 2191 2246 2192 return offset; ··· 2354 2296 void *ptr, void *opaque) 2355 2297 { 2356 2298 struct its_device *dev; 2299 + u64 baser = its->baser_device_table; 2357 2300 gpa_t itt_addr; 2358 2301 u8 num_eventid_bits; 2359 2302 u64 entry = *(u64 *)ptr; ··· 2374 2315 2375 2316 /* dte entry is valid */ 2376 2317 offset = (entry & KVM_ITS_DTE_NEXT_MASK) >> KVM_ITS_DTE_NEXT_SHIFT; 2318 + 2319 + if (!vgic_its_check_id(its, baser, id, NULL)) 2320 + return -EINVAL; 2377 2321 2378 2322 dev = vgic_its_alloc_device(its, id, itt_addr, num_eventid_bits); 2379 2323 if (IS_ERR(dev)) ··· 2507 2445 if (ret > 0) 2508 2446 ret = 0; 2509 2447 2448 + if (ret < 0) 2449 + vgic_its_free_device_list(its->dev->kvm, its); 2450 + 2510 2451 return ret; 2511 2452 } 2512 2453 ··· 2526 2461 return kvm_write_guest_lock(its->dev->kvm, gpa, &val, esz); 2527 2462 } 2528 2463 2464 + /* 2465 + * Restore a collection entry into the ITS collection table. 2466 + * Return +1 on success, 0 if the entry was invalid (which should be 2467 + * interpreted as end-of-table), and a negative error value for generic errors. 2468 + */ 2529 2469 static int vgic_its_restore_cte(struct vgic_its *its, gpa_t gpa, int esz) 2530 2470 { 2531 2471 struct its_collection *collection; ··· 2557 2487 collection = find_collection(its, coll_id); 2558 2488 if (collection) 2559 2489 return -EEXIST; 2490 + 2491 + if (!vgic_its_check_id(its, its->baser_coll_table, coll_id, NULL)) 2492 + return -EINVAL; 2493 + 2560 2494 ret = vgic_its_alloc_collection(its, &collection, coll_id); 2561 2495 if (ret) 2562 2496 return ret; ··· 2640 2566 if (ret > 0) 2641 2567 return 0; 2642 2568 2569 + if (ret < 0) 2570 + vgic_its_free_collection_list(its->dev->kvm, its); 2571 + 2643 2572 return ret; 2644 2573 } 2645 2574 ··· 2674 2597 if (ret) 2675 2598 return ret; 2676 2599 2677 - return vgic_its_restore_device_tables(its); 2600 + ret = vgic_its_restore_device_tables(its); 2601 + if (ret) 2602 + vgic_its_free_collection_list(its->dev->kvm, its); 2603 + return ret; 2678 2604 } 2679 2605 2680 2606 static int vgic_its_commit_v0(struct vgic_its *its)

+15 -3

arch/arm64/kvm/vgic/vgic-mmio-v2.c

··· 73 73 gpa_t addr, unsigned int len, 74 74 unsigned long val) 75 75 { 76 + struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 77 + u32 reg; 78 + 76 79 switch (addr & 0x0c) { 77 80 case GIC_DIST_IIDR: 78 - if (val != vgic_mmio_read_v2_misc(vcpu, addr, len)) 81 + reg = vgic_mmio_read_v2_misc(vcpu, addr, len); 82 + if ((reg ^ val) & ~GICD_IIDR_REVISION_MASK) 79 83 return -EINVAL; 80 84 81 85 /* ··· 91 87 * migration from old kernels to new kernels with legacy 92 88 * userspace. 93 89 */ 94 - vcpu->kvm->arch.vgic.v2_groups_user_writable = true; 95 - return 0; 90 + reg = FIELD_GET(GICD_IIDR_REVISION_MASK, reg); 91 + switch (reg) { 92 + case KVM_VGIC_IMP_REV_2: 93 + case KVM_VGIC_IMP_REV_3: 94 + vcpu->kvm->arch.vgic.v2_groups_user_writable = true; 95 + dist->implementation_rev = reg; 96 + return 0; 97 + default: 98 + return -EINVAL; 99 + } 96 100 } 97 101 98 102 vgic_mmio_write_v2_misc(vcpu, addr, len, val);

+114 -11

arch/arm64/kvm/vgic/vgic-mmio-v3.c

··· 155 155 unsigned long val) 156 156 { 157 157 struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 158 + u32 reg; 158 159 159 160 switch (addr & 0x0c) { 160 161 case GICD_TYPER2: 161 - case GICD_IIDR: 162 162 if (val != vgic_mmio_read_v3_misc(vcpu, addr, len)) 163 163 return -EINVAL; 164 164 return 0; 165 + case GICD_IIDR: 166 + reg = vgic_mmio_read_v3_misc(vcpu, addr, len); 167 + if ((reg ^ val) & ~GICD_IIDR_REVISION_MASK) 168 + return -EINVAL; 169 + 170 + reg = FIELD_GET(GICD_IIDR_REVISION_MASK, reg); 171 + switch (reg) { 172 + case KVM_VGIC_IMP_REV_2: 173 + case KVM_VGIC_IMP_REV_3: 174 + dist->implementation_rev = reg; 175 + return 0; 176 + default: 177 + return -EINVAL; 178 + } 165 179 case GICD_CTLR: 166 180 /* Not a GICv4.1? No HW SGIs */ 167 181 if (!kvm_vgic_global_state.has_gicv4_1) ··· 235 221 vgic_put_irq(vcpu->kvm, irq); 236 222 } 237 223 224 + bool vgic_lpis_enabled(struct kvm_vcpu *vcpu) 225 + { 226 + struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 227 + 228 + return atomic_read(&vgic_cpu->ctlr) == GICR_CTLR_ENABLE_LPIS; 229 + } 230 + 238 231 static unsigned long vgic_mmio_read_v3r_ctlr(struct kvm_vcpu *vcpu, 239 232 gpa_t addr, unsigned int len) 240 233 { 241 234 struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 235 + unsigned long val; 242 236 243 - return vgic_cpu->lpis_enabled ? GICR_CTLR_ENABLE_LPIS : 0; 237 + val = atomic_read(&vgic_cpu->ctlr); 238 + if (vgic_get_implementation_rev(vcpu) >= KVM_VGIC_IMP_REV_3) 239 + val |= GICR_CTLR_IR | GICR_CTLR_CES; 240 + 241 + return val; 244 242 } 245 - 246 243 247 244 static void vgic_mmio_write_v3r_ctlr(struct kvm_vcpu *vcpu, 248 245 gpa_t addr, unsigned int len, 249 246 unsigned long val) 250 247 { 251 248 struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 252 - bool was_enabled = vgic_cpu->lpis_enabled; 249 + u32 ctlr; 253 250 254 251 if (!vgic_has_its(vcpu->kvm)) 255 252 return; 256 253 257 - vgic_cpu->lpis_enabled = val & GICR_CTLR_ENABLE_LPIS; 254 + if (!(val & GICR_CTLR_ENABLE_LPIS)) { 255 + /* 256 + * Don't disable if RWP is set, as there already an 257 + * ongoing disable. Funky guest... 258 + */ 259 + ctlr = atomic_cmpxchg_acquire(&vgic_cpu->ctlr, 260 + GICR_CTLR_ENABLE_LPIS, 261 + GICR_CTLR_RWP); 262 + if (ctlr != GICR_CTLR_ENABLE_LPIS) 263 + return; 258 264 259 - if (was_enabled && !vgic_cpu->lpis_enabled) { 260 265 vgic_flush_pending_lpis(vcpu); 261 266 vgic_its_invalidate_cache(vcpu->kvm); 262 - } 267 + atomic_set_release(&vgic_cpu->ctlr, 0); 268 + } else { 269 + ctlr = atomic_cmpxchg_acquire(&vgic_cpu->ctlr, 0, 270 + GICR_CTLR_ENABLE_LPIS); 271 + if (ctlr != 0) 272 + return; 263 273 264 - if (!was_enabled && vgic_cpu->lpis_enabled) 265 274 vgic_enable_lpis(vcpu); 275 + } 266 276 } 267 277 268 278 static bool vgic_mmio_vcpu_rdist_is_last(struct kvm_vcpu *vcpu) ··· 516 478 unsigned long val) 517 479 { 518 480 struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 519 - struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 520 481 u64 old_propbaser, propbaser; 521 482 522 483 /* Storing a value with LPIs already enabled is undefined */ 523 - if (vgic_cpu->lpis_enabled) 484 + if (vgic_lpis_enabled(vcpu)) 524 485 return; 525 486 526 487 do { ··· 550 513 u64 old_pendbaser, pendbaser; 551 514 552 515 /* Storing a value with LPIs already enabled is undefined */ 553 - if (vgic_cpu->lpis_enabled) 516 + if (vgic_lpis_enabled(vcpu)) 554 517 return; 555 518 556 519 do { ··· 560 523 pendbaser = vgic_sanitise_pendbaser(pendbaser); 561 524 } while (cmpxchg64(&vgic_cpu->pendbaser, old_pendbaser, 562 525 pendbaser) != old_pendbaser); 526 + } 527 + 528 + static unsigned long vgic_mmio_read_sync(struct kvm_vcpu *vcpu, 529 + gpa_t addr, unsigned int len) 530 + { 531 + return !!atomic_read(&vcpu->arch.vgic_cpu.syncr_busy); 532 + } 533 + 534 + static void vgic_set_rdist_busy(struct kvm_vcpu *vcpu, bool busy) 535 + { 536 + if (busy) { 537 + atomic_inc(&vcpu->arch.vgic_cpu.syncr_busy); 538 + smp_mb__after_atomic(); 539 + } else { 540 + smp_mb__before_atomic(); 541 + atomic_dec(&vcpu->arch.vgic_cpu.syncr_busy); 542 + } 543 + } 544 + 545 + static void vgic_mmio_write_invlpi(struct kvm_vcpu *vcpu, 546 + gpa_t addr, unsigned int len, 547 + unsigned long val) 548 + { 549 + struct vgic_irq *irq; 550 + 551 + /* 552 + * If the guest wrote only to the upper 32bit part of the 553 + * register, drop the write on the floor, as it is only for 554 + * vPEs (which we don't support for obvious reasons). 555 + * 556 + * Also discard the access if LPIs are not enabled. 557 + */ 558 + if ((addr & 4) || !vgic_lpis_enabled(vcpu)) 559 + return; 560 + 561 + vgic_set_rdist_busy(vcpu, true); 562 + 563 + irq = vgic_get_irq(vcpu->kvm, NULL, lower_32_bits(val)); 564 + if (irq) { 565 + vgic_its_inv_lpi(vcpu->kvm, irq); 566 + vgic_put_irq(vcpu->kvm, irq); 567 + } 568 + 569 + vgic_set_rdist_busy(vcpu, false); 570 + } 571 + 572 + static void vgic_mmio_write_invall(struct kvm_vcpu *vcpu, 573 + gpa_t addr, unsigned int len, 574 + unsigned long val) 575 + { 576 + /* See vgic_mmio_write_invlpi() for the early return rationale */ 577 + if ((addr & 4) || !vgic_lpis_enabled(vcpu)) 578 + return; 579 + 580 + vgic_set_rdist_busy(vcpu, true); 581 + vgic_its_invall(vcpu); 582 + vgic_set_rdist_busy(vcpu, false); 563 583 } 564 584 565 585 /* ··· 724 630 REGISTER_DESC_WITH_LENGTH(GICR_PENDBASER, 725 631 vgic_mmio_read_pendbase, vgic_mmio_write_pendbase, 8, 726 632 VGIC_ACCESS_64bit | VGIC_ACCESS_32bit), 633 + REGISTER_DESC_WITH_LENGTH(GICR_INVLPIR, 634 + vgic_mmio_read_raz, vgic_mmio_write_invlpi, 8, 635 + VGIC_ACCESS_64bit | VGIC_ACCESS_32bit), 636 + REGISTER_DESC_WITH_LENGTH(GICR_INVALLR, 637 + vgic_mmio_read_raz, vgic_mmio_write_invall, 8, 638 + VGIC_ACCESS_64bit | VGIC_ACCESS_32bit), 639 + REGISTER_DESC_WITH_LENGTH(GICR_SYNCR, 640 + vgic_mmio_read_sync, vgic_mmio_write_wi, 4, 641 + VGIC_ACCESS_32bit), 727 642 REGISTER_DESC_WITH_LENGTH(GICR_IDREGS, 728 643 vgic_mmio_read_v3_idregs, vgic_mmio_write_wi, 48, 729 644 VGIC_ACCESS_32bit),

+4

arch/arm64/kvm/vgic/vgic-v3.c

··· 612 612 static const struct midr_range broken_seis[] = { 613 613 MIDR_ALL_VERSIONS(MIDR_APPLE_M1_ICESTORM), 614 614 MIDR_ALL_VERSIONS(MIDR_APPLE_M1_FIRESTORM), 615 + MIDR_ALL_VERSIONS(MIDR_APPLE_M1_ICESTORM_PRO), 616 + MIDR_ALL_VERSIONS(MIDR_APPLE_M1_FIRESTORM_PRO), 617 + MIDR_ALL_VERSIONS(MIDR_APPLE_M1_ICESTORM_MAX), 618 + MIDR_ALL_VERSIONS(MIDR_APPLE_M1_FIRESTORM_MAX), 615 619 {}, 616 620 }; 617 621

+10

arch/arm64/kvm/vgic/vgic.h

··· 98 98 #define DEBUG_SPINLOCK_BUG_ON(p) 99 99 #endif 100 100 101 + static inline u32 vgic_get_implementation_rev(struct kvm_vcpu *vcpu) 102 + { 103 + return vcpu->kvm->arch.vgic.implementation_rev; 104 + } 105 + 101 106 /* Requires the irq_lock to be held by the caller. */ 102 107 static inline bool irq_is_pending(struct vgic_irq *irq) 103 108 { ··· 313 308 (base < d->vgic_dist_base + KVM_VGIC_V3_DIST_SIZE); 314 309 } 315 310 311 + bool vgic_lpis_enabled(struct kvm_vcpu *vcpu); 316 312 int vgic_copy_lpi_list(struct kvm *kvm, struct kvm_vcpu *vcpu, u32 **intid_ptr); 317 313 int vgic_its_resolve_lpi(struct kvm *kvm, struct vgic_its *its, 318 314 u32 devid, u32 eventid, struct vgic_irq **irq); ··· 322 316 void vgic_lpi_translation_cache_init(struct kvm *kvm); 323 317 void vgic_lpi_translation_cache_destroy(struct kvm *kvm); 324 318 void vgic_its_invalidate_cache(struct kvm *kvm); 319 + 320 + /* GICv4.1 MMIO interface */ 321 + int vgic_its_inv_lpi(struct kvm *kvm, struct vgic_irq *irq); 322 + int vgic_its_invall(struct kvm_vcpu *vcpu); 325 323 326 324 bool vgic_supports_direct_msis(struct kvm *kvm); 327 325 int vgic_v4_init(struct kvm *kvm);

+11 -1

arch/arm64/lib/delay.c

··· 27 27 { 28 28 cycles_t start = get_cycles(); 29 29 30 - if (arch_timer_evtstrm_available()) { 30 + if (cpus_have_const_cap(ARM64_HAS_WFXT)) { 31 + u64 end = start + cycles; 32 + 33 + /* 34 + * Start with WFIT. If an interrupt makes us resume 35 + * early, use a WFET loop to complete the delay. 36 + */ 37 + wfit(end); 38 + while ((get_cycles() - start) < cycles) 39 + wfet(end); 40 + } else if (arch_timer_evtstrm_available()) { 31 41 const cycles_t timer_evt_period = 32 42 USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US); 33 43

+1

arch/arm64/tools/cpucaps

··· 38 38 HAS_SYSREG_GIC_CPUIF 39 39 HAS_TLB_RANGE 40 40 HAS_VIRT_HOST_EXTN 41 + HAS_WFXT 41 42 HW_DBM 42 43 KVM_PROTECTED_MODE 43 44 MISMATCHED_CACHE_TYPE

+1

arch/riscv/include/asm/csr.h

··· 117 117 #define HGATP_MODE_SV32X4 _AC(1, UL) 118 118 #define HGATP_MODE_SV39X4 _AC(8, UL) 119 119 #define HGATP_MODE_SV48X4 _AC(9, UL) 120 + #define HGATP_MODE_SV57X4 _AC(10, UL) 120 121 121 122 #define HGATP32_MODE_SHIFT 31 122 123 #define HGATP32_VMID_SHIFT 22

+99 -21

arch/riscv/include/asm/kvm_host.h

··· 12 12 #include <linux/types.h> 13 13 #include <linux/kvm.h> 14 14 #include <linux/kvm_types.h> 15 + #include <linux/spinlock.h> 15 16 #include <asm/csr.h> 16 17 #include <asm/kvm_vcpu_fp.h> 17 18 #include <asm/kvm_vcpu_timer.h> 18 19 19 - #define KVM_MAX_VCPUS \ 20 - ((HGATP_VMID_MASK >> HGATP_VMID_SHIFT) + 1) 20 + #define KVM_MAX_VCPUS 1024 21 21 22 22 #define KVM_HALT_POLL_NS_DEFAULT 500000 23 23 ··· 27 27 KVM_ARCH_REQ_FLAGS(0, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) 28 28 #define KVM_REQ_VCPU_RESET KVM_ARCH_REQ(1) 29 29 #define KVM_REQ_UPDATE_HGATP KVM_ARCH_REQ(2) 30 + #define KVM_REQ_FENCE_I \ 31 + KVM_ARCH_REQ_FLAGS(3, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) 32 + #define KVM_REQ_HFENCE_GVMA_VMID_ALL KVM_REQ_TLB_FLUSH 33 + #define KVM_REQ_HFENCE_VVMA_ALL \ 34 + KVM_ARCH_REQ_FLAGS(4, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) 35 + #define KVM_REQ_HFENCE \ 36 + KVM_ARCH_REQ_FLAGS(5, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) 37 + 38 + enum kvm_riscv_hfence_type { 39 + KVM_RISCV_HFENCE_UNKNOWN = 0, 40 + KVM_RISCV_HFENCE_GVMA_VMID_GPA, 41 + KVM_RISCV_HFENCE_VVMA_ASID_GVA, 42 + KVM_RISCV_HFENCE_VVMA_ASID_ALL, 43 + KVM_RISCV_HFENCE_VVMA_GVA, 44 + }; 45 + 46 + struct kvm_riscv_hfence { 47 + enum kvm_riscv_hfence_type type; 48 + unsigned long asid; 49 + unsigned long order; 50 + gpa_t addr; 51 + gpa_t size; 52 + }; 53 + 54 + #define KVM_RISCV_VCPU_MAX_HFENCE 64 30 55 31 56 struct kvm_vm_stat { 32 57 struct kvm_vm_stat_generic generic; ··· 79 54 }; 80 55 81 56 struct kvm_arch { 82 - /* stage2 vmid */ 57 + /* G-stage vmid */ 83 58 struct kvm_vmid vmid; 84 59 85 - /* stage2 page table */ 60 + /* G-stage page table */ 86 61 pgd_t *pgd; 87 62 phys_addr_t pgd_phys; 88 63 ··· 166 141 /* VCPU ran at least once */ 167 142 bool ran_atleast_once; 168 143 144 + /* Last Host CPU on which Guest VCPU exited */ 145 + int last_exit_cpu; 146 + 169 147 /* ISA feature bits (similar to MISA) */ 170 148 unsigned long isa; 171 149 ··· 207 179 /* VCPU Timer */ 208 180 struct kvm_vcpu_timer timer; 209 181 182 + /* HFENCE request queue */ 183 + spinlock_t hfence_lock; 184 + unsigned long hfence_head; 185 + unsigned long hfence_tail; 186 + struct kvm_riscv_hfence hfence_queue[KVM_RISCV_VCPU_MAX_HFENCE]; 187 + 210 188 /* MMIO instruction details */ 211 189 struct kvm_mmio_decode mmio_decode; 212 190 ··· 235 201 236 202 #define KVM_ARCH_WANT_MMU_NOTIFIER 237 203 238 - void __kvm_riscv_hfence_gvma_vmid_gpa(unsigned long gpa_divby_4, 239 - unsigned long vmid); 240 - void __kvm_riscv_hfence_gvma_vmid(unsigned long vmid); 241 - void __kvm_riscv_hfence_gvma_gpa(unsigned long gpa_divby_4); 242 - void __kvm_riscv_hfence_gvma_all(void); 204 + #define KVM_RISCV_GSTAGE_TLB_MIN_ORDER 12 243 205 244 - int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu, 206 + void kvm_riscv_local_hfence_gvma_vmid_gpa(unsigned long vmid, 207 + gpa_t gpa, gpa_t gpsz, 208 + unsigned long order); 209 + void kvm_riscv_local_hfence_gvma_vmid_all(unsigned long vmid); 210 + void kvm_riscv_local_hfence_gvma_gpa(gpa_t gpa, gpa_t gpsz, 211 + unsigned long order); 212 + void kvm_riscv_local_hfence_gvma_all(void); 213 + void kvm_riscv_local_hfence_vvma_asid_gva(unsigned long vmid, 214 + unsigned long asid, 215 + unsigned long gva, 216 + unsigned long gvsz, 217 + unsigned long order); 218 + void kvm_riscv_local_hfence_vvma_asid_all(unsigned long vmid, 219 + unsigned long asid); 220 + void kvm_riscv_local_hfence_vvma_gva(unsigned long vmid, 221 + unsigned long gva, unsigned long gvsz, 222 + unsigned long order); 223 + void kvm_riscv_local_hfence_vvma_all(unsigned long vmid); 224 + 225 + void kvm_riscv_local_tlb_sanitize(struct kvm_vcpu *vcpu); 226 + 227 + void kvm_riscv_fence_i_process(struct kvm_vcpu *vcpu); 228 + void kvm_riscv_hfence_gvma_vmid_all_process(struct kvm_vcpu *vcpu); 229 + void kvm_riscv_hfence_vvma_all_process(struct kvm_vcpu *vcpu); 230 + void kvm_riscv_hfence_process(struct kvm_vcpu *vcpu); 231 + 232 + void kvm_riscv_fence_i(struct kvm *kvm, 233 + unsigned long hbase, unsigned long hmask); 234 + void kvm_riscv_hfence_gvma_vmid_gpa(struct kvm *kvm, 235 + unsigned long hbase, unsigned long hmask, 236 + gpa_t gpa, gpa_t gpsz, 237 + unsigned long order); 238 + void kvm_riscv_hfence_gvma_vmid_all(struct kvm *kvm, 239 + unsigned long hbase, unsigned long hmask); 240 + void kvm_riscv_hfence_vvma_asid_gva(struct kvm *kvm, 241 + unsigned long hbase, unsigned long hmask, 242 + unsigned long gva, unsigned long gvsz, 243 + unsigned long order, unsigned long asid); 244 + void kvm_riscv_hfence_vvma_asid_all(struct kvm *kvm, 245 + unsigned long hbase, unsigned long hmask, 246 + unsigned long asid); 247 + void kvm_riscv_hfence_vvma_gva(struct kvm *kvm, 248 + unsigned long hbase, unsigned long hmask, 249 + unsigned long gva, unsigned long gvsz, 250 + unsigned long order); 251 + void kvm_riscv_hfence_vvma_all(struct kvm *kvm, 252 + unsigned long hbase, unsigned long hmask); 253 + 254 + int kvm_riscv_gstage_map(struct kvm_vcpu *vcpu, 245 255 struct kvm_memory_slot *memslot, 246 256 gpa_t gpa, unsigned long hva, bool is_write); 247 - int kvm_riscv_stage2_alloc_pgd(struct kvm *kvm); 248 - void kvm_riscv_stage2_free_pgd(struct kvm *kvm); 249 - void kvm_riscv_stage2_update_hgatp(struct kvm_vcpu *vcpu); 250 - void kvm_riscv_stage2_mode_detect(void); 251 - unsigned long kvm_riscv_stage2_mode(void); 252 - int kvm_riscv_stage2_gpa_bits(void); 257 + int kvm_riscv_gstage_alloc_pgd(struct kvm *kvm); 258 + void kvm_riscv_gstage_free_pgd(struct kvm *kvm); 259 + void kvm_riscv_gstage_update_hgatp(struct kvm_vcpu *vcpu); 260 + void kvm_riscv_gstage_mode_detect(void); 261 + unsigned long kvm_riscv_gstage_mode(void); 262 + int kvm_riscv_gstage_gpa_bits(void); 253 263 254 - void kvm_riscv_stage2_vmid_detect(void); 255 - unsigned long kvm_riscv_stage2_vmid_bits(void); 256 - int kvm_riscv_stage2_vmid_init(struct kvm *kvm); 257 - bool kvm_riscv_stage2_vmid_ver_changed(struct kvm_vmid *vmid); 258 - void kvm_riscv_stage2_vmid_update(struct kvm_vcpu *vcpu); 264 + void kvm_riscv_gstage_vmid_detect(void); 265 + unsigned long kvm_riscv_gstage_vmid_bits(void); 266 + int kvm_riscv_gstage_vmid_init(struct kvm *kvm); 267 + bool kvm_riscv_gstage_vmid_ver_changed(struct kvm_vmid *vmid); 268 + void kvm_riscv_gstage_vmid_update(struct kvm_vcpu *vcpu); 259 269 260 270 void __kvm_riscv_unpriv_trap(void); 261 271

+20

arch/riscv/include/uapi/asm/kvm.h

··· 82 82 __u64 state; 83 83 }; 84 84 85 + /* 86 + * ISA extension IDs specific to KVM. This is not the same as the host ISA 87 + * extension IDs as that is internal to the host and should not be exposed 88 + * to the guest. This should always be contiguous to keep the mapping simple 89 + * in KVM implementation. 90 + */ 91 + enum KVM_RISCV_ISA_EXT_ID { 92 + KVM_RISCV_ISA_EXT_A = 0, 93 + KVM_RISCV_ISA_EXT_C, 94 + KVM_RISCV_ISA_EXT_D, 95 + KVM_RISCV_ISA_EXT_F, 96 + KVM_RISCV_ISA_EXT_H, 97 + KVM_RISCV_ISA_EXT_I, 98 + KVM_RISCV_ISA_EXT_M, 99 + KVM_RISCV_ISA_EXT_MAX, 100 + }; 101 + 85 102 /* Possible states for kvm_riscv_timer */ 86 103 #define KVM_RISCV_TIMER_STATE_OFF 0 87 104 #define KVM_RISCV_TIMER_STATE_ON 1 ··· 139 122 #define KVM_REG_RISCV_FP_D (0x06 << KVM_REG_RISCV_TYPE_SHIFT) 140 123 #define KVM_REG_RISCV_FP_D_REG(name) \ 141 124 (offsetof(struct __riscv_d_ext_state, name) / sizeof(__u64)) 125 + 126 + /* ISA Extension registers are mapped as type 7 */ 127 + #define KVM_REG_RISCV_ISA_EXT (0x07 << KVM_REG_RISCV_TYPE_SHIFT) 142 128 143 129 #endif 144 130

+7 -4

arch/riscv/kvm/main.c

··· 89 89 return -ENODEV; 90 90 } 91 91 92 - kvm_riscv_stage2_mode_detect(); 92 + kvm_riscv_gstage_mode_detect(); 93 93 94 - kvm_riscv_stage2_vmid_detect(); 94 + kvm_riscv_gstage_vmid_detect(); 95 95 96 96 kvm_info("hypervisor extension available\n"); 97 97 98 - switch (kvm_riscv_stage2_mode()) { 98 + switch (kvm_riscv_gstage_mode()) { 99 99 case HGATP_MODE_SV32X4: 100 100 str = "Sv32x4"; 101 101 break; ··· 105 105 case HGATP_MODE_SV48X4: 106 106 str = "Sv48x4"; 107 107 break; 108 + case HGATP_MODE_SV57X4: 109 + str = "Sv57x4"; 110 + break; 108 111 default: 109 112 return -ENODEV; 110 113 } 111 114 kvm_info("using %s G-stage page table format\n", str); 112 115 113 - kvm_info("VMID %ld bits available\n", kvm_riscv_stage2_vmid_bits()); 116 + kvm_info("VMID %ld bits available\n", kvm_riscv_gstage_vmid_bits()); 114 117 115 118 return 0; 116 119 }

+139 -127

arch/riscv/kvm/mmu.c

··· 18 18 #include <asm/csr.h> 19 19 #include <asm/page.h> 20 20 #include <asm/pgtable.h> 21 - #include <asm/sbi.h> 22 21 23 22 #ifdef CONFIG_64BIT 24 - static unsigned long stage2_mode = (HGATP_MODE_SV39X4 << HGATP_MODE_SHIFT); 25 - static unsigned long stage2_pgd_levels = 3; 26 - #define stage2_index_bits 9 23 + static unsigned long gstage_mode = (HGATP_MODE_SV39X4 << HGATP_MODE_SHIFT); 24 + static unsigned long gstage_pgd_levels = 3; 25 + #define gstage_index_bits 9 27 26 #else 28 - static unsigned long stage2_mode = (HGATP_MODE_SV32X4 << HGATP_MODE_SHIFT); 29 - static unsigned long stage2_pgd_levels = 2; 30 - #define stage2_index_bits 10 27 + static unsigned long gstage_mode = (HGATP_MODE_SV32X4 << HGATP_MODE_SHIFT); 28 + static unsigned long gstage_pgd_levels = 2; 29 + #define gstage_index_bits 10 31 30 #endif 32 31 33 - #define stage2_pgd_xbits 2 34 - #define stage2_pgd_size (1UL << (HGATP_PAGE_SHIFT + stage2_pgd_xbits)) 35 - #define stage2_gpa_bits (HGATP_PAGE_SHIFT + \ 36 - (stage2_pgd_levels * stage2_index_bits) + \ 37 - stage2_pgd_xbits) 38 - #define stage2_gpa_size ((gpa_t)(1ULL << stage2_gpa_bits)) 32 + #define gstage_pgd_xbits 2 33 + #define gstage_pgd_size (1UL << (HGATP_PAGE_SHIFT + gstage_pgd_xbits)) 34 + #define gstage_gpa_bits (HGATP_PAGE_SHIFT + \ 35 + (gstage_pgd_levels * gstage_index_bits) + \ 36 + gstage_pgd_xbits) 37 + #define gstage_gpa_size ((gpa_t)(1ULL << gstage_gpa_bits)) 39 38 40 - #define stage2_pte_leaf(__ptep) \ 39 + #define gstage_pte_leaf(__ptep) \ 41 40 (pte_val(*(__ptep)) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC)) 42 41 43 - static inline unsigned long stage2_pte_index(gpa_t addr, u32 level) 42 + static inline unsigned long gstage_pte_index(gpa_t addr, u32 level) 44 43 { 45 44 unsigned long mask; 46 - unsigned long shift = HGATP_PAGE_SHIFT + (stage2_index_bits * level); 45 + unsigned long shift = HGATP_PAGE_SHIFT + (gstage_index_bits * level); 47 46 48 - if (level == (stage2_pgd_levels - 1)) 49 - mask = (PTRS_PER_PTE * (1UL << stage2_pgd_xbits)) - 1; 47 + if (level == (gstage_pgd_levels - 1)) 48 + mask = (PTRS_PER_PTE * (1UL << gstage_pgd_xbits)) - 1; 50 49 else 51 50 mask = PTRS_PER_PTE - 1; 52 51 53 52 return (addr >> shift) & mask; 54 53 } 55 54 56 - static inline unsigned long stage2_pte_page_vaddr(pte_t pte) 55 + static inline unsigned long gstage_pte_page_vaddr(pte_t pte) 57 56 { 58 57 return (unsigned long)pfn_to_virt(pte_val(pte) >> _PAGE_PFN_SHIFT); 59 58 } 60 59 61 - static int stage2_page_size_to_level(unsigned long page_size, u32 *out_level) 60 + static int gstage_page_size_to_level(unsigned long page_size, u32 *out_level) 62 61 { 63 62 u32 i; 64 63 unsigned long psz = 1UL << 12; 65 64 66 - for (i = 0; i < stage2_pgd_levels; i++) { 67 - if (page_size == (psz << (i * stage2_index_bits))) { 65 + for (i = 0; i < gstage_pgd_levels; i++) { 66 + if (page_size == (psz << (i * gstage_index_bits))) { 68 67 *out_level = i; 69 68 return 0; 70 69 } ··· 72 73 return -EINVAL; 73 74 } 74 75 75 - static int stage2_level_to_page_size(u32 level, unsigned long *out_pgsize) 76 + static int gstage_level_to_page_order(u32 level, unsigned long *out_pgorder) 76 77 { 77 - if (stage2_pgd_levels < level) 78 + if (gstage_pgd_levels < level) 78 79 return -EINVAL; 79 80 80 - *out_pgsize = 1UL << (12 + (level * stage2_index_bits)); 81 - 81 + *out_pgorder = 12 + (level * gstage_index_bits); 82 82 return 0; 83 83 } 84 84 85 - static bool stage2_get_leaf_entry(struct kvm *kvm, gpa_t addr, 85 + static int gstage_level_to_page_size(u32 level, unsigned long *out_pgsize) 86 + { 87 + int rc; 88 + unsigned long page_order = PAGE_SHIFT; 89 + 90 + rc = gstage_level_to_page_order(level, &page_order); 91 + if (rc) 92 + return rc; 93 + 94 + *out_pgsize = BIT(page_order); 95 + return 0; 96 + } 97 + 98 + static bool gstage_get_leaf_entry(struct kvm *kvm, gpa_t addr, 86 99 pte_t **ptepp, u32 *ptep_level) 87 100 { 88 101 pte_t *ptep; 89 - u32 current_level = stage2_pgd_levels - 1; 102 + u32 current_level = gstage_pgd_levels - 1; 90 103 91 104 *ptep_level = current_level; 92 105 ptep = (pte_t *)kvm->arch.pgd; 93 - ptep = &ptep[stage2_pte_index(addr, current_level)]; 106 + ptep = &ptep[gstage_pte_index(addr, current_level)]; 94 107 while (ptep && pte_val(*ptep)) { 95 - if (stage2_pte_leaf(ptep)) { 108 + if (gstage_pte_leaf(ptep)) { 96 109 *ptep_level = current_level; 97 110 *ptepp = ptep; 98 111 return true; ··· 113 102 if (current_level) { 114 103 current_level--; 115 104 *ptep_level = current_level; 116 - ptep = (pte_t *)stage2_pte_page_vaddr(*ptep); 117 - ptep = &ptep[stage2_pte_index(addr, current_level)]; 105 + ptep = (pte_t *)gstage_pte_page_vaddr(*ptep); 106 + ptep = &ptep[gstage_pte_index(addr, current_level)]; 118 107 } else { 119 108 ptep = NULL; 120 109 } ··· 123 112 return false; 124 113 } 125 114 126 - static void stage2_remote_tlb_flush(struct kvm *kvm, u32 level, gpa_t addr) 115 + static void gstage_remote_tlb_flush(struct kvm *kvm, u32 level, gpa_t addr) 127 116 { 128 - unsigned long size = PAGE_SIZE; 129 - struct kvm_vmid *vmid = &kvm->arch.vmid; 117 + unsigned long order = PAGE_SHIFT; 130 118 131 - if (stage2_level_to_page_size(level, &size)) 119 + if (gstage_level_to_page_order(level, &order)) 132 120 return; 133 - addr &= ~(size - 1); 121 + addr &= ~(BIT(order) - 1); 134 122 135 - /* 136 - * TODO: Instead of cpu_online_mask, we should only target CPUs 137 - * where the Guest/VM is running. 138 - */ 139 - preempt_disable(); 140 - sbi_remote_hfence_gvma_vmid(cpu_online_mask, addr, size, 141 - READ_ONCE(vmid->vmid)); 142 - preempt_enable(); 123 + kvm_riscv_hfence_gvma_vmid_gpa(kvm, -1UL, 0, addr, BIT(order), order); 143 124 } 144 125 145 - static int stage2_set_pte(struct kvm *kvm, u32 level, 126 + static int gstage_set_pte(struct kvm *kvm, u32 level, 146 127 struct kvm_mmu_memory_cache *pcache, 147 128 gpa_t addr, const pte_t *new_pte) 148 129 { 149 - u32 current_level = stage2_pgd_levels - 1; 130 + u32 current_level = gstage_pgd_levels - 1; 150 131 pte_t *next_ptep = (pte_t *)kvm->arch.pgd; 151 - pte_t *ptep = &next_ptep[stage2_pte_index(addr, current_level)]; 132 + pte_t *ptep = &next_ptep[gstage_pte_index(addr, current_level)]; 152 133 153 134 if (current_level < level) 154 135 return -EINVAL; 155 136 156 137 while (current_level != level) { 157 - if (stage2_pte_leaf(ptep)) 138 + if (gstage_pte_leaf(ptep)) 158 139 return -EEXIST; 159 140 160 141 if (!pte_val(*ptep)) { ··· 158 155 *ptep = pfn_pte(PFN_DOWN(__pa(next_ptep)), 159 156 __pgprot(_PAGE_TABLE)); 160 157 } else { 161 - if (stage2_pte_leaf(ptep)) 158 + if (gstage_pte_leaf(ptep)) 162 159 return -EEXIST; 163 - next_ptep = (pte_t *)stage2_pte_page_vaddr(*ptep); 160 + next_ptep = (pte_t *)gstage_pte_page_vaddr(*ptep); 164 161 } 165 162 166 163 current_level--; 167 - ptep = &next_ptep[stage2_pte_index(addr, current_level)]; 164 + ptep = &next_ptep[gstage_pte_index(addr, current_level)]; 168 165 } 169 166 170 167 *ptep = *new_pte; 171 - if (stage2_pte_leaf(ptep)) 172 - stage2_remote_tlb_flush(kvm, current_level, addr); 168 + if (gstage_pte_leaf(ptep)) 169 + gstage_remote_tlb_flush(kvm, current_level, addr); 173 170 174 171 return 0; 175 172 } 176 173 177 - static int stage2_map_page(struct kvm *kvm, 174 + static int gstage_map_page(struct kvm *kvm, 178 175 struct kvm_mmu_memory_cache *pcache, 179 176 gpa_t gpa, phys_addr_t hpa, 180 177 unsigned long page_size, ··· 185 182 pte_t new_pte; 186 183 pgprot_t prot; 187 184 188 - ret = stage2_page_size_to_level(page_size, &level); 185 + ret = gstage_page_size_to_level(page_size, &level); 189 186 if (ret) 190 187 return ret; 191 188 ··· 196 193 * PTE so that software can update these bits. 197 194 * 198 195 * We support both options mentioned above. To achieve this, we 199 - * always set 'A' and 'D' PTE bits at time of creating stage2 196 + * always set 'A' and 'D' PTE bits at time of creating G-stage 200 197 * mapping. To support KVM dirty page logging with both options 201 - * mentioned above, we will write-protect stage2 PTEs to track 198 + * mentioned above, we will write-protect G-stage PTEs to track 202 199 * dirty pages. 203 200 */ 204 201 ··· 216 213 new_pte = pfn_pte(PFN_DOWN(hpa), prot); 217 214 new_pte = pte_mkdirty(new_pte); 218 215 219 - return stage2_set_pte(kvm, level, pcache, gpa, &new_pte); 216 + return gstage_set_pte(kvm, level, pcache, gpa, &new_pte); 220 217 } 221 218 222 - enum stage2_op { 223 - STAGE2_OP_NOP = 0, /* Nothing */ 224 - STAGE2_OP_CLEAR, /* Clear/Unmap */ 225 - STAGE2_OP_WP, /* Write-protect */ 219 + enum gstage_op { 220 + GSTAGE_OP_NOP = 0, /* Nothing */ 221 + GSTAGE_OP_CLEAR, /* Clear/Unmap */ 222 + GSTAGE_OP_WP, /* Write-protect */ 226 223 }; 227 224 228 - static void stage2_op_pte(struct kvm *kvm, gpa_t addr, 229 - pte_t *ptep, u32 ptep_level, enum stage2_op op) 225 + static void gstage_op_pte(struct kvm *kvm, gpa_t addr, 226 + pte_t *ptep, u32 ptep_level, enum gstage_op op) 230 227 { 231 228 int i, ret; 232 229 pte_t *next_ptep; 233 230 u32 next_ptep_level; 234 231 unsigned long next_page_size, page_size; 235 232 236 - ret = stage2_level_to_page_size(ptep_level, &page_size); 233 + ret = gstage_level_to_page_size(ptep_level, &page_size); 237 234 if (ret) 238 235 return; 239 236 ··· 242 239 if (!pte_val(*ptep)) 243 240 return; 244 241 245 - if (ptep_level && !stage2_pte_leaf(ptep)) { 246 - next_ptep = (pte_t *)stage2_pte_page_vaddr(*ptep); 242 + if (ptep_level && !gstage_pte_leaf(ptep)) { 243 + next_ptep = (pte_t *)gstage_pte_page_vaddr(*ptep); 247 244 next_ptep_level = ptep_level - 1; 248 - ret = stage2_level_to_page_size(next_ptep_level, 245 + ret = gstage_level_to_page_size(next_ptep_level, 249 246 &next_page_size); 250 247 if (ret) 251 248 return; 252 249 253 - if (op == STAGE2_OP_CLEAR) 250 + if (op == GSTAGE_OP_CLEAR) 254 251 set_pte(ptep, __pte(0)); 255 252 for (i = 0; i < PTRS_PER_PTE; i++) 256 - stage2_op_pte(kvm, addr + i * next_page_size, 253 + gstage_op_pte(kvm, addr + i * next_page_size, 257 254 &next_ptep[i], next_ptep_level, op); 258 - if (op == STAGE2_OP_CLEAR) 255 + if (op == GSTAGE_OP_CLEAR) 259 256 put_page(virt_to_page(next_ptep)); 260 257 } else { 261 - if (op == STAGE2_OP_CLEAR) 258 + if (op == GSTAGE_OP_CLEAR) 262 259 set_pte(ptep, __pte(0)); 263 - else if (op == STAGE2_OP_WP) 260 + else if (op == GSTAGE_OP_WP) 264 261 set_pte(ptep, __pte(pte_val(*ptep) & ~_PAGE_WRITE)); 265 - stage2_remote_tlb_flush(kvm, ptep_level, addr); 262 + gstage_remote_tlb_flush(kvm, ptep_level, addr); 266 263 } 267 264 } 268 265 269 - static void stage2_unmap_range(struct kvm *kvm, gpa_t start, 266 + static void gstage_unmap_range(struct kvm *kvm, gpa_t start, 270 267 gpa_t size, bool may_block) 271 268 { 272 269 int ret; ··· 277 274 gpa_t addr = start, end = start + size; 278 275 279 276 while (addr < end) { 280 - found_leaf = stage2_get_leaf_entry(kvm, addr, 277 + found_leaf = gstage_get_leaf_entry(kvm, addr, 281 278 &ptep, &ptep_level); 282 - ret = stage2_level_to_page_size(ptep_level, &page_size); 279 + ret = gstage_level_to_page_size(ptep_level, &page_size); 283 280 if (ret) 284 281 break; 285 282 ··· 287 284 goto next; 288 285 289 286 if (!(addr & (page_size - 1)) && ((end - addr) >= page_size)) 290 - stage2_op_pte(kvm, addr, ptep, 291 - ptep_level, STAGE2_OP_CLEAR); 287 + gstage_op_pte(kvm, addr, ptep, 288 + ptep_level, GSTAGE_OP_CLEAR); 292 289 293 290 next: 294 291 addr += page_size; ··· 302 299 } 303 300 } 304 301 305 - static void stage2_wp_range(struct kvm *kvm, gpa_t start, gpa_t end) 302 + static void gstage_wp_range(struct kvm *kvm, gpa_t start, gpa_t end) 306 303 { 307 304 int ret; 308 305 pte_t *ptep; ··· 312 309 unsigned long page_size; 313 310 314 311 while (addr < end) { 315 - found_leaf = stage2_get_leaf_entry(kvm, addr, 312 + found_leaf = gstage_get_leaf_entry(kvm, addr, 316 313 &ptep, &ptep_level); 317 - ret = stage2_level_to_page_size(ptep_level, &page_size); 314 + ret = gstage_level_to_page_size(ptep_level, &page_size); 318 315 if (ret) 319 316 break; 320 317 ··· 322 319 goto next; 323 320 324 321 if (!(addr & (page_size - 1)) && ((end - addr) >= page_size)) 325 - stage2_op_pte(kvm, addr, ptep, 326 - ptep_level, STAGE2_OP_WP); 322 + gstage_op_pte(kvm, addr, ptep, 323 + ptep_level, GSTAGE_OP_WP); 327 324 328 325 next: 329 326 addr += page_size; 330 327 } 331 328 } 332 329 333 - static void stage2_wp_memory_region(struct kvm *kvm, int slot) 330 + static void gstage_wp_memory_region(struct kvm *kvm, int slot) 334 331 { 335 332 struct kvm_memslots *slots = kvm_memslots(kvm); 336 333 struct kvm_memory_slot *memslot = id_to_memslot(slots, slot); ··· 338 335 phys_addr_t end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT; 339 336 340 337 spin_lock(&kvm->mmu_lock); 341 - stage2_wp_range(kvm, start, end); 338 + gstage_wp_range(kvm, start, end); 342 339 spin_unlock(&kvm->mmu_lock); 343 340 kvm_flush_remote_tlbs(kvm); 344 341 } 345 342 346 - static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa, 343 + static int gstage_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa, 347 344 unsigned long size, bool writable) 348 345 { 349 346 pte_t pte; ··· 364 361 if (!writable) 365 362 pte = pte_wrprotect(pte); 366 363 367 - ret = kvm_mmu_topup_memory_cache(&pcache, stage2_pgd_levels); 364 + ret = kvm_mmu_topup_memory_cache(&pcache, gstage_pgd_levels); 368 365 if (ret) 369 366 goto out; 370 367 371 368 spin_lock(&kvm->mmu_lock); 372 - ret = stage2_set_pte(kvm, 0, &pcache, addr, &pte); 369 + ret = gstage_set_pte(kvm, 0, &pcache, addr, &pte); 373 370 spin_unlock(&kvm->mmu_lock); 374 371 if (ret) 375 372 goto out; ··· 391 388 phys_addr_t start = (base_gfn + __ffs(mask)) << PAGE_SHIFT; 392 389 phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT; 393 390 394 - stage2_wp_range(kvm, start, end); 391 + gstage_wp_range(kvm, start, end); 395 392 } 396 393 397 394 void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot) ··· 414 411 415 412 void kvm_arch_flush_shadow_all(struct kvm *kvm) 416 413 { 417 - kvm_riscv_stage2_free_pgd(kvm); 414 + kvm_riscv_gstage_free_pgd(kvm); 418 415 } 419 416 420 417 void kvm_arch_flush_shadow_memslot(struct kvm *kvm, ··· 424 421 phys_addr_t size = slot->npages << PAGE_SHIFT; 425 422 426 423 spin_lock(&kvm->mmu_lock); 427 - stage2_unmap_range(kvm, gpa, size, false); 424 + gstage_unmap_range(kvm, gpa, size, false); 428 425 spin_unlock(&kvm->mmu_lock); 429 426 } 430 427 ··· 439 436 * the memory slot is write protected. 440 437 */ 441 438 if (change != KVM_MR_DELETE && new->flags & KVM_MEM_LOG_DIRTY_PAGES) 442 - stage2_wp_memory_region(kvm, new->id); 439 + gstage_wp_memory_region(kvm, new->id); 443 440 } 444 441 445 442 int kvm_arch_prepare_memory_region(struct kvm *kvm, ··· 461 458 * space addressable by the KVM guest GPA space. 462 459 */ 463 460 if ((new->base_gfn + new->npages) >= 464 - (stage2_gpa_size >> PAGE_SHIFT)) 461 + (gstage_gpa_size >> PAGE_SHIFT)) 465 462 return -EFAULT; 466 463 467 464 hva = new->userspace_addr; ··· 517 514 goto out; 518 515 } 519 516 520 - ret = stage2_ioremap(kvm, gpa, pa, 517 + ret = gstage_ioremap(kvm, gpa, pa, 521 518 vm_end - vm_start, writable); 522 519 if (ret) 523 520 break; ··· 530 527 531 528 spin_lock(&kvm->mmu_lock); 532 529 if (ret) 533 - stage2_unmap_range(kvm, base_gpa, size, false); 530 + gstage_unmap_range(kvm, base_gpa, size, false); 534 531 spin_unlock(&kvm->mmu_lock); 535 532 536 533 out: ··· 543 540 if (!kvm->arch.pgd) 544 541 return false; 545 542 546 - stage2_unmap_range(kvm, range->start << PAGE_SHIFT, 543 + gstage_unmap_range(kvm, range->start << PAGE_SHIFT, 547 544 (range->end - range->start) << PAGE_SHIFT, 548 545 range->may_block); 549 546 return false; ··· 559 556 560 557 WARN_ON(range->end - range->start != 1); 561 558 562 - ret = stage2_map_page(kvm, NULL, range->start << PAGE_SHIFT, 559 + ret = gstage_map_page(kvm, NULL, range->start << PAGE_SHIFT, 563 560 __pfn_to_phys(pfn), PAGE_SIZE, true, true); 564 561 if (ret) { 565 - kvm_debug("Failed to map stage2 page (error %d)\n", ret); 562 + kvm_debug("Failed to map G-stage page (error %d)\n", ret); 566 563 return true; 567 564 } 568 565 ··· 580 577 581 578 WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PGDIR_SIZE); 582 579 583 - if (!stage2_get_leaf_entry(kvm, range->start << PAGE_SHIFT, 580 + if (!gstage_get_leaf_entry(kvm, range->start << PAGE_SHIFT, 584 581 &ptep, &ptep_level)) 585 582 return false; 586 583 ··· 598 595 599 596 WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PGDIR_SIZE); 600 597 601 - if (!stage2_get_leaf_entry(kvm, range->start << PAGE_SHIFT, 598 + if (!gstage_get_leaf_entry(kvm, range->start << PAGE_SHIFT, 602 599 &ptep, &ptep_level)) 603 600 return false; 604 601 605 602 return pte_young(*ptep); 606 603 } 607 604 608 - int kvm_riscv_stage2_map(struct kvm_vcpu *vcpu, 605 + int kvm_riscv_gstage_map(struct kvm_vcpu *vcpu, 609 606 struct kvm_memory_slot *memslot, 610 607 gpa_t gpa, unsigned long hva, bool is_write) 611 608 { ··· 651 648 } 652 649 653 650 /* We need minimum second+third level pages */ 654 - ret = kvm_mmu_topup_memory_cache(pcache, stage2_pgd_levels); 651 + ret = kvm_mmu_topup_memory_cache(pcache, gstage_pgd_levels); 655 652 if (ret) { 656 - kvm_err("Failed to topup stage2 cache\n"); 653 + kvm_err("Failed to topup G-stage cache\n"); 657 654 return ret; 658 655 } 659 656 ··· 683 680 if (writeable) { 684 681 kvm_set_pfn_dirty(hfn); 685 682 mark_page_dirty(kvm, gfn); 686 - ret = stage2_map_page(kvm, pcache, gpa, hfn << PAGE_SHIFT, 683 + ret = gstage_map_page(kvm, pcache, gpa, hfn << PAGE_SHIFT, 687 684 vma_pagesize, false, true); 688 685 } else { 689 - ret = stage2_map_page(kvm, pcache, gpa, hfn << PAGE_SHIFT, 686 + ret = gstage_map_page(kvm, pcache, gpa, hfn << PAGE_SHIFT, 690 687 vma_pagesize, true, true); 691 688 } 692 689 693 690 if (ret) 694 - kvm_err("Failed to map in stage2\n"); 691 + kvm_err("Failed to map in G-stage\n"); 695 692 696 693 out_unlock: 697 694 spin_unlock(&kvm->mmu_lock); ··· 700 697 return ret; 701 698 } 702 699 703 - int kvm_riscv_stage2_alloc_pgd(struct kvm *kvm) 700 + int kvm_riscv_gstage_alloc_pgd(struct kvm *kvm) 704 701 { 705 702 struct page *pgd_page; 706 703 ··· 710 707 } 711 708 712 709 pgd_page = alloc_pages(GFP_KERNEL | __GFP_ZERO, 713 - get_order(stage2_pgd_size)); 710 + get_order(gstage_pgd_size)); 714 711 if (!pgd_page) 715 712 return -ENOMEM; 716 713 kvm->arch.pgd = page_to_virt(pgd_page); ··· 719 716 return 0; 720 717 } 721 718 722 - void kvm_riscv_stage2_free_pgd(struct kvm *kvm) 719 + void kvm_riscv_gstage_free_pgd(struct kvm *kvm) 723 720 { 724 721 void *pgd = NULL; 725 722 726 723 spin_lock(&kvm->mmu_lock); 727 724 if (kvm->arch.pgd) { 728 - stage2_unmap_range(kvm, 0UL, stage2_gpa_size, false); 725 + gstage_unmap_range(kvm, 0UL, gstage_gpa_size, false); 729 726 pgd = READ_ONCE(kvm->arch.pgd); 730 727 kvm->arch.pgd = NULL; 731 728 kvm->arch.pgd_phys = 0; ··· 733 730 spin_unlock(&kvm->mmu_lock); 734 731 735 732 if (pgd) 736 - free_pages((unsigned long)pgd, get_order(stage2_pgd_size)); 733 + free_pages((unsigned long)pgd, get_order(gstage_pgd_size)); 737 734 } 738 735 739 - void kvm_riscv_stage2_update_hgatp(struct kvm_vcpu *vcpu) 736 + void kvm_riscv_gstage_update_hgatp(struct kvm_vcpu *vcpu) 740 737 { 741 - unsigned long hgatp = stage2_mode; 738 + unsigned long hgatp = gstage_mode; 742 739 struct kvm_arch *k = &vcpu->kvm->arch; 743 740 744 741 hgatp |= (READ_ONCE(k->vmid.vmid) << HGATP_VMID_SHIFT) & ··· 747 744 748 745 csr_write(CSR_HGATP, hgatp); 749 746 750 - if (!kvm_riscv_stage2_vmid_bits()) 751 - __kvm_riscv_hfence_gvma_all(); 747 + if (!kvm_riscv_gstage_vmid_bits()) 748 + kvm_riscv_local_hfence_gvma_all(); 752 749 } 753 750 754 - void kvm_riscv_stage2_mode_detect(void) 751 + void kvm_riscv_gstage_mode_detect(void) 755 752 { 756 753 #ifdef CONFIG_64BIT 757 - /* Try Sv48x4 stage2 mode */ 754 + /* Try Sv57x4 G-stage mode */ 755 + csr_write(CSR_HGATP, HGATP_MODE_SV57X4 << HGATP_MODE_SHIFT); 756 + if ((csr_read(CSR_HGATP) >> HGATP_MODE_SHIFT) == HGATP_MODE_SV57X4) { 757 + gstage_mode = (HGATP_MODE_SV57X4 << HGATP_MODE_SHIFT); 758 + gstage_pgd_levels = 5; 759 + goto skip_sv48x4_test; 760 + } 761 + 762 + /* Try Sv48x4 G-stage mode */ 758 763 csr_write(CSR_HGATP, HGATP_MODE_SV48X4 << HGATP_MODE_SHIFT); 759 764 if ((csr_read(CSR_HGATP) >> HGATP_MODE_SHIFT) == HGATP_MODE_SV48X4) { 760 - stage2_mode = (HGATP_MODE_SV48X4 << HGATP_MODE_SHIFT); 761 - stage2_pgd_levels = 4; 765 + gstage_mode = (HGATP_MODE_SV48X4 << HGATP_MODE_SHIFT); 766 + gstage_pgd_levels = 4; 762 767 } 763 - csr_write(CSR_HGATP, 0); 768 + skip_sv48x4_test: 764 769 765 - __kvm_riscv_hfence_gvma_all(); 770 + csr_write(CSR_HGATP, 0); 771 + kvm_riscv_local_hfence_gvma_all(); 766 772 #endif 767 773 } 768 774 769 - unsigned long kvm_riscv_stage2_mode(void) 775 + unsigned long kvm_riscv_gstage_mode(void) 770 776 { 771 - return stage2_mode >> HGATP_MODE_SHIFT; 777 + return gstage_mode >> HGATP_MODE_SHIFT; 772 778 } 773 779 774 - int kvm_riscv_stage2_gpa_bits(void) 780 + int kvm_riscv_gstage_gpa_bits(void) 775 781 { 776 - return stage2_gpa_bits; 782 + return gstage_gpa_bits; 777 783 }

-74

arch/riscv/kvm/tlb.S

··· 1 - /* SPDX-License-Identifier: GPL-2.0 */ 2 - /* 3 - * Copyright (C) 2019 Western Digital Corporation or its affiliates. 4 - * 5 - * Authors: 6 - * Anup Patel <anup.patel@wdc.com> 7 - */ 8 - 9 - #include <linux/linkage.h> 10 - #include <asm/asm.h> 11 - 12 - .text 13 - .altmacro 14 - .option norelax 15 - 16 - /* 17 - * Instruction encoding of hfence.gvma is: 18 - * HFENCE.GVMA rs1, rs2 19 - * HFENCE.GVMA zero, rs2 20 - * HFENCE.GVMA rs1 21 - * HFENCE.GVMA 22 - * 23 - * rs1!=zero and rs2!=zero ==> HFENCE.GVMA rs1, rs2 24 - * rs1==zero and rs2!=zero ==> HFENCE.GVMA zero, rs2 25 - * rs1!=zero and rs2==zero ==> HFENCE.GVMA rs1 26 - * rs1==zero and rs2==zero ==> HFENCE.GVMA 27 - * 28 - * Instruction encoding of HFENCE.GVMA is: 29 - * 0110001 rs2(5) rs1(5) 000 00000 1110011 30 - */ 31 - 32 - ENTRY(__kvm_riscv_hfence_gvma_vmid_gpa) 33 - /* 34 - * rs1 = a0 (GPA >> 2) 35 - * rs2 = a1 (VMID) 36 - * HFENCE.GVMA a0, a1 37 - * 0110001 01011 01010 000 00000 1110011 38 - */ 39 - .word 0x62b50073 40 - ret 41 - ENDPROC(__kvm_riscv_hfence_gvma_vmid_gpa) 42 - 43 - ENTRY(__kvm_riscv_hfence_gvma_vmid) 44 - /* 45 - * rs1 = zero 46 - * rs2 = a0 (VMID) 47 - * HFENCE.GVMA zero, a0 48 - * 0110001 01010 00000 000 00000 1110011 49 - */ 50 - .word 0x62a00073 51 - ret 52 - ENDPROC(__kvm_riscv_hfence_gvma_vmid) 53 - 54 - ENTRY(__kvm_riscv_hfence_gvma_gpa) 55 - /* 56 - * rs1 = a0 (GPA >> 2) 57 - * rs2 = zero 58 - * HFENCE.GVMA a0 59 - * 0110001 00000 01010 000 00000 1110011 60 - */ 61 - .word 0x62050073 62 - ret 63 - ENDPROC(__kvm_riscv_hfence_gvma_gpa) 64 - 65 - ENTRY(__kvm_riscv_hfence_gvma_all) 66 - /* 67 - * rs1 = zero 68 - * rs2 = zero 69 - * HFENCE.GVMA 70 - * 0110001 00000 00000 000 00000 1110011 71 - */ 72 - .word 0x62000073 73 - ret 74 - ENDPROC(__kvm_riscv_hfence_gvma_all)

+461

arch/riscv/kvm/tlb.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (c) 2022 Ventana Micro Systems Inc. 4 + */ 5 + 6 + #include <linux/bitmap.h> 7 + #include <linux/cpumask.h> 8 + #include <linux/errno.h> 9 + #include <linux/err.h> 10 + #include <linux/module.h> 11 + #include <linux/smp.h> 12 + #include <linux/kvm_host.h> 13 + #include <asm/cacheflush.h> 14 + #include <asm/csr.h> 15 + 16 + /* 17 + * Instruction encoding of hfence.gvma is: 18 + * HFENCE.GVMA rs1, rs2 19 + * HFENCE.GVMA zero, rs2 20 + * HFENCE.GVMA rs1 21 + * HFENCE.GVMA 22 + * 23 + * rs1!=zero and rs2!=zero ==> HFENCE.GVMA rs1, rs2 24 + * rs1==zero and rs2!=zero ==> HFENCE.GVMA zero, rs2 25 + * rs1!=zero and rs2==zero ==> HFENCE.GVMA rs1 26 + * rs1==zero and rs2==zero ==> HFENCE.GVMA 27 + * 28 + * Instruction encoding of HFENCE.GVMA is: 29 + * 0110001 rs2(5) rs1(5) 000 00000 1110011 30 + */ 31 + 32 + void kvm_riscv_local_hfence_gvma_vmid_gpa(unsigned long vmid, 33 + gpa_t gpa, gpa_t gpsz, 34 + unsigned long order) 35 + { 36 + gpa_t pos; 37 + 38 + if (PTRS_PER_PTE < (gpsz >> order)) { 39 + kvm_riscv_local_hfence_gvma_vmid_all(vmid); 40 + return; 41 + } 42 + 43 + for (pos = gpa; pos < (gpa + gpsz); pos += BIT(order)) { 44 + /* 45 + * rs1 = a0 (GPA >> 2) 46 + * rs2 = a1 (VMID) 47 + * HFENCE.GVMA a0, a1 48 + * 0110001 01011 01010 000 00000 1110011 49 + */ 50 + asm volatile ("srli a0, %0, 2\n" 51 + "add a1, %1, zero\n" 52 + ".word 0x62b50073\n" 53 + :: "r" (pos), "r" (vmid) 54 + : "a0", "a1", "memory"); 55 + } 56 + } 57 + 58 + void kvm_riscv_local_hfence_gvma_vmid_all(unsigned long vmid) 59 + { 60 + /* 61 + * rs1 = zero 62 + * rs2 = a0 (VMID) 63 + * HFENCE.GVMA zero, a0 64 + * 0110001 01010 00000 000 00000 1110011 65 + */ 66 + asm volatile ("add a0, %0, zero\n" 67 + ".word 0x62a00073\n" 68 + :: "r" (vmid) : "a0", "memory"); 69 + } 70 + 71 + void kvm_riscv_local_hfence_gvma_gpa(gpa_t gpa, gpa_t gpsz, 72 + unsigned long order) 73 + { 74 + gpa_t pos; 75 + 76 + if (PTRS_PER_PTE < (gpsz >> order)) { 77 + kvm_riscv_local_hfence_gvma_all(); 78 + return; 79 + } 80 + 81 + for (pos = gpa; pos < (gpa + gpsz); pos += BIT(order)) { 82 + /* 83 + * rs1 = a0 (GPA >> 2) 84 + * rs2 = zero 85 + * HFENCE.GVMA a0 86 + * 0110001 00000 01010 000 00000 1110011 87 + */ 88 + asm volatile ("srli a0, %0, 2\n" 89 + ".word 0x62050073\n" 90 + :: "r" (pos) : "a0", "memory"); 91 + } 92 + } 93 + 94 + void kvm_riscv_local_hfence_gvma_all(void) 95 + { 96 + /* 97 + * rs1 = zero 98 + * rs2 = zero 99 + * HFENCE.GVMA 100 + * 0110001 00000 00000 000 00000 1110011 101 + */ 102 + asm volatile (".word 0x62000073" ::: "memory"); 103 + } 104 + 105 + /* 106 + * Instruction encoding of hfence.gvma is: 107 + * HFENCE.VVMA rs1, rs2 108 + * HFENCE.VVMA zero, rs2 109 + * HFENCE.VVMA rs1 110 + * HFENCE.VVMA 111 + * 112 + * rs1!=zero and rs2!=zero ==> HFENCE.VVMA rs1, rs2 113 + * rs1==zero and rs2!=zero ==> HFENCE.VVMA zero, rs2 114 + * rs1!=zero and rs2==zero ==> HFENCE.VVMA rs1 115 + * rs1==zero and rs2==zero ==> HFENCE.VVMA 116 + * 117 + * Instruction encoding of HFENCE.VVMA is: 118 + * 0010001 rs2(5) rs1(5) 000 00000 1110011 119 + */ 120 + 121 + void kvm_riscv_local_hfence_vvma_asid_gva(unsigned long vmid, 122 + unsigned long asid, 123 + unsigned long gva, 124 + unsigned long gvsz, 125 + unsigned long order) 126 + { 127 + unsigned long pos, hgatp; 128 + 129 + if (PTRS_PER_PTE < (gvsz >> order)) { 130 + kvm_riscv_local_hfence_vvma_asid_all(vmid, asid); 131 + return; 132 + } 133 + 134 + hgatp = csr_swap(CSR_HGATP, vmid << HGATP_VMID_SHIFT); 135 + 136 + for (pos = gva; pos < (gva + gvsz); pos += BIT(order)) { 137 + /* 138 + * rs1 = a0 (GVA) 139 + * rs2 = a1 (ASID) 140 + * HFENCE.VVMA a0, a1 141 + * 0010001 01011 01010 000 00000 1110011 142 + */ 143 + asm volatile ("add a0, %0, zero\n" 144 + "add a1, %1, zero\n" 145 + ".word 0x22b50073\n" 146 + :: "r" (pos), "r" (asid) 147 + : "a0", "a1", "memory"); 148 + } 149 + 150 + csr_write(CSR_HGATP, hgatp); 151 + } 152 + 153 + void kvm_riscv_local_hfence_vvma_asid_all(unsigned long vmid, 154 + unsigned long asid) 155 + { 156 + unsigned long hgatp; 157 + 158 + hgatp = csr_swap(CSR_HGATP, vmid << HGATP_VMID_SHIFT); 159 + 160 + /* 161 + * rs1 = zero 162 + * rs2 = a0 (ASID) 163 + * HFENCE.VVMA zero, a0 164 + * 0010001 01010 00000 000 00000 1110011 165 + */ 166 + asm volatile ("add a0, %0, zero\n" 167 + ".word 0x22a00073\n" 168 + :: "r" (asid) : "a0", "memory"); 169 + 170 + csr_write(CSR_HGATP, hgatp); 171 + } 172 + 173 + void kvm_riscv_local_hfence_vvma_gva(unsigned long vmid, 174 + unsigned long gva, unsigned long gvsz, 175 + unsigned long order) 176 + { 177 + unsigned long pos, hgatp; 178 + 179 + if (PTRS_PER_PTE < (gvsz >> order)) { 180 + kvm_riscv_local_hfence_vvma_all(vmid); 181 + return; 182 + } 183 + 184 + hgatp = csr_swap(CSR_HGATP, vmid << HGATP_VMID_SHIFT); 185 + 186 + for (pos = gva; pos < (gva + gvsz); pos += BIT(order)) { 187 + /* 188 + * rs1 = a0 (GVA) 189 + * rs2 = zero 190 + * HFENCE.VVMA a0 191 + * 0010001 00000 01010 000 00000 1110011 192 + */ 193 + asm volatile ("add a0, %0, zero\n" 194 + ".word 0x22050073\n" 195 + :: "r" (pos) : "a0", "memory"); 196 + } 197 + 198 + csr_write(CSR_HGATP, hgatp); 199 + } 200 + 201 + void kvm_riscv_local_hfence_vvma_all(unsigned long vmid) 202 + { 203 + unsigned long hgatp; 204 + 205 + hgatp = csr_swap(CSR_HGATP, vmid << HGATP_VMID_SHIFT); 206 + 207 + /* 208 + * rs1 = zero 209 + * rs2 = zero 210 + * HFENCE.VVMA 211 + * 0010001 00000 00000 000 00000 1110011 212 + */ 213 + asm volatile (".word 0x22000073" ::: "memory"); 214 + 215 + csr_write(CSR_HGATP, hgatp); 216 + } 217 + 218 + void kvm_riscv_local_tlb_sanitize(struct kvm_vcpu *vcpu) 219 + { 220 + unsigned long vmid; 221 + 222 + if (!kvm_riscv_gstage_vmid_bits() || 223 + vcpu->arch.last_exit_cpu == vcpu->cpu) 224 + return; 225 + 226 + /* 227 + * On RISC-V platforms with hardware VMID support, we share same 228 + * VMID for all VCPUs of a particular Guest/VM. This means we might 229 + * have stale G-stage TLB entries on the current Host CPU due to 230 + * some other VCPU of the same Guest which ran previously on the 231 + * current Host CPU. 232 + * 233 + * To cleanup stale TLB entries, we simply flush all G-stage TLB 234 + * entries by VMID whenever underlying Host CPU changes for a VCPU. 235 + */ 236 + 237 + vmid = READ_ONCE(vcpu->kvm->arch.vmid.vmid); 238 + kvm_riscv_local_hfence_gvma_vmid_all(vmid); 239 + } 240 + 241 + void kvm_riscv_fence_i_process(struct kvm_vcpu *vcpu) 242 + { 243 + local_flush_icache_all(); 244 + } 245 + 246 + void kvm_riscv_hfence_gvma_vmid_all_process(struct kvm_vcpu *vcpu) 247 + { 248 + struct kvm_vmid *vmid; 249 + 250 + vmid = &vcpu->kvm->arch.vmid; 251 + kvm_riscv_local_hfence_gvma_vmid_all(READ_ONCE(vmid->vmid)); 252 + } 253 + 254 + void kvm_riscv_hfence_vvma_all_process(struct kvm_vcpu *vcpu) 255 + { 256 + struct kvm_vmid *vmid; 257 + 258 + vmid = &vcpu->kvm->arch.vmid; 259 + kvm_riscv_local_hfence_vvma_all(READ_ONCE(vmid->vmid)); 260 + } 261 + 262 + static bool vcpu_hfence_dequeue(struct kvm_vcpu *vcpu, 263 + struct kvm_riscv_hfence *out_data) 264 + { 265 + bool ret = false; 266 + struct kvm_vcpu_arch *varch = &vcpu->arch; 267 + 268 + spin_lock(&varch->hfence_lock); 269 + 270 + if (varch->hfence_queue[varch->hfence_head].type) { 271 + memcpy(out_data, &varch->hfence_queue[varch->hfence_head], 272 + sizeof(*out_data)); 273 + varch->hfence_queue[varch->hfence_head].type = 0; 274 + 275 + varch->hfence_head++; 276 + if (varch->hfence_head == KVM_RISCV_VCPU_MAX_HFENCE) 277 + varch->hfence_head = 0; 278 + 279 + ret = true; 280 + } 281 + 282 + spin_unlock(&varch->hfence_lock); 283 + 284 + return ret; 285 + } 286 + 287 + static bool vcpu_hfence_enqueue(struct kvm_vcpu *vcpu, 288 + const struct kvm_riscv_hfence *data) 289 + { 290 + bool ret = false; 291 + struct kvm_vcpu_arch *varch = &vcpu->arch; 292 + 293 + spin_lock(&varch->hfence_lock); 294 + 295 + if (!varch->hfence_queue[varch->hfence_tail].type) { 296 + memcpy(&varch->hfence_queue[varch->hfence_tail], 297 + data, sizeof(*data)); 298 + 299 + varch->hfence_tail++; 300 + if (varch->hfence_tail == KVM_RISCV_VCPU_MAX_HFENCE) 301 + varch->hfence_tail = 0; 302 + 303 + ret = true; 304 + } 305 + 306 + spin_unlock(&varch->hfence_lock); 307 + 308 + return ret; 309 + } 310 + 311 + void kvm_riscv_hfence_process(struct kvm_vcpu *vcpu) 312 + { 313 + struct kvm_riscv_hfence d = { 0 }; 314 + struct kvm_vmid *v = &vcpu->kvm->arch.vmid; 315 + 316 + while (vcpu_hfence_dequeue(vcpu, &d)) { 317 + switch (d.type) { 318 + case KVM_RISCV_HFENCE_UNKNOWN: 319 + break; 320 + case KVM_RISCV_HFENCE_GVMA_VMID_GPA: 321 + kvm_riscv_local_hfence_gvma_vmid_gpa( 322 + READ_ONCE(v->vmid), 323 + d.addr, d.size, d.order); 324 + break; 325 + case KVM_RISCV_HFENCE_VVMA_ASID_GVA: 326 + kvm_riscv_local_hfence_vvma_asid_gva( 327 + READ_ONCE(v->vmid), d.asid, 328 + d.addr, d.size, d.order); 329 + break; 330 + case KVM_RISCV_HFENCE_VVMA_ASID_ALL: 331 + kvm_riscv_local_hfence_vvma_asid_all( 332 + READ_ONCE(v->vmid), d.asid); 333 + break; 334 + case KVM_RISCV_HFENCE_VVMA_GVA: 335 + kvm_riscv_local_hfence_vvma_gva( 336 + READ_ONCE(v->vmid), 337 + d.addr, d.size, d.order); 338 + break; 339 + default: 340 + break; 341 + } 342 + } 343 + } 344 + 345 + static void make_xfence_request(struct kvm *kvm, 346 + unsigned long hbase, unsigned long hmask, 347 + unsigned int req, unsigned int fallback_req, 348 + const struct kvm_riscv_hfence *data) 349 + { 350 + unsigned long i; 351 + struct kvm_vcpu *vcpu; 352 + unsigned int actual_req = req; 353 + DECLARE_BITMAP(vcpu_mask, KVM_MAX_VCPUS); 354 + 355 + bitmap_clear(vcpu_mask, 0, KVM_MAX_VCPUS); 356 + kvm_for_each_vcpu(i, vcpu, kvm) { 357 + if (hbase != -1UL) { 358 + if (vcpu->vcpu_id < hbase) 359 + continue; 360 + if (!(hmask & (1UL << (vcpu->vcpu_id - hbase)))) 361 + continue; 362 + } 363 + 364 + bitmap_set(vcpu_mask, i, 1); 365 + 366 + if (!data || !data->type) 367 + continue; 368 + 369 + /* 370 + * Enqueue hfence data to VCPU hfence queue. If we don't 371 + * have space in the VCPU hfence queue then fallback to 372 + * a more conservative hfence request. 373 + */ 374 + if (!vcpu_hfence_enqueue(vcpu, data)) 375 + actual_req = fallback_req; 376 + } 377 + 378 + kvm_make_vcpus_request_mask(kvm, actual_req, vcpu_mask); 379 + } 380 + 381 + void kvm_riscv_fence_i(struct kvm *kvm, 382 + unsigned long hbase, unsigned long hmask) 383 + { 384 + make_xfence_request(kvm, hbase, hmask, KVM_REQ_FENCE_I, 385 + KVM_REQ_FENCE_I, NULL); 386 + } 387 + 388 + void kvm_riscv_hfence_gvma_vmid_gpa(struct kvm *kvm, 389 + unsigned long hbase, unsigned long hmask, 390 + gpa_t gpa, gpa_t gpsz, 391 + unsigned long order) 392 + { 393 + struct kvm_riscv_hfence data; 394 + 395 + data.type = KVM_RISCV_HFENCE_GVMA_VMID_GPA; 396 + data.asid = 0; 397 + data.addr = gpa; 398 + data.size = gpsz; 399 + data.order = order; 400 + make_xfence_request(kvm, hbase, hmask, KVM_REQ_HFENCE, 401 + KVM_REQ_HFENCE_GVMA_VMID_ALL, &data); 402 + } 403 + 404 + void kvm_riscv_hfence_gvma_vmid_all(struct kvm *kvm, 405 + unsigned long hbase, unsigned long hmask) 406 + { 407 + make_xfence_request(kvm, hbase, hmask, KVM_REQ_HFENCE_GVMA_VMID_ALL, 408 + KVM_REQ_HFENCE_GVMA_VMID_ALL, NULL); 409 + } 410 + 411 + void kvm_riscv_hfence_vvma_asid_gva(struct kvm *kvm, 412 + unsigned long hbase, unsigned long hmask, 413 + unsigned long gva, unsigned long gvsz, 414 + unsigned long order, unsigned long asid) 415 + { 416 + struct kvm_riscv_hfence data; 417 + 418 + data.type = KVM_RISCV_HFENCE_VVMA_ASID_GVA; 419 + data.asid = asid; 420 + data.addr = gva; 421 + data.size = gvsz; 422 + data.order = order; 423 + make_xfence_request(kvm, hbase, hmask, KVM_REQ_HFENCE, 424 + KVM_REQ_HFENCE_VVMA_ALL, &data); 425 + } 426 + 427 + void kvm_riscv_hfence_vvma_asid_all(struct kvm *kvm, 428 + unsigned long hbase, unsigned long hmask, 429 + unsigned long asid) 430 + { 431 + struct kvm_riscv_hfence data; 432 + 433 + data.type = KVM_RISCV_HFENCE_VVMA_ASID_ALL; 434 + data.asid = asid; 435 + data.addr = data.size = data.order = 0; 436 + make_xfence_request(kvm, hbase, hmask, KVM_REQ_HFENCE, 437 + KVM_REQ_HFENCE_VVMA_ALL, &data); 438 + } 439 + 440 + void kvm_riscv_hfence_vvma_gva(struct kvm *kvm, 441 + unsigned long hbase, unsigned long hmask, 442 + unsigned long gva, unsigned long gvsz, 443 + unsigned long order) 444 + { 445 + struct kvm_riscv_hfence data; 446 + 447 + data.type = KVM_RISCV_HFENCE_VVMA_GVA; 448 + data.asid = 0; 449 + data.addr = gva; 450 + data.size = gvsz; 451 + data.order = order; 452 + make_xfence_request(kvm, hbase, hmask, KVM_REQ_HFENCE, 453 + KVM_REQ_HFENCE_VVMA_ALL, &data); 454 + } 455 + 456 + void kvm_riscv_hfence_vvma_all(struct kvm *kvm, 457 + unsigned long hbase, unsigned long hmask) 458 + { 459 + make_xfence_request(kvm, hbase, hmask, KVM_REQ_HFENCE_VVMA_ALL, 460 + KVM_REQ_HFENCE_VVMA_ALL, NULL); 461 + }

+137 -7

arch/riscv/kvm/vcpu.c

··· 67 67 if (loaded) 68 68 kvm_arch_vcpu_put(vcpu); 69 69 70 + vcpu->arch.last_exit_cpu = -1; 71 + 70 72 memcpy(csr, reset_csr, sizeof(*csr)); 71 73 72 74 memcpy(cntx, reset_cntx, sizeof(*cntx)); ··· 79 77 80 78 WRITE_ONCE(vcpu->arch.irqs_pending, 0); 81 79 WRITE_ONCE(vcpu->arch.irqs_pending_mask, 0); 80 + 81 + vcpu->arch.hfence_head = 0; 82 + vcpu->arch.hfence_tail = 0; 83 + memset(vcpu->arch.hfence_queue, 0, sizeof(vcpu->arch.hfence_queue)); 82 84 83 85 /* Reset the guest CSRs for hotplug usecase */ 84 86 if (loaded) ··· 106 100 107 101 /* Setup ISA features available to VCPU */ 108 102 vcpu->arch.isa = riscv_isa_extension_base(NULL) & KVM_RISCV_ISA_ALLOWED; 103 + 104 + /* Setup VCPU hfence queue */ 105 + spin_lock_init(&vcpu->arch.hfence_lock); 109 106 110 107 /* Setup reset state of shadow SSTATUS and HSTATUS CSRs */ 111 108 cntx = &vcpu->arch.guest_reset_context; ··· 146 137 /* Cleanup VCPU timer */ 147 138 kvm_riscv_vcpu_timer_deinit(vcpu); 148 139 149 - /* Free unused pages pre-allocated for Stage2 page table mappings */ 140 + /* Free unused pages pre-allocated for G-stage page table mappings */ 150 141 kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache); 151 142 } 152 143 ··· 374 365 return 0; 375 366 } 376 367 368 + /* Mapping between KVM ISA Extension ID & Host ISA extension ID */ 369 + static unsigned long kvm_isa_ext_arr[] = { 370 + RISCV_ISA_EXT_a, 371 + RISCV_ISA_EXT_c, 372 + RISCV_ISA_EXT_d, 373 + RISCV_ISA_EXT_f, 374 + RISCV_ISA_EXT_h, 375 + RISCV_ISA_EXT_i, 376 + RISCV_ISA_EXT_m, 377 + }; 378 + 379 + static int kvm_riscv_vcpu_get_reg_isa_ext(struct kvm_vcpu *vcpu, 380 + const struct kvm_one_reg *reg) 381 + { 382 + unsigned long __user *uaddr = 383 + (unsigned long __user *)(unsigned long)reg->addr; 384 + unsigned long reg_num = reg->id & ~(KVM_REG_ARCH_MASK | 385 + KVM_REG_SIZE_MASK | 386 + KVM_REG_RISCV_ISA_EXT); 387 + unsigned long reg_val = 0; 388 + unsigned long host_isa_ext; 389 + 390 + if (KVM_REG_SIZE(reg->id) != sizeof(unsigned long)) 391 + return -EINVAL; 392 + 393 + if (reg_num >= KVM_RISCV_ISA_EXT_MAX || reg_num >= ARRAY_SIZE(kvm_isa_ext_arr)) 394 + return -EINVAL; 395 + 396 + host_isa_ext = kvm_isa_ext_arr[reg_num]; 397 + if (__riscv_isa_extension_available(&vcpu->arch.isa, host_isa_ext)) 398 + reg_val = 1; /* Mark the given extension as available */ 399 + 400 + if (copy_to_user(uaddr, &reg_val, KVM_REG_SIZE(reg->id))) 401 + return -EFAULT; 402 + 403 + return 0; 404 + } 405 + 406 + static int kvm_riscv_vcpu_set_reg_isa_ext(struct kvm_vcpu *vcpu, 407 + const struct kvm_one_reg *reg) 408 + { 409 + unsigned long __user *uaddr = 410 + (unsigned long __user *)(unsigned long)reg->addr; 411 + unsigned long reg_num = reg->id & ~(KVM_REG_ARCH_MASK | 412 + KVM_REG_SIZE_MASK | 413 + KVM_REG_RISCV_ISA_EXT); 414 + unsigned long reg_val; 415 + unsigned long host_isa_ext; 416 + unsigned long host_isa_ext_mask; 417 + 418 + if (KVM_REG_SIZE(reg->id) != sizeof(unsigned long)) 419 + return -EINVAL; 420 + 421 + if (reg_num >= KVM_RISCV_ISA_EXT_MAX || reg_num >= ARRAY_SIZE(kvm_isa_ext_arr)) 422 + return -EINVAL; 423 + 424 + if (copy_from_user(&reg_val, uaddr, KVM_REG_SIZE(reg->id))) 425 + return -EFAULT; 426 + 427 + host_isa_ext = kvm_isa_ext_arr[reg_num]; 428 + if (!__riscv_isa_extension_available(NULL, host_isa_ext)) 429 + return -EOPNOTSUPP; 430 + 431 + if (host_isa_ext >= RISCV_ISA_EXT_BASE && 432 + host_isa_ext < RISCV_ISA_EXT_MAX) { 433 + /* 434 + * Multi-letter ISA extension. Currently there is no provision 435 + * to enable/disable the multi-letter ISA extensions for guests. 436 + * Return success if the request is to enable any ISA extension 437 + * that is available in the hardware. 438 + * Return -EOPNOTSUPP otherwise. 439 + */ 440 + if (!reg_val) 441 + return -EOPNOTSUPP; 442 + else 443 + return 0; 444 + } 445 + 446 + /* Single letter base ISA extension */ 447 + if (!vcpu->arch.ran_atleast_once) { 448 + host_isa_ext_mask = BIT_MASK(host_isa_ext); 449 + if (!reg_val && (host_isa_ext_mask & KVM_RISCV_ISA_DISABLE_ALLOWED)) 450 + vcpu->arch.isa &= ~host_isa_ext_mask; 451 + else 452 + vcpu->arch.isa |= host_isa_ext_mask; 453 + vcpu->arch.isa &= riscv_isa_extension_base(NULL); 454 + vcpu->arch.isa &= KVM_RISCV_ISA_ALLOWED; 455 + kvm_riscv_vcpu_fp_reset(vcpu); 456 + } else { 457 + return -EOPNOTSUPP; 458 + } 459 + 460 + return 0; 461 + } 462 + 377 463 static int kvm_riscv_vcpu_set_reg(struct kvm_vcpu *vcpu, 378 464 const struct kvm_one_reg *reg) 379 465 { ··· 486 382 else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_FP_D) 487 383 return kvm_riscv_vcpu_set_reg_fp(vcpu, reg, 488 384 KVM_REG_RISCV_FP_D); 385 + else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_ISA_EXT) 386 + return kvm_riscv_vcpu_set_reg_isa_ext(vcpu, reg); 489 387 490 388 return -EINVAL; 491 389 } ··· 509 403 else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_FP_D) 510 404 return kvm_riscv_vcpu_get_reg_fp(vcpu, reg, 511 405 KVM_REG_RISCV_FP_D); 406 + else if ((reg->id & KVM_REG_RISCV_TYPE_MASK) == KVM_REG_RISCV_ISA_EXT) 407 + return kvm_riscv_vcpu_get_reg_isa_ext(vcpu, reg); 512 408 513 409 return -EINVAL; 514 410 } ··· 743 635 csr_write(CSR_HVIP, csr->hvip); 744 636 csr_write(CSR_VSATP, csr->vsatp); 745 637 746 - kvm_riscv_stage2_update_hgatp(vcpu); 638 + kvm_riscv_gstage_update_hgatp(vcpu); 747 639 748 640 kvm_riscv_vcpu_timer_restore(vcpu); 749 641 ··· 798 690 kvm_riscv_reset_vcpu(vcpu); 799 691 800 692 if (kvm_check_request(KVM_REQ_UPDATE_HGATP, vcpu)) 801 - kvm_riscv_stage2_update_hgatp(vcpu); 693 + kvm_riscv_gstage_update_hgatp(vcpu); 802 694 803 - if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu)) 804 - __kvm_riscv_hfence_gvma_all(); 695 + if (kvm_check_request(KVM_REQ_FENCE_I, vcpu)) 696 + kvm_riscv_fence_i_process(vcpu); 697 + 698 + /* 699 + * The generic KVM_REQ_TLB_FLUSH is same as 700 + * KVM_REQ_HFENCE_GVMA_VMID_ALL 701 + */ 702 + if (kvm_check_request(KVM_REQ_HFENCE_GVMA_VMID_ALL, vcpu)) 703 + kvm_riscv_hfence_gvma_vmid_all_process(vcpu); 704 + 705 + if (kvm_check_request(KVM_REQ_HFENCE_VVMA_ALL, vcpu)) 706 + kvm_riscv_hfence_vvma_all_process(vcpu); 707 + 708 + if (kvm_check_request(KVM_REQ_HFENCE, vcpu)) 709 + kvm_riscv_hfence_process(vcpu); 805 710 } 806 711 } 807 712 ··· 836 715 { 837 716 guest_state_enter_irqoff(); 838 717 __kvm_riscv_switch_to(&vcpu->arch); 718 + vcpu->arch.last_exit_cpu = vcpu->cpu; 839 719 guest_state_exit_irqoff(); 840 720 } 841 721 ··· 884 762 /* Check conditions before entering the guest */ 885 763 cond_resched(); 886 764 887 - kvm_riscv_stage2_vmid_update(vcpu); 765 + kvm_riscv_gstage_vmid_update(vcpu); 888 766 889 767 kvm_riscv_check_vcpu_requests(vcpu); 890 768 ··· 922 800 kvm_riscv_update_hvip(vcpu); 923 801 924 802 if (ret <= 0 || 925 - kvm_riscv_stage2_vmid_ver_changed(&vcpu->kvm->arch.vmid) || 803 + kvm_riscv_gstage_vmid_ver_changed(&vcpu->kvm->arch.vmid) || 926 804 kvm_request_pending(vcpu)) { 927 805 vcpu->mode = OUTSIDE_GUEST_MODE; 928 806 local_irq_enable(); ··· 930 808 kvm_vcpu_srcu_read_lock(vcpu); 931 809 continue; 932 810 } 811 + 812 + /* 813 + * Cleanup stale TLB enteries 814 + * 815 + * Note: This should be done after G-stage VMID has been 816 + * updated using kvm_riscv_gstage_vmid_ver_changed() 817 + */ 818 + kvm_riscv_local_tlb_sanitize(vcpu); 933 819 934 820 guest_timing_enter_irqoff(); 935 821

+3 -3

arch/riscv/kvm/vcpu_exit.c

··· 412 412 return 0; 413 413 } 414 414 415 - static int stage2_page_fault(struct kvm_vcpu *vcpu, struct kvm_run *run, 415 + static int gstage_page_fault(struct kvm_vcpu *vcpu, struct kvm_run *run, 416 416 struct kvm_cpu_trap *trap) 417 417 { 418 418 struct kvm_memory_slot *memslot; ··· 440 440 }; 441 441 } 442 442 443 - ret = kvm_riscv_stage2_map(vcpu, memslot, fault_addr, hva, 443 + ret = kvm_riscv_gstage_map(vcpu, memslot, fault_addr, hva, 444 444 (trap->scause == EXC_STORE_GUEST_PAGE_FAULT) ? true : false); 445 445 if (ret < 0) 446 446 return ret; ··· 686 686 case EXC_LOAD_GUEST_PAGE_FAULT: 687 687 case EXC_STORE_GUEST_PAGE_FAULT: 688 688 if (vcpu->arch.guest_context.hstatus & HSTATUS_SPV) 689 - ret = stage2_page_fault(vcpu, run, trap); 689 + ret = gstage_page_fault(vcpu, run, trap); 690 690 break; 691 691 case EXC_SUPERVISOR_SYSCALL: 692 692 if (vcpu->arch.guest_context.hstatus & HSTATUS_SPV)

+19 -21

arch/riscv/kvm/vcpu_sbi_replace.c

··· 81 81 struct kvm_cpu_trap *utrap, bool *exit) 82 82 { 83 83 int ret = 0; 84 - unsigned long i; 85 - struct cpumask cm; 86 - struct kvm_vcpu *tmp; 87 84 struct kvm_cpu_context *cp = &vcpu->arch.guest_context; 88 85 unsigned long hmask = cp->a0; 89 86 unsigned long hbase = cp->a1; 90 87 unsigned long funcid = cp->a6; 91 88 92 - cpumask_clear(&cm); 93 - kvm_for_each_vcpu(i, tmp, vcpu->kvm) { 94 - if (hbase != -1UL) { 95 - if (tmp->vcpu_id < hbase) 96 - continue; 97 - if (!(hmask & (1UL << (tmp->vcpu_id - hbase)))) 98 - continue; 99 - } 100 - if (tmp->cpu < 0) 101 - continue; 102 - cpumask_set_cpu(tmp->cpu, &cm); 103 - } 104 - 105 89 switch (funcid) { 106 90 case SBI_EXT_RFENCE_REMOTE_FENCE_I: 107 - ret = sbi_remote_fence_i(&cm); 91 + kvm_riscv_fence_i(vcpu->kvm, hbase, hmask); 108 92 break; 109 93 case SBI_EXT_RFENCE_REMOTE_SFENCE_VMA: 110 - ret = sbi_remote_hfence_vvma(&cm, cp->a2, cp->a3); 94 + if (cp->a2 == 0 && cp->a3 == 0) 95 + kvm_riscv_hfence_vvma_all(vcpu->kvm, hbase, hmask); 96 + else 97 + kvm_riscv_hfence_vvma_gva(vcpu->kvm, hbase, hmask, 98 + cp->a2, cp->a3, PAGE_SHIFT); 111 99 break; 112 100 case SBI_EXT_RFENCE_REMOTE_SFENCE_VMA_ASID: 113 - ret = sbi_remote_hfence_vvma_asid(&cm, cp->a2, 114 - cp->a3, cp->a4); 101 + if (cp->a2 == 0 && cp->a3 == 0) 102 + kvm_riscv_hfence_vvma_asid_all(vcpu->kvm, 103 + hbase, hmask, cp->a4); 104 + else 105 + kvm_riscv_hfence_vvma_asid_gva(vcpu->kvm, 106 + hbase, hmask, 107 + cp->a2, cp->a3, 108 + PAGE_SHIFT, cp->a4); 115 109 break; 116 110 case SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA: 117 111 case SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA_VMID: 118 112 case SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA: 119 113 case SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA_ASID: 120 - /* TODO: implement for nested hypervisor case */ 114 + /* 115 + * Until nested virtualization is implemented, the 116 + * SBI HFENCE calls should be treated as NOPs 117 + */ 118 + break; 121 119 default: 122 120 ret = -EOPNOTSUPP; 123 121 }

+22 -13

arch/riscv/kvm/vcpu_sbi_v01.c

··· 23 23 int i, ret = 0; 24 24 u64 next_cycle; 25 25 struct kvm_vcpu *rvcpu; 26 - struct cpumask cm; 27 26 struct kvm *kvm = vcpu->kvm; 28 27 struct kvm_cpu_context *cp = &vcpu->arch.guest_context; 29 28 ··· 79 80 if (utrap->scause) 80 81 break; 81 82 82 - cpumask_clear(&cm); 83 - for_each_set_bit(i, &hmask, BITS_PER_LONG) { 84 - rvcpu = kvm_get_vcpu_by_id(vcpu->kvm, i); 85 - if (rvcpu->cpu < 0) 86 - continue; 87 - cpumask_set_cpu(rvcpu->cpu, &cm); 88 - } 89 83 if (cp->a7 == SBI_EXT_0_1_REMOTE_FENCE_I) 90 - ret = sbi_remote_fence_i(&cm); 91 - else if (cp->a7 == SBI_EXT_0_1_REMOTE_SFENCE_VMA) 92 - ret = sbi_remote_hfence_vvma(&cm, cp->a1, cp->a2); 93 - else 94 - ret = sbi_remote_hfence_vvma_asid(&cm, cp->a1, cp->a2, cp->a3); 84 + kvm_riscv_fence_i(vcpu->kvm, 0, hmask); 85 + else if (cp->a7 == SBI_EXT_0_1_REMOTE_SFENCE_VMA) { 86 + if (cp->a1 == 0 && cp->a2 == 0) 87 + kvm_riscv_hfence_vvma_all(vcpu->kvm, 88 + 0, hmask); 89 + else 90 + kvm_riscv_hfence_vvma_gva(vcpu->kvm, 91 + 0, hmask, 92 + cp->a1, cp->a2, 93 + PAGE_SHIFT); 94 + } else { 95 + if (cp->a1 == 0 && cp->a2 == 0) 96 + kvm_riscv_hfence_vvma_asid_all(vcpu->kvm, 97 + 0, hmask, 98 + cp->a3); 99 + else 100 + kvm_riscv_hfence_vvma_asid_gva(vcpu->kvm, 101 + 0, hmask, 102 + cp->a1, cp->a2, 103 + PAGE_SHIFT, 104 + cp->a3); 105 + } 95 106 break; 96 107 default: 97 108 ret = -EINVAL;

+4 -4

arch/riscv/kvm/vm.c

··· 31 31 { 32 32 int r; 33 33 34 - r = kvm_riscv_stage2_alloc_pgd(kvm); 34 + r = kvm_riscv_gstage_alloc_pgd(kvm); 35 35 if (r) 36 36 return r; 37 37 38 - r = kvm_riscv_stage2_vmid_init(kvm); 38 + r = kvm_riscv_gstage_vmid_init(kvm); 39 39 if (r) { 40 - kvm_riscv_stage2_free_pgd(kvm); 40 + kvm_riscv_gstage_free_pgd(kvm); 41 41 return r; 42 42 } 43 43 ··· 75 75 r = KVM_USER_MEM_SLOTS; 76 76 break; 77 77 case KVM_CAP_VM_GPA_BITS: 78 - r = kvm_riscv_stage2_gpa_bits(); 78 + r = kvm_riscv_gstage_gpa_bits(); 79 79 break; 80 80 default: 81 81 r = 0;

+18 -12

arch/riscv/kvm/vmid.c

··· 11 11 #include <linux/errno.h> 12 12 #include <linux/err.h> 13 13 #include <linux/module.h> 14 + #include <linux/smp.h> 14 15 #include <linux/kvm_host.h> 15 16 #include <asm/csr.h> 16 - #include <asm/sbi.h> 17 17 18 18 static unsigned long vmid_version = 1; 19 19 static unsigned long vmid_next; 20 20 static unsigned long vmid_bits; 21 21 static DEFINE_SPINLOCK(vmid_lock); 22 22 23 - void kvm_riscv_stage2_vmid_detect(void) 23 + void kvm_riscv_gstage_vmid_detect(void) 24 24 { 25 25 unsigned long old; 26 26 ··· 33 33 csr_write(CSR_HGATP, old); 34 34 35 35 /* We polluted local TLB so flush all guest TLB */ 36 - __kvm_riscv_hfence_gvma_all(); 36 + kvm_riscv_local_hfence_gvma_all(); 37 37 38 38 /* We don't use VMID bits if they are not sufficient */ 39 39 if ((1UL << vmid_bits) < num_possible_cpus()) 40 40 vmid_bits = 0; 41 41 } 42 42 43 - unsigned long kvm_riscv_stage2_vmid_bits(void) 43 + unsigned long kvm_riscv_gstage_vmid_bits(void) 44 44 { 45 45 return vmid_bits; 46 46 } 47 47 48 - int kvm_riscv_stage2_vmid_init(struct kvm *kvm) 48 + int kvm_riscv_gstage_vmid_init(struct kvm *kvm) 49 49 { 50 50 /* Mark the initial VMID and VMID version invalid */ 51 51 kvm->arch.vmid.vmid_version = 0; ··· 54 54 return 0; 55 55 } 56 56 57 - bool kvm_riscv_stage2_vmid_ver_changed(struct kvm_vmid *vmid) 57 + bool kvm_riscv_gstage_vmid_ver_changed(struct kvm_vmid *vmid) 58 58 { 59 59 if (!vmid_bits) 60 60 return false; ··· 63 63 READ_ONCE(vmid_version)); 64 64 } 65 65 66 - void kvm_riscv_stage2_vmid_update(struct kvm_vcpu *vcpu) 66 + static void __local_hfence_gvma_all(void *info) 67 + { 68 + kvm_riscv_local_hfence_gvma_all(); 69 + } 70 + 71 + void kvm_riscv_gstage_vmid_update(struct kvm_vcpu *vcpu) 67 72 { 68 73 unsigned long i; 69 74 struct kvm_vcpu *v; 70 75 struct kvm_vmid *vmid = &vcpu->kvm->arch.vmid; 71 76 72 - if (!kvm_riscv_stage2_vmid_ver_changed(vmid)) 77 + if (!kvm_riscv_gstage_vmid_ver_changed(vmid)) 73 78 return; 74 79 75 80 spin_lock(&vmid_lock); ··· 83 78 * We need to re-check the vmid_version here to ensure that if 84 79 * another vcpu already allocated a valid vmid for this vm. 85 80 */ 86 - if (!kvm_riscv_stage2_vmid_ver_changed(vmid)) { 81 + if (!kvm_riscv_gstage_vmid_ver_changed(vmid)) { 87 82 spin_unlock(&vmid_lock); 88 83 return; 89 84 } ··· 101 96 * instances is invalid and we have force VMID re-assignement 102 97 * for all Guest instances. The Guest instances that were not 103 98 * running will automatically pick-up new VMIDs because will 104 - * call kvm_riscv_stage2_vmid_update() whenever they enter 99 + * call kvm_riscv_gstage_vmid_update() whenever they enter 105 100 * in-kernel run loop. For Guest instances that are already 106 101 * running, we force VM exits on all host CPUs using IPI and 107 102 * flush all Guest TLBs. 108 103 */ 109 - sbi_remote_hfence_gvma(cpu_online_mask, 0, 0); 104 + on_each_cpu_mask(cpu_online_mask, __local_hfence_gvma_all, 105 + NULL, 1); 110 106 } 111 107 112 108 vmid->vmid = vmid_next; ··· 118 112 119 113 spin_unlock(&vmid_lock); 120 114 121 - /* Request stage2 page table update for all VCPUs */ 115 + /* Request G-stage page table update for all VCPUs */ 122 116 kvm_for_each_vcpu(i, v, vcpu->kvm) 123 117 kvm_make_request(KVM_REQ_UPDATE_HGATP, v); 124 118 }

+22 -1

arch/s390/include/asm/uv.h

··· 2 2 /* 3 3 * Ultravisor Interfaces 4 4 * 5 - * Copyright IBM Corp. 2019 5 + * Copyright IBM Corp. 2019, 2022 6 6 * 7 7 * Author(s): 8 8 * Vasily Gorbik <gor@linux.ibm.com> ··· 52 52 #define UVC_CMD_UNPIN_PAGE_SHARED 0x0342 53 53 #define UVC_CMD_SET_SHARED_ACCESS 0x1000 54 54 #define UVC_CMD_REMOVE_SHARED_ACCESS 0x1001 55 + #define UVC_CMD_RETR_ATTEST 0x1020 55 56 56 57 /* Bits in installed uv calls */ 57 58 enum uv_cmds_inst { ··· 77 76 BIT_UVC_CMD_UNSHARE_ALL = 20, 78 77 BIT_UVC_CMD_PIN_PAGE_SHARED = 21, 79 78 BIT_UVC_CMD_UNPIN_PAGE_SHARED = 22, 79 + BIT_UVC_CMD_RETR_ATTEST = 28, 80 80 }; 81 81 82 82 enum uv_feat_ind { ··· 219 217 u64 reserved08[3]; 220 218 u64 paddr; 221 219 u64 reserved28; 220 + } __packed __aligned(8); 221 + 222 + /* Retrieve Attestation Measurement */ 223 + struct uv_cb_attest { 224 + struct uv_cb_header header; /* 0x0000 */ 225 + u64 reserved08[2]; /* 0x0008 */ 226 + u64 arcb_addr; /* 0x0018 */ 227 + u64 cont_token; /* 0x0020 */ 228 + u8 reserved28[6]; /* 0x0028 */ 229 + u16 user_data_len; /* 0x002e */ 230 + u8 user_data[256]; /* 0x0030 */ 231 + u32 reserved130[3]; /* 0x0130 */ 232 + u32 meas_len; /* 0x013c */ 233 + u64 meas_addr; /* 0x0140 */ 234 + u8 config_uid[16]; /* 0x0148 */ 235 + u32 reserved158; /* 0x0158 */ 236 + u32 add_data_len; /* 0x015c */ 237 + u64 add_data_addr; /* 0x0160 */ 238 + u64 reserved168[4]; /* 0x0168 */ 222 239 } __packed __aligned(8); 223 240 224 241 static inline int __uv_call(unsigned long r1, unsigned long r2)

+51

arch/s390/include/uapi/asm/uvdevice.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ 2 + /* 3 + * Copyright IBM Corp. 2022 4 + * Author(s): Steffen Eiden <seiden@linux.ibm.com> 5 + */ 6 + #ifndef __S390_ASM_UVDEVICE_H 7 + #define __S390_ASM_UVDEVICE_H 8 + 9 + #include <linux/types.h> 10 + 11 + struct uvio_ioctl_cb { 12 + __u32 flags; 13 + __u16 uv_rc; /* UV header rc value */ 14 + __u16 uv_rrc; /* UV header rrc value */ 15 + __u64 argument_addr; /* Userspace address of uvio argument */ 16 + __u32 argument_len; 17 + __u8 reserved14[0x40 - 0x14]; /* must be zero */ 18 + }; 19 + 20 + #define UVIO_ATT_USER_DATA_LEN 0x100 21 + #define UVIO_ATT_UID_LEN 0x10 22 + struct uvio_attest { 23 + __u64 arcb_addr; /* 0x0000 */ 24 + __u64 meas_addr; /* 0x0008 */ 25 + __u64 add_data_addr; /* 0x0010 */ 26 + __u8 user_data[UVIO_ATT_USER_DATA_LEN]; /* 0x0018 */ 27 + __u8 config_uid[UVIO_ATT_UID_LEN]; /* 0x0118 */ 28 + __u32 arcb_len; /* 0x0128 */ 29 + __u32 meas_len; /* 0x012c */ 30 + __u32 add_data_len; /* 0x0130 */ 31 + __u16 user_data_len; /* 0x0134 */ 32 + __u16 reserved136; /* 0x0136 */ 33 + }; 34 + 35 + /* 36 + * The following max values define an upper length for the IOCTL in/out buffers. 37 + * However, they do not represent the maximum the Ultravisor allows which is 38 + * often way smaller. By allowing larger buffer sizes we hopefully do not need 39 + * to update the code with every machine update. It is therefore possible for 40 + * userspace to request more memory than actually used by kernel/UV. 41 + */ 42 + #define UVIO_ATT_ARCB_MAX_LEN 0x100000 43 + #define UVIO_ATT_MEASUREMENT_MAX_LEN 0x8000 44 + #define UVIO_ATT_ADDITIONAL_MAX_LEN 0x8000 45 + 46 + #define UVIO_DEVICE_NAME "uv" 47 + #define UVIO_TYPE_UVC 'u' 48 + 49 + #define UVIO_IOCTL_ATT _IOWR(UVIO_TYPE_UVC, 0x01, struct uvio_ioctl_cb) 50 + 51 + #endif /* __S390_ASM_UVDEVICE_H */

+18 -4

arch/s390/kvm/gaccess.c

··· 491 491 PROT_TYPE_IEP = 4, 492 492 }; 493 493 494 - static int trans_exc(struct kvm_vcpu *vcpu, int code, unsigned long gva, 495 - u8 ar, enum gacc_mode mode, enum prot_type prot) 494 + static int trans_exc_ending(struct kvm_vcpu *vcpu, int code, unsigned long gva, u8 ar, 495 + enum gacc_mode mode, enum prot_type prot, bool terminate) 496 496 { 497 497 struct kvm_s390_pgm_info *pgm = &vcpu->arch.pgm; 498 498 struct trans_exc_code_bits *tec; ··· 519 519 case PROT_TYPE_DAT: 520 520 tec->b61 = 1; 521 521 break; 522 + } 523 + if (terminate) { 524 + tec->b56 = 0; 525 + tec->b60 = 0; 526 + tec->b61 = 0; 522 527 } 523 528 fallthrough; 524 529 case PGM_ASCE_TYPE: ··· 555 550 break; 556 551 } 557 552 return code; 553 + } 554 + 555 + static int trans_exc(struct kvm_vcpu *vcpu, int code, unsigned long gva, u8 ar, 556 + enum gacc_mode mode, enum prot_type prot) 557 + { 558 + return trans_exc_ending(vcpu, code, gva, ar, mode, prot, false); 558 559 } 559 560 560 561 static int get_vcpu_asce(struct kvm_vcpu *vcpu, union asce *asce, ··· 1120 1109 data += fragment_len; 1121 1110 ga = kvm_s390_logical_to_effective(vcpu, ga + fragment_len); 1122 1111 } 1123 - if (rc > 0) 1124 - rc = trans_exc(vcpu, rc, ga, ar, mode, prot); 1112 + if (rc > 0) { 1113 + bool terminate = (mode == GACC_STORE) && (idx > 0); 1114 + 1115 + rc = trans_exc_ending(vcpu, rc, ga, ar, mode, prot, terminate); 1116 + } 1125 1117 out_unlock: 1126 1118 if (need_ipte_lock) 1127 1119 ipte_unlock(vcpu);

+1

arch/x86/include/asm/cpufeatures.h

··· 407 407 #define X86_FEATURE_SEV (19*32+ 1) /* AMD Secure Encrypted Virtualization */ 408 408 #define X86_FEATURE_VM_PAGE_FLUSH (19*32+ 2) /* "" VM Page Flush MSR is supported */ 409 409 #define X86_FEATURE_SEV_ES (19*32+ 3) /* AMD Secure Encrypted Virtualization - Encrypted State */ 410 + #define X86_FEATURE_V_TSC_AUX (19*32+ 9) /* "" Virtual TSC_AUX */ 410 411 #define X86_FEATURE_SME_COHERENT (19*32+10) /* "" AMD hardware-enforced cache coherency */ 411 412 412 413 /*

+1

arch/x86/include/asm/kvm-x86-ops.h

··· 127 127 KVM_X86_OP(msr_filter_changed) 128 128 KVM_X86_OP(complete_emulated_msr) 129 129 KVM_X86_OP(vcpu_deliver_sipi_vector) 130 + KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons); 130 131 131 132 #undef KVM_X86_OP 132 133 #undef KVM_X86_OP_OPTIONAL

+31

arch/x86/include/asm/kvm-x86-pmu-ops.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #if !defined(KVM_X86_PMU_OP) || !defined(KVM_X86_PMU_OP_OPTIONAL) 3 + BUILD_BUG_ON(1) 4 + #endif 5 + 6 + /* 7 + * KVM_X86_PMU_OP() and KVM_X86_PMU_OP_OPTIONAL() are used to help generate 8 + * both DECLARE/DEFINE_STATIC_CALL() invocations and 9 + * "static_call_update()" calls. 10 + * 11 + * KVM_X86_PMU_OP_OPTIONAL() can be used for those functions that can have 12 + * a NULL definition, for example if "static_call_cond()" will be used 13 + * at the call sites. 14 + */ 15 + KVM_X86_PMU_OP(pmc_perf_hw_id) 16 + KVM_X86_PMU_OP(pmc_is_enabled) 17 + KVM_X86_PMU_OP(pmc_idx_to_pmc) 18 + KVM_X86_PMU_OP(rdpmc_ecx_to_pmc) 19 + KVM_X86_PMU_OP(msr_idx_to_pmc) 20 + KVM_X86_PMU_OP(is_valid_rdpmc_ecx) 21 + KVM_X86_PMU_OP(is_valid_msr) 22 + KVM_X86_PMU_OP(get_msr) 23 + KVM_X86_PMU_OP(set_msr) 24 + KVM_X86_PMU_OP(refresh) 25 + KVM_X86_PMU_OP(init) 26 + KVM_X86_PMU_OP(reset) 27 + KVM_X86_PMU_OP_OPTIONAL(deliver_pmi) 28 + KVM_X86_PMU_OP_OPTIONAL(cleanup) 29 + 30 + #undef KVM_X86_PMU_OP 31 + #undef KVM_X86_PMU_OP_OPTIONAL

+49 -44

arch/x86/include/asm/kvm_host.h

··· 281 281 /* 282 282 * kvm_mmu_page_role tracks the properties of a shadow page (where shadow page 283 283 * also includes TDP pages) to determine whether or not a page can be used in 284 - * the given MMU context. This is a subset of the overall kvm_mmu_role to 284 + * the given MMU context. This is a subset of the overall kvm_cpu_role to 285 285 * minimize the size of kvm_memory_slot.arch.gfn_track, i.e. allows allocating 286 286 * 2 bytes per gfn instead of 4 bytes per gfn. 287 287 * 288 - * Indirect upper-level shadow pages are tracked for write-protection via 288 + * Upper-level shadow pages having gptes are tracked for write-protection via 289 289 * gfn_track. As above, gfn_track is a 16 bit counter, so KVM must not create 290 290 * more than 2^16-1 upper-level shadow pages at a single gfn, otherwise 291 291 * gfn_track will overflow and explosions will ensure. ··· 331 331 unsigned smap_andnot_wp:1; 332 332 unsigned ad_disabled:1; 333 333 unsigned guest_mode:1; 334 - unsigned :6; 334 + unsigned passthrough:1; 335 + unsigned :5; 335 336 336 337 /* 337 338 * This is left at the top of the word so that ··· 368 367 struct { 369 368 unsigned int valid:1; 370 369 unsigned int execonly:1; 371 - unsigned int cr0_pg:1; 372 - unsigned int cr4_pae:1; 373 370 unsigned int cr4_pse:1; 374 371 unsigned int cr4_pke:1; 375 372 unsigned int cr4_smap:1; ··· 377 378 }; 378 379 }; 379 380 380 - union kvm_mmu_role { 381 + union kvm_cpu_role { 381 382 u64 as_u64; 382 383 struct { 383 384 union kvm_mmu_page_role base; ··· 437 438 struct kvm_mmu_page *sp); 438 439 void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa); 439 440 struct kvm_mmu_root_info root; 440 - union kvm_mmu_role mmu_role; 441 - u8 root_level; 442 - u8 shadow_root_level; 443 - u8 ept_ad; 444 - bool direct_map; 445 - struct kvm_mmu_root_info prev_roots[KVM_MMU_NUM_PREV_ROOTS]; 446 - 447 - /* 448 - * Bitmap; bit set = permission fault 449 - * Byte index: page fault error code [4:1] 450 - * Bit index: pte permissions in ACC_* format 451 - */ 452 - u8 permissions[16]; 441 + union kvm_cpu_role cpu_role; 442 + union kvm_mmu_page_role root_role; 453 443 454 444 /* 455 445 * The pkru_mask indicates if protection key checks are needed. It ··· 447 459 * Each domain has 2 bits which are ANDed with AD and WD from PKRU. 448 460 */ 449 461 u32 pkru_mask; 462 + 463 + struct kvm_mmu_root_info prev_roots[KVM_MMU_NUM_PREV_ROOTS]; 464 + 465 + /* 466 + * Bitmap; bit set = permission fault 467 + * Byte index: page fault error code [4:1] 468 + * Bit index: pte permissions in ACC_* format 469 + */ 470 + u8 permissions[16]; 450 471 451 472 u64 *pae_root; 452 473 u64 *pml4_root; ··· 604 607 struct kvm_vcpu_xen { 605 608 u64 hypercall_rip; 606 609 u32 current_runstate; 607 - bool vcpu_info_set; 608 - bool vcpu_time_info_set; 609 - bool runstate_set; 610 - struct gfn_to_hva_cache vcpu_info_cache; 611 - struct gfn_to_hva_cache vcpu_time_info_cache; 612 - struct gfn_to_hva_cache runstate_cache; 610 + u8 upcall_vector; 611 + struct gfn_to_pfn_cache vcpu_info_cache; 612 + struct gfn_to_pfn_cache vcpu_time_info_cache; 613 + struct gfn_to_pfn_cache runstate_cache; 613 614 u64 last_steal; 614 615 u64 runstate_entry_time; 615 616 u64 runstate_times[4]; 616 617 unsigned long evtchn_pending_sel; 618 + u32 vcpu_id; /* The Xen / ACPI vCPU ID */ 619 + u32 timer_virq; 620 + u64 timer_expires; /* In guest epoch */ 621 + atomic_t timer_pending; 622 + struct hrtimer timer; 623 + int poll_evtchn; 624 + struct timer_list poll_timer; 617 625 }; 618 626 619 627 struct kvm_vcpu_arch { ··· 755 753 gpa_t time; 756 754 struct pvclock_vcpu_time_info hv_clock; 757 755 unsigned int hw_tsc_khz; 758 - struct gfn_to_hva_cache pv_time; 759 - bool pv_time_enabled; 756 + struct gfn_to_pfn_cache pv_time; 760 757 /* set guest stopped flag in pvclock flags field */ 761 758 bool pvclock_set_guest_stopped_request; 762 759 ··· 1025 1024 1026 1025 /* Xen emulation context */ 1027 1026 struct kvm_xen { 1027 + u32 xen_version; 1028 1028 bool long_mode; 1029 1029 u8 upcall_vector; 1030 1030 struct gfn_to_pfn_cache shinfo_cache; 1031 + struct idr evtchn_ports; 1032 + unsigned long poll_mask[BITS_TO_LONGS(KVM_MAX_VCPUS)]; 1031 1033 }; 1032 1034 1033 1035 enum kvm_irqchip_mode { ··· 1122 1118 u64 cur_tsc_offset; 1123 1119 u64 cur_tsc_generation; 1124 1120 int nr_vcpus_matched_tsc; 1121 + 1122 + u32 default_tsc_khz; 1125 1123 1126 1124 seqcount_raw_spinlock_t pvclock_sc; 1127 1125 bool use_master_clock; ··· 1269 1263 1270 1264 struct kvm_vcpu_stat { 1271 1265 struct kvm_vcpu_stat_generic generic; 1266 + u64 pf_taken; 1272 1267 u64 pf_fixed; 1268 + u64 pf_emulate; 1269 + u64 pf_spurious; 1270 + u64 pf_fast; 1271 + u64 pf_mmio_spte_created; 1273 1272 u64 pf_guest; 1274 1273 u64 tlb_flush; 1275 1274 u64 invlpg; ··· 1466 1455 int cpu_dirty_log_size; 1467 1456 void (*update_cpu_dirty_logging)(struct kvm_vcpu *vcpu); 1468 1457 1469 - /* pmu operations of sub-arch */ 1470 - const struct kvm_pmu_ops *pmu_ops; 1471 1458 const struct kvm_x86_nested_ops *nested_ops; 1472 1459 1473 1460 void (*vcpu_blocking)(struct kvm_vcpu *vcpu); ··· 1508 1499 int (*complete_emulated_msr)(struct kvm_vcpu *vcpu, int err); 1509 1500 1510 1501 void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector); 1502 + 1503 + /* 1504 + * Returns vCPU specific APICv inhibit reasons 1505 + */ 1506 + unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu); 1511 1507 }; 1512 1508 1513 1509 struct kvm_x86_nested_ops { 1514 1510 void (*leave_nested)(struct kvm_vcpu *vcpu); 1515 1511 int (*check_events)(struct kvm_vcpu *vcpu); 1512 + bool (*handle_page_fault_workaround)(struct kvm_vcpu *vcpu, 1513 + struct x86_exception *fault); 1516 1514 bool (*hv_timer_pending)(struct kvm_vcpu *vcpu); 1517 1515 void (*triple_fault)(struct kvm_vcpu *vcpu); 1518 1516 int (*get_state)(struct kvm_vcpu *vcpu, ··· 1544 1528 unsigned int (*handle_intel_pt_intr)(void); 1545 1529 1546 1530 struct kvm_x86_ops *runtime_ops; 1531 + struct kvm_pmu_ops *pmu_ops; 1547 1532 }; 1548 1533 1549 1534 struct kvm_arch_async_pf { ··· 1565 1548 #define KVM_X86_OP_OPTIONAL KVM_X86_OP 1566 1549 #define KVM_X86_OP_OPTIONAL_RET0 KVM_X86_OP 1567 1550 #include <asm/kvm-x86-ops.h> 1568 - 1569 - static inline void kvm_ops_static_call_update(void) 1570 - { 1571 - #define __KVM_X86_OP(func) \ 1572 - static_call_update(kvm_x86_##func, kvm_x86_ops.func); 1573 - #define KVM_X86_OP(func) \ 1574 - WARN_ON(!kvm_x86_ops.func); __KVM_X86_OP(func) 1575 - #define KVM_X86_OP_OPTIONAL __KVM_X86_OP 1576 - #define KVM_X86_OP_OPTIONAL_RET0(func) \ 1577 - static_call_update(kvm_x86_##func, (void *)kvm_x86_ops.func ? : \ 1578 - (void *)__static_call_return0); 1579 - #include <asm/kvm-x86-ops.h> 1580 - #undef __KVM_X86_OP 1581 - } 1582 1551 1583 1552 #define __KVM_HAVE_ARCH_VM_ALLOC 1584 1553 static inline struct kvm *kvm_arch_alloc_vm(void) ··· 1803 1800 struct x86_exception *exception); 1804 1801 1805 1802 bool kvm_apicv_activated(struct kvm *kvm); 1803 + bool kvm_vcpu_apicv_activated(struct kvm_vcpu *vcpu); 1806 1804 void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu); 1807 1805 void __kvm_set_or_clear_apicv_inhibit(struct kvm *kvm, 1808 1806 enum kvm_apicv_inhibit reason, bool set); ··· 1993 1989 KVM_X86_QUIRK_CD_NW_CLEARED | \ 1994 1990 KVM_X86_QUIRK_LAPIC_MMIO_HOLE | \ 1995 1991 KVM_X86_QUIRK_OUT_7E_INC_RIP | \ 1996 - KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) 1992 + KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT | \ 1993 + KVM_X86_QUIRK_FIX_HYPERCALL_INSN) 1997 1994 1998 1995 #endif /* _ASM_X86_KVM_HOST_H */

+142

arch/x86/include/asm/uaccess.h

··· 382 382 383 383 #endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT 384 384 385 + #ifdef CONFIG_CC_HAS_ASM_GOTO_TIED_OUTPUT 386 + #define __try_cmpxchg_user_asm(itype, ltype, _ptr, _pold, _new, label) ({ \ 387 + bool success; \ 388 + __typeof__(_ptr) _old = (__typeof__(_ptr))(_pold); \ 389 + __typeof__(*(_ptr)) __old = *_old; \ 390 + __typeof__(*(_ptr)) __new = (_new); \ 391 + asm_volatile_goto("\n" \ 392 + "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\ 393 + _ASM_EXTABLE_UA(1b, %l[label]) \ 394 + : CC_OUT(z) (success), \ 395 + [ptr] "+m" (*_ptr), \ 396 + [old] "+a" (__old) \ 397 + : [new] ltype (__new) \ 398 + : "memory" \ 399 + : label); \ 400 + if (unlikely(!success)) \ 401 + *_old = __old; \ 402 + likely(success); }) 403 + 404 + #ifdef CONFIG_X86_32 405 + #define __try_cmpxchg64_user_asm(_ptr, _pold, _new, label) ({ \ 406 + bool success; \ 407 + __typeof__(_ptr) _old = (__typeof__(_ptr))(_pold); \ 408 + __typeof__(*(_ptr)) __old = *_old; \ 409 + __typeof__(*(_ptr)) __new = (_new); \ 410 + asm_volatile_goto("\n" \ 411 + "1: " LOCK_PREFIX "cmpxchg8b %[ptr]\n" \ 412 + _ASM_EXTABLE_UA(1b, %l[label]) \ 413 + : CC_OUT(z) (success), \ 414 + "+A" (__old), \ 415 + [ptr] "+m" (*_ptr) \ 416 + : "b" ((u32)__new), \ 417 + "c" ((u32)((u64)__new >> 32)) \ 418 + : "memory" \ 419 + : label); \ 420 + if (unlikely(!success)) \ 421 + *_old = __old; \ 422 + likely(success); }) 423 + #endif // CONFIG_X86_32 424 + #else // !CONFIG_CC_HAS_ASM_GOTO_TIED_OUTPUT 425 + #define __try_cmpxchg_user_asm(itype, ltype, _ptr, _pold, _new, label) ({ \ 426 + int __err = 0; \ 427 + bool success; \ 428 + __typeof__(_ptr) _old = (__typeof__(_ptr))(_pold); \ 429 + __typeof__(*(_ptr)) __old = *_old; \ 430 + __typeof__(*(_ptr)) __new = (_new); \ 431 + asm volatile("\n" \ 432 + "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\ 433 + CC_SET(z) \ 434 + "2:\n" \ 435 + _ASM_EXTABLE_TYPE_REG(1b, 2b, EX_TYPE_EFAULT_REG, \ 436 + %[errout]) \ 437 + : CC_OUT(z) (success), \ 438 + [errout] "+r" (__err), \ 439 + [ptr] "+m" (*_ptr), \ 440 + [old] "+a" (__old) \ 441 + : [new] ltype (__new) \ 442 + : "memory", "cc"); \ 443 + if (unlikely(__err)) \ 444 + goto label; \ 445 + if (unlikely(!success)) \ 446 + *_old = __old; \ 447 + likely(success); }) 448 + 449 + #ifdef CONFIG_X86_32 450 + /* 451 + * Unlike the normal CMPXCHG, hardcode ECX for both success/fail and error. 452 + * There are only six GPRs available and four (EAX, EBX, ECX, and EDX) are 453 + * hardcoded by CMPXCHG8B, leaving only ESI and EDI. If the compiler uses 454 + * both ESI and EDI for the memory operand, compilation will fail if the error 455 + * is an input+output as there will be no register available for input. 456 + */ 457 + #define __try_cmpxchg64_user_asm(_ptr, _pold, _new, label) ({ \ 458 + int __result; \ 459 + __typeof__(_ptr) _old = (__typeof__(_ptr))(_pold); \ 460 + __typeof__(*(_ptr)) __old = *_old; \ 461 + __typeof__(*(_ptr)) __new = (_new); \ 462 + asm volatile("\n" \ 463 + "1: " LOCK_PREFIX "cmpxchg8b %[ptr]\n" \ 464 + "mov $0, %%ecx\n\t" \ 465 + "setz %%cl\n" \ 466 + "2:\n" \ 467 + _ASM_EXTABLE_TYPE_REG(1b, 2b, EX_TYPE_EFAULT_REG, %%ecx) \ 468 + : [result]"=c" (__result), \ 469 + "+A" (__old), \ 470 + [ptr] "+m" (*_ptr) \ 471 + : "b" ((u32)__new), \ 472 + "c" ((u32)((u64)__new >> 32)) \ 473 + : "memory", "cc"); \ 474 + if (unlikely(__result < 0)) \ 475 + goto label; \ 476 + if (unlikely(!__result)) \ 477 + *_old = __old; \ 478 + likely(__result); }) 479 + #endif // CONFIG_X86_32 480 + #endif // CONFIG_CC_HAS_ASM_GOTO_TIED_OUTPUT 481 + 385 482 /* FIXME: this hack is definitely wrong -AK */ 386 483 struct __large_struct { unsigned long buf[100]; }; 387 484 #define __m(x) (*(struct __large_struct __user *)(x)) ··· 570 473 if (unlikely(__gu_err)) goto err_label; \ 571 474 } while (0) 572 475 #endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT 476 + 477 + extern void __try_cmpxchg_user_wrong_size(void); 478 + 479 + #ifndef CONFIG_X86_32 480 + #define __try_cmpxchg64_user_asm(_ptr, _oldp, _nval, _label) \ 481 + __try_cmpxchg_user_asm("q", "r", (_ptr), (_oldp), (_nval), _label) 482 + #endif 483 + 484 + /* 485 + * Force the pointer to u<size> to match the size expected by the asm helper. 486 + * clang/LLVM compiles all cases and only discards the unused paths after 487 + * processing errors, which breaks i386 if the pointer is an 8-byte value. 488 + */ 489 + #define unsafe_try_cmpxchg_user(_ptr, _oldp, _nval, _label) ({ \ 490 + bool __ret; \ 491 + __chk_user_ptr(_ptr); \ 492 + switch (sizeof(*(_ptr))) { \ 493 + case 1: __ret = __try_cmpxchg_user_asm("b", "q", \ 494 + (__force u8 *)(_ptr), (_oldp), \ 495 + (_nval), _label); \ 496 + break; \ 497 + case 2: __ret = __try_cmpxchg_user_asm("w", "r", \ 498 + (__force u16 *)(_ptr), (_oldp), \ 499 + (_nval), _label); \ 500 + break; \ 501 + case 4: __ret = __try_cmpxchg_user_asm("l", "r", \ 502 + (__force u32 *)(_ptr), (_oldp), \ 503 + (_nval), _label); \ 504 + break; \ 505 + case 8: __ret = __try_cmpxchg64_user_asm((__force u64 *)(_ptr), (_oldp),\ 506 + (_nval), _label); \ 507 + break; \ 508 + default: __try_cmpxchg_user_wrong_size(); \ 509 + } \ 510 + __ret; }) 511 + 512 + /* "Returns" 0 on success, 1 on failure, -EFAULT if the access faults. */ 513 + #define __try_cmpxchg_user(_ptr, _oldp, _nval, _label) ({ \ 514 + int __ret = -EFAULT; \ 515 + __uaccess_begin_nospec(); \ 516 + __ret = !unsafe_try_cmpxchg_user(_ptr, _oldp, _nval, _label); \ 517 + _label: \ 518 + __uaccess_end(); \ 519 + __ret; \ 520 + }) 573 521 574 522 /* 575 523 * We want the unsafe accessors to always be inlined and use

+4 -6

arch/x86/include/asm/vmx.h

··· 543 543 #define EPT_VIOLATION_ACC_READ_BIT 0 544 544 #define EPT_VIOLATION_ACC_WRITE_BIT 1 545 545 #define EPT_VIOLATION_ACC_INSTR_BIT 2 546 - #define EPT_VIOLATION_READABLE_BIT 3 547 - #define EPT_VIOLATION_WRITABLE_BIT 4 548 - #define EPT_VIOLATION_EXECUTABLE_BIT 5 546 + #define EPT_VIOLATION_RWX_SHIFT 3 547 + #define EPT_VIOLATION_GVA_IS_VALID_BIT 7 549 548 #define EPT_VIOLATION_GVA_TRANSLATED_BIT 8 550 549 #define EPT_VIOLATION_ACC_READ (1 << EPT_VIOLATION_ACC_READ_BIT) 551 550 #define EPT_VIOLATION_ACC_WRITE (1 << EPT_VIOLATION_ACC_WRITE_BIT) 552 551 #define EPT_VIOLATION_ACC_INSTR (1 << EPT_VIOLATION_ACC_INSTR_BIT) 553 - #define EPT_VIOLATION_READABLE (1 << EPT_VIOLATION_READABLE_BIT) 554 - #define EPT_VIOLATION_WRITABLE (1 << EPT_VIOLATION_WRITABLE_BIT) 555 - #define EPT_VIOLATION_EXECUTABLE (1 << EPT_VIOLATION_EXECUTABLE_BIT) 552 + #define EPT_VIOLATION_RWX_MASK (VMX_EPT_RWX_MASK << EPT_VIOLATION_RWX_SHIFT) 553 + #define EPT_VIOLATION_GVA_IS_VALID (1 << EPT_VIOLATION_GVA_IS_VALID_BIT) 556 554 #define EPT_VIOLATION_GVA_TRANSLATED (1 << EPT_VIOLATION_GVA_TRANSLATED_BIT) 557 555 558 556 /*

+6 -5

arch/x86/include/uapi/asm/kvm.h

··· 428 428 struct kvm_vcpu_events events; 429 429 }; 430 430 431 - #define KVM_X86_QUIRK_LINT0_REENABLED (1 << 0) 432 - #define KVM_X86_QUIRK_CD_NW_CLEARED (1 << 1) 433 - #define KVM_X86_QUIRK_LAPIC_MMIO_HOLE (1 << 2) 434 - #define KVM_X86_QUIRK_OUT_7E_INC_RIP (1 << 3) 435 - #define KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT (1 << 4) 431 + #define KVM_X86_QUIRK_LINT0_REENABLED (1 << 0) 432 + #define KVM_X86_QUIRK_CD_NW_CLEARED (1 << 1) 433 + #define KVM_X86_QUIRK_LAPIC_MMIO_HOLE (1 << 2) 434 + #define KVM_X86_QUIRK_OUT_7E_INC_RIP (1 << 3) 435 + #define KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT (1 << 4) 436 + #define KVM_X86_QUIRK_FIX_HYPERCALL_INSN (1 << 5) 436 437 437 438 #define KVM_STATE_NESTED_FORMAT_VMX 0 438 439 #define KVM_STATE_NESTED_FORMAT_SVM 1

+2 -2

arch/x86/kernel/asm-offsets_64.c

··· 5 5 6 6 #include <asm/ia32.h> 7 7 8 - #if defined(CONFIG_KVM_GUEST) && defined(CONFIG_PARAVIRT_SPINLOCKS) 8 + #if defined(CONFIG_KVM_GUEST) 9 9 #include <asm/kvm_para.h> 10 10 #endif 11 11 ··· 20 20 BLANK(); 21 21 #endif 22 22 23 - #if defined(CONFIG_KVM_GUEST) && defined(CONFIG_PARAVIRT_SPINLOCKS) 23 + #if defined(CONFIG_KVM_GUEST) 24 24 OFFSET(KVM_STEAL_TIME_preempted, kvm_steal_time, preempted); 25 25 BLANK(); 26 26 #endif

+16 -1

arch/x86/kernel/fpu/core.c

··· 14 14 #include <asm/traps.h> 15 15 #include <asm/irq_regs.h> 16 16 17 + #include <uapi/asm/kvm.h> 18 + 17 19 #include <linux/hardirq.h> 18 20 #include <linux/pkeys.h> 19 21 #include <linux/vmalloc.h> ··· 234 232 gfpu->fpstate = fpstate; 235 233 gfpu->xfeatures = fpu_user_cfg.default_features; 236 234 gfpu->perm = fpu_user_cfg.default_features; 237 - gfpu->uabi_size = fpu_user_cfg.default_size; 235 + 236 + /* 237 + * KVM sets the FP+SSE bits in the XSAVE header when copying FPU state 238 + * to userspace, even when XSAVE is unsupported, so that restoring FPU 239 + * state on a different CPU that does support XSAVE can cleanly load 240 + * the incoming state using its natural XSAVE. In other words, KVM's 241 + * uABI size may be larger than this host's default size. Conversely, 242 + * the default size should never be larger than KVM's base uABI size; 243 + * all features that can expand the uABI size must be opt-in. 244 + */ 245 + gfpu->uabi_size = sizeof(struct kvm_xsave); 246 + if (WARN_ON_ONCE(fpu_user_cfg.default_size > gfpu->uabi_size)) 247 + gfpu->uabi_size = fpu_user_cfg.default_size; 248 + 238 249 fpu_init_guest_permissions(gfpu); 239 250 240 251 return true;

+67 -53

arch/x86/kernel/kvm.c

··· 191 191 { 192 192 u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS); 193 193 struct kvm_task_sleep_head *b = &async_pf_sleepers[key]; 194 - struct kvm_task_sleep_node *n; 194 + struct kvm_task_sleep_node *n, *dummy = NULL; 195 195 196 196 if (token == ~0) { 197 197 apf_task_wake_all(); ··· 203 203 n = _find_apf_task(b, token); 204 204 if (!n) { 205 205 /* 206 - * async PF was not yet handled. 207 - * Add dummy entry for the token. 206 + * Async #PF not yet handled, add a dummy entry for the token. 207 + * Allocating the token must be down outside of the raw lock 208 + * as the allocator is preemptible on PREEMPT_RT kernels. 208 209 */ 209 - n = kzalloc(sizeof(*n), GFP_ATOMIC); 210 - if (!n) { 211 - /* 212 - * Allocation failed! Busy wait while other cpu 213 - * handles async PF. 214 - */ 210 + if (!dummy) { 215 211 raw_spin_unlock(&b->lock); 216 - cpu_relax(); 212 + dummy = kzalloc(sizeof(*dummy), GFP_ATOMIC); 213 + 214 + /* 215 + * Continue looping on allocation failure, eventually 216 + * the async #PF will be handled and allocating a new 217 + * node will be unnecessary. 218 + */ 219 + if (!dummy) 220 + cpu_relax(); 221 + 222 + /* 223 + * Recheck for async #PF completion before enqueueing 224 + * the dummy token to avoid duplicate list entries. 225 + */ 217 226 goto again; 218 227 } 219 - n->token = token; 220 - n->cpu = smp_processor_id(); 221 - init_swait_queue_head(&n->wq); 222 - hlist_add_head(&n->link, &b->list); 228 + dummy->token = token; 229 + dummy->cpu = smp_processor_id(); 230 + init_swait_queue_head(&dummy->wq); 231 + hlist_add_head(&dummy->link, &b->list); 232 + dummy = NULL; 223 233 } else { 224 234 apf_task_wake_one(n); 225 235 } 226 236 raw_spin_unlock(&b->lock); 227 - return; 237 + 238 + /* A dummy token might be allocated and ultimately not used. */ 239 + if (dummy) 240 + kfree(dummy); 228 241 } 229 242 EXPORT_SYMBOL_GPL(kvm_async_pf_task_wake); 230 243 ··· 778 765 } 779 766 #endif 780 767 768 + #if defined(CONFIG_X86_32) || !defined(CONFIG_SMP) 769 + bool __kvm_vcpu_is_preempted(long cpu); 770 + 771 + __visible bool __kvm_vcpu_is_preempted(long cpu) 772 + { 773 + struct kvm_steal_time *src = &per_cpu(steal_time, cpu); 774 + 775 + return !!(src->preempted & KVM_VCPU_PREEMPTED); 776 + } 777 + PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); 778 + 779 + #else 780 + 781 + #include <asm/asm-offsets.h> 782 + 783 + extern bool __raw_callee_save___kvm_vcpu_is_preempted(long); 784 + 785 + /* 786 + * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and 787 + * restoring to/from the stack. 788 + */ 789 + asm( 790 + ".pushsection .text;" 791 + ".global __raw_callee_save___kvm_vcpu_is_preempted;" 792 + ".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" 793 + "__raw_callee_save___kvm_vcpu_is_preempted:" 794 + ASM_ENDBR 795 + "movq __per_cpu_offset(,%rdi,8), %rax;" 796 + "cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax);" 797 + "setne %al;" 798 + ASM_RET 799 + ".size __raw_callee_save___kvm_vcpu_is_preempted, .-__raw_callee_save___kvm_vcpu_is_preempted;" 800 + ".popsection"); 801 + 802 + #endif 803 + 781 804 static void __init kvm_guest_init(void) 782 805 { 783 806 int i; ··· 826 777 if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { 827 778 has_steal_clock = 1; 828 779 static_call_update(pv_steal_clock, kvm_steal_clock); 780 + 781 + pv_ops.lock.vcpu_is_preempted = 782 + PV_CALLEE_SAVE(__kvm_vcpu_is_preempted); 829 783 } 830 784 831 785 if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) ··· 1070 1018 } 1071 1019 } 1072 1020 1073 - #ifdef CONFIG_X86_32 1074 - __visible bool __kvm_vcpu_is_preempted(long cpu) 1075 - { 1076 - struct kvm_steal_time *src = &per_cpu(steal_time, cpu); 1077 - 1078 - return !!(src->preempted & KVM_VCPU_PREEMPTED); 1079 - } 1080 - PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); 1081 - 1082 - #else 1083 - 1084 - #include <asm/asm-offsets.h> 1085 - 1086 - extern bool __raw_callee_save___kvm_vcpu_is_preempted(long); 1087 - 1088 - /* 1089 - * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and 1090 - * restoring to/from the stack. 1091 - */ 1092 - asm( 1093 - ".pushsection .text;" 1094 - ".global __raw_callee_save___kvm_vcpu_is_preempted;" 1095 - ".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" 1096 - "__raw_callee_save___kvm_vcpu_is_preempted:" 1097 - ASM_ENDBR 1098 - "movq __per_cpu_offset(,%rdi,8), %rax;" 1099 - "cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax);" 1100 - "setne %al;" 1101 - ASM_RET 1102 - ".size __raw_callee_save___kvm_vcpu_is_preempted, .-__raw_callee_save___kvm_vcpu_is_preempted;" 1103 - ".popsection"); 1104 - 1105 - #endif 1106 - 1107 1021 /* 1108 1022 * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. 1109 1023 */ ··· 1113 1095 pv_ops.lock.wait = kvm_wait; 1114 1096 pv_ops.lock.kick = kvm_kick_cpu; 1115 1097 1116 - if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { 1117 - pv_ops.lock.vcpu_is_preempted = 1118 - PV_CALLEE_SAVE(__kvm_vcpu_is_preempted); 1119 - } 1120 1098 /* 1121 1099 * When PV spinlock is enabled which is preferred over 1122 1100 * virt_spin_lock(), virt_spin_lock_key's value is meaningless.

+1 -1

arch/x86/kernel/kvmclock.c

··· 239 239 240 240 static int __init kvm_setup_vsyscall_timeinfo(void) 241 241 { 242 - if (!kvm_para_available() || !kvmclock) 242 + if (!kvm_para_available() || !kvmclock || nopv) 243 243 return 0; 244 244 245 245 kvmclock_init_mem();

-1

arch/x86/kvm/i8259.c

··· 252 252 */ 253 253 irq2 = 7; 254 254 intno = s->pics[1].irq_base + irq2; 255 - irq = irq2 + 8; 256 255 } else 257 256 intno = s->pics[0].irq_base + irq; 258 257 } else {

+9 -3

arch/x86/kvm/irq.c

··· 22 22 */ 23 23 int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu) 24 24 { 25 - if (lapic_in_kernel(vcpu)) 26 - return apic_has_pending_timer(vcpu); 25 + int r = 0; 27 26 28 - return 0; 27 + if (lapic_in_kernel(vcpu)) 28 + r = apic_has_pending_timer(vcpu); 29 + if (kvm_xen_timer_enabled(vcpu)) 30 + r += kvm_xen_has_pending_timer(vcpu); 31 + 32 + return r; 29 33 } 30 34 EXPORT_SYMBOL(kvm_cpu_has_pending_timer); 31 35 ··· 147 143 { 148 144 if (lapic_in_kernel(vcpu)) 149 145 kvm_inject_apic_timer_irqs(vcpu); 146 + if (kvm_xen_timer_enabled(vcpu)) 147 + kvm_xen_inject_timer_irqs(vcpu); 150 148 } 151 149 EXPORT_SYMBOL_GPL(kvm_inject_pending_timer_irqs); 152 150

+1 -1

arch/x86/kvm/irq_comm.c

··· 181 181 if (!level) 182 182 return -1; 183 183 184 - return kvm_xen_set_evtchn_fast(e, kvm); 184 + return kvm_xen_set_evtchn_fast(&e->xen_evtchn, kvm); 185 185 #endif 186 186 default: 187 187 break;

+3 -2

arch/x86/kvm/lapic.c

··· 1548 1548 if (apic->lapic_timer.hv_timer_in_use) 1549 1549 cancel_hv_timer(apic); 1550 1550 preempt_enable(); 1551 + atomic_set(&apic->lapic_timer.pending, 0); 1551 1552 } 1552 1553 1553 1554 static void apic_update_lvtt(struct kvm_lapic *apic) ··· 1649 1648 tsc_deadline = apic->lapic_timer.expired_tscdeadline; 1650 1649 apic->lapic_timer.expired_tscdeadline = 0; 1651 1650 guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); 1652 - apic->lapic_timer.advance_expire_delta = guest_tsc - tsc_deadline; 1651 + trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline); 1653 1652 1654 1653 if (lapic_timer_advance_dynamic) { 1655 - adjust_lapic_timer_advance(vcpu, apic->lapic_timer.advance_expire_delta); 1654 + adjust_lapic_timer_advance(vcpu, guest_tsc - tsc_deadline); 1656 1655 /* 1657 1656 * If the timer fired early, reread the TSC to account for the 1658 1657 * overhead of the above adjustment to avoid waiting longer

-1

arch/x86/kvm/lapic.h

··· 38 38 u64 tscdeadline; 39 39 u64 expired_tscdeadline; 40 40 u32 timer_advance_ns; 41 - s64 advance_expire_delta; 42 41 atomic_t pending; /* accumulated triggered timers */ 43 42 bool hv_timer_in_use; 44 43 };

+21 -88

arch/x86/kvm/mmu.h

··· 89 89 return (1ULL << (max_gpa_bits - PAGE_SHIFT)) - 1; 90 90 } 91 91 92 + static inline u8 kvm_get_shadow_phys_bits(void) 93 + { 94 + /* 95 + * boot_cpu_data.x86_phys_bits is reduced when MKTME or SME are detected 96 + * in CPU detection code, but the processor treats those reduced bits as 97 + * 'keyID' thus they are not reserved bits. Therefore KVM needs to look at 98 + * the physical address bits reported by CPUID. 99 + */ 100 + if (likely(boot_cpu_data.extended_cpuid_level >= 0x80000008)) 101 + return cpuid_eax(0x80000008) & 0xff; 102 + 103 + /* 104 + * Quite weird to have VMX or SVM but not MAXPHYADDR; probably a VM with 105 + * custom CPUID. Proceed with whatever the kernel found since these features 106 + * aren't virtualizable (SME/SEV also require CPUIDs higher than 0x80000008). 107 + */ 108 + return boot_cpu_data.x86_phys_bits; 109 + } 110 + 92 111 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask); 112 + void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask); 93 113 void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only); 94 114 95 115 void kvm_init_mmu(struct kvm_vcpu *vcpu); ··· 158 138 return; 159 139 160 140 static_call(kvm_x86_load_mmu_pgd)(vcpu, root_hpa, 161 - vcpu->arch.mmu->shadow_root_level); 162 - } 163 - 164 - struct kvm_page_fault { 165 - /* arguments to kvm_mmu_do_page_fault. */ 166 - const gpa_t addr; 167 - const u32 error_code; 168 - const bool prefetch; 169 - 170 - /* Derived from error_code. */ 171 - const bool exec; 172 - const bool write; 173 - const bool present; 174 - const bool rsvd; 175 - const bool user; 176 - 177 - /* Derived from mmu and global state. */ 178 - const bool is_tdp; 179 - const bool nx_huge_page_workaround_enabled; 180 - 181 - /* 182 - * Whether a >4KB mapping can be created or is forbidden due to NX 183 - * hugepages. 184 - */ 185 - bool huge_page_disallowed; 186 - 187 - /* 188 - * Maximum page size that can be created for this fault; input to 189 - * FNAME(fetch), __direct_map and kvm_tdp_mmu_map. 190 - */ 191 - u8 max_level; 192 - 193 - /* 194 - * Page size that can be created based on the max_level and the 195 - * page size used by the host mapping. 196 - */ 197 - u8 req_level; 198 - 199 - /* 200 - * Page size that will be created based on the req_level and 201 - * huge_page_disallowed. 202 - */ 203 - u8 goal_level; 204 - 205 - /* Shifted addr, or result of guest page table walk if addr is a gva. */ 206 - gfn_t gfn; 207 - 208 - /* The memslot containing gfn. May be NULL. */ 209 - struct kvm_memory_slot *slot; 210 - 211 - /* Outputs of kvm_faultin_pfn. */ 212 - kvm_pfn_t pfn; 213 - hva_t hva; 214 - bool map_writable; 215 - }; 216 - 217 - int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); 218 - 219 - extern int nx_huge_pages; 220 - static inline bool is_nx_huge_page_enabled(void) 221 - { 222 - return READ_ONCE(nx_huge_pages); 223 - } 224 - 225 - static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, 226 - u32 err, bool prefetch) 227 - { 228 - struct kvm_page_fault fault = { 229 - .addr = cr2_or_gpa, 230 - .error_code = err, 231 - .exec = err & PFERR_FETCH_MASK, 232 - .write = err & PFERR_WRITE_MASK, 233 - .present = err & PFERR_PRESENT_MASK, 234 - .rsvd = err & PFERR_RSVD_MASK, 235 - .user = err & PFERR_USER_MASK, 236 - .prefetch = prefetch, 237 - .is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault), 238 - .nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(), 239 - 240 - .max_level = KVM_MAX_HUGEPAGE_LEVEL, 241 - .req_level = PG_LEVEL_4K, 242 - .goal_level = PG_LEVEL_4K, 243 - }; 244 - #ifdef CONFIG_RETPOLINE 245 - if (fault.is_tdp) 246 - return kvm_tdp_page_fault(vcpu, &fault); 247 - #endif 248 - return vcpu->arch.mmu->page_fault(vcpu, &fault); 141 + vcpu->arch.mmu->root_role.level); 249 142 } 250 143 251 144 /*

+304 -293

arch/x86/kvm/mmu/mmu.c

··· 193 193 194 194 /* 195 195 * Yes, lot's of underscores. They're a hint that you probably shouldn't be 196 - * reading from the role_regs. Once the mmu_role is constructed, it becomes 196 + * reading from the role_regs. Once the root_role is constructed, it becomes 197 197 * the single source of truth for the MMU's state. 198 198 */ 199 199 #define BUILD_MMU_ROLE_REGS_ACCESSOR(reg, name, flag) \ 200 - static inline bool __maybe_unused ____is_##reg##_##name(struct kvm_mmu_role_regs *regs)\ 200 + static inline bool __maybe_unused \ 201 + ____is_##reg##_##name(const struct kvm_mmu_role_regs *regs) \ 201 202 { \ 202 203 return !!(regs->reg & flag); \ 203 204 } ··· 222 221 #define BUILD_MMU_ROLE_ACCESSOR(base_or_ext, reg, name) \ 223 222 static inline bool __maybe_unused is_##reg##_##name(struct kvm_mmu *mmu) \ 224 223 { \ 225 - return !!(mmu->mmu_role. base_or_ext . reg##_##name); \ 224 + return !!(mmu->cpu_role. base_or_ext . reg##_##name); \ 226 225 } 227 - BUILD_MMU_ROLE_ACCESSOR(ext, cr0, pg); 228 226 BUILD_MMU_ROLE_ACCESSOR(base, cr0, wp); 229 227 BUILD_MMU_ROLE_ACCESSOR(ext, cr4, pse); 230 - BUILD_MMU_ROLE_ACCESSOR(ext, cr4, pae); 231 228 BUILD_MMU_ROLE_ACCESSOR(ext, cr4, smep); 232 229 BUILD_MMU_ROLE_ACCESSOR(ext, cr4, smap); 233 230 BUILD_MMU_ROLE_ACCESSOR(ext, cr4, pke); 234 231 BUILD_MMU_ROLE_ACCESSOR(ext, cr4, la57); 235 232 BUILD_MMU_ROLE_ACCESSOR(base, efer, nx); 233 + BUILD_MMU_ROLE_ACCESSOR(ext, efer, lma); 234 + 235 + static inline bool is_cr0_pg(struct kvm_mmu *mmu) 236 + { 237 + return mmu->cpu_role.base.level > 0; 238 + } 239 + 240 + static inline bool is_cr4_pae(struct kvm_mmu *mmu) 241 + { 242 + return !mmu->cpu_role.base.has_4_byte_gpte; 243 + } 236 244 237 245 static struct kvm_mmu_role_regs vcpu_to_role_regs(struct kvm_vcpu *vcpu) 238 246 { ··· 252 242 }; 253 243 254 244 return regs; 255 - } 256 - 257 - static int role_regs_to_root_level(struct kvm_mmu_role_regs *regs) 258 - { 259 - if (!____is_cr0_pg(regs)) 260 - return 0; 261 - else if (____is_efer_lma(regs)) 262 - return ____is_cr4_la57(regs) ? PT64_ROOT_5LEVEL : 263 - PT64_ROOT_4LEVEL; 264 - else if (____is_cr4_pae(regs)) 265 - return PT32E_ROOT_LEVEL; 266 - else 267 - return PT32_ROOT_LEVEL; 268 245 } 269 246 270 247 static inline bool kvm_available_flush_tlb_with_range(void) ··· 711 714 712 715 static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index) 713 716 { 717 + if (sp->role.passthrough) 718 + return sp->gfn; 719 + 714 720 if (!sp->role.direct) 715 721 return sp->gfns[index]; 716 722 ··· 722 722 723 723 static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn) 724 724 { 725 + if (sp->role.passthrough) { 726 + WARN_ON_ONCE(gfn != sp->gfn); 727 + return; 728 + } 729 + 725 730 if (!sp->role.direct) { 726 731 sp->gfns[index] = gfn; 727 732 return; ··· 1483 1478 1484 1479 static void slot_rmap_walk_next(struct slot_rmap_walk_iterator *iterator) 1485 1480 { 1486 - if (++iterator->rmap <= iterator->end_rmap) { 1481 + while (++iterator->rmap <= iterator->end_rmap) { 1487 1482 iterator->gfn += (1UL << KVM_HPAGE_GFN_SHIFT(iterator->level)); 1488 - return; 1483 + 1484 + if (iterator->rmap->val) 1485 + return; 1489 1486 } 1490 1487 1491 1488 if (++iterator->level > iterator->end_level) { ··· 1840 1833 static void kvm_mmu_commit_zap_page(struct kvm *kvm, 1841 1834 struct list_head *invalid_list); 1842 1835 1836 + static bool sp_has_gptes(struct kvm_mmu_page *sp) 1837 + { 1838 + if (sp->role.direct) 1839 + return false; 1840 + 1841 + if (sp->role.passthrough) 1842 + return false; 1843 + 1844 + return true; 1845 + } 1846 + 1843 1847 #define for_each_valid_sp(_kvm, _sp, _list) \ 1844 1848 hlist_for_each_entry(_sp, _list, hash_link) \ 1845 1849 if (is_obsolete_sp((_kvm), (_sp))) { \ 1846 1850 } else 1847 1851 1848 - #define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn) \ 1852 + #define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn) \ 1849 1853 for_each_valid_sp(_kvm, _sp, \ 1850 1854 &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)]) \ 1851 - if ((_sp)->gfn != (_gfn) || (_sp)->role.direct) {} else 1855 + if ((_sp)->gfn != (_gfn) || !sp_has_gptes(_sp)) {} else 1852 1856 1853 - static bool kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, 1857 + static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, 1854 1858 struct list_head *invalid_list) 1855 1859 { 1856 1860 int ret = vcpu->arch.mmu->sync_page(vcpu, sp); 1857 1861 1858 - if (ret < 0) { 1862 + if (ret < 0) 1859 1863 kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list); 1860 - return false; 1861 - } 1862 - 1863 - return !!ret; 1864 + return ret; 1864 1865 } 1865 1866 1866 1867 static bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm, ··· 1990 1975 1991 1976 for_each_sp(pages, sp, parents, i) { 1992 1977 kvm_unlink_unsync_page(vcpu->kvm, sp); 1993 - flush |= kvm_sync_page(vcpu, sp, &invalid_list); 1978 + flush |= kvm_sync_page(vcpu, sp, &invalid_list) > 0; 1994 1979 mmu_pages_clear_parents(&parents); 1995 1980 } 1996 1981 if (need_resched() || rwlock_needbreak(&vcpu->kvm->mmu_lock)) { ··· 2026 2011 int direct, 2027 2012 unsigned int access) 2028 2013 { 2029 - bool direct_mmu = vcpu->arch.mmu->direct_map; 2014 + bool direct_mmu = vcpu->arch.mmu->root_role.direct; 2030 2015 union kvm_mmu_page_role role; 2031 2016 struct hlist_head *sp_list; 2032 2017 unsigned quadrant; 2033 2018 struct kvm_mmu_page *sp; 2019 + int ret; 2034 2020 int collisions = 0; 2035 2021 LIST_HEAD(invalid_list); 2036 2022 2037 - role = vcpu->arch.mmu->mmu_role.base; 2023 + role = vcpu->arch.mmu->root_role; 2038 2024 role.level = level; 2039 2025 role.direct = direct; 2040 2026 role.access = access; ··· 2044 2028 quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1; 2045 2029 role.quadrant = quadrant; 2046 2030 } 2031 + if (level <= vcpu->arch.mmu->cpu_role.base.level) 2032 + role.passthrough = 0; 2047 2033 2048 2034 sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]; 2049 2035 for_each_valid_sp(vcpu->kvm, sp, sp_list) { ··· 2086 2068 * If the sync fails, the page is zapped. If so, break 2087 2069 * in order to rebuild it. 2088 2070 */ 2089 - if (!kvm_sync_page(vcpu, sp, &invalid_list)) 2071 + ret = kvm_sync_page(vcpu, sp, &invalid_list); 2072 + if (ret < 0) 2090 2073 break; 2091 2074 2092 2075 WARN_ON(!list_empty(&invalid_list)); 2093 - kvm_flush_remote_tlbs(vcpu->kvm); 2076 + if (ret > 0) 2077 + kvm_flush_remote_tlbs(vcpu->kvm); 2094 2078 } 2095 2079 2096 2080 __clear_sp_write_flooding_count(sp); ··· 2109 2089 sp->gfn = gfn; 2110 2090 sp->role = role; 2111 2091 hlist_add_head(&sp->hash_link, sp_list); 2112 - if (!direct) { 2092 + if (sp_has_gptes(sp)) { 2113 2093 account_shadowed(vcpu->kvm, sp); 2114 2094 if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn)) 2115 2095 kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1); ··· 2129 2109 { 2130 2110 iterator->addr = addr; 2131 2111 iterator->shadow_addr = root; 2132 - iterator->level = vcpu->arch.mmu->shadow_root_level; 2112 + iterator->level = vcpu->arch.mmu->root_role.level; 2133 2113 2134 2114 if (iterator->level >= PT64_ROOT_4LEVEL && 2135 - vcpu->arch.mmu->root_level < PT64_ROOT_4LEVEL && 2136 - !vcpu->arch.mmu->direct_map) 2115 + vcpu->arch.mmu->cpu_role.base.level < PT64_ROOT_4LEVEL && 2116 + !vcpu->arch.mmu->root_role.direct) 2137 2117 iterator->level = PT32E_ROOT_LEVEL; 2138 2118 2139 2119 if (iterator->level == PT32E_ROOT_LEVEL) { ··· 2318 2298 /* Zapping children means active_mmu_pages has become unstable. */ 2319 2299 list_unstable = *nr_zapped; 2320 2300 2321 - if (!sp->role.invalid && !sp->role.direct) 2301 + if (!sp->role.invalid && sp_has_gptes(sp)) 2322 2302 unaccount_shadowed(kvm, sp); 2323 2303 2324 2304 if (sp->unsync) ··· 2498 2478 pgprintk("%s: looking for gfn %llx\n", __func__, gfn); 2499 2479 r = 0; 2500 2480 write_lock(&kvm->mmu_lock); 2501 - for_each_gfn_indirect_valid_sp(kvm, sp, gfn) { 2481 + for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn) { 2502 2482 pgprintk("%s: gfn %llx role %x\n", __func__, gfn, 2503 2483 sp->role.word); 2504 2484 r = 1; ··· 2515 2495 gpa_t gpa; 2516 2496 int r; 2517 2497 2518 - if (vcpu->arch.mmu->direct_map) 2498 + if (vcpu->arch.mmu->root_role.direct) 2519 2499 return 0; 2520 2500 2521 2501 gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL); ··· 2560 2540 * that case, KVM must complete emulation of the guest TLB flush before 2561 2541 * allowing shadow pages to become unsync (writable by the guest). 2562 2542 */ 2563 - for_each_gfn_indirect_valid_sp(kvm, sp, gfn) { 2543 + for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn) { 2564 2544 if (!can_unsync) 2565 2545 return -EPERM; 2566 2546 ··· 2662 2642 *sptep, write_fault, gfn); 2663 2643 2664 2644 if (unlikely(is_noslot_pfn(pfn))) { 2645 + vcpu->stat.pf_mmio_spte_created++; 2665 2646 mark_mmio_spte(vcpu, sptep, gfn, pte_access); 2666 2647 return RET_PF_EMULATE; 2667 2648 } ··· 2983 2962 return ret; 2984 2963 2985 2964 direct_pte_prefetch(vcpu, it.sptep); 2986 - ++vcpu->stat.pf_fixed; 2987 2965 return ret; 2988 2966 } 2989 2967 ··· 3009 2989 return -EFAULT; 3010 2990 } 3011 2991 3012 - static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, 3013 - unsigned int access, int *ret_val) 2992 + static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, 2993 + unsigned int access) 3014 2994 { 3015 2995 /* The pfn is invalid, report the error! */ 3016 - if (unlikely(is_error_pfn(fault->pfn))) { 3017 - *ret_val = kvm_handle_bad_page(vcpu, fault->gfn, fault->pfn); 3018 - return true; 3019 - } 2996 + if (unlikely(is_error_pfn(fault->pfn))) 2997 + return kvm_handle_bad_page(vcpu, fault->gfn, fault->pfn); 3020 2998 3021 2999 if (unlikely(!fault->slot)) { 3022 3000 gva_t gva = fault->is_tdp ? 0 : fault->addr; ··· 3031 3013 * and only if L1's MAXPHYADDR is inaccurate with respect to 3032 3014 * the hardware's). 3033 3015 */ 3034 - if (unlikely(!shadow_mmio_value) || 3035 - unlikely(fault->gfn > kvm_mmu_max_gfn())) { 3036 - *ret_val = RET_PF_EMULATE; 3037 - return true; 3038 - } 3016 + if (unlikely(!enable_mmio_caching) || 3017 + unlikely(fault->gfn > kvm_mmu_max_gfn())) 3018 + return RET_PF_EMULATE; 3039 3019 } 3040 3020 3041 - return false; 3021 + return RET_PF_CONTINUE; 3042 3022 } 3043 3023 3044 3024 static bool page_fault_can_be_fast(struct kvm_page_fault *fault) 3045 3025 { 3046 3026 /* 3047 - * Do not fix the mmio spte with invalid generation number which 3048 - * need to be updated by slow page fault path. 3027 + * Page faults with reserved bits set, i.e. faults on MMIO SPTEs, only 3028 + * reach the common page fault handler if the SPTE has an invalid MMIO 3029 + * generation number. Refreshing the MMIO generation needs to go down 3030 + * the slow path. Note, EPT Misconfigs do NOT set the PRESENT flag! 3049 3031 */ 3050 3032 if (fault->rsvd) 3051 3033 return false; 3052 3034 3053 - /* See if the page fault is due to an NX violation */ 3054 - if (unlikely(fault->exec && fault->present)) 3055 - return false; 3056 - 3057 3035 /* 3058 3036 * #PF can be fast if: 3059 - * 1. The shadow page table entry is not present, which could mean that 3060 - * the fault is potentially caused by access tracking (if enabled). 3061 - * 2. The shadow page table entry is present and the fault 3062 - * is caused by write-protect, that means we just need change the W 3063 - * bit of the spte which can be done out of mmu-lock. 3064 3037 * 3065 - * However, if access tracking is disabled we know that a non-present 3066 - * page must be a genuine page fault where we have to create a new SPTE. 3067 - * So, if access tracking is disabled, we return true only for write 3068 - * accesses to a present page. 3038 + * 1. The shadow page table entry is not present and A/D bits are 3039 + * disabled _by KVM_, which could mean that the fault is potentially 3040 + * caused by access tracking (if enabled). If A/D bits are enabled 3041 + * by KVM, but disabled by L1 for L2, KVM is forced to disable A/D 3042 + * bits for L2 and employ access tracking, but the fast page fault 3043 + * mechanism only supports direct MMUs. 3044 + * 2. The shadow page table entry is present, the access is a write, 3045 + * and no reserved bits are set (MMIO SPTEs cannot be "fixed"), i.e. 3046 + * the fault was caused by a write-protection violation. If the 3047 + * SPTE is MMU-writable (determined later), the fault can be fixed 3048 + * by setting the Writable bit, which can be done out of mmu_lock. 3069 3049 */ 3050 + if (!fault->present) 3051 + return !kvm_ad_enabled(); 3070 3052 3071 - return shadow_acc_track_mask != 0 || (fault->write && fault->present); 3053 + /* 3054 + * Note, instruction fetches and writes are mutually exclusive, ignore 3055 + * the "exec" flag. 3056 + */ 3057 + return fault->write; 3072 3058 } 3073 3059 3074 3060 /* ··· 3187 3165 3188 3166 new_spte = spte; 3189 3167 3190 - if (is_access_track_spte(spte)) 3168 + /* 3169 + * KVM only supports fixing page faults outside of MMU lock for 3170 + * direct MMUs, nested MMUs are always indirect, and KVM always 3171 + * uses A/D bits for non-nested MMUs. Thus, if A/D bits are 3172 + * enabled, the SPTE can't be an access-tracked SPTE. 3173 + */ 3174 + if (unlikely(!kvm_ad_enabled()) && is_access_track_spte(spte)) 3191 3175 new_spte = restore_acc_track_spte(new_spte); 3192 3176 3193 3177 /* 3194 - * Currently, to simplify the code, write-protection can 3195 - * be removed in the fast path only if the SPTE was 3196 - * write-protected for dirty-logging or access tracking. 3178 + * To keep things simple, only SPTEs that are MMU-writable can 3179 + * be made fully writable outside of mmu_lock, e.g. only SPTEs 3180 + * that were write-protected for dirty-logging or access 3181 + * tracking are handled here. Don't bother checking if the 3182 + * SPTE is writable to prioritize running with A/D bits enabled. 3183 + * The is_access_allowed() check above handles the common case 3184 + * of the fault being spurious, and the SPTE is known to be 3185 + * shadow-present, i.e. except for access tracking restoration 3186 + * making the new SPTE writable, the check is wasteful. 3197 3187 */ 3198 3188 if (fault->write && is_mmu_writable_spte(spte)) { 3199 3189 new_spte |= PT_WRITABLE_MASK; ··· 3250 3216 3251 3217 trace_fast_page_fault(vcpu, fault, sptep, spte, ret); 3252 3218 walk_shadow_page_lockless_end(vcpu); 3219 + 3220 + if (ret != RET_PF_INVALID) 3221 + vcpu->stat.pf_fast++; 3253 3222 3254 3223 return ret; 3255 3224 } ··· 3340 3303 * This should not be called while L2 is active, L2 can't invalidate 3341 3304 * _only_ its own roots, e.g. INVVPID unconditionally exits. 3342 3305 */ 3343 - WARN_ON_ONCE(mmu->mmu_role.base.guest_mode); 3306 + WARN_ON_ONCE(mmu->root_role.guest_mode); 3344 3307 3345 3308 for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) { 3346 3309 root_hpa = mmu->prev_roots[i].hpa; ··· 3383 3346 static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu) 3384 3347 { 3385 3348 struct kvm_mmu *mmu = vcpu->arch.mmu; 3386 - u8 shadow_root_level = mmu->shadow_root_level; 3349 + u8 shadow_root_level = mmu->root_role.level; 3387 3350 hpa_t root; 3388 3351 unsigned i; 3389 3352 int r; ··· 3507 3470 * On SVM, reading PDPTRs might access guest memory, which might fault 3508 3471 * and thus might sleep. Grab the PDPTRs before acquiring mmu_lock. 3509 3472 */ 3510 - if (mmu->root_level == PT32E_ROOT_LEVEL) { 3473 + if (mmu->cpu_role.base.level == PT32E_ROOT_LEVEL) { 3511 3474 for (i = 0; i < 4; ++i) { 3512 3475 pdptrs[i] = mmu->get_pdptr(vcpu, i); 3513 3476 if (!(pdptrs[i] & PT_PRESENT_MASK)) ··· 3531 3494 * Do we shadow a long mode page table? If so we need to 3532 3495 * write-protect the guests page table root. 3533 3496 */ 3534 - if (mmu->root_level >= PT64_ROOT_4LEVEL) { 3497 + if (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) { 3535 3498 root = mmu_alloc_root(vcpu, root_gfn, 0, 3536 - mmu->shadow_root_level, false); 3499 + mmu->root_role.level, false); 3537 3500 mmu->root.hpa = root; 3538 3501 goto set_root_pgd; 3539 3502 } ··· 3548 3511 * or a PAE 3-level page table. In either case we need to be aware that 3549 3512 * the shadow page table may be a PAE or a long mode page table. 3550 3513 */ 3551 - pm_mask = PT_PRESENT_MASK | shadow_me_mask; 3552 - if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL) { 3514 + pm_mask = PT_PRESENT_MASK | shadow_me_value; 3515 + if (mmu->root_role.level >= PT64_ROOT_4LEVEL) { 3553 3516 pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK; 3554 3517 3555 3518 if (WARN_ON_ONCE(!mmu->pml4_root)) { ··· 3558 3521 } 3559 3522 mmu->pml4_root[0] = __pa(mmu->pae_root) | pm_mask; 3560 3523 3561 - if (mmu->shadow_root_level == PT64_ROOT_5LEVEL) { 3524 + if (mmu->root_role.level == PT64_ROOT_5LEVEL) { 3562 3525 if (WARN_ON_ONCE(!mmu->pml5_root)) { 3563 3526 r = -EIO; 3564 3527 goto out_unlock; ··· 3570 3533 for (i = 0; i < 4; ++i) { 3571 3534 WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i])); 3572 3535 3573 - if (mmu->root_level == PT32E_ROOT_LEVEL) { 3536 + if (mmu->cpu_role.base.level == PT32E_ROOT_LEVEL) { 3574 3537 if (!(pdptrs[i] & PT_PRESENT_MASK)) { 3575 3538 mmu->pae_root[i] = INVALID_PAE_ROOT; 3576 3539 continue; ··· 3583 3546 mmu->pae_root[i] = root | pm_mask; 3584 3547 } 3585 3548 3586 - if (mmu->shadow_root_level == PT64_ROOT_5LEVEL) 3549 + if (mmu->root_role.level == PT64_ROOT_5LEVEL) 3587 3550 mmu->root.hpa = __pa(mmu->pml5_root); 3588 - else if (mmu->shadow_root_level == PT64_ROOT_4LEVEL) 3551 + else if (mmu->root_role.level == PT64_ROOT_4LEVEL) 3589 3552 mmu->root.hpa = __pa(mmu->pml4_root); 3590 3553 else 3591 3554 mmu->root.hpa = __pa(mmu->pae_root); ··· 3601 3564 static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu) 3602 3565 { 3603 3566 struct kvm_mmu *mmu = vcpu->arch.mmu; 3604 - bool need_pml5 = mmu->shadow_root_level > PT64_ROOT_4LEVEL; 3567 + bool need_pml5 = mmu->root_role.level > PT64_ROOT_4LEVEL; 3605 3568 u64 *pml5_root = NULL; 3606 3569 u64 *pml4_root = NULL; 3607 3570 u64 *pae_root; ··· 3612 3575 * equivalent level in the guest's NPT to shadow. Allocate the tables 3613 3576 * on demand, as running a 32-bit L1 VMM on 64-bit KVM is very rare. 3614 3577 */ 3615 - if (mmu->direct_map || mmu->root_level >= PT64_ROOT_4LEVEL || 3616 - mmu->shadow_root_level < PT64_ROOT_4LEVEL) 3578 + if (mmu->root_role.direct || 3579 + mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL || 3580 + mmu->root_role.level < PT64_ROOT_4LEVEL) 3617 3581 return 0; 3618 3582 3619 3583 /* ··· 3710 3672 int i; 3711 3673 struct kvm_mmu_page *sp; 3712 3674 3713 - if (vcpu->arch.mmu->direct_map) 3675 + if (vcpu->arch.mmu->root_role.direct) 3714 3676 return; 3715 3677 3716 3678 if (!VALID_PAGE(vcpu->arch.mmu->root.hpa)) ··· 3718 3680 3719 3681 vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY); 3720 3682 3721 - if (vcpu->arch.mmu->root_level >= PT64_ROOT_4LEVEL) { 3683 + if (vcpu->arch.mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) { 3722 3684 hpa_t root = vcpu->arch.mmu->root.hpa; 3723 3685 sp = to_shadow_page(root); 3724 3686 ··· 3940 3902 3941 3903 arch.token = alloc_apf_token(vcpu); 3942 3904 arch.gfn = gfn; 3943 - arch.direct_map = vcpu->arch.mmu->direct_map; 3905 + arch.direct_map = vcpu->arch.mmu->root_role.direct; 3944 3906 arch.cr3 = vcpu->arch.mmu->get_guest_pgd(vcpu); 3945 3907 3946 3908 return kvm_setup_async_pf(vcpu, cr2_or_gpa, 3947 3909 kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch); 3948 3910 } 3949 3911 3950 - static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r) 3912 + void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) 3913 + { 3914 + int r; 3915 + 3916 + if ((vcpu->arch.mmu->root_role.direct != work->arch.direct_map) || 3917 + work->wakeup_all) 3918 + return; 3919 + 3920 + r = kvm_mmu_reload(vcpu); 3921 + if (unlikely(r)) 3922 + return; 3923 + 3924 + if (!vcpu->arch.mmu->root_role.direct && 3925 + work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu)) 3926 + return; 3927 + 3928 + kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true); 3929 + } 3930 + 3931 + static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) 3951 3932 { 3952 3933 struct kvm_memory_slot *slot = fault->slot; 3953 3934 bool async; ··· 3977 3920 * be zapped before KVM inserts a new MMIO SPTE for the gfn. 3978 3921 */ 3979 3922 if (slot && (slot->flags & KVM_MEMSLOT_INVALID)) 3980 - goto out_retry; 3923 + return RET_PF_RETRY; 3981 3924 3982 3925 if (!kvm_is_visible_memslot(slot)) { 3983 3926 /* Don't expose private memslots to L2. */ ··· 3985 3928 fault->slot = NULL; 3986 3929 fault->pfn = KVM_PFN_NOSLOT; 3987 3930 fault->map_writable = false; 3988 - return false; 3931 + return RET_PF_CONTINUE; 3989 3932 } 3990 3933 /* 3991 3934 * If the APIC access page exists but is disabled, go directly ··· 3994 3937 * when the AVIC is re-enabled. 3995 3938 */ 3996 3939 if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT && 3997 - !kvm_apicv_activated(vcpu->kvm)) { 3998 - *r = RET_PF_EMULATE; 3999 - return true; 4000 - } 3940 + !kvm_apicv_activated(vcpu->kvm)) 3941 + return RET_PF_EMULATE; 4001 3942 } 4002 3943 4003 3944 async = false; ··· 4003 3948 fault->write, &fault->map_writable, 4004 3949 &fault->hva); 4005 3950 if (!async) 4006 - return false; /* *pfn has correct page already */ 3951 + return RET_PF_CONTINUE; /* *pfn has correct page already */ 4007 3952 4008 3953 if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) { 4009 3954 trace_kvm_try_async_get_page(fault->addr, fault->gfn); 4010 3955 if (kvm_find_async_pf_gfn(vcpu, fault->gfn)) { 4011 3956 trace_kvm_async_pf_doublefault(fault->addr, fault->gfn); 4012 3957 kvm_make_request(KVM_REQ_APF_HALT, vcpu); 4013 - goto out_retry; 4014 - } else if (kvm_arch_setup_async_pf(vcpu, fault->addr, fault->gfn)) 4015 - goto out_retry; 3958 + return RET_PF_RETRY; 3959 + } else if (kvm_arch_setup_async_pf(vcpu, fault->addr, fault->gfn)) { 3960 + return RET_PF_RETRY; 3961 + } 4016 3962 } 4017 3963 4018 3964 fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, NULL, 4019 3965 fault->write, &fault->map_writable, 4020 3966 &fault->hva); 4021 - return false; 4022 - 4023 - out_retry: 4024 - *r = RET_PF_RETRY; 4025 - return true; 3967 + return RET_PF_CONTINUE; 4026 3968 } 4027 3969 4028 3970 /* ··· 4074 4022 mmu_seq = vcpu->kvm->mmu_notifier_seq; 4075 4023 smp_rmb(); 4076 4024 4077 - if (kvm_faultin_pfn(vcpu, fault, &r)) 4025 + r = kvm_faultin_pfn(vcpu, fault); 4026 + if (r != RET_PF_CONTINUE) 4078 4027 return r; 4079 4028 4080 - if (handle_abnormal_pfn(vcpu, fault, ACC_ALL, &r)) 4029 + r = handle_abnormal_pfn(vcpu, fault, ACC_ALL); 4030 + if (r != RET_PF_CONTINUE) 4081 4031 return r; 4082 4032 4083 4033 r = RET_PF_RETRY; ··· 4174 4120 context->gva_to_gpa = nonpaging_gva_to_gpa; 4175 4121 context->sync_page = nonpaging_sync_page; 4176 4122 context->invlpg = NULL; 4177 - context->direct_map = true; 4178 4123 } 4179 4124 4180 4125 static inline bool is_root_usable(struct kvm_mmu_root_info *root, gpa_t pgd, ··· 4267 4214 void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd) 4268 4215 { 4269 4216 struct kvm_mmu *mmu = vcpu->arch.mmu; 4270 - union kvm_mmu_page_role new_role = mmu->mmu_role.base; 4217 + union kvm_mmu_page_role new_role = mmu->root_role; 4271 4218 4272 4219 if (!fast_pgd_switch(vcpu->kvm, mmu, new_pgd, new_role)) { 4273 4220 /* kvm_mmu_ensure_valid_pgd will set up a new root. */ ··· 4444 4391 guest_cpuid_has(vcpu, X86_FEATURE_GBPAGES); 4445 4392 } 4446 4393 4447 - static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, 4448 - struct kvm_mmu *context) 4394 + static void reset_guest_rsvds_bits_mask(struct kvm_vcpu *vcpu, 4395 + struct kvm_mmu *context) 4449 4396 { 4450 4397 __reset_rsvds_bits_mask(&context->guest_rsvd_check, 4451 4398 vcpu->arch.reserved_gpa_bits, 4452 - context->root_level, is_efer_nx(context), 4399 + context->cpu_role.base.level, is_efer_nx(context), 4453 4400 guest_can_use_gbpages(vcpu), 4454 4401 is_cr4_pse(context), 4455 4402 guest_cpuid_is_amd_or_hygon(vcpu)); ··· 4514 4461 static void reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, 4515 4462 struct kvm_mmu *context) 4516 4463 { 4517 - /* 4518 - * KVM uses NX when TDP is disabled to handle a variety of scenarios, 4519 - * notably for huge SPTEs if iTLB multi-hit mitigation is enabled and 4520 - * to generate correct permissions for CR0.WP=0/CR4.SMEP=1/EFER.NX=0. 4521 - * The iTLB multi-hit workaround can be toggled at any time, so assume 4522 - * NX can be used by any non-nested shadow MMU to avoid having to reset 4523 - * MMU contexts. Note, KVM forces EFER.NX=1 when TDP is disabled. 4524 - */ 4525 - bool uses_nx = is_efer_nx(context) || !tdp_enabled; 4526 - 4527 4464 /* @amd adds a check on bit of SPTEs, which KVM shouldn't use anyways. */ 4528 4465 bool is_amd = true; 4529 4466 /* KVM doesn't use 2-level page tables for the shadow MMU. */ ··· 4521 4478 struct rsvd_bits_validate *shadow_zero_check; 4522 4479 int i; 4523 4480 4524 - WARN_ON_ONCE(context->shadow_root_level < PT32E_ROOT_LEVEL); 4481 + WARN_ON_ONCE(context->root_role.level < PT32E_ROOT_LEVEL); 4525 4482 4526 4483 shadow_zero_check = &context->shadow_zero_check; 4527 4484 __reset_rsvds_bits_mask(shadow_zero_check, reserved_hpa_bits(), 4528 - context->shadow_root_level, uses_nx, 4485 + context->root_role.level, 4486 + context->root_role.efer_nx, 4529 4487 guest_can_use_gbpages(vcpu), is_pse, is_amd); 4530 4488 4531 4489 if (!shadow_me_mask) 4532 4490 return; 4533 4491 4534 - for (i = context->shadow_root_level; --i >= 0;) { 4535 - shadow_zero_check->rsvd_bits_mask[0][i] &= ~shadow_me_mask; 4536 - shadow_zero_check->rsvd_bits_mask[1][i] &= ~shadow_me_mask; 4492 + for (i = context->root_role.level; --i >= 0;) { 4493 + /* 4494 + * So far shadow_me_value is a constant during KVM's life 4495 + * time. Bits in shadow_me_value are allowed to be set. 4496 + * Bits in shadow_me_mask but not in shadow_me_value are 4497 + * not allowed to be set. 4498 + */ 4499 + shadow_zero_check->rsvd_bits_mask[0][i] |= shadow_me_mask; 4500 + shadow_zero_check->rsvd_bits_mask[1][i] |= shadow_me_mask; 4501 + shadow_zero_check->rsvd_bits_mask[0][i] &= ~shadow_me_value; 4502 + shadow_zero_check->rsvd_bits_mask[1][i] &= ~shadow_me_value; 4537 4503 } 4538 4504 4539 4505 } ··· 4567 4515 4568 4516 if (boot_cpu_is_amd()) 4569 4517 __reset_rsvds_bits_mask(shadow_zero_check, reserved_hpa_bits(), 4570 - context->shadow_root_level, false, 4518 + context->root_role.level, false, 4571 4519 boot_cpu_has(X86_FEATURE_GBPAGES), 4572 4520 false, true); 4573 4521 else ··· 4578 4526 if (!shadow_me_mask) 4579 4527 return; 4580 4528 4581 - for (i = context->shadow_root_level; --i >= 0;) { 4529 + for (i = context->root_role.level; --i >= 0;) { 4582 4530 shadow_zero_check->rsvd_bits_mask[0][i] &= ~shadow_me_mask; 4583 4531 shadow_zero_check->rsvd_bits_mask[1][i] &= ~shadow_me_mask; 4584 4532 } ··· 4752 4700 if (!is_cr0_pg(mmu)) 4753 4701 return; 4754 4702 4755 - reset_rsvds_bits_mask(vcpu, mmu); 4703 + reset_guest_rsvds_bits_mask(vcpu, mmu); 4756 4704 update_permission_bitmask(mmu, false); 4757 4705 update_pkru_bitmask(mmu); 4758 4706 } ··· 4763 4711 context->gva_to_gpa = paging64_gva_to_gpa; 4764 4712 context->sync_page = paging64_sync_page; 4765 4713 context->invlpg = paging64_invlpg; 4766 - context->direct_map = false; 4767 4714 } 4768 4715 4769 4716 static void paging32_init_context(struct kvm_mmu *context) ··· 4771 4720 context->gva_to_gpa = paging32_gva_to_gpa; 4772 4721 context->sync_page = paging32_sync_page; 4773 4722 context->invlpg = paging32_invlpg; 4774 - context->direct_map = false; 4775 4723 } 4776 4724 4777 - static union kvm_mmu_extended_role kvm_calc_mmu_role_ext(struct kvm_vcpu *vcpu, 4778 - struct kvm_mmu_role_regs *regs) 4725 + static union kvm_cpu_role 4726 + kvm_calc_cpu_role(struct kvm_vcpu *vcpu, const struct kvm_mmu_role_regs *regs) 4779 4727 { 4780 - union kvm_mmu_extended_role ext = {0}; 4781 - 4782 - if (____is_cr0_pg(regs)) { 4783 - ext.cr0_pg = 1; 4784 - ext.cr4_pae = ____is_cr4_pae(regs); 4785 - ext.cr4_smep = ____is_cr4_smep(regs); 4786 - ext.cr4_smap = ____is_cr4_smap(regs); 4787 - ext.cr4_pse = ____is_cr4_pse(regs); 4788 - 4789 - /* PKEY and LA57 are active iff long mode is active. */ 4790 - ext.cr4_pke = ____is_efer_lma(regs) && ____is_cr4_pke(regs); 4791 - ext.cr4_la57 = ____is_efer_lma(regs) && ____is_cr4_la57(regs); 4792 - ext.efer_lma = ____is_efer_lma(regs); 4793 - } 4794 - 4795 - ext.valid = 1; 4796 - 4797 - return ext; 4798 - } 4799 - 4800 - static union kvm_mmu_role kvm_calc_mmu_role_common(struct kvm_vcpu *vcpu, 4801 - struct kvm_mmu_role_regs *regs, 4802 - bool base_only) 4803 - { 4804 - union kvm_mmu_role role = {0}; 4728 + union kvm_cpu_role role = {0}; 4805 4729 4806 4730 role.base.access = ACC_ALL; 4807 - if (____is_cr0_pg(regs)) { 4808 - role.base.efer_nx = ____is_efer_nx(regs); 4809 - role.base.cr0_wp = ____is_cr0_wp(regs); 4810 - } 4811 4731 role.base.smm = is_smm(vcpu); 4812 4732 role.base.guest_mode = is_guest_mode(vcpu); 4733 + role.ext.valid = 1; 4813 4734 4814 - if (base_only) 4735 + if (!____is_cr0_pg(regs)) { 4736 + role.base.direct = 1; 4815 4737 return role; 4738 + } 4816 4739 4817 - role.ext = kvm_calc_mmu_role_ext(vcpu, regs); 4740 + role.base.efer_nx = ____is_efer_nx(regs); 4741 + role.base.cr0_wp = ____is_cr0_wp(regs); 4742 + role.base.smep_andnot_wp = ____is_cr4_smep(regs) && !____is_cr0_wp(regs); 4743 + role.base.smap_andnot_wp = ____is_cr4_smap(regs) && !____is_cr0_wp(regs); 4744 + role.base.has_4_byte_gpte = !____is_cr4_pae(regs); 4818 4745 4746 + if (____is_efer_lma(regs)) 4747 + role.base.level = ____is_cr4_la57(regs) ? PT64_ROOT_5LEVEL 4748 + : PT64_ROOT_4LEVEL; 4749 + else if (____is_cr4_pae(regs)) 4750 + role.base.level = PT32E_ROOT_LEVEL; 4751 + else 4752 + role.base.level = PT32_ROOT_LEVEL; 4753 + 4754 + role.ext.cr4_smep = ____is_cr4_smep(regs); 4755 + role.ext.cr4_smap = ____is_cr4_smap(regs); 4756 + role.ext.cr4_pse = ____is_cr4_pse(regs); 4757 + 4758 + /* PKEY and LA57 are active iff long mode is active. */ 4759 + role.ext.cr4_pke = ____is_efer_lma(regs) && ____is_cr4_pke(regs); 4760 + role.ext.cr4_la57 = ____is_efer_lma(regs) && ____is_cr4_la57(regs); 4761 + role.ext.efer_lma = ____is_efer_lma(regs); 4819 4762 return role; 4820 4763 } 4821 4764 ··· 4826 4781 return max_tdp_level; 4827 4782 } 4828 4783 4829 - static union kvm_mmu_role 4784 + static union kvm_mmu_page_role 4830 4785 kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu, 4831 - struct kvm_mmu_role_regs *regs, bool base_only) 4786 + union kvm_cpu_role cpu_role) 4832 4787 { 4833 - union kvm_mmu_role role = kvm_calc_mmu_role_common(vcpu, regs, base_only); 4788 + union kvm_mmu_page_role role = {0}; 4834 4789 4835 - role.base.ad_disabled = (shadow_accessed_mask == 0); 4836 - role.base.level = kvm_mmu_get_tdp_level(vcpu); 4837 - role.base.direct = true; 4838 - role.base.has_4_byte_gpte = false; 4790 + role.access = ACC_ALL; 4791 + role.cr0_wp = true; 4792 + role.efer_nx = true; 4793 + role.smm = cpu_role.base.smm; 4794 + role.guest_mode = cpu_role.base.guest_mode; 4795 + role.ad_disabled = !kvm_ad_enabled(); 4796 + role.level = kvm_mmu_get_tdp_level(vcpu); 4797 + role.direct = true; 4798 + role.has_4_byte_gpte = false; 4839 4799 4840 4800 return role; 4841 4801 } 4842 4802 4843 - static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu) 4803 + static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu, 4804 + union kvm_cpu_role cpu_role) 4844 4805 { 4845 4806 struct kvm_mmu *context = &vcpu->arch.root_mmu; 4846 - struct kvm_mmu_role_regs regs = vcpu_to_role_regs(vcpu); 4847 - union kvm_mmu_role new_role = 4848 - kvm_calc_tdp_mmu_root_page_role(vcpu, &regs, false); 4807 + union kvm_mmu_page_role root_role = kvm_calc_tdp_mmu_root_page_role(vcpu, cpu_role); 4849 4808 4850 - if (new_role.as_u64 == context->mmu_role.as_u64) 4809 + if (cpu_role.as_u64 == context->cpu_role.as_u64 && 4810 + root_role.word == context->root_role.word) 4851 4811 return; 4852 4812 4853 - context->mmu_role.as_u64 = new_role.as_u64; 4813 + context->cpu_role.as_u64 = cpu_role.as_u64; 4814 + context->root_role.word = root_role.word; 4854 4815 context->page_fault = kvm_tdp_page_fault; 4855 4816 context->sync_page = nonpaging_sync_page; 4856 4817 context->invlpg = NULL; 4857 - context->shadow_root_level = kvm_mmu_get_tdp_level(vcpu); 4858 - context->direct_map = true; 4859 4818 context->get_guest_pgd = get_cr3; 4860 4819 context->get_pdptr = kvm_pdptr_read; 4861 4820 context->inject_page_fault = kvm_inject_page_fault; 4862 - context->root_level = role_regs_to_root_level(&regs); 4863 4821 4864 4822 if (!is_cr0_pg(context)) 4865 4823 context->gva_to_gpa = nonpaging_gva_to_gpa; ··· 4875 4827 reset_tdp_shadow_zero_bits_mask(context); 4876 4828 } 4877 4829 4878 - static union kvm_mmu_role 4879 - kvm_calc_shadow_root_page_role_common(struct kvm_vcpu *vcpu, 4880 - struct kvm_mmu_role_regs *regs, bool base_only) 4881 - { 4882 - union kvm_mmu_role role = kvm_calc_mmu_role_common(vcpu, regs, base_only); 4883 - 4884 - role.base.smep_andnot_wp = role.ext.cr4_smep && !____is_cr0_wp(regs); 4885 - role.base.smap_andnot_wp = role.ext.cr4_smap && !____is_cr0_wp(regs); 4886 - role.base.has_4_byte_gpte = ____is_cr0_pg(regs) && !____is_cr4_pae(regs); 4887 - 4888 - return role; 4889 - } 4890 - 4891 - static union kvm_mmu_role 4892 - kvm_calc_shadow_mmu_root_page_role(struct kvm_vcpu *vcpu, 4893 - struct kvm_mmu_role_regs *regs, bool base_only) 4894 - { 4895 - union kvm_mmu_role role = 4896 - kvm_calc_shadow_root_page_role_common(vcpu, regs, base_only); 4897 - 4898 - role.base.direct = !____is_cr0_pg(regs); 4899 - 4900 - if (!____is_efer_lma(regs)) 4901 - role.base.level = PT32E_ROOT_LEVEL; 4902 - else if (____is_cr4_la57(regs)) 4903 - role.base.level = PT64_ROOT_5LEVEL; 4904 - else 4905 - role.base.level = PT64_ROOT_4LEVEL; 4906 - 4907 - return role; 4908 - } 4909 - 4910 4830 static void shadow_mmu_init_context(struct kvm_vcpu *vcpu, struct kvm_mmu *context, 4911 - struct kvm_mmu_role_regs *regs, 4912 - union kvm_mmu_role new_role) 4831 + union kvm_cpu_role cpu_role, 4832 + union kvm_mmu_page_role root_role) 4913 4833 { 4914 - if (new_role.as_u64 == context->mmu_role.as_u64) 4834 + if (cpu_role.as_u64 == context->cpu_role.as_u64 && 4835 + root_role.word == context->root_role.word) 4915 4836 return; 4916 4837 4917 - context->mmu_role.as_u64 = new_role.as_u64; 4838 + context->cpu_role.as_u64 = cpu_role.as_u64; 4839 + context->root_role.word = root_role.word; 4918 4840 4919 4841 if (!is_cr0_pg(context)) 4920 4842 nonpaging_init_context(context); ··· 4892 4874 paging64_init_context(context); 4893 4875 else 4894 4876 paging32_init_context(context); 4895 - context->root_level = role_regs_to_root_level(regs); 4896 4877 4897 4878 reset_guest_paging_metadata(vcpu, context); 4898 - context->shadow_root_level = new_role.base.level; 4899 - 4900 4879 reset_shadow_zero_bits_mask(vcpu, context); 4901 4880 } 4902 4881 4903 4882 static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, 4904 - struct kvm_mmu_role_regs *regs) 4883 + union kvm_cpu_role cpu_role) 4905 4884 { 4906 4885 struct kvm_mmu *context = &vcpu->arch.root_mmu; 4907 - union kvm_mmu_role new_role = 4908 - kvm_calc_shadow_mmu_root_page_role(vcpu, regs, false); 4886 + union kvm_mmu_page_role root_role; 4909 4887 4910 - shadow_mmu_init_context(vcpu, context, regs, new_role); 4911 - } 4888 + root_role = cpu_role.base; 4912 4889 4913 - static union kvm_mmu_role 4914 - kvm_calc_shadow_npt_root_page_role(struct kvm_vcpu *vcpu, 4915 - struct kvm_mmu_role_regs *regs) 4916 - { 4917 - union kvm_mmu_role role = 4918 - kvm_calc_shadow_root_page_role_common(vcpu, regs, false); 4890 + /* KVM uses PAE paging whenever the guest isn't using 64-bit paging. */ 4891 + root_role.level = max_t(u32, root_role.level, PT32E_ROOT_LEVEL); 4919 4892 4920 - role.base.direct = false; 4921 - role.base.level = kvm_mmu_get_tdp_level(vcpu); 4893 + /* 4894 + * KVM forces EFER.NX=1 when TDP is disabled, reflect it in the MMU role. 4895 + * KVM uses NX when TDP is disabled to handle a variety of scenarios, 4896 + * notably for huge SPTEs if iTLB multi-hit mitigation is enabled and 4897 + * to generate correct permissions for CR0.WP=0/CR4.SMEP=1/EFER.NX=0. 4898 + * The iTLB multi-hit workaround can be toggled at any time, so assume 4899 + * NX can be used by any non-nested shadow MMU to avoid having to reset 4900 + * MMU contexts. 4901 + */ 4902 + root_role.efer_nx = true; 4922 4903 4923 - return role; 4904 + shadow_mmu_init_context(vcpu, context, cpu_role, root_role); 4924 4905 } 4925 4906 4926 4907 void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0, ··· 4931 4914 .cr4 = cr4 & ~X86_CR4_PKE, 4932 4915 .efer = efer, 4933 4916 }; 4934 - union kvm_mmu_role new_role; 4917 + union kvm_cpu_role cpu_role = kvm_calc_cpu_role(vcpu, &regs); 4918 + union kvm_mmu_page_role root_role; 4935 4919 4936 - new_role = kvm_calc_shadow_npt_root_page_role(vcpu, &regs); 4920 + /* NPT requires CR0.PG=1. */ 4921 + WARN_ON_ONCE(cpu_role.base.direct); 4937 4922 4938 - shadow_mmu_init_context(vcpu, context, &regs, new_role); 4923 + root_role = cpu_role.base; 4924 + root_role.level = kvm_mmu_get_tdp_level(vcpu); 4925 + if (root_role.level == PT64_ROOT_5LEVEL && 4926 + cpu_role.base.level == PT64_ROOT_4LEVEL) 4927 + root_role.passthrough = 1; 4928 + 4929 + shadow_mmu_init_context(vcpu, context, cpu_role, root_role); 4939 4930 kvm_mmu_new_pgd(vcpu, nested_cr3); 4940 4931 } 4941 4932 EXPORT_SYMBOL_GPL(kvm_init_shadow_npt_mmu); 4942 4933 4943 - static union kvm_mmu_role 4934 + static union kvm_cpu_role 4944 4935 kvm_calc_shadow_ept_root_page_role(struct kvm_vcpu *vcpu, bool accessed_dirty, 4945 4936 bool execonly, u8 level) 4946 4937 { 4947 - union kvm_mmu_role role = {0}; 4938 + union kvm_cpu_role role = {0}; 4948 4939 4949 - /* SMM flag is inherited from root_mmu */ 4950 - role.base.smm = vcpu->arch.root_mmu.mmu_role.base.smm; 4951 - 4940 + /* 4941 + * KVM does not support SMM transfer monitors, and consequently does not 4942 + * support the "entry to SMM" control either. role.base.smm is always 0. 4943 + */ 4944 + WARN_ON_ONCE(is_smm(vcpu)); 4952 4945 role.base.level = level; 4953 4946 role.base.has_4_byte_gpte = false; 4954 4947 role.base.direct = false; ··· 4966 4939 role.base.guest_mode = true; 4967 4940 role.base.access = ACC_ALL; 4968 4941 4969 - /* EPT, and thus nested EPT, does not consume CR0, CR4, nor EFER. */ 4970 4942 role.ext.word = 0; 4971 4943 role.ext.execonly = execonly; 4972 4944 role.ext.valid = 1; ··· 4979 4953 { 4980 4954 struct kvm_mmu *context = &vcpu->arch.guest_mmu; 4981 4955 u8 level = vmx_eptp_page_walk_level(new_eptp); 4982 - union kvm_mmu_role new_role = 4956 + union kvm_cpu_role new_mode = 4983 4957 kvm_calc_shadow_ept_root_page_role(vcpu, accessed_dirty, 4984 4958 execonly, level); 4985 4959 4986 - if (new_role.as_u64 != context->mmu_role.as_u64) { 4987 - context->mmu_role.as_u64 = new_role.as_u64; 4960 + if (new_mode.as_u64 != context->cpu_role.as_u64) { 4961 + /* EPT, and thus nested EPT, does not consume CR0, CR4, nor EFER. */ 4962 + context->cpu_role.as_u64 = new_mode.as_u64; 4963 + context->root_role.word = new_mode.base.word; 4988 4964 4989 - context->shadow_root_level = level; 4990 - 4991 - context->ept_ad = accessed_dirty; 4992 4965 context->page_fault = ept_page_fault; 4993 4966 context->gva_to_gpa = ept_gva_to_gpa; 4994 4967 context->sync_page = ept_sync_page; 4995 4968 context->invlpg = ept_invlpg; 4996 - context->root_level = level; 4997 - context->direct_map = false; 4969 + 4998 4970 update_permission_bitmask(context, true); 4999 4971 context->pkru_mask = 0; 5000 4972 reset_rsvds_bits_mask_ept(vcpu, context, execonly, huge_page_level); ··· 5003 4979 } 5004 4980 EXPORT_SYMBOL_GPL(kvm_init_shadow_ept_mmu); 5005 4981 5006 - static void init_kvm_softmmu(struct kvm_vcpu *vcpu) 4982 + static void init_kvm_softmmu(struct kvm_vcpu *vcpu, 4983 + union kvm_cpu_role cpu_role) 5007 4984 { 5008 4985 struct kvm_mmu *context = &vcpu->arch.root_mmu; 5009 - struct kvm_mmu_role_regs regs = vcpu_to_role_regs(vcpu); 5010 4986 5011 - kvm_init_shadow_mmu(vcpu, &regs); 4987 + kvm_init_shadow_mmu(vcpu, cpu_role); 5012 4988 5013 4989 context->get_guest_pgd = get_cr3; 5014 4990 context->get_pdptr = kvm_pdptr_read; 5015 4991 context->inject_page_fault = kvm_inject_page_fault; 5016 4992 } 5017 4993 5018 - static union kvm_mmu_role 5019 - kvm_calc_nested_mmu_role(struct kvm_vcpu *vcpu, struct kvm_mmu_role_regs *regs) 4994 + static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu, 4995 + union kvm_cpu_role new_mode) 5020 4996 { 5021 - union kvm_mmu_role role; 5022 - 5023 - role = kvm_calc_shadow_root_page_role_common(vcpu, regs, false); 5024 - 5025 - /* 5026 - * Nested MMUs are used only for walking L2's gva->gpa, they never have 5027 - * shadow pages of their own and so "direct" has no meaning. Set it 5028 - * to "true" to try to detect bogus usage of the nested MMU. 5029 - */ 5030 - role.base.direct = true; 5031 - role.base.level = role_regs_to_root_level(regs); 5032 - return role; 5033 - } 5034 - 5035 - static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu) 5036 - { 5037 - struct kvm_mmu_role_regs regs = vcpu_to_role_regs(vcpu); 5038 - union kvm_mmu_role new_role = kvm_calc_nested_mmu_role(vcpu, &regs); 5039 4997 struct kvm_mmu *g_context = &vcpu->arch.nested_mmu; 5040 4998 5041 - if (new_role.as_u64 == g_context->mmu_role.as_u64) 4999 + if (new_mode.as_u64 == g_context->cpu_role.as_u64) 5042 5000 return; 5043 5001 5044 - g_context->mmu_role.as_u64 = new_role.as_u64; 5002 + g_context->cpu_role.as_u64 = new_mode.as_u64; 5045 5003 g_context->get_guest_pgd = get_cr3; 5046 5004 g_context->get_pdptr = kvm_pdptr_read; 5047 5005 g_context->inject_page_fault = kvm_inject_page_fault; 5048 - g_context->root_level = new_role.base.level; 5049 5006 5050 5007 /* 5051 5008 * L2 page tables are never shadowed, so there is no need to sync ··· 5056 5051 5057 5052 void kvm_init_mmu(struct kvm_vcpu *vcpu) 5058 5053 { 5054 + struct kvm_mmu_role_regs regs = vcpu_to_role_regs(vcpu); 5055 + union kvm_cpu_role cpu_role = kvm_calc_cpu_role(vcpu, &regs); 5056 + 5059 5057 if (mmu_is_nested(vcpu)) 5060 - init_kvm_nested_mmu(vcpu); 5058 + init_kvm_nested_mmu(vcpu, cpu_role); 5061 5059 else if (tdp_enabled) 5062 - init_kvm_tdp_mmu(vcpu); 5060 + init_kvm_tdp_mmu(vcpu, cpu_role); 5063 5061 else 5064 - init_kvm_softmmu(vcpu); 5062 + init_kvm_softmmu(vcpu, cpu_role); 5065 5063 } 5066 5064 EXPORT_SYMBOL_GPL(kvm_init_mmu); 5067 5065 ··· 5082 5074 * problem is swept under the rug; KVM's CPUID API is horrific and 5083 5075 * it's all but impossible to solve it without introducing a new API. 5084 5076 */ 5085 - vcpu->arch.root_mmu.mmu_role.ext.valid = 0; 5086 - vcpu->arch.guest_mmu.mmu_role.ext.valid = 0; 5087 - vcpu->arch.nested_mmu.mmu_role.ext.valid = 0; 5077 + vcpu->arch.root_mmu.root_role.word = 0; 5078 + vcpu->arch.guest_mmu.root_role.word = 0; 5079 + vcpu->arch.nested_mmu.root_role.word = 0; 5080 + vcpu->arch.root_mmu.cpu_role.ext.valid = 0; 5081 + vcpu->arch.guest_mmu.cpu_role.ext.valid = 0; 5082 + vcpu->arch.nested_mmu.cpu_role.ext.valid = 0; 5088 5083 kvm_mmu_reset_context(vcpu); 5089 5084 5090 5085 /* ··· 5108 5097 { 5109 5098 int r; 5110 5099 5111 - r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->direct_map); 5100 + r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->root_role.direct); 5112 5101 if (r) 5113 5102 goto out; 5114 5103 r = mmu_alloc_special_roots(vcpu); 5115 5104 if (r) 5116 5105 goto out; 5117 - if (vcpu->arch.mmu->direct_map) 5106 + if (vcpu->arch.mmu->root_role.direct) 5118 5107 r = mmu_alloc_direct_roots(vcpu); 5119 5108 else 5120 5109 r = mmu_alloc_shadow_roots(vcpu); ··· 5341 5330 5342 5331 ++vcpu->kvm->stat.mmu_pte_write; 5343 5332 5344 - for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) { 5333 + for_each_gfn_valid_sp_with_gptes(vcpu->kvm, sp, gfn) { 5345 5334 if (detect_write_misaligned(sp, gpa, bytes) || 5346 5335 detect_write_flooding(sp)) { 5347 5336 kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list); ··· 5367 5356 write_unlock(&vcpu->kvm->mmu_lock); 5368 5357 } 5369 5358 5370 - int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, 5359 + int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, 5371 5360 void *insn, int insn_len) 5372 5361 { 5373 5362 int r, emulation_type = EMULTYPE_PF; 5374 - bool direct = vcpu->arch.mmu->direct_map; 5363 + bool direct = vcpu->arch.mmu->root_role.direct; 5375 5364 5376 5365 if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root.hpa))) 5377 5366 return RET_PF_RETRY; ··· 5402 5391 * paging in both guests. If true, we simply unprotect the page 5403 5392 * and resume the guest. 5404 5393 */ 5405 - if (vcpu->arch.mmu->direct_map && 5394 + if (vcpu->arch.mmu->root_role.direct && 5406 5395 (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) { 5407 5396 kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)); 5408 5397 return 1; ··· 5636 5625 if (!tdp_enabled) 5637 5626 set_memory_decrypted((unsigned long)mmu->pae_root, 1); 5638 5627 else 5639 - WARN_ON_ONCE(shadow_me_mask); 5628 + WARN_ON_ONCE(shadow_me_value); 5640 5629 5641 5630 for (i = 0; i < 4; ++i) 5642 5631 mmu->pae_root[i] = INVALID_PAE_ROOT; ··· 6298 6287 */ 6299 6288 BUILD_BUG_ON(sizeof(union kvm_mmu_page_role) != sizeof(u32)); 6300 6289 BUILD_BUG_ON(sizeof(union kvm_mmu_extended_role) != sizeof(u32)); 6301 - BUILD_BUG_ON(sizeof(union kvm_mmu_role) != sizeof(u64)); 6290 + BUILD_BUG_ON(sizeof(union kvm_cpu_role) != sizeof(u64)); 6302 6291 6303 6292 kvm_mmu_reset_all_pte_masks(); 6304 6293

+121 -2

arch/x86/kvm/mmu/mmu_internal.h

··· 140 140 u64 start_gfn, u64 pages); 141 141 unsigned int pte_list_count(struct kvm_rmap_head *rmap_head); 142 142 143 + extern int nx_huge_pages; 144 + static inline bool is_nx_huge_page_enabled(void) 145 + { 146 + return READ_ONCE(nx_huge_pages); 147 + } 148 + 149 + struct kvm_page_fault { 150 + /* arguments to kvm_mmu_do_page_fault. */ 151 + const gpa_t addr; 152 + const u32 error_code; 153 + const bool prefetch; 154 + 155 + /* Derived from error_code. */ 156 + const bool exec; 157 + const bool write; 158 + const bool present; 159 + const bool rsvd; 160 + const bool user; 161 + 162 + /* Derived from mmu and global state. */ 163 + const bool is_tdp; 164 + const bool nx_huge_page_workaround_enabled; 165 + 166 + /* 167 + * Whether a >4KB mapping can be created or is forbidden due to NX 168 + * hugepages. 169 + */ 170 + bool huge_page_disallowed; 171 + 172 + /* 173 + * Maximum page size that can be created for this fault; input to 174 + * FNAME(fetch), __direct_map and kvm_tdp_mmu_map. 175 + */ 176 + u8 max_level; 177 + 178 + /* 179 + * Page size that can be created based on the max_level and the 180 + * page size used by the host mapping. 181 + */ 182 + u8 req_level; 183 + 184 + /* 185 + * Page size that will be created based on the req_level and 186 + * huge_page_disallowed. 187 + */ 188 + u8 goal_level; 189 + 190 + /* Shifted addr, or result of guest page table walk if addr is a gva. */ 191 + gfn_t gfn; 192 + 193 + /* The memslot containing gfn. May be NULL. */ 194 + struct kvm_memory_slot *slot; 195 + 196 + /* Outputs of kvm_faultin_pfn. */ 197 + kvm_pfn_t pfn; 198 + hva_t hva; 199 + bool map_writable; 200 + }; 201 + 202 + int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); 203 + 143 204 /* 144 - * Return values of handle_mmio_page_fault, mmu.page_fault, and fast_page_fault(). 205 + * Return values of handle_mmio_page_fault(), mmu.page_fault(), fast_page_fault(), 206 + * and of course kvm_mmu_do_page_fault(). 145 207 * 208 + * RET_PF_CONTINUE: So far, so good, keep handling the page fault. 146 209 * RET_PF_RETRY: let CPU fault again on the address. 147 210 * RET_PF_EMULATE: mmio page fault, emulate the instruction directly. 148 211 * RET_PF_INVALID: the spte is invalid, let the real page fault path update it. ··· 214 151 * 215 152 * Any names added to this enum should be exported to userspace for use in 216 153 * tracepoints via TRACE_DEFINE_ENUM() in mmutrace.h 154 + * 155 + * Note, all values must be greater than or equal to zero so as not to encroach 156 + * on -errno return values. Somewhat arbitrarily use '0' for CONTINUE, which 157 + * will allow for efficient machine code when checking for CONTINUE, e.g. 158 + * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero. 217 159 */ 218 160 enum { 219 - RET_PF_RETRY = 0, 161 + RET_PF_CONTINUE = 0, 162 + RET_PF_RETRY, 220 163 RET_PF_EMULATE, 221 164 RET_PF_INVALID, 222 165 RET_PF_FIXED, 223 166 RET_PF_SPURIOUS, 224 167 }; 168 + 169 + static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, 170 + u32 err, bool prefetch) 171 + { 172 + struct kvm_page_fault fault = { 173 + .addr = cr2_or_gpa, 174 + .error_code = err, 175 + .exec = err & PFERR_FETCH_MASK, 176 + .write = err & PFERR_WRITE_MASK, 177 + .present = err & PFERR_PRESENT_MASK, 178 + .rsvd = err & PFERR_RSVD_MASK, 179 + .user = err & PFERR_USER_MASK, 180 + .prefetch = prefetch, 181 + .is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault), 182 + .nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(), 183 + 184 + .max_level = KVM_MAX_HUGEPAGE_LEVEL, 185 + .req_level = PG_LEVEL_4K, 186 + .goal_level = PG_LEVEL_4K, 187 + }; 188 + int r; 189 + 190 + /* 191 + * Async #PF "faults", a.k.a. prefetch faults, are not faults from the 192 + * guest perspective and have already been counted at the time of the 193 + * original fault. 194 + */ 195 + if (!prefetch) 196 + vcpu->stat.pf_taken++; 197 + 198 + if (IS_ENABLED(CONFIG_RETPOLINE) && fault.is_tdp) 199 + r = kvm_tdp_page_fault(vcpu, &fault); 200 + else 201 + r = vcpu->arch.mmu->page_fault(vcpu, &fault); 202 + 203 + /* 204 + * Similar to above, prefetch faults aren't truly spurious, and the 205 + * async #PF path doesn't do emulation. Do count faults that are fixed 206 + * by the async #PF handler though, otherwise they'll never be counted. 207 + */ 208 + if (r == RET_PF_FIXED) 209 + vcpu->stat.pf_fixed++; 210 + else if (prefetch) 211 + ; 212 + else if (r == RET_PF_EMULATE) 213 + vcpu->stat.pf_emulate++; 214 + else if (r == RET_PF_SPURIOUS) 215 + vcpu->stat.pf_spurious++; 216 + return r; 217 + } 225 218 226 219 int kvm_mmu_max_mapping_level(struct kvm *kvm, 227 220 const struct kvm_memory_slot *slot, gfn_t gfn,

+1

arch/x86/kvm/mmu/mmutrace.h

··· 54 54 { PFERR_RSVD_MASK, "RSVD" }, \ 55 55 { PFERR_FETCH_MASK, "F" } 56 56 57 + TRACE_DEFINE_ENUM(RET_PF_CONTINUE); 57 58 TRACE_DEFINE_ENUM(RET_PF_RETRY); 58 59 TRACE_DEFINE_ENUM(RET_PF_EMULATE); 59 60 TRACE_DEFINE_ENUM(RET_PF_INVALID);

+22 -49

arch/x86/kvm/mmu/paging_tmpl.h

··· 63 63 #define PT_LEVEL_BITS PT64_LEVEL_BITS 64 64 #define PT_GUEST_DIRTY_SHIFT 9 65 65 #define PT_GUEST_ACCESSED_SHIFT 8 66 - #define PT_HAVE_ACCESSED_DIRTY(mmu) ((mmu)->ept_ad) 66 + #define PT_HAVE_ACCESSED_DIRTY(mmu) (!(mmu)->cpu_role.base.ad_disabled) 67 67 #ifdef CONFIG_X86_64 68 68 #define CMPXCHG "cmpxchgq" 69 69 #endif ··· 144 144 FNAME(is_bad_mt_xwr)(&mmu->guest_rsvd_check, gpte); 145 145 } 146 146 147 - static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, 148 - pt_element_t __user *ptep_user, unsigned index, 149 - pt_element_t orig_pte, pt_element_t new_pte) 150 - { 151 - signed char r; 152 - 153 - if (!user_access_begin(ptep_user, sizeof(pt_element_t))) 154 - return -EFAULT; 155 - 156 - #ifdef CMPXCHG 157 - asm volatile("1:" LOCK_PREFIX CMPXCHG " %[new], %[ptr]\n" 158 - "setnz %b[r]\n" 159 - "2:" 160 - _ASM_EXTABLE_TYPE_REG(1b, 2b, EX_TYPE_EFAULT_REG, %k[r]) 161 - : [ptr] "+m" (*ptep_user), 162 - [old] "+a" (orig_pte), 163 - [r] "=q" (r) 164 - : [new] "r" (new_pte) 165 - : "memory"); 166 - #else 167 - asm volatile("1:" LOCK_PREFIX "cmpxchg8b %[ptr]\n" 168 - "setnz %b[r]\n" 169 - "2:" 170 - _ASM_EXTABLE_TYPE_REG(1b, 2b, EX_TYPE_EFAULT_REG, %k[r]) 171 - : [ptr] "+m" (*ptep_user), 172 - [old] "+A" (orig_pte), 173 - [r] "=q" (r) 174 - : [new_lo] "b" ((u32)new_pte), 175 - [new_hi] "c" ((u32)(new_pte >> 32)) 176 - : "memory"); 177 - #endif 178 - 179 - user_access_end(); 180 - return r; 181 - } 182 - 183 147 static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu, 184 148 struct kvm_mmu_page *sp, u64 *spte, 185 149 u64 gpte) ··· 151 187 if (!FNAME(is_present_gpte)(gpte)) 152 188 goto no_present; 153 189 154 - /* if accessed bit is not supported prefetch non accessed gpte */ 190 + /* Prefetch only accessed entries (unless A/D bits are disabled). */ 155 191 if (PT_HAVE_ACCESSED_DIRTY(vcpu->arch.mmu) && 156 192 !(gpte & PT_GUEST_ACCESSED_MASK)) 157 193 goto no_present; ··· 242 278 if (unlikely(!walker->pte_writable[level - 1])) 243 279 continue; 244 280 245 - ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, orig_pte, pte); 281 + ret = __try_cmpxchg_user(ptep_user, &orig_pte, pte, fault); 246 282 if (ret) 247 283 return ret; 248 284 ··· 281 317 * is not reserved and does not indicate a large page at this level, 282 318 * so clear PT_PAGE_SIZE_MASK in gpte if that is the case. 283 319 */ 284 - gpte &= level - (PT32_ROOT_LEVEL + mmu->mmu_role.ext.cr4_pse); 320 + gpte &= level - (PT32_ROOT_LEVEL + mmu->cpu_role.ext.cr4_pse); 285 321 #endif 286 322 /* 287 323 * PG_LEVEL_4K always terminates. The RHS has bit 7 set ··· 319 355 320 356 trace_kvm_mmu_pagetable_walk(addr, access); 321 357 retry_walk: 322 - walker->level = mmu->root_level; 358 + walker->level = mmu->cpu_role.base.level; 323 359 pte = mmu->get_guest_pgd(vcpu); 324 360 have_ad = PT_HAVE_ACCESSED_DIRTY(mmu); 325 361 ··· 479 515 * The other bits are set to 0. 480 516 */ 481 517 if (!(errcode & PFERR_RSVD_MASK)) { 482 - vcpu->arch.exit_qualification &= 0x180; 518 + vcpu->arch.exit_qualification &= (EPT_VIOLATION_GVA_IS_VALID | 519 + EPT_VIOLATION_GVA_TRANSLATED); 483 520 if (write_fault) 484 521 vcpu->arch.exit_qualification |= EPT_VIOLATION_ACC_WRITE; 485 522 if (user_fault) 486 523 vcpu->arch.exit_qualification |= EPT_VIOLATION_ACC_READ; 487 524 if (fetch_fault) 488 525 vcpu->arch.exit_qualification |= EPT_VIOLATION_ACC_INSTR; 489 - vcpu->arch.exit_qualification |= (pte_access & 0x7) << 3; 526 + 527 + /* 528 + * Note, pte_access holds the raw RWX bits from the EPTE, not 529 + * ACC_*_MASK flags! 530 + */ 531 + vcpu->arch.exit_qualification |= (pte_access & VMX_EPT_RWX_MASK) << 532 + EPT_VIOLATION_RWX_SHIFT; 490 533 } 491 534 #endif 492 535 walker->fault.address = addr; ··· 621 650 WARN_ON_ONCE(gw->gfn != base_gfn); 622 651 direct_access = gw->pte_access; 623 652 624 - top_level = vcpu->arch.mmu->root_level; 653 + top_level = vcpu->arch.mmu->cpu_role.base.level; 625 654 if (top_level == PT32E_ROOT_LEVEL) 626 655 top_level = PT32_ROOT_LEVEL; 627 656 /* ··· 723 752 return ret; 724 753 725 754 FNAME(pte_prefetch)(vcpu, gw, it.sptep); 726 - ++vcpu->stat.pf_fixed; 727 755 return ret; 728 756 729 757 out_gpte_changed: ··· 837 867 mmu_seq = vcpu->kvm->mmu_notifier_seq; 838 868 smp_rmb(); 839 869 840 - if (kvm_faultin_pfn(vcpu, fault, &r)) 870 + r = kvm_faultin_pfn(vcpu, fault); 871 + if (r != RET_PF_CONTINUE) 841 872 return r; 842 873 843 - if (handle_abnormal_pfn(vcpu, fault, walker.pte_access, &r)) 874 + r = handle_abnormal_pfn(vcpu, fault, walker.pte_access); 875 + if (r != RET_PF_CONTINUE) 844 876 return r; 845 877 846 878 /* ··· 989 1017 */ 990 1018 static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) 991 1019 { 992 - union kvm_mmu_page_role mmu_role = vcpu->arch.mmu->mmu_role.base; 1020 + union kvm_mmu_page_role root_role = vcpu->arch.mmu->root_role; 993 1021 int i; 994 1022 bool host_writable; 995 1023 gpa_t first_pte_gpa; ··· 1008 1036 .level = 0xf, 1009 1037 .access = 0x7, 1010 1038 .quadrant = 0x3, 1039 + .passthrough = 0x1, 1011 1040 }; 1012 1041 1013 1042 /* ··· 1018 1045 * reserved bits checks will be wrong, etc... 1019 1046 */ 1020 1047 if (WARN_ON_ONCE(sp->role.direct || 1021 - (sp->role.word ^ mmu_role.word) & ~sync_role_ign.word)) 1048 + (sp->role.word ^ root_role.word) & ~sync_role_ign.word)) 1022 1049 return -1; 1023 1050 1024 1051 first_pte_gpa = FNAME(get_level1_sp_gpa)(sp);

+21 -26

arch/x86/kvm/mmu/spte.c

··· 19 19 #include <asm/memtype.h> 20 20 #include <asm/vmx.h> 21 21 22 - static bool __read_mostly enable_mmio_caching = true; 22 + bool __read_mostly enable_mmio_caching = true; 23 23 module_param_named(mmio_caching, enable_mmio_caching, bool, 0444); 24 24 25 25 u64 __read_mostly shadow_host_writable_mask; ··· 33 33 u64 __read_mostly shadow_mmio_mask; 34 34 u64 __read_mostly shadow_mmio_access_mask; 35 35 u64 __read_mostly shadow_present_mask; 36 + u64 __read_mostly shadow_me_value; 36 37 u64 __read_mostly shadow_me_mask; 37 38 u64 __read_mostly shadow_acc_track_mask; 38 39 ··· 168 167 else 169 168 pte_access &= ~ACC_WRITE_MASK; 170 169 171 - if (!kvm_is_mmio_pfn(pfn)) 172 - spte |= shadow_me_mask; 170 + if (shadow_me_value && !kvm_is_mmio_pfn(pfn)) 171 + spte |= shadow_me_value; 173 172 174 173 spte |= (u64)pfn << PAGE_SHIFT; 175 174 ··· 285 284 u64 spte = SPTE_MMU_PRESENT_MASK; 286 285 287 286 spte |= __pa(child_pt) | shadow_present_mask | PT_WRITABLE_MASK | 288 - shadow_user_mask | shadow_x_mask | shadow_me_mask; 287 + shadow_user_mask | shadow_x_mask | shadow_me_value; 289 288 290 289 if (ad_disabled) 291 290 spte |= SPTE_TDP_AD_DISABLED_MASK; ··· 309 308 new_spte = mark_spte_for_access_track(new_spte); 310 309 311 310 return new_spte; 312 - } 313 - 314 - static u8 kvm_get_shadow_phys_bits(void) 315 - { 316 - /* 317 - * boot_cpu_data.x86_phys_bits is reduced when MKTME or SME are detected 318 - * in CPU detection code, but the processor treats those reduced bits as 319 - * 'keyID' thus they are not reserved bits. Therefore KVM needs to look at 320 - * the physical address bits reported by CPUID. 321 - */ 322 - if (likely(boot_cpu_data.extended_cpuid_level >= 0x80000008)) 323 - return cpuid_eax(0x80000008) & 0xff; 324 - 325 - /* 326 - * Quite weird to have VMX or SVM but not MAXPHYADDR; probably a VM with 327 - * custom CPUID. Proceed with whatever the kernel found since these features 328 - * aren't virtualizable (SME/SEV also require CPUIDs higher than 0x80000008). 329 - */ 330 - return boot_cpu_data.x86_phys_bits; 331 311 } 332 312 333 313 u64 mark_spte_for_access_track(u64 spte) ··· 361 379 WARN_ON(mmio_value && (REMOVED_SPTE & mmio_mask) == mmio_value)) 362 380 mmio_value = 0; 363 381 382 + if (!mmio_value) 383 + enable_mmio_caching = false; 384 + 364 385 shadow_mmio_value = mmio_value; 365 386 shadow_mmio_mask = mmio_mask; 366 387 shadow_mmio_access_mask = access_mask; 367 388 } 368 389 EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask); 390 + 391 + void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask) 392 + { 393 + /* shadow_me_value must be a subset of shadow_me_mask */ 394 + if (WARN_ON(me_value & ~me_mask)) 395 + me_value = me_mask = 0; 396 + 397 + shadow_me_value = me_value; 398 + shadow_me_mask = me_mask; 399 + } 400 + EXPORT_SYMBOL_GPL(kvm_mmu_set_me_spte_mask); 369 401 370 402 void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only) 371 403 { ··· 390 394 shadow_x_mask = VMX_EPT_EXECUTABLE_MASK; 391 395 shadow_present_mask = has_exec_only ? 0ull : VMX_EPT_READABLE_MASK; 392 396 shadow_acc_track_mask = VMX_EPT_RWX_MASK; 393 - shadow_me_mask = 0ull; 394 - 395 397 shadow_host_writable_mask = EPT_SPTE_HOST_WRITABLE; 396 398 shadow_mmu_writable_mask = EPT_SPTE_MMU_WRITABLE; 397 399 ··· 440 446 shadow_x_mask = 0; 441 447 shadow_present_mask = PT_PRESENT_MASK; 442 448 shadow_acc_track_mask = 0; 443 - shadow_me_mask = sme_me_mask; 449 + shadow_me_mask = 0; 450 + shadow_me_value = 0; 444 451 445 452 shadow_host_writable_mask = DEFAULT_SPTE_HOST_WRITABLE; 446 453 shadow_mmu_writable_mask = DEFAULT_SPTE_MMU_WRITABLE;

+15 -1

arch/x86/kvm/mmu/spte.h

··· 5 5 6 6 #include "mmu_internal.h" 7 7 8 + extern bool __read_mostly enable_mmio_caching; 9 + 8 10 /* 9 11 * A MMU present SPTE is backed by actual memory and may or may not be present 10 12 * in hardware. E.g. MMIO SPTEs are not considered present. Use bit 11, as it ··· 151 149 extern u64 __read_mostly shadow_mmio_mask; 152 150 extern u64 __read_mostly shadow_mmio_access_mask; 153 151 extern u64 __read_mostly shadow_present_mask; 152 + extern u64 __read_mostly shadow_me_value; 154 153 extern u64 __read_mostly shadow_me_mask; 155 154 156 155 /* ··· 207 204 static inline bool is_mmio_spte(u64 spte) 208 205 { 209 206 return (spte & shadow_mmio_mask) == shadow_mmio_value && 210 - likely(shadow_mmio_value); 207 + likely(enable_mmio_caching); 211 208 } 212 209 213 210 static inline bool is_shadow_present_pte(u64 pte) 214 211 { 215 212 return !!(pte & SPTE_MMU_PRESENT_MASK); 213 + } 214 + 215 + /* 216 + * Returns true if A/D bits are supported in hardware and are enabled by KVM. 217 + * When enabled, KVM uses A/D bits for all non-nested MMUs. Because L1 can 218 + * disable A/D bits in EPTP12, SP and SPTE variants are needed to handle the 219 + * scenario where KVM is using A/D bits for L1, but not L2. 220 + */ 221 + static inline bool kvm_ad_enabled(void) 222 + { 223 + return !!shadow_accessed_mask; 216 224 } 217 225 218 226 static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)

+4 -10

arch/x86/kvm/mmu/tdp_mmu.c

··· 310 310 311 311 hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu) 312 312 { 313 - union kvm_mmu_page_role role = vcpu->arch.mmu->mmu_role.base; 313 + union kvm_mmu_page_role role = vcpu->arch.mmu->root_role; 314 314 struct kvm *kvm = vcpu->kvm; 315 315 struct kvm_mmu_page *root; 316 316 ··· 1100 1100 1101 1101 /* If a MMIO SPTE is installed, the MMIO will need to be emulated. */ 1102 1102 if (unlikely(is_mmio_spte(new_spte))) { 1103 + vcpu->stat.pf_mmio_spte_created++; 1103 1104 trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn, 1104 1105 new_spte); 1105 1106 ret = RET_PF_EMULATE; ··· 1108 1107 trace_kvm_mmu_set_spte(iter->level, iter->gfn, 1109 1108 rcu_dereference(iter->sptep)); 1110 1109 } 1111 - 1112 - /* 1113 - * Increase pf_fixed in both RET_PF_EMULATE and RET_PF_FIXED to be 1114 - * consistent with legacy MMU behavior. 1115 - */ 1116 - if (ret != RET_PF_SPURIOUS) 1117 - vcpu->stat.pf_fixed++; 1118 1110 1119 1111 return ret; 1120 1112 } ··· 1130 1136 struct kvm_mmu_page *sp, bool account_nx, 1131 1137 bool shared) 1132 1138 { 1133 - u64 spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask); 1139 + u64 spte = make_nonleaf_spte(sp->spt, !kvm_ad_enabled()); 1134 1140 int ret = 0; 1135 1141 1136 1142 if (shared) { ··· 1853 1859 gfn_t gfn = addr >> PAGE_SHIFT; 1854 1860 int leaf = -1; 1855 1861 1856 - *root_level = vcpu->arch.mmu->shadow_root_level; 1862 + *root_level = vcpu->arch.mmu->root_role.level; 1857 1863 1858 1864 tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) { 1859 1865 leaf = iter.level;

+45 -21

arch/x86/kvm/pmu.c

··· 49 49 * * AMD: [0 .. AMD64_NUM_COUNTERS-1] <=> gp counters 50 50 */ 51 51 52 + static struct kvm_pmu_ops kvm_pmu_ops __read_mostly; 53 + 54 + #define KVM_X86_PMU_OP(func) \ 55 + DEFINE_STATIC_CALL_NULL(kvm_x86_pmu_##func, \ 56 + *(((struct kvm_pmu_ops *)0)->func)); 57 + #define KVM_X86_PMU_OP_OPTIONAL KVM_X86_PMU_OP 58 + #include <asm/kvm-x86-pmu-ops.h> 59 + 60 + void kvm_pmu_ops_update(const struct kvm_pmu_ops *pmu_ops) 61 + { 62 + memcpy(&kvm_pmu_ops, pmu_ops, sizeof(kvm_pmu_ops)); 63 + 64 + #define __KVM_X86_PMU_OP(func) \ 65 + static_call_update(kvm_x86_pmu_##func, kvm_pmu_ops.func); 66 + #define KVM_X86_PMU_OP(func) \ 67 + WARN_ON(!kvm_pmu_ops.func); __KVM_X86_PMU_OP(func) 68 + #define KVM_X86_PMU_OP_OPTIONAL __KVM_X86_PMU_OP 69 + #include <asm/kvm-x86-pmu-ops.h> 70 + #undef __KVM_X86_PMU_OP 71 + } 72 + 73 + static inline bool pmc_is_enabled(struct kvm_pmc *pmc) 74 + { 75 + return static_call(kvm_x86_pmu_pmc_is_enabled)(pmc); 76 + } 77 + 52 78 static void kvm_pmi_trigger_fn(struct irq_work *irq_work) 53 79 { 54 80 struct kvm_pmu *pmu = container_of(irq_work, struct kvm_pmu, irq_work); ··· 242 216 ARCH_PERFMON_EVENTSEL_CMASK | 243 217 HSW_IN_TX | 244 218 HSW_IN_TX_CHECKPOINTED))) { 245 - config = kvm_x86_ops.pmu_ops->pmc_perf_hw_id(pmc); 219 + config = static_call(kvm_x86_pmu_pmc_perf_hw_id)(pmc); 246 220 if (config != PERF_COUNT_HW_MAX) 247 221 type = PERF_TYPE_HARDWARE; 248 222 } ··· 292 266 293 267 pmc->current_config = (u64)ctrl; 294 268 pmc_reprogram_counter(pmc, PERF_TYPE_HARDWARE, 295 - kvm_x86_ops.pmu_ops->pmc_perf_hw_id(pmc), 269 + static_call(kvm_x86_pmu_pmc_perf_hw_id)(pmc), 296 270 !(en_field & 0x2), /* exclude user */ 297 271 !(en_field & 0x1), /* exclude kernel */ 298 272 pmi); ··· 301 275 302 276 void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx) 303 277 { 304 - struct kvm_pmc *pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, pmc_idx); 278 + struct kvm_pmc *pmc = static_call(kvm_x86_pmu_pmc_idx_to_pmc)(pmu, pmc_idx); 305 279 306 280 if (!pmc) 307 281 return; ··· 323 297 int bit; 324 298 325 299 for_each_set_bit(bit, pmu->reprogram_pmi, X86_PMC_IDX_MAX) { 326 - struct kvm_pmc *pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit); 300 + struct kvm_pmc *pmc = static_call(kvm_x86_pmu_pmc_idx_to_pmc)(pmu, bit); 327 301 328 302 if (unlikely(!pmc || !pmc->perf_event)) { 329 303 clear_bit(bit, pmu->reprogram_pmi); ··· 345 319 /* check if idx is a valid index to access PMU */ 346 320 bool kvm_pmu_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx) 347 321 { 348 - return kvm_x86_ops.pmu_ops->is_valid_rdpmc_ecx(vcpu, idx); 322 + return static_call(kvm_x86_pmu_is_valid_rdpmc_ecx)(vcpu, idx); 349 323 } 350 324 351 325 bool is_vmware_backdoor_pmc(u32 pmc_idx) ··· 395 369 if (is_vmware_backdoor_pmc(idx)) 396 370 return kvm_pmu_rdpmc_vmware(vcpu, idx, data); 397 371 398 - pmc = kvm_x86_ops.pmu_ops->rdpmc_ecx_to_pmc(vcpu, idx, &mask); 372 + pmc = static_call(kvm_x86_pmu_rdpmc_ecx_to_pmc)(vcpu, idx, &mask); 399 373 if (!pmc) 400 374 return 1; 401 375 ··· 411 385 void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu) 412 386 { 413 387 if (lapic_in_kernel(vcpu)) { 414 - if (kvm_x86_ops.pmu_ops->deliver_pmi) 415 - kvm_x86_ops.pmu_ops->deliver_pmi(vcpu); 388 + static_call_cond(kvm_x86_pmu_deliver_pmi)(vcpu); 416 389 kvm_apic_local_deliver(vcpu->arch.apic, APIC_LVTPC); 417 390 } 418 391 } 419 392 420 393 bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) 421 394 { 422 - return kvm_x86_ops.pmu_ops->msr_idx_to_pmc(vcpu, msr) || 423 - kvm_x86_ops.pmu_ops->is_valid_msr(vcpu, msr); 395 + return static_call(kvm_x86_pmu_msr_idx_to_pmc)(vcpu, msr) || 396 + static_call(kvm_x86_pmu_is_valid_msr)(vcpu, msr); 424 397 } 425 398 426 399 static void kvm_pmu_mark_pmc_in_use(struct kvm_vcpu *vcpu, u32 msr) 427 400 { 428 401 struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); 429 - struct kvm_pmc *pmc = kvm_x86_ops.pmu_ops->msr_idx_to_pmc(vcpu, msr); 402 + struct kvm_pmc *pmc = static_call(kvm_x86_pmu_msr_idx_to_pmc)(vcpu, msr); 430 403 431 404 if (pmc) 432 405 __set_bit(pmc->idx, pmu->pmc_in_use); ··· 433 408 434 409 int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) 435 410 { 436 - return kvm_x86_ops.pmu_ops->get_msr(vcpu, msr_info); 411 + return static_call(kvm_x86_pmu_get_msr)(vcpu, msr_info); 437 412 } 438 413 439 414 int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) 440 415 { 441 416 kvm_pmu_mark_pmc_in_use(vcpu, msr_info->index); 442 - return kvm_x86_ops.pmu_ops->set_msr(vcpu, msr_info); 417 + return static_call(kvm_x86_pmu_set_msr)(vcpu, msr_info); 443 418 } 444 419 445 420 /* refresh PMU settings. This function generally is called when underlying ··· 448 423 */ 449 424 void kvm_pmu_refresh(struct kvm_vcpu *vcpu) 450 425 { 451 - kvm_x86_ops.pmu_ops->refresh(vcpu); 426 + static_call(kvm_x86_pmu_refresh)(vcpu); 452 427 } 453 428 454 429 void kvm_pmu_reset(struct kvm_vcpu *vcpu) ··· 456 431 struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); 457 432 458 433 irq_work_sync(&pmu->irq_work); 459 - kvm_x86_ops.pmu_ops->reset(vcpu); 434 + static_call(kvm_x86_pmu_reset)(vcpu); 460 435 } 461 436 462 437 void kvm_pmu_init(struct kvm_vcpu *vcpu) ··· 464 439 struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); 465 440 466 441 memset(pmu, 0, sizeof(*pmu)); 467 - kvm_x86_ops.pmu_ops->init(vcpu); 442 + static_call(kvm_x86_pmu_init)(vcpu); 468 443 init_irq_work(&pmu->irq_work, kvm_pmi_trigger_fn); 469 444 pmu->event_count = 0; 470 445 pmu->need_cleanup = false; ··· 496 471 pmu->pmc_in_use, X86_PMC_IDX_MAX); 497 472 498 473 for_each_set_bit(i, bitmask, X86_PMC_IDX_MAX) { 499 - pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, i); 474 + pmc = static_call(kvm_x86_pmu_pmc_idx_to_pmc)(pmu, i); 500 475 501 476 if (pmc && pmc->perf_event && !pmc_speculative_in_use(pmc)) 502 477 pmc_stop_counter(pmc); 503 478 } 504 479 505 - if (kvm_x86_ops.pmu_ops->cleanup) 506 - kvm_x86_ops.pmu_ops->cleanup(vcpu); 480 + static_call_cond(kvm_x86_pmu_cleanup)(vcpu); 507 481 508 482 bitmap_zero(pmu->pmc_in_use, X86_PMC_IDX_MAX); 509 483 } ··· 532 508 unsigned int config; 533 509 534 510 pmc->eventsel &= (ARCH_PERFMON_EVENTSEL_EVENT | ARCH_PERFMON_EVENTSEL_UMASK); 535 - config = kvm_x86_ops.pmu_ops->pmc_perf_hw_id(pmc); 511 + config = static_call(kvm_x86_pmu_pmc_perf_hw_id)(pmc); 536 512 pmc->eventsel = old_eventsel; 537 513 return config == perf_hw_id; 538 514 } ··· 560 536 int i; 561 537 562 538 for_each_set_bit(i, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX) { 563 - pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, i); 539 + pmc = static_call(kvm_x86_pmu_pmc_idx_to_pmc)(pmu, i); 564 540 565 541 if (!pmc || !pmc_is_enabled(pmc) || !pmc_speculative_in_use(pmc)) 566 542 continue;

+2 -5

arch/x86/kvm/pmu.h

··· 39 39 void (*cleanup)(struct kvm_vcpu *vcpu); 40 40 }; 41 41 42 + void kvm_pmu_ops_update(const struct kvm_pmu_ops *pmu_ops); 43 + 42 44 static inline u64 pmc_bitmask(struct kvm_pmc *pmc) 43 45 { 44 46 struct kvm_pmu *pmu = pmc_to_pmu(pmc); ··· 86 84 static inline bool pmc_is_fixed(struct kvm_pmc *pmc) 87 85 { 88 86 return pmc->type == KVM_PMC_FIXED; 89 - } 90 - 91 - static inline bool pmc_is_enabled(struct kvm_pmc *pmc) 92 - { 93 - return kvm_x86_ops.pmu_ops->pmc_is_enabled(pmc); 94 87 } 95 88 96 89 static inline bool kvm_valid_perf_global_ctrl(struct kvm_pmu *pmu,

+78 -6

arch/x86/kvm/svm/avic.c

··· 165 165 return err; 166 166 } 167 167 168 - void avic_init_vmcb(struct vcpu_svm *svm) 168 + void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb) 169 169 { 170 - struct vmcb *vmcb = svm->vmcb; 171 170 struct kvm_svm *kvm_svm = to_kvm_svm(svm->vcpu.kvm); 172 171 phys_addr_t bpa = __sme_set(page_to_phys(svm->avic_backing_page)); 173 172 phys_addr_t lpa = __sme_set(page_to_phys(kvm_svm->avic_logical_id_table_page)); ··· 284 285 put_cpu(); 285 286 } 286 287 287 - static void avic_kick_target_vcpus(struct kvm *kvm, struct kvm_lapic *source, 288 - u32 icrl, u32 icrh) 288 + /* 289 + * A fast-path version of avic_kick_target_vcpus(), which attempts to match 290 + * destination APIC ID to vCPU without looping through all vCPUs. 291 + */ 292 + static int avic_kick_target_vcpus_fast(struct kvm *kvm, struct kvm_lapic *source, 293 + u32 icrl, u32 icrh, u32 index) 289 294 { 295 + u32 dest, apic_id; 290 296 struct kvm_vcpu *vcpu; 297 + int dest_mode = icrl & APIC_DEST_MASK; 298 + int shorthand = icrl & APIC_SHORT_MASK; 299 + struct kvm_svm *kvm_svm = to_kvm_svm(kvm); 300 + u32 *avic_logical_id_table = page_address(kvm_svm->avic_logical_id_table_page); 301 + 302 + if (shorthand != APIC_DEST_NOSHORT) 303 + return -EINVAL; 304 + 305 + /* 306 + * The AVIC incomplete IPI #vmexit info provides index into 307 + * the physical APIC ID table, which can be used to derive 308 + * guest physical APIC ID. 309 + */ 310 + if (dest_mode == APIC_DEST_PHYSICAL) { 311 + apic_id = index; 312 + } else { 313 + if (!apic_x2apic_mode(source)) { 314 + /* For xAPIC logical mode, the index is for logical APIC table. */ 315 + apic_id = avic_logical_id_table[index] & 0x1ff; 316 + } else { 317 + return -EINVAL; 318 + } 319 + } 320 + 321 + /* 322 + * Assuming vcpu ID is the same as physical apic ID, 323 + * and use it to retrieve the target vCPU. 324 + */ 325 + vcpu = kvm_get_vcpu_by_id(kvm, apic_id); 326 + if (!vcpu) 327 + return -EINVAL; 328 + 329 + if (apic_x2apic_mode(vcpu->arch.apic)) 330 + dest = icrh; 331 + else 332 + dest = GET_APIC_DEST_FIELD(icrh); 333 + 334 + /* 335 + * Try matching the destination APIC ID with the vCPU. 336 + */ 337 + if (kvm_apic_match_dest(vcpu, source, shorthand, dest, dest_mode)) { 338 + vcpu->arch.apic->irr_pending = true; 339 + svm_complete_interrupt_delivery(vcpu, 340 + icrl & APIC_MODE_MASK, 341 + icrl & APIC_INT_LEVELTRIG, 342 + icrl & APIC_VECTOR_MASK); 343 + return 0; 344 + } 345 + 346 + return -EINVAL; 347 + } 348 + 349 + static void avic_kick_target_vcpus(struct kvm *kvm, struct kvm_lapic *source, 350 + u32 icrl, u32 icrh, u32 index) 351 + { 291 352 unsigned long i; 353 + struct kvm_vcpu *vcpu; 354 + 355 + if (!avic_kick_target_vcpus_fast(kvm, source, icrl, icrh, index)) 356 + return; 357 + 358 + trace_kvm_avic_kick_vcpu_slowpath(icrh, icrl, index); 292 359 293 360 /* 294 361 * Wake any target vCPUs that are blocking, i.e. waiting for a wake ··· 381 316 u32 icrh = svm->vmcb->control.exit_info_1 >> 32; 382 317 u32 icrl = svm->vmcb->control.exit_info_1; 383 318 u32 id = svm->vmcb->control.exit_info_2 >> 32; 384 - u32 index = svm->vmcb->control.exit_info_2 & 0xFF; 319 + u32 index = svm->vmcb->control.exit_info_2 & 0x1FF; 385 320 struct kvm_lapic *apic = vcpu->arch.apic; 386 321 387 322 trace_kvm_avic_incomplete_ipi(vcpu->vcpu_id, icrh, icrl, id, index); ··· 408 343 * set the appropriate IRR bits on the valid target 409 344 * vcpus. So, we just need to kick the appropriate vcpu. 410 345 */ 411 - avic_kick_target_vcpus(vcpu->kvm, apic, icrl, icrh); 346 + avic_kick_target_vcpus(vcpu->kvm, apic, icrl, icrh, index); 412 347 break; 413 348 case AVIC_IPI_FAILURE_INVALID_TARGET: 414 349 break; ··· 420 355 } 421 356 422 357 return 1; 358 + } 359 + 360 + unsigned long avic_vcpu_get_apicv_inhibit_reasons(struct kvm_vcpu *vcpu) 361 + { 362 + if (is_guest_mode(vcpu)) 363 + return APICV_INHIBIT_REASON_NESTED; 364 + return 0; 423 365 } 424 366 425 367 static u32 *avic_get_logical_id_entry(struct kvm_vcpu *vcpu, u32 ldr, bool flat)

+199 -107

arch/x86/kvm/svm/nested.c

··· 36 36 struct x86_exception *fault) 37 37 { 38 38 struct vcpu_svm *svm = to_svm(vcpu); 39 + struct vmcb *vmcb = svm->vmcb; 39 40 40 - if (svm->vmcb->control.exit_code != SVM_EXIT_NPF) { 41 + if (vmcb->control.exit_code != SVM_EXIT_NPF) { 41 42 /* 42 43 * TODO: track the cause of the nested page fault, and 43 44 * correctly fill in the high bits of exit_info_1. 44 45 */ 45 - svm->vmcb->control.exit_code = SVM_EXIT_NPF; 46 - svm->vmcb->control.exit_code_hi = 0; 47 - svm->vmcb->control.exit_info_1 = (1ULL << 32); 48 - svm->vmcb->control.exit_info_2 = fault->address; 46 + vmcb->control.exit_code = SVM_EXIT_NPF; 47 + vmcb->control.exit_code_hi = 0; 48 + vmcb->control.exit_info_1 = (1ULL << 32); 49 + vmcb->control.exit_info_2 = fault->address; 49 50 } 50 51 51 - svm->vmcb->control.exit_info_1 &= ~0xffffffffULL; 52 - svm->vmcb->control.exit_info_1 |= fault->error_code; 52 + vmcb->control.exit_info_1 &= ~0xffffffffULL; 53 + vmcb->control.exit_info_1 |= fault->error_code; 53 54 54 55 nested_svm_vmexit(svm); 55 56 } 56 57 57 - static void svm_inject_page_fault_nested(struct kvm_vcpu *vcpu, struct x86_exception *fault) 58 + static bool nested_svm_handle_page_fault_workaround(struct kvm_vcpu *vcpu, 59 + struct x86_exception *fault) 58 60 { 59 - struct vcpu_svm *svm = to_svm(vcpu); 60 - WARN_ON(!is_guest_mode(vcpu)); 61 + struct vcpu_svm *svm = to_svm(vcpu); 62 + struct vmcb *vmcb = svm->vmcb; 63 + 64 + WARN_ON(!is_guest_mode(vcpu)); 61 65 62 66 if (vmcb12_is_intercept(&svm->nested.ctl, 63 67 INTERCEPT_EXCEPTION_OFFSET + PF_VECTOR) && 64 - !svm->nested.nested_run_pending) { 65 - svm->vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + PF_VECTOR; 66 - svm->vmcb->control.exit_code_hi = 0; 67 - svm->vmcb->control.exit_info_1 = fault->error_code; 68 - svm->vmcb->control.exit_info_2 = fault->address; 69 - nested_svm_vmexit(svm); 70 - } else { 71 - kvm_inject_page_fault(vcpu, fault); 72 - } 68 + !WARN_ON_ONCE(svm->nested.nested_run_pending)) { 69 + vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + PF_VECTOR; 70 + vmcb->control.exit_code_hi = 0; 71 + vmcb->control.exit_info_1 = fault->error_code; 72 + vmcb->control.exit_info_2 = fault->address; 73 + nested_svm_vmexit(svm); 74 + return true; 75 + } 76 + 77 + return false; 73 78 } 74 79 75 80 static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index) ··· 126 121 vcpu->arch.walk_mmu = &vcpu->arch.root_mmu; 127 122 } 128 123 124 + static bool nested_vmcb_needs_vls_intercept(struct vcpu_svm *svm) 125 + { 126 + if (!svm->v_vmload_vmsave_enabled) 127 + return true; 128 + 129 + if (!nested_npt_enabled(svm)) 130 + return true; 131 + 132 + if (!(svm->nested.ctl.virt_ext & VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK)) 133 + return true; 134 + 135 + return false; 136 + } 137 + 129 138 void recalc_intercepts(struct vcpu_svm *svm) 130 139 { 131 140 struct vmcb_control_area *c, *h; ··· 181 162 if (!intercept_smi) 182 163 vmcb_clr_intercept(c, INTERCEPT_SMI); 183 164 184 - vmcb_set_intercept(c, INTERCEPT_VMLOAD); 185 - vmcb_set_intercept(c, INTERCEPT_VMSAVE); 165 + if (nested_vmcb_needs_vls_intercept(svm)) { 166 + /* 167 + * If the virtual VMLOAD/VMSAVE is not enabled for the L2, 168 + * we must intercept these instructions to correctly 169 + * emulate them in case L1 doesn't intercept them. 170 + */ 171 + vmcb_set_intercept(c, INTERCEPT_VMLOAD); 172 + vmcb_set_intercept(c, INTERCEPT_VMSAVE); 173 + } else { 174 + WARN_ON(!(c->virt_ext & VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK)); 175 + } 186 176 } 187 177 188 178 /* ··· 441 413 */ 442 414 mask &= ~V_IRQ_MASK; 443 415 } 416 + 417 + if (nested_vgif_enabled(svm)) 418 + mask |= V_GIF_MASK; 419 + 444 420 svm->nested.ctl.int_ctl &= ~mask; 445 421 svm->nested.ctl.int_ctl |= svm->vmcb->control.int_ctl & mask; 446 422 } ··· 484 452 } 485 453 486 454 vmcb12->control.exit_int_info = exit_int_info; 487 - } 488 - 489 - static inline bool nested_npt_enabled(struct vcpu_svm *svm) 490 - { 491 - return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE; 492 455 } 493 456 494 457 static void nested_svm_transition_tlb_flush(struct kvm_vcpu *vcpu) ··· 542 515 static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12) 543 516 { 544 517 bool new_vmcb12 = false; 518 + struct vmcb *vmcb01 = svm->vmcb01.ptr; 519 + struct vmcb *vmcb02 = svm->nested.vmcb02.ptr; 545 520 546 521 nested_vmcb02_compute_g_pat(svm); 547 522 ··· 555 526 } 556 527 557 528 if (unlikely(new_vmcb12 || vmcb_is_dirty(vmcb12, VMCB_SEG))) { 558 - svm->vmcb->save.es = vmcb12->save.es; 559 - svm->vmcb->save.cs = vmcb12->save.cs; 560 - svm->vmcb->save.ss = vmcb12->save.ss; 561 - svm->vmcb->save.ds = vmcb12->save.ds; 562 - svm->vmcb->save.cpl = vmcb12->save.cpl; 563 - vmcb_mark_dirty(svm->vmcb, VMCB_SEG); 529 + vmcb02->save.es = vmcb12->save.es; 530 + vmcb02->save.cs = vmcb12->save.cs; 531 + vmcb02->save.ss = vmcb12->save.ss; 532 + vmcb02->save.ds = vmcb12->save.ds; 533 + vmcb02->save.cpl = vmcb12->save.cpl; 534 + vmcb_mark_dirty(vmcb02, VMCB_SEG); 564 535 } 565 536 566 537 if (unlikely(new_vmcb12 || vmcb_is_dirty(vmcb12, VMCB_DT))) { 567 - svm->vmcb->save.gdtr = vmcb12->save.gdtr; 568 - svm->vmcb->save.idtr = vmcb12->save.idtr; 569 - vmcb_mark_dirty(svm->vmcb, VMCB_DT); 538 + vmcb02->save.gdtr = vmcb12->save.gdtr; 539 + vmcb02->save.idtr = vmcb12->save.idtr; 540 + vmcb_mark_dirty(vmcb02, VMCB_DT); 570 541 } 571 542 572 543 kvm_set_rflags(&svm->vcpu, vmcb12->save.rflags | X86_EFLAGS_FIXED); ··· 583 554 kvm_rip_write(&svm->vcpu, vmcb12->save.rip); 584 555 585 556 /* In case we don't even reach vcpu_run, the fields are not updated */ 586 - svm->vmcb->save.rax = vmcb12->save.rax; 587 - svm->vmcb->save.rsp = vmcb12->save.rsp; 588 - svm->vmcb->save.rip = vmcb12->save.rip; 557 + vmcb02->save.rax = vmcb12->save.rax; 558 + vmcb02->save.rsp = vmcb12->save.rsp; 559 + vmcb02->save.rip = vmcb12->save.rip; 589 560 590 561 /* These bits will be set properly on the first execution when new_vmc12 is true */ 591 562 if (unlikely(new_vmcb12 || vmcb_is_dirty(vmcb12, VMCB_DR))) { 592 - svm->vmcb->save.dr7 = svm->nested.save.dr7 | DR7_FIXED_1; 563 + vmcb02->save.dr7 = svm->nested.save.dr7 | DR7_FIXED_1; 593 564 svm->vcpu.arch.dr6 = svm->nested.save.dr6 | DR6_ACTIVE_LOW; 594 - vmcb_mark_dirty(svm->vmcb, VMCB_DR); 565 + vmcb_mark_dirty(vmcb02, VMCB_DR); 566 + } 567 + 568 + if (unlikely(svm->lbrv_enabled && (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK))) { 569 + /* 570 + * Reserved bits of DEBUGCTL are ignored. Be consistent with 571 + * svm_set_msr's definition of reserved bits. 572 + */ 573 + svm_copy_lbrs(vmcb02, vmcb12); 574 + vmcb02->save.dbgctl &= ~DEBUGCTL_RESERVED_BITS; 575 + svm_update_lbrv(&svm->vcpu); 576 + 577 + } else if (unlikely(vmcb01->control.virt_ext & LBR_CTL_ENABLE_MASK)) { 578 + svm_copy_lbrs(vmcb02, vmcb01); 595 579 } 596 580 } 597 581 598 582 static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) 599 583 { 600 - const u32 int_ctl_vmcb01_bits = 601 - V_INTR_MASKING_MASK | V_GIF_MASK | V_GIF_ENABLE_MASK; 602 - 603 - const u32 int_ctl_vmcb12_bits = V_TPR_MASK | V_IRQ_INJECTION_BITS_MASK; 584 + u32 int_ctl_vmcb01_bits = V_INTR_MASKING_MASK; 585 + u32 int_ctl_vmcb12_bits = V_TPR_MASK | V_IRQ_INJECTION_BITS_MASK; 604 586 605 587 struct kvm_vcpu *vcpu = &svm->vcpu; 588 + struct vmcb *vmcb01 = svm->vmcb01.ptr; 589 + struct vmcb *vmcb02 = svm->nested.vmcb02.ptr; 606 590 607 591 /* 608 592 * Filled at exit: exit_code, exit_code_hi, exit_info_1, exit_info_2, 609 593 * exit_int_info, exit_int_info_err, next_rip, insn_len, insn_bytes. 610 594 */ 611 595 612 - /* 613 - * Also covers avic_vapic_bar, avic_backing_page, avic_logical_id, 614 - * avic_physical_id. 615 - */ 616 - WARN_ON(kvm_apicv_activated(svm->vcpu.kvm)); 596 + if (svm->vgif_enabled && (svm->nested.ctl.int_ctl & V_GIF_ENABLE_MASK)) 597 + int_ctl_vmcb12_bits |= (V_GIF_MASK | V_GIF_ENABLE_MASK); 598 + else 599 + int_ctl_vmcb01_bits |= (V_GIF_MASK | V_GIF_ENABLE_MASK); 617 600 618 601 /* Copied from vmcb01. msrpm_base can be overwritten later. */ 619 - svm->vmcb->control.nested_ctl = svm->vmcb01.ptr->control.nested_ctl; 620 - svm->vmcb->control.iopm_base_pa = svm->vmcb01.ptr->control.iopm_base_pa; 621 - svm->vmcb->control.msrpm_base_pa = svm->vmcb01.ptr->control.msrpm_base_pa; 602 + vmcb02->control.nested_ctl = vmcb01->control.nested_ctl; 603 + vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa; 604 + vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa; 622 605 623 606 /* Done at vmrun: asid. */ 624 607 625 608 /* Also overwritten later if necessary. */ 626 - svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING; 609 + vmcb02->control.tlb_ctl = TLB_CONTROL_DO_NOTHING; 627 610 628 611 /* nested_cr3. */ 629 612 if (nested_npt_enabled(svm)) ··· 646 605 svm->nested.ctl.tsc_offset, 647 606 svm->tsc_ratio_msr); 648 607 649 - svm->vmcb->control.tsc_offset = vcpu->arch.tsc_offset; 608 + vmcb02->control.tsc_offset = vcpu->arch.tsc_offset; 650 609 651 610 if (svm->tsc_ratio_msr != kvm_default_tsc_scaling_ratio) { 652 611 WARN_ON(!svm->tsc_scaling_enabled); 653 612 nested_svm_update_tsc_ratio_msr(vcpu); 654 613 } 655 614 656 - svm->vmcb->control.int_ctl = 615 + vmcb02->control.int_ctl = 657 616 (svm->nested.ctl.int_ctl & int_ctl_vmcb12_bits) | 658 - (svm->vmcb01.ptr->control.int_ctl & int_ctl_vmcb01_bits); 617 + (vmcb01->control.int_ctl & int_ctl_vmcb01_bits); 659 618 660 - svm->vmcb->control.int_vector = svm->nested.ctl.int_vector; 661 - svm->vmcb->control.int_state = svm->nested.ctl.int_state; 662 - svm->vmcb->control.event_inj = svm->nested.ctl.event_inj; 663 - svm->vmcb->control.event_inj_err = svm->nested.ctl.event_inj_err; 619 + vmcb02->control.int_vector = svm->nested.ctl.int_vector; 620 + vmcb02->control.int_state = svm->nested.ctl.int_state; 621 + vmcb02->control.event_inj = svm->nested.ctl.event_inj; 622 + vmcb02->control.event_inj_err = svm->nested.ctl.event_inj_err; 623 + 624 + vmcb02->control.virt_ext = vmcb01->control.virt_ext & 625 + LBR_CTL_ENABLE_MASK; 626 + if (svm->lbrv_enabled) 627 + vmcb02->control.virt_ext |= 628 + (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK); 629 + 630 + if (!nested_vmcb_needs_vls_intercept(svm)) 631 + vmcb02->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; 632 + 633 + if (kvm_pause_in_guest(svm->vcpu.kvm)) { 634 + /* use guest values since host doesn't use them */ 635 + vmcb02->control.pause_filter_count = 636 + svm->pause_filter_enabled ? 637 + svm->nested.ctl.pause_filter_count : 0; 638 + 639 + vmcb02->control.pause_filter_thresh = 640 + svm->pause_threshold_enabled ? 641 + svm->nested.ctl.pause_filter_thresh : 0; 642 + 643 + } else if (!vmcb12_is_intercept(&svm->nested.ctl, INTERCEPT_PAUSE)) { 644 + /* use host values when guest doesn't use them */ 645 + vmcb02->control.pause_filter_count = vmcb01->control.pause_filter_count; 646 + vmcb02->control.pause_filter_thresh = vmcb01->control.pause_filter_thresh; 647 + } else { 648 + /* 649 + * Intercept every PAUSE otherwise and 650 + * ignore both host and guest values 651 + */ 652 + vmcb02->control.pause_filter_count = 0; 653 + vmcb02->control.pause_filter_thresh = 0; 654 + } 664 655 665 656 nested_svm_transition_tlb_flush(vcpu); 666 657 ··· 753 680 if (ret) 754 681 return ret; 755 682 756 - if (!npt_enabled) 757 - vcpu->arch.mmu->inject_page_fault = svm_inject_page_fault_nested; 758 - 759 683 if (!from_vmrun) 760 684 kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); 761 685 762 686 svm_set_gif(svm, true); 687 + 688 + if (kvm_vcpu_apicv_active(vcpu)) 689 + kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu); 763 690 764 691 return 0; 765 692 } ··· 771 698 struct vmcb *vmcb12; 772 699 struct kvm_host_map map; 773 700 u64 vmcb12_gpa; 701 + struct vmcb *vmcb01 = svm->vmcb01.ptr; 774 702 775 703 if (!svm->nested.hsave_msr) { 776 704 kvm_inject_gp(vcpu, 0); ··· 815 741 * Since vmcb01 is not in use, we can use it to store some of the L1 816 742 * state. 817 743 */ 818 - svm->vmcb01.ptr->save.efer = vcpu->arch.efer; 819 - svm->vmcb01.ptr->save.cr0 = kvm_read_cr0(vcpu); 820 - svm->vmcb01.ptr->save.cr4 = vcpu->arch.cr4; 821 - svm->vmcb01.ptr->save.rflags = kvm_get_rflags(vcpu); 822 - svm->vmcb01.ptr->save.rip = kvm_rip_read(vcpu); 744 + vmcb01->save.efer = vcpu->arch.efer; 745 + vmcb01->save.cr0 = kvm_read_cr0(vcpu); 746 + vmcb01->save.cr4 = vcpu->arch.cr4; 747 + vmcb01->save.rflags = kvm_get_rflags(vcpu); 748 + vmcb01->save.rip = kvm_rip_read(vcpu); 823 749 824 750 if (!npt_enabled) 825 - svm->vmcb01.ptr->save.cr3 = kvm_read_cr3(vcpu); 751 + vmcb01->save.cr3 = kvm_read_cr3(vcpu); 826 752 827 753 svm->nested.nested_run_pending = 1; 828 754 ··· 888 814 int nested_svm_vmexit(struct vcpu_svm *svm) 889 815 { 890 816 struct kvm_vcpu *vcpu = &svm->vcpu; 817 + struct vmcb *vmcb01 = svm->vmcb01.ptr; 818 + struct vmcb *vmcb02 = svm->nested.vmcb02.ptr; 891 819 struct vmcb *vmcb12; 892 - struct vmcb *vmcb = svm->vmcb; 893 820 struct kvm_host_map map; 894 821 int rc; 895 - 896 - /* Triple faults in L2 should never escape. */ 897 - WARN_ON_ONCE(kvm_check_request(KVM_REQ_TRIPLE_FAULT, vcpu)); 898 822 899 823 rc = kvm_vcpu_map(vcpu, gpa_to_gfn(svm->nested.vmcb12_gpa), &map); 900 824 if (rc) { ··· 915 843 916 844 /* Give the current vmcb to the guest */ 917 845 918 - vmcb12->save.es = vmcb->save.es; 919 - vmcb12->save.cs = vmcb->save.cs; 920 - vmcb12->save.ss = vmcb->save.ss; 921 - vmcb12->save.ds = vmcb->save.ds; 922 - vmcb12->save.gdtr = vmcb->save.gdtr; 923 - vmcb12->save.idtr = vmcb->save.idtr; 846 + vmcb12->save.es = vmcb02->save.es; 847 + vmcb12->save.cs = vmcb02->save.cs; 848 + vmcb12->save.ss = vmcb02->save.ss; 849 + vmcb12->save.ds = vmcb02->save.ds; 850 + vmcb12->save.gdtr = vmcb02->save.gdtr; 851 + vmcb12->save.idtr = vmcb02->save.idtr; 924 852 vmcb12->save.efer = svm->vcpu.arch.efer; 925 853 vmcb12->save.cr0 = kvm_read_cr0(vcpu); 926 854 vmcb12->save.cr3 = kvm_read_cr3(vcpu); 927 - vmcb12->save.cr2 = vmcb->save.cr2; 855 + vmcb12->save.cr2 = vmcb02->save.cr2; 928 856 vmcb12->save.cr4 = svm->vcpu.arch.cr4; 929 857 vmcb12->save.rflags = kvm_get_rflags(vcpu); 930 858 vmcb12->save.rip = kvm_rip_read(vcpu); 931 859 vmcb12->save.rsp = kvm_rsp_read(vcpu); 932 860 vmcb12->save.rax = kvm_rax_read(vcpu); 933 - vmcb12->save.dr7 = vmcb->save.dr7; 861 + vmcb12->save.dr7 = vmcb02->save.dr7; 934 862 vmcb12->save.dr6 = svm->vcpu.arch.dr6; 935 - vmcb12->save.cpl = vmcb->save.cpl; 863 + vmcb12->save.cpl = vmcb02->save.cpl; 936 864 937 - vmcb12->control.int_state = vmcb->control.int_state; 938 - vmcb12->control.exit_code = vmcb->control.exit_code; 939 - vmcb12->control.exit_code_hi = vmcb->control.exit_code_hi; 940 - vmcb12->control.exit_info_1 = vmcb->control.exit_info_1; 941 - vmcb12->control.exit_info_2 = vmcb->control.exit_info_2; 865 + vmcb12->control.int_state = vmcb02->control.int_state; 866 + vmcb12->control.exit_code = vmcb02->control.exit_code; 867 + vmcb12->control.exit_code_hi = vmcb02->control.exit_code_hi; 868 + vmcb12->control.exit_info_1 = vmcb02->control.exit_info_1; 869 + vmcb12->control.exit_info_2 = vmcb02->control.exit_info_2; 942 870 943 871 if (vmcb12->control.exit_code != SVM_EXIT_ERR) 944 872 nested_save_pending_event_to_vmcb12(svm, vmcb12); 945 873 946 874 if (svm->nrips_enabled) 947 - vmcb12->control.next_rip = vmcb->control.next_rip; 875 + vmcb12->control.next_rip = vmcb02->control.next_rip; 948 876 949 877 vmcb12->control.int_ctl = svm->nested.ctl.int_ctl; 950 878 vmcb12->control.tlb_ctl = svm->nested.ctl.tlb_ctl; 951 879 vmcb12->control.event_inj = svm->nested.ctl.event_inj; 952 880 vmcb12->control.event_inj_err = svm->nested.ctl.event_inj_err; 953 881 882 + if (!kvm_pause_in_guest(vcpu->kvm) && vmcb02->control.pause_filter_count) 883 + vmcb01->control.pause_filter_count = vmcb02->control.pause_filter_count; 884 + 954 885 nested_svm_copy_common_state(svm->nested.vmcb02.ptr, svm->vmcb01.ptr); 955 886 956 887 svm_switch_vmcb(svm, &svm->vmcb01); 888 + 889 + if (unlikely(svm->lbrv_enabled && (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK))) { 890 + svm_copy_lbrs(vmcb12, vmcb02); 891 + svm_update_lbrv(vcpu); 892 + } else if (unlikely(vmcb01->control.virt_ext & LBR_CTL_ENABLE_MASK)) { 893 + svm_copy_lbrs(vmcb01, vmcb02); 894 + svm_update_lbrv(vcpu); 895 + } 957 896 958 897 /* 959 898 * On vmexit the GIF is set to false and 960 899 * no event can be injected in L1. 961 900 */ 962 901 svm_set_gif(svm, false); 963 - svm->vmcb->control.exit_int_info = 0; 902 + vmcb01->control.exit_int_info = 0; 964 903 965 904 svm->vcpu.arch.tsc_offset = svm->vcpu.arch.l1_tsc_offset; 966 - if (svm->vmcb->control.tsc_offset != svm->vcpu.arch.tsc_offset) { 967 - svm->vmcb->control.tsc_offset = svm->vcpu.arch.tsc_offset; 968 - vmcb_mark_dirty(svm->vmcb, VMCB_INTERCEPTS); 905 + if (vmcb01->control.tsc_offset != svm->vcpu.arch.tsc_offset) { 906 + vmcb01->control.tsc_offset = svm->vcpu.arch.tsc_offset; 907 + vmcb_mark_dirty(vmcb01, VMCB_INTERCEPTS); 969 908 } 970 909 971 910 if (svm->tsc_ratio_msr != kvm_default_tsc_scaling_ratio) { ··· 990 907 /* 991 908 * Restore processor state that had been saved in vmcb01 992 909 */ 993 - kvm_set_rflags(vcpu, svm->vmcb->save.rflags); 994 - svm_set_efer(vcpu, svm->vmcb->save.efer); 995 - svm_set_cr0(vcpu, svm->vmcb->save.cr0 | X86_CR0_PE); 996 - svm_set_cr4(vcpu, svm->vmcb->save.cr4); 997 - kvm_rax_write(vcpu, svm->vmcb->save.rax); 998 - kvm_rsp_write(vcpu, svm->vmcb->save.rsp); 999 - kvm_rip_write(vcpu, svm->vmcb->save.rip); 910 + kvm_set_rflags(vcpu, vmcb01->save.rflags); 911 + svm_set_efer(vcpu, vmcb01->save.efer); 912 + svm_set_cr0(vcpu, vmcb01->save.cr0 | X86_CR0_PE); 913 + svm_set_cr4(vcpu, vmcb01->save.cr4); 914 + kvm_rax_write(vcpu, vmcb01->save.rax); 915 + kvm_rsp_write(vcpu, vmcb01->save.rsp); 916 + kvm_rip_write(vcpu, vmcb01->save.rip); 1000 917 1001 918 svm->vcpu.arch.dr7 = DR7_FIXED_1; 1002 919 kvm_update_dr7(&svm->vcpu); ··· 1014 931 1015 932 nested_svm_uninit_mmu_context(vcpu); 1016 933 1017 - rc = nested_svm_load_cr3(vcpu, svm->vmcb->save.cr3, false, true); 934 + rc = nested_svm_load_cr3(vcpu, vmcb01->save.cr3, false, true); 1018 935 if (rc) 1019 936 return 1; 1020 937 ··· 1032 949 * right now so that it an be accounted for before we execute 1033 950 * L1's next instruction. 1034 951 */ 1035 - if (unlikely(svm->vmcb->save.rflags & X86_EFLAGS_TF)) 952 + if (unlikely(vmcb01->save.rflags & X86_EFLAGS_TF)) 1036 953 kvm_queue_exception(&(svm->vcpu), DB_VECTOR); 954 + 955 + /* 956 + * Un-inhibit the AVIC right away, so that other vCPUs can start 957 + * to benefit from it right away. 958 + */ 959 + if (kvm_apicv_activated(vcpu->kvm)) 960 + kvm_vcpu_update_apicv(vcpu); 1037 961 1038 962 return 0; 1039 963 } ··· 1252 1162 static void nested_svm_inject_exception_vmexit(struct vcpu_svm *svm) 1253 1163 { 1254 1164 unsigned int nr = svm->vcpu.arch.exception.nr; 1165 + struct vmcb *vmcb = svm->vmcb; 1255 1166 1256 - svm->vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + nr; 1257 - svm->vmcb->control.exit_code_hi = 0; 1167 + vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + nr; 1168 + vmcb->control.exit_code_hi = 0; 1258 1169 1259 1170 if (svm->vcpu.arch.exception.has_error_code) 1260 - svm->vmcb->control.exit_info_1 = svm->vcpu.arch.exception.error_code; 1171 + vmcb->control.exit_info_1 = svm->vcpu.arch.exception.error_code; 1261 1172 1262 1173 /* 1263 1174 * EXITINFO2 is undefined for all exception intercepts other ··· 1266 1175 */ 1267 1176 if (nr == PF_VECTOR) { 1268 1177 if (svm->vcpu.arch.exception.nested_apf) 1269 - svm->vmcb->control.exit_info_2 = svm->vcpu.arch.apf.nested_apf_token; 1178 + vmcb->control.exit_info_2 = svm->vcpu.arch.apf.nested_apf_token; 1270 1179 else if (svm->vcpu.arch.exception.has_payload) 1271 - svm->vmcb->control.exit_info_2 = svm->vcpu.arch.exception.payload; 1180 + vmcb->control.exit_info_2 = svm->vcpu.arch.exception.payload; 1272 1181 else 1273 - svm->vmcb->control.exit_info_2 = svm->vcpu.arch.cr2; 1182 + vmcb->control.exit_info_2 = svm->vcpu.arch.cr2; 1274 1183 } else if (nr == DB_VECTOR) { 1275 1184 /* See inject_pending_event. */ 1276 1185 kvm_deliver_exception_payload(&svm->vcpu); ··· 1658 1567 struct kvm_x86_nested_ops svm_nested_ops = { 1659 1568 .leave_nested = svm_leave_nested, 1660 1569 .check_events = svm_check_nested_events, 1570 + .handle_page_fault_workaround = nested_svm_handle_page_fault_workaround, 1661 1571 .triple_fault = nested_svm_triple_fault, 1662 1572 .get_nested_state_pages = svm_get_nested_state_pages, 1663 1573 .get_state = svm_get_nested_state,

+1 -1

arch/x86/kvm/svm/pmu.c

··· 342 342 } 343 343 } 344 344 345 - struct kvm_pmu_ops amd_pmu_ops = { 345 + struct kvm_pmu_ops amd_pmu_ops __initdata = { 346 346 .pmc_perf_hw_id = amd_pmc_perf_hw_id, 347 347 .pmc_is_enabled = amd_pmc_is_enabled, 348 348 .pmc_idx_to_pmc = amd_pmc_idx_to_pmc,

+20 -8

arch/x86/kvm/svm/sev.c

··· 688 688 if (params.len > SEV_FW_BLOB_MAX_SIZE) 689 689 return -EINVAL; 690 690 691 - blob = kmalloc(params.len, GFP_KERNEL_ACCOUNT); 691 + blob = kzalloc(params.len, GFP_KERNEL_ACCOUNT); 692 692 if (!blob) 693 693 return -ENOMEM; 694 694 ··· 808 808 if (!IS_ALIGNED(dst_paddr, 16) || 809 809 !IS_ALIGNED(paddr, 16) || 810 810 !IS_ALIGNED(size, 16)) { 811 - tpage = (void *)alloc_page(GFP_KERNEL); 811 + tpage = (void *)alloc_page(GFP_KERNEL | __GFP_ZERO); 812 812 if (!tpage) 813 813 return -ENOMEM; 814 814 ··· 1094 1094 if (params.len > SEV_FW_BLOB_MAX_SIZE) 1095 1095 return -EINVAL; 1096 1096 1097 - blob = kmalloc(params.len, GFP_KERNEL_ACCOUNT); 1097 + blob = kzalloc(params.len, GFP_KERNEL_ACCOUNT); 1098 1098 if (!blob) 1099 1099 return -ENOMEM; 1100 1100 ··· 1176 1176 return -EINVAL; 1177 1177 1178 1178 /* allocate the memory to hold the session data blob */ 1179 - session_data = kmalloc(params.session_len, GFP_KERNEL_ACCOUNT); 1179 + session_data = kzalloc(params.session_len, GFP_KERNEL_ACCOUNT); 1180 1180 if (!session_data) 1181 1181 return -ENOMEM; 1182 1182 ··· 1300 1300 1301 1301 /* allocate memory for header and transport buffer */ 1302 1302 ret = -ENOMEM; 1303 - hdr = kmalloc(params.hdr_len, GFP_KERNEL_ACCOUNT); 1303 + hdr = kzalloc(params.hdr_len, GFP_KERNEL_ACCOUNT); 1304 1304 if (!hdr) 1305 1305 goto e_unpin; 1306 1306 1307 - trans_data = kmalloc(params.trans_len, GFP_KERNEL_ACCOUNT); 1307 + trans_data = kzalloc(params.trans_len, GFP_KERNEL_ACCOUNT); 1308 1308 if (!trans_data) 1309 1309 goto e_free_hdr; 1310 1310 ··· 2769 2769 pr_info("SEV-ES guest requested termination: %#llx:%#llx\n", 2770 2770 reason_set, reason_code); 2771 2771 2772 - ret = -EINVAL; 2773 - break; 2772 + vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT; 2773 + vcpu->run->system_event.type = KVM_SYSTEM_EVENT_SEV_TERM; 2774 + vcpu->run->system_event.ndata = 1; 2775 + vcpu->run->system_event.data[0] = control->ghcb_gpa; 2776 + 2777 + return 0; 2774 2778 } 2775 2779 default: 2776 2780 /* Error, keep GHCB MSR value as-is */ ··· 2957 2953 set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTBRANCHTOIP, 1, 1); 2958 2954 set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTFROMIP, 1, 1); 2959 2955 set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTTOIP, 1, 1); 2956 + 2957 + if (boot_cpu_has(X86_FEATURE_V_TSC_AUX) && 2958 + (guest_cpuid_has(&svm->vcpu, X86_FEATURE_RDTSCP) || 2959 + guest_cpuid_has(&svm->vcpu, X86_FEATURE_RDPID))) { 2960 + set_msr_interception(vcpu, svm->msrpm, MSR_TSC_AUX, 1, 1); 2961 + if (guest_cpuid_has(&svm->vcpu, X86_FEATURE_RDTSCP)) 2962 + svm_clr_intercept(svm, INTERCEPT_RDTSCP); 2963 + } 2960 2964 } 2961 2965 2962 2966 void sev_es_vcpu_reset(struct vcpu_svm *svm)

+164 -51

arch/x86/kvm/svm/svm.c

··· 62 62 #define SEG_TYPE_LDT 2 63 63 #define SEG_TYPE_BUSY_TSS16 3 64 64 65 - #define DEBUGCTL_RESERVED_BITS (~(0x3fULL)) 66 - 67 65 static bool erratum_383_found __read_mostly; 68 66 69 67 u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly; ··· 99 101 { .index = MSR_EFER, .always = false }, 100 102 { .index = MSR_IA32_CR_PAT, .always = false }, 101 103 { .index = MSR_AMD64_SEV_ES_GHCB, .always = true }, 104 + { .index = MSR_TSC_AUX, .always = false }, 102 105 { .index = MSR_INVALID, .always = false }, 103 106 }; 104 107 ··· 171 172 module_param(vls, int, 0444); 172 173 173 174 /* enable/disable Virtual GIF */ 174 - static int vgif = true; 175 + int vgif = true; 175 176 module_param(vgif, int, 0444); 176 177 177 178 /* enable/disable LBR virtualization */ ··· 187 188 */ 188 189 static bool avic; 189 190 module_param(avic, bool, 0444); 191 + 192 + static bool force_avic; 193 + module_param_unsafe(force_avic, bool, 0444); 190 194 191 195 bool __read_mostly dump_invalid_vmcb; 192 196 module_param(dump_invalid_vmcb, bool, 0644); ··· 792 790 } 793 791 } 794 792 793 + void svm_copy_lbrs(struct vmcb *to_vmcb, struct vmcb *from_vmcb) 794 + { 795 + to_vmcb->save.dbgctl = from_vmcb->save.dbgctl; 796 + to_vmcb->save.br_from = from_vmcb->save.br_from; 797 + to_vmcb->save.br_to = from_vmcb->save.br_to; 798 + to_vmcb->save.last_excp_from = from_vmcb->save.last_excp_from; 799 + to_vmcb->save.last_excp_to = from_vmcb->save.last_excp_to; 800 + 801 + vmcb_mark_dirty(to_vmcb, VMCB_LBR); 802 + } 803 + 795 804 static void svm_enable_lbrv(struct kvm_vcpu *vcpu) 796 805 { 797 806 struct vcpu_svm *svm = to_svm(vcpu); ··· 812 799 set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTBRANCHTOIP, 1, 1); 813 800 set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTFROMIP, 1, 1); 814 801 set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTTOIP, 1, 1); 802 + 803 + /* Move the LBR msrs to the vmcb02 so that the guest can see them. */ 804 + if (is_guest_mode(vcpu)) 805 + svm_copy_lbrs(svm->vmcb, svm->vmcb01.ptr); 815 806 } 816 807 817 808 static void svm_disable_lbrv(struct kvm_vcpu *vcpu) ··· 827 810 set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTBRANCHTOIP, 0, 0); 828 811 set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTFROMIP, 0, 0); 829 812 set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTTOIP, 0, 0); 813 + 814 + /* 815 + * Move the LBR msrs back to the vmcb01 to avoid copying them 816 + * on nested guest entries. 817 + */ 818 + if (is_guest_mode(vcpu)) 819 + svm_copy_lbrs(svm->vmcb01.ptr, svm->vmcb); 820 + } 821 + 822 + static int svm_get_lbr_msr(struct vcpu_svm *svm, u32 index) 823 + { 824 + /* 825 + * If the LBR virtualization is disabled, the LBR msrs are always 826 + * kept in the vmcb01 to avoid copying them on nested guest entries. 827 + * 828 + * If nested, and the LBR virtualization is enabled/disabled, the msrs 829 + * are moved between the vmcb01 and vmcb02 as needed. 830 + */ 831 + struct vmcb *vmcb = 832 + (svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK) ? 833 + svm->vmcb : svm->vmcb01.ptr; 834 + 835 + switch (index) { 836 + case MSR_IA32_DEBUGCTLMSR: 837 + return vmcb->save.dbgctl; 838 + case MSR_IA32_LASTBRANCHFROMIP: 839 + return vmcb->save.br_from; 840 + case MSR_IA32_LASTBRANCHTOIP: 841 + return vmcb->save.br_to; 842 + case MSR_IA32_LASTINTFROMIP: 843 + return vmcb->save.last_excp_from; 844 + case MSR_IA32_LASTINTTOIP: 845 + return vmcb->save.last_excp_to; 846 + default: 847 + KVM_BUG(false, svm->vcpu.kvm, 848 + "%s: Unknown MSR 0x%x", __func__, index); 849 + return 0; 850 + } 851 + } 852 + 853 + void svm_update_lbrv(struct kvm_vcpu *vcpu) 854 + { 855 + struct vcpu_svm *svm = to_svm(vcpu); 856 + 857 + bool enable_lbrv = svm_get_lbr_msr(svm, MSR_IA32_DEBUGCTLMSR) & 858 + DEBUGCTLMSR_LBR; 859 + 860 + bool current_enable_lbrv = !!(svm->vmcb->control.virt_ext & 861 + LBR_CTL_ENABLE_MASK); 862 + 863 + if (unlikely(is_guest_mode(vcpu) && svm->lbrv_enabled)) 864 + if (unlikely(svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK)) 865 + enable_lbrv = true; 866 + 867 + if (enable_lbrv == current_enable_lbrv) 868 + return; 869 + 870 + if (enable_lbrv) 871 + svm_enable_lbrv(vcpu); 872 + else 873 + svm_disable_lbrv(vcpu); 830 874 } 831 875 832 876 void disable_nmi_singlestep(struct vcpu_svm *svm) ··· 909 831 struct vmcb_control_area *control = &svm->vmcb->control; 910 832 int old = control->pause_filter_count; 911 833 834 + if (kvm_pause_in_guest(vcpu->kvm) || !old) 835 + return; 836 + 912 837 control->pause_filter_count = __grow_ple_window(old, 913 838 pause_filter_count, 914 839 pause_filter_count_grow, ··· 929 848 struct vcpu_svm *svm = to_svm(vcpu); 930 849 struct vmcb_control_area *control = &svm->vmcb->control; 931 850 int old = control->pause_filter_count; 851 + 852 + if (kvm_pause_in_guest(vcpu->kvm) || !old) 853 + return; 932 854 933 855 control->pause_filter_count = 934 856 __shrink_ple_window(old, ··· 1044 960 1045 961 set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SYSENTER_EIP, 0, 0); 1046 962 set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SYSENTER_ESP, 0, 0); 963 + 964 + svm->v_vmload_vmsave_enabled = false; 1047 965 } else { 1048 966 /* 1049 967 * If hardware supports Virtual VMLOAD VMSAVE then enable it ··· 1065 979 static void init_vmcb(struct kvm_vcpu *vcpu) 1066 980 { 1067 981 struct vcpu_svm *svm = to_svm(vcpu); 1068 - struct vmcb_control_area *control = &svm->vmcb->control; 1069 - struct vmcb_save_area *save = &svm->vmcb->save; 982 + struct vmcb *vmcb = svm->vmcb01.ptr; 983 + struct vmcb_control_area *control = &vmcb->control; 984 + struct vmcb_save_area *save = &vmcb->save; 1070 985 1071 986 svm_set_intercept(svm, INTERCEPT_CR0_READ); 1072 987 svm_set_intercept(svm, INTERCEPT_CR3_READ); ··· 1191 1104 set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1); 1192 1105 1193 1106 if (kvm_vcpu_apicv_active(vcpu)) 1194 - avic_init_vmcb(svm); 1107 + avic_init_vmcb(svm, vmcb); 1195 1108 1196 1109 if (vgif) { 1197 1110 svm_clr_intercept(svm, INTERCEPT_STGI); ··· 1209 1122 } 1210 1123 } 1211 1124 1212 - svm_hv_init_vmcb(svm->vmcb); 1125 + svm_hv_init_vmcb(vmcb); 1213 1126 init_vmcb_after_set_cpuid(vcpu); 1214 1127 1215 - vmcb_mark_all_dirty(svm->vmcb); 1128 + vmcb_mark_all_dirty(vmcb); 1216 1129 1217 1130 enable_gif(svm); 1218 1131 } ··· 1467 1380 /* 1468 1381 * The following fields are ignored when AVIC is enabled 1469 1382 */ 1470 - WARN_ON(kvm_apicv_activated(svm->vcpu.kvm)); 1383 + WARN_ON(kvm_vcpu_apicv_activated(&svm->vcpu)); 1471 1384 1472 1385 svm_set_intercept(svm, INTERCEPT_VINTR); 1473 1386 ··· 2229 2142 * Likewise, clear the VINTR intercept, we will set it 2230 2143 * again while processing KVM_REQ_EVENT if needed. 2231 2144 */ 2232 - if (vgif_enabled(svm)) 2145 + if (vgif) 2233 2146 svm_clr_intercept(svm, INTERCEPT_STGI); 2234 2147 if (svm_is_intercept(svm, INTERCEPT_VINTR)) 2235 2148 svm_clear_vintr(svm); ··· 2247 2160 * in use, we still rely on the VINTR intercept (rather than 2248 2161 * STGI) to detect an open interrupt window. 2249 2162 */ 2250 - if (!vgif_enabled(svm)) 2163 + if (!vgif) 2251 2164 svm_clear_vintr(svm); 2252 2165 } 2253 2166 } ··· 2662 2575 case MSR_TSC_AUX: 2663 2576 msr_info->data = svm->tsc_aux; 2664 2577 break; 2665 - /* 2666 - * Nobody will change the following 5 values in the VMCB so we can 2667 - * safely return them on rdmsr. They will always be 0 until LBRV is 2668 - * implemented. 2669 - */ 2670 2578 case MSR_IA32_DEBUGCTLMSR: 2671 - msr_info->data = svm->vmcb->save.dbgctl; 2672 - break; 2673 2579 case MSR_IA32_LASTBRANCHFROMIP: 2674 - msr_info->data = svm->vmcb->save.br_from; 2675 - break; 2676 2580 case MSR_IA32_LASTBRANCHTOIP: 2677 - msr_info->data = svm->vmcb->save.br_to; 2678 - break; 2679 2581 case MSR_IA32_LASTINTFROMIP: 2680 - msr_info->data = svm->vmcb->save.last_excp_from; 2681 - break; 2682 2582 case MSR_IA32_LASTINTTOIP: 2683 - msr_info->data = svm->vmcb->save.last_excp_to; 2583 + msr_info->data = svm_get_lbr_msr(svm, msr_info->index); 2684 2584 break; 2685 2585 case MSR_VM_HSAVE_PA: 2686 2586 msr_info->data = svm->nested.hsave_msr; ··· 2913 2839 if (data & DEBUGCTL_RESERVED_BITS) 2914 2840 return 1; 2915 2841 2916 - svm->vmcb->save.dbgctl = data; 2917 - vmcb_mark_dirty(svm->vmcb, VMCB_LBR); 2918 - if (data & (1ULL<<0)) 2919 - svm_enable_lbrv(vcpu); 2842 + if (svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK) 2843 + svm->vmcb->save.dbgctl = data; 2920 2844 else 2921 - svm_disable_lbrv(vcpu); 2845 + svm->vmcb01.ptr->save.dbgctl = data; 2846 + 2847 + svm_update_lbrv(vcpu); 2848 + 2922 2849 break; 2923 2850 case MSR_VM_HSAVE_PA: 2924 2851 /* ··· 2976 2901 svm_clear_vintr(to_svm(vcpu)); 2977 2902 2978 2903 /* 2979 - * For AVIC, the only reason to end up here is ExtINTs. 2904 + * If not running nested, for AVIC, the only reason to end up here is ExtINTs. 2980 2905 * In this case AVIC was temporarily disabled for 2981 2906 * requesting the IRQ window and we have to re-enable it. 2907 + * 2908 + * If running nested, still remove the VM wide AVIC inhibit to 2909 + * support case in which the interrupt window was requested when the 2910 + * vCPU was not running nested. 2911 + 2912 + * All vCPUs which run still run nested, will remain to have their 2913 + * AVIC still inhibited due to per-cpu AVIC inhibition. 2982 2914 */ 2983 2915 kvm_clear_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN); 2984 2916 ··· 2996 2914 static int pause_interception(struct kvm_vcpu *vcpu) 2997 2915 { 2998 2916 bool in_kernel; 2999 - 3000 2917 /* 3001 2918 * CPL is not made available for an SEV-ES guest, therefore 3002 2919 * vcpu->arch.preempted_in_kernel can never be true. Just ··· 3003 2922 */ 3004 2923 in_kernel = !sev_es_guest(vcpu->kvm) && svm_get_cpl(vcpu) == 0; 3005 2924 3006 - if (!kvm_pause_in_guest(vcpu->kvm)) 3007 - grow_ple_window(vcpu); 2925 + grow_ple_window(vcpu); 3008 2926 3009 2927 kvm_vcpu_on_spin(vcpu, in_kernel); 3010 2928 return kvm_skip_emulated_instruction(vcpu); ··· 3576 3496 * enabled, the STGI interception will not occur. Enable the irq 3577 3497 * window under the assumption that the hardware will set the GIF. 3578 3498 */ 3579 - if (vgif_enabled(svm) || gif_set(svm)) { 3499 + if (vgif || gif_set(svm)) { 3580 3500 /* 3581 3501 * IRQ window is not needed when AVIC is enabled, 3582 3502 * unless we have pending ExtINT since it cannot be injected 3583 - * via AVIC. In such case, we need to temporarily disable AVIC, 3503 + * via AVIC. In such case, KVM needs to temporarily disable AVIC, 3584 3504 * and fallback to injecting IRQ via V_IRQ. 3505 + * 3506 + * If running nested, AVIC is already locally inhibited 3507 + * on this vCPU, therefore there is no need to request 3508 + * the VM wide AVIC inhibition. 3585 3509 */ 3586 - kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN); 3510 + if (!is_guest_mode(vcpu)) 3511 + kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN); 3512 + 3587 3513 svm_set_vintr(svm); 3588 3514 } 3589 3515 } ··· 3602 3516 return; /* IRET will cause a vm exit */ 3603 3517 3604 3518 if (!gif_set(svm)) { 3605 - if (vgif_enabled(svm)) 3519 + if (vgif) 3606 3520 svm_set_intercept(svm, INTERCEPT_STGI); 3607 3521 return; /* STGI will cause a vm exit */ 3608 3522 } ··· 3951 3865 hv_track_root_tdp(vcpu, root_hpa); 3952 3866 3953 3867 cr3 = vcpu->arch.cr3; 3954 - } else if (vcpu->arch.mmu->shadow_root_level >= PT64_ROOT_4LEVEL) { 3868 + } else if (vcpu->arch.mmu->root_role.level >= PT64_ROOT_4LEVEL) { 3955 3869 cr3 = __sme_set(root_hpa) | kvm_get_active_pcid(vcpu); 3956 3870 } else { 3957 3871 /* PCID in the guest should be impossible with a 32-bit MMU. */ ··· 4032 3946 guest_cpuid_has(vcpu, X86_FEATURE_NRIPS); 4033 3947 4034 3948 svm->tsc_scaling_enabled = tsc_scaling && guest_cpuid_has(vcpu, X86_FEATURE_TSCRATEMSR); 3949 + svm->lbrv_enabled = lbrv && guest_cpuid_has(vcpu, X86_FEATURE_LBRV); 3950 + 3951 + svm->v_vmload_vmsave_enabled = vls && guest_cpuid_has(vcpu, X86_FEATURE_V_VMSAVE_VMLOAD); 3952 + 3953 + svm->pause_filter_enabled = kvm_cpu_cap_has(X86_FEATURE_PAUSEFILTER) && 3954 + guest_cpuid_has(vcpu, X86_FEATURE_PAUSEFILTER); 3955 + 3956 + svm->pause_threshold_enabled = kvm_cpu_cap_has(X86_FEATURE_PFTHRESHOLD) && 3957 + guest_cpuid_has(vcpu, X86_FEATURE_PFTHRESHOLD); 3958 + 3959 + svm->vgif_enabled = vgif && guest_cpuid_has(vcpu, X86_FEATURE_VGIF); 4035 3960 4036 3961 svm_recalc_instruction_intercepts(vcpu, svm); 4037 3962 ··· 4060 3963 */ 4061 3964 if (guest_cpuid_has(vcpu, X86_FEATURE_X2APIC)) 4062 3965 kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_X2APIC); 4063 - 4064 - /* 4065 - * Currently, AVIC does not work with nested virtualization. 4066 - * So, we disable AVIC when cpuid for SVM is set in the L1 guest. 4067 - */ 4068 - if (nested && guest_cpuid_has(vcpu, X86_FEATURE_SVM)) 4069 - kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_NESTED); 4070 3966 } 4071 3967 init_vmcb_after_set_cpuid(vcpu); 4072 3968 } ··· 4314 4224 svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP]; 4315 4225 svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP]; 4316 4226 4317 - ret = nested_svm_vmexit(svm); 4227 + ret = nested_svm_simple_vmexit(svm, SVM_EXIT_SW); 4318 4228 if (ret) 4319 4229 return ret; 4320 4230 ··· 4411 4321 struct vcpu_svm *svm = to_svm(vcpu); 4412 4322 4413 4323 if (!gif_set(svm)) { 4414 - if (vgif_enabled(svm)) 4324 + if (vgif) 4415 4325 svm_set_intercept(svm, INTERCEPT_STGI); 4416 4326 /* STGI will cause a vm exit */ 4417 4327 } else { ··· 4695 4605 4696 4606 .sched_in = svm_sched_in, 4697 4607 4698 - .pmu_ops = &amd_pmu_ops, 4699 4608 .nested_ops = &svm_nested_ops, 4700 4609 4701 4610 .deliver_interrupt = svm_deliver_interrupt, ··· 4722 4633 .complete_emulated_msr = svm_complete_emulated_msr, 4723 4634 4724 4635 .vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector, 4636 + .vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons, 4725 4637 }; 4726 4638 4727 4639 /* ··· 4785 4695 4786 4696 if (tsc_scaling) 4787 4697 kvm_cpu_cap_set(X86_FEATURE_TSCRATEMSR); 4698 + 4699 + if (vls) 4700 + kvm_cpu_cap_set(X86_FEATURE_V_VMSAVE_VMLOAD); 4701 + if (lbrv) 4702 + kvm_cpu_cap_set(X86_FEATURE_LBRV); 4703 + 4704 + if (boot_cpu_has(X86_FEATURE_PAUSEFILTER)) 4705 + kvm_cpu_cap_set(X86_FEATURE_PAUSEFILTER); 4706 + 4707 + if (boot_cpu_has(X86_FEATURE_PFTHRESHOLD)) 4708 + kvm_cpu_cap_set(X86_FEATURE_PFTHRESHOLD); 4709 + 4710 + if (vgif) 4711 + kvm_cpu_cap_set(X86_FEATURE_VGIF); 4788 4712 4789 4713 /* Nested VM can receive #VMEXIT instead of triggering #GP */ 4790 4714 kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK); ··· 4893 4789 get_npt_level(), PG_LEVEL_1G); 4894 4790 pr_info("kvm: Nested Paging %sabled\n", npt_enabled ? "en" : "dis"); 4895 4791 4792 + /* Setup shadow_me_value and shadow_me_mask */ 4793 + kvm_mmu_set_me_spte_mask(sme_me_mask, sme_me_mask); 4794 + 4896 4795 /* Note, SEV setup consumes npt_enabled. */ 4897 4796 sev_hardware_setup(); 4898 4797 ··· 4914 4807 nrips = false; 4915 4808 } 4916 4809 4917 - enable_apicv = avic = avic && npt_enabled && boot_cpu_has(X86_FEATURE_AVIC); 4810 + enable_apicv = avic = avic && npt_enabled && (boot_cpu_has(X86_FEATURE_AVIC) || force_avic); 4918 4811 4919 4812 if (enable_apicv) { 4920 - pr_info("AVIC enabled\n"); 4813 + if (!boot_cpu_has(X86_FEATURE_AVIC)) { 4814 + pr_warn("AVIC is not supported in CPUID but force enabled"); 4815 + pr_warn("Your system might crash and burn"); 4816 + } else 4817 + pr_info("AVIC enabled\n"); 4921 4818 4922 4819 amd_iommu_register_ga_log_notifier(&avic_ga_log_notifier); 4923 4820 } else { 4924 4821 svm_x86_ops.vcpu_blocking = NULL; 4925 4822 svm_x86_ops.vcpu_unblocking = NULL; 4823 + svm_x86_ops.vcpu_get_apicv_inhibit_reasons = NULL; 4926 4824 } 4927 4825 4928 4826 if (vls) { ··· 4992 4880 .check_processor_compatibility = svm_check_processor_compat, 4993 4881 4994 4882 .runtime_ops = &svm_x86_ops, 4883 + .pmu_ops = &amd_pmu_ops, 4995 4884 }; 4996 4885 4997 4886 static int __init svm_init(void)

+44 -11

arch/x86/kvm/svm/svm.h

··· 29 29 #define IOPM_SIZE PAGE_SIZE * 3 30 30 #define MSRPM_SIZE PAGE_SIZE * 2 31 31 32 - #define MAX_DIRECT_ACCESS_MSRS 20 32 + #define MAX_DIRECT_ACCESS_MSRS 21 33 33 #define MSRPM_OFFSETS 16 34 34 extern u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly; 35 35 extern bool npt_enabled; 36 + extern int vgif; 36 37 extern bool intercept_smi; 37 38 38 39 /* ··· 232 231 unsigned int3_injected; 233 232 unsigned long int3_rip; 234 233 235 - /* cached guest cpuid flags for faster access */ 234 + /* optional nested SVM features that are enabled for this guest */ 236 235 bool nrips_enabled : 1; 237 236 bool tsc_scaling_enabled : 1; 237 + bool v_vmload_vmsave_enabled : 1; 238 + bool lbrv_enabled : 1; 239 + bool pause_filter_enabled : 1; 240 + bool pause_threshold_enabled : 1; 241 + bool vgif_enabled : 1; 238 242 239 243 u32 ldr_reg; 240 244 u32 dfr_reg; ··· 458 452 return vmcb_is_intercept(&svm->vmcb->control, bit); 459 453 } 460 454 461 - static inline bool vgif_enabled(struct vcpu_svm *svm) 455 + static inline bool nested_vgif_enabled(struct vcpu_svm *svm) 462 456 { 463 - return !!(svm->vmcb->control.int_ctl & V_GIF_ENABLE_MASK); 457 + return svm->vgif_enabled && (svm->nested.ctl.int_ctl & V_GIF_ENABLE_MASK); 458 + } 459 + 460 + static inline struct vmcb *get_vgif_vmcb(struct vcpu_svm *svm) 461 + { 462 + if (!vgif) 463 + return NULL; 464 + 465 + if (is_guest_mode(&svm->vcpu) && !nested_vgif_enabled(svm)) 466 + return svm->nested.vmcb02.ptr; 467 + else 468 + return svm->vmcb01.ptr; 464 469 } 465 470 466 471 static inline void enable_gif(struct vcpu_svm *svm) 467 472 { 468 - if (vgif_enabled(svm)) 469 - svm->vmcb->control.int_ctl |= V_GIF_MASK; 473 + struct vmcb *vmcb = get_vgif_vmcb(svm); 474 + 475 + if (vmcb) 476 + vmcb->control.int_ctl |= V_GIF_MASK; 470 477 else 471 478 svm->vcpu.arch.hflags |= HF_GIF_MASK; 472 479 } 473 480 474 481 static inline void disable_gif(struct vcpu_svm *svm) 475 482 { 476 - if (vgif_enabled(svm)) 477 - svm->vmcb->control.int_ctl &= ~V_GIF_MASK; 483 + struct vmcb *vmcb = get_vgif_vmcb(svm); 484 + 485 + if (vmcb) 486 + vmcb->control.int_ctl &= ~V_GIF_MASK; 478 487 else 479 488 svm->vcpu.arch.hflags &= ~HF_GIF_MASK; 480 489 } 481 490 482 491 static inline bool gif_set(struct vcpu_svm *svm) 483 492 { 484 - if (vgif_enabled(svm)) 485 - return !!(svm->vmcb->control.int_ctl & V_GIF_MASK); 493 + struct vmcb *vmcb = get_vgif_vmcb(svm); 494 + 495 + if (vmcb) 496 + return !!(vmcb->control.int_ctl & V_GIF_MASK); 486 497 else 487 498 return !!(svm->vcpu.arch.hflags & HF_GIF_MASK); 488 499 } 489 500 501 + static inline bool nested_npt_enabled(struct vcpu_svm *svm) 502 + { 503 + return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE; 504 + } 505 + 490 506 /* svm.c */ 491 507 #define MSR_INVALID 0xffffffffU 508 + 509 + #define DEBUGCTL_RESERVED_BITS (~(0x3fULL)) 492 510 493 511 extern bool dump_invalid_vmcb; 494 512 ··· 520 490 u32 *svm_vcpu_alloc_msrpm(void); 521 491 void svm_vcpu_init_msrpm(struct kvm_vcpu *vcpu, u32 *msrpm); 522 492 void svm_vcpu_free_msrpm(u32 *msrpm); 493 + void svm_copy_lbrs(struct vmcb *to_vmcb, struct vmcb *from_vmcb); 494 + void svm_update_lbrv(struct kvm_vcpu *vcpu); 523 495 524 496 int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer); 525 497 void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0); ··· 606 574 int avic_ga_log_notifier(u32 ga_tag); 607 575 void avic_vm_destroy(struct kvm *kvm); 608 576 int avic_vm_init(struct kvm *kvm); 609 - void avic_init_vmcb(struct vcpu_svm *svm); 577 + void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb); 610 578 int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu); 611 579 int avic_unaccelerated_access_interception(struct kvm_vcpu *vcpu); 612 580 int avic_init_vcpu(struct vcpu_svm *svm); ··· 624 592 void avic_vcpu_blocking(struct kvm_vcpu *vcpu); 625 593 void avic_vcpu_unblocking(struct kvm_vcpu *vcpu); 626 594 void avic_ring_doorbell(struct kvm_vcpu *vcpu); 595 + unsigned long avic_vcpu_get_apicv_inhibit_reasons(struct kvm_vcpu *vcpu); 627 596 628 597 /* sev.c */ 629 598

+20

arch/x86/kvm/trace.h

··· 1459 1459 __entry->vmid, __entry->vcpuid) 1460 1460 ); 1461 1461 1462 + TRACE_EVENT(kvm_avic_kick_vcpu_slowpath, 1463 + TP_PROTO(u32 icrh, u32 icrl, u32 index), 1464 + TP_ARGS(icrh, icrl, index), 1465 + 1466 + TP_STRUCT__entry( 1467 + __field(u32, icrh) 1468 + __field(u32, icrl) 1469 + __field(u32, index) 1470 + ), 1471 + 1472 + TP_fast_assign( 1473 + __entry->icrh = icrh; 1474 + __entry->icrl = icrl; 1475 + __entry->index = index; 1476 + ), 1477 + 1478 + TP_printk("icrh:icrl=%#08x:%08x, index=%u", 1479 + __entry->icrh, __entry->icrl, __entry->index) 1480 + ); 1481 + 1462 1482 TRACE_EVENT(kvm_hv_timer_state, 1463 1483 TP_PROTO(unsigned int vcpu_id, unsigned int hv_timer_in_use), 1464 1484 TP_ARGS(vcpu_id, hv_timer_in_use),

+43 -20

arch/x86/kvm/vmx/nested.c

··· 476 476 return 0; 477 477 } 478 478 479 - 480 - static void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu, 481 - struct x86_exception *fault) 479 + static bool nested_vmx_handle_page_fault_workaround(struct kvm_vcpu *vcpu, 480 + struct x86_exception *fault) 482 481 { 483 482 struct vmcs12 *vmcs12 = get_vmcs12(vcpu); 484 483 485 484 WARN_ON(!is_guest_mode(vcpu)); 486 485 487 486 if (nested_vmx_is_page_fault_vmexit(vmcs12, fault->error_code) && 488 - !to_vmx(vcpu)->nested.nested_run_pending) { 487 + !WARN_ON_ONCE(to_vmx(vcpu)->nested.nested_run_pending)) { 489 488 vmcs12->vm_exit_intr_error_code = fault->error_code; 490 489 nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, 491 490 PF_VECTOR | INTR_TYPE_HARD_EXCEPTION | 492 491 INTR_INFO_DELIVER_CODE_MASK | INTR_INFO_VALID_MASK, 493 492 fault->address); 494 - } else { 495 - kvm_inject_page_fault(vcpu, fault); 493 + return true; 496 494 } 495 + return false; 497 496 } 498 497 499 498 static int nested_vmx_check_io_bitmap_controls(struct kvm_vcpu *vcpu, ··· 2613 2614 vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3); 2614 2615 } 2615 2616 2616 - if (!enable_ept) 2617 - vcpu->arch.walk_mmu->inject_page_fault = vmx_inject_page_fault_nested; 2618 - 2619 2617 if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) && 2620 2618 WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL, 2621 2619 vmcs12->guest_ia32_perf_global_ctrl))) { ··· 3691 3695 } 3692 3696 3693 3697 static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu, 3694 - struct vmcs12 *vmcs12) 3698 + struct vmcs12 *vmcs12, 3699 + u32 vm_exit_reason, u32 exit_intr_info) 3695 3700 { 3696 3701 u32 idt_vectoring; 3697 3702 unsigned int nr; 3698 3703 3699 - if (vcpu->arch.exception.injected) { 3704 + /* 3705 + * Per the SDM, VM-Exits due to double and triple faults are never 3706 + * considered to occur during event delivery, even if the double/triple 3707 + * fault is the result of an escalating vectoring issue. 3708 + * 3709 + * Note, the SDM qualifies the double fault behavior with "The original 3710 + * event results in a double-fault exception". It's unclear why the 3711 + * qualification exists since exits due to double fault can occur only 3712 + * while vectoring a different exception (injected events are never 3713 + * subject to interception), i.e. there's _always_ an original event. 3714 + * 3715 + * The SDM also uses NMI as a confusing example for the "original event 3716 + * causes the VM exit directly" clause. NMI isn't special in any way, 3717 + * the same rule applies to all events that cause an exit directly. 3718 + * NMI is an odd choice for the example because NMIs can only occur on 3719 + * instruction boundaries, i.e. they _can't_ occur during vectoring. 3720 + */ 3721 + if ((u16)vm_exit_reason == EXIT_REASON_TRIPLE_FAULT || 3722 + ((u16)vm_exit_reason == EXIT_REASON_EXCEPTION_NMI && 3723 + is_double_fault(exit_intr_info))) { 3724 + vmcs12->idt_vectoring_info_field = 0; 3725 + } else if (vcpu->arch.exception.injected) { 3700 3726 nr = vcpu->arch.exception.nr; 3701 3727 idt_vectoring = nr | VECTORING_INFO_VALID_MASK; 3702 3728 ··· 3751 3733 idt_vectoring |= INTR_TYPE_EXT_INTR; 3752 3734 3753 3735 vmcs12->idt_vectoring_info_field = idt_vectoring; 3736 + } else { 3737 + vmcs12->idt_vectoring_info_field = 0; 3754 3738 } 3755 3739 } 3756 3740 ··· 4222 4202 if (to_vmx(vcpu)->exit_reason.enclave_mode) 4223 4203 vmcs12->vm_exit_reason |= VMX_EXIT_REASONS_SGX_ENCLAVE_MODE; 4224 4204 vmcs12->exit_qualification = exit_qualification; 4225 - vmcs12->vm_exit_intr_info = exit_intr_info; 4226 4205 4227 - vmcs12->idt_vectoring_info_field = 0; 4228 - vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN); 4229 - vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); 4230 - 4206 + /* 4207 + * On VM-Exit due to a failed VM-Entry, the VMCS isn't marked launched 4208 + * and only EXIT_REASON and EXIT_QUALIFICATION are updated, all other 4209 + * exit info fields are unmodified. 4210 + */ 4231 4211 if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY)) { 4232 4212 vmcs12->launch_state = 1; 4233 4213 ··· 4239 4219 * Transfer the event that L0 or L1 may wanted to inject into 4240 4220 * L2 to IDT_VECTORING_INFO_FIELD. 4241 4221 */ 4242 - vmcs12_save_pending_event(vcpu, vmcs12); 4222 + vmcs12_save_pending_event(vcpu, vmcs12, 4223 + vm_exit_reason, exit_intr_info); 4224 + 4225 + vmcs12->vm_exit_intr_info = exit_intr_info; 4226 + vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN); 4227 + vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); 4243 4228 4244 4229 /* 4245 4230 * According to spec, there's no need to store the guest's ··· 4542 4517 4543 4518 /* trying to cancel vmlaunch/vmresume is a bug */ 4544 4519 WARN_ON_ONCE(vmx->nested.nested_run_pending); 4545 - 4546 - /* Similarly, triple faults in L2 should never escape. */ 4547 - WARN_ON_ONCE(kvm_check_request(KVM_REQ_TRIPLE_FAULT, vcpu)); 4548 4520 4549 4521 if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) { 4550 4522 /* ··· 6831 6809 struct kvm_x86_nested_ops vmx_nested_ops = { 6832 6810 .leave_nested = vmx_leave_nested, 6833 6811 .check_events = vmx_check_nested_events, 6812 + .handle_page_fault_workaround = nested_vmx_handle_page_fault_workaround, 6834 6813 .hv_timer_pending = nested_vmx_preemption_timer_pending, 6835 6814 .triple_fault = nested_vmx_triple_fault, 6836 6815 .get_state = vmx_get_nested_state,

+1 -1

arch/x86/kvm/vmx/pmu_intel.c

··· 719 719 intel_pmu_release_guest_lbr_event(vcpu); 720 720 } 721 721 722 - struct kvm_pmu_ops intel_pmu_ops = { 722 + struct kvm_pmu_ops intel_pmu_ops __initdata = { 723 723 .pmc_perf_hw_id = intel_pmc_perf_hw_id, 724 724 .pmc_is_enabled = intel_pmc_is_enabled, 725 725 .pmc_idx_to_pmc = intel_pmc_idx_to_pmc,

+6 -5

arch/x86/kvm/vmx/posted_intr.c

··· 202 202 void pi_wakeup_handler(void) 203 203 { 204 204 int cpu = smp_processor_id(); 205 + struct list_head *wakeup_list = &per_cpu(wakeup_vcpus_on_cpu, cpu); 206 + raw_spinlock_t *spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, cpu); 205 207 struct vcpu_vmx *vmx; 206 208 207 - raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); 208 - list_for_each_entry(vmx, &per_cpu(wakeup_vcpus_on_cpu, cpu), 209 - pi_wakeup_list) { 209 + raw_spin_lock(spinlock); 210 + list_for_each_entry(vmx, wakeup_list, pi_wakeup_list) { 210 211 211 212 if (pi_test_on(&vmx->pi_desc)) 212 213 kvm_vcpu_wake_up(&vmx->vcpu); 213 214 } 214 - raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); 215 + raw_spin_unlock(spinlock); 215 216 } 216 217 217 218 void __init pi_init_cpu(int cpu) ··· 312 311 continue; 313 312 } 314 313 315 - vcpu_info.pi_desc_addr = __pa(&to_vmx(vcpu)->pi_desc); 314 + vcpu_info.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu)); 316 315 vcpu_info.vector = irq.vector; 317 316 318 317 trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, e->gsi,

+5

arch/x86/kvm/vmx/vmcs.h

··· 104 104 return is_exception_n(intr_info, BP_VECTOR); 105 105 } 106 106 107 + static inline bool is_double_fault(u32 intr_info) 108 + { 109 + return is_exception_n(intr_info, DF_VECTOR); 110 + } 111 + 107 112 static inline bool is_page_fault(u32 intr_info) 108 113 { 109 114 return is_exception_n(intr_info, PF_VECTOR);

+37 -8

arch/x86/kvm/vmx/vmx.c

··· 2444 2444 &_cpu_based_exec_control) < 0) 2445 2445 return -EIO; 2446 2446 #ifdef CONFIG_X86_64 2447 - if ((_cpu_based_exec_control & CPU_BASED_TPR_SHADOW)) 2447 + if (_cpu_based_exec_control & CPU_BASED_TPR_SHADOW) 2448 2448 _cpu_based_exec_control &= ~CPU_BASED_CR8_LOAD_EXITING & 2449 2449 ~CPU_BASED_CR8_STORE_EXITING; 2450 2450 #endif ··· 2948 2948 2949 2949 if (enable_ept) 2950 2950 ept_sync_context(construct_eptp(vcpu, root_hpa, 2951 - mmu->shadow_root_level)); 2951 + mmu->root_role.level)); 2952 2952 else 2953 2953 vpid_sync_context(vmx_get_current_vpid(vcpu)); 2954 2954 } ··· 4385 4385 if (cpu_has_secondary_exec_ctrls()) 4386 4386 secondary_exec_controls_set(vmx, vmx_secondary_exec_control(vmx)); 4387 4387 4388 - if (kvm_vcpu_apicv_active(&vmx->vcpu)) { 4388 + if (enable_apicv && lapic_in_kernel(&vmx->vcpu)) { 4389 4389 vmcs_write64(EOI_EXIT_BITMAP0, 0); 4390 4390 vmcs_write64(EOI_EXIT_BITMAP1, 0); 4391 4391 vmcs_write64(EOI_EXIT_BITMAP2, 0); ··· 5410 5410 error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR) 5411 5411 ? PFERR_FETCH_MASK : 0; 5412 5412 /* ept page table entry is present? */ 5413 - error_code |= (exit_qualification & 5414 - (EPT_VIOLATION_READABLE | EPT_VIOLATION_WRITABLE | 5415 - EPT_VIOLATION_EXECUTABLE)) 5413 + error_code |= (exit_qualification & EPT_VIOLATION_RWX_MASK) 5416 5414 ? PFERR_PRESENT_MASK : 0; 5417 5415 5418 5416 error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ? ··· 7821 7823 .cpu_dirty_log_size = PML_ENTITY_NUM, 7822 7824 .update_cpu_dirty_logging = vmx_update_cpu_dirty_logging, 7823 7825 7824 - .pmu_ops = &intel_pmu_ops, 7825 7826 .nested_ops = &vmx_nested_ops, 7826 7827 7827 7828 .pi_update_irte = vmx_pi_update_irte, ··· 7853 7856 struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); 7854 7857 7855 7858 /* '0' on failure so that the !PT case can use a RET0 static call. */ 7856 - if (!kvm_arch_pmi_in_guest(vcpu)) 7859 + if (!vcpu || !kvm_handling_nmi_from_guest(vcpu)) 7857 7860 return 0; 7858 7861 7859 7862 kvm_make_request(KVM_REQ_PMI, vcpu); ··· 7886 7889 7887 7890 for (i = 0; i < ARRAY_SIZE(vmx_uret_msrs_list); ++i) 7888 7891 kvm_add_user_return_msr(vmx_uret_msrs_list[i]); 7892 + } 7893 + 7894 + static void __init vmx_setup_me_spte_mask(void) 7895 + { 7896 + u64 me_mask = 0; 7897 + 7898 + /* 7899 + * kvm_get_shadow_phys_bits() returns shadow_phys_bits. Use 7900 + * the former to avoid exposing shadow_phys_bits. 7901 + * 7902 + * On pre-MKTME system, boot_cpu_data.x86_phys_bits equals to 7903 + * shadow_phys_bits. On MKTME and/or TDX capable systems, 7904 + * boot_cpu_data.x86_phys_bits holds the actual physical address 7905 + * w/o the KeyID bits, and shadow_phys_bits equals to MAXPHYADDR 7906 + * reported by CPUID. Those bits between are KeyID bits. 7907 + */ 7908 + if (boot_cpu_data.x86_phys_bits != kvm_get_shadow_phys_bits()) 7909 + me_mask = rsvd_bits(boot_cpu_data.x86_phys_bits, 7910 + kvm_get_shadow_phys_bits() - 1); 7911 + /* 7912 + * Unlike SME, host kernel doesn't support setting up any 7913 + * MKTME KeyID on Intel platforms. No memory encryption 7914 + * bits should be included into the SPTE. 7915 + */ 7916 + kvm_mmu_set_me_spte_mask(0, me_mask); 7889 7917 } 7890 7918 7891 7919 static struct kvm_x86_init_ops vmx_init_ops __initdata; ··· 8015 7993 kvm_mmu_set_ept_masks(enable_ept_ad_bits, 8016 7994 cpu_has_vmx_ept_execute_only()); 8017 7995 7996 + /* 7997 + * Setup shadow_me_value/shadow_me_mask to include MKTME KeyID 7998 + * bits to shadow_zero_check. 7999 + */ 8000 + vmx_setup_me_spte_mask(); 8001 + 8018 8002 kvm_configure_mmu(enable_ept, 0, vmx_get_max_tdp_level(), 8019 8003 ept_caps_to_lpage_level(vmx_capability.ept)); 8020 8004 ··· 8105 8077 .handle_intel_pt_intr = NULL, 8106 8078 8107 8079 .runtime_ops = &vmx_x86_ops, 8080 + .pmu_ops = &intel_pmu_ops, 8108 8081 }; 8109 8082 8110 8083 static void vmx_cleanup_l1d_flush(void)

+222 -151

arch/x86/kvm/x86.c

··· 266 266 267 267 const struct _kvm_stats_desc kvm_vcpu_stats_desc[] = { 268 268 KVM_GENERIC_VCPU_STATS(), 269 + STATS_DESC_COUNTER(VCPU, pf_taken), 269 270 STATS_DESC_COUNTER(VCPU, pf_fixed), 271 + STATS_DESC_COUNTER(VCPU, pf_emulate), 272 + STATS_DESC_COUNTER(VCPU, pf_spurious), 273 + STATS_DESC_COUNTER(VCPU, pf_fast), 274 + STATS_DESC_COUNTER(VCPU, pf_mmio_spte_created), 270 275 STATS_DESC_COUNTER(VCPU, pf_guest), 271 276 STATS_DESC_COUNTER(VCPU, tlb_flush), 272 277 STATS_DESC_COUNTER(VCPU, invlpg), ··· 753 748 } 754 749 EXPORT_SYMBOL_GPL(kvm_inject_page_fault); 755 750 751 + /* Returns true if the page fault was immediately morphed into a VM-Exit. */ 756 752 bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu, 757 753 struct x86_exception *fault) 758 754 { ··· 772 766 kvm_mmu_invalidate_gva(vcpu, fault_mmu, fault->address, 773 767 fault_mmu->root.hpa); 774 768 769 + /* 770 + * A workaround for KVM's bad exception handling. If KVM injected an 771 + * exception into L2, and L2 encountered a #PF while vectoring the 772 + * injected exception, manually check to see if L1 wants to intercept 773 + * #PF, otherwise queuing the #PF will lead to #DF or a lost exception. 774 + * In all other cases, defer the check to nested_ops->check_events(), 775 + * which will correctly handle priority (this does not). Note, other 776 + * exceptions, e.g. #GP, are theoretically affected, #PF is simply the 777 + * most problematic, e.g. when L0 and L1 are both intercepting #PF for 778 + * shadow paging. 779 + * 780 + * TODO: Rewrite exception handling to track injected and pending 781 + * (VM-Exit) exceptions separately. 782 + */ 783 + if (unlikely(vcpu->arch.exception.injected && is_guest_mode(vcpu)) && 784 + kvm_x86_ops.nested_ops->handle_page_fault_workaround(vcpu, fault)) 785 + return true; 786 + 775 787 fault_mmu->inject_page_fault(vcpu, fault); 776 - return fault->nested_page_fault; 788 + return false; 777 789 } 778 790 EXPORT_SYMBOL_GPL(kvm_inject_emulated_page_fault); 779 791 ··· 985 961 wrmsrl(MSR_IA32_XSS, vcpu->arch.ia32_xss); 986 962 } 987 963 964 + #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS 988 965 if (static_cpu_has(X86_FEATURE_PKU) && 989 - (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || 990 - (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU)) && 991 - vcpu->arch.pkru != vcpu->arch.host_pkru) 966 + vcpu->arch.pkru != vcpu->arch.host_pkru && 967 + ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || 968 + kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) 992 969 write_pkru(vcpu->arch.pkru); 970 + #endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */ 993 971 } 994 972 EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state); 995 973 ··· 1000 974 if (vcpu->arch.guest_state_protected) 1001 975 return; 1002 976 977 + #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS 1003 978 if (static_cpu_has(X86_FEATURE_PKU) && 1004 - (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || 1005 - (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU))) { 979 + ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || 980 + kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) { 1006 981 vcpu->arch.pkru = rdpkru(); 1007 982 if (vcpu->arch.pkru != vcpu->arch.host_pkru) 1008 983 write_pkru(vcpu->arch.host_pkru); 1009 984 } 985 + #endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */ 1010 986 1011 987 if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)) { 1012 988 ··· 2277 2249 kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu); 2278 2250 2279 2251 /* we verify if the enable bit is set... */ 2280 - vcpu->arch.pv_time_enabled = false; 2281 - if (!(system_time & 1)) 2282 - return; 2283 - 2284 - if (!kvm_gfn_to_hva_cache_init(vcpu->kvm, 2285 - &vcpu->arch.pv_time, system_time & ~1ULL, 2286 - sizeof(struct pvclock_vcpu_time_info))) 2287 - vcpu->arch.pv_time_enabled = true; 2252 + if (system_time & 1) { 2253 + kvm_gfn_to_pfn_cache_init(vcpu->kvm, &vcpu->arch.pv_time, vcpu, 2254 + KVM_HOST_USES_PFN, system_time & ~1ULL, 2255 + sizeof(struct pvclock_vcpu_time_info)); 2256 + } else { 2257 + kvm_gfn_to_pfn_cache_destroy(vcpu->kvm, &vcpu->arch.pv_time); 2258 + } 2288 2259 2289 2260 return; 2290 2261 } ··· 2988 2961 return data.clock; 2989 2962 } 2990 2963 2991 - static void kvm_setup_pvclock_page(struct kvm_vcpu *v, 2992 - struct gfn_to_hva_cache *cache, 2993 - unsigned int offset) 2964 + static void kvm_setup_guest_pvclock(struct kvm_vcpu *v, 2965 + struct gfn_to_pfn_cache *gpc, 2966 + unsigned int offset) 2994 2967 { 2995 2968 struct kvm_vcpu_arch *vcpu = &v->arch; 2996 - struct pvclock_vcpu_time_info guest_hv_clock; 2969 + struct pvclock_vcpu_time_info *guest_hv_clock; 2970 + unsigned long flags; 2997 2971 2998 - if (unlikely(kvm_read_guest_offset_cached(v->kvm, cache, 2999 - &guest_hv_clock, offset, sizeof(guest_hv_clock)))) 3000 - return; 2972 + read_lock_irqsave(&gpc->lock, flags); 2973 + while (!kvm_gfn_to_pfn_cache_check(v->kvm, gpc, gpc->gpa, 2974 + offset + sizeof(*guest_hv_clock))) { 2975 + read_unlock_irqrestore(&gpc->lock, flags); 3001 2976 3002 - /* This VCPU is paused, but it's legal for a guest to read another 2977 + if (kvm_gfn_to_pfn_cache_refresh(v->kvm, gpc, gpc->gpa, 2978 + offset + sizeof(*guest_hv_clock))) 2979 + return; 2980 + 2981 + read_lock_irqsave(&gpc->lock, flags); 2982 + } 2983 + 2984 + guest_hv_clock = (void *)(gpc->khva + offset); 2985 + 2986 + /* 2987 + * This VCPU is paused, but it's legal for a guest to read another 3003 2988 * VCPU's kvmclock, so we really have to follow the specification where 3004 2989 * it says that version is odd if data is being modified, and even after 3005 2990 * it is consistent. 3006 - * 3007 - * Version field updates must be kept separate. This is because 3008 - * kvm_write_guest_cached might use a "rep movs" instruction, and 3009 - * writes within a string instruction are weakly ordered. So there 3010 - * are three writes overall. 3011 - * 3012 - * As a small optimization, only write the version field in the first 3013 - * and third write. The vcpu->pv_time cache is still valid, because the 3014 - * version field is the first in the struct. 3015 2991 */ 3016 - BUILD_BUG_ON(offsetof(struct pvclock_vcpu_time_info, version) != 0); 3017 2992 3018 - if (guest_hv_clock.version & 1) 3019 - ++guest_hv_clock.version; /* first time write, random junk */ 3020 - 3021 - vcpu->hv_clock.version = guest_hv_clock.version + 1; 3022 - kvm_write_guest_offset_cached(v->kvm, cache, 3023 - &vcpu->hv_clock, offset, 3024 - sizeof(vcpu->hv_clock.version)); 3025 - 2993 + guest_hv_clock->version = vcpu->hv_clock.version = (guest_hv_clock->version + 1) | 1; 3026 2994 smp_wmb(); 3027 2995 3028 2996 /* retain PVCLOCK_GUEST_STOPPED if set in guest copy */ 3029 - vcpu->hv_clock.flags |= (guest_hv_clock.flags & PVCLOCK_GUEST_STOPPED); 2997 + vcpu->hv_clock.flags |= (guest_hv_clock->flags & PVCLOCK_GUEST_STOPPED); 3030 2998 3031 2999 if (vcpu->pvclock_set_guest_stopped_request) { 3032 3000 vcpu->hv_clock.flags |= PVCLOCK_GUEST_STOPPED; 3033 3001 vcpu->pvclock_set_guest_stopped_request = false; 3034 3002 } 3035 3003 3036 - trace_kvm_pvclock_update(v->vcpu_id, &vcpu->hv_clock); 3037 - 3038 - kvm_write_guest_offset_cached(v->kvm, cache, 3039 - &vcpu->hv_clock, offset, 3040 - sizeof(vcpu->hv_clock)); 3041 - 3004 + memcpy(guest_hv_clock, &vcpu->hv_clock, sizeof(*guest_hv_clock)); 3042 3005 smp_wmb(); 3043 3006 3044 - vcpu->hv_clock.version++; 3045 - kvm_write_guest_offset_cached(v->kvm, cache, 3046 - &vcpu->hv_clock, offset, 3047 - sizeof(vcpu->hv_clock.version)); 3007 + guest_hv_clock->version = ++vcpu->hv_clock.version; 3008 + 3009 + mark_page_dirty_in_slot(v->kvm, gpc->memslot, gpc->gpa >> PAGE_SHIFT); 3010 + read_unlock_irqrestore(&gpc->lock, flags); 3011 + 3012 + trace_kvm_pvclock_update(v->vcpu_id, &vcpu->hv_clock); 3048 3013 } 3049 3014 3050 3015 static int kvm_guest_time_update(struct kvm_vcpu *v) ··· 3125 3106 3126 3107 vcpu->hv_clock.flags = pvclock_flags; 3127 3108 3128 - if (vcpu->pv_time_enabled) 3129 - kvm_setup_pvclock_page(v, &vcpu->pv_time, 0); 3130 - if (vcpu->xen.vcpu_info_set) 3131 - kvm_setup_pvclock_page(v, &vcpu->xen.vcpu_info_cache, 3132 - offsetof(struct compat_vcpu_info, time)); 3133 - if (vcpu->xen.vcpu_time_info_set) 3134 - kvm_setup_pvclock_page(v, &vcpu->xen.vcpu_time_info_cache, 0); 3109 + if (vcpu->pv_time.active) 3110 + kvm_setup_guest_pvclock(v, &vcpu->pv_time, 0); 3111 + if (vcpu->xen.vcpu_info_cache.active) 3112 + kvm_setup_guest_pvclock(v, &vcpu->xen.vcpu_info_cache, 3113 + offsetof(struct compat_vcpu_info, time)); 3114 + if (vcpu->xen.vcpu_time_info_cache.active) 3115 + kvm_setup_guest_pvclock(v, &vcpu->xen.vcpu_time_info_cache, 0); 3135 3116 kvm_hv_setup_tsc_page(v->kvm, &vcpu->hv_clock); 3136 3117 return 0; 3137 3118 } ··· 3319 3300 3320 3301 static void kvmclock_reset(struct kvm_vcpu *vcpu) 3321 3302 { 3322 - vcpu->arch.pv_time_enabled = false; 3303 + kvm_gfn_to_pfn_cache_destroy(vcpu->kvm, &vcpu->arch.pv_time); 3323 3304 vcpu->arch.time = 0; 3324 3305 } 3325 3306 ··· 4303 4284 r = KVM_XEN_HVM_CONFIG_HYPERCALL_MSR | 4304 4285 KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL | 4305 4286 KVM_XEN_HVM_CONFIG_SHARED_INFO | 4306 - KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL; 4287 + KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL | 4288 + KVM_XEN_HVM_CONFIG_EVTCHN_SEND; 4307 4289 if (sched_info_on()) 4308 4290 r |= KVM_XEN_HVM_CONFIG_RUNSTATE; 4309 4291 break; ··· 4351 4331 r = boot_cpu_has(X86_FEATURE_XSAVE); 4352 4332 break; 4353 4333 case KVM_CAP_TSC_CONTROL: 4334 + case KVM_CAP_VM_TSC_CONTROL: 4354 4335 r = kvm_has_tsc_control; 4355 4336 break; 4356 4337 case KVM_CAP_X2APIC_API: ··· 5123 5102 */ 5124 5103 static int kvm_set_guest_paused(struct kvm_vcpu *vcpu) 5125 5104 { 5126 - if (!vcpu->arch.pv_time_enabled) 5105 + if (!vcpu->arch.pv_time.active) 5127 5106 return -EINVAL; 5128 5107 vcpu->arch.pvclock_set_guest_stopped_request = true; 5129 5108 kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); ··· 6207 6186 6208 6187 mutex_lock(&kvm->lock); 6209 6188 kvm_for_each_vcpu(i, vcpu, kvm) { 6210 - if (!vcpu->arch.pv_time_enabled) 6189 + if (!vcpu->arch.pv_time.active) 6211 6190 continue; 6212 6191 6213 6192 ret = kvm_set_guest_paused(vcpu); ··· 6534 6513 r = kvm_xen_hvm_set_attr(kvm, &xha); 6535 6514 break; 6536 6515 } 6516 + case KVM_XEN_HVM_EVTCHN_SEND: { 6517 + struct kvm_irq_routing_xen_evtchn uxe; 6518 + 6519 + r = -EFAULT; 6520 + if (copy_from_user(&uxe, argp, sizeof(uxe))) 6521 + goto out; 6522 + r = kvm_xen_hvm_evtchn_send(kvm, &uxe); 6523 + break; 6524 + } 6537 6525 #endif 6538 6526 case KVM_SET_CLOCK: 6539 6527 r = kvm_vm_ioctl_set_clock(kvm, argp); ··· 6550 6520 case KVM_GET_CLOCK: 6551 6521 r = kvm_vm_ioctl_get_clock(kvm, argp); 6552 6522 break; 6523 + case KVM_SET_TSC_KHZ: { 6524 + u32 user_tsc_khz; 6525 + 6526 + r = -EINVAL; 6527 + user_tsc_khz = (u32)arg; 6528 + 6529 + if (kvm_has_tsc_control && 6530 + user_tsc_khz >= kvm_max_guest_tsc_khz) 6531 + goto out; 6532 + 6533 + if (user_tsc_khz == 0) 6534 + user_tsc_khz = tsc_khz; 6535 + 6536 + WRITE_ONCE(kvm->arch.default_tsc_khz, user_tsc_khz); 6537 + r = 0; 6538 + 6539 + goto out; 6540 + } 6541 + case KVM_GET_TSC_KHZ: { 6542 + r = READ_ONCE(kvm->arch.default_tsc_khz); 6543 + goto out; 6544 + } 6553 6545 case KVM_MEMORY_ENCRYPT_OP: { 6554 6546 r = -ENOTTY; 6555 6547 if (!kvm_x86_ops.mem_enc_ioctl) ··· 7281 7229 exception, &write_emultor); 7282 7230 } 7283 7231 7284 - #define CMPXCHG_TYPE(t, ptr, old, new) \ 7285 - (cmpxchg((t *)(ptr), *(t *)(old), *(t *)(new)) == *(t *)(old)) 7286 - 7287 - #ifdef CONFIG_X86_64 7288 - # define CMPXCHG64(ptr, old, new) CMPXCHG_TYPE(u64, ptr, old, new) 7289 - #else 7290 - # define CMPXCHG64(ptr, old, new) \ 7291 - (cmpxchg64((u64 *)(ptr), *(u64 *)(old), *(u64 *)(new)) == *(u64 *)(old)) 7292 - #endif 7232 + #define emulator_try_cmpxchg_user(t, ptr, old, new) \ 7233 + (__try_cmpxchg_user((t __user *)(ptr), (t *)(old), *(t *)(new), efault ## t)) 7293 7234 7294 7235 static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt, 7295 7236 unsigned long addr, ··· 7291 7246 unsigned int bytes, 7292 7247 struct x86_exception *exception) 7293 7248 { 7294 - struct kvm_host_map map; 7295 7249 struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); 7296 7250 u64 page_line_mask; 7251 + unsigned long hva; 7297 7252 gpa_t gpa; 7298 - char *kaddr; 7299 - bool exchanged; 7253 + int r; 7300 7254 7301 7255 /* guests cmpxchg8b have to be emulated atomically */ 7302 7256 if (bytes > 8 || (bytes & (bytes - 1))) ··· 7319 7275 if (((gpa + bytes - 1) & page_line_mask) != (gpa & page_line_mask)) 7320 7276 goto emul_write; 7321 7277 7322 - if (kvm_vcpu_map(vcpu, gpa_to_gfn(gpa), &map)) 7278 + hva = kvm_vcpu_gfn_to_hva(vcpu, gpa_to_gfn(gpa)); 7279 + if (kvm_is_error_hva(hva)) 7323 7280 goto emul_write; 7324 7281 7325 - kaddr = map.hva + offset_in_page(gpa); 7282 + hva += offset_in_page(gpa); 7326 7283 7327 7284 switch (bytes) { 7328 7285 case 1: 7329 - exchanged = CMPXCHG_TYPE(u8, kaddr, old, new); 7286 + r = emulator_try_cmpxchg_user(u8, hva, old, new); 7330 7287 break; 7331 7288 case 2: 7332 - exchanged = CMPXCHG_TYPE(u16, kaddr, old, new); 7289 + r = emulator_try_cmpxchg_user(u16, hva, old, new); 7333 7290 break; 7334 7291 case 4: 7335 - exchanged = CMPXCHG_TYPE(u32, kaddr, old, new); 7292 + r = emulator_try_cmpxchg_user(u32, hva, old, new); 7336 7293 break; 7337 7294 case 8: 7338 - exchanged = CMPXCHG64(kaddr, old, new); 7295 + r = emulator_try_cmpxchg_user(u64, hva, old, new); 7339 7296 break; 7340 7297 default: 7341 7298 BUG(); 7342 7299 } 7343 7300 7344 - kvm_vcpu_unmap(vcpu, &map, true); 7345 - 7346 - if (!exchanged) 7301 + if (r < 0) 7302 + return X86EMUL_UNHANDLEABLE; 7303 + if (r) 7347 7304 return X86EMUL_CMPXCHG_FAILED; 7348 7305 7349 7306 kvm_page_track_write(vcpu, gpa, new, bytes); ··· 8106 8061 WARN_ON_ONCE(!(emulation_type & EMULTYPE_PF))) 8107 8062 return false; 8108 8063 8109 - if (!vcpu->arch.mmu->direct_map) { 8064 + if (!vcpu->arch.mmu->root_role.direct) { 8110 8065 /* 8111 8066 * Write permission should be allowed since only 8112 8067 * write access need to be emulated. ··· 8139 8094 kvm_release_pfn_clean(pfn); 8140 8095 8141 8096 /* The instructions are well-emulated on direct mmu. */ 8142 - if (vcpu->arch.mmu->direct_map) { 8097 + if (vcpu->arch.mmu->root_role.direct) { 8143 8098 unsigned int indirect_shadow_pages; 8144 8099 8145 8100 write_lock(&vcpu->kvm->mmu_lock); ··· 8207 8162 vcpu->arch.last_retry_eip = ctxt->eip; 8208 8163 vcpu->arch.last_retry_addr = cr2_or_gpa; 8209 8164 8210 - if (!vcpu->arch.mmu->direct_map) 8165 + if (!vcpu->arch.mmu->root_role.direct) 8211 8166 gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2_or_gpa, NULL); 8212 8167 8213 8168 kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)); ··· 8296 8251 } 8297 8252 EXPORT_SYMBOL_GPL(kvm_skip_emulated_instruction); 8298 8253 8299 - static bool kvm_vcpu_check_breakpoint(struct kvm_vcpu *vcpu, int *r) 8254 + static bool kvm_vcpu_check_code_breakpoint(struct kvm_vcpu *vcpu, int *r) 8300 8255 { 8301 8256 if (unlikely(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP) && 8302 8257 (vcpu->arch.guest_debug_dr7 & DR7_BP_EN_MASK)) { ··· 8365 8320 } 8366 8321 8367 8322 /* 8368 - * Decode to be emulated instruction. Return EMULATION_OK if success. 8323 + * Decode an instruction for emulation. The caller is responsible for handling 8324 + * code breakpoints. Note, manually detecting code breakpoints is unnecessary 8325 + * (and wrong) when emulating on an intercepted fault-like exception[*], as 8326 + * code breakpoints have higher priority and thus have already been done by 8327 + * hardware. 8328 + * 8329 + * [*] Except #MC, which is higher priority, but KVM should never emulate in 8330 + * response to a machine check. 8369 8331 */ 8370 8332 int x86_decode_emulated_instruction(struct kvm_vcpu *vcpu, int emulation_type, 8371 8333 void *insn, int insn_len) 8372 8334 { 8373 - int r = EMULATION_OK; 8374 8335 struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt; 8336 + int r; 8375 8337 8376 8338 init_emulate_ctxt(vcpu); 8377 - 8378 - /* 8379 - * We will reenter on the same instruction since we do not set 8380 - * complete_userspace_io. This does not handle watchpoints yet, 8381 - * those would be handled in the emulate_ops. 8382 - */ 8383 - if (!(emulation_type & EMULTYPE_SKIP) && 8384 - kvm_vcpu_check_breakpoint(vcpu, &r)) 8385 - return r; 8386 8339 8387 8340 r = x86_decode_insn(ctxt, insn, insn_len, emulation_type); 8388 8341 ··· 8413 8370 8414 8371 if (!(emulation_type & EMULTYPE_NO_DECODE)) { 8415 8372 kvm_clear_exception_queue(vcpu); 8373 + 8374 + /* 8375 + * Return immediately if RIP hits a code breakpoint, such #DBs 8376 + * are fault-like and are higher priority than any faults on 8377 + * the code fetch itself. 8378 + */ 8379 + if (!(emulation_type & EMULTYPE_SKIP) && 8380 + kvm_vcpu_check_code_breakpoint(vcpu, &r)) 8381 + return r; 8416 8382 8417 8383 r = x86_decode_emulated_instruction(vcpu, emulation_type, 8418 8384 insn, insn_len); ··· 8494 8442 ctxt->exception.address = cr2_or_gpa; 8495 8443 8496 8444 /* With shadow page tables, cr2 contains a GVA or nGPA. */ 8497 - if (vcpu->arch.mmu->direct_map) { 8445 + if (vcpu->arch.mmu->root_role.direct) { 8498 8446 ctxt->gpa_available = true; 8499 8447 ctxt->gpa_val = cr2_or_gpa; 8500 8448 } ··· 8841 8789 8842 8790 static void kvm_timer_init(void) 8843 8791 { 8844 - max_tsc_khz = tsc_khz; 8845 - 8846 8792 if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) { 8847 - #ifdef CONFIG_CPU_FREQ 8848 - struct cpufreq_policy *policy; 8849 - int cpu; 8793 + max_tsc_khz = tsc_khz; 8850 8794 8851 - cpu = get_cpu(); 8852 - policy = cpufreq_cpu_get(cpu); 8853 - if (policy) { 8854 - if (policy->cpuinfo.max_freq) 8855 - max_tsc_khz = policy->cpuinfo.max_freq; 8856 - cpufreq_cpu_put(policy); 8795 + if (IS_ENABLED(CONFIG_CPU_FREQ)) { 8796 + struct cpufreq_policy *policy; 8797 + int cpu; 8798 + 8799 + cpu = get_cpu(); 8800 + policy = cpufreq_cpu_get(cpu); 8801 + if (policy) { 8802 + if (policy->cpuinfo.max_freq) 8803 + max_tsc_khz = policy->cpuinfo.max_freq; 8804 + cpufreq_cpu_put(policy); 8805 + } 8806 + put_cpu(); 8857 8807 } 8858 - put_cpu(); 8859 - #endif 8860 8808 cpufreq_register_notifier(&kvmclock_cpufreq_notifier_block, 8861 8809 CPUFREQ_TRANSITION_NOTIFIER); 8862 8810 } ··· 9141 9089 } 9142 9090 EXPORT_SYMBOL_GPL(kvm_apicv_activated); 9143 9091 9092 + bool kvm_vcpu_apicv_activated(struct kvm_vcpu *vcpu) 9093 + { 9094 + ulong vm_reasons = READ_ONCE(vcpu->kvm->arch.apicv_inhibit_reasons); 9095 + ulong vcpu_reasons = static_call(kvm_x86_vcpu_get_apicv_inhibit_reasons)(vcpu); 9096 + 9097 + return (vm_reasons | vcpu_reasons) == 0; 9098 + } 9099 + EXPORT_SYMBOL_GPL(kvm_vcpu_apicv_activated); 9144 9100 9145 9101 static void set_or_clear_apicv_inhibit(unsigned long *inhibits, 9146 9102 enum kvm_apicv_inhibit reason, bool set) ··· 9325 9265 struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); 9326 9266 char instruction[3]; 9327 9267 unsigned long rip = kvm_rip_read(vcpu); 9268 + 9269 + /* 9270 + * If the quirk is disabled, synthesize a #UD and let the guest pick up 9271 + * the pieces. 9272 + */ 9273 + if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_FIX_HYPERCALL_INSN)) { 9274 + ctxt->exception.error_code_valid = false; 9275 + ctxt->exception.vector = UD_VECTOR; 9276 + ctxt->have_exception = true; 9277 + return X86EMUL_PROPAGATE_FAULT; 9278 + } 9328 9279 9329 9280 static_call(kvm_x86_patch_hypercall)(vcpu, instruction); 9330 9281 ··· 9834 9763 9835 9764 down_read(&vcpu->kvm->arch.apicv_update_lock); 9836 9765 9837 - activate = kvm_apicv_activated(vcpu->kvm); 9766 + activate = kvm_vcpu_apicv_activated(vcpu); 9767 + 9838 9768 if (vcpu->arch.apicv_active == activate) 9839 9769 goto out; 9840 9770 ··· 10243 10171 * per-VM state, and responsing vCPUs must wait for the update 10244 10172 * to complete before servicing KVM_REQ_APICV_UPDATE. 10245 10173 */ 10246 - WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu)); 10174 + WARN_ON_ONCE(kvm_vcpu_apicv_activated(vcpu) != kvm_vcpu_apicv_active(vcpu)); 10247 10175 10248 10176 exit_fastpath = static_call(kvm_x86_vcpu_run)(vcpu); 10249 10177 if (likely(exit_fastpath != EXIT_FASTPATH_REENTER_GUEST)) ··· 10321 10249 * acceptable for all known use cases. 10322 10250 */ 10323 10251 guest_timing_exit_irqoff(); 10324 - 10325 - if (lapic_in_kernel(vcpu)) { 10326 - s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta; 10327 - if (delta != S64_MIN) { 10328 - trace_kvm_wait_lapic_expire(vcpu->vcpu_id, delta); 10329 - vcpu->arch.apic->lapic_timer.advance_expire_delta = S64_MIN; 10330 - } 10331 - } 10332 10252 10333 10253 local_irq_enable(); 10334 10254 preempt_enable(); ··· 10432 10368 break; 10433 10369 10434 10370 kvm_clear_request(KVM_REQ_UNBLOCK, vcpu); 10371 + if (kvm_xen_has_pending_events(vcpu)) 10372 + kvm_xen_inject_pending_events(vcpu); 10373 + 10435 10374 if (kvm_cpu_has_pending_timer(vcpu)) 10436 10375 kvm_inject_pending_timer_irqs(vcpu); 10437 10376 ··· 11330 11263 11331 11264 vcpu->arch.arch_capabilities = kvm_get_arch_capabilities(); 11332 11265 vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT; 11266 + kvm_xen_init_vcpu(vcpu); 11333 11267 kvm_vcpu_mtrr_init(vcpu); 11334 11268 vcpu_load(vcpu); 11335 - kvm_set_tsc_khz(vcpu, max_tsc_khz); 11269 + kvm_set_tsc_khz(vcpu, vcpu->kvm->arch.default_tsc_khz); 11336 11270 kvm_vcpu_reset(vcpu, false); 11337 11271 kvm_init_mmu(vcpu); 11338 11272 vcpu_put(vcpu); ··· 11388 11320 free_cpumask_var(vcpu->arch.wbinvd_dirty_mask); 11389 11321 fpu_free_guest_fpstate(&vcpu->arch.guest_fpu); 11390 11322 11323 + kvm_xen_destroy_vcpu(vcpu); 11391 11324 kvm_hv_vcpu_uninit(vcpu); 11392 11325 kvm_pmu_destroy(vcpu); 11393 11326 kfree(vcpu->arch.mce_banks); ··· 11650 11581 drop_user_return_notifiers(); 11651 11582 } 11652 11583 11584 + static inline void kvm_ops_update(struct kvm_x86_init_ops *ops) 11585 + { 11586 + memcpy(&kvm_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops)); 11587 + 11588 + #define __KVM_X86_OP(func) \ 11589 + static_call_update(kvm_x86_##func, kvm_x86_ops.func); 11590 + #define KVM_X86_OP(func) \ 11591 + WARN_ON(!kvm_x86_ops.func); __KVM_X86_OP(func) 11592 + #define KVM_X86_OP_OPTIONAL __KVM_X86_OP 11593 + #define KVM_X86_OP_OPTIONAL_RET0(func) \ 11594 + static_call_update(kvm_x86_##func, (void *)kvm_x86_ops.func ? : \ 11595 + (void *)__static_call_return0); 11596 + #include <asm/kvm-x86-ops.h> 11597 + #undef __KVM_X86_OP 11598 + 11599 + kvm_pmu_ops_update(ops->pmu_ops); 11600 + } 11601 + 11653 11602 int kvm_arch_hardware_setup(void *opaque) 11654 11603 { 11655 11604 struct kvm_x86_init_ops *ops = opaque; ··· 11682 11595 if (r != 0) 11683 11596 return r; 11684 11597 11685 - memcpy(&kvm_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops)); 11686 - kvm_ops_static_call_update(); 11598 + kvm_ops_update(ops); 11687 11599 11688 11600 kvm_register_perf_callbacks(ops->handle_intel_pt_intr); 11689 11601 ··· 11798 11712 pvclock_update_vm_gtod_copy(kvm); 11799 11713 raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags); 11800 11714 11715 + kvm->arch.default_tsc_khz = max_tsc_khz ? : tsc_khz; 11801 11716 kvm->arch.guest_can_read_msr_platform_info = true; 11802 11717 kvm->arch.enable_pmu = enable_pmu; 11803 11718 ··· 11834 11747 vcpu_put(vcpu); 11835 11748 } 11836 11749 11837 - static void kvm_free_vcpus(struct kvm *kvm) 11750 + static void kvm_unload_vcpu_mmus(struct kvm *kvm) 11838 11751 { 11839 11752 unsigned long i; 11840 11753 struct kvm_vcpu *vcpu; 11841 11754 11842 - /* 11843 - * Unpin any mmu pages first. 11844 - */ 11845 11755 kvm_for_each_vcpu(i, vcpu, kvm) { 11846 11756 kvm_clear_async_pf_completion_queue(vcpu); 11847 11757 kvm_unload_vcpu_mmu(vcpu); 11848 11758 } 11849 - 11850 - kvm_destroy_vcpus(kvm); 11851 11759 } 11852 11760 11853 11761 void kvm_arch_sync_events(struct kvm *kvm) ··· 11948 11866 __x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, 0, 0); 11949 11867 mutex_unlock(&kvm->slots_lock); 11950 11868 } 11869 + kvm_unload_vcpu_mmus(kvm); 11951 11870 static_call_cond(kvm_x86_vm_destroy)(kvm); 11952 11871 kvm_free_msr_filter(srcu_dereference_check(kvm->arch.msr_filter, &kvm->srcu, 1)); 11953 11872 kvm_pic_destroy(kvm); 11954 11873 kvm_ioapic_destroy(kvm); 11955 - kvm_free_vcpus(kvm); 11874 + kvm_destroy_vcpus(kvm); 11956 11875 kvfree(rcu_dereference_check(kvm->arch.apic_map, 1)); 11957 11876 kfree(srcu_dereference_check(kvm->arch.pmu_event_filter, &kvm->srcu, 1)); 11958 11877 kvm_mmu_uninit_vm(kvm); ··· 12276 12193 kvm_x86_ops.nested_ops->hv_timer_pending(vcpu)) 12277 12194 return true; 12278 12195 12196 + if (kvm_xen_has_pending_events(vcpu)) 12197 + return true; 12198 + 12199 + if (kvm_test_request(KVM_REQ_TRIPLE_FAULT, vcpu)) 12200 + return true; 12201 + 12279 12202 return false; 12280 12203 } 12281 12204 ··· 12378 12289 kvm_make_request(KVM_REQ_EVENT, vcpu); 12379 12290 } 12380 12291 EXPORT_SYMBOL_GPL(kvm_set_rflags); 12381 - 12382 - void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) 12383 - { 12384 - int r; 12385 - 12386 - if ((vcpu->arch.mmu->direct_map != work->arch.direct_map) || 12387 - work->wakeup_all) 12388 - return; 12389 - 12390 - r = kvm_mmu_reload(vcpu); 12391 - if (unlikely(r)) 12392 - return; 12393 - 12394 - if (!vcpu->arch.mmu->direct_map && 12395 - work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu)) 12396 - return; 12397 - 12398 - kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true); 12399 - } 12400 12292 12401 12293 static inline u32 kvm_async_pf_hash_fn(gfn_t gfn) 12402 12294 { ··· 13070 13000 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_unaccelerated_access); 13071 13001 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_incomplete_ipi); 13072 13002 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_ga_log); 13003 + EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_kick_vcpu_slowpath); 13073 13004 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_apicv_accept_irq); 13074 13005 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_enter); 13075 13006 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_exit);

+1022 -223

arch/x86/kvm/xen.c

··· 9 9 #include "x86.h" 10 10 #include "xen.h" 11 11 #include "hyperv.h" 12 + #include "lapic.h" 12 13 14 + #include <linux/eventfd.h> 13 15 #include <linux/kvm_host.h> 14 16 #include <linux/sched/stat.h> 15 17 16 18 #include <trace/events/kvm.h> 17 19 #include <xen/interface/xen.h> 18 20 #include <xen/interface/vcpu.h> 21 + #include <xen/interface/version.h> 19 22 #include <xen/interface/event_channel.h> 23 + #include <xen/interface/sched.h> 20 24 21 25 #include "trace.h" 26 + 27 + static int kvm_xen_set_evtchn(struct kvm_xen_evtchn *xe, struct kvm *kvm); 28 + static int kvm_xen_setattr_evtchn(struct kvm *kvm, struct kvm_xen_hvm_attr *data); 29 + static bool kvm_xen_hcall_evtchn_send(struct kvm_vcpu *vcpu, u64 param, u64 *r); 22 30 23 31 DEFINE_STATIC_KEY_DEFERRED_FALSE(kvm_xen_enabled, HZ); 24 32 ··· 110 102 return ret; 111 103 } 112 104 105 + void kvm_xen_inject_timer_irqs(struct kvm_vcpu *vcpu) 106 + { 107 + if (atomic_read(&vcpu->arch.xen.timer_pending) > 0) { 108 + struct kvm_xen_evtchn e; 109 + 110 + e.vcpu_id = vcpu->vcpu_id; 111 + e.vcpu_idx = vcpu->vcpu_idx; 112 + e.port = vcpu->arch.xen.timer_virq; 113 + e.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL; 114 + 115 + kvm_xen_set_evtchn(&e, vcpu->kvm); 116 + 117 + vcpu->arch.xen.timer_expires = 0; 118 + atomic_set(&vcpu->arch.xen.timer_pending, 0); 119 + } 120 + } 121 + 122 + static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer) 123 + { 124 + struct kvm_vcpu *vcpu = container_of(timer, struct kvm_vcpu, 125 + arch.xen.timer); 126 + if (atomic_read(&vcpu->arch.xen.timer_pending)) 127 + return HRTIMER_NORESTART; 128 + 129 + atomic_inc(&vcpu->arch.xen.timer_pending); 130 + kvm_make_request(KVM_REQ_UNBLOCK, vcpu); 131 + kvm_vcpu_kick(vcpu); 132 + 133 + return HRTIMER_NORESTART; 134 + } 135 + 136 + static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns) 137 + { 138 + atomic_set(&vcpu->arch.xen.timer_pending, 0); 139 + vcpu->arch.xen.timer_expires = guest_abs; 140 + 141 + if (delta_ns <= 0) { 142 + xen_timer_callback(&vcpu->arch.xen.timer); 143 + } else { 144 + ktime_t ktime_now = ktime_get(); 145 + hrtimer_start(&vcpu->arch.xen.timer, 146 + ktime_add_ns(ktime_now, delta_ns), 147 + HRTIMER_MODE_ABS_HARD); 148 + } 149 + } 150 + 151 + static void kvm_xen_stop_timer(struct kvm_vcpu *vcpu) 152 + { 153 + hrtimer_cancel(&vcpu->arch.xen.timer); 154 + vcpu->arch.xen.timer_expires = 0; 155 + atomic_set(&vcpu->arch.xen.timer_pending, 0); 156 + } 157 + 158 + static void kvm_xen_init_timer(struct kvm_vcpu *vcpu) 159 + { 160 + hrtimer_init(&vcpu->arch.xen.timer, CLOCK_MONOTONIC, 161 + HRTIMER_MODE_ABS_HARD); 162 + vcpu->arch.xen.timer.function = xen_timer_callback; 163 + } 164 + 113 165 static void kvm_xen_update_runstate(struct kvm_vcpu *v, int state) 114 166 { 115 167 struct kvm_vcpu_xen *vx = &v->arch.xen; ··· 201 133 void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, int state) 202 134 { 203 135 struct kvm_vcpu_xen *vx = &v->arch.xen; 204 - struct gfn_to_hva_cache *ghc = &vx->runstate_cache; 205 - struct kvm_memslots *slots = kvm_memslots(v->kvm); 206 - bool atomic = (state == RUNSTATE_runnable); 207 - uint64_t state_entry_time; 208 - int __user *user_state; 209 - uint64_t __user *user_times; 136 + struct gfn_to_pfn_cache *gpc = &vx->runstate_cache; 137 + uint64_t *user_times; 138 + unsigned long flags; 139 + size_t user_len; 140 + int *user_state; 210 141 211 142 kvm_xen_update_runstate(v, state); 212 143 213 - if (!vx->runstate_set) 144 + if (!vx->runstate_cache.active) 214 145 return; 215 146 216 - if (unlikely(slots->generation != ghc->generation || kvm_is_error_hva(ghc->hva)) && 217 - kvm_gfn_to_hva_cache_init(v->kvm, ghc, ghc->gpa, ghc->len)) 218 - return; 147 + if (IS_ENABLED(CONFIG_64BIT) && v->kvm->arch.xen.long_mode) 148 + user_len = sizeof(struct vcpu_runstate_info); 149 + else 150 + user_len = sizeof(struct compat_vcpu_runstate_info); 219 151 220 - /* We made sure it fits in a single page */ 221 - BUG_ON(!ghc->memslot); 152 + read_lock_irqsave(&gpc->lock, flags); 153 + while (!kvm_gfn_to_pfn_cache_check(v->kvm, gpc, gpc->gpa, 154 + user_len)) { 155 + read_unlock_irqrestore(&gpc->lock, flags); 222 156 223 - if (atomic) 224 - pagefault_disable(); 157 + /* When invoked from kvm_sched_out() we cannot sleep */ 158 + if (state == RUNSTATE_runnable) 159 + return; 160 + 161 + if (kvm_gfn_to_pfn_cache_refresh(v->kvm, gpc, gpc->gpa, user_len)) 162 + return; 163 + 164 + read_lock_irqsave(&gpc->lock, flags); 165 + } 225 166 226 167 /* 227 168 * The only difference between 32-bit and 64-bit versions of the ··· 244 167 */ 245 168 BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) != 0); 246 169 BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info, state) != 0); 247 - user_state = (int __user *)ghc->hva; 248 - 249 170 BUILD_BUG_ON(sizeof(struct compat_vcpu_runstate_info) != 0x2c); 250 - 251 - user_times = (uint64_t __user *)(ghc->hva + 252 - offsetof(struct compat_vcpu_runstate_info, 253 - state_entry_time)); 254 171 #ifdef CONFIG_X86_64 255 172 BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) != 256 173 offsetof(struct compat_vcpu_runstate_info, state_entry_time) + 4); 257 174 BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, time) != 258 175 offsetof(struct compat_vcpu_runstate_info, time) + 4); 259 - 260 - if (v->kvm->arch.xen.long_mode) 261 - user_times = (uint64_t __user *)(ghc->hva + 262 - offsetof(struct vcpu_runstate_info, 263 - state_entry_time)); 264 176 #endif 177 + 178 + user_state = gpc->khva; 179 + 180 + if (IS_ENABLED(CONFIG_64BIT) && v->kvm->arch.xen.long_mode) 181 + user_times = gpc->khva + offsetof(struct vcpu_runstate_info, 182 + state_entry_time); 183 + else 184 + user_times = gpc->khva + offsetof(struct compat_vcpu_runstate_info, 185 + state_entry_time); 186 + 265 187 /* 266 188 * First write the updated state_entry_time at the appropriate 267 189 * location determined by 'offset'. 268 190 */ 269 - state_entry_time = vx->runstate_entry_time; 270 - state_entry_time |= XEN_RUNSTATE_UPDATE; 271 - 272 191 BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state_entry_time) != 273 - sizeof(state_entry_time)); 192 + sizeof(user_times[0])); 274 193 BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state_entry_time) != 275 - sizeof(state_entry_time)); 194 + sizeof(user_times[0])); 276 195 277 - if (__put_user(state_entry_time, user_times)) 278 - goto out; 196 + user_times[0] = vx->runstate_entry_time | XEN_RUNSTATE_UPDATE; 279 197 smp_wmb(); 280 198 281 199 /* ··· 284 212 BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state) != 285 213 sizeof(vx->current_runstate)); 286 214 287 - if (__put_user(vx->current_runstate, user_state)) 288 - goto out; 215 + *user_state = vx->current_runstate; 289 216 290 217 /* 291 218 * Write the actual runstate times immediately after the ··· 299 228 BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) != 300 229 sizeof(vx->runstate_times)); 301 230 302 - if (__copy_to_user(user_times + 1, vx->runstate_times, sizeof(vx->runstate_times))) 303 - goto out; 231 + memcpy(user_times + 1, vx->runstate_times, sizeof(vx->runstate_times)); 304 232 smp_wmb(); 305 233 306 234 /* 307 235 * Finally, clear the XEN_RUNSTATE_UPDATE bit in the guest's 308 236 * runstate_entry_time field. 309 237 */ 310 - state_entry_time &= ~XEN_RUNSTATE_UPDATE; 311 - __put_user(state_entry_time, user_times); 238 + user_times[0] &= ~XEN_RUNSTATE_UPDATE; 312 239 smp_wmb(); 313 240 314 - out: 315 - mark_page_dirty_in_slot(v->kvm, ghc->memslot, ghc->gpa >> PAGE_SHIFT); 241 + read_unlock_irqrestore(&gpc->lock, flags); 316 242 317 - if (atomic) 318 - pagefault_enable(); 243 + mark_page_dirty_in_slot(v->kvm, gpc->memslot, gpc->gpa >> PAGE_SHIFT); 244 + } 245 + 246 + static void kvm_xen_inject_vcpu_vector(struct kvm_vcpu *v) 247 + { 248 + struct kvm_lapic_irq irq = { }; 249 + int r; 250 + 251 + irq.dest_id = v->vcpu_id; 252 + irq.vector = v->arch.xen.upcall_vector; 253 + irq.dest_mode = APIC_DEST_PHYSICAL; 254 + irq.shorthand = APIC_DEST_NOSHORT; 255 + irq.delivery_mode = APIC_DM_FIXED; 256 + irq.level = 1; 257 + 258 + /* The fast version will always work for physical unicast */ 259 + WARN_ON_ONCE(!kvm_irq_delivery_to_apic_fast(v->kvm, NULL, &irq, &r, NULL)); 260 + } 261 + 262 + /* 263 + * On event channel delivery, the vcpu_info may not have been accessible. 264 + * In that case, there are bits in vcpu->arch.xen.evtchn_pending_sel which 265 + * need to be marked into the vcpu_info (and evtchn_upcall_pending set). 266 + * Do so now that we can sleep in the context of the vCPU to bring the 267 + * page in, and refresh the pfn cache for it. 268 + */ 269 + void kvm_xen_inject_pending_events(struct kvm_vcpu *v) 270 + { 271 + unsigned long evtchn_pending_sel = READ_ONCE(v->arch.xen.evtchn_pending_sel); 272 + struct gfn_to_pfn_cache *gpc = &v->arch.xen.vcpu_info_cache; 273 + unsigned long flags; 274 + 275 + if (!evtchn_pending_sel) 276 + return; 277 + 278 + /* 279 + * Yes, this is an open-coded loop. But that's just what put_user() 280 + * does anyway. Page it in and retry the instruction. We're just a 281 + * little more honest about it. 282 + */ 283 + read_lock_irqsave(&gpc->lock, flags); 284 + while (!kvm_gfn_to_pfn_cache_check(v->kvm, gpc, gpc->gpa, 285 + sizeof(struct vcpu_info))) { 286 + read_unlock_irqrestore(&gpc->lock, flags); 287 + 288 + if (kvm_gfn_to_pfn_cache_refresh(v->kvm, gpc, gpc->gpa, 289 + sizeof(struct vcpu_info))) 290 + return; 291 + 292 + read_lock_irqsave(&gpc->lock, flags); 293 + } 294 + 295 + /* Now gpc->khva is a valid kernel address for the vcpu_info */ 296 + if (IS_ENABLED(CONFIG_64BIT) && v->kvm->arch.xen.long_mode) { 297 + struct vcpu_info *vi = gpc->khva; 298 + 299 + asm volatile(LOCK_PREFIX "orq %0, %1\n" 300 + "notq %0\n" 301 + LOCK_PREFIX "andq %0, %2\n" 302 + : "=r" (evtchn_pending_sel), 303 + "+m" (vi->evtchn_pending_sel), 304 + "+m" (v->arch.xen.evtchn_pending_sel) 305 + : "0" (evtchn_pending_sel)); 306 + WRITE_ONCE(vi->evtchn_upcall_pending, 1); 307 + } else { 308 + u32 evtchn_pending_sel32 = evtchn_pending_sel; 309 + struct compat_vcpu_info *vi = gpc->khva; 310 + 311 + asm volatile(LOCK_PREFIX "orl %0, %1\n" 312 + "notl %0\n" 313 + LOCK_PREFIX "andl %0, %2\n" 314 + : "=r" (evtchn_pending_sel32), 315 + "+m" (vi->evtchn_pending_sel), 316 + "+m" (v->arch.xen.evtchn_pending_sel) 317 + : "0" (evtchn_pending_sel32)); 318 + WRITE_ONCE(vi->evtchn_upcall_pending, 1); 319 + } 320 + read_unlock_irqrestore(&gpc->lock, flags); 321 + 322 + /* For the per-vCPU lapic vector, deliver it as MSI. */ 323 + if (v->arch.xen.upcall_vector) 324 + kvm_xen_inject_vcpu_vector(v); 325 + 326 + mark_page_dirty_in_slot(v->kvm, gpc->memslot, gpc->gpa >> PAGE_SHIFT); 319 327 } 320 328 321 329 int __kvm_xen_has_interrupt(struct kvm_vcpu *v) 322 330 { 323 - unsigned long evtchn_pending_sel = READ_ONCE(v->arch.xen.evtchn_pending_sel); 324 - bool atomic = in_atomic() || !task_is_running(current); 325 - int err; 331 + struct gfn_to_pfn_cache *gpc = &v->arch.xen.vcpu_info_cache; 332 + unsigned long flags; 326 333 u8 rc = 0; 327 334 328 335 /* 329 336 * If the global upcall vector (HVMIRQ_callback_vector) is set and 330 337 * the vCPU's evtchn_upcall_pending flag is set, the IRQ is pending. 331 338 */ 332 - struct gfn_to_hva_cache *ghc = &v->arch.xen.vcpu_info_cache; 333 - struct kvm_memslots *slots = kvm_memslots(v->kvm); 334 - bool ghc_valid = slots->generation == ghc->generation && 335 - !kvm_is_error_hva(ghc->hva) && ghc->memslot; 336 - 337 - unsigned int offset = offsetof(struct vcpu_info, evtchn_upcall_pending); 338 339 339 340 /* No need for compat handling here */ 340 341 BUILD_BUG_ON(offsetof(struct vcpu_info, evtchn_upcall_pending) != ··· 416 273 BUILD_BUG_ON(sizeof(rc) != 417 274 sizeof_field(struct compat_vcpu_info, evtchn_upcall_pending)); 418 275 419 - /* 420 - * For efficiency, this mirrors the checks for using the valid 421 - * cache in kvm_read_guest_offset_cached(), but just uses 422 - * __get_user() instead. And falls back to the slow path. 423 - */ 424 - if (!evtchn_pending_sel && ghc_valid) { 425 - /* Fast path */ 426 - pagefault_disable(); 427 - err = __get_user(rc, (u8 __user *)ghc->hva + offset); 428 - pagefault_enable(); 429 - if (!err) 430 - return rc; 431 - } 276 + read_lock_irqsave(&gpc->lock, flags); 277 + while (!kvm_gfn_to_pfn_cache_check(v->kvm, gpc, gpc->gpa, 278 + sizeof(struct vcpu_info))) { 279 + read_unlock_irqrestore(&gpc->lock, flags); 432 280 433 - /* Slow path */ 281 + /* 282 + * This function gets called from kvm_vcpu_block() after setting the 283 + * task to TASK_INTERRUPTIBLE, to see if it needs to wake immediately 284 + * from a HLT. So we really mustn't sleep. If the page ended up absent 285 + * at that point, just return 1 in order to trigger an immediate wake, 286 + * and we'll end up getting called again from a context where we *can* 287 + * fault in the page and wait for it. 288 + */ 289 + if (in_atomic() || !task_is_running(current)) 290 + return 1; 434 291 435 - /* 436 - * This function gets called from kvm_vcpu_block() after setting the 437 - * task to TASK_INTERRUPTIBLE, to see if it needs to wake immediately 438 - * from a HLT. So we really mustn't sleep. If the page ended up absent 439 - * at that point, just return 1 in order to trigger an immediate wake, 440 - * and we'll end up getting called again from a context where we *can* 441 - * fault in the page and wait for it. 442 - */ 443 - if (atomic) 444 - return 1; 445 - 446 - if (!ghc_valid) { 447 - err = kvm_gfn_to_hva_cache_init(v->kvm, ghc, ghc->gpa, ghc->len); 448 - if (err || !ghc->memslot) { 292 + if (kvm_gfn_to_pfn_cache_refresh(v->kvm, gpc, gpc->gpa, 293 + sizeof(struct vcpu_info))) { 449 294 /* 450 295 * If this failed, userspace has screwed up the 451 296 * vcpu_info mapping. No interrupts for you. 452 297 */ 453 298 return 0; 454 299 } 300 + read_lock_irqsave(&gpc->lock, flags); 455 301 } 456 302 457 - /* 458 - * Now we have a valid (protected by srcu) userspace HVA in 459 - * ghc->hva which points to the struct vcpu_info. If there 460 - * are any bits in the in-kernel evtchn_pending_sel then 461 - * we need to write those to the guest vcpu_info and set 462 - * its evtchn_upcall_pending flag. If there aren't any bits 463 - * to add, we only want to *check* evtchn_upcall_pending. 464 - */ 465 - if (evtchn_pending_sel) { 466 - bool long_mode = v->kvm->arch.xen.long_mode; 467 - 468 - if (!user_access_begin((void __user *)ghc->hva, sizeof(struct vcpu_info))) 469 - return 0; 470 - 471 - if (IS_ENABLED(CONFIG_64BIT) && long_mode) { 472 - struct vcpu_info __user *vi = (void __user *)ghc->hva; 473 - 474 - /* Attempt to set the evtchn_pending_sel bits in the 475 - * guest, and if that succeeds then clear the same 476 - * bits in the in-kernel version. */ 477 - asm volatile("1:\t" LOCK_PREFIX "orq %0, %1\n" 478 - "\tnotq %0\n" 479 - "\t" LOCK_PREFIX "andq %0, %2\n" 480 - "2:\n" 481 - _ASM_EXTABLE_UA(1b, 2b) 482 - : "=r" (evtchn_pending_sel), 483 - "+m" (vi->evtchn_pending_sel), 484 - "+m" (v->arch.xen.evtchn_pending_sel) 485 - : "0" (evtchn_pending_sel)); 486 - } else { 487 - struct compat_vcpu_info __user *vi = (void __user *)ghc->hva; 488 - u32 evtchn_pending_sel32 = evtchn_pending_sel; 489 - 490 - /* Attempt to set the evtchn_pending_sel bits in the 491 - * guest, and if that succeeds then clear the same 492 - * bits in the in-kernel version. */ 493 - asm volatile("1:\t" LOCK_PREFIX "orl %0, %1\n" 494 - "\tnotl %0\n" 495 - "\t" LOCK_PREFIX "andl %0, %2\n" 496 - "2:\n" 497 - _ASM_EXTABLE_UA(1b, 2b) 498 - : "=r" (evtchn_pending_sel32), 499 - "+m" (vi->evtchn_pending_sel), 500 - "+m" (v->arch.xen.evtchn_pending_sel) 501 - : "0" (evtchn_pending_sel32)); 502 - } 503 - rc = 1; 504 - unsafe_put_user(rc, (u8 __user *)ghc->hva + offset, err); 505 - 506 - err: 507 - user_access_end(); 508 - 509 - mark_page_dirty_in_slot(v->kvm, ghc->memslot, ghc->gpa >> PAGE_SHIFT); 510 - } else { 511 - __get_user(rc, (u8 __user *)ghc->hva + offset); 512 - } 513 - 303 + rc = ((struct vcpu_info *)gpc->khva)->evtchn_upcall_pending; 304 + read_unlock_irqrestore(&gpc->lock, flags); 514 305 return rc; 515 306 } 516 307 ··· 452 375 { 453 376 int r = -ENOENT; 454 377 455 - mutex_lock(&kvm->lock); 456 378 457 379 switch (data->type) { 458 380 case KVM_XEN_ATTR_TYPE_LONG_MODE: 459 381 if (!IS_ENABLED(CONFIG_64BIT) && data->u.long_mode) { 460 382 r = -EINVAL; 461 383 } else { 384 + mutex_lock(&kvm->lock); 462 385 kvm->arch.xen.long_mode = !!data->u.long_mode; 386 + mutex_unlock(&kvm->lock); 463 387 r = 0; 464 388 } 465 389 break; 466 390 467 391 case KVM_XEN_ATTR_TYPE_SHARED_INFO: 392 + mutex_lock(&kvm->lock); 468 393 r = kvm_xen_shared_info_init(kvm, data->u.shared_info.gfn); 394 + mutex_unlock(&kvm->lock); 469 395 break; 470 396 471 397 case KVM_XEN_ATTR_TYPE_UPCALL_VECTOR: 472 398 if (data->u.vector && data->u.vector < 0x10) 473 399 r = -EINVAL; 474 400 else { 401 + mutex_lock(&kvm->lock); 475 402 kvm->arch.xen.upcall_vector = data->u.vector; 403 + mutex_unlock(&kvm->lock); 476 404 r = 0; 477 405 } 406 + break; 407 + 408 + case KVM_XEN_ATTR_TYPE_EVTCHN: 409 + r = kvm_xen_setattr_evtchn(kvm, data); 410 + break; 411 + 412 + case KVM_XEN_ATTR_TYPE_XEN_VERSION: 413 + mutex_lock(&kvm->lock); 414 + kvm->arch.xen.xen_version = data->u.xen_version; 415 + mutex_unlock(&kvm->lock); 416 + r = 0; 478 417 break; 479 418 480 419 default: 481 420 break; 482 421 } 483 422 484 - mutex_unlock(&kvm->lock); 485 423 return r; 486 424 } 487 425 ··· 525 433 r = 0; 526 434 break; 527 435 436 + case KVM_XEN_ATTR_TYPE_XEN_VERSION: 437 + data->u.xen_version = kvm->arch.xen.xen_version; 438 + r = 0; 439 + break; 440 + 528 441 default: 529 442 break; 530 443 } ··· 554 457 offsetof(struct compat_vcpu_info, time)); 555 458 556 459 if (data->u.gpa == GPA_INVALID) { 557 - vcpu->arch.xen.vcpu_info_set = false; 460 + kvm_gfn_to_pfn_cache_destroy(vcpu->kvm, &vcpu->arch.xen.vcpu_info_cache); 558 461 r = 0; 559 462 break; 560 463 } 561 464 562 - /* It must fit within a single page */ 563 - if ((data->u.gpa & ~PAGE_MASK) + sizeof(struct vcpu_info) > PAGE_SIZE) { 564 - r = -EINVAL; 565 - break; 566 - } 567 - 568 - r = kvm_gfn_to_hva_cache_init(vcpu->kvm, 465 + r = kvm_gfn_to_pfn_cache_init(vcpu->kvm, 569 466 &vcpu->arch.xen.vcpu_info_cache, 570 - data->u.gpa, 467 + NULL, KVM_HOST_USES_PFN, data->u.gpa, 571 468 sizeof(struct vcpu_info)); 572 - if (!r) { 573 - vcpu->arch.xen.vcpu_info_set = true; 469 + if (!r) 574 470 kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); 575 - } 471 + 576 472 break; 577 473 578 474 case KVM_XEN_VCPU_ATTR_TYPE_VCPU_TIME_INFO: 579 475 if (data->u.gpa == GPA_INVALID) { 580 - vcpu->arch.xen.vcpu_time_info_set = false; 476 + kvm_gfn_to_pfn_cache_destroy(vcpu->kvm, 477 + &vcpu->arch.xen.vcpu_time_info_cache); 581 478 r = 0; 582 479 break; 583 480 } 584 481 585 - /* It must fit within a single page */ 586 - if ((data->u.gpa & ~PAGE_MASK) + sizeof(struct pvclock_vcpu_time_info) > PAGE_SIZE) { 587 - r = -EINVAL; 588 - break; 589 - } 590 - 591 - r = kvm_gfn_to_hva_cache_init(vcpu->kvm, 482 + r = kvm_gfn_to_pfn_cache_init(vcpu->kvm, 592 483 &vcpu->arch.xen.vcpu_time_info_cache, 593 - data->u.gpa, 484 + NULL, KVM_HOST_USES_PFN, data->u.gpa, 594 485 sizeof(struct pvclock_vcpu_time_info)); 595 - if (!r) { 596 - vcpu->arch.xen.vcpu_time_info_set = true; 486 + if (!r) 597 487 kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); 598 - } 599 488 break; 600 489 601 490 case KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADDR: ··· 590 507 break; 591 508 } 592 509 if (data->u.gpa == GPA_INVALID) { 593 - vcpu->arch.xen.runstate_set = false; 510 + kvm_gfn_to_pfn_cache_destroy(vcpu->kvm, 511 + &vcpu->arch.xen.runstate_cache); 594 512 r = 0; 595 513 break; 596 514 } 597 515 598 - /* It must fit within a single page */ 599 - if ((data->u.gpa & ~PAGE_MASK) + sizeof(struct vcpu_runstate_info) > PAGE_SIZE) { 600 - r = -EINVAL; 601 - break; 602 - } 603 - 604 - r = kvm_gfn_to_hva_cache_init(vcpu->kvm, 516 + r = kvm_gfn_to_pfn_cache_init(vcpu->kvm, 605 517 &vcpu->arch.xen.runstate_cache, 606 - data->u.gpa, 518 + NULL, KVM_HOST_USES_PFN, data->u.gpa, 607 519 sizeof(struct vcpu_runstate_info)); 608 - if (!r) { 609 - vcpu->arch.xen.runstate_set = true; 610 - } 611 520 break; 612 521 613 522 case KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_CURRENT: ··· 697 622 r = 0; 698 623 break; 699 624 625 + case KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID: 626 + if (data->u.vcpu_id >= KVM_MAX_VCPUS) 627 + r = -EINVAL; 628 + else { 629 + vcpu->arch.xen.vcpu_id = data->u.vcpu_id; 630 + r = 0; 631 + } 632 + break; 633 + 634 + case KVM_XEN_VCPU_ATTR_TYPE_TIMER: 635 + if (data->u.timer.port) { 636 + if (data->u.timer.priority != KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL) { 637 + r = -EINVAL; 638 + break; 639 + } 640 + vcpu->arch.xen.timer_virq = data->u.timer.port; 641 + kvm_xen_init_timer(vcpu); 642 + 643 + /* Restart the timer if it's set */ 644 + if (data->u.timer.expires_ns) 645 + kvm_xen_start_timer(vcpu, data->u.timer.expires_ns, 646 + data->u.timer.expires_ns - 647 + get_kvmclock_ns(vcpu->kvm)); 648 + } else if (kvm_xen_timer_enabled(vcpu)) { 649 + kvm_xen_stop_timer(vcpu); 650 + vcpu->arch.xen.timer_virq = 0; 651 + } 652 + 653 + r = 0; 654 + break; 655 + 656 + case KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR: 657 + if (data->u.vector && data->u.vector < 0x10) 658 + r = -EINVAL; 659 + else { 660 + vcpu->arch.xen.upcall_vector = data->u.vector; 661 + r = 0; 662 + } 663 + break; 664 + 700 665 default: 701 666 break; 702 667 } ··· 754 639 755 640 switch (data->type) { 756 641 case KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO: 757 - if (vcpu->arch.xen.vcpu_info_set) 642 + if (vcpu->arch.xen.vcpu_info_cache.active) 758 643 data->u.gpa = vcpu->arch.xen.vcpu_info_cache.gpa; 759 644 else 760 645 data->u.gpa = GPA_INVALID; ··· 762 647 break; 763 648 764 649 case KVM_XEN_VCPU_ATTR_TYPE_VCPU_TIME_INFO: 765 - if (vcpu->arch.xen.vcpu_time_info_set) 650 + if (vcpu->arch.xen.vcpu_time_info_cache.active) 766 651 data->u.gpa = vcpu->arch.xen.vcpu_time_info_cache.gpa; 767 652 else 768 653 data->u.gpa = GPA_INVALID; ··· 774 659 r = -EOPNOTSUPP; 775 660 break; 776 661 } 777 - if (vcpu->arch.xen.runstate_set) { 662 + if (vcpu->arch.xen.runstate_cache.active) { 778 663 data->u.gpa = vcpu->arch.xen.runstate_cache.gpa; 779 664 r = 0; 780 665 } ··· 810 695 811 696 case KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST: 812 697 r = -EINVAL; 698 + break; 699 + 700 + case KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID: 701 + data->u.vcpu_id = vcpu->arch.xen.vcpu_id; 702 + r = 0; 703 + break; 704 + 705 + case KVM_XEN_VCPU_ATTR_TYPE_TIMER: 706 + data->u.timer.port = vcpu->arch.xen.timer_virq; 707 + data->u.timer.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL; 708 + data->u.timer.expires_ns = vcpu->arch.xen.timer_expires; 709 + r = 0; 710 + break; 711 + 712 + case KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR: 713 + data->u.vector = vcpu->arch.xen.upcall_vector; 714 + r = 0; 813 715 break; 814 716 815 717 default: ··· 909 777 910 778 int kvm_xen_hvm_config(struct kvm *kvm, struct kvm_xen_hvm_config *xhc) 911 779 { 912 - if (xhc->flags & ~KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL) 780 + /* Only some feature flags need to be *enabled* by userspace */ 781 + u32 permitted_flags = KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL | 782 + KVM_XEN_HVM_CONFIG_EVTCHN_SEND; 783 + 784 + if (xhc->flags & ~permitted_flags) 913 785 return -EINVAL; 914 786 915 787 /* ··· 938 802 return 0; 939 803 } 940 804 941 - void kvm_xen_init_vm(struct kvm *kvm) 942 - { 943 - } 944 - 945 - void kvm_xen_destroy_vm(struct kvm *kvm) 946 - { 947 - kvm_gfn_to_pfn_cache_destroy(kvm, &kvm->arch.xen.shinfo_cache); 948 - 949 - if (kvm->arch.xen_hvm_config.msr) 950 - static_branch_slow_dec_deferred(&kvm_xen_enabled); 951 - } 952 - 953 805 static int kvm_xen_hypercall_set_result(struct kvm_vcpu *vcpu, u64 result) 954 806 { 955 807 kvm_rax_write(vcpu, result); ··· 954 830 return kvm_xen_hypercall_set_result(vcpu, run->xen.u.hcall.result); 955 831 } 956 832 833 + static bool wait_pending_event(struct kvm_vcpu *vcpu, int nr_ports, 834 + evtchn_port_t *ports) 835 + { 836 + struct kvm *kvm = vcpu->kvm; 837 + struct gfn_to_pfn_cache *gpc = &kvm->arch.xen.shinfo_cache; 838 + unsigned long *pending_bits; 839 + unsigned long flags; 840 + bool ret = true; 841 + int idx, i; 842 + 843 + read_lock_irqsave(&gpc->lock, flags); 844 + idx = srcu_read_lock(&kvm->srcu); 845 + if (!kvm_gfn_to_pfn_cache_check(kvm, gpc, gpc->gpa, PAGE_SIZE)) 846 + goto out_rcu; 847 + 848 + ret = false; 849 + if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) { 850 + struct shared_info *shinfo = gpc->khva; 851 + pending_bits = (unsigned long *)&shinfo->evtchn_pending; 852 + } else { 853 + struct compat_shared_info *shinfo = gpc->khva; 854 + pending_bits = (unsigned long *)&shinfo->evtchn_pending; 855 + } 856 + 857 + for (i = 0; i < nr_ports; i++) { 858 + if (test_bit(ports[i], pending_bits)) { 859 + ret = true; 860 + break; 861 + } 862 + } 863 + 864 + out_rcu: 865 + srcu_read_unlock(&kvm->srcu, idx); 866 + read_unlock_irqrestore(&gpc->lock, flags); 867 + 868 + return ret; 869 + } 870 + 871 + static bool kvm_xen_schedop_poll(struct kvm_vcpu *vcpu, bool longmode, 872 + u64 param, u64 *r) 873 + { 874 + int idx, i; 875 + struct sched_poll sched_poll; 876 + evtchn_port_t port, *ports; 877 + gpa_t gpa; 878 + 879 + if (!longmode || !lapic_in_kernel(vcpu) || 880 + !(vcpu->kvm->arch.xen_hvm_config.flags & KVM_XEN_HVM_CONFIG_EVTCHN_SEND)) 881 + return false; 882 + 883 + idx = srcu_read_lock(&vcpu->kvm->srcu); 884 + gpa = kvm_mmu_gva_to_gpa_system(vcpu, param, NULL); 885 + srcu_read_unlock(&vcpu->kvm->srcu, idx); 886 + 887 + if (!gpa || kvm_vcpu_read_guest(vcpu, gpa, &sched_poll, 888 + sizeof(sched_poll))) { 889 + *r = -EFAULT; 890 + return true; 891 + } 892 + 893 + if (unlikely(sched_poll.nr_ports > 1)) { 894 + /* Xen (unofficially) limits number of pollers to 128 */ 895 + if (sched_poll.nr_ports > 128) { 896 + *r = -EINVAL; 897 + return true; 898 + } 899 + 900 + ports = kmalloc_array(sched_poll.nr_ports, 901 + sizeof(*ports), GFP_KERNEL); 902 + if (!ports) { 903 + *r = -ENOMEM; 904 + return true; 905 + } 906 + } else 907 + ports = &port; 908 + 909 + for (i = 0; i < sched_poll.nr_ports; i++) { 910 + idx = srcu_read_lock(&vcpu->kvm->srcu); 911 + gpa = kvm_mmu_gva_to_gpa_system(vcpu, 912 + (gva_t)(sched_poll.ports + i), 913 + NULL); 914 + srcu_read_unlock(&vcpu->kvm->srcu, idx); 915 + 916 + if (!gpa || kvm_vcpu_read_guest(vcpu, gpa, 917 + &ports[i], sizeof(port))) { 918 + *r = -EFAULT; 919 + goto out; 920 + } 921 + } 922 + 923 + if (sched_poll.nr_ports == 1) 924 + vcpu->arch.xen.poll_evtchn = port; 925 + else 926 + vcpu->arch.xen.poll_evtchn = -1; 927 + 928 + set_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.xen.poll_mask); 929 + 930 + if (!wait_pending_event(vcpu, sched_poll.nr_ports, ports)) { 931 + vcpu->arch.mp_state = KVM_MP_STATE_HALTED; 932 + 933 + if (sched_poll.timeout) 934 + mod_timer(&vcpu->arch.xen.poll_timer, 935 + jiffies + nsecs_to_jiffies(sched_poll.timeout)); 936 + 937 + kvm_vcpu_halt(vcpu); 938 + 939 + if (sched_poll.timeout) 940 + del_timer(&vcpu->arch.xen.poll_timer); 941 + 942 + vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; 943 + kvm_clear_request(KVM_REQ_UNHALT, vcpu); 944 + } 945 + 946 + vcpu->arch.xen.poll_evtchn = 0; 947 + *r = 0; 948 + out: 949 + /* Really, this is only needed in case of timeout */ 950 + clear_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.xen.poll_mask); 951 + 952 + if (unlikely(sched_poll.nr_ports > 1)) 953 + kfree(ports); 954 + return true; 955 + } 956 + 957 + static void cancel_evtchn_poll(struct timer_list *t) 958 + { 959 + struct kvm_vcpu *vcpu = from_timer(vcpu, t, arch.xen.poll_timer); 960 + 961 + kvm_make_request(KVM_REQ_UNBLOCK, vcpu); 962 + kvm_vcpu_kick(vcpu); 963 + } 964 + 965 + static bool kvm_xen_hcall_sched_op(struct kvm_vcpu *vcpu, bool longmode, 966 + int cmd, u64 param, u64 *r) 967 + { 968 + switch (cmd) { 969 + case SCHEDOP_poll: 970 + if (kvm_xen_schedop_poll(vcpu, longmode, param, r)) 971 + return true; 972 + fallthrough; 973 + case SCHEDOP_yield: 974 + kvm_vcpu_on_spin(vcpu, true); 975 + *r = 0; 976 + return true; 977 + default: 978 + break; 979 + } 980 + 981 + return false; 982 + } 983 + 984 + struct compat_vcpu_set_singleshot_timer { 985 + uint64_t timeout_abs_ns; 986 + uint32_t flags; 987 + } __attribute__((packed)); 988 + 989 + static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd, 990 + int vcpu_id, u64 param, u64 *r) 991 + { 992 + struct vcpu_set_singleshot_timer oneshot; 993 + s64 delta; 994 + gpa_t gpa; 995 + int idx; 996 + 997 + if (!kvm_xen_timer_enabled(vcpu)) 998 + return false; 999 + 1000 + switch (cmd) { 1001 + case VCPUOP_set_singleshot_timer: 1002 + if (vcpu->arch.xen.vcpu_id != vcpu_id) { 1003 + *r = -EINVAL; 1004 + return true; 1005 + } 1006 + idx = srcu_read_lock(&vcpu->kvm->srcu); 1007 + gpa = kvm_mmu_gva_to_gpa_system(vcpu, param, NULL); 1008 + srcu_read_unlock(&vcpu->kvm->srcu, idx); 1009 + 1010 + /* 1011 + * The only difference for 32-bit compat is the 4 bytes of 1012 + * padding after the interesting part of the structure. So 1013 + * for a faithful emulation of Xen we have to *try* to copy 1014 + * the padding and return -EFAULT if we can't. Otherwise we 1015 + * might as well just have copied the 12-byte 32-bit struct. 1016 + */ 1017 + BUILD_BUG_ON(offsetof(struct compat_vcpu_set_singleshot_timer, timeout_abs_ns) != 1018 + offsetof(struct vcpu_set_singleshot_timer, timeout_abs_ns)); 1019 + BUILD_BUG_ON(sizeof_field(struct compat_vcpu_set_singleshot_timer, timeout_abs_ns) != 1020 + sizeof_field(struct vcpu_set_singleshot_timer, timeout_abs_ns)); 1021 + BUILD_BUG_ON(offsetof(struct compat_vcpu_set_singleshot_timer, flags) != 1022 + offsetof(struct vcpu_set_singleshot_timer, flags)); 1023 + BUILD_BUG_ON(sizeof_field(struct compat_vcpu_set_singleshot_timer, flags) != 1024 + sizeof_field(struct vcpu_set_singleshot_timer, flags)); 1025 + 1026 + if (!gpa || 1027 + kvm_vcpu_read_guest(vcpu, gpa, &oneshot, longmode ? sizeof(oneshot) : 1028 + sizeof(struct compat_vcpu_set_singleshot_timer))) { 1029 + *r = -EFAULT; 1030 + return true; 1031 + } 1032 + 1033 + delta = oneshot.timeout_abs_ns - get_kvmclock_ns(vcpu->kvm); 1034 + if ((oneshot.flags & VCPU_SSHOTTMR_future) && delta < 0) { 1035 + *r = -ETIME; 1036 + return true; 1037 + } 1038 + 1039 + kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, delta); 1040 + *r = 0; 1041 + return true; 1042 + 1043 + case VCPUOP_stop_singleshot_timer: 1044 + if (vcpu->arch.xen.vcpu_id != vcpu_id) { 1045 + *r = -EINVAL; 1046 + return true; 1047 + } 1048 + kvm_xen_stop_timer(vcpu); 1049 + *r = 0; 1050 + return true; 1051 + } 1052 + 1053 + return false; 1054 + } 1055 + 1056 + static bool kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout, 1057 + u64 *r) 1058 + { 1059 + if (!kvm_xen_timer_enabled(vcpu)) 1060 + return false; 1061 + 1062 + if (timeout) { 1063 + uint64_t guest_now = get_kvmclock_ns(vcpu->kvm); 1064 + int64_t delta = timeout - guest_now; 1065 + 1066 + /* Xen has a 'Linux workaround' in do_set_timer_op() which 1067 + * checks for negative absolute timeout values (caused by 1068 + * integer overflow), and for values about 13 days in the 1069 + * future (2^50ns) which would be caused by jiffies 1070 + * overflow. For those cases, it sets the timeout 100ms in 1071 + * the future (not *too* soon, since if a guest really did 1072 + * set a long timeout on purpose we don't want to keep 1073 + * churning CPU time by waking it up). 1074 + */ 1075 + if (unlikely((int64_t)timeout < 0 || 1076 + (delta > 0 && (uint32_t) (delta >> 50) != 0))) { 1077 + delta = 100 * NSEC_PER_MSEC; 1078 + timeout = guest_now + delta; 1079 + } 1080 + 1081 + kvm_xen_start_timer(vcpu, timeout, delta); 1082 + } else { 1083 + kvm_xen_stop_timer(vcpu); 1084 + } 1085 + 1086 + *r = 0; 1087 + return true; 1088 + } 1089 + 957 1090 int kvm_xen_hypercall(struct kvm_vcpu *vcpu) 958 1091 { 959 1092 bool longmode; 960 - u64 input, params[6]; 1093 + u64 input, params[6], r = -ENOSYS; 1094 + bool handled = false; 961 1095 962 1096 input = (u64)kvm_register_read(vcpu, VCPU_REGS_RAX); 963 1097 ··· 1246 864 trace_kvm_xen_hypercall(input, params[0], params[1], params[2], 1247 865 params[3], params[4], params[5]); 1248 866 867 + switch (input) { 868 + case __HYPERVISOR_xen_version: 869 + if (params[0] == XENVER_version && vcpu->kvm->arch.xen.xen_version) { 870 + r = vcpu->kvm->arch.xen.xen_version; 871 + handled = true; 872 + } 873 + break; 874 + case __HYPERVISOR_event_channel_op: 875 + if (params[0] == EVTCHNOP_send) 876 + handled = kvm_xen_hcall_evtchn_send(vcpu, params[1], &r); 877 + break; 878 + case __HYPERVISOR_sched_op: 879 + handled = kvm_xen_hcall_sched_op(vcpu, longmode, params[0], 880 + params[1], &r); 881 + break; 882 + case __HYPERVISOR_vcpu_op: 883 + handled = kvm_xen_hcall_vcpu_op(vcpu, longmode, params[0], params[1], 884 + params[2], &r); 885 + break; 886 + case __HYPERVISOR_set_timer_op: { 887 + u64 timeout = params[0]; 888 + /* In 32-bit mode, the 64-bit timeout is in two 32-bit params. */ 889 + if (!longmode) 890 + timeout |= params[1] << 32; 891 + handled = kvm_xen_hcall_set_timer_op(vcpu, timeout, &r); 892 + break; 893 + } 894 + default: 895 + break; 896 + } 897 + 898 + if (handled) 899 + return kvm_xen_hypercall_set_result(vcpu, r); 900 + 1249 901 vcpu->run->exit_reason = KVM_EXIT_XEN; 1250 902 vcpu->run->xen.type = KVM_EXIT_XEN_HCALL; 1251 903 vcpu->run->xen.u.hcall.longmode = longmode; ··· 1306 890 return COMPAT_EVTCHN_2L_NR_CHANNELS; 1307 891 } 1308 892 893 + static void kvm_xen_check_poller(struct kvm_vcpu *vcpu, int port) 894 + { 895 + int poll_evtchn = vcpu->arch.xen.poll_evtchn; 896 + 897 + if ((poll_evtchn == port || poll_evtchn == -1) && 898 + test_and_clear_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.xen.poll_mask)) { 899 + kvm_make_request(KVM_REQ_UNBLOCK, vcpu); 900 + kvm_vcpu_kick(vcpu); 901 + } 902 + } 903 + 1309 904 /* 1310 - * This follows the kvm_set_irq() API, so it returns: 905 + * The return value from this function is propagated to kvm_set_irq() API, 906 + * so it returns: 1311 907 * < 0 Interrupt was ignored (masked or not delivered for other reasons) 1312 908 * = 0 Interrupt was coalesced (previous irq is still pending) 1313 909 * > 0 Number of CPUs interrupt was delivered to 910 + * 911 + * It is also called directly from kvm_arch_set_irq_inatomic(), where the 912 + * only check on its return value is a comparison with -EWOULDBLOCK'. 1314 913 */ 1315 - int kvm_xen_set_evtchn_fast(struct kvm_kernel_irq_routing_entry *e, 1316 - struct kvm *kvm) 914 + int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm) 1317 915 { 1318 916 struct gfn_to_pfn_cache *gpc = &kvm->arch.xen.shinfo_cache; 1319 917 struct kvm_vcpu *vcpu; ··· 1335 905 unsigned long flags; 1336 906 int port_word_bit; 1337 907 bool kick_vcpu = false; 1338 - int idx; 1339 - int rc; 908 + int vcpu_idx, idx, rc; 1340 909 1341 - vcpu = kvm_get_vcpu_by_id(kvm, e->xen_evtchn.vcpu); 1342 - if (!vcpu) 1343 - return -1; 910 + vcpu_idx = READ_ONCE(xe->vcpu_idx); 911 + if (vcpu_idx >= 0) 912 + vcpu = kvm_get_vcpu(kvm, vcpu_idx); 913 + else { 914 + vcpu = kvm_get_vcpu_by_id(kvm, xe->vcpu_id); 915 + if (!vcpu) 916 + return -EINVAL; 917 + WRITE_ONCE(xe->vcpu_idx, kvm_vcpu_get_idx(vcpu)); 918 + } 1344 919 1345 - if (!vcpu->arch.xen.vcpu_info_set) 1346 - return -1; 920 + if (!vcpu->arch.xen.vcpu_info_cache.active) 921 + return -EINVAL; 1347 922 1348 - if (e->xen_evtchn.port >= max_evtchn_port(kvm)) 1349 - return -1; 923 + if (xe->port >= max_evtchn_port(kvm)) 924 + return -EINVAL; 1350 925 1351 926 rc = -EWOULDBLOCK; 1352 - read_lock_irqsave(&gpc->lock, flags); 1353 927 1354 928 idx = srcu_read_lock(&kvm->srcu); 929 + 930 + read_lock_irqsave(&gpc->lock, flags); 1355 931 if (!kvm_gfn_to_pfn_cache_check(kvm, gpc, gpc->gpa, PAGE_SIZE)) 1356 932 goto out_rcu; 1357 933 ··· 1365 929 struct shared_info *shinfo = gpc->khva; 1366 930 pending_bits = (unsigned long *)&shinfo->evtchn_pending; 1367 931 mask_bits = (unsigned long *)&shinfo->evtchn_mask; 1368 - port_word_bit = e->xen_evtchn.port / 64; 932 + port_word_bit = xe->port / 64; 1369 933 } else { 1370 934 struct compat_shared_info *shinfo = gpc->khva; 1371 935 pending_bits = (unsigned long *)&shinfo->evtchn_pending; 1372 936 mask_bits = (unsigned long *)&shinfo->evtchn_mask; 1373 - port_word_bit = e->xen_evtchn.port / 32; 937 + port_word_bit = xe->port / 32; 1374 938 } 1375 939 1376 940 /* ··· 1380 944 * already set, then we kick the vCPU in question to write to the 1381 945 * *real* evtchn_pending_sel in its own guest vcpu_info struct. 1382 946 */ 1383 - if (test_and_set_bit(e->xen_evtchn.port, pending_bits)) { 947 + if (test_and_set_bit(xe->port, pending_bits)) { 1384 948 rc = 0; /* It was already raised */ 1385 - } else if (test_bit(e->xen_evtchn.port, mask_bits)) { 1386 - rc = -1; /* Masked */ 949 + } else if (test_bit(xe->port, mask_bits)) { 950 + rc = -ENOTCONN; /* Masked */ 951 + kvm_xen_check_poller(vcpu, xe->port); 1387 952 } else { 1388 - rc = 1; /* Delivered. But was the vCPU waking already? */ 1389 - if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel)) 1390 - kick_vcpu = true; 953 + rc = 1; /* Delivered to the bitmap in shared_info. */ 954 + /* Now switch to the vCPU's vcpu_info to set the index and pending_sel */ 955 + read_unlock_irqrestore(&gpc->lock, flags); 956 + gpc = &vcpu->arch.xen.vcpu_info_cache; 957 + 958 + read_lock_irqsave(&gpc->lock, flags); 959 + if (!kvm_gfn_to_pfn_cache_check(kvm, gpc, gpc->gpa, sizeof(struct vcpu_info))) { 960 + /* 961 + * Could not access the vcpu_info. Set the bit in-kernel 962 + * and prod the vCPU to deliver it for itself. 963 + */ 964 + if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel)) 965 + kick_vcpu = true; 966 + goto out_rcu; 967 + } 968 + 969 + if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) { 970 + struct vcpu_info *vcpu_info = gpc->khva; 971 + if (!test_and_set_bit(port_word_bit, &vcpu_info->evtchn_pending_sel)) { 972 + WRITE_ONCE(vcpu_info->evtchn_upcall_pending, 1); 973 + kick_vcpu = true; 974 + } 975 + } else { 976 + struct compat_vcpu_info *vcpu_info = gpc->khva; 977 + if (!test_and_set_bit(port_word_bit, 978 + (unsigned long *)&vcpu_info->evtchn_pending_sel)) { 979 + WRITE_ONCE(vcpu_info->evtchn_upcall_pending, 1); 980 + kick_vcpu = true; 981 + } 982 + } 983 + 984 + /* For the per-vCPU lapic vector, deliver it as MSI. */ 985 + if (kick_vcpu && vcpu->arch.xen.upcall_vector) { 986 + kvm_xen_inject_vcpu_vector(vcpu); 987 + kick_vcpu = false; 988 + } 1391 989 } 1392 990 1393 991 out_rcu: 1394 - srcu_read_unlock(&kvm->srcu, idx); 1395 992 read_unlock_irqrestore(&gpc->lock, flags); 993 + srcu_read_unlock(&kvm->srcu, idx); 1396 994 1397 995 if (kick_vcpu) { 1398 - kvm_make_request(KVM_REQ_EVENT, vcpu); 996 + kvm_make_request(KVM_REQ_UNBLOCK, vcpu); 1399 997 kvm_vcpu_kick(vcpu); 1400 998 } 1401 999 1402 1000 return rc; 1403 1001 } 1404 1002 1405 - /* This is the version called from kvm_set_irq() as the .set function */ 1406 - static int evtchn_set_fn(struct kvm_kernel_irq_routing_entry *e, struct kvm *kvm, 1407 - int irq_source_id, int level, bool line_status) 1003 + static int kvm_xen_set_evtchn(struct kvm_xen_evtchn *xe, struct kvm *kvm) 1408 1004 { 1409 1005 bool mm_borrowed = false; 1410 1006 int rc; 1411 1007 1412 - if (!level) 1413 - return -1; 1414 - 1415 - rc = kvm_xen_set_evtchn_fast(e, kvm); 1008 + rc = kvm_xen_set_evtchn_fast(xe, kvm); 1416 1009 if (rc != -EWOULDBLOCK) 1417 1010 return rc; 1418 1011 ··· 1485 1020 struct gfn_to_pfn_cache *gpc = &kvm->arch.xen.shinfo_cache; 1486 1021 int idx; 1487 1022 1488 - rc = kvm_xen_set_evtchn_fast(e, kvm); 1023 + rc = kvm_xen_set_evtchn_fast(xe, kvm); 1489 1024 if (rc != -EWOULDBLOCK) 1490 1025 break; 1491 1026 ··· 1502 1037 return rc; 1503 1038 } 1504 1039 1040 + /* This is the version called from kvm_set_irq() as the .set function */ 1041 + static int evtchn_set_fn(struct kvm_kernel_irq_routing_entry *e, struct kvm *kvm, 1042 + int irq_source_id, int level, bool line_status) 1043 + { 1044 + if (!level) 1045 + return -EINVAL; 1046 + 1047 + return kvm_xen_set_evtchn(&e->xen_evtchn, kvm); 1048 + } 1049 + 1050 + /* 1051 + * Set up an event channel interrupt from the KVM IRQ routing table. 1052 + * Used for e.g. PIRQ from passed through physical devices. 1053 + */ 1505 1054 int kvm_xen_setup_evtchn(struct kvm *kvm, 1506 1055 struct kvm_kernel_irq_routing_entry *e, 1507 1056 const struct kvm_irq_routing_entry *ue) 1508 1057 1509 1058 { 1059 + struct kvm_vcpu *vcpu; 1060 + 1510 1061 if (ue->u.xen_evtchn.port >= max_evtchn_port(kvm)) 1511 1062 return -EINVAL; 1512 1063 ··· 1530 1049 if (ue->u.xen_evtchn.priority != KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL) 1531 1050 return -EINVAL; 1532 1051 1052 + /* 1053 + * Xen gives us interesting mappings from vCPU index to APIC ID, 1054 + * which means kvm_get_vcpu_by_id() has to iterate over all vCPUs 1055 + * to find it. Do that once at setup time, instead of every time. 1056 + * But beware that on live update / live migration, the routing 1057 + * table might be reinstated before the vCPU threads have finished 1058 + * recreating their vCPUs. 1059 + */ 1060 + vcpu = kvm_get_vcpu_by_id(kvm, ue->u.xen_evtchn.vcpu); 1061 + if (vcpu) 1062 + e->xen_evtchn.vcpu_idx = kvm_vcpu_get_idx(vcpu); 1063 + else 1064 + e->xen_evtchn.vcpu_idx = -1; 1065 + 1533 1066 e->xen_evtchn.port = ue->u.xen_evtchn.port; 1534 - e->xen_evtchn.vcpu = ue->u.xen_evtchn.vcpu; 1067 + e->xen_evtchn.vcpu_id = ue->u.xen_evtchn.vcpu; 1535 1068 e->xen_evtchn.priority = ue->u.xen_evtchn.priority; 1536 1069 e->set = evtchn_set_fn; 1537 1070 1538 1071 return 0; 1072 + } 1073 + 1074 + /* 1075 + * Explicit event sending from userspace with KVM_XEN_HVM_EVTCHN_SEND ioctl. 1076 + */ 1077 + int kvm_xen_hvm_evtchn_send(struct kvm *kvm, struct kvm_irq_routing_xen_evtchn *uxe) 1078 + { 1079 + struct kvm_xen_evtchn e; 1080 + int ret; 1081 + 1082 + if (!uxe->port || uxe->port >= max_evtchn_port(kvm)) 1083 + return -EINVAL; 1084 + 1085 + /* We only support 2 level event channels for now */ 1086 + if (uxe->priority != KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL) 1087 + return -EINVAL; 1088 + 1089 + e.port = uxe->port; 1090 + e.vcpu_id = uxe->vcpu; 1091 + e.vcpu_idx = -1; 1092 + e.priority = uxe->priority; 1093 + 1094 + ret = kvm_xen_set_evtchn(&e, kvm); 1095 + 1096 + /* 1097 + * None of that 'return 1 if it actually got delivered' nonsense. 1098 + * We don't care if it was masked (-ENOTCONN) either. 1099 + */ 1100 + if (ret > 0 || ret == -ENOTCONN) 1101 + ret = 0; 1102 + 1103 + return ret; 1104 + } 1105 + 1106 + /* 1107 + * Support for *outbound* event channel events via the EVTCHNOP_send hypercall. 1108 + */ 1109 + struct evtchnfd { 1110 + u32 send_port; 1111 + u32 type; 1112 + union { 1113 + struct kvm_xen_evtchn port; 1114 + struct { 1115 + u32 port; /* zero */ 1116 + struct eventfd_ctx *ctx; 1117 + } eventfd; 1118 + } deliver; 1119 + }; 1120 + 1121 + /* 1122 + * Update target vCPU or priority for a registered sending channel. 1123 + */ 1124 + static int kvm_xen_eventfd_update(struct kvm *kvm, 1125 + struct kvm_xen_hvm_attr *data) 1126 + { 1127 + u32 port = data->u.evtchn.send_port; 1128 + struct evtchnfd *evtchnfd; 1129 + 1130 + if (!port || port >= max_evtchn_port(kvm)) 1131 + return -EINVAL; 1132 + 1133 + mutex_lock(&kvm->lock); 1134 + evtchnfd = idr_find(&kvm->arch.xen.evtchn_ports, port); 1135 + mutex_unlock(&kvm->lock); 1136 + 1137 + if (!evtchnfd) 1138 + return -ENOENT; 1139 + 1140 + /* For an UPDATE, nothing may change except the priority/vcpu */ 1141 + if (evtchnfd->type != data->u.evtchn.type) 1142 + return -EINVAL; 1143 + 1144 + /* 1145 + * Port cannot change, and if it's zero that was an eventfd 1146 + * which can't be changed either. 1147 + */ 1148 + if (!evtchnfd->deliver.port.port || 1149 + evtchnfd->deliver.port.port != data->u.evtchn.deliver.port.port) 1150 + return -EINVAL; 1151 + 1152 + /* We only support 2 level event channels for now */ 1153 + if (data->u.evtchn.deliver.port.priority != KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL) 1154 + return -EINVAL; 1155 + 1156 + mutex_lock(&kvm->lock); 1157 + evtchnfd->deliver.port.priority = data->u.evtchn.deliver.port.priority; 1158 + if (evtchnfd->deliver.port.vcpu_id != data->u.evtchn.deliver.port.vcpu) { 1159 + evtchnfd->deliver.port.vcpu_id = data->u.evtchn.deliver.port.vcpu; 1160 + evtchnfd->deliver.port.vcpu_idx = -1; 1161 + } 1162 + mutex_unlock(&kvm->lock); 1163 + return 0; 1164 + } 1165 + 1166 + /* 1167 + * Configure the target (eventfd or local port delivery) for sending on 1168 + * a given event channel. 1169 + */ 1170 + static int kvm_xen_eventfd_assign(struct kvm *kvm, 1171 + struct kvm_xen_hvm_attr *data) 1172 + { 1173 + u32 port = data->u.evtchn.send_port; 1174 + struct eventfd_ctx *eventfd = NULL; 1175 + struct evtchnfd *evtchnfd = NULL; 1176 + int ret = -EINVAL; 1177 + 1178 + if (!port || port >= max_evtchn_port(kvm)) 1179 + return -EINVAL; 1180 + 1181 + evtchnfd = kzalloc(sizeof(struct evtchnfd), GFP_KERNEL); 1182 + if (!evtchnfd) 1183 + return -ENOMEM; 1184 + 1185 + switch(data->u.evtchn.type) { 1186 + case EVTCHNSTAT_ipi: 1187 + /* IPI must map back to the same port# */ 1188 + if (data->u.evtchn.deliver.port.port != data->u.evtchn.send_port) 1189 + goto out; /* -EINVAL */ 1190 + break; 1191 + 1192 + case EVTCHNSTAT_interdomain: 1193 + if (data->u.evtchn.deliver.port.port) { 1194 + if (data->u.evtchn.deliver.port.port >= max_evtchn_port(kvm)) 1195 + goto out; /* -EINVAL */ 1196 + } else { 1197 + eventfd = eventfd_ctx_fdget(data->u.evtchn.deliver.eventfd.fd); 1198 + if (IS_ERR(eventfd)) { 1199 + ret = PTR_ERR(eventfd); 1200 + goto out; 1201 + } 1202 + } 1203 + break; 1204 + 1205 + case EVTCHNSTAT_virq: 1206 + case EVTCHNSTAT_closed: 1207 + case EVTCHNSTAT_unbound: 1208 + case EVTCHNSTAT_pirq: 1209 + default: /* Unknown event channel type */ 1210 + goto out; /* -EINVAL */ 1211 + } 1212 + 1213 + evtchnfd->send_port = data->u.evtchn.send_port; 1214 + evtchnfd->type = data->u.evtchn.type; 1215 + if (eventfd) { 1216 + evtchnfd->deliver.eventfd.ctx = eventfd; 1217 + } else { 1218 + /* We only support 2 level event channels for now */ 1219 + if (data->u.evtchn.deliver.port.priority != KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL) 1220 + goto out; /* -EINVAL; */ 1221 + 1222 + evtchnfd->deliver.port.port = data->u.evtchn.deliver.port.port; 1223 + evtchnfd->deliver.port.vcpu_id = data->u.evtchn.deliver.port.vcpu; 1224 + evtchnfd->deliver.port.vcpu_idx = -1; 1225 + evtchnfd->deliver.port.priority = data->u.evtchn.deliver.port.priority; 1226 + } 1227 + 1228 + mutex_lock(&kvm->lock); 1229 + ret = idr_alloc(&kvm->arch.xen.evtchn_ports, evtchnfd, port, port + 1, 1230 + GFP_KERNEL); 1231 + mutex_unlock(&kvm->lock); 1232 + if (ret >= 0) 1233 + return 0; 1234 + 1235 + if (ret == -ENOSPC) 1236 + ret = -EEXIST; 1237 + out: 1238 + if (eventfd) 1239 + eventfd_ctx_put(eventfd); 1240 + kfree(evtchnfd); 1241 + return ret; 1242 + } 1243 + 1244 + static int kvm_xen_eventfd_deassign(struct kvm *kvm, u32 port) 1245 + { 1246 + struct evtchnfd *evtchnfd; 1247 + 1248 + mutex_lock(&kvm->lock); 1249 + evtchnfd = idr_remove(&kvm->arch.xen.evtchn_ports, port); 1250 + mutex_unlock(&kvm->lock); 1251 + 1252 + if (!evtchnfd) 1253 + return -ENOENT; 1254 + 1255 + if (kvm) 1256 + synchronize_srcu(&kvm->srcu); 1257 + if (!evtchnfd->deliver.port.port) 1258 + eventfd_ctx_put(evtchnfd->deliver.eventfd.ctx); 1259 + kfree(evtchnfd); 1260 + return 0; 1261 + } 1262 + 1263 + static int kvm_xen_eventfd_reset(struct kvm *kvm) 1264 + { 1265 + struct evtchnfd *evtchnfd; 1266 + int i; 1267 + 1268 + mutex_lock(&kvm->lock); 1269 + idr_for_each_entry(&kvm->arch.xen.evtchn_ports, evtchnfd, i) { 1270 + idr_remove(&kvm->arch.xen.evtchn_ports, evtchnfd->send_port); 1271 + synchronize_srcu(&kvm->srcu); 1272 + if (!evtchnfd->deliver.port.port) 1273 + eventfd_ctx_put(evtchnfd->deliver.eventfd.ctx); 1274 + kfree(evtchnfd); 1275 + } 1276 + mutex_unlock(&kvm->lock); 1277 + 1278 + return 0; 1279 + } 1280 + 1281 + static int kvm_xen_setattr_evtchn(struct kvm *kvm, struct kvm_xen_hvm_attr *data) 1282 + { 1283 + u32 port = data->u.evtchn.send_port; 1284 + 1285 + if (data->u.evtchn.flags == KVM_XEN_EVTCHN_RESET) 1286 + return kvm_xen_eventfd_reset(kvm); 1287 + 1288 + if (!port || port >= max_evtchn_port(kvm)) 1289 + return -EINVAL; 1290 + 1291 + if (data->u.evtchn.flags == KVM_XEN_EVTCHN_DEASSIGN) 1292 + return kvm_xen_eventfd_deassign(kvm, port); 1293 + if (data->u.evtchn.flags == KVM_XEN_EVTCHN_UPDATE) 1294 + return kvm_xen_eventfd_update(kvm, data); 1295 + if (data->u.evtchn.flags) 1296 + return -EINVAL; 1297 + 1298 + return kvm_xen_eventfd_assign(kvm, data); 1299 + } 1300 + 1301 + static bool kvm_xen_hcall_evtchn_send(struct kvm_vcpu *vcpu, u64 param, u64 *r) 1302 + { 1303 + struct evtchnfd *evtchnfd; 1304 + struct evtchn_send send; 1305 + gpa_t gpa; 1306 + int idx; 1307 + 1308 + idx = srcu_read_lock(&vcpu->kvm->srcu); 1309 + gpa = kvm_mmu_gva_to_gpa_system(vcpu, param, NULL); 1310 + srcu_read_unlock(&vcpu->kvm->srcu, idx); 1311 + 1312 + if (!gpa || kvm_vcpu_read_guest(vcpu, gpa, &send, sizeof(send))) { 1313 + *r = -EFAULT; 1314 + return true; 1315 + } 1316 + 1317 + /* The evtchn_ports idr is protected by vcpu->kvm->srcu */ 1318 + evtchnfd = idr_find(&vcpu->kvm->arch.xen.evtchn_ports, send.port); 1319 + if (!evtchnfd) 1320 + return false; 1321 + 1322 + if (evtchnfd->deliver.port.port) { 1323 + int ret = kvm_xen_set_evtchn(&evtchnfd->deliver.port, vcpu->kvm); 1324 + if (ret < 0 && ret != -ENOTCONN) 1325 + return false; 1326 + } else { 1327 + eventfd_signal(evtchnfd->deliver.eventfd.ctx, 1); 1328 + } 1329 + 1330 + *r = 0; 1331 + return true; 1332 + } 1333 + 1334 + void kvm_xen_init_vcpu(struct kvm_vcpu *vcpu) 1335 + { 1336 + vcpu->arch.xen.vcpu_id = vcpu->vcpu_idx; 1337 + vcpu->arch.xen.poll_evtchn = 0; 1338 + timer_setup(&vcpu->arch.xen.poll_timer, cancel_evtchn_poll, 0); 1339 + } 1340 + 1341 + void kvm_xen_destroy_vcpu(struct kvm_vcpu *vcpu) 1342 + { 1343 + if (kvm_xen_timer_enabled(vcpu)) 1344 + kvm_xen_stop_timer(vcpu); 1345 + 1346 + kvm_gfn_to_pfn_cache_destroy(vcpu->kvm, 1347 + &vcpu->arch.xen.runstate_cache); 1348 + kvm_gfn_to_pfn_cache_destroy(vcpu->kvm, 1349 + &vcpu->arch.xen.vcpu_info_cache); 1350 + kvm_gfn_to_pfn_cache_destroy(vcpu->kvm, 1351 + &vcpu->arch.xen.vcpu_time_info_cache); 1352 + del_timer_sync(&vcpu->arch.xen.poll_timer); 1353 + } 1354 + 1355 + void kvm_xen_init_vm(struct kvm *kvm) 1356 + { 1357 + idr_init(&kvm->arch.xen.evtchn_ports); 1358 + } 1359 + 1360 + void kvm_xen_destroy_vm(struct kvm *kvm) 1361 + { 1362 + struct evtchnfd *evtchnfd; 1363 + int i; 1364 + 1365 + kvm_gfn_to_pfn_cache_destroy(kvm, &kvm->arch.xen.shinfo_cache); 1366 + 1367 + idr_for_each_entry(&kvm->arch.xen.evtchn_ports, evtchnfd, i) { 1368 + if (!evtchnfd->deliver.port.port) 1369 + eventfd_ctx_put(evtchnfd->deliver.eventfd.ctx); 1370 + kfree(evtchnfd); 1371 + } 1372 + idr_destroy(&kvm->arch.xen.evtchn_ports); 1373 + 1374 + if (kvm->arch.xen_hvm_config.msr) 1375 + static_branch_slow_dec_deferred(&kvm_xen_enabled); 1539 1376 }

+59 -3

arch/x86/kvm/xen.h

··· 15 15 extern struct static_key_false_deferred kvm_xen_enabled; 16 16 17 17 int __kvm_xen_has_interrupt(struct kvm_vcpu *vcpu); 18 + void kvm_xen_inject_pending_events(struct kvm_vcpu *vcpu); 18 19 int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data); 19 20 int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data); 20 21 int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data); 21 22 int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data); 23 + int kvm_xen_hvm_evtchn_send(struct kvm *kvm, struct kvm_irq_routing_xen_evtchn *evt); 22 24 int kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data); 23 25 int kvm_xen_hvm_config(struct kvm *kvm, struct kvm_xen_hvm_config *xhc); 24 26 void kvm_xen_init_vm(struct kvm *kvm); 25 27 void kvm_xen_destroy_vm(struct kvm *kvm); 26 - 27 - int kvm_xen_set_evtchn_fast(struct kvm_kernel_irq_routing_entry *e, 28 + void kvm_xen_init_vcpu(struct kvm_vcpu *vcpu); 29 + void kvm_xen_destroy_vcpu(struct kvm_vcpu *vcpu); 30 + int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, 28 31 struct kvm *kvm); 29 32 int kvm_xen_setup_evtchn(struct kvm *kvm, 30 33 struct kvm_kernel_irq_routing_entry *e, ··· 49 46 static inline int kvm_xen_has_interrupt(struct kvm_vcpu *vcpu) 50 47 { 51 48 if (static_branch_unlikely(&kvm_xen_enabled.key) && 52 - vcpu->arch.xen.vcpu_info_set && vcpu->kvm->arch.xen.upcall_vector) 49 + vcpu->arch.xen.vcpu_info_cache.active && 50 + vcpu->kvm->arch.xen.upcall_vector) 53 51 return __kvm_xen_has_interrupt(vcpu); 54 52 55 53 return 0; 56 54 } 55 + 56 + static inline bool kvm_xen_has_pending_events(struct kvm_vcpu *vcpu) 57 + { 58 + return static_branch_unlikely(&kvm_xen_enabled.key) && 59 + vcpu->arch.xen.evtchn_pending_sel; 60 + } 61 + 62 + static inline bool kvm_xen_timer_enabled(struct kvm_vcpu *vcpu) 63 + { 64 + return !!vcpu->arch.xen.timer_virq; 65 + } 66 + 67 + static inline int kvm_xen_has_pending_timer(struct kvm_vcpu *vcpu) 68 + { 69 + if (kvm_xen_hypercall_enabled(vcpu->kvm) && kvm_xen_timer_enabled(vcpu)) 70 + return atomic_read(&vcpu->arch.xen.timer_pending); 71 + 72 + return 0; 73 + } 74 + 75 + void kvm_xen_inject_timer_irqs(struct kvm_vcpu *vcpu); 57 76 #else 58 77 static inline int kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data) 59 78 { ··· 87 62 } 88 63 89 64 static inline void kvm_xen_destroy_vm(struct kvm *kvm) 65 + { 66 + } 67 + 68 + static inline void kvm_xen_init_vcpu(struct kvm_vcpu *vcpu) 69 + { 70 + } 71 + 72 + static inline void kvm_xen_destroy_vcpu(struct kvm_vcpu *vcpu) 90 73 { 91 74 } 92 75 ··· 111 78 static inline int kvm_xen_has_interrupt(struct kvm_vcpu *vcpu) 112 79 { 113 80 return 0; 81 + } 82 + 83 + static inline void kvm_xen_inject_pending_events(struct kvm_vcpu *vcpu) 84 + { 85 + } 86 + 87 + static inline bool kvm_xen_has_pending_events(struct kvm_vcpu *vcpu) 88 + { 89 + return false; 90 + } 91 + 92 + static inline int kvm_xen_has_pending_timer(struct kvm_vcpu *vcpu) 93 + { 94 + return 0; 95 + } 96 + 97 + static inline void kvm_xen_inject_timer_irqs(struct kvm_vcpu *vcpu) 98 + { 99 + } 100 + 101 + static inline bool kvm_xen_timer_enabled(struct kvm_vcpu *vcpu) 102 + { 103 + return false; 114 104 } 115 105 #endif 116 106

+11

drivers/s390/char/Kconfig

··· 100 100 This option enables the Open-for-Business interface to the s390 101 101 Service Element. 102 102 103 + config S390_UV_UAPI 104 + def_tristate m 105 + prompt "Ultravisor userspace API" 106 + depends on S390 107 + help 108 + Selecting exposes parts of the UV interface to userspace 109 + by providing a misc character device at /dev/uv. 110 + Using IOCTLs one can interact with the UV. 111 + The device is only available if the Ultravisor 112 + Facility (158) is present. 113 + 103 114 config S390_TAPE 104 115 def_tristate m 105 116 prompt "S/390 tape device support"

+1

drivers/s390/char/Makefile

··· 48 48 obj-$(CONFIG_MONWRITER) += monwriter.o 49 49 obj-$(CONFIG_S390_VMUR) += vmur.o 50 50 obj-$(CONFIG_CRASH_DUMP) += sclp_sdias.o zcore.o 51 + obj-$(CONFIG_S390_UV_UAPI) += uvdevice.o 51 52 52 53 hmcdrv-objs := hmcdrv_mod.o hmcdrv_dev.o hmcdrv_ftp.o hmcdrv_cache.o diag_ftp.o sclp_ftp.o 53 54 obj-$(CONFIG_HMC_DRV) += hmcdrv.o

+257

drivers/s390/char/uvdevice.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright IBM Corp. 2022 4 + * Author(s): Steffen Eiden <seiden@linux.ibm.com> 5 + * 6 + * This file provides a Linux misc device to give userspace access to some 7 + * Ultravisor (UV) functions. The device only accepts IOCTLs and will only 8 + * be present if the Ultravisor facility (158) is present. 9 + * 10 + * When userspace sends a valid IOCTL uvdevice will copy the input data to 11 + * kernel space, do some basic validity checks to avoid kernel/system 12 + * corruption. Any other check that the Ultravisor does will not be done by 13 + * the uvdevice to keep changes minimal when adding new functionalities 14 + * to existing UV-calls. 15 + * After the checks uvdevice builds a corresponding 16 + * Ultravisor Call Control Block, and sends the request to the Ultravisor. 17 + * Then, it copies the response, including the return codes, back to userspace. 18 + * It is the responsibility of the userspace to check for any error issued 19 + * by UV and to interpret the UV response. The uvdevice acts as a communication 20 + * channel for userspace to the Ultravisor. 21 + */ 22 + 23 + #include <linux/module.h> 24 + #include <linux/kernel.h> 25 + #include <linux/miscdevice.h> 26 + #include <linux/types.h> 27 + #include <linux/stddef.h> 28 + #include <linux/vmalloc.h> 29 + #include <linux/slab.h> 30 + 31 + #include <asm/uvdevice.h> 32 + #include <asm/uv.h> 33 + 34 + static int uvio_build_uvcb_attest(struct uv_cb_attest *uvcb_attest, u8 *arcb, 35 + u8 *meas, u8 *add_data, struct uvio_attest *uvio_attest) 36 + { 37 + void __user *user_buf_arcb = (void __user *)uvio_attest->arcb_addr; 38 + 39 + if (copy_from_user(arcb, user_buf_arcb, uvio_attest->arcb_len)) 40 + return -EFAULT; 41 + 42 + uvcb_attest->header.len = sizeof(*uvcb_attest); 43 + uvcb_attest->header.cmd = UVC_CMD_RETR_ATTEST; 44 + uvcb_attest->arcb_addr = (u64)arcb; 45 + uvcb_attest->cont_token = 0; 46 + uvcb_attest->user_data_len = uvio_attest->user_data_len; 47 + memcpy(uvcb_attest->user_data, uvio_attest->user_data, sizeof(uvcb_attest->user_data)); 48 + uvcb_attest->meas_len = uvio_attest->meas_len; 49 + uvcb_attest->meas_addr = (u64)meas; 50 + uvcb_attest->add_data_len = uvio_attest->add_data_len; 51 + uvcb_attest->add_data_addr = (u64)add_data; 52 + 53 + return 0; 54 + } 55 + 56 + static int uvio_copy_attest_result_to_user(struct uv_cb_attest *uvcb_attest, 57 + struct uvio_ioctl_cb *uv_ioctl, 58 + u8 *measurement, u8 *add_data, 59 + struct uvio_attest *uvio_attest) 60 + { 61 + struct uvio_attest __user *user_uvio_attest = (void __user *)uv_ioctl->argument_addr; 62 + void __user *user_buf_add = (void __user *)uvio_attest->add_data_addr; 63 + void __user *user_buf_meas = (void __user *)uvio_attest->meas_addr; 64 + void __user *user_buf_uid = &user_uvio_attest->config_uid; 65 + 66 + if (copy_to_user(user_buf_meas, measurement, uvio_attest->meas_len)) 67 + return -EFAULT; 68 + if (add_data && copy_to_user(user_buf_add, add_data, uvio_attest->add_data_len)) 69 + return -EFAULT; 70 + if (copy_to_user(user_buf_uid, uvcb_attest->config_uid, sizeof(uvcb_attest->config_uid))) 71 + return -EFAULT; 72 + return 0; 73 + } 74 + 75 + static int get_uvio_attest(struct uvio_ioctl_cb *uv_ioctl, struct uvio_attest *uvio_attest) 76 + { 77 + u8 __user *user_arg_buf = (u8 __user *)uv_ioctl->argument_addr; 78 + 79 + if (copy_from_user(uvio_attest, user_arg_buf, sizeof(*uvio_attest))) 80 + return -EFAULT; 81 + 82 + if (uvio_attest->arcb_len > UVIO_ATT_ARCB_MAX_LEN) 83 + return -EINVAL; 84 + if (uvio_attest->arcb_len == 0) 85 + return -EINVAL; 86 + if (uvio_attest->meas_len > UVIO_ATT_MEASUREMENT_MAX_LEN) 87 + return -EINVAL; 88 + if (uvio_attest->meas_len == 0) 89 + return -EINVAL; 90 + if (uvio_attest->add_data_len > UVIO_ATT_ADDITIONAL_MAX_LEN) 91 + return -EINVAL; 92 + if (uvio_attest->reserved136) 93 + return -EINVAL; 94 + return 0; 95 + } 96 + 97 + /** 98 + * uvio_attestation() - Perform a Retrieve Attestation Measurement UVC. 99 + * 100 + * @uv_ioctl: ioctl control block 101 + * 102 + * uvio_attestation() does a Retrieve Attestation Measurement Ultravisor Call. 103 + * It verifies that the given userspace addresses are valid and request sizes 104 + * are sane. Every other check is made by the Ultravisor (UV) and won't result 105 + * in a negative return value. It copies the input to kernelspace, builds the 106 + * request, sends the UV-call, and copies the result to userspace. 107 + * 108 + * The Attestation Request has two input and two outputs. 109 + * ARCB and User Data are inputs for the UV generated by userspace. 110 + * Measurement and Additional Data are outputs for userspace generated by UV. 111 + * 112 + * The Attestation Request Control Block (ARCB) is a cryptographically verified 113 + * and secured request to UV and User Data is some plaintext data which is 114 + * going to be included in the Attestation Measurement calculation. 115 + * 116 + * Measurement is a cryptographic measurement of the callers properties, 117 + * optional data configured by the ARCB and the user data. If specified by the 118 + * ARCB, UV will add some Additional Data to the measurement calculation. 119 + * This Additional Data is then returned as well. 120 + * 121 + * If the Retrieve Attestation Measurement UV facility is not present, 122 + * UV will return invalid command rc. This won't be fenced in the driver 123 + * and does not result in a negative return value. 124 + * 125 + * Context: might sleep 126 + * 127 + * Return: 0 on success or a negative error code on error. 128 + */ 129 + static int uvio_attestation(struct uvio_ioctl_cb *uv_ioctl) 130 + { 131 + struct uv_cb_attest *uvcb_attest = NULL; 132 + struct uvio_attest *uvio_attest = NULL; 133 + u8 *measurement = NULL; 134 + u8 *add_data = NULL; 135 + u8 *arcb = NULL; 136 + int ret; 137 + 138 + ret = -EINVAL; 139 + if (uv_ioctl->argument_len != sizeof(*uvio_attest)) 140 + goto out; 141 + 142 + ret = -ENOMEM; 143 + uvio_attest = kzalloc(sizeof(*uvio_attest), GFP_KERNEL); 144 + if (!uvio_attest) 145 + goto out; 146 + 147 + ret = get_uvio_attest(uv_ioctl, uvio_attest); 148 + if (ret) 149 + goto out; 150 + 151 + ret = -ENOMEM; 152 + arcb = kvzalloc(uvio_attest->arcb_len, GFP_KERNEL); 153 + measurement = kvzalloc(uvio_attest->meas_len, GFP_KERNEL); 154 + if (!arcb || !measurement) 155 + goto out; 156 + 157 + if (uvio_attest->add_data_len) { 158 + add_data = kvzalloc(uvio_attest->add_data_len, GFP_KERNEL); 159 + if (!add_data) 160 + goto out; 161 + } 162 + 163 + uvcb_attest = kzalloc(sizeof(*uvcb_attest), GFP_KERNEL); 164 + if (!uvcb_attest) 165 + goto out; 166 + 167 + ret = uvio_build_uvcb_attest(uvcb_attest, arcb, measurement, add_data, uvio_attest); 168 + if (ret) 169 + goto out; 170 + 171 + uv_call_sched(0, (u64)uvcb_attest); 172 + 173 + uv_ioctl->uv_rc = uvcb_attest->header.rc; 174 + uv_ioctl->uv_rrc = uvcb_attest->header.rrc; 175 + 176 + ret = uvio_copy_attest_result_to_user(uvcb_attest, uv_ioctl, measurement, add_data, 177 + uvio_attest); 178 + out: 179 + kvfree(arcb); 180 + kvfree(measurement); 181 + kvfree(add_data); 182 + kfree(uvio_attest); 183 + kfree(uvcb_attest); 184 + return ret; 185 + } 186 + 187 + static int uvio_copy_and_check_ioctl(struct uvio_ioctl_cb *ioctl, void __user *argp) 188 + { 189 + if (copy_from_user(ioctl, argp, sizeof(*ioctl))) 190 + return -EFAULT; 191 + if (ioctl->flags != 0) 192 + return -EINVAL; 193 + if (memchr_inv(ioctl->reserved14, 0, sizeof(ioctl->reserved14))) 194 + return -EINVAL; 195 + 196 + return 0; 197 + } 198 + 199 + /* 200 + * IOCTL entry point for the Ultravisor device. 201 + */ 202 + static long uvio_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) 203 + { 204 + void __user *argp = (void __user *)arg; 205 + struct uvio_ioctl_cb uv_ioctl = { }; 206 + long ret; 207 + 208 + switch (cmd) { 209 + case UVIO_IOCTL_ATT: 210 + ret = uvio_copy_and_check_ioctl(&uv_ioctl, argp); 211 + if (ret) 212 + return ret; 213 + ret = uvio_attestation(&uv_ioctl); 214 + break; 215 + default: 216 + ret = -ENOIOCTLCMD; 217 + break; 218 + } 219 + if (ret) 220 + return ret; 221 + 222 + if (copy_to_user(argp, &uv_ioctl, sizeof(uv_ioctl))) 223 + ret = -EFAULT; 224 + 225 + return ret; 226 + } 227 + 228 + static const struct file_operations uvio_dev_fops = { 229 + .owner = THIS_MODULE, 230 + .unlocked_ioctl = uvio_ioctl, 231 + .llseek = no_llseek, 232 + }; 233 + 234 + static struct miscdevice uvio_dev_miscdev = { 235 + .minor = MISC_DYNAMIC_MINOR, 236 + .name = UVIO_DEVICE_NAME, 237 + .fops = &uvio_dev_fops, 238 + }; 239 + 240 + static void __exit uvio_dev_exit(void) 241 + { 242 + misc_deregister(&uvio_dev_miscdev); 243 + } 244 + 245 + static int __init uvio_dev_init(void) 246 + { 247 + if (!test_facility(158)) 248 + return -ENXIO; 249 + return misc_register(&uvio_dev_miscdev); 250 + } 251 + 252 + module_init(uvio_dev_init); 253 + module_exit(uvio_dev_exit); 254 + 255 + MODULE_AUTHOR("IBM Corporation"); 256 + MODULE_LICENSE("GPL"); 257 + MODULE_DESCRIPTION("Ultravisor UAPI driver");

-2

include/kvm/arm_arch_timer.h

··· 76 76 int kvm_arm_timer_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr); 77 77 int kvm_arm_timer_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr); 78 78 79 - bool kvm_timer_is_pending(struct kvm_vcpu *vcpu); 80 - 81 79 u64 kvm_phys_timer_read(void); 82 80 83 81 void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu);

+8

include/kvm/arm_hypercalls.h

··· 40 40 vcpu_set_reg(vcpu, 3, a3); 41 41 } 42 42 43 + struct kvm_one_reg; 44 + 45 + void kvm_arm_init_hypercalls(struct kvm *kvm); 46 + int kvm_arm_get_fw_num_regs(struct kvm_vcpu *vcpu); 47 + int kvm_arm_copy_fw_reg_indices(struct kvm_vcpu *vcpu, u64 __user *uindices); 48 + int kvm_arm_get_fw_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg); 49 + int kvm_arm_set_fw_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg); 50 + 43 51 #endif

+32 -2

include/kvm/arm_pmu.h

··· 20 20 struct perf_event *perf_event; 21 21 }; 22 22 23 + struct kvm_pmu_events { 24 + u32 events_host; 25 + u32 events_guest; 26 + }; 27 + 23 28 struct kvm_pmu { 24 - int irq_num; 29 + struct irq_work overflow_work; 30 + struct kvm_pmu_events events; 25 31 struct kvm_pmc pmc[ARMV8_PMU_MAX_COUNTERS]; 26 32 DECLARE_BITMAP(chained, ARMV8_PMU_MAX_COUNTER_PAIRS); 33 + int irq_num; 27 34 bool created; 28 35 bool irq_level; 29 - struct irq_work overflow_work; 30 36 }; 31 37 32 38 struct arm_pmu_entry { ··· 72 66 int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu *vcpu, 73 67 struct kvm_device_attr *attr); 74 68 int kvm_arm_pmu_v3_enable(struct kvm_vcpu *vcpu); 69 + 70 + struct kvm_pmu_events *kvm_get_pmu_events(void); 71 + void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu); 72 + void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu); 73 + 74 + #define kvm_vcpu_has_pmu(vcpu) \ 75 + (test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features)) 76 + 77 + /* 78 + * Updates the vcpu's view of the pmu events for this cpu. 79 + * Must be called before every vcpu run after disabling interrupts, to ensure 80 + * that an interrupt cannot fire and update the structure. 81 + */ 82 + #define kvm_pmu_update_vcpu_events(vcpu) \ 83 + do { \ 84 + if (!has_vhe() && kvm_vcpu_has_pmu(vcpu)) \ 85 + vcpu->arch.pmu.events = *kvm_get_pmu_events(); \ 86 + } while (0) 87 + 75 88 #else 76 89 struct kvm_pmu { 77 90 }; ··· 151 126 { 152 127 return 0; 153 128 } 129 + 130 + #define kvm_vcpu_has_pmu(vcpu) ({ false; }) 131 + static inline void kvm_pmu_update_vcpu_events(struct kvm_vcpu *vcpu) {} 132 + static inline void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu) {} 133 + static inline void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu) {} 154 134 155 135 #endif 156 136

-7

include/kvm/arm_psci.h

··· 39 39 40 40 int kvm_psci_call(struct kvm_vcpu *vcpu); 41 41 42 - struct kvm_one_reg; 43 - 44 - int kvm_arm_get_fw_num_regs(struct kvm_vcpu *vcpu); 45 - int kvm_arm_copy_fw_reg_indices(struct kvm_vcpu *vcpu, u64 __user *uindices); 46 - int kvm_arm_get_fw_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg); 47 - int kvm_arm_set_fw_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg); 48 - 49 42 #endif /* __KVM_ARM_PSCI_H__ */

+6 -2

include/kvm/arm_vgic.h

··· 231 231 232 232 /* Implementation revision as reported in the GICD_IIDR */ 233 233 u32 implementation_rev; 234 + #define KVM_VGIC_IMP_REV_2 2 /* GICv2 restorable groups */ 235 + #define KVM_VGIC_IMP_REV_3 3 /* GICv3 GICR_CTLR.{IW,CES,RWP} */ 236 + #define KVM_VGIC_IMP_REV_LATEST KVM_VGIC_IMP_REV_3 234 237 235 238 /* Userspace can write to GICv2 IGROUPR */ 236 239 bool v2_groups_user_writable; ··· 347 344 struct vgic_io_device rd_iodev; 348 345 struct vgic_redist_region *rdreg; 349 346 u32 rdreg_index; 347 + atomic_t syncr_busy; 350 348 351 349 /* Contains the attributes and gpa of the LPI pending tables. */ 352 350 u64 pendbaser; 353 - 354 - bool lpis_enabled; 351 + /* GICR_CTLR.{ENABLE_LPIS,RWP} */ 352 + atomic_t ctlr; 355 353 356 354 /* Cache guest priority bits */ 357 355 u32 num_pri_bits;

+3 -1

include/linux/kvm_host.h

··· 614 614 615 615 struct kvm_xen_evtchn { 616 616 u32 port; 617 - u32 vcpu; 617 + u32 vcpu_id; 618 + int vcpu_idx; 618 619 u32 priority; 619 620 }; 620 621 ··· 728 727 * and is accessed atomically. 729 728 */ 730 729 atomic_t online_vcpus; 730 + int max_vcpus; 731 731 int created_vcpus; 732 732 int last_boosted_vcpu; 733 733 struct list_head vm_list;

+52 -2

include/uapi/linux/kvm.h

··· 444 444 #define KVM_SYSTEM_EVENT_SHUTDOWN 1 445 445 #define KVM_SYSTEM_EVENT_RESET 2 446 446 #define KVM_SYSTEM_EVENT_CRASH 3 447 + #define KVM_SYSTEM_EVENT_WAKEUP 4 448 + #define KVM_SYSTEM_EVENT_SUSPEND 5 449 + #define KVM_SYSTEM_EVENT_SEV_TERM 6 447 450 __u32 type; 448 451 __u32 ndata; 449 452 union { ··· 649 646 #define KVM_MP_STATE_OPERATING 7 650 647 #define KVM_MP_STATE_LOAD 8 651 648 #define KVM_MP_STATE_AP_RESET_HOLD 9 649 + #define KVM_MP_STATE_SUSPENDED 10 652 650 653 651 struct kvm_mp_state { 654 652 __u32 mp_state; ··· 1154 1150 #define KVM_CAP_S390_MEM_OP_EXTENSION 211 1155 1151 #define KVM_CAP_PMU_CAPABILITY 212 1156 1152 #define KVM_CAP_DISABLE_QUIRKS2 213 1157 - /* #define KVM_CAP_VM_TSC_CONTROL 214 */ 1153 + #define KVM_CAP_VM_TSC_CONTROL 214 1158 1154 #define KVM_CAP_SYSTEM_EVENT_DATA 215 1155 + #define KVM_CAP_ARM_SYSTEM_SUSPEND 216 1159 1156 1160 1157 #ifdef KVM_CAP_IRQ_ROUTING 1161 1158 ··· 1245 1240 #define KVM_XEN_HVM_CONFIG_SHARED_INFO (1 << 2) 1246 1241 #define KVM_XEN_HVM_CONFIG_RUNSTATE (1 << 3) 1247 1242 #define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 4) 1243 + #define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5) 1248 1244 1249 1245 struct kvm_xen_hvm_config { 1250 1246 __u32 flags; ··· 1484 1478 #define KVM_SET_PIT2 _IOW(KVMIO, 0xa0, struct kvm_pit_state2) 1485 1479 /* Available with KVM_CAP_PPC_GET_PVINFO */ 1486 1480 #define KVM_PPC_GET_PVINFO _IOW(KVMIO, 0xa1, struct kvm_ppc_pvinfo) 1487 - /* Available with KVM_CAP_TSC_CONTROL */ 1481 + /* Available with KVM_CAP_TSC_CONTROL for a vCPU, or with 1482 + * KVM_CAP_VM_TSC_CONTROL to set defaults for a VM */ 1488 1483 #define KVM_SET_TSC_KHZ _IO(KVMIO, 0xa2) 1489 1484 #define KVM_GET_TSC_KHZ _IO(KVMIO, 0xa3) 1490 1485 /* Available with KVM_CAP_PCI_2_3 */ ··· 1701 1694 struct { 1702 1695 __u64 gfn; 1703 1696 } shared_info; 1697 + struct { 1698 + __u32 send_port; 1699 + __u32 type; /* EVTCHNSTAT_ipi / EVTCHNSTAT_interdomain */ 1700 + __u32 flags; 1701 + #define KVM_XEN_EVTCHN_DEASSIGN (1 << 0) 1702 + #define KVM_XEN_EVTCHN_UPDATE (1 << 1) 1703 + #define KVM_XEN_EVTCHN_RESET (1 << 2) 1704 + /* 1705 + * Events sent by the guest are either looped back to 1706 + * the guest itself (potentially on a different port#) 1707 + * or signalled via an eventfd. 1708 + */ 1709 + union { 1710 + struct { 1711 + __u32 port; 1712 + __u32 vcpu; 1713 + __u32 priority; 1714 + } port; 1715 + struct { 1716 + __u32 port; /* Zero for eventfd */ 1717 + __s32 fd; 1718 + } eventfd; 1719 + __u32 padding[4]; 1720 + } deliver; 1721 + } evtchn; 1722 + __u32 xen_version; 1704 1723 __u64 pad[8]; 1705 1724 } u; 1706 1725 }; ··· 1735 1702 #define KVM_XEN_ATTR_TYPE_LONG_MODE 0x0 1736 1703 #define KVM_XEN_ATTR_TYPE_SHARED_INFO 0x1 1737 1704 #define KVM_XEN_ATTR_TYPE_UPCALL_VECTOR 0x2 1705 + /* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_EVTCHN_SEND */ 1706 + #define KVM_XEN_ATTR_TYPE_EVTCHN 0x3 1707 + #define KVM_XEN_ATTR_TYPE_XEN_VERSION 0x4 1738 1708 1739 1709 /* Per-vCPU Xen attributes */ 1740 1710 #define KVM_XEN_VCPU_GET_ATTR _IOWR(KVMIO, 0xca, struct kvm_xen_vcpu_attr) 1741 1711 #define KVM_XEN_VCPU_SET_ATTR _IOW(KVMIO, 0xcb, struct kvm_xen_vcpu_attr) 1712 + 1713 + /* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_EVTCHN_SEND */ 1714 + #define KVM_XEN_HVM_EVTCHN_SEND _IOW(KVMIO, 0xd0, struct kvm_irq_routing_xen_evtchn) 1742 1715 1743 1716 #define KVM_GET_SREGS2 _IOR(KVMIO, 0xcc, struct kvm_sregs2) 1744 1717 #define KVM_SET_SREGS2 _IOW(KVMIO, 0xcd, struct kvm_sregs2) ··· 1763 1724 __u64 time_blocked; 1764 1725 __u64 time_offline; 1765 1726 } runstate; 1727 + __u32 vcpu_id; 1728 + struct { 1729 + __u32 port; 1730 + __u32 priority; 1731 + __u64 expires_ns; 1732 + } timer; 1733 + __u8 vector; 1766 1734 } u; 1767 1735 }; 1768 1736 ··· 1780 1734 #define KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_CURRENT 0x3 1781 1735 #define KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_DATA 0x4 1782 1736 #define KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST 0x5 1737 + /* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_EVTCHN_SEND */ 1738 + #define KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID 0x6 1739 + #define KVM_XEN_VCPU_ATTR_TYPE_TIMER 0x7 1740 + #define KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR 0x8 1783 1741 1784 1742 /* Secure Encrypted Virtualization command */ 1785 1743 enum sev_cmd_id {

+5

init/Kconfig

··· 77 77 depends on CC_HAS_ASM_GOTO 78 78 def_bool $(success,echo 'int foo(int x) { asm goto ("": "=r"(x) ::: bar); return x; bar: return 0; }' | $(CC) -x c - -c -o /dev/null) 79 79 80 + config CC_HAS_ASM_GOTO_TIED_OUTPUT 81 + depends on CC_HAS_ASM_GOTO_OUTPUT 82 + # Detect buggy gcc and clang, fixed in gcc-11 clang-14. 83 + def_bool $(success,echo 'int foo(int *x) { asm goto (".long (%l[bar]) - .\n": "+m"(*x) ::: bar); return *x; bar: return 0; }' | $CC -x c - -c -o /dev/null) 84 + 80 85 config TOOLS_SUPPORT_RELR 81 86 def_bool $(success,env "CC=$(CC)" "LD=$(LD)" "NM=$(NM)" "OBJCOPY=$(OBJCOPY)" $(srctree)/scripts/tools-support-relr.sh) 82 87

+2 -1

scripts/kallsyms.c

··· 111 111 ".L", /* local labels, .LBB,.Ltmpxxx,.L__unnamed_xx,.LASANPC, etc. */ 112 112 "__crc_", /* modversions */ 113 113 "__efistub_", /* arm64 EFI stub namespace */ 114 - "__kvm_nvhe_", /* arm64 non-VHE KVM namespace */ 114 + "__kvm_nvhe_$", /* arm64 local symbols in non-VHE KVM namespace */ 115 + "__kvm_nvhe_.L", /* arm64 local symbols in non-VHE KVM namespace */ 115 116 "__AArch64ADRPThunk_", /* arm64 lld */ 116 117 "__ARMV5PILongThunk_", /* arm lld */ 117 118 "__ARMV7PILongThunk_",

+193

tools/include/linux/arm-smccc.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + /* 3 + * Copyright (c) 2015, Linaro Limited 4 + */ 5 + #ifndef __LINUX_ARM_SMCCC_H 6 + #define __LINUX_ARM_SMCCC_H 7 + 8 + #include <linux/const.h> 9 + 10 + /* 11 + * This file provides common defines for ARM SMC Calling Convention as 12 + * specified in 13 + * https://developer.arm.com/docs/den0028/latest 14 + * 15 + * This code is up-to-date with version DEN 0028 C 16 + */ 17 + 18 + #define ARM_SMCCC_STD_CALL _AC(0,U) 19 + #define ARM_SMCCC_FAST_CALL _AC(1,U) 20 + #define ARM_SMCCC_TYPE_SHIFT 31 21 + 22 + #define ARM_SMCCC_SMC_32 0 23 + #define ARM_SMCCC_SMC_64 1 24 + #define ARM_SMCCC_CALL_CONV_SHIFT 30 25 + 26 + #define ARM_SMCCC_OWNER_MASK 0x3F 27 + #define ARM_SMCCC_OWNER_SHIFT 24 28 + 29 + #define ARM_SMCCC_FUNC_MASK 0xFFFF 30 + 31 + #define ARM_SMCCC_IS_FAST_CALL(smc_val) \ 32 + ((smc_val) & (ARM_SMCCC_FAST_CALL << ARM_SMCCC_TYPE_SHIFT)) 33 + #define ARM_SMCCC_IS_64(smc_val) \ 34 + ((smc_val) & (ARM_SMCCC_SMC_64 << ARM_SMCCC_CALL_CONV_SHIFT)) 35 + #define ARM_SMCCC_FUNC_NUM(smc_val) ((smc_val) & ARM_SMCCC_FUNC_MASK) 36 + #define ARM_SMCCC_OWNER_NUM(smc_val) \ 37 + (((smc_val) >> ARM_SMCCC_OWNER_SHIFT) & ARM_SMCCC_OWNER_MASK) 38 + 39 + #define ARM_SMCCC_CALL_VAL(type, calling_convention, owner, func_num) \ 40 + (((type) << ARM_SMCCC_TYPE_SHIFT) | \ 41 + ((calling_convention) << ARM_SMCCC_CALL_CONV_SHIFT) | \ 42 + (((owner) & ARM_SMCCC_OWNER_MASK) << ARM_SMCCC_OWNER_SHIFT) | \ 43 + ((func_num) & ARM_SMCCC_FUNC_MASK)) 44 + 45 + #define ARM_SMCCC_OWNER_ARCH 0 46 + #define ARM_SMCCC_OWNER_CPU 1 47 + #define ARM_SMCCC_OWNER_SIP 2 48 + #define ARM_SMCCC_OWNER_OEM 3 49 + #define ARM_SMCCC_OWNER_STANDARD 4 50 + #define ARM_SMCCC_OWNER_STANDARD_HYP 5 51 + #define ARM_SMCCC_OWNER_VENDOR_HYP 6 52 + #define ARM_SMCCC_OWNER_TRUSTED_APP 48 53 + #define ARM_SMCCC_OWNER_TRUSTED_APP_END 49 54 + #define ARM_SMCCC_OWNER_TRUSTED_OS 50 55 + #define ARM_SMCCC_OWNER_TRUSTED_OS_END 63 56 + 57 + #define ARM_SMCCC_FUNC_QUERY_CALL_UID 0xff01 58 + 59 + #define ARM_SMCCC_QUIRK_NONE 0 60 + #define ARM_SMCCC_QUIRK_QCOM_A6 1 /* Save/restore register a6 */ 61 + 62 + #define ARM_SMCCC_VERSION_1_0 0x10000 63 + #define ARM_SMCCC_VERSION_1_1 0x10001 64 + #define ARM_SMCCC_VERSION_1_2 0x10002 65 + #define ARM_SMCCC_VERSION_1_3 0x10003 66 + 67 + #define ARM_SMCCC_1_3_SVE_HINT 0x10000 68 + 69 + #define ARM_SMCCC_VERSION_FUNC_ID \ 70 + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ 71 + ARM_SMCCC_SMC_32, \ 72 + 0, 0) 73 + 74 + #define ARM_SMCCC_ARCH_FEATURES_FUNC_ID \ 75 + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ 76 + ARM_SMCCC_SMC_32, \ 77 + 0, 1) 78 + 79 + #define ARM_SMCCC_ARCH_SOC_ID \ 80 + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ 81 + ARM_SMCCC_SMC_32, \ 82 + 0, 2) 83 + 84 + #define ARM_SMCCC_ARCH_WORKAROUND_1 \ 85 + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ 86 + ARM_SMCCC_SMC_32, \ 87 + 0, 0x8000) 88 + 89 + #define ARM_SMCCC_ARCH_WORKAROUND_2 \ 90 + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ 91 + ARM_SMCCC_SMC_32, \ 92 + 0, 0x7fff) 93 + 94 + #define ARM_SMCCC_ARCH_WORKAROUND_3 \ 95 + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ 96 + ARM_SMCCC_SMC_32, \ 97 + 0, 0x3fff) 98 + 99 + #define ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID \ 100 + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ 101 + ARM_SMCCC_SMC_32, \ 102 + ARM_SMCCC_OWNER_VENDOR_HYP, \ 103 + ARM_SMCCC_FUNC_QUERY_CALL_UID) 104 + 105 + /* KVM UID value: 28b46fb6-2ec5-11e9-a9ca-4b564d003a74 */ 106 + #define ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_0 0xb66fb428U 107 + #define ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_1 0xe911c52eU 108 + #define ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_2 0x564bcaa9U 109 + #define ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_3 0x743a004dU 110 + 111 + /* KVM "vendor specific" services */ 112 + #define ARM_SMCCC_KVM_FUNC_FEATURES 0 113 + #define ARM_SMCCC_KVM_FUNC_PTP 1 114 + #define ARM_SMCCC_KVM_FUNC_FEATURES_2 127 115 + #define ARM_SMCCC_KVM_NUM_FUNCS 128 116 + 117 + #define ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID \ 118 + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ 119 + ARM_SMCCC_SMC_32, \ 120 + ARM_SMCCC_OWNER_VENDOR_HYP, \ 121 + ARM_SMCCC_KVM_FUNC_FEATURES) 122 + 123 + #define SMCCC_ARCH_WORKAROUND_RET_UNAFFECTED 1 124 + 125 + /* 126 + * ptp_kvm is a feature used for time sync between vm and host. 127 + * ptp_kvm module in guest kernel will get service from host using 128 + * this hypercall ID. 129 + */ 130 + #define ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID \ 131 + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ 132 + ARM_SMCCC_SMC_32, \ 133 + ARM_SMCCC_OWNER_VENDOR_HYP, \ 134 + ARM_SMCCC_KVM_FUNC_PTP) 135 + 136 + /* ptp_kvm counter type ID */ 137 + #define KVM_PTP_VIRT_COUNTER 0 138 + #define KVM_PTP_PHYS_COUNTER 1 139 + 140 + /* Paravirtualised time calls (defined by ARM DEN0057A) */ 141 + #define ARM_SMCCC_HV_PV_TIME_FEATURES \ 142 + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ 143 + ARM_SMCCC_SMC_64, \ 144 + ARM_SMCCC_OWNER_STANDARD_HYP, \ 145 + 0x20) 146 + 147 + #define ARM_SMCCC_HV_PV_TIME_ST \ 148 + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ 149 + ARM_SMCCC_SMC_64, \ 150 + ARM_SMCCC_OWNER_STANDARD_HYP, \ 151 + 0x21) 152 + 153 + /* TRNG entropy source calls (defined by ARM DEN0098) */ 154 + #define ARM_SMCCC_TRNG_VERSION \ 155 + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ 156 + ARM_SMCCC_SMC_32, \ 157 + ARM_SMCCC_OWNER_STANDARD, \ 158 + 0x50) 159 + 160 + #define ARM_SMCCC_TRNG_FEATURES \ 161 + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ 162 + ARM_SMCCC_SMC_32, \ 163 + ARM_SMCCC_OWNER_STANDARD, \ 164 + 0x51) 165 + 166 + #define ARM_SMCCC_TRNG_GET_UUID \ 167 + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ 168 + ARM_SMCCC_SMC_32, \ 169 + ARM_SMCCC_OWNER_STANDARD, \ 170 + 0x52) 171 + 172 + #define ARM_SMCCC_TRNG_RND32 \ 173 + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ 174 + ARM_SMCCC_SMC_32, \ 175 + ARM_SMCCC_OWNER_STANDARD, \ 176 + 0x53) 177 + 178 + #define ARM_SMCCC_TRNG_RND64 \ 179 + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ 180 + ARM_SMCCC_SMC_64, \ 181 + ARM_SMCCC_OWNER_STANDARD, \ 182 + 0x53) 183 + 184 + /* 185 + * Return codes defined in ARM DEN 0070A 186 + * ARM DEN 0070A is now merged/consolidated into ARM DEN 0028 C 187 + */ 188 + #define SMCCC_RET_SUCCESS 0 189 + #define SMCCC_RET_NOT_SUPPORTED -1 190 + #define SMCCC_RET_NOT_REQUIRED -2 191 + #define SMCCC_RET_INVALID_PARAMETER -3 192 + 193 + #endif /*__LINUX_ARM_SMCCC_H*/

+1

tools/testing/selftests/Makefile

··· 11 11 TARGETS += cpu-hotplug 12 12 TARGETS += damon 13 13 TARGETS += drivers/dma-buf 14 + TARGETS += drivers/s390x/uvdevice 14 15 TARGETS += efivarfs 15 16 TARGETS += exec 16 17 TARGETS += filesystems

+1

tools/testing/selftests/drivers/.gitignore

··· 1 1 # SPDX-License-Identifier: GPL-2.0-only 2 2 /dma-buf/udmabuf 3 + /s390x/uvdevice/test_uvdevice

+22

tools/testing/selftests/drivers/s390x/uvdevice/Makefile

··· 1 + include ../../../../../build/Build.include 2 + 3 + UNAME_M := $(shell uname -m) 4 + 5 + ifneq ($(UNAME_M),s390x) 6 + nothing: 7 + .PHONY: all clean run_tests install 8 + .SILENT: 9 + else 10 + 11 + TEST_GEN_PROGS := test_uvdevice 12 + 13 + top_srcdir ?= ../../../../../.. 14 + KSFT_KHDR_INSTALL := 1 15 + khdr_dir = $(top_srcdir)/usr/include 16 + LINUX_TOOL_ARCH_INCLUDE = $(top_srcdir)/tools/arch/$(ARCH)/include 17 + 18 + CFLAGS += -Wall -Werror -static -I$(khdr_dir) -I$(LINUX_TOOL_ARCH_INCLUDE) 19 + 20 + include ../../../lib.mk 21 + 22 + endif

+1

tools/testing/selftests/drivers/s390x/uvdevice/config

··· 1 + CONFIG_S390_UV_UAPI=y

+276

tools/testing/selftests/drivers/s390x/uvdevice/test_uvdevice.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * selftest for the Ultravisor UAPI device 4 + * 5 + * Copyright IBM Corp. 2022 6 + * Author(s): Steffen Eiden <seiden@linux.ibm.com> 7 + */ 8 + 9 + #include <stdint.h> 10 + #include <fcntl.h> 11 + #include <errno.h> 12 + #include <sys/ioctl.h> 13 + #include <sys/mman.h> 14 + 15 + #include <asm/uvdevice.h> 16 + 17 + #include "../../../kselftest_harness.h" 18 + 19 + #define UV_PATH "/dev/uv" 20 + #define BUFFER_SIZE 0x200 21 + FIXTURE(uvio_fixture) { 22 + int uv_fd; 23 + struct uvio_ioctl_cb uvio_ioctl; 24 + uint8_t buffer[BUFFER_SIZE]; 25 + __u64 fault_page; 26 + }; 27 + 28 + FIXTURE_VARIANT(uvio_fixture) { 29 + unsigned long ioctl_cmd; 30 + uint32_t arg_size; 31 + }; 32 + 33 + FIXTURE_VARIANT_ADD(uvio_fixture, att) { 34 + .ioctl_cmd = UVIO_IOCTL_ATT, 35 + .arg_size = sizeof(struct uvio_attest), 36 + }; 37 + 38 + FIXTURE_SETUP(uvio_fixture) 39 + { 40 + self->uv_fd = open(UV_PATH, O_ACCMODE); 41 + 42 + self->uvio_ioctl.argument_addr = (__u64)self->buffer; 43 + self->uvio_ioctl.argument_len = variant->arg_size; 44 + self->fault_page = 45 + (__u64)mmap(NULL, (size_t)getpagesize(), PROT_NONE, MAP_ANONYMOUS, -1, 0); 46 + } 47 + 48 + FIXTURE_TEARDOWN(uvio_fixture) 49 + { 50 + if (self->uv_fd) 51 + close(self->uv_fd); 52 + munmap((void *)self->fault_page, (size_t)getpagesize()); 53 + } 54 + 55 + TEST_F(uvio_fixture, fault_ioctl_arg) 56 + { 57 + int rc, errno_cache; 58 + 59 + rc = ioctl(self->uv_fd, variant->ioctl_cmd, NULL); 60 + errno_cache = errno; 61 + ASSERT_EQ(rc, -1); 62 + ASSERT_EQ(errno_cache, EFAULT); 63 + 64 + rc = ioctl(self->uv_fd, variant->ioctl_cmd, self->fault_page); 65 + errno_cache = errno; 66 + ASSERT_EQ(rc, -1); 67 + ASSERT_EQ(errno_cache, EFAULT); 68 + } 69 + 70 + TEST_F(uvio_fixture, fault_uvio_arg) 71 + { 72 + int rc, errno_cache; 73 + 74 + self->uvio_ioctl.argument_addr = 0; 75 + rc = ioctl(self->uv_fd, variant->ioctl_cmd, &self->uvio_ioctl); 76 + errno_cache = errno; 77 + ASSERT_EQ(rc, -1); 78 + ASSERT_EQ(errno_cache, EFAULT); 79 + 80 + self->uvio_ioctl.argument_addr = self->fault_page; 81 + rc = ioctl(self->uv_fd, variant->ioctl_cmd, &self->uvio_ioctl); 82 + errno_cache = errno; 83 + ASSERT_EQ(rc, -1); 84 + ASSERT_EQ(errno_cache, EFAULT); 85 + } 86 + 87 + /* 88 + * Test to verify that IOCTLs with invalid values in the ioctl_control block 89 + * are rejected. 90 + */ 91 + TEST_F(uvio_fixture, inval_ioctl_cb) 92 + { 93 + int rc, errno_cache; 94 + 95 + self->uvio_ioctl.argument_len = 0; 96 + rc = ioctl(self->uv_fd, variant->ioctl_cmd, &self->uvio_ioctl); 97 + errno_cache = errno; 98 + ASSERT_EQ(rc, -1); 99 + ASSERT_EQ(errno_cache, EINVAL); 100 + 101 + self->uvio_ioctl.argument_len = (uint32_t)-1; 102 + rc = ioctl(self->uv_fd, variant->ioctl_cmd, &self->uvio_ioctl); 103 + errno_cache = errno; 104 + ASSERT_EQ(rc, -1); 105 + ASSERT_EQ(errno_cache, EINVAL); 106 + self->uvio_ioctl.argument_len = variant->arg_size; 107 + 108 + self->uvio_ioctl.flags = (uint32_t)-1; 109 + rc = ioctl(self->uv_fd, variant->ioctl_cmd, &self->uvio_ioctl); 110 + errno_cache = errno; 111 + ASSERT_EQ(rc, -1); 112 + ASSERT_EQ(errno_cache, EINVAL); 113 + self->uvio_ioctl.flags = 0; 114 + 115 + memset(self->uvio_ioctl.reserved14, 0xff, sizeof(self->uvio_ioctl.reserved14)); 116 + rc = ioctl(self->uv_fd, variant->ioctl_cmd, &self->uvio_ioctl); 117 + errno_cache = errno; 118 + ASSERT_EQ(rc, -1); 119 + ASSERT_EQ(errno_cache, EINVAL); 120 + 121 + memset(&self->uvio_ioctl, 0x11, sizeof(self->uvio_ioctl)); 122 + rc = ioctl(self->uv_fd, variant->ioctl_cmd, &self->uvio_ioctl); 123 + ASSERT_EQ(rc, -1); 124 + } 125 + 126 + TEST_F(uvio_fixture, inval_ioctl_cmd) 127 + { 128 + int rc, errno_cache; 129 + uint8_t nr = _IOC_NR(variant->ioctl_cmd); 130 + unsigned long cmds[] = { 131 + _IOWR('a', nr, struct uvio_ioctl_cb), 132 + _IOWR(UVIO_TYPE_UVC, nr, int), 133 + _IO(UVIO_TYPE_UVC, nr), 134 + _IOR(UVIO_TYPE_UVC, nr, struct uvio_ioctl_cb), 135 + _IOW(UVIO_TYPE_UVC, nr, struct uvio_ioctl_cb), 136 + }; 137 + 138 + for (size_t i = 0; i < ARRAY_SIZE(cmds); i++) { 139 + rc = ioctl(self->uv_fd, cmds[i], &self->uvio_ioctl); 140 + errno_cache = errno; 141 + ASSERT_EQ(rc, -1); 142 + ASSERT_EQ(errno_cache, ENOTTY); 143 + } 144 + } 145 + 146 + struct test_attest_buffer { 147 + uint8_t arcb[0x180]; 148 + uint8_t meas[64]; 149 + uint8_t add[32]; 150 + }; 151 + 152 + FIXTURE(attest_fixture) { 153 + int uv_fd; 154 + struct uvio_ioctl_cb uvio_ioctl; 155 + struct uvio_attest uvio_attest; 156 + struct test_attest_buffer attest_buffer; 157 + __u64 fault_page; 158 + }; 159 + 160 + FIXTURE_SETUP(attest_fixture) 161 + { 162 + self->uv_fd = open(UV_PATH, O_ACCMODE); 163 + 164 + self->uvio_ioctl.argument_addr = (__u64)&self->uvio_attest; 165 + self->uvio_ioctl.argument_len = sizeof(self->uvio_attest); 166 + 167 + self->uvio_attest.arcb_addr = (__u64)&self->attest_buffer.arcb; 168 + self->uvio_attest.arcb_len = sizeof(self->attest_buffer.arcb); 169 + 170 + self->uvio_attest.meas_addr = (__u64)&self->attest_buffer.meas; 171 + self->uvio_attest.meas_len = sizeof(self->attest_buffer.meas); 172 + 173 + self->uvio_attest.add_data_addr = (__u64)&self->attest_buffer.add; 174 + self->uvio_attest.add_data_len = sizeof(self->attest_buffer.add); 175 + self->fault_page = 176 + (__u64)mmap(NULL, (size_t)getpagesize(), PROT_NONE, MAP_ANONYMOUS, -1, 0); 177 + } 178 + 179 + FIXTURE_TEARDOWN(attest_fixture) 180 + { 181 + if (self->uv_fd) 182 + close(self->uv_fd); 183 + munmap((void *)self->fault_page, (size_t)getpagesize()); 184 + } 185 + 186 + static void att_inval_sizes_test(uint32_t *size, uint32_t max_size, bool test_zero, 187 + struct __test_metadata *_metadata, 188 + FIXTURE_DATA(attest_fixture) *self) 189 + { 190 + int rc, errno_cache; 191 + uint32_t tmp = *size; 192 + 193 + if (test_zero) { 194 + *size = 0; 195 + rc = ioctl(self->uv_fd, UVIO_IOCTL_ATT, &self->uvio_ioctl); 196 + errno_cache = errno; 197 + ASSERT_EQ(rc, -1); 198 + ASSERT_EQ(errno_cache, EINVAL); 199 + } 200 + *size = max_size + 1; 201 + rc = ioctl(self->uv_fd, UVIO_IOCTL_ATT, &self->uvio_ioctl); 202 + errno_cache = errno; 203 + ASSERT_EQ(rc, -1); 204 + ASSERT_EQ(errno_cache, EINVAL); 205 + *size = tmp; 206 + } 207 + 208 + /* 209 + * Test to verify that attestation IOCTLs with invalid values in the UVIO 210 + * attestation control block are rejected. 211 + */ 212 + TEST_F(attest_fixture, att_inval_request) 213 + { 214 + int rc, errno_cache; 215 + 216 + att_inval_sizes_test(&self->uvio_attest.add_data_len, UVIO_ATT_ADDITIONAL_MAX_LEN, 217 + false, _metadata, self); 218 + att_inval_sizes_test(&self->uvio_attest.meas_len, UVIO_ATT_MEASUREMENT_MAX_LEN, 219 + true, _metadata, self); 220 + att_inval_sizes_test(&self->uvio_attest.arcb_len, UVIO_ATT_ARCB_MAX_LEN, 221 + true, _metadata, self); 222 + 223 + self->uvio_attest.reserved136 = (uint16_t)-1; 224 + rc = ioctl(self->uv_fd, UVIO_IOCTL_ATT, &self->uvio_ioctl); 225 + errno_cache = errno; 226 + ASSERT_EQ(rc, -1); 227 + ASSERT_EQ(errno_cache, EINVAL); 228 + 229 + memset(&self->uvio_attest, 0x11, sizeof(self->uvio_attest)); 230 + rc = ioctl(self->uv_fd, UVIO_IOCTL_ATT, &self->uvio_ioctl); 231 + ASSERT_EQ(rc, -1); 232 + } 233 + 234 + static void att_inval_addr_test(__u64 *addr, struct __test_metadata *_metadata, 235 + FIXTURE_DATA(attest_fixture) *self) 236 + { 237 + int rc, errno_cache; 238 + __u64 tmp = *addr; 239 + 240 + *addr = 0; 241 + rc = ioctl(self->uv_fd, UVIO_IOCTL_ATT, &self->uvio_ioctl); 242 + errno_cache = errno; 243 + ASSERT_EQ(rc, -1); 244 + ASSERT_EQ(errno_cache, EFAULT); 245 + *addr = self->fault_page; 246 + rc = ioctl(self->uv_fd, UVIO_IOCTL_ATT, &self->uvio_ioctl); 247 + errno_cache = errno; 248 + ASSERT_EQ(rc, -1); 249 + ASSERT_EQ(errno_cache, EFAULT); 250 + *addr = tmp; 251 + } 252 + 253 + TEST_F(attest_fixture, att_inval_addr) 254 + { 255 + att_inval_addr_test(&self->uvio_attest.arcb_addr, _metadata, self); 256 + att_inval_addr_test(&self->uvio_attest.add_data_addr, _metadata, self); 257 + att_inval_addr_test(&self->uvio_attest.meas_addr, _metadata, self); 258 + } 259 + 260 + static void __attribute__((constructor)) __constructor_order_last(void) 261 + { 262 + if (!__constructor_order) 263 + __constructor_order = _CONSTRUCTOR_ORDER_BACKWARD; 264 + } 265 + 266 + int main(int argc, char **argv) 267 + { 268 + int fd = open(UV_PATH, O_ACCMODE); 269 + 270 + if (fd < 0) 271 + ksft_exit_skip("No uv-device or cannot access " UV_PATH "\n" 272 + "Enable CONFIG_S390_UV_UAPI and check the access rights on " 273 + UV_PATH ".\n"); 274 + close(fd); 275 + return test_harness_run(argc, argv); 276 + }

+4 -2

tools/testing/selftests/kvm/.gitignore

··· 2 2 /aarch64/arch_timer 3 3 /aarch64/debug-exceptions 4 4 /aarch64/get-reg-list 5 - /aarch64/psci_cpu_on_test 5 + /aarch64/hypercalls 6 + /aarch64/psci_test 6 7 /aarch64/vcpu_width_config 7 8 /aarch64/vgic_init 8 9 /aarch64/vgic_irq ··· 17 16 /x86_64/debug_regs 18 17 /x86_64/evmcs_test 19 18 /x86_64/emulator_error_test 19 + /x86_64/fix_hypercall_test 20 20 /x86_64/get_msr_index_features 21 21 /x86_64/kvm_clock_test 22 22 /x86_64/kvm_pv_test ··· 55 53 /x86_64/xen_shinfo_test 56 54 /x86_64/xen_vmcall_test 57 55 /x86_64/xss_msr_test 58 - /x86_64/vmx_pmu_msrs_test 56 + /x86_64/vmx_pmu_caps_test 59 57 /access_tracking_perf_test 60 58 /demand_paging_test 61 59 /dirty_log_test

+5 -2

tools/testing/selftests/kvm/Makefile

··· 48 48 TEST_GEN_PROGS_x86_64 += x86_64/get_msr_index_features 49 49 TEST_GEN_PROGS_x86_64 += x86_64/evmcs_test 50 50 TEST_GEN_PROGS_x86_64 += x86_64/emulator_error_test 51 + TEST_GEN_PROGS_x86_64 += x86_64/fix_hypercall_test 51 52 TEST_GEN_PROGS_x86_64 += x86_64/hyperv_clock 52 53 TEST_GEN_PROGS_x86_64 += x86_64/hyperv_cpuid 53 54 TEST_GEN_PROGS_x86_64 += x86_64/hyperv_features ··· 66 65 TEST_GEN_PROGS_x86_64 += x86_64/vmx_preemption_timer_test 67 66 TEST_GEN_PROGS_x86_64 += x86_64/svm_vmcall_test 68 67 TEST_GEN_PROGS_x86_64 += x86_64/svm_int_ctl_test 68 + TEST_GEN_PROGS_x86_64 += x86_64/tsc_scaling_sync 69 69 TEST_GEN_PROGS_x86_64 += x86_64/sync_regs_test 70 70 TEST_GEN_PROGS_x86_64 += x86_64/userspace_io_test 71 71 TEST_GEN_PROGS_x86_64 += x86_64/userspace_msr_exit_test ··· 83 81 TEST_GEN_PROGS_x86_64 += x86_64/xss_msr_test 84 82 TEST_GEN_PROGS_x86_64 += x86_64/debug_regs 85 83 TEST_GEN_PROGS_x86_64 += x86_64/tsc_msrs_test 86 - TEST_GEN_PROGS_x86_64 += x86_64/vmx_pmu_msrs_test 84 + TEST_GEN_PROGS_x86_64 += x86_64/vmx_pmu_caps_test 87 85 TEST_GEN_PROGS_x86_64 += x86_64/xen_shinfo_test 88 86 TEST_GEN_PROGS_x86_64 += x86_64/xen_vmcall_test 89 87 TEST_GEN_PROGS_x86_64 += x86_64/sev_migrate_tests ··· 107 105 TEST_GEN_PROGS_aarch64 += aarch64/arch_timer 108 106 TEST_GEN_PROGS_aarch64 += aarch64/debug-exceptions 109 107 TEST_GEN_PROGS_aarch64 += aarch64/get-reg-list 110 - TEST_GEN_PROGS_aarch64 += aarch64/psci_cpu_on_test 108 + TEST_GEN_PROGS_aarch64 += aarch64/hypercalls 109 + TEST_GEN_PROGS_aarch64 += aarch64/psci_test 111 110 TEST_GEN_PROGS_aarch64 += aarch64/vcpu_width_config 112 111 TEST_GEN_PROGS_aarch64 += aarch64/vgic_init 113 112 TEST_GEN_PROGS_aarch64 += aarch64/vgic_irq

+8

tools/testing/selftests/kvm/aarch64/get-reg-list.c

··· 294 294 "%s: Unexpected bits set in FW reg id: 0x%llx", config_name(c), id); 295 295 printf("\tKVM_REG_ARM_FW_REG(%lld),\n", id & 0xffff); 296 296 break; 297 + case KVM_REG_ARM_FW_FEAT_BMAP: 298 + TEST_ASSERT(id == KVM_REG_ARM_FW_FEAT_BMAP_REG(id & 0xffff), 299 + "%s: Unexpected bits set in the bitmap feature FW reg id: 0x%llx", config_name(c), id); 300 + printf("\tKVM_REG_ARM_FW_FEAT_BMAP_REG(%lld),\n", id & 0xffff); 301 + break; 297 302 case KVM_REG_ARM64_SVE: 298 303 if (has_cap(c, KVM_CAP_ARM_SVE)) 299 304 printf("\t%s,\n", sve_id_to_str(c, id)); ··· 697 692 KVM_REG_ARM_FW_REG(1), /* KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1 */ 698 693 KVM_REG_ARM_FW_REG(2), /* KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2 */ 699 694 KVM_REG_ARM_FW_REG(3), /* KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3 */ 695 + KVM_REG_ARM_FW_FEAT_BMAP_REG(0), /* KVM_REG_ARM_STD_BMAP */ 696 + KVM_REG_ARM_FW_FEAT_BMAP_REG(1), /* KVM_REG_ARM_STD_HYP_BMAP */ 697 + KVM_REG_ARM_FW_FEAT_BMAP_REG(2), /* KVM_REG_ARM_VENDOR_HYP_BMAP */ 700 698 ARM64_SYS_REG(3, 3, 14, 3, 1), /* CNTV_CTL_EL0 */ 701 699 ARM64_SYS_REG(3, 3, 14, 3, 2), /* CNTV_CVAL_EL0 */ 702 700 ARM64_SYS_REG(3, 3, 14, 0, 2),

+336

tools/testing/selftests/kvm/aarch64/hypercalls.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + 3 + /* hypercalls: Check the ARM64's psuedo-firmware bitmap register interface. 4 + * 5 + * The test validates the basic hypercall functionalities that are exposed 6 + * via the psuedo-firmware bitmap register. This includes the registers' 7 + * read/write behavior before and after the VM has started, and if the 8 + * hypercalls are properly masked or unmasked to the guest when disabled or 9 + * enabled from the KVM userspace, respectively. 10 + */ 11 + 12 + #include <errno.h> 13 + #include <linux/arm-smccc.h> 14 + #include <asm/kvm.h> 15 + #include <kvm_util.h> 16 + 17 + #include "processor.h" 18 + 19 + #define FW_REG_ULIMIT_VAL(max_feat_bit) (GENMASK(max_feat_bit, 0)) 20 + 21 + /* Last valid bits of the bitmapped firmware registers */ 22 + #define KVM_REG_ARM_STD_BMAP_BIT_MAX 0 23 + #define KVM_REG_ARM_STD_HYP_BMAP_BIT_MAX 0 24 + #define KVM_REG_ARM_VENDOR_HYP_BMAP_BIT_MAX 1 25 + 26 + struct kvm_fw_reg_info { 27 + uint64_t reg; /* Register definition */ 28 + uint64_t max_feat_bit; /* Bit that represents the upper limit of the feature-map */ 29 + }; 30 + 31 + #define FW_REG_INFO(r) \ 32 + { \ 33 + .reg = r, \ 34 + .max_feat_bit = r##_BIT_MAX, \ 35 + } 36 + 37 + static const struct kvm_fw_reg_info fw_reg_info[] = { 38 + FW_REG_INFO(KVM_REG_ARM_STD_BMAP), 39 + FW_REG_INFO(KVM_REG_ARM_STD_HYP_BMAP), 40 + FW_REG_INFO(KVM_REG_ARM_VENDOR_HYP_BMAP), 41 + }; 42 + 43 + enum test_stage { 44 + TEST_STAGE_REG_IFACE, 45 + TEST_STAGE_HVC_IFACE_FEAT_DISABLED, 46 + TEST_STAGE_HVC_IFACE_FEAT_ENABLED, 47 + TEST_STAGE_HVC_IFACE_FALSE_INFO, 48 + TEST_STAGE_END, 49 + }; 50 + 51 + static int stage = TEST_STAGE_REG_IFACE; 52 + 53 + struct test_hvc_info { 54 + uint32_t func_id; 55 + uint64_t arg1; 56 + }; 57 + 58 + #define TEST_HVC_INFO(f, a1) \ 59 + { \ 60 + .func_id = f, \ 61 + .arg1 = a1, \ 62 + } 63 + 64 + static const struct test_hvc_info hvc_info[] = { 65 + /* KVM_REG_ARM_STD_BMAP */ 66 + TEST_HVC_INFO(ARM_SMCCC_TRNG_VERSION, 0), 67 + TEST_HVC_INFO(ARM_SMCCC_TRNG_FEATURES, ARM_SMCCC_TRNG_RND64), 68 + TEST_HVC_INFO(ARM_SMCCC_TRNG_GET_UUID, 0), 69 + TEST_HVC_INFO(ARM_SMCCC_TRNG_RND32, 0), 70 + TEST_HVC_INFO(ARM_SMCCC_TRNG_RND64, 0), 71 + 72 + /* KVM_REG_ARM_STD_HYP_BMAP */ 73 + TEST_HVC_INFO(ARM_SMCCC_ARCH_FEATURES_FUNC_ID, ARM_SMCCC_HV_PV_TIME_FEATURES), 74 + TEST_HVC_INFO(ARM_SMCCC_HV_PV_TIME_FEATURES, ARM_SMCCC_HV_PV_TIME_ST), 75 + TEST_HVC_INFO(ARM_SMCCC_HV_PV_TIME_ST, 0), 76 + 77 + /* KVM_REG_ARM_VENDOR_HYP_BMAP */ 78 + TEST_HVC_INFO(ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID, 79 + ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID), 80 + TEST_HVC_INFO(ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID, 0), 81 + TEST_HVC_INFO(ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID, KVM_PTP_VIRT_COUNTER), 82 + }; 83 + 84 + /* Feed false hypercall info to test the KVM behavior */ 85 + static const struct test_hvc_info false_hvc_info[] = { 86 + /* Feature support check against a different family of hypercalls */ 87 + TEST_HVC_INFO(ARM_SMCCC_TRNG_FEATURES, ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID), 88 + TEST_HVC_INFO(ARM_SMCCC_ARCH_FEATURES_FUNC_ID, ARM_SMCCC_TRNG_RND64), 89 + TEST_HVC_INFO(ARM_SMCCC_HV_PV_TIME_FEATURES, ARM_SMCCC_TRNG_RND64), 90 + }; 91 + 92 + static void guest_test_hvc(const struct test_hvc_info *hc_info) 93 + { 94 + unsigned int i; 95 + struct arm_smccc_res res; 96 + unsigned int hvc_info_arr_sz; 97 + 98 + hvc_info_arr_sz = 99 + hc_info == hvc_info ? ARRAY_SIZE(hvc_info) : ARRAY_SIZE(false_hvc_info); 100 + 101 + for (i = 0; i < hvc_info_arr_sz; i++, hc_info++) { 102 + memset(&res, 0, sizeof(res)); 103 + smccc_hvc(hc_info->func_id, hc_info->arg1, 0, 0, 0, 0, 0, 0, &res); 104 + 105 + switch (stage) { 106 + case TEST_STAGE_HVC_IFACE_FEAT_DISABLED: 107 + case TEST_STAGE_HVC_IFACE_FALSE_INFO: 108 + GUEST_ASSERT_3(res.a0 == SMCCC_RET_NOT_SUPPORTED, 109 + res.a0, hc_info->func_id, hc_info->arg1); 110 + break; 111 + case TEST_STAGE_HVC_IFACE_FEAT_ENABLED: 112 + GUEST_ASSERT_3(res.a0 != SMCCC_RET_NOT_SUPPORTED, 113 + res.a0, hc_info->func_id, hc_info->arg1); 114 + break; 115 + default: 116 + GUEST_ASSERT_1(0, stage); 117 + } 118 + } 119 + } 120 + 121 + static void guest_code(void) 122 + { 123 + while (stage != TEST_STAGE_END) { 124 + switch (stage) { 125 + case TEST_STAGE_REG_IFACE: 126 + break; 127 + case TEST_STAGE_HVC_IFACE_FEAT_DISABLED: 128 + case TEST_STAGE_HVC_IFACE_FEAT_ENABLED: 129 + guest_test_hvc(hvc_info); 130 + break; 131 + case TEST_STAGE_HVC_IFACE_FALSE_INFO: 132 + guest_test_hvc(false_hvc_info); 133 + break; 134 + default: 135 + GUEST_ASSERT_1(0, stage); 136 + } 137 + 138 + GUEST_SYNC(stage); 139 + } 140 + 141 + GUEST_DONE(); 142 + } 143 + 144 + static int set_fw_reg(struct kvm_vm *vm, uint64_t id, uint64_t val) 145 + { 146 + struct kvm_one_reg reg = { 147 + .id = id, 148 + .addr = (uint64_t)&val, 149 + }; 150 + 151 + return _vcpu_ioctl(vm, 0, KVM_SET_ONE_REG, &reg); 152 + } 153 + 154 + static void get_fw_reg(struct kvm_vm *vm, uint64_t id, uint64_t *addr) 155 + { 156 + struct kvm_one_reg reg = { 157 + .id = id, 158 + .addr = (uint64_t)addr, 159 + }; 160 + 161 + vcpu_ioctl(vm, 0, KVM_GET_ONE_REG, &reg); 162 + } 163 + 164 + struct st_time { 165 + uint32_t rev; 166 + uint32_t attr; 167 + uint64_t st_time; 168 + }; 169 + 170 + #define STEAL_TIME_SIZE ((sizeof(struct st_time) + 63) & ~63) 171 + #define ST_GPA_BASE (1 << 30) 172 + 173 + static void steal_time_init(struct kvm_vm *vm) 174 + { 175 + uint64_t st_ipa = (ulong)ST_GPA_BASE; 176 + unsigned int gpages; 177 + struct kvm_device_attr dev = { 178 + .group = KVM_ARM_VCPU_PVTIME_CTRL, 179 + .attr = KVM_ARM_VCPU_PVTIME_IPA, 180 + .addr = (uint64_t)&st_ipa, 181 + }; 182 + 183 + gpages = vm_calc_num_guest_pages(VM_MODE_DEFAULT, STEAL_TIME_SIZE); 184 + vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, ST_GPA_BASE, 1, gpages, 0); 185 + 186 + vcpu_ioctl(vm, 0, KVM_SET_DEVICE_ATTR, &dev); 187 + } 188 + 189 + static void test_fw_regs_before_vm_start(struct kvm_vm *vm) 190 + { 191 + uint64_t val; 192 + unsigned int i; 193 + int ret; 194 + 195 + for (i = 0; i < ARRAY_SIZE(fw_reg_info); i++) { 196 + const struct kvm_fw_reg_info *reg_info = &fw_reg_info[i]; 197 + 198 + /* First 'read' should be an upper limit of the features supported */ 199 + get_fw_reg(vm, reg_info->reg, &val); 200 + TEST_ASSERT(val == FW_REG_ULIMIT_VAL(reg_info->max_feat_bit), 201 + "Expected all the features to be set for reg: 0x%lx; expected: 0x%lx; read: 0x%lx\n", 202 + reg_info->reg, FW_REG_ULIMIT_VAL(reg_info->max_feat_bit), val); 203 + 204 + /* Test a 'write' by disabling all the features of the register map */ 205 + ret = set_fw_reg(vm, reg_info->reg, 0); 206 + TEST_ASSERT(ret == 0, 207 + "Failed to clear all the features of reg: 0x%lx; ret: %d\n", 208 + reg_info->reg, errno); 209 + 210 + get_fw_reg(vm, reg_info->reg, &val); 211 + TEST_ASSERT(val == 0, 212 + "Expected all the features to be cleared for reg: 0x%lx\n", reg_info->reg); 213 + 214 + /* 215 + * Test enabling a feature that's not supported. 216 + * Avoid this check if all the bits are occupied. 217 + */ 218 + if (reg_info->max_feat_bit < 63) { 219 + ret = set_fw_reg(vm, reg_info->reg, BIT(reg_info->max_feat_bit + 1)); 220 + TEST_ASSERT(ret != 0 && errno == EINVAL, 221 + "Unexpected behavior or return value (%d) while setting an unsupported feature for reg: 0x%lx\n", 222 + errno, reg_info->reg); 223 + } 224 + } 225 + } 226 + 227 + static void test_fw_regs_after_vm_start(struct kvm_vm *vm) 228 + { 229 + uint64_t val; 230 + unsigned int i; 231 + int ret; 232 + 233 + for (i = 0; i < ARRAY_SIZE(fw_reg_info); i++) { 234 + const struct kvm_fw_reg_info *reg_info = &fw_reg_info[i]; 235 + 236 + /* 237 + * Before starting the VM, the test clears all the bits. 238 + * Check if that's still the case. 239 + */ 240 + get_fw_reg(vm, reg_info->reg, &val); 241 + TEST_ASSERT(val == 0, 242 + "Expected all the features to be cleared for reg: 0x%lx\n", 243 + reg_info->reg); 244 + 245 + /* 246 + * Since the VM has run at least once, KVM shouldn't allow modification of 247 + * the registers and should return EBUSY. Set the registers and check for 248 + * the expected errno. 249 + */ 250 + ret = set_fw_reg(vm, reg_info->reg, FW_REG_ULIMIT_VAL(reg_info->max_feat_bit)); 251 + TEST_ASSERT(ret != 0 && errno == EBUSY, 252 + "Unexpected behavior or return value (%d) while setting a feature while VM is running for reg: 0x%lx\n", 253 + errno, reg_info->reg); 254 + } 255 + } 256 + 257 + static struct kvm_vm *test_vm_create(void) 258 + { 259 + struct kvm_vm *vm; 260 + 261 + vm = vm_create_default(0, 0, guest_code); 262 + 263 + ucall_init(vm, NULL); 264 + steal_time_init(vm); 265 + 266 + return vm; 267 + } 268 + 269 + static struct kvm_vm *test_guest_stage(struct kvm_vm *vm) 270 + { 271 + struct kvm_vm *ret_vm = vm; 272 + 273 + pr_debug("Stage: %d\n", stage); 274 + 275 + switch (stage) { 276 + case TEST_STAGE_REG_IFACE: 277 + test_fw_regs_after_vm_start(vm); 278 + break; 279 + case TEST_STAGE_HVC_IFACE_FEAT_DISABLED: 280 + /* Start a new VM so that all the features are now enabled by default */ 281 + kvm_vm_free(vm); 282 + ret_vm = test_vm_create(); 283 + break; 284 + case TEST_STAGE_HVC_IFACE_FEAT_ENABLED: 285 + case TEST_STAGE_HVC_IFACE_FALSE_INFO: 286 + break; 287 + default: 288 + TEST_FAIL("Unknown test stage: %d\n", stage); 289 + } 290 + 291 + stage++; 292 + sync_global_to_guest(vm, stage); 293 + 294 + return ret_vm; 295 + } 296 + 297 + static void test_run(void) 298 + { 299 + struct kvm_vm *vm; 300 + struct ucall uc; 301 + bool guest_done = false; 302 + 303 + vm = test_vm_create(); 304 + 305 + test_fw_regs_before_vm_start(vm); 306 + 307 + while (!guest_done) { 308 + vcpu_run(vm, 0); 309 + 310 + switch (get_ucall(vm, 0, &uc)) { 311 + case UCALL_SYNC: 312 + vm = test_guest_stage(vm); 313 + break; 314 + case UCALL_DONE: 315 + guest_done = true; 316 + break; 317 + case UCALL_ABORT: 318 + TEST_FAIL("%s at %s:%ld\n\tvalues: 0x%lx, 0x%lx; 0x%lx, stage: %u", 319 + (const char *)uc.args[0], __FILE__, uc.args[1], 320 + uc.args[2], uc.args[3], uc.args[4], stage); 321 + break; 322 + default: 323 + TEST_FAIL("Unexpected guest exit\n"); 324 + } 325 + } 326 + 327 + kvm_vm_free(vm); 328 + } 329 + 330 + int main(void) 331 + { 332 + setbuf(stdout, NULL); 333 + 334 + test_run(); 335 + return 0; 336 + }

-121

tools/testing/selftests/kvm/aarch64/psci_cpu_on_test.c

··· 1 - // SPDX-License-Identifier: GPL-2.0-only 2 - /* 3 - * psci_cpu_on_test - Test that the observable state of a vCPU targeted by the 4 - * CPU_ON PSCI call matches what the caller requested. 5 - * 6 - * Copyright (c) 2021 Google LLC. 7 - * 8 - * This is a regression test for a race between KVM servicing the PSCI call and 9 - * userspace reading the vCPUs registers. 10 - */ 11 - 12 - #define _GNU_SOURCE 13 - 14 - #include <linux/psci.h> 15 - 16 - #include "kvm_util.h" 17 - #include "processor.h" 18 - #include "test_util.h" 19 - 20 - #define VCPU_ID_SOURCE 0 21 - #define VCPU_ID_TARGET 1 22 - 23 - #define CPU_ON_ENTRY_ADDR 0xfeedf00dul 24 - #define CPU_ON_CONTEXT_ID 0xdeadc0deul 25 - 26 - static uint64_t psci_cpu_on(uint64_t target_cpu, uint64_t entry_addr, 27 - uint64_t context_id) 28 - { 29 - register uint64_t x0 asm("x0") = PSCI_0_2_FN64_CPU_ON; 30 - register uint64_t x1 asm("x1") = target_cpu; 31 - register uint64_t x2 asm("x2") = entry_addr; 32 - register uint64_t x3 asm("x3") = context_id; 33 - 34 - asm("hvc #0" 35 - : "=r"(x0) 36 - : "r"(x0), "r"(x1), "r"(x2), "r"(x3) 37 - : "memory"); 38 - 39 - return x0; 40 - } 41 - 42 - static uint64_t psci_affinity_info(uint64_t target_affinity, 43 - uint64_t lowest_affinity_level) 44 - { 45 - register uint64_t x0 asm("x0") = PSCI_0_2_FN64_AFFINITY_INFO; 46 - register uint64_t x1 asm("x1") = target_affinity; 47 - register uint64_t x2 asm("x2") = lowest_affinity_level; 48 - 49 - asm("hvc #0" 50 - : "=r"(x0) 51 - : "r"(x0), "r"(x1), "r"(x2) 52 - : "memory"); 53 - 54 - return x0; 55 - } 56 - 57 - static void guest_main(uint64_t target_cpu) 58 - { 59 - GUEST_ASSERT(!psci_cpu_on(target_cpu, CPU_ON_ENTRY_ADDR, CPU_ON_CONTEXT_ID)); 60 - uint64_t target_state; 61 - 62 - do { 63 - target_state = psci_affinity_info(target_cpu, 0); 64 - 65 - GUEST_ASSERT((target_state == PSCI_0_2_AFFINITY_LEVEL_ON) || 66 - (target_state == PSCI_0_2_AFFINITY_LEVEL_OFF)); 67 - } while (target_state != PSCI_0_2_AFFINITY_LEVEL_ON); 68 - 69 - GUEST_DONE(); 70 - } 71 - 72 - int main(void) 73 - { 74 - uint64_t target_mpidr, obs_pc, obs_x0; 75 - struct kvm_vcpu_init init; 76 - struct kvm_vm *vm; 77 - struct ucall uc; 78 - 79 - vm = vm_create(VM_MODE_DEFAULT, DEFAULT_GUEST_PHY_PAGES, O_RDWR); 80 - kvm_vm_elf_load(vm, program_invocation_name); 81 - ucall_init(vm, NULL); 82 - 83 - vm_ioctl(vm, KVM_ARM_PREFERRED_TARGET, &init); 84 - init.features[0] |= (1 << KVM_ARM_VCPU_PSCI_0_2); 85 - 86 - aarch64_vcpu_add_default(vm, VCPU_ID_SOURCE, &init, guest_main); 87 - 88 - /* 89 - * make sure the target is already off when executing the test. 90 - */ 91 - init.features[0] |= (1 << KVM_ARM_VCPU_POWER_OFF); 92 - aarch64_vcpu_add_default(vm, VCPU_ID_TARGET, &init, guest_main); 93 - 94 - get_reg(vm, VCPU_ID_TARGET, KVM_ARM64_SYS_REG(SYS_MPIDR_EL1), &target_mpidr); 95 - vcpu_args_set(vm, VCPU_ID_SOURCE, 1, target_mpidr & MPIDR_HWID_BITMASK); 96 - vcpu_run(vm, VCPU_ID_SOURCE); 97 - 98 - switch (get_ucall(vm, VCPU_ID_SOURCE, &uc)) { 99 - case UCALL_DONE: 100 - break; 101 - case UCALL_ABORT: 102 - TEST_FAIL("%s at %s:%ld", (const char *)uc.args[0], __FILE__, 103 - uc.args[1]); 104 - break; 105 - default: 106 - TEST_FAIL("Unhandled ucall: %lu", uc.cmd); 107 - } 108 - 109 - get_reg(vm, VCPU_ID_TARGET, ARM64_CORE_REG(regs.pc), &obs_pc); 110 - get_reg(vm, VCPU_ID_TARGET, ARM64_CORE_REG(regs.regs[0]), &obs_x0); 111 - 112 - TEST_ASSERT(obs_pc == CPU_ON_ENTRY_ADDR, 113 - "unexpected target cpu pc: %lx (expected: %lx)", 114 - obs_pc, CPU_ON_ENTRY_ADDR); 115 - TEST_ASSERT(obs_x0 == CPU_ON_CONTEXT_ID, 116 - "unexpected target context id: %lx (expected: %lx)", 117 - obs_x0, CPU_ON_CONTEXT_ID); 118 - 119 - kvm_vm_free(vm); 120 - return 0; 121 - }

+213

tools/testing/selftests/kvm/aarch64/psci_test.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * psci_cpu_on_test - Test that the observable state of a vCPU targeted by the 4 + * CPU_ON PSCI call matches what the caller requested. 5 + * 6 + * Copyright (c) 2021 Google LLC. 7 + * 8 + * This is a regression test for a race between KVM servicing the PSCI call and 9 + * userspace reading the vCPUs registers. 10 + */ 11 + 12 + #define _GNU_SOURCE 13 + 14 + #include <linux/psci.h> 15 + 16 + #include "kvm_util.h" 17 + #include "processor.h" 18 + #include "test_util.h" 19 + 20 + #define VCPU_ID_SOURCE 0 21 + #define VCPU_ID_TARGET 1 22 + 23 + #define CPU_ON_ENTRY_ADDR 0xfeedf00dul 24 + #define CPU_ON_CONTEXT_ID 0xdeadc0deul 25 + 26 + static uint64_t psci_cpu_on(uint64_t target_cpu, uint64_t entry_addr, 27 + uint64_t context_id) 28 + { 29 + struct arm_smccc_res res; 30 + 31 + smccc_hvc(PSCI_0_2_FN64_CPU_ON, target_cpu, entry_addr, context_id, 32 + 0, 0, 0, 0, &res); 33 + 34 + return res.a0; 35 + } 36 + 37 + static uint64_t psci_affinity_info(uint64_t target_affinity, 38 + uint64_t lowest_affinity_level) 39 + { 40 + struct arm_smccc_res res; 41 + 42 + smccc_hvc(PSCI_0_2_FN64_AFFINITY_INFO, target_affinity, lowest_affinity_level, 43 + 0, 0, 0, 0, 0, &res); 44 + 45 + return res.a0; 46 + } 47 + 48 + static uint64_t psci_system_suspend(uint64_t entry_addr, uint64_t context_id) 49 + { 50 + struct arm_smccc_res res; 51 + 52 + smccc_hvc(PSCI_1_0_FN64_SYSTEM_SUSPEND, entry_addr, context_id, 53 + 0, 0, 0, 0, 0, &res); 54 + 55 + return res.a0; 56 + } 57 + 58 + static uint64_t psci_features(uint32_t func_id) 59 + { 60 + struct arm_smccc_res res; 61 + 62 + smccc_hvc(PSCI_1_0_FN_PSCI_FEATURES, func_id, 0, 0, 0, 0, 0, 0, &res); 63 + 64 + return res.a0; 65 + } 66 + 67 + static void vcpu_power_off(struct kvm_vm *vm, uint32_t vcpuid) 68 + { 69 + struct kvm_mp_state mp_state = { 70 + .mp_state = KVM_MP_STATE_STOPPED, 71 + }; 72 + 73 + vcpu_set_mp_state(vm, vcpuid, &mp_state); 74 + } 75 + 76 + static struct kvm_vm *setup_vm(void *guest_code) 77 + { 78 + struct kvm_vcpu_init init; 79 + struct kvm_vm *vm; 80 + 81 + vm = vm_create(VM_MODE_DEFAULT, DEFAULT_GUEST_PHY_PAGES, O_RDWR); 82 + kvm_vm_elf_load(vm, program_invocation_name); 83 + ucall_init(vm, NULL); 84 + 85 + vm_ioctl(vm, KVM_ARM_PREFERRED_TARGET, &init); 86 + init.features[0] |= (1 << KVM_ARM_VCPU_PSCI_0_2); 87 + 88 + aarch64_vcpu_add_default(vm, VCPU_ID_SOURCE, &init, guest_code); 89 + aarch64_vcpu_add_default(vm, VCPU_ID_TARGET, &init, guest_code); 90 + 91 + return vm; 92 + } 93 + 94 + static void enter_guest(struct kvm_vm *vm, uint32_t vcpuid) 95 + { 96 + struct ucall uc; 97 + 98 + vcpu_run(vm, vcpuid); 99 + if (get_ucall(vm, vcpuid, &uc) == UCALL_ABORT) 100 + TEST_FAIL("%s at %s:%ld", (const char *)uc.args[0], __FILE__, 101 + uc.args[1]); 102 + } 103 + 104 + static void assert_vcpu_reset(struct kvm_vm *vm, uint32_t vcpuid) 105 + { 106 + uint64_t obs_pc, obs_x0; 107 + 108 + get_reg(vm, vcpuid, ARM64_CORE_REG(regs.pc), &obs_pc); 109 + get_reg(vm, vcpuid, ARM64_CORE_REG(regs.regs[0]), &obs_x0); 110 + 111 + TEST_ASSERT(obs_pc == CPU_ON_ENTRY_ADDR, 112 + "unexpected target cpu pc: %lx (expected: %lx)", 113 + obs_pc, CPU_ON_ENTRY_ADDR); 114 + TEST_ASSERT(obs_x0 == CPU_ON_CONTEXT_ID, 115 + "unexpected target context id: %lx (expected: %lx)", 116 + obs_x0, CPU_ON_CONTEXT_ID); 117 + } 118 + 119 + static void guest_test_cpu_on(uint64_t target_cpu) 120 + { 121 + uint64_t target_state; 122 + 123 + GUEST_ASSERT(!psci_cpu_on(target_cpu, CPU_ON_ENTRY_ADDR, CPU_ON_CONTEXT_ID)); 124 + 125 + do { 126 + target_state = psci_affinity_info(target_cpu, 0); 127 + 128 + GUEST_ASSERT((target_state == PSCI_0_2_AFFINITY_LEVEL_ON) || 129 + (target_state == PSCI_0_2_AFFINITY_LEVEL_OFF)); 130 + } while (target_state != PSCI_0_2_AFFINITY_LEVEL_ON); 131 + 132 + GUEST_DONE(); 133 + } 134 + 135 + static void host_test_cpu_on(void) 136 + { 137 + uint64_t target_mpidr; 138 + struct kvm_vm *vm; 139 + struct ucall uc; 140 + 141 + vm = setup_vm(guest_test_cpu_on); 142 + 143 + /* 144 + * make sure the target is already off when executing the test. 145 + */ 146 + vcpu_power_off(vm, VCPU_ID_TARGET); 147 + 148 + get_reg(vm, VCPU_ID_TARGET, KVM_ARM64_SYS_REG(SYS_MPIDR_EL1), &target_mpidr); 149 + vcpu_args_set(vm, VCPU_ID_SOURCE, 1, target_mpidr & MPIDR_HWID_BITMASK); 150 + enter_guest(vm, VCPU_ID_SOURCE); 151 + 152 + if (get_ucall(vm, VCPU_ID_SOURCE, &uc) != UCALL_DONE) 153 + TEST_FAIL("Unhandled ucall: %lu", uc.cmd); 154 + 155 + assert_vcpu_reset(vm, VCPU_ID_TARGET); 156 + kvm_vm_free(vm); 157 + } 158 + 159 + static void enable_system_suspend(struct kvm_vm *vm) 160 + { 161 + struct kvm_enable_cap cap = { 162 + .cap = KVM_CAP_ARM_SYSTEM_SUSPEND, 163 + }; 164 + 165 + vm_enable_cap(vm, &cap); 166 + } 167 + 168 + static void guest_test_system_suspend(void) 169 + { 170 + uint64_t ret; 171 + 172 + /* assert that SYSTEM_SUSPEND is discoverable */ 173 + GUEST_ASSERT(!psci_features(PSCI_1_0_FN_SYSTEM_SUSPEND)); 174 + GUEST_ASSERT(!psci_features(PSCI_1_0_FN64_SYSTEM_SUSPEND)); 175 + 176 + ret = psci_system_suspend(CPU_ON_ENTRY_ADDR, CPU_ON_CONTEXT_ID); 177 + GUEST_SYNC(ret); 178 + } 179 + 180 + static void host_test_system_suspend(void) 181 + { 182 + struct kvm_run *run; 183 + struct kvm_vm *vm; 184 + 185 + vm = setup_vm(guest_test_system_suspend); 186 + enable_system_suspend(vm); 187 + 188 + vcpu_power_off(vm, VCPU_ID_TARGET); 189 + run = vcpu_state(vm, VCPU_ID_SOURCE); 190 + 191 + enter_guest(vm, VCPU_ID_SOURCE); 192 + 193 + TEST_ASSERT(run->exit_reason == KVM_EXIT_SYSTEM_EVENT, 194 + "Unhandled exit reason: %u (%s)", 195 + run->exit_reason, exit_reason_str(run->exit_reason)); 196 + TEST_ASSERT(run->system_event.type == KVM_SYSTEM_EVENT_SUSPEND, 197 + "Unhandled system event: %u (expected: %u)", 198 + run->system_event.type, KVM_SYSTEM_EVENT_SUSPEND); 199 + 200 + kvm_vm_free(vm); 201 + } 202 + 203 + int main(void) 204 + { 205 + if (!kvm_check_cap(KVM_CAP_ARM_SYSTEM_SUSPEND)) { 206 + print_skip("KVM_CAP_ARM_SYSTEM_SUSPEND not supported"); 207 + exit(KSFT_SKIP); 208 + } 209 + 210 + host_test_cpu_on(); 211 + host_test_system_suspend(); 212 + return 0; 213 + }

+22

tools/testing/selftests/kvm/include/aarch64/processor.h

··· 185 185 asm volatile("msr daifset, #3" : : : "memory"); 186 186 } 187 187 188 + /** 189 + * struct arm_smccc_res - Result from SMC/HVC call 190 + * @a0-a3 result values from registers 0 to 3 191 + */ 192 + struct arm_smccc_res { 193 + unsigned long a0; 194 + unsigned long a1; 195 + unsigned long a2; 196 + unsigned long a3; 197 + }; 198 + 199 + /** 200 + * smccc_hvc - Invoke a SMCCC function using the hvc conduit 201 + * @function_id: the SMCCC function to be called 202 + * @arg0-arg6: SMCCC function arguments, corresponding to registers x1-x7 203 + * @res: pointer to write the return values from registers x0-x3 204 + * 205 + */ 206 + void smccc_hvc(uint32_t function_id, uint64_t arg0, uint64_t arg1, 207 + uint64_t arg2, uint64_t arg3, uint64_t arg4, uint64_t arg5, 208 + uint64_t arg6, struct arm_smccc_res *res); 209 + 188 210 #endif /* SELFTEST_KVM_PROCESSOR_H */

+5 -3

tools/testing/selftests/kvm/include/riscv/processor.h

··· 119 119 #define SATP_ASID_SHIFT 44 120 120 #define SATP_ASID_MASK _AC(0xFFFF, UL) 121 121 122 - #define SBI_EXT_EXPERIMENTAL_START 0x08000000 123 - #define SBI_EXT_EXPERIMENTAL_END 0x08FFFFFF 122 + #define SBI_EXT_EXPERIMENTAL_START 0x08000000 123 + #define SBI_EXT_EXPERIMENTAL_END 0x08FFFFFF 124 124 125 - #define KVM_RISCV_SELFTESTS_SBI_EXT SBI_EXT_EXPERIMENTAL_END 125 + #define KVM_RISCV_SELFTESTS_SBI_EXT SBI_EXT_EXPERIMENTAL_END 126 + #define KVM_RISCV_SELFTESTS_SBI_UCALL 0 127 + #define KVM_RISCV_SELFTESTS_SBI_UNEXP 1 126 128 127 129 struct sbiret { 128 130 long error;

+25

tools/testing/selftests/kvm/lib/aarch64/processor.c

··· 500 500 { 501 501 guest_modes_append_default(); 502 502 } 503 + 504 + void smccc_hvc(uint32_t function_id, uint64_t arg0, uint64_t arg1, 505 + uint64_t arg2, uint64_t arg3, uint64_t arg4, uint64_t arg5, 506 + uint64_t arg6, struct arm_smccc_res *res) 507 + { 508 + asm volatile("mov w0, %w[function_id]\n" 509 + "mov x1, %[arg0]\n" 510 + "mov x2, %[arg1]\n" 511 + "mov x3, %[arg2]\n" 512 + "mov x4, %[arg3]\n" 513 + "mov x5, %[arg4]\n" 514 + "mov x6, %[arg5]\n" 515 + "mov x7, %[arg6]\n" 516 + "hvc #0\n" 517 + "mov %[res0], x0\n" 518 + "mov %[res1], x1\n" 519 + "mov %[res2], x2\n" 520 + "mov %[res3], x3\n" 521 + : [res0] "=r"(res->a0), [res1] "=r"(res->a1), 522 + [res2] "=r"(res->a2), [res3] "=r"(res->a3) 523 + : [function_id] "r"(function_id), [arg0] "r"(arg0), 524 + [arg1] "r"(arg1), [arg2] "r"(arg2), [arg3] "r"(arg3), 525 + [arg4] "r"(arg4), [arg5] "r"(arg5), [arg6] "r"(arg6) 526 + : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7"); 527 + }

+6 -5

tools/testing/selftests/kvm/lib/riscv/processor.c

··· 268 268 core.regs.t3, core.regs.t4, core.regs.t5, core.regs.t6); 269 269 } 270 270 271 - static void __aligned(16) guest_hang(void) 271 + static void __aligned(16) guest_unexp_trap(void) 272 272 { 273 - while (1) 274 - ; 273 + sbi_ecall(KVM_RISCV_SELFTESTS_SBI_EXT, 274 + KVM_RISCV_SELFTESTS_SBI_UNEXP, 275 + 0, 0, 0, 0, 0, 0); 275 276 } 276 277 277 278 void vm_vcpu_add_default(struct kvm_vm *vm, uint32_t vcpuid, void *guest_code) ··· 311 310 312 311 /* Setup default exception vector of guest */ 313 312 set_reg(vm, vcpuid, RISCV_CSR_REG(stvec), 314 - (unsigned long)guest_hang); 313 + (unsigned long)guest_unexp_trap); 315 314 } 316 315 317 316 void vcpu_args_set(struct kvm_vm *vm, uint32_t vcpuid, unsigned int num, ...) ··· 351 350 case 7: 352 351 id = RISCV_CORE_REG(regs.a7); 353 352 break; 354 - }; 353 + } 355 354 set_reg(vm, vcpuid, id, va_arg(ap, uint64_t)); 356 355 } 357 356

+20 -9

tools/testing/selftests/kvm/lib/riscv/ucall.c

··· 60 60 uc.args[i] = va_arg(va, uint64_t); 61 61 va_end(va); 62 62 63 - sbi_ecall(KVM_RISCV_SELFTESTS_SBI_EXT, 0, (vm_vaddr_t)&uc, 64 - 0, 0, 0, 0, 0); 63 + sbi_ecall(KVM_RISCV_SELFTESTS_SBI_EXT, 64 + KVM_RISCV_SELFTESTS_SBI_UCALL, 65 + (vm_vaddr_t)&uc, 0, 0, 0, 0, 0); 65 66 } 66 67 67 68 uint64_t get_ucall(struct kvm_vm *vm, uint32_t vcpu_id, struct ucall *uc) ··· 74 73 memset(uc, 0, sizeof(*uc)); 75 74 76 75 if (run->exit_reason == KVM_EXIT_RISCV_SBI && 77 - run->riscv_sbi.extension_id == KVM_RISCV_SELFTESTS_SBI_EXT && 78 - run->riscv_sbi.function_id == 0) { 79 - memcpy(&ucall, addr_gva2hva(vm, run->riscv_sbi.args[0]), 80 - sizeof(ucall)); 76 + run->riscv_sbi.extension_id == KVM_RISCV_SELFTESTS_SBI_EXT) { 77 + switch (run->riscv_sbi.function_id) { 78 + case KVM_RISCV_SELFTESTS_SBI_UCALL: 79 + memcpy(&ucall, addr_gva2hva(vm, 80 + run->riscv_sbi.args[0]), sizeof(ucall)); 81 81 82 - vcpu_run_complete_io(vm, vcpu_id); 83 - if (uc) 84 - memcpy(uc, &ucall, sizeof(ucall)); 82 + vcpu_run_complete_io(vm, vcpu_id); 83 + if (uc) 84 + memcpy(uc, &ucall, sizeof(ucall)); 85 + 86 + break; 87 + case KVM_RISCV_SELFTESTS_SBI_UNEXP: 88 + vcpu_dump(stderr, vm, vcpu_id, 2); 89 + TEST_ASSERT(0, "Unexpected trap taken by guest"); 90 + break; 91 + default: 92 + break; 93 + } 85 94 } 86 95 87 96 return ucall.cmd;

+45 -1

tools/testing/selftests/kvm/s390x/memop.c

··· 10 10 #include <string.h> 11 11 #include <sys/ioctl.h> 12 12 13 + #include <linux/bits.h> 14 + 13 15 #include "test_util.h" 14 16 #include "kvm_util.h" 15 17 ··· 196 194 #define SIDA_OFFSET(o) ._sida_offset = 1, .sida_offset = (o) 197 195 #define AR(a) ._ar = 1, .ar = (a) 198 196 #define KEY(a) .f_key = 1, .key = (a) 197 + #define INJECT .f_inject = 1 199 198 200 199 #define CHECK_N_DO(f, ...) ({ f(__VA_ARGS__, CHECK_ONLY); f(__VA_ARGS__); }) 201 200 ··· 433 430 TEST_ASSERT(rv == 4, "Should result in protection exception"); \ 434 431 }) 435 432 433 + static void guest_error_key(void) 434 + { 435 + GUEST_SYNC(STAGE_INITED); 436 + set_storage_key_range(mem1, PAGE_SIZE, 0x18); 437 + set_storage_key_range(mem1 + PAGE_SIZE, sizeof(mem1) - PAGE_SIZE, 0x98); 438 + GUEST_SYNC(STAGE_SKEYS_SET); 439 + GUEST_SYNC(STAGE_IDLED); 440 + } 441 + 436 442 static void test_errors_key(void) 437 443 { 438 - struct test_default t = test_default_init(guest_copy_key_fetch_prot); 444 + struct test_default t = test_default_init(guest_error_key); 439 445 440 446 HOST_SYNC(t.vcpu, STAGE_INITED); 441 447 HOST_SYNC(t.vcpu, STAGE_SKEYS_SET); ··· 454 442 CHECK_N_DO(ERR_PROT_MOP, t.vcpu, LOGICAL, READ, mem2, t.size, GADDR_V(mem2), KEY(2)); 455 443 CHECK_N_DO(ERR_PROT_MOP, t.vm, ABSOLUTE, WRITE, mem1, t.size, GADDR_V(mem1), KEY(2)); 456 444 CHECK_N_DO(ERR_PROT_MOP, t.vm, ABSOLUTE, READ, mem2, t.size, GADDR_V(mem2), KEY(2)); 445 + 446 + kvm_vm_free(t.kvm_vm); 447 + } 448 + 449 + static void test_termination(void) 450 + { 451 + struct test_default t = test_default_init(guest_error_key); 452 + uint64_t prefix; 453 + uint64_t teid; 454 + uint64_t teid_mask = BIT(63 - 56) | BIT(63 - 60) | BIT(63 - 61); 455 + uint64_t psw[2]; 456 + 457 + HOST_SYNC(t.vcpu, STAGE_INITED); 458 + HOST_SYNC(t.vcpu, STAGE_SKEYS_SET); 459 + 460 + /* vcpu, mismatching keys after first page */ 461 + ERR_PROT_MOP(t.vcpu, LOGICAL, WRITE, mem1, t.size, GADDR_V(mem1), KEY(1), INJECT); 462 + /* 463 + * The memop injected a program exception and the test needs to check the 464 + * Translation-Exception Identification (TEID). It is necessary to run 465 + * the guest in order to be able to read the TEID from guest memory. 466 + * Set the guest program new PSW, so the guest state is not clobbered. 467 + */ 468 + prefix = t.run->s.regs.prefix; 469 + psw[0] = t.run->psw_mask; 470 + psw[1] = t.run->psw_addr; 471 + MOP(t.vm, ABSOLUTE, WRITE, psw, sizeof(psw), GADDR(prefix + 464)); 472 + HOST_SYNC(t.vcpu, STAGE_IDLED); 473 + MOP(t.vm, ABSOLUTE, READ, &teid, sizeof(teid), GADDR(prefix + 168)); 474 + /* Bits 56, 60, 61 form a code, 0 being the only one allowing for termination */ 475 + ASSERT_EQ(teid & teid_mask, 0); 457 476 458 477 kvm_vm_free(t.kvm_vm); 459 478 } ··· 711 668 test_copy_key_fetch_prot(); 712 669 test_copy_key_fetch_prot_override(); 713 670 test_errors_key(); 671 + test_termination(); 714 672 test_errors_key_storage_prot_override(); 715 673 test_errors_key_fetch_prot_override_not_enabled(); 716 674 test_errors_key_fetch_prot_override_enabled();

+3 -10

tools/testing/selftests/kvm/steal_time.c

··· 118 118 119 119 static int64_t smccc(uint32_t func, uint64_t arg) 120 120 { 121 - unsigned long ret; 121 + struct arm_smccc_res res; 122 122 123 - asm volatile( 124 - "mov w0, %w1\n" 125 - "mov x1, %2\n" 126 - "hvc #0\n" 127 - "mov %0, x0\n" 128 - : "=r" (ret) : "r" (func), "r" (arg) : 129 - "x0", "x1", "x2", "x3"); 130 - 131 - return ret; 123 + smccc_hvc(func, arg, 0, 0, 0, 0, 0, 0, &res); 124 + return res.a0; 132 125 } 133 126 134 127 static void check_status(struct st_time *st)

+170

tools/testing/selftests/kvm/x86_64/fix_hypercall_test.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * Copyright (C) 2020, Google LLC. 4 + * 5 + * Tests for KVM paravirtual feature disablement 6 + */ 7 + #include <asm/kvm_para.h> 8 + #include <linux/kvm_para.h> 9 + #include <linux/stringify.h> 10 + #include <stdint.h> 11 + 12 + #include "apic.h" 13 + #include "test_util.h" 14 + #include "kvm_util.h" 15 + #include "processor.h" 16 + 17 + #define VCPU_ID 0 18 + 19 + static bool ud_expected; 20 + 21 + static void guest_ud_handler(struct ex_regs *regs) 22 + { 23 + GUEST_ASSERT(ud_expected); 24 + GUEST_DONE(); 25 + } 26 + 27 + extern unsigned char svm_hypercall_insn; 28 + static uint64_t svm_do_sched_yield(uint8_t apic_id) 29 + { 30 + uint64_t ret; 31 + 32 + asm volatile("mov %1, %%rax\n\t" 33 + "mov %2, %%rbx\n\t" 34 + "svm_hypercall_insn:\n\t" 35 + "vmmcall\n\t" 36 + "mov %%rax, %0\n\t" 37 + : "=r"(ret) 38 + : "r"((uint64_t)KVM_HC_SCHED_YIELD), "r"((uint64_t)apic_id) 39 + : "rax", "rbx", "memory"); 40 + 41 + return ret; 42 + } 43 + 44 + extern unsigned char vmx_hypercall_insn; 45 + static uint64_t vmx_do_sched_yield(uint8_t apic_id) 46 + { 47 + uint64_t ret; 48 + 49 + asm volatile("mov %1, %%rax\n\t" 50 + "mov %2, %%rbx\n\t" 51 + "vmx_hypercall_insn:\n\t" 52 + "vmcall\n\t" 53 + "mov %%rax, %0\n\t" 54 + : "=r"(ret) 55 + : "r"((uint64_t)KVM_HC_SCHED_YIELD), "r"((uint64_t)apic_id) 56 + : "rax", "rbx", "memory"); 57 + 58 + return ret; 59 + } 60 + 61 + static void assert_hypercall_insn(unsigned char *exp_insn, unsigned char *obs_insn) 62 + { 63 + uint32_t exp = 0, obs = 0; 64 + 65 + memcpy(&exp, exp_insn, sizeof(exp)); 66 + memcpy(&obs, obs_insn, sizeof(obs)); 67 + 68 + GUEST_ASSERT_EQ(exp, obs); 69 + } 70 + 71 + static void guest_main(void) 72 + { 73 + unsigned char *native_hypercall_insn, *hypercall_insn; 74 + uint8_t apic_id; 75 + 76 + apic_id = GET_APIC_ID_FIELD(xapic_read_reg(APIC_ID)); 77 + 78 + if (is_intel_cpu()) { 79 + native_hypercall_insn = &vmx_hypercall_insn; 80 + hypercall_insn = &svm_hypercall_insn; 81 + svm_do_sched_yield(apic_id); 82 + } else if (is_amd_cpu()) { 83 + native_hypercall_insn = &svm_hypercall_insn; 84 + hypercall_insn = &vmx_hypercall_insn; 85 + vmx_do_sched_yield(apic_id); 86 + } else { 87 + GUEST_ASSERT(0); 88 + /* unreachable */ 89 + return; 90 + } 91 + 92 + GUEST_ASSERT(!ud_expected); 93 + assert_hypercall_insn(native_hypercall_insn, hypercall_insn); 94 + GUEST_DONE(); 95 + } 96 + 97 + static void setup_ud_vector(struct kvm_vm *vm) 98 + { 99 + vm_init_descriptor_tables(vm); 100 + vcpu_init_descriptor_tables(vm, VCPU_ID); 101 + vm_install_exception_handler(vm, UD_VECTOR, guest_ud_handler); 102 + } 103 + 104 + static void enter_guest(struct kvm_vm *vm) 105 + { 106 + struct kvm_run *run; 107 + struct ucall uc; 108 + 109 + run = vcpu_state(vm, VCPU_ID); 110 + 111 + vcpu_run(vm, VCPU_ID); 112 + switch (get_ucall(vm, VCPU_ID, &uc)) { 113 + case UCALL_SYNC: 114 + pr_info("%s: %016lx\n", (const char *)uc.args[2], uc.args[3]); 115 + break; 116 + case UCALL_DONE: 117 + return; 118 + case UCALL_ABORT: 119 + TEST_FAIL("%s at %s:%ld", (const char *)uc.args[0], __FILE__, uc.args[1]); 120 + default: 121 + TEST_FAIL("Unhandled ucall: %ld\nexit_reason: %u (%s)", 122 + uc.cmd, run->exit_reason, exit_reason_str(run->exit_reason)); 123 + } 124 + } 125 + 126 + static void test_fix_hypercall(void) 127 + { 128 + struct kvm_vm *vm; 129 + 130 + vm = vm_create_default(VCPU_ID, 0, guest_main); 131 + setup_ud_vector(vm); 132 + 133 + ud_expected = false; 134 + sync_global_to_guest(vm, ud_expected); 135 + 136 + virt_pg_map(vm, APIC_DEFAULT_GPA, APIC_DEFAULT_GPA); 137 + 138 + enter_guest(vm); 139 + } 140 + 141 + static void test_fix_hypercall_disabled(void) 142 + { 143 + struct kvm_enable_cap cap = {0}; 144 + struct kvm_vm *vm; 145 + 146 + vm = vm_create_default(VCPU_ID, 0, guest_main); 147 + setup_ud_vector(vm); 148 + 149 + cap.cap = KVM_CAP_DISABLE_QUIRKS2; 150 + cap.args[0] = KVM_X86_QUIRK_FIX_HYPERCALL_INSN; 151 + vm_enable_cap(vm, &cap); 152 + 153 + ud_expected = true; 154 + sync_global_to_guest(vm, ud_expected); 155 + 156 + virt_pg_map(vm, APIC_DEFAULT_GPA, APIC_DEFAULT_GPA); 157 + 158 + enter_guest(vm); 159 + } 160 + 161 + int main(void) 162 + { 163 + if (!(kvm_check_cap(KVM_CAP_DISABLE_QUIRKS2) & KVM_X86_QUIRK_FIX_HYPERCALL_INSN)) { 164 + print_skip("KVM_X86_QUIRK_HYPERCALL_INSN not supported"); 165 + exit(KSFT_SKIP); 166 + } 167 + 168 + test_fix_hypercall(); 169 + test_fix_hypercall_disabled(); 170 + }

+119

tools/testing/selftests/kvm/x86_64/tsc_scaling_sync.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * svm_vmcall_test 4 + * 5 + * Copyright © 2021 Amazon.com, Inc. or its affiliates. 6 + * 7 + * Xen shared_info / pvclock testing 8 + */ 9 + 10 + #include "test_util.h" 11 + #include "kvm_util.h" 12 + #include "processor.h" 13 + 14 + #include <stdint.h> 15 + #include <time.h> 16 + #include <sched.h> 17 + #include <signal.h> 18 + #include <pthread.h> 19 + 20 + #define NR_TEST_VCPUS 20 21 + 22 + static struct kvm_vm *vm; 23 + pthread_spinlock_t create_lock; 24 + 25 + #define TEST_TSC_KHZ 2345678UL 26 + #define TEST_TSC_OFFSET 200000000 27 + 28 + uint64_t tsc_sync; 29 + static void guest_code(void) 30 + { 31 + uint64_t start_tsc, local_tsc, tmp; 32 + 33 + start_tsc = rdtsc(); 34 + do { 35 + tmp = READ_ONCE(tsc_sync); 36 + local_tsc = rdtsc(); 37 + WRITE_ONCE(tsc_sync, local_tsc); 38 + if (unlikely(local_tsc < tmp)) 39 + GUEST_SYNC_ARGS(0, local_tsc, tmp, 0, 0); 40 + 41 + } while (local_tsc - start_tsc < 5000 * TEST_TSC_KHZ); 42 + 43 + GUEST_DONE(); 44 + } 45 + 46 + 47 + static void *run_vcpu(void *_cpu_nr) 48 + { 49 + unsigned long cpu = (unsigned long)_cpu_nr; 50 + unsigned long failures = 0; 51 + static bool first_cpu_done; 52 + 53 + /* The kernel is fine, but vm_vcpu_add_default() needs locking */ 54 + pthread_spin_lock(&create_lock); 55 + 56 + vm_vcpu_add_default(vm, cpu, guest_code); 57 + 58 + if (!first_cpu_done) { 59 + first_cpu_done = true; 60 + vcpu_set_msr(vm, cpu, MSR_IA32_TSC, TEST_TSC_OFFSET); 61 + } 62 + 63 + pthread_spin_unlock(&create_lock); 64 + 65 + for (;;) { 66 + volatile struct kvm_run *run = vcpu_state(vm, cpu); 67 + struct ucall uc; 68 + 69 + vcpu_run(vm, cpu); 70 + TEST_ASSERT(run->exit_reason == KVM_EXIT_IO, 71 + "Got exit_reason other than KVM_EXIT_IO: %u (%s)\n", 72 + run->exit_reason, 73 + exit_reason_str(run->exit_reason)); 74 + 75 + switch (get_ucall(vm, cpu, &uc)) { 76 + case UCALL_DONE: 77 + goto out; 78 + 79 + case UCALL_SYNC: 80 + printf("Guest %ld sync %lx %lx %ld\n", cpu, uc.args[2], uc.args[3], uc.args[2] - uc.args[3]); 81 + failures++; 82 + break; 83 + 84 + default: 85 + TEST_FAIL("Unknown ucall %lu", uc.cmd); 86 + } 87 + } 88 + out: 89 + return (void *)failures; 90 + } 91 + 92 + int main(int argc, char *argv[]) 93 + { 94 + if (!kvm_check_cap(KVM_CAP_VM_TSC_CONTROL)) { 95 + print_skip("KVM_CAP_VM_TSC_CONTROL not available"); 96 + exit(KSFT_SKIP); 97 + } 98 + 99 + vm = vm_create_default_with_vcpus(0, DEFAULT_STACK_PGS * NR_TEST_VCPUS, 0, guest_code, NULL); 100 + vm_ioctl(vm, KVM_SET_TSC_KHZ, (void *) TEST_TSC_KHZ); 101 + 102 + pthread_spin_init(&create_lock, PTHREAD_PROCESS_PRIVATE); 103 + pthread_t cpu_threads[NR_TEST_VCPUS]; 104 + unsigned long cpu; 105 + for (cpu = 0; cpu < NR_TEST_VCPUS; cpu++) 106 + pthread_create(&cpu_threads[cpu], NULL, run_vcpu, (void *)cpu); 107 + 108 + unsigned long failures = 0; 109 + for (cpu = 0; cpu < NR_TEST_VCPUS; cpu++) { 110 + void *this_cpu_failures; 111 + pthread_join(cpu_threads[cpu], &this_cpu_failures); 112 + failures += (unsigned long)this_cpu_failures; 113 + } 114 + 115 + TEST_ASSERT(!failures, "TSC sync failed"); 116 + pthread_spin_destroy(&create_lock); 117 + kvm_vm_free(vm); 118 + return 0; 119 + }

+10 -8

tools/testing/selftests/kvm/x86_64/vmx_pmu_msrs_test.c tools/testing/selftests/kvm/x86_64/vmx_pmu_caps_test.c

··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 /* 3 - * VMX-pmu related msrs test 3 + * Test for VMX-pmu perf capability msr 4 4 * 5 5 * Copyright (C) 2021 Intel Corporation 6 6 * 7 - * Test to check the effect of various CPUID settings 8 - * on the MSR_IA32_PERF_CAPABILITIES MSR, and check that 9 - * whatever we write with KVM_SET_MSR is _not_ modified 10 - * in the guest and test it can be retrieved with KVM_GET_MSR. 11 - * 12 - * Test to check that invalid LBR formats are rejected. 7 + * Test to check the effect of various CPUID settings on 8 + * MSR_IA32_PERF_CAPABILITIES MSR, and check that what 9 + * we write with KVM_SET_MSR is _not_ modified by the guest 10 + * and check it can be retrieved with KVM_GET_MSR, also test 11 + * the invalid LBR formats are rejected. 13 12 */ 14 13 15 14 #define _GNU_SOURCE /* for program_invocation_short_name */ ··· 106 107 ASSERT_EQ(vcpu_get_msr(vm, VCPU_ID, MSR_IA32_PERF_CAPABILITIES), (u64)host_cap.lbr_format); 107 108 108 109 /* testcase 3, check invalid LBR format is rejected */ 109 - ret = _vcpu_set_msr(vm, 0, MSR_IA32_PERF_CAPABILITIES, PMU_CAP_LBR_FMT); 110 + /* Note, on Arch LBR capable platforms, LBR_FMT in perf capability msr is 0x3f, 111 + * to avoid the failure, use a true invalid format 0x30 for the test. */ 112 + ret = _vcpu_set_msr(vm, 0, MSR_IA32_PERF_CAPABILITIES, 0x30); 110 113 TEST_ASSERT(ret == 0, "Bad PERF_CAPABILITIES didn't fail."); 111 114 115 + printf("Completed perf capability tests.\n"); 112 116 kvm_vm_free(vm); 113 117 }

+354 -12

tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c

··· 38 38 39 39 #define EVTCHN_VECTOR 0x10 40 40 41 + #define EVTCHN_TEST1 15 42 + #define EVTCHN_TEST2 66 43 + #define EVTCHN_TIMER 13 44 + 41 45 static struct kvm_vm *vm; 42 46 43 47 #define XEN_HYPERCALL_MSR 0x40000000 44 48 45 49 #define MIN_STEAL_TIME 50000 50 + 51 + #define __HYPERVISOR_set_timer_op 15 52 + #define __HYPERVISOR_sched_op 29 53 + #define __HYPERVISOR_event_channel_op 32 54 + 55 + #define SCHEDOP_poll 3 56 + 57 + #define EVTCHNOP_send 4 58 + 59 + #define EVTCHNSTAT_interdomain 2 60 + 61 + struct evtchn_send { 62 + u32 port; 63 + }; 64 + 65 + struct sched_poll { 66 + u32 *ports; 67 + unsigned int nr_ports; 68 + u64 timeout; 69 + }; 46 70 47 71 struct pvclock_vcpu_time_info { 48 72 u32 version; ··· 130 106 struct kvm_irq_routing_entry entries[2]; 131 107 } irq_routes; 132 108 109 + bool guest_saw_irq; 110 + 133 111 static void evtchn_handler(struct ex_regs *regs) 134 112 { 135 113 struct vcpu_info *vi = (void *)VCPU_INFO_VADDR; 136 114 vi->evtchn_upcall_pending = 0; 137 115 vi->evtchn_pending_sel = 0; 116 + guest_saw_irq = true; 138 117 139 118 GUEST_SYNC(0x20); 119 + } 120 + 121 + static void guest_wait_for_irq(void) 122 + { 123 + while (!guest_saw_irq) 124 + __asm__ __volatile__ ("rep nop" : : : "memory"); 125 + guest_saw_irq = false; 140 126 } 141 127 142 128 static void guest_code(void) ··· 160 126 161 127 /* Trigger an interrupt injection */ 162 128 GUEST_SYNC(0); 129 + 130 + guest_wait_for_irq(); 163 131 164 132 /* Test having the host set runstates manually */ 165 133 GUEST_SYNC(RUNSTATE_runnable); ··· 203 167 /* Now deliver an *unmasked* interrupt */ 204 168 GUEST_SYNC(8); 205 169 206 - while (!si->evtchn_pending[1]) 207 - __asm__ __volatile__ ("rep nop" : : : "memory"); 170 + guest_wait_for_irq(); 208 171 209 172 /* Change memslots and deliver an interrupt */ 210 173 GUEST_SYNC(9); 211 174 212 - for (;;) 213 - __asm__ __volatile__ ("rep nop" : : : "memory"); 175 + guest_wait_for_irq(); 176 + 177 + /* Deliver event channel with KVM_XEN_HVM_EVTCHN_SEND */ 178 + GUEST_SYNC(10); 179 + 180 + guest_wait_for_irq(); 181 + 182 + GUEST_SYNC(11); 183 + 184 + /* Our turn. Deliver event channel (to ourselves) with 185 + * EVTCHNOP_send hypercall. */ 186 + unsigned long rax; 187 + struct evtchn_send s = { .port = 127 }; 188 + __asm__ __volatile__ ("vmcall" : 189 + "=a" (rax) : 190 + "a" (__HYPERVISOR_event_channel_op), 191 + "D" (EVTCHNOP_send), 192 + "S" (&s)); 193 + 194 + GUEST_ASSERT(rax == 0); 195 + 196 + guest_wait_for_irq(); 197 + 198 + GUEST_SYNC(12); 199 + 200 + /* Deliver "outbound" event channel to an eventfd which 201 + * happens to be one of our own irqfds. */ 202 + s.port = 197; 203 + __asm__ __volatile__ ("vmcall" : 204 + "=a" (rax) : 205 + "a" (__HYPERVISOR_event_channel_op), 206 + "D" (EVTCHNOP_send), 207 + "S" (&s)); 208 + 209 + GUEST_ASSERT(rax == 0); 210 + 211 + guest_wait_for_irq(); 212 + 213 + GUEST_SYNC(13); 214 + 215 + /* Set a timer 100ms in the future. */ 216 + __asm__ __volatile__ ("vmcall" : 217 + "=a" (rax) : 218 + "a" (__HYPERVISOR_set_timer_op), 219 + "D" (rs->state_entry_time + 100000000)); 220 + GUEST_ASSERT(rax == 0); 221 + 222 + GUEST_SYNC(14); 223 + 224 + /* Now wait for the timer */ 225 + guest_wait_for_irq(); 226 + 227 + GUEST_SYNC(15); 228 + 229 + /* The host has 'restored' the timer. Just wait for it. */ 230 + guest_wait_for_irq(); 231 + 232 + GUEST_SYNC(16); 233 + 234 + /* Poll for an event channel port which is already set */ 235 + u32 ports[1] = { EVTCHN_TIMER }; 236 + struct sched_poll p = { 237 + .ports = ports, 238 + .nr_ports = 1, 239 + .timeout = 0, 240 + }; 241 + 242 + __asm__ __volatile__ ("vmcall" : 243 + "=a" (rax) : 244 + "a" (__HYPERVISOR_sched_op), 245 + "D" (SCHEDOP_poll), 246 + "S" (&p)); 247 + 248 + GUEST_ASSERT(rax == 0); 249 + 250 + GUEST_SYNC(17); 251 + 252 + /* Poll for an unset port and wait for the timeout. */ 253 + p.timeout = 100000000; 254 + __asm__ __volatile__ ("vmcall" : 255 + "=a" (rax) : 256 + "a" (__HYPERVISOR_sched_op), 257 + "D" (SCHEDOP_poll), 258 + "S" (&p)); 259 + 260 + GUEST_ASSERT(rax == 0); 261 + 262 + GUEST_SYNC(18); 263 + 264 + /* A timer will wake the masked port we're waiting on, while we poll */ 265 + p.timeout = 0; 266 + __asm__ __volatile__ ("vmcall" : 267 + "=a" (rax) : 268 + "a" (__HYPERVISOR_sched_op), 269 + "D" (SCHEDOP_poll), 270 + "S" (&p)); 271 + 272 + GUEST_ASSERT(rax == 0); 273 + 274 + GUEST_SYNC(19); 275 + 276 + /* A timer wake an *unmasked* port which should wake us with an 277 + * actual interrupt, while we're polling on a different port. */ 278 + ports[0]++; 279 + p.timeout = 0; 280 + __asm__ __volatile__ ("vmcall" : 281 + "=a" (rax) : 282 + "a" (__HYPERVISOR_sched_op), 283 + "D" (SCHEDOP_poll), 284 + "S" (&p)); 285 + 286 + GUEST_ASSERT(rax == 0); 287 + 288 + guest_wait_for_irq(); 289 + 290 + GUEST_SYNC(20); 291 + 292 + /* Timer should have fired already */ 293 + guest_wait_for_irq(); 294 + 295 + GUEST_SYNC(21); 214 296 } 215 297 216 298 static int cmp_timespec(struct timespec *a, struct timespec *b) ··· 344 190 else 345 191 return 0; 346 192 } 193 + struct vcpu_info *vinfo; 347 194 348 195 static void handle_alrm(int sig) 349 196 { 197 + if (vinfo) 198 + printf("evtchn_upcall_pending 0x%x\n", vinfo->evtchn_upcall_pending); 199 + vcpu_dump(stdout, vm, VCPU_ID, 0); 350 200 TEST_FAIL("IRQ delivery timed out"); 351 201 } 352 202 ··· 370 212 371 213 bool do_runstate_tests = !!(xen_caps & KVM_XEN_HVM_CONFIG_RUNSTATE); 372 214 bool do_eventfd_tests = !!(xen_caps & KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL); 215 + bool do_evtchn_tests = do_eventfd_tests && !!(xen_caps & KVM_XEN_HVM_CONFIG_EVTCHN_SEND); 373 216 374 217 clock_gettime(CLOCK_REALTIME, &min_ts); 375 218 ··· 391 232 .flags = KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL, 392 233 .msr = XEN_HYPERCALL_MSR, 393 234 }; 235 + 236 + /* Let the kernel know that we *will* use it for sending all 237 + * event channels, which lets it intercept SCHEDOP_poll */ 238 + if (do_evtchn_tests) 239 + hvmc.flags |= KVM_XEN_HVM_CONFIG_EVTCHN_SEND; 240 + 394 241 vm_ioctl(vm, KVM_XEN_HVM_CONFIG, &hvmc); 395 242 396 243 struct kvm_xen_hvm_attr lm = { ··· 459 294 460 295 /* Unexpected, but not a KVM failure */ 461 296 if (irq_fd[0] == -1 || irq_fd[1] == -1) 462 - do_eventfd_tests = false; 297 + do_evtchn_tests = do_eventfd_tests = false; 463 298 } 464 299 465 300 if (do_eventfd_tests) { ··· 467 302 468 303 irq_routes.entries[0].gsi = 32; 469 304 irq_routes.entries[0].type = KVM_IRQ_ROUTING_XEN_EVTCHN; 470 - irq_routes.entries[0].u.xen_evtchn.port = 15; 305 + irq_routes.entries[0].u.xen_evtchn.port = EVTCHN_TEST1; 471 306 irq_routes.entries[0].u.xen_evtchn.vcpu = VCPU_ID; 472 307 irq_routes.entries[0].u.xen_evtchn.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL; 473 308 474 309 irq_routes.entries[1].gsi = 33; 475 310 irq_routes.entries[1].type = KVM_IRQ_ROUTING_XEN_EVTCHN; 476 - irq_routes.entries[1].u.xen_evtchn.port = 66; 311 + irq_routes.entries[1].u.xen_evtchn.port = EVTCHN_TEST2; 477 312 irq_routes.entries[1].u.xen_evtchn.vcpu = VCPU_ID; 478 313 irq_routes.entries[1].u.xen_evtchn.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL; 479 314 ··· 494 329 sigaction(SIGALRM, &sa, NULL); 495 330 } 496 331 497 - struct vcpu_info *vinfo = addr_gpa2hva(vm, VCPU_INFO_VADDR); 332 + struct kvm_xen_vcpu_attr tmr = { 333 + .type = KVM_XEN_VCPU_ATTR_TYPE_TIMER, 334 + .u.timer.port = EVTCHN_TIMER, 335 + .u.timer.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL, 336 + .u.timer.expires_ns = 0 337 + }; 338 + 339 + if (do_evtchn_tests) { 340 + struct kvm_xen_hvm_attr inj = { 341 + .type = KVM_XEN_ATTR_TYPE_EVTCHN, 342 + .u.evtchn.send_port = 127, 343 + .u.evtchn.type = EVTCHNSTAT_interdomain, 344 + .u.evtchn.flags = 0, 345 + .u.evtchn.deliver.port.port = EVTCHN_TEST1, 346 + .u.evtchn.deliver.port.vcpu = VCPU_ID + 1, 347 + .u.evtchn.deliver.port.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL, 348 + }; 349 + vm_ioctl(vm, KVM_XEN_HVM_SET_ATTR, &inj); 350 + 351 + /* Test migration to a different vCPU */ 352 + inj.u.evtchn.flags = KVM_XEN_EVTCHN_UPDATE; 353 + inj.u.evtchn.deliver.port.vcpu = VCPU_ID; 354 + vm_ioctl(vm, KVM_XEN_HVM_SET_ATTR, &inj); 355 + 356 + inj.u.evtchn.send_port = 197; 357 + inj.u.evtchn.deliver.eventfd.port = 0; 358 + inj.u.evtchn.deliver.eventfd.fd = irq_fd[1]; 359 + inj.u.evtchn.flags = 0; 360 + vm_ioctl(vm, KVM_XEN_HVM_SET_ATTR, &inj); 361 + 362 + vcpu_ioctl(vm, VCPU_ID, KVM_XEN_VCPU_SET_ATTR, &tmr); 363 + } 364 + vinfo = addr_gpa2hva(vm, VCPU_INFO_VADDR); 498 365 vinfo->evtchn_upcall_pending = 0; 499 366 500 367 struct vcpu_runstate_info *rs = addr_gpa2hva(vm, RUNSTATE_ADDR); ··· 619 422 goto done; 620 423 if (verbose) 621 424 printf("Testing masked event channel\n"); 622 - shinfo->evtchn_mask[0] = 0x8000; 425 + shinfo->evtchn_mask[0] = 1UL << EVTCHN_TEST1; 623 426 eventfd_write(irq_fd[0], 1UL); 624 427 alarm(1); 625 428 break; ··· 636 439 break; 637 440 638 441 case 9: 442 + TEST_ASSERT(!evtchn_irq_expected, 443 + "Expected event channel IRQ but it didn't happen"); 444 + shinfo->evtchn_pending[1] = 0; 639 445 if (verbose) 640 446 printf("Testing event channel after memslot change\n"); 641 447 vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, ··· 648 448 alarm(1); 649 449 break; 650 450 451 + case 10: 452 + TEST_ASSERT(!evtchn_irq_expected, 453 + "Expected event channel IRQ but it didn't happen"); 454 + if (!do_evtchn_tests) 455 + goto done; 456 + 457 + shinfo->evtchn_pending[0] = 0; 458 + if (verbose) 459 + printf("Testing injection with KVM_XEN_HVM_EVTCHN_SEND\n"); 460 + 461 + struct kvm_irq_routing_xen_evtchn e; 462 + e.port = EVTCHN_TEST2; 463 + e.vcpu = VCPU_ID; 464 + e.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL; 465 + 466 + vm_ioctl(vm, KVM_XEN_HVM_EVTCHN_SEND, &e); 467 + evtchn_irq_expected = true; 468 + alarm(1); 469 + break; 470 + 471 + case 11: 472 + TEST_ASSERT(!evtchn_irq_expected, 473 + "Expected event channel IRQ but it didn't happen"); 474 + shinfo->evtchn_pending[1] = 0; 475 + 476 + if (verbose) 477 + printf("Testing guest EVTCHNOP_send direct to evtchn\n"); 478 + evtchn_irq_expected = true; 479 + alarm(1); 480 + break; 481 + 482 + case 12: 483 + TEST_ASSERT(!evtchn_irq_expected, 484 + "Expected event channel IRQ but it didn't happen"); 485 + shinfo->evtchn_pending[0] = 0; 486 + 487 + if (verbose) 488 + printf("Testing guest EVTCHNOP_send to eventfd\n"); 489 + evtchn_irq_expected = true; 490 + alarm(1); 491 + break; 492 + 493 + case 13: 494 + TEST_ASSERT(!evtchn_irq_expected, 495 + "Expected event channel IRQ but it didn't happen"); 496 + shinfo->evtchn_pending[1] = 0; 497 + 498 + if (verbose) 499 + printf("Testing guest oneshot timer\n"); 500 + break; 501 + 502 + case 14: 503 + memset(&tmr, 0, sizeof(tmr)); 504 + tmr.type = KVM_XEN_VCPU_ATTR_TYPE_TIMER; 505 + vcpu_ioctl(vm, VCPU_ID, KVM_XEN_VCPU_GET_ATTR, &tmr); 506 + TEST_ASSERT(tmr.u.timer.port == EVTCHN_TIMER, 507 + "Timer port not returned"); 508 + TEST_ASSERT(tmr.u.timer.priority == KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL, 509 + "Timer priority not returned"); 510 + TEST_ASSERT(tmr.u.timer.expires_ns > rs->state_entry_time, 511 + "Timer expiry not returned"); 512 + evtchn_irq_expected = true; 513 + alarm(1); 514 + break; 515 + 516 + case 15: 517 + TEST_ASSERT(!evtchn_irq_expected, 518 + "Expected event channel IRQ but it didn't happen"); 519 + shinfo->evtchn_pending[0] = 0; 520 + 521 + if (verbose) 522 + printf("Testing restored oneshot timer\n"); 523 + 524 + tmr.u.timer.expires_ns = rs->state_entry_time + 100000000, 525 + vcpu_ioctl(vm, VCPU_ID, KVM_XEN_VCPU_SET_ATTR, &tmr); 526 + evtchn_irq_expected = true; 527 + alarm(1); 528 + break; 529 + 530 + case 16: 531 + TEST_ASSERT(!evtchn_irq_expected, 532 + "Expected event channel IRQ but it didn't happen"); 533 + 534 + if (verbose) 535 + printf("Testing SCHEDOP_poll with already pending event\n"); 536 + shinfo->evtchn_pending[0] = shinfo->evtchn_mask[0] = 1UL << EVTCHN_TIMER; 537 + alarm(1); 538 + break; 539 + 540 + case 17: 541 + if (verbose) 542 + printf("Testing SCHEDOP_poll timeout\n"); 543 + shinfo->evtchn_pending[0] = 0; 544 + alarm(1); 545 + break; 546 + 547 + case 18: 548 + if (verbose) 549 + printf("Testing SCHEDOP_poll wake on masked event\n"); 550 + 551 + tmr.u.timer.expires_ns = rs->state_entry_time + 100000000, 552 + vcpu_ioctl(vm, VCPU_ID, KVM_XEN_VCPU_SET_ATTR, &tmr); 553 + alarm(1); 554 + break; 555 + 556 + case 19: 557 + shinfo->evtchn_pending[0] = shinfo->evtchn_mask[0] = 0; 558 + if (verbose) 559 + printf("Testing SCHEDOP_poll wake on unmasked event\n"); 560 + 561 + evtchn_irq_expected = true; 562 + tmr.u.timer.expires_ns = rs->state_entry_time + 100000000; 563 + vcpu_ioctl(vm, VCPU_ID, KVM_XEN_VCPU_SET_ATTR, &tmr); 564 + 565 + /* Read it back and check the pending time is reported correctly */ 566 + tmr.u.timer.expires_ns = 0; 567 + vcpu_ioctl(vm, VCPU_ID, KVM_XEN_VCPU_GET_ATTR, &tmr); 568 + TEST_ASSERT(tmr.u.timer.expires_ns == rs->state_entry_time + 100000000, 569 + "Timer not reported pending"); 570 + alarm(1); 571 + break; 572 + 573 + case 20: 574 + TEST_ASSERT(!evtchn_irq_expected, 575 + "Expected event channel IRQ but it didn't happen"); 576 + /* Read timer and check it is no longer pending */ 577 + vcpu_ioctl(vm, VCPU_ID, KVM_XEN_VCPU_GET_ATTR, &tmr); 578 + TEST_ASSERT(!tmr.u.timer.expires_ns, "Timer still reported pending"); 579 + 580 + shinfo->evtchn_pending[0] = 0; 581 + if (verbose) 582 + printf("Testing timer in the past\n"); 583 + 584 + evtchn_irq_expected = true; 585 + tmr.u.timer.expires_ns = rs->state_entry_time - 100000000ULL; 586 + vcpu_ioctl(vm, VCPU_ID, KVM_XEN_VCPU_SET_ATTR, &tmr); 587 + alarm(1); 588 + break; 589 + 590 + case 21: 591 + TEST_ASSERT(!evtchn_irq_expected, 592 + "Expected event channel IRQ but it didn't happen"); 593 + goto done; 594 + 651 595 case 0x20: 652 596 TEST_ASSERT(evtchn_irq_expected, "Unexpected event channel IRQ"); 653 597 evtchn_irq_expected = false; 654 - if (shinfo->evtchn_pending[1] && 655 - shinfo->evtchn_pending[0]) 656 - goto done; 657 598 break; 658 599 } 659 600 break; ··· 807 466 } 808 467 809 468 done: 469 + alarm(0); 810 470 clock_gettime(CLOCK_REALTIME, &max_ts); 811 471 812 472 /*

+2 -1

virt/kvm/kvm_main.c

··· 1092 1092 spin_lock_init(&kvm->gpc_lock); 1093 1093 1094 1094 INIT_LIST_HEAD(&kvm->devices); 1095 + kvm->max_vcpus = KVM_MAX_VCPUS; 1095 1096 1096 1097 BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX); 1097 1098 ··· 3754 3753 return -EINVAL; 3755 3754 3756 3755 mutex_lock(&kvm->lock); 3757 - if (kvm->created_vcpus == KVM_MAX_VCPUS) { 3756 + if (kvm->created_vcpus >= kvm->max_vcpus) { 3758 3757 mutex_unlock(&kvm->lock); 3759 3758 return -EINVAL; 3760 3759 }