Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

+2

Documentation/arm64/silicon-errata.rst

··· 86 86 +----------------+-----------------+-----------------+-----------------------------+ 87 87 | ARM | Neoverse-N1 | #1188873,1418040| ARM64_ERRATUM_1418040 | 88 88 +----------------+-----------------+-----------------+-----------------------------+ 89 + | ARM | Neoverse-N1 | #1349291 | N/A | 90 + +----------------+-----------------+-----------------+-----------------------------+ 89 91 | ARM | MMU-500 | #841119,826419 | N/A | 90 92 +----------------+-----------------+-----------------+-----------------------------+ 91 93 +----------------+-----------------+-----------------+-----------------------------+

+18

Documentation/virtual/index.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ============================ 4 + Linux Virtualization Support 5 + ============================ 6 + 7 + .. toctree:: 8 + :maxdepth: 2 9 + 10 + kvm/index 11 + paravirt_ops 12 + 13 + .. only:: html and subproject 14 + 15 + Indices 16 + ======= 17 + 18 + * :ref:`genindex`

+28

Documentation/virtual/kvm/api.txt

··· 4081 4081 See KVM_ARM_VCPU_INIT for details of vcpu features that require finalization 4082 4082 using this ioctl. 4083 4083 4084 + 4.120 KVM_SET_PMU_EVENT_FILTER 4085 + 4086 + Capability: KVM_CAP_PMU_EVENT_FILTER 4087 + Architectures: x86 4088 + Type: vm ioctl 4089 + Parameters: struct kvm_pmu_event_filter (in) 4090 + Returns: 0 on success, -1 on error 4091 + 4092 + struct kvm_pmu_event_filter { 4093 + __u32 action; 4094 + __u32 nevents; 4095 + __u64 events[0]; 4096 + }; 4097 + 4098 + This ioctl restricts the set of PMU events that the guest can program. 4099 + The argument holds a list of events which will be allowed or denied. 4100 + The eventsel+umask of each event the guest attempts to program is compared 4101 + against the events field to determine whether the guest should have access. 4102 + This only affects general purpose counters; fixed purpose counters can 4103 + be disabled by changing the perfmon CPUID leaf. 4104 + 4105 + Valid values for 'action': 4106 + #define KVM_PMU_EVENT_ALLOW 0 4107 + #define KVM_PMU_EVENT_DENY 1 4108 + 4109 + 4084 4110 5. The kvm_run structure 4085 4111 ------------------------ 4086 4112 ··· 4935 4909 4936 4910 #define KVM_X86_DISABLE_EXITS_MWAIT (1 << 0) 4937 4911 #define KVM_X86_DISABLE_EXITS_HLT (1 << 1) 4912 + #define KVM_X86_DISABLE_EXITS_PAUSE (1 << 2) 4913 + #define KVM_X86_DISABLE_EXITS_CSTATE (1 << 3) 4938 4914 4939 4915 Enabling this capability on a VM provides userspace with a way to no 4940 4916 longer intercept some instructions for improved latency in some

+31

Documentation/virtual/kvm/arm/psci.txt

··· 28 28 - Allows any PSCI version implemented by KVM and compatible with 29 29 v0.2 to be set with SET_ONE_REG 30 30 - Affects the whole VM (even if the register view is per-vcpu) 31 + 32 + * KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1: 33 + Holds the state of the firmware support to mitigate CVE-2017-5715, as 34 + offered by KVM to the guest via a HVC call. The workaround is described 35 + under SMCCC_ARCH_WORKAROUND_1 in [1]. 36 + Accepted values are: 37 + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_AVAIL: KVM does not offer 38 + firmware support for the workaround. The mitigation status for the 39 + guest is unknown. 40 + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_AVAIL: The workaround HVC call is 41 + available to the guest and required for the mitigation. 42 + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_REQUIRED: The workaround HVC call 43 + is available to the guest, but it is not needed on this VCPU. 44 + 45 + * KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2: 46 + Holds the state of the firmware support to mitigate CVE-2018-3639, as 47 + offered by KVM to the guest via a HVC call. The workaround is described 48 + under SMCCC_ARCH_WORKAROUND_2 in [1]. 49 + Accepted values are: 50 + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL: A workaround is not 51 + available. KVM does not offer firmware support for the workaround. 52 + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_UNKNOWN: The workaround state is 53 + unknown. KVM does not offer firmware support for the workaround. 54 + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL: The workaround is available, 55 + and can be disabled by a vCPU. If 56 + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED is set, it is active for 57 + this vCPU. 58 + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED: The workaround is 59 + always active on this vCPU or it is not needed. 60 + 61 + [1] https://developer.arm.com/-/media/developer/pdf/ARM_DEN_0070A_Firmware_interfaces_for_mitigating_CVE-2017-5715.pdf

+107

Documentation/virtual/kvm/cpuid.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ============== 4 + KVM CPUID bits 5 + ============== 6 + 7 + :Author: Glauber Costa <glommer@gmail.com> 8 + 9 + A guest running on a kvm host, can check some of its features using 10 + cpuid. This is not always guaranteed to work, since userspace can 11 + mask-out some, or even all KVM-related cpuid features before launching 12 + a guest. 13 + 14 + KVM cpuid functions are: 15 + 16 + function: KVM_CPUID_SIGNATURE (0x40000000) 17 + 18 + returns:: 19 + 20 + eax = 0x40000001 21 + ebx = 0x4b4d564b 22 + ecx = 0x564b4d56 23 + edx = 0x4d 24 + 25 + Note that this value in ebx, ecx and edx corresponds to the string "KVMKVMKVM". 26 + The value in eax corresponds to the maximum cpuid function present in this leaf, 27 + and will be updated if more functions are added in the future. 28 + Note also that old hosts set eax value to 0x0. This should 29 + be interpreted as if the value was 0x40000001. 30 + This function queries the presence of KVM cpuid leafs. 31 + 32 + function: define KVM_CPUID_FEATURES (0x40000001) 33 + 34 + returns:: 35 + 36 + ebx, ecx 37 + eax = an OR'ed group of (1 << flag) 38 + 39 + where ``flag`` is defined as below: 40 + 41 + ================================= =========== ================================ 42 + flag value meaning 43 + ================================= =========== ================================ 44 + KVM_FEATURE_CLOCKSOURCE 0 kvmclock available at msrs 45 + 0x11 and 0x12 46 + 47 + KVM_FEATURE_NOP_IO_DELAY 1 not necessary to perform delays 48 + on PIO operations 49 + 50 + KVM_FEATURE_MMU_OP 2 deprecated 51 + 52 + KVM_FEATURE_CLOCKSOURCE2 3 kvmclock available at msrs 53 + 54 + 0x4b564d00 and 0x4b564d01 55 + KVM_FEATURE_ASYNC_PF 4 async pf can be enabled by 56 + writing to msr 0x4b564d02 57 + 58 + KVM_FEATURE_STEAL_TIME 5 steal time can be enabled by 59 + writing to msr 0x4b564d03 60 + 61 + KVM_FEATURE_PV_EOI 6 paravirtualized end of interrupt 62 + handler can be enabled by 63 + writing to msr 0x4b564d04 64 + 65 + KVM_FEATURE_PV_UNHAULT 7 guest checks this feature bit 66 + before enabling paravirtualized 67 + spinlock support 68 + 69 + KVM_FEATURE_PV_TLB_FLUSH 9 guest checks this feature bit 70 + before enabling paravirtualized 71 + tlb flush 72 + 73 + KVM_FEATURE_ASYNC_PF_VMEXIT 10 paravirtualized async PF VM EXIT 74 + can be enabled by setting bit 2 75 + when writing to msr 0x4b564d02 76 + 77 + KVM_FEATURE_PV_SEND_IPI 11 guest checks this feature bit 78 + before enabling paravirtualized 79 + sebd IPIs 80 + 81 + KVM_FEATURE_PV_POLL_CONTROL 12 host-side polling on HLT can 82 + be disabled by writing 83 + to msr 0x4b564d05. 84 + 85 + KVM_FEATURE_PV_SCHED_YIELD 13 guest checks this feature bit 86 + before using paravirtualized 87 + sched yield. 88 + 89 + KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24 host will warn if no guest-side 90 + per-cpu warps are expeced in 91 + kvmclock 92 + ================================= =========== ================================ 93 + 94 + :: 95 + 96 + edx = an OR'ed group of (1 << flag) 97 + 98 + Where ``flag`` here is defined as below: 99 + 100 + ================== ============ ================================= 101 + flag value meaning 102 + ================== ============ ================================= 103 + KVM_HINTS_REALTIME 0 guest checks this feature bit to 104 + determine that vCPUs are never 105 + preempted for an unlimited time 106 + allowing optimizations 107 + ================== ============ =================================

-83

Documentation/virtual/kvm/cpuid.txt

··· 1 - KVM CPUID bits 2 - Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010 3 - ===================================================== 4 - 5 - A guest running on a kvm host, can check some of its features using 6 - cpuid. This is not always guaranteed to work, since userspace can 7 - mask-out some, or even all KVM-related cpuid features before launching 8 - a guest. 9 - 10 - KVM cpuid functions are: 11 - 12 - function: KVM_CPUID_SIGNATURE (0x40000000) 13 - returns : eax = 0x40000001, 14 - ebx = 0x4b4d564b, 15 - ecx = 0x564b4d56, 16 - edx = 0x4d. 17 - Note that this value in ebx, ecx and edx corresponds to the string "KVMKVMKVM". 18 - The value in eax corresponds to the maximum cpuid function present in this leaf, 19 - and will be updated if more functions are added in the future. 20 - Note also that old hosts set eax value to 0x0. This should 21 - be interpreted as if the value was 0x40000001. 22 - This function queries the presence of KVM cpuid leafs. 23 - 24 - 25 - function: define KVM_CPUID_FEATURES (0x40000001) 26 - returns : ebx, ecx 27 - eax = an OR'ed group of (1 << flag), where each flags is: 28 - 29 - 30 - flag || value || meaning 31 - ============================================================================= 32 - KVM_FEATURE_CLOCKSOURCE || 0 || kvmclock available at msrs 33 - || || 0x11 and 0x12. 34 - ------------------------------------------------------------------------------ 35 - KVM_FEATURE_NOP_IO_DELAY || 1 || not necessary to perform delays 36 - || || on PIO operations. 37 - ------------------------------------------------------------------------------ 38 - KVM_FEATURE_MMU_OP || 2 || deprecated. 39 - ------------------------------------------------------------------------------ 40 - KVM_FEATURE_CLOCKSOURCE2 || 3 || kvmclock available at msrs 41 - || || 0x4b564d00 and 0x4b564d01 42 - ------------------------------------------------------------------------------ 43 - KVM_FEATURE_ASYNC_PF || 4 || async pf can be enabled by 44 - || || writing to msr 0x4b564d02 45 - ------------------------------------------------------------------------------ 46 - KVM_FEATURE_STEAL_TIME || 5 || steal time can be enabled by 47 - || || writing to msr 0x4b564d03. 48 - ------------------------------------------------------------------------------ 49 - KVM_FEATURE_PV_EOI || 6 || paravirtualized end of interrupt 50 - || || handler can be enabled by writing 51 - || || to msr 0x4b564d04. 52 - ------------------------------------------------------------------------------ 53 - KVM_FEATURE_PV_UNHALT || 7 || guest checks this feature bit 54 - || || before enabling paravirtualized 55 - || || spinlock support. 56 - ------------------------------------------------------------------------------ 57 - KVM_FEATURE_PV_TLB_FLUSH || 9 || guest checks this feature bit 58 - || || before enabling paravirtualized 59 - || || tlb flush. 60 - ------------------------------------------------------------------------------ 61 - KVM_FEATURE_ASYNC_PF_VMEXIT || 10 || paravirtualized async PF VM exit 62 - || || can be enabled by setting bit 2 63 - || || when writing to msr 0x4b564d02 64 - ------------------------------------------------------------------------------ 65 - KVM_FEATURE_PV_SEND_IPI || 11 || guest checks this feature bit 66 - || || before using paravirtualized 67 - || || send IPIs. 68 - ------------------------------------------------------------------------------ 69 - KVM_FEATURE_CLOCKSOURCE_STABLE_BIT || 24 || host will warn if no guest-side 70 - || || per-cpu warps are expected in 71 - || || kvmclock. 72 - ------------------------------------------------------------------------------ 73 - 74 - edx = an OR'ed group of (1 << flag), where each flags is: 75 - 76 - 77 - flag || value || meaning 78 - ================================================================================== 79 - KVM_HINTS_REALTIME || 0 || guest checks this feature bit to 80 - || || determine that vCPUs are never 81 - || || preempted for an unlimited time, 82 - || || allowing optimizations 83 - ----------------------------------------------------------------------------------

+11

Documentation/virtual/kvm/hypercalls.txt

··· 141 141 corresponds to the APIC ID a2+1, and so on. 142 142 143 143 Returns the number of CPUs to which the IPIs were delivered successfully. 144 + 145 + 7. KVM_HC_SCHED_YIELD 146 + ------------------------ 147 + Architecture: x86 148 + Status: active 149 + Purpose: Hypercall used to yield if the IPI target vCPU is preempted 150 + 151 + a0: destination APIC ID 152 + 153 + Usage example: When sending a call-function IPI-many to vCPUs, yield if 154 + any of the IPI target vCPUs was preempted.

+11

Documentation/virtual/kvm/index.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + === 4 + KVM 5 + === 6 + 7 + .. toctree:: 8 + :maxdepth: 2 9 + 10 + amd-memory-encryption 11 + cpuid

+1 -3

Documentation/virtual/kvm/locking.txt

··· 15 15 16 16 On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock. 17 17 18 - For spinlocks, kvm_lock is taken outside kvm->mmu_lock. 19 - 20 18 Everything else is a leaf: no other lock is taken inside the critical 21 19 sections. 22 20 ··· 167 169 ------------ 168 170 169 171 Name: kvm_lock 170 - Type: spinlock_t 172 + Type: mutex 171 173 Arch: any 172 174 Protects: - vm_list 173 175

+9

Documentation/virtual/kvm/msr.txt

··· 273 273 guest must both read the least significant bit in the memory area and 274 274 clear it using a single CPU instruction, such as test and clear, or 275 275 compare and exchange. 276 + 277 + MSR_KVM_POLL_CONTROL: 0x4b564d05 278 + Control host-side polling. 279 + 280 + data: Bit 0 enables (1) or disables (0) host-side HLT polling logic. 281 + 282 + KVM guests can request the host not to poll on HLT, for example if 283 + they are performing polling themselves. 284 +

+11 -8

Documentation/virtual/paravirt_ops.txt Documentation/virtual/paravirt_ops.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ============ 1 4 Paravirt_ops 2 5 ============ 3 6 ··· 21 18 pv_ops operations are classified into three categories: 22 19 23 20 - simple indirect call 24 - These operations correspond to high level functionality where it is 25 - known that the overhead of indirect call isn't very important. 21 + These operations correspond to high level functionality where it is 22 + known that the overhead of indirect call isn't very important. 26 23 27 24 - indirect call which allows optimization with binary patch 28 - Usually these operations correspond to low level critical instructions. They 29 - are called frequently and are performance critical. The overhead is 30 - very important. 25 + Usually these operations correspond to low level critical instructions. They 26 + are called frequently and are performance critical. The overhead is 27 + very important. 31 28 32 29 - a set of macros for hand written assembly code 33 - Hand written assembly codes (.S files) also need paravirtualization 34 - because they include sensitive instructions or some of code paths in 35 - them are very performance critical. 30 + Hand written assembly codes (.S files) also need paravirtualization 31 + because they include sensitive instructions or some of code paths in 32 + them are very performance critical.

+10

arch/arm/include/asm/kvm_emulate.h

··· 271 271 return vcpu_cp15(vcpu, c0_MPIDR) & MPIDR_HWID_BITMASK; 272 272 } 273 273 274 + static inline bool kvm_arm_get_vcpu_workaround_2_flag(struct kvm_vcpu *vcpu) 275 + { 276 + return false; 277 + } 278 + 279 + static inline void kvm_arm_set_vcpu_workaround_2_flag(struct kvm_vcpu *vcpu, 280 + bool flag) 281 + { 282 + } 283 + 274 284 static inline void kvm_vcpu_set_be(struct kvm_vcpu *vcpu) 275 285 { 276 286 *vcpu_cpsr(vcpu) |= PSR_E_BIT;

+11 -7

arch/arm/include/asm/kvm_host.h

··· 15 15 #include <asm/kvm_asm.h> 16 16 #include <asm/kvm_mmio.h> 17 17 #include <asm/fpstate.h> 18 - #include <asm/smp_plat.h> 19 18 #include <kvm/arm_arch_timer.h> 20 19 21 20 #define __KVM_HAVE_ARCH_INTC_INITIALIZED ··· 146 147 147 148 typedef struct kvm_host_data kvm_host_data_t; 148 149 149 - static inline void kvm_init_host_cpu_context(struct kvm_cpu_context *cpu_ctxt, 150 - int cpu) 150 + static inline void kvm_init_host_cpu_context(struct kvm_cpu_context *cpu_ctxt) 151 151 { 152 152 /* The host's MPIDR is immutable, so let's set it up at boot time */ 153 - cpu_ctxt->cp15[c0_MPIDR] = cpu_logical_map(cpu); 153 + cpu_ctxt->cp15[c0_MPIDR] = read_cpuid_mpidr(); 154 154 } 155 155 156 156 struct vcpu_reset_state { ··· 360 362 static inline void kvm_arm_vhe_guest_enter(void) {} 361 363 static inline void kvm_arm_vhe_guest_exit(void) {} 362 364 363 - static inline bool kvm_arm_harden_branch_predictor(void) 365 + #define KVM_BP_HARDEN_UNKNOWN -1 366 + #define KVM_BP_HARDEN_WA_NEEDED 0 367 + #define KVM_BP_HARDEN_NOT_REQUIRED 1 368 + 369 + static inline int kvm_arm_harden_branch_predictor(void) 364 370 { 365 371 switch(read_cpuid_part()) { 366 372 #ifdef CONFIG_HARDEN_BRANCH_PREDICTOR ··· 372 370 case ARM_CPU_PART_CORTEX_A12: 373 371 case ARM_CPU_PART_CORTEX_A15: 374 372 case ARM_CPU_PART_CORTEX_A17: 375 - return true; 373 + return KVM_BP_HARDEN_WA_NEEDED; 376 374 #endif 375 + case ARM_CPU_PART_CORTEX_A7: 376 + return KVM_BP_HARDEN_NOT_REQUIRED; 377 377 default: 378 - return false; 378 + return KVM_BP_HARDEN_UNKNOWN; 379 379 } 380 380 } 381 381

+7 -6

arch/arm/include/asm/kvm_hyp.h

··· 82 82 #define VFP_FPEXC __ACCESS_VFP(FPEXC) 83 83 84 84 /* AArch64 compatibility macros, only for the timer so far */ 85 - #define read_sysreg_el0(r) read_sysreg(r##_el0) 86 - #define write_sysreg_el0(v, r) write_sysreg(v, r##_el0) 85 + #define read_sysreg_el0(r) read_sysreg(r##_EL0) 86 + #define write_sysreg_el0(v, r) write_sysreg(v, r##_EL0) 87 87 88 - #define cntp_ctl_el0 CNTP_CTL 89 - #define cntp_cval_el0 CNTP_CVAL 90 - #define cntv_ctl_el0 CNTV_CTL 91 - #define cntv_cval_el0 CNTV_CVAL 88 + #define SYS_CNTP_CTL_EL0 CNTP_CTL 89 + #define SYS_CNTP_CVAL_EL0 CNTP_CVAL 90 + #define SYS_CNTV_CTL_EL0 CNTV_CTL 91 + #define SYS_CNTV_CVAL_EL0 CNTV_CVAL 92 + 92 93 #define cntvoff_el2 CNTVOFF 93 94 #define cnthctl_el2 CNTHCTL 94 95

+12

arch/arm/include/uapi/asm/kvm.h

··· 214 214 #define KVM_REG_ARM_FW_REG(r) (KVM_REG_ARM | KVM_REG_SIZE_U64 | \ 215 215 KVM_REG_ARM_FW | ((r) & 0xffff)) 216 216 #define KVM_REG_ARM_PSCI_VERSION KVM_REG_ARM_FW_REG(0) 217 + #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1 KVM_REG_ARM_FW_REG(1) 218 + /* Higher values mean better protection. */ 219 + #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_AVAIL 0 220 + #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_AVAIL 1 221 + #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_REQUIRED 2 222 + #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2 KVM_REG_ARM_FW_REG(2) 223 + /* Higher values mean better protection. */ 224 + #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL 0 225 + #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_UNKNOWN 1 226 + #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL 2 227 + #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED 3 228 + #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED (1U << 4) 217 229 218 230 /* Device Control API: ARM VGIC */ 219 231 #define KVM_DEV_ARM_VGIC_GRP_ADDR 0

+4

arch/arm64/include/asm/assembler.h

··· 96 96 * RAS Error Synchronization barrier 97 97 */ 98 98 .macro esb 99 + #ifdef CONFIG_ARM64_RAS_EXTN 99 100 hint #16 101 + #else 102 + nop 103 + #endif 100 104 .endm 101 105 102 106 /*

+6

arch/arm64/include/asm/cpufeature.h

··· 620 620 system_uses_irq_prio_masking(); 621 621 } 622 622 623 + #define ARM64_BP_HARDEN_UNKNOWN -1 624 + #define ARM64_BP_HARDEN_WA_NEEDED 0 625 + #define ARM64_BP_HARDEN_NOT_REQUIRED 1 626 + 627 + int get_spectre_v2_workaround_state(void); 628 + 623 629 #define ARM64_SSBD_UNKNOWN -1 624 630 #define ARM64_SSBD_FORCE_DISABLE 0 625 631 #define ARM64_SSBD_KERNEL 1

+6

arch/arm64/include/asm/kvm_asm.h

··· 30 30 {ARM_EXCEPTION_TRAP, "TRAP" }, \ 31 31 {ARM_EXCEPTION_HYP_GONE, "HYP_GONE" } 32 32 33 + /* 34 + * Size of the HYP vectors preamble. kvm_patch_vector_branch() generates code 35 + * that jumps over this. 36 + */ 37 + #define KVM_VECTOR_PREAMBLE (2 * AARCH64_INSN_SIZE) 38 + 33 39 #ifndef __ASSEMBLY__ 34 40 35 41 #include <linux/mm.h>

+22 -8

arch/arm64/include/asm/kvm_emulate.h

··· 126 126 static inline unsigned long vcpu_read_elr_el1(const struct kvm_vcpu *vcpu) 127 127 { 128 128 if (vcpu->arch.sysregs_loaded_on_cpu) 129 - return read_sysreg_el1(elr); 129 + return read_sysreg_el1(SYS_ELR); 130 130 else 131 131 return *__vcpu_elr_el1(vcpu); 132 132 } ··· 134 134 static inline void vcpu_write_elr_el1(const struct kvm_vcpu *vcpu, unsigned long v) 135 135 { 136 136 if (vcpu->arch.sysregs_loaded_on_cpu) 137 - write_sysreg_el1(v, elr); 137 + write_sysreg_el1(v, SYS_ELR); 138 138 else 139 139 *__vcpu_elr_el1(vcpu) = v; 140 140 } ··· 186 186 return vcpu_read_spsr32(vcpu); 187 187 188 188 if (vcpu->arch.sysregs_loaded_on_cpu) 189 - return read_sysreg_el1(spsr); 189 + return read_sysreg_el1(SYS_SPSR); 190 190 else 191 191 return vcpu_gp_regs(vcpu)->spsr[KVM_SPSR_EL1]; 192 192 } ··· 199 199 } 200 200 201 201 if (vcpu->arch.sysregs_loaded_on_cpu) 202 - write_sysreg_el1(v, spsr); 202 + write_sysreg_el1(v, SYS_SPSR); 203 203 else 204 204 vcpu_gp_regs(vcpu)->spsr[KVM_SPSR_EL1] = v; 205 205 } ··· 353 353 return vcpu_read_sys_reg(vcpu, MPIDR_EL1) & MPIDR_HWID_BITMASK; 354 354 } 355 355 356 + static inline bool kvm_arm_get_vcpu_workaround_2_flag(struct kvm_vcpu *vcpu) 357 + { 358 + return vcpu->arch.workaround_flags & VCPU_WORKAROUND_2_FLAG; 359 + } 360 + 361 + static inline void kvm_arm_set_vcpu_workaround_2_flag(struct kvm_vcpu *vcpu, 362 + bool flag) 363 + { 364 + if (flag) 365 + vcpu->arch.workaround_flags |= VCPU_WORKAROUND_2_FLAG; 366 + else 367 + vcpu->arch.workaround_flags &= ~VCPU_WORKAROUND_2_FLAG; 368 + } 369 + 356 370 static inline void kvm_vcpu_set_be(struct kvm_vcpu *vcpu) 357 371 { 358 372 if (vcpu_mode_is_32bit(vcpu)) { ··· 465 451 */ 466 452 static inline void __hyp_text __kvm_skip_instr(struct kvm_vcpu *vcpu) 467 453 { 468 - *vcpu_pc(vcpu) = read_sysreg_el2(elr); 469 - vcpu->arch.ctxt.gp_regs.regs.pstate = read_sysreg_el2(spsr); 454 + *vcpu_pc(vcpu) = read_sysreg_el2(SYS_ELR); 455 + vcpu->arch.ctxt.gp_regs.regs.pstate = read_sysreg_el2(SYS_SPSR); 470 456 471 457 kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu)); 472 458 473 - write_sysreg_el2(vcpu->arch.ctxt.gp_regs.regs.pstate, spsr); 474 - write_sysreg_el2(*vcpu_pc(vcpu), elr); 459 + write_sysreg_el2(vcpu->arch.ctxt.gp_regs.regs.pstate, SYS_SPSR); 460 + write_sysreg_el2(*vcpu_pc(vcpu), SYS_ELR); 475 461 } 476 462 477 463 #endif /* __ARM64_KVM_EMULATE_H__ */

+17 -6

arch/arm64/include/asm/kvm_host.h

··· 19 19 #include <asm/arch_gicv3.h> 20 20 #include <asm/barrier.h> 21 21 #include <asm/cpufeature.h> 22 + #include <asm/cputype.h> 22 23 #include <asm/daifflags.h> 23 24 #include <asm/fpsimd.h> 24 25 #include <asm/kvm.h> 25 26 #include <asm/kvm_asm.h> 26 27 #include <asm/kvm_mmio.h> 27 - #include <asm/smp_plat.h> 28 28 #include <asm/thread_info.h> 29 29 30 30 #define __KVM_HAVE_ARCH_INTC_INITIALIZED ··· 484 484 485 485 DECLARE_PER_CPU(kvm_host_data_t, kvm_host_data); 486 486 487 - static inline void kvm_init_host_cpu_context(struct kvm_cpu_context *cpu_ctxt, 488 - int cpu) 487 + static inline void kvm_init_host_cpu_context(struct kvm_cpu_context *cpu_ctxt) 489 488 { 490 489 /* The host's MPIDR is immutable, so let's set it up at boot time */ 491 - cpu_ctxt->sys_regs[MPIDR_EL1] = cpu_logical_map(cpu); 490 + cpu_ctxt->sys_regs[MPIDR_EL1] = read_cpuid_mpidr(); 492 491 } 493 492 494 493 void __kvm_enable_ssbs(void); ··· 620 621 isb(); 621 622 } 622 623 623 - static inline bool kvm_arm_harden_branch_predictor(void) 624 + #define KVM_BP_HARDEN_UNKNOWN -1 625 + #define KVM_BP_HARDEN_WA_NEEDED 0 626 + #define KVM_BP_HARDEN_NOT_REQUIRED 1 627 + 628 + static inline int kvm_arm_harden_branch_predictor(void) 624 629 { 625 - return cpus_have_const_cap(ARM64_HARDEN_BRANCH_PREDICTOR); 630 + switch (get_spectre_v2_workaround_state()) { 631 + case ARM64_BP_HARDEN_WA_NEEDED: 632 + return KVM_BP_HARDEN_WA_NEEDED; 633 + case ARM64_BP_HARDEN_NOT_REQUIRED: 634 + return KVM_BP_HARDEN_NOT_REQUIRED; 635 + case ARM64_BP_HARDEN_UNKNOWN: 636 + default: 637 + return KVM_BP_HARDEN_UNKNOWN; 638 + } 626 639 } 627 640 628 641 #define KVM_SSBD_UNKNOWN -1

+5 -45

arch/arm64/include/asm/kvm_hyp.h

··· 18 18 #define read_sysreg_elx(r,nvh,vh) \ 19 19 ({ \ 20 20 u64 reg; \ 21 - asm volatile(ALTERNATIVE("mrs %0, " __stringify(r##nvh),\ 21 + asm volatile(ALTERNATIVE(__mrs_s("%0", r##nvh), \ 22 22 __mrs_s("%0", r##vh), \ 23 23 ARM64_HAS_VIRT_HOST_EXTN) \ 24 24 : "=r" (reg)); \ ··· 28 28 #define write_sysreg_elx(v,r,nvh,vh) \ 29 29 do { \ 30 30 u64 __val = (u64)(v); \ 31 - asm volatile(ALTERNATIVE("msr " __stringify(r##nvh) ", %x0",\ 31 + asm volatile(ALTERNATIVE(__msr_s(r##nvh, "%x0"), \ 32 32 __msr_s(r##vh, "%x0"), \ 33 33 ARM64_HAS_VIRT_HOST_EXTN) \ 34 34 : : "rZ" (__val)); \ ··· 37 37 /* 38 38 * Unified accessors for registers that have a different encoding 39 39 * between VHE and non-VHE. They must be specified without their "ELx" 40 - * encoding. 40 + * encoding, but with the SYS_ prefix, as defined in asm/sysreg.h. 41 41 */ 42 - #define read_sysreg_el2(r) \ 43 - ({ \ 44 - u64 reg; \ 45 - asm volatile(ALTERNATIVE("mrs %0, " __stringify(r##_EL2),\ 46 - "mrs %0, " __stringify(r##_EL1),\ 47 - ARM64_HAS_VIRT_HOST_EXTN) \ 48 - : "=r" (reg)); \ 49 - reg; \ 50 - }) 51 - 52 - #define write_sysreg_el2(v,r) \ 53 - do { \ 54 - u64 __val = (u64)(v); \ 55 - asm volatile(ALTERNATIVE("msr " __stringify(r##_EL2) ", %x0",\ 56 - "msr " __stringify(r##_EL1) ", %x0",\ 57 - ARM64_HAS_VIRT_HOST_EXTN) \ 58 - : : "rZ" (__val)); \ 59 - } while (0) 60 42 61 43 #define read_sysreg_el0(r) read_sysreg_elx(r, _EL0, _EL02) 62 44 #define write_sysreg_el0(v,r) write_sysreg_elx(v, r, _EL0, _EL02) 63 45 #define read_sysreg_el1(r) read_sysreg_elx(r, _EL1, _EL12) 64 46 #define write_sysreg_el1(v,r) write_sysreg_elx(v, r, _EL1, _EL12) 65 - 66 - /* The VHE specific system registers and their encoding */ 67 - #define sctlr_EL12 sys_reg(3, 5, 1, 0, 0) 68 - #define cpacr_EL12 sys_reg(3, 5, 1, 0, 2) 69 - #define ttbr0_EL12 sys_reg(3, 5, 2, 0, 0) 70 - #define ttbr1_EL12 sys_reg(3, 5, 2, 0, 1) 71 - #define tcr_EL12 sys_reg(3, 5, 2, 0, 2) 72 - #define afsr0_EL12 sys_reg(3, 5, 5, 1, 0) 73 - #define afsr1_EL12 sys_reg(3, 5, 5, 1, 1) 74 - #define esr_EL12 sys_reg(3, 5, 5, 2, 0) 75 - #define far_EL12 sys_reg(3, 5, 6, 0, 0) 76 - #define mair_EL12 sys_reg(3, 5, 10, 2, 0) 77 - #define amair_EL12 sys_reg(3, 5, 10, 3, 0) 78 - #define vbar_EL12 sys_reg(3, 5, 12, 0, 0) 79 - #define contextidr_EL12 sys_reg(3, 5, 13, 0, 1) 80 - #define cntkctl_EL12 sys_reg(3, 5, 14, 1, 0) 81 - #define cntp_tval_EL02 sys_reg(3, 5, 14, 2, 0) 82 - #define cntp_ctl_EL02 sys_reg(3, 5, 14, 2, 1) 83 - #define cntp_cval_EL02 sys_reg(3, 5, 14, 2, 2) 84 - #define cntv_tval_EL02 sys_reg(3, 5, 14, 3, 0) 85 - #define cntv_ctl_EL02 sys_reg(3, 5, 14, 3, 1) 86 - #define cntv_cval_EL02 sys_reg(3, 5, 14, 3, 2) 87 - #define spsr_EL12 sys_reg(3, 5, 4, 0, 0) 88 - #define elr_EL12 sys_reg(3, 5, 4, 0, 1) 47 + #define read_sysreg_el2(r) read_sysreg_elx(r, _EL2, _EL1) 48 + #define write_sysreg_el2(v,r) write_sysreg_elx(v, r, _EL2, _EL1) 89 49 90 50 /** 91 51 * hyp_alternate_select - Generates patchable code sequences that are

+33 -2

arch/arm64/include/asm/sysreg.h

··· 191 191 #define SYS_APGAKEYLO_EL1 sys_reg(3, 0, 2, 3, 0) 192 192 #define SYS_APGAKEYHI_EL1 sys_reg(3, 0, 2, 3, 1) 193 193 194 + #define SYS_SPSR_EL1 sys_reg(3, 0, 4, 0, 0) 195 + #define SYS_ELR_EL1 sys_reg(3, 0, 4, 0, 1) 196 + 194 197 #define SYS_ICC_PMR_EL1 sys_reg(3, 0, 4, 6, 0) 195 198 196 199 #define SYS_AFSR0_EL1 sys_reg(3, 0, 5, 1, 0) ··· 385 382 #define SYS_CNTP_CTL_EL0 sys_reg(3, 3, 14, 2, 1) 386 383 #define SYS_CNTP_CVAL_EL0 sys_reg(3, 3, 14, 2, 2) 387 384 385 + #define SYS_CNTV_CTL_EL0 sys_reg(3, 3, 14, 3, 1) 386 + #define SYS_CNTV_CVAL_EL0 sys_reg(3, 3, 14, 3, 2) 387 + 388 388 #define SYS_AARCH32_CNTP_TVAL sys_reg(0, 0, 14, 2, 0) 389 389 #define SYS_AARCH32_CNTP_CTL sys_reg(0, 0, 14, 2, 1) 390 390 #define SYS_AARCH32_CNTP_CVAL sys_reg(0, 2, 0, 14, 0) ··· 398 392 #define __TYPER_CRm(n) (0xc | (((n) >> 3) & 0x3)) 399 393 #define SYS_PMEVTYPERn_EL0(n) sys_reg(3, 3, 14, __TYPER_CRm(n), __PMEV_op2(n)) 400 394 401 - #define SYS_PMCCFILTR_EL0 sys_reg (3, 3, 14, 15, 7) 395 + #define SYS_PMCCFILTR_EL0 sys_reg(3, 3, 14, 15, 7) 402 396 403 397 #define SYS_ZCR_EL2 sys_reg(3, 4, 1, 2, 0) 404 - 405 398 #define SYS_DACR32_EL2 sys_reg(3, 4, 3, 0, 0) 399 + #define SYS_SPSR_EL2 sys_reg(3, 4, 4, 0, 0) 400 + #define SYS_ELR_EL2 sys_reg(3, 4, 4, 0, 1) 406 401 #define SYS_IFSR32_EL2 sys_reg(3, 4, 5, 0, 1) 402 + #define SYS_ESR_EL2 sys_reg(3, 4, 5, 2, 0) 407 403 #define SYS_VSESR_EL2 sys_reg(3, 4, 5, 2, 3) 408 404 #define SYS_FPEXC32_EL2 sys_reg(3, 4, 5, 3, 0) 405 + #define SYS_FAR_EL2 sys_reg(3, 4, 6, 0, 0) 409 406 410 407 #define SYS_VDISR_EL2 sys_reg(3, 4, 12, 1, 1) 411 408 #define __SYS__AP0Rx_EL2(x) sys_reg(3, 4, 12, 8, x) ··· 453 444 #define SYS_ICH_LR15_EL2 __SYS__LR8_EL2(7) 454 445 455 446 /* VHE encodings for architectural EL0/1 system registers */ 447 + #define SYS_SCTLR_EL12 sys_reg(3, 5, 1, 0, 0) 448 + #define SYS_CPACR_EL12 sys_reg(3, 5, 1, 0, 2) 456 449 #define SYS_ZCR_EL12 sys_reg(3, 5, 1, 2, 0) 450 + #define SYS_TTBR0_EL12 sys_reg(3, 5, 2, 0, 0) 451 + #define SYS_TTBR1_EL12 sys_reg(3, 5, 2, 0, 1) 452 + #define SYS_TCR_EL12 sys_reg(3, 5, 2, 0, 2) 453 + #define SYS_SPSR_EL12 sys_reg(3, 5, 4, 0, 0) 454 + #define SYS_ELR_EL12 sys_reg(3, 5, 4, 0, 1) 455 + #define SYS_AFSR0_EL12 sys_reg(3, 5, 5, 1, 0) 456 + #define SYS_AFSR1_EL12 sys_reg(3, 5, 5, 1, 1) 457 + #define SYS_ESR_EL12 sys_reg(3, 5, 5, 2, 0) 458 + #define SYS_FAR_EL12 sys_reg(3, 5, 6, 0, 0) 459 + #define SYS_MAIR_EL12 sys_reg(3, 5, 10, 2, 0) 460 + #define SYS_AMAIR_EL12 sys_reg(3, 5, 10, 3, 0) 461 + #define SYS_VBAR_EL12 sys_reg(3, 5, 12, 0, 0) 462 + #define SYS_CONTEXTIDR_EL12 sys_reg(3, 5, 13, 0, 1) 463 + #define SYS_CNTKCTL_EL12 sys_reg(3, 5, 14, 1, 0) 464 + #define SYS_CNTP_TVAL_EL02 sys_reg(3, 5, 14, 2, 0) 465 + #define SYS_CNTP_CTL_EL02 sys_reg(3, 5, 14, 2, 1) 466 + #define SYS_CNTP_CVAL_EL02 sys_reg(3, 5, 14, 2, 2) 467 + #define SYS_CNTV_TVAL_EL02 sys_reg(3, 5, 14, 3, 0) 468 + #define SYS_CNTV_CTL_EL02 sys_reg(3, 5, 14, 3, 1) 469 + #define SYS_CNTV_CVAL_EL02 sys_reg(3, 5, 14, 3, 2) 457 470 458 471 /* Common SCTLR_ELx flags. */ 459 472 #define SCTLR_ELx_DSSBS (_BITUL(44))

+10

arch/arm64/include/uapi/asm/kvm.h

··· 229 229 #define KVM_REG_ARM_FW_REG(r) (KVM_REG_ARM64 | KVM_REG_SIZE_U64 | \ 230 230 KVM_REG_ARM_FW | ((r) & 0xffff)) 231 231 #define KVM_REG_ARM_PSCI_VERSION KVM_REG_ARM_FW_REG(0) 232 + #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1 KVM_REG_ARM_FW_REG(1) 233 + #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_AVAIL 0 234 + #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_AVAIL 1 235 + #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_REQUIRED 2 236 + #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2 KVM_REG_ARM_FW_REG(2) 237 + #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL 0 238 + #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_UNKNOWN 1 239 + #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL 2 240 + #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED 3 241 + #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED (1U << 4) 232 242 233 243 /* SVE registers */ 234 244 #define KVM_REG_ARM64_SVE (0x15 << KVM_REG_ARM_COPROC_SHIFT)

+18 -5

arch/arm64/kernel/cpu_errata.c

··· 554 554 static bool __hardenbp_enab = true; 555 555 static bool __spectrev2_safe = true; 556 556 557 + int get_spectre_v2_workaround_state(void) 558 + { 559 + if (__spectrev2_safe) 560 + return ARM64_BP_HARDEN_NOT_REQUIRED; 561 + 562 + if (!__hardenbp_enab) 563 + return ARM64_BP_HARDEN_UNKNOWN; 564 + 565 + return ARM64_BP_HARDEN_WA_NEEDED; 566 + } 567 + 557 568 /* 558 569 * List of CPUs that do not need any Spectre-v2 mitigation at all. 559 570 */ ··· 865 854 ssize_t cpu_show_spectre_v2(struct device *dev, struct device_attribute *attr, 866 855 char *buf) 867 856 { 868 - if (__spectrev2_safe) 857 + switch (get_spectre_v2_workaround_state()) { 858 + case ARM64_BP_HARDEN_NOT_REQUIRED: 869 859 return sprintf(buf, "Not affected\n"); 870 - 871 - if (__hardenbp_enab) 860 + case ARM64_BP_HARDEN_WA_NEEDED: 872 861 return sprintf(buf, "Mitigation: Branch predictor hardening\n"); 873 - 874 - return sprintf(buf, "Vulnerable\n"); 862 + case ARM64_BP_HARDEN_UNKNOWN: 863 + default: 864 + return sprintf(buf, "Vulnerable\n"); 865 + } 875 866 } 876 867 877 868 ssize_t cpu_show_spec_store_bypass(struct device *dev,

+4

arch/arm64/kernel/traps.c

··· 871 871 /* 872 872 * The CPU can't make progress. The exception may have 873 873 * been imprecise. 874 + * 875 + * Neoverse-N1 #1349291 means a non-KVM SError reported as 876 + * Unrecoverable should be treated as Uncontainable. We 877 + * call arm64_serror_panic() in both cases. 874 878 */ 875 879 return true; 876 880

+29 -7

arch/arm64/kvm/hyp/entry.S

··· 6 6 7 7 #include <linux/linkage.h> 8 8 9 + #include <asm/alternative.h> 9 10 #include <asm/asm-offsets.h> 10 11 #include <asm/assembler.h> 11 12 #include <asm/fpsimdmacros.h> ··· 53 52 // Store the host regs 54 53 save_callee_saved_regs x1 55 54 55 + // Now the host state is stored if we have a pending RAS SError it must 56 + // affect the host. If any asynchronous exception is pending we defer 57 + // the guest entry. The DSB isn't necessary before v8.2 as any SError 58 + // would be fatal. 59 + alternative_if ARM64_HAS_RAS_EXTN 60 + dsb nshst 61 + isb 62 + alternative_else_nop_endif 63 + mrs x1, isr_el1 64 + cbz x1, 1f 65 + mov x0, #ARM_EXCEPTION_IRQ 66 + ret 67 + 68 + 1: 56 69 add x18, x0, #VCPU_CONTEXT 57 70 58 71 // Macro ptrauth_switch_to_guest format: ··· 142 127 143 128 alternative_if ARM64_HAS_RAS_EXTN 144 129 // If we have the RAS extensions we can consume a pending error 145 - // without an unmask-SError and isb. 146 - esb 130 + // without an unmask-SError and isb. The ESB-instruction consumed any 131 + // pending guest error when we took the exception from the guest. 147 132 mrs_s x2, SYS_DISR_EL1 148 133 str x2, [x1, #(VCPU_FAULT_DISR - VCPU_CONTEXT)] 149 134 cbz x2, 1f ··· 151 136 orr x0, x0, #(1<<ARM_EXIT_WITH_SERROR_BIT) 152 137 1: ret 153 138 alternative_else 154 - // If we have a pending asynchronous abort, now is the 155 - // time to find out. From your VAXorcist book, page 666: 139 + dsb sy // Synchronize against in-flight ld/st 140 + isb // Prevent an early read of side-effect free ISR 141 + mrs x2, isr_el1 142 + tbnz x2, #8, 2f // ISR_EL1.A 143 + ret 144 + nop 145 + 2: 146 + alternative_endif 147 + // We know we have a pending asynchronous abort, now is the 148 + // time to flush it out. From your VAXorcist book, page 666: 156 149 // "Threaten me not, oh Evil one! For I speak with 157 150 // the power of DEC, and I command thee to show thyself!" 158 151 mrs x2, elr_el2 ··· 168 145 mrs x4, spsr_el2 169 146 mov x5, x0 170 147 171 - dsb sy // Synchronize against in-flight ld/st 172 - nop 173 148 msr daifclr, #4 // Unmask aborts 174 - alternative_endif 175 149 176 150 // This is our single instruction exception window. A pending 177 151 // SError is guaranteed to occur at the earliest when we unmask ··· 180 160 181 161 .global abort_guest_exit_end 182 162 abort_guest_exit_end: 163 + 164 + msr daifset, #4 // Mask aborts 183 165 184 166 // If the exception took place, restore the EL1 exception 185 167 // context so that we can report some information.

+25 -5

arch/arm64/kvm/hyp/hyp-entry.S

··· 216 216 217 217 .align 11 218 218 219 + .macro check_preamble_length start, end 220 + /* kvm_patch_vector_branch() generates code that jumps over the preamble. */ 221 + .if ((\end-\start) != KVM_VECTOR_PREAMBLE) 222 + .error "KVM vector preamble length mismatch" 223 + .endif 224 + .endm 225 + 219 226 .macro valid_vect target 220 227 .align 7 228 + 661: 229 + esb 221 230 stp x0, x1, [sp, #-16]! 231 + 662: 222 232 b \target 233 + 234 + check_preamble_length 661b, 662b 223 235 .endm 224 236 225 237 .macro invalid_vect target 226 238 .align 7 239 + 661: 227 240 b \target 241 + nop 242 + 662: 228 243 ldp x0, x1, [sp], #16 229 244 b \target 245 + 246 + check_preamble_length 661b, 662b 230 247 .endm 231 248 232 249 ENTRY(__kvm_hyp_vector) ··· 271 254 #ifdef CONFIG_KVM_INDIRECT_VECTORS 272 255 .macro hyp_ventry 273 256 .align 7 274 - 1: .rept 27 257 + 1: esb 258 + .rept 26 275 259 nop 276 260 .endr 277 261 /* 278 262 * The default sequence is to directly branch to the KVM vectors, 279 263 * using the computed offset. This applies for VHE as well as 280 - * !ARM64_HARDEN_EL2_VECTORS. 264 + * !ARM64_HARDEN_EL2_VECTORS. The first vector must always run the preamble. 281 265 * 282 266 * For ARM64_HARDEN_EL2_VECTORS configurations, this gets replaced 283 267 * with: ··· 289 271 * movk x0, #((addr >> 32) & 0xffff), lsl #32 290 272 * br x0 291 273 * 292 - * Where addr = kern_hyp_va(__kvm_hyp_vector) + vector-offset + 4. 274 + * Where: 275 + * addr = kern_hyp_va(__kvm_hyp_vector) + vector-offset + KVM_VECTOR_PREAMBLE. 293 276 * See kvm_patch_vector_branch for details. 294 277 */ 295 278 alternative_cb kvm_patch_vector_branch 296 - b __kvm_hyp_vector + (1b - 0b) 297 - nop 279 + stp x0, x1, [sp, #-16]! 280 + b __kvm_hyp_vector + (1b - 0b + KVM_VECTOR_PREAMBLE) 298 281 nop 299 282 nop 300 283 nop ··· 320 301 .popsection 321 302 322 303 ENTRY(__smccc_workaround_1_smc_start) 304 + esb 323 305 sub sp, sp, #(8 * 4) 324 306 stp x2, x3, [sp, #(8 * 0)] 325 307 stp x0, x1, [sp, #(8 * 2)]

+7 -7

arch/arm64/kvm/hyp/switch.c

··· 284 284 if (ec != ESR_ELx_EC_DABT_LOW && ec != ESR_ELx_EC_IABT_LOW) 285 285 return true; 286 286 287 - far = read_sysreg_el2(far); 287 + far = read_sysreg_el2(SYS_FAR); 288 288 289 289 /* 290 290 * The HPFAR can be invalid if the stage 2 fault did not ··· 401 401 static bool __hyp_text fixup_guest_exit(struct kvm_vcpu *vcpu, u64 *exit_code) 402 402 { 403 403 if (ARM_EXCEPTION_CODE(*exit_code) != ARM_EXCEPTION_IRQ) 404 - vcpu->arch.fault.esr_el2 = read_sysreg_el2(esr); 404 + vcpu->arch.fault.esr_el2 = read_sysreg_el2(SYS_ESR); 405 405 406 406 /* 407 407 * We're using the raw exception code in order to only process ··· 697 697 asm volatile("ldr %0, =__hyp_panic_string" : "=r" (str_va)); 698 698 699 699 __hyp_do_panic(str_va, 700 - spsr, elr, 701 - read_sysreg(esr_el2), read_sysreg_el2(far), 700 + spsr, elr, 701 + read_sysreg(esr_el2), read_sysreg_el2(SYS_FAR), 702 702 read_sysreg(hpfar_el2), par, vcpu); 703 703 } 704 704 ··· 713 713 714 714 panic(__hyp_panic_string, 715 715 spsr, elr, 716 - read_sysreg_el2(esr), read_sysreg_el2(far), 716 + read_sysreg_el2(SYS_ESR), read_sysreg_el2(SYS_FAR), 717 717 read_sysreg(hpfar_el2), par, vcpu); 718 718 } 719 719 NOKPROBE_SYMBOL(__hyp_call_panic_vhe); 720 720 721 721 void __hyp_text __noreturn hyp_panic(struct kvm_cpu_context *host_ctxt) 722 722 { 723 - u64 spsr = read_sysreg_el2(spsr); 724 - u64 elr = read_sysreg_el2(elr); 723 + u64 spsr = read_sysreg_el2(SYS_SPSR); 724 + u64 elr = read_sysreg_el2(SYS_ELR); 725 725 u64 par = read_sysreg(par_el1); 726 726 727 727 if (!has_vhe())

+39 -39

arch/arm64/kvm/hyp/sysreg-sr.c

··· 43 43 static void __hyp_text __sysreg_save_el1_state(struct kvm_cpu_context *ctxt) 44 44 { 45 45 ctxt->sys_regs[CSSELR_EL1] = read_sysreg(csselr_el1); 46 - ctxt->sys_regs[SCTLR_EL1] = read_sysreg_el1(sctlr); 46 + ctxt->sys_regs[SCTLR_EL1] = read_sysreg_el1(SYS_SCTLR); 47 47 ctxt->sys_regs[ACTLR_EL1] = read_sysreg(actlr_el1); 48 - ctxt->sys_regs[CPACR_EL1] = read_sysreg_el1(cpacr); 49 - ctxt->sys_regs[TTBR0_EL1] = read_sysreg_el1(ttbr0); 50 - ctxt->sys_regs[TTBR1_EL1] = read_sysreg_el1(ttbr1); 51 - ctxt->sys_regs[TCR_EL1] = read_sysreg_el1(tcr); 52 - ctxt->sys_regs[ESR_EL1] = read_sysreg_el1(esr); 53 - ctxt->sys_regs[AFSR0_EL1] = read_sysreg_el1(afsr0); 54 - ctxt->sys_regs[AFSR1_EL1] = read_sysreg_el1(afsr1); 55 - ctxt->sys_regs[FAR_EL1] = read_sysreg_el1(far); 56 - ctxt->sys_regs[MAIR_EL1] = read_sysreg_el1(mair); 57 - ctxt->sys_regs[VBAR_EL1] = read_sysreg_el1(vbar); 58 - ctxt->sys_regs[CONTEXTIDR_EL1] = read_sysreg_el1(contextidr); 59 - ctxt->sys_regs[AMAIR_EL1] = read_sysreg_el1(amair); 60 - ctxt->sys_regs[CNTKCTL_EL1] = read_sysreg_el1(cntkctl); 48 + ctxt->sys_regs[CPACR_EL1] = read_sysreg_el1(SYS_CPACR); 49 + ctxt->sys_regs[TTBR0_EL1] = read_sysreg_el1(SYS_TTBR0); 50 + ctxt->sys_regs[TTBR1_EL1] = read_sysreg_el1(SYS_TTBR1); 51 + ctxt->sys_regs[TCR_EL1] = read_sysreg_el1(SYS_TCR); 52 + ctxt->sys_regs[ESR_EL1] = read_sysreg_el1(SYS_ESR); 53 + ctxt->sys_regs[AFSR0_EL1] = read_sysreg_el1(SYS_AFSR0); 54 + ctxt->sys_regs[AFSR1_EL1] = read_sysreg_el1(SYS_AFSR1); 55 + ctxt->sys_regs[FAR_EL1] = read_sysreg_el1(SYS_FAR); 56 + ctxt->sys_regs[MAIR_EL1] = read_sysreg_el1(SYS_MAIR); 57 + ctxt->sys_regs[VBAR_EL1] = read_sysreg_el1(SYS_VBAR); 58 + ctxt->sys_regs[CONTEXTIDR_EL1] = read_sysreg_el1(SYS_CONTEXTIDR); 59 + ctxt->sys_regs[AMAIR_EL1] = read_sysreg_el1(SYS_AMAIR); 60 + ctxt->sys_regs[CNTKCTL_EL1] = read_sysreg_el1(SYS_CNTKCTL); 61 61 ctxt->sys_regs[PAR_EL1] = read_sysreg(par_el1); 62 62 ctxt->sys_regs[TPIDR_EL1] = read_sysreg(tpidr_el1); 63 63 64 64 ctxt->gp_regs.sp_el1 = read_sysreg(sp_el1); 65 - ctxt->gp_regs.elr_el1 = read_sysreg_el1(elr); 66 - ctxt->gp_regs.spsr[KVM_SPSR_EL1]= read_sysreg_el1(spsr); 65 + ctxt->gp_regs.elr_el1 = read_sysreg_el1(SYS_ELR); 66 + ctxt->gp_regs.spsr[KVM_SPSR_EL1]= read_sysreg_el1(SYS_SPSR); 67 67 } 68 68 69 69 static void __hyp_text __sysreg_save_el2_return_state(struct kvm_cpu_context *ctxt) 70 70 { 71 - ctxt->gp_regs.regs.pc = read_sysreg_el2(elr); 72 - ctxt->gp_regs.regs.pstate = read_sysreg_el2(spsr); 71 + ctxt->gp_regs.regs.pc = read_sysreg_el2(SYS_ELR); 72 + ctxt->gp_regs.regs.pstate = read_sysreg_el2(SYS_SPSR); 73 73 74 74 if (cpus_have_const_cap(ARM64_HAS_RAS_EXTN)) 75 75 ctxt->sys_regs[DISR_EL1] = read_sysreg_s(SYS_VDISR_EL2); ··· 109 109 110 110 static void __hyp_text __sysreg_restore_user_state(struct kvm_cpu_context *ctxt) 111 111 { 112 - write_sysreg(ctxt->sys_regs[TPIDR_EL0], tpidr_el0); 113 - write_sysreg(ctxt->sys_regs[TPIDRRO_EL0], tpidrro_el0); 112 + write_sysreg(ctxt->sys_regs[TPIDR_EL0], tpidr_el0); 113 + write_sysreg(ctxt->sys_regs[TPIDRRO_EL0], tpidrro_el0); 114 114 } 115 115 116 116 static void __hyp_text __sysreg_restore_el1_state(struct kvm_cpu_context *ctxt) 117 117 { 118 118 write_sysreg(ctxt->sys_regs[MPIDR_EL1], vmpidr_el2); 119 119 write_sysreg(ctxt->sys_regs[CSSELR_EL1], csselr_el1); 120 - write_sysreg_el1(ctxt->sys_regs[SCTLR_EL1], sctlr); 121 - write_sysreg(ctxt->sys_regs[ACTLR_EL1], actlr_el1); 122 - write_sysreg_el1(ctxt->sys_regs[CPACR_EL1], cpacr); 123 - write_sysreg_el1(ctxt->sys_regs[TTBR0_EL1], ttbr0); 124 - write_sysreg_el1(ctxt->sys_regs[TTBR1_EL1], ttbr1); 125 - write_sysreg_el1(ctxt->sys_regs[TCR_EL1], tcr); 126 - write_sysreg_el1(ctxt->sys_regs[ESR_EL1], esr); 127 - write_sysreg_el1(ctxt->sys_regs[AFSR0_EL1], afsr0); 128 - write_sysreg_el1(ctxt->sys_regs[AFSR1_EL1], afsr1); 129 - write_sysreg_el1(ctxt->sys_regs[FAR_EL1], far); 130 - write_sysreg_el1(ctxt->sys_regs[MAIR_EL1], mair); 131 - write_sysreg_el1(ctxt->sys_regs[VBAR_EL1], vbar); 132 - write_sysreg_el1(ctxt->sys_regs[CONTEXTIDR_EL1],contextidr); 133 - write_sysreg_el1(ctxt->sys_regs[AMAIR_EL1], amair); 134 - write_sysreg_el1(ctxt->sys_regs[CNTKCTL_EL1], cntkctl); 120 + write_sysreg_el1(ctxt->sys_regs[SCTLR_EL1], SYS_SCTLR); 121 + write_sysreg(ctxt->sys_regs[ACTLR_EL1], actlr_el1); 122 + write_sysreg_el1(ctxt->sys_regs[CPACR_EL1], SYS_CPACR); 123 + write_sysreg_el1(ctxt->sys_regs[TTBR0_EL1], SYS_TTBR0); 124 + write_sysreg_el1(ctxt->sys_regs[TTBR1_EL1], SYS_TTBR1); 125 + write_sysreg_el1(ctxt->sys_regs[TCR_EL1], SYS_TCR); 126 + write_sysreg_el1(ctxt->sys_regs[ESR_EL1], SYS_ESR); 127 + write_sysreg_el1(ctxt->sys_regs[AFSR0_EL1], SYS_AFSR0); 128 + write_sysreg_el1(ctxt->sys_regs[AFSR1_EL1], SYS_AFSR1); 129 + write_sysreg_el1(ctxt->sys_regs[FAR_EL1], SYS_FAR); 130 + write_sysreg_el1(ctxt->sys_regs[MAIR_EL1], SYS_MAIR); 131 + write_sysreg_el1(ctxt->sys_regs[VBAR_EL1], SYS_VBAR); 132 + write_sysreg_el1(ctxt->sys_regs[CONTEXTIDR_EL1],SYS_CONTEXTIDR); 133 + write_sysreg_el1(ctxt->sys_regs[AMAIR_EL1], SYS_AMAIR); 134 + write_sysreg_el1(ctxt->sys_regs[CNTKCTL_EL1], SYS_CNTKCTL); 135 135 write_sysreg(ctxt->sys_regs[PAR_EL1], par_el1); 136 136 write_sysreg(ctxt->sys_regs[TPIDR_EL1], tpidr_el1); 137 137 138 138 write_sysreg(ctxt->gp_regs.sp_el1, sp_el1); 139 - write_sysreg_el1(ctxt->gp_regs.elr_el1, elr); 140 - write_sysreg_el1(ctxt->gp_regs.spsr[KVM_SPSR_EL1],spsr); 139 + write_sysreg_el1(ctxt->gp_regs.elr_el1, SYS_ELR); 140 + write_sysreg_el1(ctxt->gp_regs.spsr[KVM_SPSR_EL1],SYS_SPSR); 141 141 } 142 142 143 143 static void __hyp_text ··· 160 160 if (!(mode & PSR_MODE32_BIT) && mode >= PSR_MODE_EL2t) 161 161 pstate = PSR_MODE_EL2h | PSR_IL_BIT; 162 162 163 - write_sysreg_el2(ctxt->gp_regs.regs.pc, elr); 164 - write_sysreg_el2(pstate, spsr); 163 + write_sysreg_el2(ctxt->gp_regs.regs.pc, SYS_ELR); 164 + write_sysreg_el2(pstate, SYS_SPSR); 165 165 166 166 if (cpus_have_const_cap(ARM64_HAS_RAS_EXTN)) 167 167 write_sysreg_s(ctxt->sys_regs[DISR_EL1], SYS_VDISR_EL2);

+6 -6

arch/arm64/kvm/hyp/tlb.c

··· 33 33 * in the TCR_EL1 register. We also need to prevent it to 34 34 * allocate IPA->PA walks, so we enable the S1 MMU... 35 35 */ 36 - val = cxt->tcr = read_sysreg_el1(tcr); 36 + val = cxt->tcr = read_sysreg_el1(SYS_TCR); 37 37 val |= TCR_EPD1_MASK | TCR_EPD0_MASK; 38 - write_sysreg_el1(val, tcr); 39 - val = cxt->sctlr = read_sysreg_el1(sctlr); 38 + write_sysreg_el1(val, SYS_TCR); 39 + val = cxt->sctlr = read_sysreg_el1(SYS_SCTLR); 40 40 val |= SCTLR_ELx_M; 41 - write_sysreg_el1(val, sctlr); 41 + write_sysreg_el1(val, SYS_SCTLR); 42 42 } 43 43 44 44 /* ··· 85 85 86 86 if (cpus_have_const_cap(ARM64_WORKAROUND_1165522)) { 87 87 /* Restore the registers to what they were */ 88 - write_sysreg_el1(cxt->tcr, tcr); 89 - write_sysreg_el1(cxt->sctlr, sctlr); 88 + write_sysreg_el1(cxt->tcr, SYS_TCR); 89 + write_sysreg_el1(cxt->sctlr, SYS_SCTLR); 90 90 } 91 91 92 92 local_irq_restore(cxt->flags);

+1 -1

arch/arm64/kvm/hyp/vgic-v2-cpuif-proxy.c

··· 16 16 static bool __hyp_text __is_be(struct kvm_vcpu *vcpu) 17 17 { 18 18 if (vcpu_mode_is_32bit(vcpu)) 19 - return !!(read_sysreg_el2(spsr) & PSR_AA32_E_BIT); 19 + return !!(read_sysreg_el2(SYS_SPSR) & PSR_AA32_E_BIT); 20 20 21 21 return !!(read_sysreg(SCTLR_EL1) & SCTLR_ELx_EE); 22 22 }

+2 -2

arch/arm64/kvm/regmap.c

··· 152 152 153 153 switch (spsr_idx) { 154 154 case KVM_SPSR_SVC: 155 - return read_sysreg_el1(spsr); 155 + return read_sysreg_el1(SYS_SPSR); 156 156 case KVM_SPSR_ABT: 157 157 return read_sysreg(spsr_abt); 158 158 case KVM_SPSR_UND: ··· 177 177 178 178 switch (spsr_idx) { 179 179 case KVM_SPSR_SVC: 180 - write_sysreg_el1(v, spsr); 180 + write_sysreg_el1(v, SYS_SPSR); 181 181 case KVM_SPSR_ABT: 182 182 write_sysreg(v, spsr_abt); 183 183 case KVM_SPSR_UND:

+30 -30

arch/arm64/kvm/sys_regs.c

··· 81 81 */ 82 82 switch (reg) { 83 83 case CSSELR_EL1: return read_sysreg_s(SYS_CSSELR_EL1); 84 - case SCTLR_EL1: return read_sysreg_s(sctlr_EL12); 84 + case SCTLR_EL1: return read_sysreg_s(SYS_SCTLR_EL12); 85 85 case ACTLR_EL1: return read_sysreg_s(SYS_ACTLR_EL1); 86 - case CPACR_EL1: return read_sysreg_s(cpacr_EL12); 87 - case TTBR0_EL1: return read_sysreg_s(ttbr0_EL12); 88 - case TTBR1_EL1: return read_sysreg_s(ttbr1_EL12); 89 - case TCR_EL1: return read_sysreg_s(tcr_EL12); 90 - case ESR_EL1: return read_sysreg_s(esr_EL12); 91 - case AFSR0_EL1: return read_sysreg_s(afsr0_EL12); 92 - case AFSR1_EL1: return read_sysreg_s(afsr1_EL12); 93 - case FAR_EL1: return read_sysreg_s(far_EL12); 94 - case MAIR_EL1: return read_sysreg_s(mair_EL12); 95 - case VBAR_EL1: return read_sysreg_s(vbar_EL12); 96 - case CONTEXTIDR_EL1: return read_sysreg_s(contextidr_EL12); 86 + case CPACR_EL1: return read_sysreg_s(SYS_CPACR_EL12); 87 + case TTBR0_EL1: return read_sysreg_s(SYS_TTBR0_EL12); 88 + case TTBR1_EL1: return read_sysreg_s(SYS_TTBR1_EL12); 89 + case TCR_EL1: return read_sysreg_s(SYS_TCR_EL12); 90 + case ESR_EL1: return read_sysreg_s(SYS_ESR_EL12); 91 + case AFSR0_EL1: return read_sysreg_s(SYS_AFSR0_EL12); 92 + case AFSR1_EL1: return read_sysreg_s(SYS_AFSR1_EL12); 93 + case FAR_EL1: return read_sysreg_s(SYS_FAR_EL12); 94 + case MAIR_EL1: return read_sysreg_s(SYS_MAIR_EL12); 95 + case VBAR_EL1: return read_sysreg_s(SYS_VBAR_EL12); 96 + case CONTEXTIDR_EL1: return read_sysreg_s(SYS_CONTEXTIDR_EL12); 97 97 case TPIDR_EL0: return read_sysreg_s(SYS_TPIDR_EL0); 98 98 case TPIDRRO_EL0: return read_sysreg_s(SYS_TPIDRRO_EL0); 99 99 case TPIDR_EL1: return read_sysreg_s(SYS_TPIDR_EL1); 100 - case AMAIR_EL1: return read_sysreg_s(amair_EL12); 101 - case CNTKCTL_EL1: return read_sysreg_s(cntkctl_EL12); 100 + case AMAIR_EL1: return read_sysreg_s(SYS_AMAIR_EL12); 101 + case CNTKCTL_EL1: return read_sysreg_s(SYS_CNTKCTL_EL12); 102 102 case PAR_EL1: return read_sysreg_s(SYS_PAR_EL1); 103 103 case DACR32_EL2: return read_sysreg_s(SYS_DACR32_EL2); 104 104 case IFSR32_EL2: return read_sysreg_s(SYS_IFSR32_EL2); ··· 124 124 */ 125 125 switch (reg) { 126 126 case CSSELR_EL1: write_sysreg_s(val, SYS_CSSELR_EL1); return; 127 - case SCTLR_EL1: write_sysreg_s(val, sctlr_EL12); return; 127 + case SCTLR_EL1: write_sysreg_s(val, SYS_SCTLR_EL12); return; 128 128 case ACTLR_EL1: write_sysreg_s(val, SYS_ACTLR_EL1); return; 129 - case CPACR_EL1: write_sysreg_s(val, cpacr_EL12); return; 130 - case TTBR0_EL1: write_sysreg_s(val, ttbr0_EL12); return; 131 - case TTBR1_EL1: write_sysreg_s(val, ttbr1_EL12); return; 132 - case TCR_EL1: write_sysreg_s(val, tcr_EL12); return; 133 - case ESR_EL1: write_sysreg_s(val, esr_EL12); return; 134 - case AFSR0_EL1: write_sysreg_s(val, afsr0_EL12); return; 135 - case AFSR1_EL1: write_sysreg_s(val, afsr1_EL12); return; 136 - case FAR_EL1: write_sysreg_s(val, far_EL12); return; 137 - case MAIR_EL1: write_sysreg_s(val, mair_EL12); return; 138 - case VBAR_EL1: write_sysreg_s(val, vbar_EL12); return; 139 - case CONTEXTIDR_EL1: write_sysreg_s(val, contextidr_EL12); return; 129 + case CPACR_EL1: write_sysreg_s(val, SYS_CPACR_EL12); return; 130 + case TTBR0_EL1: write_sysreg_s(val, SYS_TTBR0_EL12); return; 131 + case TTBR1_EL1: write_sysreg_s(val, SYS_TTBR1_EL12); return; 132 + case TCR_EL1: write_sysreg_s(val, SYS_TCR_EL12); return; 133 + case ESR_EL1: write_sysreg_s(val, SYS_ESR_EL12); return; 134 + case AFSR0_EL1: write_sysreg_s(val, SYS_AFSR0_EL12); return; 135 + case AFSR1_EL1: write_sysreg_s(val, SYS_AFSR1_EL12); return; 136 + case FAR_EL1: write_sysreg_s(val, SYS_FAR_EL12); return; 137 + case MAIR_EL1: write_sysreg_s(val, SYS_MAIR_EL12); return; 138 + case VBAR_EL1: write_sysreg_s(val, SYS_VBAR_EL12); return; 139 + case CONTEXTIDR_EL1: write_sysreg_s(val, SYS_CONTEXTIDR_EL12); return; 140 140 case TPIDR_EL0: write_sysreg_s(val, SYS_TPIDR_EL0); return; 141 141 case TPIDRRO_EL0: write_sysreg_s(val, SYS_TPIDRRO_EL0); return; 142 142 case TPIDR_EL1: write_sysreg_s(val, SYS_TPIDR_EL1); return; 143 - case AMAIR_EL1: write_sysreg_s(val, amair_EL12); return; 144 - case CNTKCTL_EL1: write_sysreg_s(val, cntkctl_EL12); return; 143 + case AMAIR_EL1: write_sysreg_s(val, SYS_AMAIR_EL12); return; 144 + case CNTKCTL_EL1: write_sysreg_s(val, SYS_CNTKCTL_EL12); return; 145 145 case PAR_EL1: write_sysreg_s(val, SYS_PAR_EL1); return; 146 146 case DACR32_EL2: write_sysreg_s(val, SYS_DACR32_EL2); return; 147 147 case IFSR32_EL2: write_sysreg_s(val, SYS_IFSR32_EL2); return; ··· 865 865 if (r->Op2 & 0x1) { 866 866 /* accessing PMCNTENSET_EL0 */ 867 867 __vcpu_sys_reg(vcpu, PMCNTENSET_EL0) |= val; 868 - kvm_pmu_enable_counter(vcpu, val); 868 + kvm_pmu_enable_counter_mask(vcpu, val); 869 869 kvm_vcpu_pmu_restore_guest(vcpu); 870 870 } else { 871 871 /* accessing PMCNTENCLR_EL0 */ 872 872 __vcpu_sys_reg(vcpu, PMCNTENSET_EL0) &= ~val; 873 - kvm_pmu_disable_counter(vcpu, val); 873 + kvm_pmu_disable_counter_mask(vcpu, val); 874 874 } 875 875 } else { 876 876 p->regval = __vcpu_sys_reg(vcpu, PMCNTENSET_EL0) & mask;

+3 -4

arch/arm64/kvm/va_layout.c

··· 170 170 addr |= ((u64)origptr & GENMASK_ULL(10, 7)); 171 171 172 172 /* 173 - * Branch to the second instruction in the vectors in order to 174 - * avoid the initial store on the stack (which we already 175 - * perform in the hardening vectors). 173 + * Branch over the preamble in order to avoid the initial store on 174 + * the stack (which we already perform in the hardening vectors). 176 175 */ 177 - addr += AARCH64_INSN_SIZE; 176 + addr += KVM_VECTOR_PREAMBLE; 178 177 179 178 /* stp x0, x1, [sp, #-16]! */ 180 179 insn = aarch64_insn_gen_load_store_pair(AARCH64_INSN_REG_0,

+2 -2

arch/mips/kvm/mips.c

··· 123 123 return 0; 124 124 } 125 125 126 - void kvm_arch_check_processor_compat(void *rtn) 126 + int kvm_arch_check_processor_compat(void) 127 127 { 128 - *(int *)rtn = 0; 128 + return 0; 129 129 } 130 130 131 131 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)

+2 -2

arch/powerpc/kvm/powerpc.c

··· 414 414 return 0; 415 415 } 416 416 417 - void kvm_arch_check_processor_compat(void *rtn) 417 + int kvm_arch_check_processor_compat(void) 418 418 { 419 - *(int *)rtn = kvmppc_core_check_processor_compat(); 419 + return kvmppc_core_check_processor_compat(); 420 420 } 421 421 422 422 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)

-1

arch/s390/include/asm/kvm_host.h

··· 912 912 extern int kvm_s390_gisc_unregister(struct kvm *kvm, u32 gisc); 913 913 914 914 static inline void kvm_arch_hardware_disable(void) {} 915 - static inline void kvm_arch_check_processor_compat(void *rtn) {} 916 915 static inline void kvm_arch_sync_events(struct kvm *kvm) {} 917 916 static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {} 918 917 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}

+7 -2

arch/s390/kvm/kvm-s390.c

··· 227 227 return 0; 228 228 } 229 229 230 + int kvm_arch_check_processor_compat(void) 231 + { 232 + return 0; 233 + } 234 + 230 235 static void kvm_gmap_notifier(struct gmap *gmap, unsigned long start, 231 236 unsigned long end); 232 237 ··· 2423 2418 kvm->arch.sca = (struct bsca_block *) get_zeroed_page(alloc_flags); 2424 2419 if (!kvm->arch.sca) 2425 2420 goto out_err; 2426 - spin_lock(&kvm_lock); 2421 + mutex_lock(&kvm_lock); 2427 2422 sca_offset += 16; 2428 2423 if (sca_offset + sizeof(struct bsca_block) > PAGE_SIZE) 2429 2424 sca_offset = 0; 2430 2425 kvm->arch.sca = (struct bsca_block *) 2431 2426 ((char *) kvm->arch.sca + sca_offset); 2432 - spin_unlock(&kvm_lock); 2427 + mutex_unlock(&kvm_lock); 2433 2428 2434 2429 sprintf(debug_name, "kvm-%u", current->pid); 2435 2430

+8 -3

arch/x86/include/asm/kvm_host.h

··· 686 686 u32 virtual_tsc_mult; 687 687 u32 virtual_tsc_khz; 688 688 s64 ia32_tsc_adjust_msr; 689 + u64 msr_ia32_power_ctl; 689 690 u64 tsc_scaling_ratio; 690 691 691 692 atomic_t nmi_queued; /* unprocessed asynchronous NMIs */ ··· 752 751 u64 msr_val; 753 752 struct gfn_to_hva_cache data; 754 753 } pv_eoi; 754 + 755 + u64 msr_kvm_poll_control; 755 756 756 757 /* 757 758 * Indicate whether the access faults on its page table in guest ··· 882 879 bool mwait_in_guest; 883 880 bool hlt_in_guest; 884 881 bool pause_in_guest; 882 + bool cstate_in_guest; 885 883 886 884 unsigned long irq_sources_bitmap; 887 885 s64 kvmclock_offset; ··· 930 926 931 927 bool guest_can_read_msr_platform_info; 932 928 bool exception_payload_enabled; 929 + 930 + struct kvm_pmu_event_filter *pmu_event_filter; 933 931 }; 934 932 935 933 struct kvm_vm_stat { ··· 1002 996 int (*disabled_by_bios)(void); /* __init */ 1003 997 int (*hardware_enable)(void); 1004 998 void (*hardware_disable)(void); 1005 - void (*check_processor_compatibility)(void *rtn); 999 + int (*check_processor_compatibility)(void);/* __init */ 1006 1000 int (*hardware_setup)(void); /* __init */ 1007 1001 void (*hardware_unsetup)(void); /* __exit */ 1008 1002 bool (*cpu_has_accelerated_tpr)(void); ··· 1116 1110 int (*check_intercept)(struct kvm_vcpu *vcpu, 1117 1111 struct x86_instruction_info *info, 1118 1112 enum x86_intercept_stage stage); 1119 - void (*handle_external_intr)(struct kvm_vcpu *vcpu); 1113 + void (*handle_exit_irqoff)(struct kvm_vcpu *vcpu); 1120 1114 bool (*mpx_supported)(void); 1121 1115 bool (*xsaves_supported)(void); 1122 1116 bool (*umip_emulated)(void); ··· 1535 1529 unsigned long ipi_bitmap_high, u32 min, 1536 1530 unsigned long icr, int op_64_bit); 1537 1531 1538 - u64 kvm_get_arch_capabilities(void); 1539 1532 void kvm_define_shared_msr(unsigned index, u32 msr); 1540 1533 int kvm_set_shared_msr(unsigned index, u64 val, u64 mask); 1541 1534

+15 -4

arch/x86/include/uapi/asm/kvm.h

··· 378 378 struct kvm_vcpu_events events; 379 379 }; 380 380 381 - #define KVM_X86_QUIRK_LINT0_REENABLED (1 << 0) 382 - #define KVM_X86_QUIRK_CD_NW_CLEARED (1 << 1) 383 - #define KVM_X86_QUIRK_LAPIC_MMIO_HOLE (1 << 2) 384 - #define KVM_X86_QUIRK_OUT_7E_INC_RIP (1 << 3) 381 + #define KVM_X86_QUIRK_LINT0_REENABLED (1 << 0) 382 + #define KVM_X86_QUIRK_CD_NW_CLEARED (1 << 1) 383 + #define KVM_X86_QUIRK_LAPIC_MMIO_HOLE (1 << 2) 384 + #define KVM_X86_QUIRK_OUT_7E_INC_RIP (1 << 3) 385 + #define KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT (1 << 4) 385 386 386 387 #define KVM_STATE_NESTED_FORMAT_VMX 0 387 388 #define KVM_STATE_NESTED_FORMAT_SVM 1 /* unused */ ··· 432 431 struct kvm_vmx_nested_state_data vmx[0]; 433 432 } data; 434 433 }; 434 + 435 + /* for KVM_CAP_PMU_EVENT_FILTER */ 436 + struct kvm_pmu_event_filter { 437 + __u32 action; 438 + __u32 nevents; 439 + __u64 events[0]; 440 + }; 441 + 442 + #define KVM_PMU_EVENT_ALLOW 0 443 + #define KVM_PMU_EVENT_DENY 1 435 444 436 445 #endif /* _ASM_X86_KVM_H */

+3

arch/x86/include/uapi/asm/kvm_para.h

··· 29 29 #define KVM_FEATURE_PV_TLB_FLUSH 9 30 30 #define KVM_FEATURE_ASYNC_PF_VMEXIT 10 31 31 #define KVM_FEATURE_PV_SEND_IPI 11 32 + #define KVM_FEATURE_POLL_CONTROL 12 33 + #define KVM_FEATURE_PV_SCHED_YIELD 13 32 34 33 35 #define KVM_HINTS_REALTIME 0 34 36 ··· 49 47 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02 50 48 #define MSR_KVM_STEAL_TIME 0x4b564d03 51 49 #define MSR_KVM_PV_EOI_EN 0x4b564d04 50 + #define MSR_KVM_POLL_CONTROL 0x4b564d05 52 51 53 52 struct kvm_steal_time { 54 53 __u64 steal;

-1

arch/x86/include/uapi/asm/vmx.h

··· 146 146 147 147 #define VMX_ABORT_SAVE_GUEST_MSR_FAIL 1 148 148 #define VMX_ABORT_LOAD_HOST_PDPTE_FAIL 2 149 - #define VMX_ABORT_VMCS_CORRUPTED 3 150 149 #define VMX_ABORT_LOAD_HOST_MSR_FAIL 4 151 150 152 151 #endif /* _UAPIVMX_H */

+21

arch/x86/kernel/kvm.c

··· 527 527 pr_info("KVM setup pv IPIs\n"); 528 528 } 529 529 530 + static void kvm_smp_send_call_func_ipi(const struct cpumask *mask) 531 + { 532 + int cpu; 533 + 534 + native_send_call_func_ipi(mask); 535 + 536 + /* Make sure other vCPUs get a chance to run if they need to. */ 537 + for_each_cpu(cpu, mask) { 538 + if (vcpu_is_preempted(cpu)) { 539 + kvm_hypercall1(KVM_HC_SCHED_YIELD, per_cpu(x86_cpu_to_apicid, cpu)); 540 + break; 541 + } 542 + } 543 + } 544 + 530 545 static void __init kvm_smp_prepare_cpus(unsigned int max_cpus) 531 546 { 532 547 native_smp_prepare_cpus(max_cpus); ··· 653 638 #ifdef CONFIG_SMP 654 639 smp_ops.smp_prepare_cpus = kvm_smp_prepare_cpus; 655 640 smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu; 641 + if (kvm_para_has_feature(KVM_FEATURE_PV_SCHED_YIELD) && 642 + !kvm_para_has_hint(KVM_HINTS_REALTIME) && 643 + kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { 644 + smp_ops.send_call_func_ipi = kvm_smp_send_call_func_ipi; 645 + pr_info("KVM setup pv sched yield\n"); 646 + } 656 647 if (cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "x86/kvm:online", 657 648 kvm_cpu_online, kvm_cpu_down_prepare) < 0) 658 649 pr_err("kvm_guest: Failed to install cpu hotplug callbacks\n");

+1

arch/x86/kvm/Kconfig

··· 41 41 select PERF_EVENTS 42 42 select HAVE_KVM_MSI 43 43 select HAVE_KVM_CPU_RELAX_INTERCEPT 44 + select HAVE_KVM_NO_POLL 44 45 select KVM_GENERIC_DIRTYLOG_READ_PROTECT 45 46 select KVM_VFIO 46 47 select SRCU

+143 -104

arch/x86/kvm/cpuid.c

··· 134 134 (best->eax & (1 << KVM_FEATURE_PV_UNHALT))) 135 135 best->eax &= ~(1 << KVM_FEATURE_PV_UNHALT); 136 136 137 + if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT)) { 138 + best = kvm_find_cpuid_entry(vcpu, 0x1, 0); 139 + if (best) { 140 + if (vcpu->arch.ia32_misc_enable_msr & MSR_IA32_MISC_ENABLE_MWAIT) 141 + best->ecx |= F(MWAIT); 142 + else 143 + best->ecx &= ~F(MWAIT); 144 + } 145 + } 146 + 137 147 /* Update physical-address width */ 138 148 vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu); 139 149 kvm_mmu_reset_context(vcpu); ··· 286 276 *word &= boot_cpu_data.x86_capability[wordnum]; 287 277 } 288 278 289 - static void do_cpuid_1_ent(struct kvm_cpuid_entry2 *entry, u32 function, 279 + static void do_host_cpuid(struct kvm_cpuid_entry2 *entry, u32 function, 290 280 u32 index) 291 281 { 292 282 entry->function = function; 293 283 entry->index = index; 284 + entry->flags = 0; 285 + 294 286 cpuid_count(entry->function, entry->index, 295 287 &entry->eax, &entry->ebx, &entry->ecx, &entry->edx); 296 - entry->flags = 0; 288 + 289 + switch (function) { 290 + case 2: 291 + entry->flags |= KVM_CPUID_FLAG_STATEFUL_FUNC; 292 + break; 293 + case 4: 294 + case 7: 295 + case 0xb: 296 + case 0xd: 297 + case 0x14: 298 + case 0x8000001d: 299 + entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX; 300 + break; 301 + } 297 302 } 298 303 299 - static int __do_cpuid_ent_emulated(struct kvm_cpuid_entry2 *entry, 300 - u32 func, u32 index, int *nent, int maxnent) 304 + static int __do_cpuid_func_emulated(struct kvm_cpuid_entry2 *entry, 305 + u32 func, int *nent, int maxnent) 301 306 { 307 + entry->function = func; 308 + entry->index = 0; 309 + entry->flags = 0; 310 + 302 311 switch (func) { 303 312 case 0: 304 313 entry->eax = 7; ··· 329 300 break; 330 301 case 7: 331 302 entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX; 332 - if (index == 0) 333 - entry->ecx = F(RDPID); 303 + entry->eax = 0; 304 + entry->ecx = F(RDPID); 334 305 ++*nent; 335 306 default: 336 307 break; 337 308 } 338 309 339 - entry->function = func; 340 - entry->index = index; 341 - 342 310 return 0; 343 311 } 344 312 345 - static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, 346 - u32 index, int *nent, int maxnent) 313 + static inline void do_cpuid_7_mask(struct kvm_cpuid_entry2 *entry, int index) 314 + { 315 + unsigned f_invpcid = kvm_x86_ops->invpcid_supported() ? F(INVPCID) : 0; 316 + unsigned f_mpx = kvm_mpx_supported() ? F(MPX) : 0; 317 + unsigned f_umip = kvm_x86_ops->umip_emulated() ? F(UMIP) : 0; 318 + unsigned f_intel_pt = kvm_x86_ops->pt_supported() ? F(INTEL_PT) : 0; 319 + unsigned f_la57; 320 + 321 + /* cpuid 7.0.ebx */ 322 + const u32 kvm_cpuid_7_0_ebx_x86_features = 323 + F(FSGSBASE) | F(BMI1) | F(HLE) | F(AVX2) | F(SMEP) | 324 + F(BMI2) | F(ERMS) | f_invpcid | F(RTM) | f_mpx | F(RDSEED) | 325 + F(ADX) | F(SMAP) | F(AVX512IFMA) | F(AVX512F) | F(AVX512PF) | 326 + F(AVX512ER) | F(AVX512CD) | F(CLFLUSHOPT) | F(CLWB) | F(AVX512DQ) | 327 + F(SHA_NI) | F(AVX512BW) | F(AVX512VL) | f_intel_pt; 328 + 329 + /* cpuid 7.0.ecx*/ 330 + const u32 kvm_cpuid_7_0_ecx_x86_features = 331 + F(AVX512VBMI) | F(LA57) | F(PKU) | 0 /*OSPKE*/ | 332 + F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) | 333 + F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) | 334 + F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B); 335 + 336 + /* cpuid 7.0.edx*/ 337 + const u32 kvm_cpuid_7_0_edx_x86_features = 338 + F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) | 339 + F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) | 340 + F(MD_CLEAR); 341 + 342 + switch (index) { 343 + case 0: 344 + entry->eax = 0; 345 + entry->ebx &= kvm_cpuid_7_0_ebx_x86_features; 346 + cpuid_mask(&entry->ebx, CPUID_7_0_EBX); 347 + /* TSC_ADJUST is emulated */ 348 + entry->ebx |= F(TSC_ADJUST); 349 + 350 + entry->ecx &= kvm_cpuid_7_0_ecx_x86_features; 351 + f_la57 = entry->ecx & F(LA57); 352 + cpuid_mask(&entry->ecx, CPUID_7_ECX); 353 + /* Set LA57 based on hardware capability. */ 354 + entry->ecx |= f_la57; 355 + entry->ecx |= f_umip; 356 + /* PKU is not yet implemented for shadow paging. */ 357 + if (!tdp_enabled || !boot_cpu_has(X86_FEATURE_OSPKE)) 358 + entry->ecx &= ~F(PKU); 359 + 360 + entry->edx &= kvm_cpuid_7_0_edx_x86_features; 361 + cpuid_mask(&entry->edx, CPUID_7_EDX); 362 + /* 363 + * We emulate ARCH_CAPABILITIES in software even 364 + * if the host doesn't support it. 365 + */ 366 + entry->edx |= F(ARCH_CAPABILITIES); 367 + break; 368 + default: 369 + WARN_ON_ONCE(1); 370 + entry->eax = 0; 371 + entry->ebx = 0; 372 + entry->ecx = 0; 373 + entry->edx = 0; 374 + break; 375 + } 376 + } 377 + 378 + static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, 379 + int *nent, int maxnent) 347 380 { 348 381 int r; 349 382 unsigned f_nx = is_efer_nx() ? F(NX) : 0; ··· 418 327 unsigned f_lm = 0; 419 328 #endif 420 329 unsigned f_rdtscp = kvm_x86_ops->rdtscp_supported() ? F(RDTSCP) : 0; 421 - unsigned f_invpcid = kvm_x86_ops->invpcid_supported() ? F(INVPCID) : 0; 422 - unsigned f_mpx = kvm_mpx_supported() ? F(MPX) : 0; 423 330 unsigned f_xsaves = kvm_x86_ops->xsaves_supported() ? F(XSAVES) : 0; 424 - unsigned f_umip = kvm_x86_ops->umip_emulated() ? F(UMIP) : 0; 425 331 unsigned f_intel_pt = kvm_x86_ops->pt_supported() ? F(INTEL_PT) : 0; 426 - unsigned f_la57 = 0; 427 332 428 333 /* cpuid 1.edx */ 429 334 const u32 kvm_cpuid_1_edx_x86_features = ··· 464 377 /* cpuid 0x80000008.ebx */ 465 378 const u32 kvm_cpuid_8000_0008_ebx_x86_features = 466 379 F(WBNOINVD) | F(AMD_IBPB) | F(AMD_IBRS) | F(AMD_SSBD) | F(VIRT_SSBD) | 467 - F(AMD_SSB_NO) | F(AMD_STIBP); 380 + F(AMD_SSB_NO) | F(AMD_STIBP) | F(AMD_STIBP_ALWAYS_ON); 468 381 469 382 /* cpuid 0xC0000001.edx */ 470 383 const u32 kvm_cpuid_C000_0001_edx_x86_features = ··· 472 385 F(ACE2) | F(ACE2_EN) | F(PHE) | F(PHE_EN) | 473 386 F(PMM) | F(PMM_EN); 474 387 475 - /* cpuid 7.0.ebx */ 476 - const u32 kvm_cpuid_7_0_ebx_x86_features = 477 - F(FSGSBASE) | F(BMI1) | F(HLE) | F(AVX2) | F(SMEP) | 478 - F(BMI2) | F(ERMS) | f_invpcid | F(RTM) | f_mpx | F(RDSEED) | 479 - F(ADX) | F(SMAP) | F(AVX512IFMA) | F(AVX512F) | F(AVX512PF) | 480 - F(AVX512ER) | F(AVX512CD) | F(CLFLUSHOPT) | F(CLWB) | F(AVX512DQ) | 481 - F(SHA_NI) | F(AVX512BW) | F(AVX512VL) | f_intel_pt; 482 - 483 388 /* cpuid 0xD.1.eax */ 484 389 const u32 kvm_cpuid_D_1_eax_x86_features = 485 390 F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | f_xsaves; 486 - 487 - /* cpuid 7.0.ecx*/ 488 - const u32 kvm_cpuid_7_0_ecx_x86_features = 489 - F(AVX512VBMI) | F(LA57) | F(PKU) | 0 /*OSPKE*/ | 490 - F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) | 491 - F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) | 492 - F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B); 493 - 494 - /* cpuid 7.0.edx*/ 495 - const u32 kvm_cpuid_7_0_edx_x86_features = 496 - F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) | 497 - F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) | 498 - F(MD_CLEAR); 499 391 500 392 /* all calls to cpuid_count() should be made on the same cpu */ 501 393 get_cpu(); ··· 484 418 if (*nent >= maxnent) 485 419 goto out; 486 420 487 - do_cpuid_1_ent(entry, function, index); 421 + do_host_cpuid(entry, function, 0); 488 422 ++*nent; 489 423 490 424 switch (function) { 491 425 case 0: 492 - entry->eax = min(entry->eax, (u32)(f_intel_pt ? 0x14 : 0xd)); 426 + /* Limited to the highest leaf implemented in KVM. */ 427 + entry->eax = min(entry->eax, 0x1fU); 493 428 break; 494 429 case 1: 495 430 entry->edx &= kvm_cpuid_1_edx_x86_features; ··· 508 441 case 2: { 509 442 int t, times = entry->eax & 0xff; 510 443 511 - entry->flags |= KVM_CPUID_FLAG_STATEFUL_FUNC; 512 444 entry->flags |= KVM_CPUID_FLAG_STATE_READ_NEXT; 513 445 for (t = 1; t < times; ++t) { 514 446 if (*nent >= maxnent) 515 447 goto out; 516 448 517 - do_cpuid_1_ent(&entry[t], function, 0); 518 - entry[t].flags |= KVM_CPUID_FLAG_STATEFUL_FUNC; 449 + do_host_cpuid(&entry[t], function, 0); 519 450 ++*nent; 520 451 } 521 452 break; ··· 523 458 case 0x8000001d: { 524 459 int i, cache_type; 525 460 526 - entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX; 527 461 /* read more entries until cache_type is zero */ 528 462 for (i = 1; ; ++i) { 529 463 if (*nent >= maxnent) ··· 531 467 cache_type = entry[i - 1].eax & 0x1f; 532 468 if (!cache_type) 533 469 break; 534 - do_cpuid_1_ent(&entry[i], function, i); 535 - entry[i].flags |= 536 - KVM_CPUID_FLAG_SIGNIFCANT_INDEX; 470 + do_host_cpuid(&entry[i], function, i); 537 471 ++*nent; 538 472 } 539 473 break; ··· 542 480 entry->ecx = 0; 543 481 entry->edx = 0; 544 482 break; 483 + /* function 7 has additional index. */ 545 484 case 7: { 546 - entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX; 547 - /* Mask ebx against host capability word 9 */ 548 - if (index == 0) { 549 - entry->ebx &= kvm_cpuid_7_0_ebx_x86_features; 550 - cpuid_mask(&entry->ebx, CPUID_7_0_EBX); 551 - // TSC_ADJUST is emulated 552 - entry->ebx |= F(TSC_ADJUST); 553 - entry->ecx &= kvm_cpuid_7_0_ecx_x86_features; 554 - f_la57 = entry->ecx & F(LA57); 555 - cpuid_mask(&entry->ecx, CPUID_7_ECX); 556 - /* Set LA57 based on hardware capability. */ 557 - entry->ecx |= f_la57; 558 - entry->ecx |= f_umip; 559 - /* PKU is not yet implemented for shadow paging. */ 560 - if (!tdp_enabled || !boot_cpu_has(X86_FEATURE_OSPKE)) 561 - entry->ecx &= ~F(PKU); 562 - entry->edx &= kvm_cpuid_7_0_edx_x86_features; 563 - cpuid_mask(&entry->edx, CPUID_7_EDX); 564 - /* 565 - * We emulate ARCH_CAPABILITIES in software even 566 - * if the host doesn't support it. 567 - */ 568 - entry->edx |= F(ARCH_CAPABILITIES); 569 - } else { 570 - entry->ebx = 0; 571 - entry->ecx = 0; 572 - entry->edx = 0; 485 + int i; 486 + 487 + for (i = 0; ; ) { 488 + do_cpuid_7_mask(&entry[i], i); 489 + if (i == entry->eax) 490 + break; 491 + if (*nent >= maxnent) 492 + goto out; 493 + 494 + ++i; 495 + do_host_cpuid(&entry[i], function, i); 496 + ++*nent; 573 497 } 574 - entry->eax = 0; 575 498 break; 576 499 } 577 500 case 9: ··· 590 543 entry->edx = edx.full; 591 544 break; 592 545 } 593 - /* function 0xb has additional index. */ 546 + /* 547 + * Per Intel's SDM, the 0x1f is a superset of 0xb, 548 + * thus they can be handled by common code. 549 + */ 550 + case 0x1f: 594 551 case 0xb: { 595 552 int i, level_type; 596 553 597 - entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX; 598 554 /* read more entries until level_type is zero */ 599 555 for (i = 1; ; ++i) { 600 556 if (*nent >= maxnent) ··· 606 556 level_type = entry[i - 1].ecx & 0xff00; 607 557 if (!level_type) 608 558 break; 609 - do_cpuid_1_ent(&entry[i], function, i); 610 - entry[i].flags |= 611 - KVM_CPUID_FLAG_SIGNIFCANT_INDEX; 559 + do_host_cpuid(&entry[i], function, i); 612 560 ++*nent; 613 561 } 614 562 break; ··· 619 571 entry->ebx = xstate_required_size(supported, false); 620 572 entry->ecx = entry->ebx; 621 573 entry->edx &= supported >> 32; 622 - entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX; 623 574 if (!supported) 624 575 break; 625 576 ··· 627 580 if (*nent >= maxnent) 628 581 goto out; 629 582 630 - do_cpuid_1_ent(&entry[i], function, idx); 583 + do_host_cpuid(&entry[i], function, idx); 631 584 if (idx == 1) { 632 585 entry[i].eax &= kvm_cpuid_D_1_eax_x86_features; 633 586 cpuid_mask(&entry[i].eax, CPUID_D_1_EAX); ··· 644 597 } 645 598 entry[i].ecx = 0; 646 599 entry[i].edx = 0; 647 - entry[i].flags |= 648 - KVM_CPUID_FLAG_SIGNIFCANT_INDEX; 649 600 ++*nent; 650 601 ++i; 651 602 } ··· 656 611 if (!f_intel_pt) 657 612 break; 658 613 659 - entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX; 660 614 for (t = 1; t <= times; ++t) { 661 615 if (*nent >= maxnent) 662 616 goto out; 663 - do_cpuid_1_ent(&entry[t], function, t); 664 - entry[t].flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX; 617 + do_host_cpuid(&entry[t], function, t); 665 618 ++*nent; 666 619 } 667 620 break; ··· 683 640 (1 << KVM_FEATURE_PV_UNHALT) | 684 641 (1 << KVM_FEATURE_PV_TLB_FLUSH) | 685 642 (1 << KVM_FEATURE_ASYNC_PF_VMEXIT) | 686 - (1 << KVM_FEATURE_PV_SEND_IPI); 643 + (1 << KVM_FEATURE_PV_SEND_IPI) | 644 + (1 << KVM_FEATURE_POLL_CONTROL) | 645 + (1 << KVM_FEATURE_PV_SCHED_YIELD); 687 646 688 647 if (sched_info_on()) 689 648 entry->eax |= (1 << KVM_FEATURE_STEAL_TIME); ··· 775 730 return r; 776 731 } 777 732 778 - static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 func, 779 - u32 idx, int *nent, int maxnent, unsigned int type) 733 + static int do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 func, 734 + int *nent, int maxnent, unsigned int type) 780 735 { 781 736 if (type == KVM_GET_EMULATED_CPUID) 782 - return __do_cpuid_ent_emulated(entry, func, idx, nent, maxnent); 737 + return __do_cpuid_func_emulated(entry, func, nent, maxnent); 783 738 784 - return __do_cpuid_ent(entry, func, idx, nent, maxnent); 739 + return __do_cpuid_func(entry, func, nent, maxnent); 785 740 } 786 741 787 742 #undef F 788 743 789 744 struct kvm_cpuid_param { 790 745 u32 func; 791 - u32 idx; 792 - bool has_leaf_count; 793 746 bool (*qualifier)(const struct kvm_cpuid_param *param); 794 747 }; 795 748 ··· 831 788 int limit, nent = 0, r = -E2BIG, i; 832 789 u32 func; 833 790 static const struct kvm_cpuid_param param[] = { 834 - { .func = 0, .has_leaf_count = true }, 835 - { .func = 0x80000000, .has_leaf_count = true }, 836 - { .func = 0xC0000000, .qualifier = is_centaur_cpu, .has_leaf_count = true }, 791 + { .func = 0 }, 792 + { .func = 0x80000000 }, 793 + { .func = 0xC0000000, .qualifier = is_centaur_cpu }, 837 794 { .func = KVM_CPUID_SIGNATURE }, 838 - { .func = KVM_CPUID_FEATURES }, 839 795 }; 840 796 841 797 if (cpuid->nent < 1) ··· 858 816 if (ent->qualifier && !ent->qualifier(ent)) 859 817 continue; 860 818 861 - r = do_cpuid_ent(&cpuid_entries[nent], ent->func, ent->idx, 862 - &nent, cpuid->nent, type); 819 + r = do_cpuid_func(&cpuid_entries[nent], ent->func, 820 + &nent, cpuid->nent, type); 863 821 864 822 if (r) 865 823 goto out_free; 866 824 867 - if (!ent->has_leaf_count) 868 - continue; 869 - 870 825 limit = cpuid_entries[nent - 1].eax; 871 826 for (func = ent->func + 1; func <= limit && nent < cpuid->nent && r == 0; ++func) 872 - r = do_cpuid_ent(&cpuid_entries[nent], func, ent->idx, 873 - &nent, cpuid->nent, type); 827 + r = do_cpuid_func(&cpuid_entries[nent], func, 828 + &nent, cpuid->nent, type); 874 829 875 830 if (r) 876 831 goto out_free;

+1 -1

arch/x86/kvm/emulate.c

··· 4258 4258 ulong dr6; 4259 4259 4260 4260 ctxt->ops->get_dr(ctxt, 6, &dr6); 4261 - dr6 &= ~15; 4261 + dr6 &= ~DR_TRAP_BITS; 4262 4262 dr6 |= DR6_BD | DR6_RTM; 4263 4263 ctxt->ops->set_dr(ctxt, 6, dr6); 4264 4264 return emulate_db(ctxt);

-1

arch/x86/kvm/irq.h

··· 102 102 return mode != KVM_IRQCHIP_NONE; 103 103 } 104 104 105 - bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args); 106 105 void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu); 107 106 void kvm_inject_apic_timer_irqs(struct kvm_vcpu *vcpu); 108 107 void kvm_apic_nmi_wd_deliver(struct kvm_vcpu *vcpu);

+1 -1

arch/x86/kvm/irq_comm.c

··· 75 75 if (r < 0) 76 76 r = 0; 77 77 r += kvm_apic_set_irq(vcpu, irq, dest_map); 78 - } else if (kvm_lapic_enabled(vcpu)) { 78 + } else if (kvm_apic_sw_enabled(vcpu->arch.apic)) { 79 79 if (!kvm_vector_hashing_enabled()) { 80 80 if (!lowest) 81 81 lowest = vcpu;

+77 -44

arch/x86/kvm/lapic.c

··· 69 69 #define X2APIC_BROADCAST 0xFFFFFFFFul 70 70 71 71 #define LAPIC_TIMER_ADVANCE_ADJUST_DONE 100 72 + #define LAPIC_TIMER_ADVANCE_ADJUST_INIT 1000 72 73 /* step-by-step approximation to mitigate fluctuation */ 73 74 #define LAPIC_TIMER_ADVANCE_ADJUST_STEP 8 74 75 ··· 84 83 85 84 return apic_test_vector(vector, apic->regs + APIC_ISR) || 86 85 apic_test_vector(vector, apic->regs + APIC_IRR); 87 - } 88 - 89 - static inline void apic_clear_vector(int vec, void *bitmap) 90 - { 91 - clear_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); 92 86 } 93 87 94 88 static inline int __apic_test_and_set_vector(int vec, void *bitmap) ··· 439 443 440 444 if (unlikely(vcpu->arch.apicv_active)) { 441 445 /* need to update RVI */ 442 - apic_clear_vector(vec, apic->regs + APIC_IRR); 446 + kvm_lapic_clear_vector(vec, apic->regs + APIC_IRR); 443 447 kvm_x86_ops->hwapic_irr_update(vcpu, 444 448 apic_find_highest_irr(apic)); 445 449 } else { 446 450 apic->irr_pending = false; 447 - apic_clear_vector(vec, apic->regs + APIC_IRR); 451 + kvm_lapic_clear_vector(vec, apic->regs + APIC_IRR); 448 452 if (apic_search_irr(apic) != -1) 449 453 apic->irr_pending = true; 450 454 } ··· 1049 1053 1050 1054 if (apic_test_vector(vector, apic->regs + APIC_TMR) != !!trig_mode) { 1051 1055 if (trig_mode) 1052 - kvm_lapic_set_vector(vector, apic->regs + APIC_TMR); 1056 + kvm_lapic_set_vector(vector, 1057 + apic->regs + APIC_TMR); 1053 1058 else 1054 - apic_clear_vector(vector, apic->regs + APIC_TMR); 1059 + kvm_lapic_clear_vector(vector, 1060 + apic->regs + APIC_TMR); 1055 1061 } 1056 1062 1057 1063 if (vcpu->arch.apicv_active) ··· 1311 1313 return container_of(dev, struct kvm_lapic, dev); 1312 1314 } 1313 1315 1316 + #define APIC_REG_MASK(reg) (1ull << ((reg) >> 4)) 1317 + #define APIC_REGS_MASK(first, count) \ 1318 + (APIC_REG_MASK(first) * ((1ull << (count)) - 1)) 1319 + 1314 1320 int kvm_lapic_reg_read(struct kvm_lapic *apic, u32 offset, int len, 1315 1321 void *data) 1316 1322 { 1317 1323 unsigned char alignment = offset & 0xf; 1318 1324 u32 result; 1319 1325 /* this bitmask has a bit cleared for each reserved register */ 1320 - static const u64 rmask = 0x43ff01ffffffe70cULL; 1326 + u64 valid_reg_mask = 1327 + APIC_REG_MASK(APIC_ID) | 1328 + APIC_REG_MASK(APIC_LVR) | 1329 + APIC_REG_MASK(APIC_TASKPRI) | 1330 + APIC_REG_MASK(APIC_PROCPRI) | 1331 + APIC_REG_MASK(APIC_LDR) | 1332 + APIC_REG_MASK(APIC_DFR) | 1333 + APIC_REG_MASK(APIC_SPIV) | 1334 + APIC_REGS_MASK(APIC_ISR, APIC_ISR_NR) | 1335 + APIC_REGS_MASK(APIC_TMR, APIC_ISR_NR) | 1336 + APIC_REGS_MASK(APIC_IRR, APIC_ISR_NR) | 1337 + APIC_REG_MASK(APIC_ESR) | 1338 + APIC_REG_MASK(APIC_ICR) | 1339 + APIC_REG_MASK(APIC_ICR2) | 1340 + APIC_REG_MASK(APIC_LVTT) | 1341 + APIC_REG_MASK(APIC_LVTTHMR) | 1342 + APIC_REG_MASK(APIC_LVTPC) | 1343 + APIC_REG_MASK(APIC_LVT0) | 1344 + APIC_REG_MASK(APIC_LVT1) | 1345 + APIC_REG_MASK(APIC_LVTERR) | 1346 + APIC_REG_MASK(APIC_TMICT) | 1347 + APIC_REG_MASK(APIC_TMCCT) | 1348 + APIC_REG_MASK(APIC_TDCR); 1321 1349 1322 - if ((alignment + len) > 4) { 1323 - apic_debug("KVM_APIC_READ: alignment error %x %d\n", 1324 - offset, len); 1325 - return 1; 1326 - } 1350 + /* ARBPRI is not valid on x2APIC */ 1351 + if (!apic_x2apic_mode(apic)) 1352 + valid_reg_mask |= APIC_REG_MASK(APIC_ARBPRI); 1327 1353 1328 - if (offset > 0x3f0 || !(rmask & (1ULL << (offset >> 4)))) { 1354 + if (offset > 0x3f0 || !(valid_reg_mask & APIC_REG_MASK(offset))) { 1329 1355 apic_debug("KVM_APIC_READ: read reserved register %x\n", 1330 1356 offset); 1331 1357 return 1; ··· 1521 1499 } 1522 1500 } 1523 1501 1524 - void wait_lapic_expire(struct kvm_vcpu *vcpu) 1502 + static inline void adjust_lapic_timer_advance(struct kvm_vcpu *vcpu, 1503 + s64 advance_expire_delta) 1525 1504 { 1526 1505 struct kvm_lapic *apic = vcpu->arch.apic; 1527 1506 u32 timer_advance_ns = apic->lapic_timer.timer_advance_ns; 1528 - u64 guest_tsc, tsc_deadline, ns; 1507 + u64 ns; 1508 + 1509 + /* too early */ 1510 + if (advance_expire_delta < 0) { 1511 + ns = -advance_expire_delta * 1000000ULL; 1512 + do_div(ns, vcpu->arch.virtual_tsc_khz); 1513 + timer_advance_ns -= min((u32)ns, 1514 + timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP); 1515 + } else { 1516 + /* too late */ 1517 + ns = advance_expire_delta * 1000000ULL; 1518 + do_div(ns, vcpu->arch.virtual_tsc_khz); 1519 + timer_advance_ns += min((u32)ns, 1520 + timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP); 1521 + } 1522 + 1523 + if (abs(advance_expire_delta) < LAPIC_TIMER_ADVANCE_ADJUST_DONE) 1524 + apic->lapic_timer.timer_advance_adjust_done = true; 1525 + if (unlikely(timer_advance_ns > 5000)) { 1526 + timer_advance_ns = LAPIC_TIMER_ADVANCE_ADJUST_INIT; 1527 + apic->lapic_timer.timer_advance_adjust_done = false; 1528 + } 1529 + apic->lapic_timer.timer_advance_ns = timer_advance_ns; 1530 + } 1531 + 1532 + void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu) 1533 + { 1534 + struct kvm_lapic *apic = vcpu->arch.apic; 1535 + u64 guest_tsc, tsc_deadline; 1529 1536 1530 1537 if (apic->lapic_timer.expired_tscdeadline == 0) 1531 1538 return; ··· 1565 1514 tsc_deadline = apic->lapic_timer.expired_tscdeadline; 1566 1515 apic->lapic_timer.expired_tscdeadline = 0; 1567 1516 guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); 1568 - trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline); 1517 + apic->lapic_timer.advance_expire_delta = guest_tsc - tsc_deadline; 1569 1518 1570 1519 if (guest_tsc < tsc_deadline) 1571 1520 __wait_lapic_expire(vcpu, tsc_deadline - guest_tsc); 1572 1521 1573 - if (!apic->lapic_timer.timer_advance_adjust_done) { 1574 - /* too early */ 1575 - if (guest_tsc < tsc_deadline) { 1576 - ns = (tsc_deadline - guest_tsc) * 1000000ULL; 1577 - do_div(ns, vcpu->arch.virtual_tsc_khz); 1578 - timer_advance_ns -= min((u32)ns, 1579 - timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP); 1580 - } else { 1581 - /* too late */ 1582 - ns = (guest_tsc - tsc_deadline) * 1000000ULL; 1583 - do_div(ns, vcpu->arch.virtual_tsc_khz); 1584 - timer_advance_ns += min((u32)ns, 1585 - timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP); 1586 - } 1587 - if (abs(guest_tsc - tsc_deadline) < LAPIC_TIMER_ADVANCE_ADJUST_DONE) 1588 - apic->lapic_timer.timer_advance_adjust_done = true; 1589 - if (unlikely(timer_advance_ns > 5000)) { 1590 - timer_advance_ns = 0; 1591 - apic->lapic_timer.timer_advance_adjust_done = true; 1592 - } 1593 - apic->lapic_timer.timer_advance_ns = timer_advance_ns; 1594 - } 1522 + if (unlikely(!apic->lapic_timer.timer_advance_adjust_done)) 1523 + adjust_lapic_timer_advance(vcpu, apic->lapic_timer.advance_expire_delta); 1595 1524 } 1525 + EXPORT_SYMBOL_GPL(kvm_wait_lapic_expire); 1596 1526 1597 1527 static void start_sw_tscdeadline(struct kvm_lapic *apic) 1598 1528 { ··· 2046 2014 apic_debug("%s: offset 0x%x with length 0x%x, and value is " 2047 2015 "0x%x\n", __func__, offset, len, val); 2048 2016 2049 - kvm_lapic_reg_write(apic, offset & 0xff0, val); 2017 + kvm_lapic_reg_write(apic, offset, val); 2050 2018 2051 2019 return 0; 2052 2020 } ··· 2343 2311 HRTIMER_MODE_ABS_PINNED); 2344 2312 apic->lapic_timer.timer.function = apic_timer_fn; 2345 2313 if (timer_advance_ns == -1) { 2346 - apic->lapic_timer.timer_advance_ns = 1000; 2314 + apic->lapic_timer.timer_advance_ns = LAPIC_TIMER_ADVANCE_ADJUST_INIT; 2347 2315 apic->lapic_timer.timer_advance_adjust_done = false; 2348 2316 } else { 2349 2317 apic->lapic_timer.timer_advance_ns = timer_advance_ns; ··· 2353 2321 2354 2322 /* 2355 2323 * APIC is created enabled. This will prevent kvm_lapic_set_base from 2356 - * thinking that APIC satet has changed. 2324 + * thinking that APIC state has changed. 2357 2325 */ 2358 2326 vcpu->arch.apic_base = MSR_IA32_APICBASE_ENABLE; 2359 2327 static_key_slow_inc(&apic_sw_disabled.key); /* sw disabled at reset */ ··· 2362 2330 return 0; 2363 2331 nomem_free_apic: 2364 2332 kfree(apic); 2333 + vcpu->arch.apic = NULL; 2365 2334 nomem: 2366 2335 return -ENOMEM; 2367 2336 }

+7 -1

arch/x86/kvm/lapic.h

··· 32 32 u64 tscdeadline; 33 33 u64 expired_tscdeadline; 34 34 u32 timer_advance_ns; 35 + s64 advance_expire_delta; 35 36 atomic_t pending; /* accumulated triggered timers */ 36 37 bool hv_timer_in_use; 37 38 bool timer_advance_adjust_done; ··· 130 129 #define VEC_POS(v) ((v) & (32 - 1)) 131 130 #define REG_POS(v) (((v) >> 5) << 4) 132 131 132 + static inline void kvm_lapic_clear_vector(int vec, void *bitmap) 133 + { 134 + clear_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); 135 + } 136 + 133 137 static inline void kvm_lapic_set_vector(int vec, void *bitmap) 134 138 { 135 139 set_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); ··· 225 219 226 220 bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector); 227 221 228 - void wait_lapic_expire(struct kvm_vcpu *vcpu); 222 + void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu); 229 223 230 224 bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm, struct kvm_lapic_irq *irq, 231 225 struct kvm_vcpu **dest_vcpu);

+108 -74

arch/x86/kvm/mmu.c

··· 140 140 141 141 #include <trace/events/kvm.h> 142 142 143 - #define CREATE_TRACE_POINTS 144 - #include "mmutrace.h" 145 - 146 143 #define SPTE_HOST_WRITEABLE (1ULL << PT_FIRST_AVAIL_BITS_SHIFT) 147 144 #define SPTE_MMU_WRITEABLE (1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1)) 148 145 ··· 256 259 */ 257 260 static u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask; 258 261 262 + /* 263 + * The number of non-reserved physical address bits irrespective of features 264 + * that repurpose legal bits, e.g. MKTME. 265 + */ 266 + static u8 __read_mostly shadow_phys_bits; 259 267 260 268 static void mmu_spte_set(u64 *sptep, u64 spte); 269 + static bool is_executable_pte(u64 spte); 261 270 static union kvm_mmu_page_role 262 271 kvm_mmu_calc_root_page_role(struct kvm_vcpu *vcpu); 272 + 273 + #define CREATE_TRACE_POINTS 274 + #include "mmutrace.h" 263 275 264 276 265 277 static inline bool kvm_available_flush_tlb_with_range(void) ··· 474 468 } 475 469 EXPORT_SYMBOL_GPL(kvm_mmu_set_mask_ptes); 476 470 471 + static u8 kvm_get_shadow_phys_bits(void) 472 + { 473 + /* 474 + * boot_cpu_data.x86_phys_bits is reduced when MKTME is detected 475 + * in CPU detection code, but MKTME treats those reduced bits as 476 + * 'keyID' thus they are not reserved bits. Therefore for MKTME 477 + * we should still return physical address bits reported by CPUID. 478 + */ 479 + if (!boot_cpu_has(X86_FEATURE_TME) || 480 + WARN_ON_ONCE(boot_cpu_data.extended_cpuid_level < 0x80000008)) 481 + return boot_cpu_data.x86_phys_bits; 482 + 483 + return cpuid_eax(0x80000008) & 0xff; 484 + } 485 + 477 486 static void kvm_mmu_reset_all_pte_masks(void) 478 487 { 479 488 u8 low_phys_bits; ··· 501 480 shadow_mmio_mask = 0; 502 481 shadow_present_mask = 0; 503 482 shadow_acc_track_mask = 0; 483 + 484 + shadow_phys_bits = kvm_get_shadow_phys_bits(); 504 485 505 486 /* 506 487 * If the CPU has 46 or less physical address bits, then set an ··· 1096 1073 1097 1074 static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn) 1098 1075 { 1099 - if (sp->role.direct) 1100 - BUG_ON(gfn != kvm_mmu_page_get_gfn(sp, index)); 1101 - else 1076 + if (!sp->role.direct) { 1102 1077 sp->gfns[index] = gfn; 1078 + return; 1079 + } 1080 + 1081 + if (WARN_ON(gfn != kvm_mmu_page_get_gfn(sp, index))) 1082 + pr_err_ratelimited("gfn mismatch under direct page %llx " 1083 + "(expected %llx, got %llx)\n", 1084 + sp->gfn, 1085 + kvm_mmu_page_get_gfn(sp, index), gfn); 1103 1086 } 1104 1087 1105 1088 /* ··· 3084 3055 ret = RET_PF_EMULATE; 3085 3056 3086 3057 pgprintk("%s: setting spte %llx\n", __func__, *sptep); 3087 - pgprintk("instantiating %s PTE (%s) at %llx (%llx) addr %p\n", 3088 - is_large_pte(*sptep)? "2MB" : "4kB", 3089 - *sptep & PT_WRITABLE_MASK ? "RW" : "R", gfn, 3090 - *sptep, sptep); 3058 + trace_kvm_mmu_set_spte(level, gfn, sptep); 3091 3059 if (!was_rmapped && is_large_pte(*sptep)) 3092 3060 ++vcpu->kvm->stat.lpages; 3093 3061 ··· 3095 3069 rmap_recycle(vcpu, sptep, gfn); 3096 3070 } 3097 3071 } 3098 - 3099 - kvm_release_pfn_clean(pfn); 3100 3072 3101 3073 return ret; 3102 3074 } ··· 3130 3106 if (ret <= 0) 3131 3107 return -1; 3132 3108 3133 - for (i = 0; i < ret; i++, gfn++, start++) 3109 + for (i = 0; i < ret; i++, gfn++, start++) { 3134 3110 mmu_set_spte(vcpu, start, access, 0, sp->role.level, gfn, 3135 3111 page_to_pfn(pages[i]), true, true); 3112 + put_page(pages[i]); 3113 + } 3136 3114 3137 3115 return 0; 3138 3116 } ··· 3182 3156 __direct_pte_prefetch(vcpu, sp, sptep); 3183 3157 } 3184 3158 3185 - static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable, 3186 - int level, gfn_t gfn, kvm_pfn_t pfn, bool prefault) 3159 + static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, int write, 3160 + int map_writable, int level, kvm_pfn_t pfn, 3161 + bool prefault) 3187 3162 { 3188 - struct kvm_shadow_walk_iterator iterator; 3163 + struct kvm_shadow_walk_iterator it; 3189 3164 struct kvm_mmu_page *sp; 3190 - int emulate = 0; 3191 - gfn_t pseudo_gfn; 3165 + int ret; 3166 + gfn_t gfn = gpa >> PAGE_SHIFT; 3167 + gfn_t base_gfn = gfn; 3192 3168 3193 3169 if (!VALID_PAGE(vcpu->arch.mmu->root_hpa)) 3194 - return 0; 3170 + return RET_PF_RETRY; 3195 3171 3196 - for_each_shadow_entry(vcpu, (u64)gfn << PAGE_SHIFT, iterator) { 3197 - if (iterator.level == level) { 3198 - emulate = mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, 3199 - write, level, gfn, pfn, prefault, 3200 - map_writable); 3201 - direct_pte_prefetch(vcpu, iterator.sptep); 3202 - ++vcpu->stat.pf_fixed; 3172 + trace_kvm_mmu_spte_requested(gpa, level, pfn); 3173 + for_each_shadow_entry(vcpu, gpa, it) { 3174 + base_gfn = gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1); 3175 + if (it.level == level) 3203 3176 break; 3204 - } 3205 3177 3206 - drop_large_spte(vcpu, iterator.sptep); 3207 - if (!is_shadow_present_pte(*iterator.sptep)) { 3208 - u64 base_addr = iterator.addr; 3178 + drop_large_spte(vcpu, it.sptep); 3179 + if (!is_shadow_present_pte(*it.sptep)) { 3180 + sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr, 3181 + it.level - 1, true, ACC_ALL); 3209 3182 3210 - base_addr &= PT64_LVL_ADDR_MASK(iterator.level); 3211 - pseudo_gfn = base_addr >> PAGE_SHIFT; 3212 - sp = kvm_mmu_get_page(vcpu, pseudo_gfn, iterator.addr, 3213 - iterator.level - 1, 1, ACC_ALL); 3214 - 3215 - link_shadow_page(vcpu, iterator.sptep, sp); 3183 + link_shadow_page(vcpu, it.sptep, sp); 3216 3184 } 3217 3185 } 3218 - return emulate; 3186 + 3187 + ret = mmu_set_spte(vcpu, it.sptep, ACC_ALL, 3188 + write, level, base_gfn, pfn, prefault, 3189 + map_writable); 3190 + direct_pte_prefetch(vcpu, it.sptep); 3191 + ++vcpu->stat.pf_fixed; 3192 + return ret; 3219 3193 } 3220 3194 3221 3195 static void kvm_send_hwpoison_signal(unsigned long address, struct task_struct *tsk) ··· 3242 3216 } 3243 3217 3244 3218 static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, 3245 - gfn_t *gfnp, kvm_pfn_t *pfnp, 3219 + gfn_t gfn, kvm_pfn_t *pfnp, 3246 3220 int *levelp) 3247 3221 { 3248 3222 kvm_pfn_t pfn = *pfnp; 3249 - gfn_t gfn = *gfnp; 3250 3223 int level = *levelp; 3251 3224 3252 3225 /* ··· 3272 3247 mask = KVM_PAGES_PER_HPAGE(level) - 1; 3273 3248 VM_BUG_ON((gfn & mask) != (pfn & mask)); 3274 3249 if (pfn & mask) { 3275 - gfn &= ~mask; 3276 - *gfnp = gfn; 3277 3250 kvm_release_pfn_clean(pfn); 3278 3251 pfn &= ~mask; 3279 3252 kvm_get_pfn(pfn); ··· 3528 3505 if (handle_abnormal_pfn(vcpu, v, gfn, pfn, ACC_ALL, &r)) 3529 3506 return r; 3530 3507 3508 + r = RET_PF_RETRY; 3531 3509 spin_lock(&vcpu->kvm->mmu_lock); 3532 3510 if (mmu_notifier_retry(vcpu->kvm, mmu_seq)) 3533 3511 goto out_unlock; 3534 3512 if (make_mmu_pages_available(vcpu) < 0) 3535 3513 goto out_unlock; 3536 3514 if (likely(!force_pt_level)) 3537 - transparent_hugepage_adjust(vcpu, &gfn, &pfn, &level); 3538 - r = __direct_map(vcpu, write, map_writable, level, gfn, pfn, prefault); 3539 - spin_unlock(&vcpu->kvm->mmu_lock); 3540 - 3541 - return r; 3542 - 3515 + transparent_hugepage_adjust(vcpu, gfn, &pfn, &level); 3516 + r = __direct_map(vcpu, v, write, map_writable, level, pfn, prefault); 3543 3517 out_unlock: 3544 3518 spin_unlock(&vcpu->kvm->mmu_lock); 3545 3519 kvm_release_pfn_clean(pfn); 3546 - return RET_PF_RETRY; 3520 + return r; 3547 3521 } 3548 3522 3549 3523 static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa, ··· 4035 4015 return kvm_setup_async_pf(vcpu, gva, kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch); 4036 4016 } 4037 4017 4038 - bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu) 4039 - { 4040 - if (unlikely(!lapic_in_kernel(vcpu) || 4041 - kvm_event_needs_reinjection(vcpu) || 4042 - vcpu->arch.exception.pending)) 4043 - return false; 4044 - 4045 - if (!vcpu->arch.apf.delivery_as_pf_vmexit && is_guest_mode(vcpu)) 4046 - return false; 4047 - 4048 - return kvm_x86_ops->interrupt_allowed(vcpu); 4049 - } 4050 - 4051 4018 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn, 4052 4019 gva_t gva, kvm_pfn_t *pfn, bool write, bool *writable) 4053 4020 { ··· 4154 4147 if (handle_abnormal_pfn(vcpu, 0, gfn, pfn, ACC_ALL, &r)) 4155 4148 return r; 4156 4149 4150 + r = RET_PF_RETRY; 4157 4151 spin_lock(&vcpu->kvm->mmu_lock); 4158 4152 if (mmu_notifier_retry(vcpu->kvm, mmu_seq)) 4159 4153 goto out_unlock; 4160 4154 if (make_mmu_pages_available(vcpu) < 0) 4161 4155 goto out_unlock; 4162 4156 if (likely(!force_pt_level)) 4163 - transparent_hugepage_adjust(vcpu, &gfn, &pfn, &level); 4164 - r = __direct_map(vcpu, write, map_writable, level, gfn, pfn, prefault); 4165 - spin_unlock(&vcpu->kvm->mmu_lock); 4166 - 4167 - return r; 4168 - 4157 + transparent_hugepage_adjust(vcpu, gfn, &pfn, &level); 4158 + r = __direct_map(vcpu, gpa, write, map_writable, level, pfn, prefault); 4169 4159 out_unlock: 4170 4160 spin_unlock(&vcpu->kvm->mmu_lock); 4171 4161 kvm_release_pfn_clean(pfn); 4172 - return RET_PF_RETRY; 4162 + return r; 4173 4163 } 4174 4164 4175 4165 static void nonpaging_init_context(struct kvm_vcpu *vcpu, ··· 4498 4494 */ 4499 4495 shadow_zero_check = &context->shadow_zero_check; 4500 4496 __reset_rsvds_bits_mask(vcpu, shadow_zero_check, 4501 - boot_cpu_data.x86_phys_bits, 4497 + shadow_phys_bits, 4502 4498 context->shadow_root_level, uses_nx, 4503 4499 guest_cpuid_has(vcpu, X86_FEATURE_GBPAGES), 4504 4500 is_pse(vcpu), true); ··· 4535 4531 4536 4532 if (boot_cpu_is_amd()) 4537 4533 __reset_rsvds_bits_mask(vcpu, shadow_zero_check, 4538 - boot_cpu_data.x86_phys_bits, 4534 + shadow_phys_bits, 4539 4535 context->shadow_root_level, false, 4540 4536 boot_cpu_has(X86_FEATURE_GBPAGES), 4541 4537 true, true); 4542 4538 else 4543 4539 __reset_rsvds_bits_mask_ept(shadow_zero_check, 4544 - boot_cpu_data.x86_phys_bits, 4540 + shadow_phys_bits, 4545 4541 false); 4546 4542 4547 4543 if (!shadow_me_mask) ··· 4562 4558 struct kvm_mmu *context, bool execonly) 4563 4559 { 4564 4560 __reset_rsvds_bits_mask_ept(&context->shadow_zero_check, 4565 - boot_cpu_data.x86_phys_bits, execonly); 4561 + shadow_phys_bits, execonly); 4566 4562 } 4567 4563 4568 4564 #define BYTE_MASK(access) \ ··· 5939 5935 int nr_to_scan = sc->nr_to_scan; 5940 5936 unsigned long freed = 0; 5941 5937 5942 - spin_lock(&kvm_lock); 5938 + mutex_lock(&kvm_lock); 5943 5939 5944 5940 list_for_each_entry(kvm, &vm_list, vm_list) { 5945 5941 int idx; ··· 5981 5977 break; 5982 5978 } 5983 5979 5984 - spin_unlock(&kvm_lock); 5980 + mutex_unlock(&kvm_lock); 5985 5981 return freed; 5986 5982 } 5987 5983 ··· 6003 5999 kmem_cache_destroy(mmu_page_header_cache); 6004 6000 } 6005 6001 6002 + static void kvm_set_mmio_spte_mask(void) 6003 + { 6004 + u64 mask; 6005 + 6006 + /* 6007 + * Set the reserved bits and the present bit of an paging-structure 6008 + * entry to generate page fault with PFER.RSV = 1. 6009 + */ 6010 + 6011 + /* 6012 + * Mask the uppermost physical address bit, which would be reserved as 6013 + * long as the supported physical address width is less than 52. 6014 + */ 6015 + mask = 1ull << 51; 6016 + 6017 + /* Set the present bit. */ 6018 + mask |= 1ull; 6019 + 6020 + /* 6021 + * If reserved bit is not supported, clear the present bit to disable 6022 + * mmio page fault. 6023 + */ 6024 + if (IS_ENABLED(CONFIG_X86_64) && shadow_phys_bits == 52) 6025 + mask &= ~1ull; 6026 + 6027 + kvm_mmu_set_mmio_spte_mask(mask, mask); 6028 + } 6029 + 6006 6030 int kvm_mmu_module_init(void) 6007 6031 { 6008 6032 int ret = -ENOMEM; ··· 6046 6014 BUILD_BUG_ON(sizeof(union kvm_mmu_role) != sizeof(u64)); 6047 6015 6048 6016 kvm_mmu_reset_all_pte_masks(); 6017 + 6018 + kvm_set_mmio_spte_mask(); 6049 6019 6050 6020 pte_list_desc_cache = kmem_cache_create("pte_list_desc", 6051 6021 sizeof(struct pte_list_desc),

+59

arch/x86/kvm/mmutrace.h

··· 301 301 __entry->kvm_gen == __entry->spte_gen 302 302 ) 303 303 ); 304 + 305 + TRACE_EVENT( 306 + kvm_mmu_set_spte, 307 + TP_PROTO(int level, gfn_t gfn, u64 *sptep), 308 + TP_ARGS(level, gfn, sptep), 309 + 310 + TP_STRUCT__entry( 311 + __field(u64, gfn) 312 + __field(u64, spte) 313 + __field(u64, sptep) 314 + __field(u8, level) 315 + /* These depend on page entry type, so compute them now. */ 316 + __field(bool, r) 317 + __field(bool, x) 318 + __field(u8, u) 319 + ), 320 + 321 + TP_fast_assign( 322 + __entry->gfn = gfn; 323 + __entry->spte = *sptep; 324 + __entry->sptep = virt_to_phys(sptep); 325 + __entry->level = level; 326 + __entry->r = shadow_present_mask || (__entry->spte & PT_PRESENT_MASK); 327 + __entry->x = is_executable_pte(__entry->spte); 328 + __entry->u = shadow_user_mask ? !!(__entry->spte & shadow_user_mask) : -1; 329 + ), 330 + 331 + TP_printk("gfn %llx spte %llx (%s%s%s%s) level %d at %llx", 332 + __entry->gfn, __entry->spte, 333 + __entry->r ? "r" : "-", 334 + __entry->spte & PT_WRITABLE_MASK ? "w" : "-", 335 + __entry->x ? "x" : "-", 336 + __entry->u == -1 ? "" : (__entry->u ? "u" : "-"), 337 + __entry->level, __entry->sptep 338 + ) 339 + ); 340 + 341 + TRACE_EVENT( 342 + kvm_mmu_spte_requested, 343 + TP_PROTO(gpa_t addr, int level, kvm_pfn_t pfn), 344 + TP_ARGS(addr, level, pfn), 345 + 346 + TP_STRUCT__entry( 347 + __field(u64, gfn) 348 + __field(u64, pfn) 349 + __field(u8, level) 350 + ), 351 + 352 + TP_fast_assign( 353 + __entry->gfn = addr >> PAGE_SHIFT; 354 + __entry->pfn = pfn | (__entry->gfn & (KVM_PAGES_PER_HPAGE(level) - 1)); 355 + __entry->level = level; 356 + ), 357 + 358 + TP_printk("gfn %llx pfn %llx level %d", 359 + __entry->gfn, __entry->pfn, __entry->level 360 + ) 361 + ); 362 + 304 363 #endif /* _TRACE_KVMMMU_H */ 305 364 306 365 #undef TRACE_INCLUDE_PATH

+20 -22

arch/x86/kvm/paging_tmpl.h

··· 540 540 mmu_set_spte(vcpu, spte, pte_access, 0, PT_PAGE_TABLE_LEVEL, gfn, pfn, 541 541 true, true); 542 542 543 + kvm_release_pfn_clean(pfn); 543 544 return true; 544 545 } 545 546 ··· 620 619 struct kvm_shadow_walk_iterator it; 621 620 unsigned direct_access, access = gw->pt_access; 622 621 int top_level, ret; 622 + gfn_t base_gfn; 623 623 624 624 direct_access = gw->pte_access; 625 625 ··· 665 663 link_shadow_page(vcpu, it.sptep, sp); 666 664 } 667 665 668 - for (; 669 - shadow_walk_okay(&it) && it.level > hlevel; 670 - shadow_walk_next(&it)) { 671 - gfn_t direct_gfn; 666 + base_gfn = gw->gfn; 672 667 668 + trace_kvm_mmu_spte_requested(addr, gw->level, pfn); 669 + 670 + for (; shadow_walk_okay(&it); shadow_walk_next(&it)) { 673 671 clear_sp_write_flooding_count(it.sptep); 672 + base_gfn = gw->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1); 673 + if (it.level == hlevel) 674 + break; 675 + 674 676 validate_direct_spte(vcpu, it.sptep, direct_access); 675 677 676 678 drop_large_spte(vcpu, it.sptep); 677 679 678 - if (is_shadow_present_pte(*it.sptep)) 679 - continue; 680 - 681 - direct_gfn = gw->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1); 682 - 683 - sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1, 684 - true, direct_access); 685 - link_shadow_page(vcpu, it.sptep, sp); 680 + if (!is_shadow_present_pte(*it.sptep)) { 681 + sp = kvm_mmu_get_page(vcpu, base_gfn, addr, 682 + it.level - 1, true, direct_access); 683 + link_shadow_page(vcpu, it.sptep, sp); 684 + } 686 685 } 687 686 688 - clear_sp_write_flooding_count(it.sptep); 689 687 ret = mmu_set_spte(vcpu, it.sptep, gw->pte_access, write_fault, 690 - it.level, gw->gfn, pfn, prefault, map_writable); 688 + it.level, base_gfn, pfn, prefault, map_writable); 691 689 FNAME(pte_prefetch)(vcpu, gw, it.sptep); 692 - 690 + ++vcpu->stat.pf_fixed; 693 691 return ret; 694 692 695 693 out_gpte_changed: 696 - kvm_release_pfn_clean(pfn); 697 694 return RET_PF_RETRY; 698 695 } 699 696 ··· 840 839 walker.pte_access &= ~ACC_EXEC_MASK; 841 840 } 842 841 842 + r = RET_PF_RETRY; 843 843 spin_lock(&vcpu->kvm->mmu_lock); 844 844 if (mmu_notifier_retry(vcpu->kvm, mmu_seq)) 845 845 goto out_unlock; ··· 849 847 if (make_mmu_pages_available(vcpu) < 0) 850 848 goto out_unlock; 851 849 if (!force_pt_level) 852 - transparent_hugepage_adjust(vcpu, &walker.gfn, &pfn, &level); 850 + transparent_hugepage_adjust(vcpu, walker.gfn, &pfn, &level); 853 851 r = FNAME(fetch)(vcpu, addr, &walker, write_fault, 854 852 level, pfn, map_writable, prefault); 855 - ++vcpu->stat.pf_fixed; 856 853 kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT); 857 - spin_unlock(&vcpu->kvm->mmu_lock); 858 - 859 - return r; 860 854 861 855 out_unlock: 862 856 spin_unlock(&vcpu->kvm->mmu_lock); 863 857 kvm_release_pfn_clean(pfn); 864 - return RET_PF_RETRY; 858 + return r; 865 859 } 866 860 867 861 static gpa_t FNAME(get_level1_sp_gpa)(struct kvm_mmu_page *sp)

+63

arch/x86/kvm/pmu.c

··· 19 19 #include "lapic.h" 20 20 #include "pmu.h" 21 21 22 + /* This keeps the total size of the filter under 4k. */ 23 + #define KVM_PMU_EVENT_FILTER_MAX_EVENTS 63 24 + 22 25 /* NOTE: 23 26 * - Each perf counter is defined as "struct kvm_pmc"; 24 27 * - There are two types of perf counters: general purpose (gp) and fixed. ··· 144 141 { 145 142 unsigned config, type = PERF_TYPE_RAW; 146 143 u8 event_select, unit_mask; 144 + struct kvm *kvm = pmc->vcpu->kvm; 145 + struct kvm_pmu_event_filter *filter; 146 + int i; 147 + bool allow_event = true; 147 148 148 149 if (eventsel & ARCH_PERFMON_EVENTSEL_PIN_CONTROL) 149 150 printk_once("kvm pmu: pin control bit is ignored\n"); ··· 157 150 pmc_stop_counter(pmc); 158 151 159 152 if (!(eventsel & ARCH_PERFMON_EVENTSEL_ENABLE) || !pmc_is_enabled(pmc)) 153 + return; 154 + 155 + filter = srcu_dereference(kvm->arch.pmu_event_filter, &kvm->srcu); 156 + if (filter) { 157 + for (i = 0; i < filter->nevents; i++) 158 + if (filter->events[i] == 159 + (eventsel & AMD64_RAW_EVENT_MASK_NB)) 160 + break; 161 + if (filter->action == KVM_PMU_EVENT_ALLOW && 162 + i == filter->nevents) 163 + allow_event = false; 164 + if (filter->action == KVM_PMU_EVENT_DENY && 165 + i < filter->nevents) 166 + allow_event = false; 167 + } 168 + if (!allow_event) 160 169 return; 161 170 162 171 event_select = eventsel & ARCH_PERFMON_EVENTSEL_EVENT; ··· 370 347 void kvm_pmu_destroy(struct kvm_vcpu *vcpu) 371 348 { 372 349 kvm_pmu_reset(vcpu); 350 + } 351 + 352 + int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp) 353 + { 354 + struct kvm_pmu_event_filter tmp, *filter; 355 + size_t size; 356 + int r; 357 + 358 + if (copy_from_user(&tmp, argp, sizeof(tmp))) 359 + return -EFAULT; 360 + 361 + if (tmp.action != KVM_PMU_EVENT_ALLOW && 362 + tmp.action != KVM_PMU_EVENT_DENY) 363 + return -EINVAL; 364 + 365 + if (tmp.nevents > KVM_PMU_EVENT_FILTER_MAX_EVENTS) 366 + return -E2BIG; 367 + 368 + size = struct_size(filter, events, tmp.nevents); 369 + filter = kmalloc(size, GFP_KERNEL_ACCOUNT); 370 + if (!filter) 371 + return -ENOMEM; 372 + 373 + r = -EFAULT; 374 + if (copy_from_user(filter, argp, size)) 375 + goto cleanup; 376 + 377 + /* Ensure nevents can't be changed between the user copies. */ 378 + *filter = tmp; 379 + 380 + mutex_lock(&kvm->lock); 381 + rcu_swap_protected(kvm->arch.pmu_event_filter, filter, 382 + mutex_is_locked(&kvm->lock)); 383 + mutex_unlock(&kvm->lock); 384 + 385 + synchronize_srcu_expedited(&kvm->srcu); 386 + r = 0; 387 + cleanup: 388 + kfree(filter); 389 + return r; 373 390 }

+1

arch/x86/kvm/pmu.h

··· 118 118 void kvm_pmu_reset(struct kvm_vcpu *vcpu); 119 119 void kvm_pmu_init(struct kvm_vcpu *vcpu); 120 120 void kvm_pmu_destroy(struct kvm_vcpu *vcpu); 121 + int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp); 121 122 122 123 bool is_vmware_backdoor_pmc(u32 pmc_idx); 123 124

+31 -20

arch/x86/kvm/svm.c

··· 364 364 module_param(avic, int, S_IRUGO); 365 365 #endif 366 366 367 + /* enable/disable Next RIP Save */ 368 + static int nrips = true; 369 + module_param(nrips, int, 0444); 370 + 367 371 /* enable/disable Virtual VMLOAD VMSAVE */ 368 372 static int vls = true; 369 373 module_param(vls, int, 0444); ··· 774 770 { 775 771 struct vcpu_svm *svm = to_svm(vcpu); 776 772 777 - if (svm->vmcb->control.next_rip != 0) { 773 + if (nrips && svm->vmcb->control.next_rip != 0) { 778 774 WARN_ON_ONCE(!static_cpu_has(X86_FEATURE_NRIPS)); 779 775 svm->next_rip = svm->vmcb->control.next_rip; 780 776 } ··· 811 807 812 808 kvm_deliver_exception_payload(&svm->vcpu); 813 809 814 - if (nr == BP_VECTOR && !static_cpu_has(X86_FEATURE_NRIPS)) { 810 + if (nr == BP_VECTOR && !nrips) { 815 811 unsigned long rip, old_rip = kvm_rip_read(&svm->vcpu); 816 812 817 813 /* ··· 1367 1363 kvm_enable_tdp(); 1368 1364 } else 1369 1365 kvm_disable_tdp(); 1366 + 1367 + if (nrips) { 1368 + if (!boot_cpu_has(X86_FEATURE_NRIPS)) 1369 + nrips = false; 1370 + } 1370 1371 1371 1372 if (avic) { 1372 1373 if (!npt_enabled || ··· 3299 3290 vmcb->control.exit_int_info_err, 3300 3291 KVM_ISA_SVM); 3301 3292 3302 - rc = kvm_vcpu_map(&svm->vcpu, gfn_to_gpa(svm->nested.vmcb), &map); 3293 + rc = kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(svm->nested.vmcb), &map); 3303 3294 if (rc) { 3304 3295 if (rc == -EINVAL) 3305 3296 kvm_inject_gp(&svm->vcpu, 0); ··· 3589 3580 3590 3581 vmcb_gpa = svm->vmcb->save.rax; 3591 3582 3592 - rc = kvm_vcpu_map(&svm->vcpu, gfn_to_gpa(vmcb_gpa), &map); 3583 + rc = kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(vmcb_gpa), &map); 3593 3584 if (rc) { 3594 3585 if (rc == -EINVAL) 3595 3586 kvm_inject_gp(&svm->vcpu, 0); ··· 3944 3935 { 3945 3936 int err; 3946 3937 3947 - if (!static_cpu_has(X86_FEATURE_NRIPS)) 3938 + if (!nrips) 3948 3939 return emulate_on_interception(svm); 3949 3940 3950 3941 err = kvm_rdpmc(&svm->vcpu); ··· 5169 5160 kvm_lapic_set_irr(vec, vcpu->arch.apic); 5170 5161 smp_mb__after_atomic(); 5171 5162 5172 - if (avic_vcpu_is_running(vcpu)) 5173 - wrmsrl(SVM_AVIC_DOORBELL, 5174 - kvm_cpu_get_apicid(vcpu->cpu)); 5175 - else 5163 + if (avic_vcpu_is_running(vcpu)) { 5164 + int cpuid = vcpu->cpu; 5165 + 5166 + if (cpuid != get_cpu()) 5167 + wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpuid)); 5168 + put_cpu(); 5169 + } else 5176 5170 kvm_vcpu_wake_up(vcpu); 5177 5171 } 5178 5172 ··· 5652 5640 clgi(); 5653 5641 kvm_load_guest_xcr0(vcpu); 5654 5642 5643 + if (lapic_in_kernel(vcpu) && 5644 + vcpu->arch.apic->lapic_timer.timer_advance_ns) 5645 + kvm_wait_lapic_expire(vcpu); 5646 + 5655 5647 /* 5656 5648 * If this vCPU has touched SPEC_CTRL, restore the guest's value if 5657 5649 * it's non-zero. Since vmentry is serialising on affected CPUs, there ··· 5877 5861 hypercall[2] = 0xd9; 5878 5862 } 5879 5863 5880 - static void svm_check_processor_compat(void *rtn) 5864 + static int __init svm_check_processor_compat(void) 5881 5865 { 5882 - *(int *)rtn = 0; 5866 + return 0; 5883 5867 } 5884 5868 5885 5869 static bool svm_cpu_has_accelerated_tpr(void) ··· 5891 5875 { 5892 5876 switch (index) { 5893 5877 case MSR_IA32_MCG_EXT_CTL: 5878 + case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC: 5894 5879 return false; 5895 5880 default: 5896 5881 break; ··· 6179 6162 return ret; 6180 6163 } 6181 6164 6182 - static void svm_handle_external_intr(struct kvm_vcpu *vcpu) 6165 + static void svm_handle_exit_irqoff(struct kvm_vcpu *vcpu) 6183 6166 { 6184 - local_irq_enable(); 6185 - /* 6186 - * We must have an instruction with interrupts enabled, so 6187 - * the timer interrupt isn't delayed by the interrupt shadow. 6188 - */ 6189 - asm("nop"); 6190 - local_irq_disable(); 6167 + 6191 6168 } 6192 6169 6193 6170 static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu) ··· 7267 7256 .set_tdp_cr3 = set_tdp_cr3, 7268 7257 7269 7258 .check_intercept = svm_check_intercept, 7270 - .handle_external_intr = svm_handle_external_intr, 7259 + .handle_exit_irqoff = svm_handle_exit_irqoff, 7271 7260 7272 7261 .request_immediate_exit = __kvm_request_immediate_exit, 7273 7262

+1 -1

arch/x86/kvm/trace.h

··· 1365 1365 __entry->vcpu_id = vcpu_id; 1366 1366 __entry->hv_timer_in_use = hv_timer_in_use; 1367 1367 ), 1368 - TP_printk("vcpu_id %x hv_timer %x\n", 1368 + TP_printk("vcpu_id %x hv_timer %x", 1369 1369 __entry->vcpu_id, 1370 1370 __entry->hv_timer_in_use) 1371 1371 );

+18

arch/x86/kvm/vmx/evmcs.c

··· 3 3 #include <linux/errno.h> 4 4 #include <linux/smp.h> 5 5 6 + #include "../hyperv.h" 6 7 #include "evmcs.h" 7 8 #include "vmcs.h" 8 9 #include "vmx.h" ··· 313 312 314 313 } 315 314 #endif 315 + 316 + bool nested_enlightened_vmentry(struct kvm_vcpu *vcpu, u64 *evmcs_gpa) 317 + { 318 + struct hv_vp_assist_page assist_page; 319 + 320 + *evmcs_gpa = -1ull; 321 + 322 + if (unlikely(!kvm_hv_get_assist_page(vcpu, &assist_page))) 323 + return false; 324 + 325 + if (unlikely(!assist_page.enlighten_vmentry)) 326 + return false; 327 + 328 + *evmcs_gpa = assist_page.current_nested_vmcs; 329 + 330 + return true; 331 + } 316 332 317 333 uint16_t nested_get_evmcs_version(struct kvm_vcpu *vcpu) 318 334 {

+1

arch/x86/kvm/vmx/evmcs.h

··· 195 195 static inline void evmcs_touch_msr_bitmap(void) {} 196 196 #endif /* IS_ENABLED(CONFIG_HYPERV) */ 197 197 198 + bool nested_enlightened_vmentry(struct kvm_vcpu *vcpu, u64 *evmcs_gpa); 198 199 uint16_t nested_get_evmcs_version(struct kvm_vcpu *vcpu); 199 200 int nested_enable_evmcs(struct kvm_vcpu *vcpu, 200 201 uint16_t *vmcs_version);

+475 -288

arch/x86/kvm/vmx/nested.c

··· 41 41 #define vmx_vmread_bitmap (vmx_bitmap[VMX_VMREAD_BITMAP]) 42 42 #define vmx_vmwrite_bitmap (vmx_bitmap[VMX_VMWRITE_BITMAP]) 43 43 44 - static u16 shadow_read_only_fields[] = { 45 - #define SHADOW_FIELD_RO(x) x, 44 + struct shadow_vmcs_field { 45 + u16 encoding; 46 + u16 offset; 47 + }; 48 + static struct shadow_vmcs_field shadow_read_only_fields[] = { 49 + #define SHADOW_FIELD_RO(x, y) { x, offsetof(struct vmcs12, y) }, 46 50 #include "vmcs_shadow_fields.h" 47 51 }; 48 52 static int max_shadow_read_only_fields = 49 53 ARRAY_SIZE(shadow_read_only_fields); 50 54 51 - static u16 shadow_read_write_fields[] = { 52 - #define SHADOW_FIELD_RW(x) x, 55 + static struct shadow_vmcs_field shadow_read_write_fields[] = { 56 + #define SHADOW_FIELD_RW(x, y) { x, offsetof(struct vmcs12, y) }, 53 57 #include "vmcs_shadow_fields.h" 54 58 }; 55 59 static int max_shadow_read_write_fields = ··· 67 63 memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE); 68 64 69 65 for (i = j = 0; i < max_shadow_read_only_fields; i++) { 70 - u16 field = shadow_read_only_fields[i]; 66 + struct shadow_vmcs_field entry = shadow_read_only_fields[i]; 67 + u16 field = entry.encoding; 71 68 72 69 if (vmcs_field_width(field) == VMCS_FIELD_WIDTH_U64 && 73 70 (i + 1 == max_shadow_read_only_fields || 74 - shadow_read_only_fields[i + 1] != field + 1)) 71 + shadow_read_only_fields[i + 1].encoding != field + 1)) 75 72 pr_err("Missing field from shadow_read_only_field %x\n", 76 73 field + 1); 77 74 78 75 clear_bit(field, vmx_vmread_bitmap); 79 - #ifdef CONFIG_X86_64 80 76 if (field & 1) 77 + #ifdef CONFIG_X86_64 81 78 continue; 79 + #else 80 + entry.offset += sizeof(u32); 82 81 #endif 83 - if (j < i) 84 - shadow_read_only_fields[j] = field; 85 - j++; 82 + shadow_read_only_fields[j++] = entry; 86 83 } 87 84 max_shadow_read_only_fields = j; 88 85 89 86 for (i = j = 0; i < max_shadow_read_write_fields; i++) { 90 - u16 field = shadow_read_write_fields[i]; 87 + struct shadow_vmcs_field entry = shadow_read_write_fields[i]; 88 + u16 field = entry.encoding; 91 89 92 90 if (vmcs_field_width(field) == VMCS_FIELD_WIDTH_U64 && 93 91 (i + 1 == max_shadow_read_write_fields || 94 - shadow_read_write_fields[i + 1] != field + 1)) 92 + shadow_read_write_fields[i + 1].encoding != field + 1)) 95 93 pr_err("Missing field from shadow_read_write_field %x\n", 96 94 field + 1); 95 + 96 + WARN_ONCE(field >= GUEST_ES_AR_BYTES && 97 + field <= GUEST_TR_AR_BYTES, 98 + "Update vmcs12_write_any() to drop reserved bits from AR_BYTES"); 97 99 98 100 /* 99 101 * PML and the preemption timer can be emulated, but the ··· 125 115 126 116 clear_bit(field, vmx_vmwrite_bitmap); 127 117 clear_bit(field, vmx_vmread_bitmap); 128 - #ifdef CONFIG_X86_64 129 118 if (field & 1) 119 + #ifdef CONFIG_X86_64 130 120 continue; 121 + #else 122 + entry.offset += sizeof(u32); 131 123 #endif 132 - if (j < i) 133 - shadow_read_write_fields[j] = field; 134 - j++; 124 + shadow_read_write_fields[j++] = entry; 135 125 } 136 126 max_shadow_read_write_fields = j; 137 127 } ··· 192 182 193 183 static void vmx_disable_shadow_vmcs(struct vcpu_vmx *vmx) 194 184 { 195 - vmcs_clear_bits(SECONDARY_VM_EXEC_CONTROL, SECONDARY_EXEC_SHADOW_VMCS); 185 + secondary_exec_controls_clearbit(vmx, SECONDARY_EXEC_SHADOW_VMCS); 196 186 vmcs_write64(VMCS_LINK_POINTER, -1ull); 197 187 } 198 188 ··· 248 238 free_loaded_vmcs(&vmx->nested.vmcs02); 249 239 } 250 240 241 + static void vmx_sync_vmcs_host_state(struct vcpu_vmx *vmx, 242 + struct loaded_vmcs *prev) 243 + { 244 + struct vmcs_host_state *dest, *src; 245 + 246 + if (unlikely(!vmx->guest_state_loaded)) 247 + return; 248 + 249 + src = &prev->host_state; 250 + dest = &vmx->loaded_vmcs->host_state; 251 + 252 + vmx_set_host_fs_gs(dest, src->fs_sel, src->gs_sel, src->fs_base, src->gs_base); 253 + dest->ldt_sel = src->ldt_sel; 254 + #ifdef CONFIG_X86_64 255 + dest->ds_sel = src->ds_sel; 256 + dest->es_sel = src->es_sel; 257 + #endif 258 + } 259 + 251 260 static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, struct loaded_vmcs *vmcs) 252 261 { 253 262 struct vcpu_vmx *vmx = to_vmx(vcpu); 263 + struct loaded_vmcs *prev; 254 264 int cpu; 255 265 256 266 if (vmx->loaded_vmcs == vmcs) 257 267 return; 258 268 259 269 cpu = get_cpu(); 260 - vmx_vcpu_put(vcpu); 270 + prev = vmx->loaded_vmcs; 261 271 vmx->loaded_vmcs = vmcs; 262 - vmx_vcpu_load(vcpu, cpu); 272 + vmx_vcpu_load_vmcs(vcpu, cpu); 273 + vmx_sync_vmcs_host_state(vmx, prev); 263 274 put_cpu(); 264 275 265 - vm_entry_controls_reset_shadow(vmx); 266 - vm_exit_controls_reset_shadow(vmx); 267 276 vmx_segment_cache_clear(vmx); 268 277 } 269 278 ··· 959 930 * If PAE paging and EPT are both on, CR3 is not used by the CPU and 960 931 * must not be dereferenced. 961 932 */ 962 - if (!is_long_mode(vcpu) && is_pae(vcpu) && is_paging(vcpu) && 963 - !nested_ept) { 933 + if (is_pae_paging(vcpu) && !nested_ept) { 964 934 if (!load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3)) { 965 935 *entry_failure_code = ENTRY_FAIL_PDPTE; 966 936 return -EINVAL; ··· 1133 1105 vmx->nested.msrs.misc_low = data; 1134 1106 vmx->nested.msrs.misc_high = data >> 32; 1135 1107 1136 - /* 1137 - * If L1 has read-only VM-exit information fields, use the 1138 - * less permissive vmx_vmwrite_bitmap to specify write 1139 - * permissions for the shadow VMCS. 1140 - */ 1141 - if (enable_shadow_vmcs && !nested_cpu_has_vmwrite_any_field(&vmx->vcpu)) 1142 - vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap)); 1143 - 1144 1108 return 0; 1145 1109 } 1146 1110 ··· 1234 1214 case MSR_IA32_VMX_VMCS_ENUM: 1235 1215 vmx->nested.msrs.vmcs_enum = data; 1236 1216 return 0; 1217 + case MSR_IA32_VMX_VMFUNC: 1218 + if (data & ~vmx->nested.msrs.vmfunc_controls) 1219 + return -EINVAL; 1220 + vmx->nested.msrs.vmfunc_controls = data; 1221 + return 0; 1237 1222 default: 1238 1223 /* 1239 1224 * The rest of the VMX capability MSRs do not support restore. ··· 1326 1301 } 1327 1302 1328 1303 /* 1329 - * Copy the writable VMCS shadow fields back to the VMCS12, in case 1330 - * they have been modified by the L1 guest. Note that the "read-only" 1331 - * VM-exit information fields are actually writable if the vCPU is 1332 - * configured to support "VMWRITE to any supported field in the VMCS." 1304 + * Copy the writable VMCS shadow fields back to the VMCS12, in case they have 1305 + * been modified by the L1 guest. Note, "writable" in this context means 1306 + * "writable by the guest", i.e. tagged SHADOW_FIELD_RW; the set of 1307 + * fields tagged SHADOW_FIELD_RO may or may not align with the "read-only" 1308 + * VM-exit information fields (which are actually writable if the vCPU is 1309 + * configured to support "VMWRITE to any supported field in the VMCS"). 1333 1310 */ 1334 1311 static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx) 1335 1312 { 1336 - const u16 *fields[] = { 1337 - shadow_read_write_fields, 1338 - shadow_read_only_fields 1339 - }; 1340 - const int max_fields[] = { 1341 - max_shadow_read_write_fields, 1342 - max_shadow_read_only_fields 1343 - }; 1344 - int i, q; 1345 - unsigned long field; 1346 - u64 field_value; 1347 1313 struct vmcs *shadow_vmcs = vmx->vmcs01.shadow_vmcs; 1314 + struct vmcs12 *vmcs12 = get_vmcs12(&vmx->vcpu); 1315 + struct shadow_vmcs_field field; 1316 + unsigned long val; 1317 + int i; 1348 1318 1349 1319 preempt_disable(); 1350 1320 1351 1321 vmcs_load(shadow_vmcs); 1352 1322 1353 - for (q = 0; q < ARRAY_SIZE(fields); q++) { 1354 - for (i = 0; i < max_fields[q]; i++) { 1355 - field = fields[q][i]; 1356 - field_value = __vmcs_readl(field); 1357 - vmcs12_write_any(get_vmcs12(&vmx->vcpu), field, field_value); 1358 - } 1359 - /* 1360 - * Skip the VM-exit information fields if they are read-only. 1361 - */ 1362 - if (!nested_cpu_has_vmwrite_any_field(&vmx->vcpu)) 1363 - break; 1323 + for (i = 0; i < max_shadow_read_write_fields; i++) { 1324 + field = shadow_read_write_fields[i]; 1325 + val = __vmcs_readl(field.encoding); 1326 + vmcs12_write_any(vmcs12, field.encoding, field.offset, val); 1364 1327 } 1365 1328 1366 1329 vmcs_clear(shadow_vmcs); ··· 1359 1346 1360 1347 static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx) 1361 1348 { 1362 - const u16 *fields[] = { 1349 + const struct shadow_vmcs_field *fields[] = { 1363 1350 shadow_read_write_fields, 1364 1351 shadow_read_only_fields 1365 1352 }; ··· 1367 1354 max_shadow_read_write_fields, 1368 1355 max_shadow_read_only_fields 1369 1356 }; 1370 - int i, q; 1371 - unsigned long field; 1372 - u64 field_value = 0; 1373 1357 struct vmcs *shadow_vmcs = vmx->vmcs01.shadow_vmcs; 1358 + struct vmcs12 *vmcs12 = get_vmcs12(&vmx->vcpu); 1359 + struct shadow_vmcs_field field; 1360 + unsigned long val; 1361 + int i, q; 1374 1362 1375 1363 vmcs_load(shadow_vmcs); 1376 1364 1377 1365 for (q = 0; q < ARRAY_SIZE(fields); q++) { 1378 1366 for (i = 0; i < max_fields[q]; i++) { 1379 1367 field = fields[q][i]; 1380 - vmcs12_read_any(get_vmcs12(&vmx->vcpu), field, &field_value); 1381 - __vmcs_writel(field, field_value); 1368 + val = vmcs12_read_any(vmcs12, field.encoding, 1369 + field.offset); 1370 + __vmcs_writel(field.encoding, val); 1382 1371 } 1383 1372 } 1384 1373 ··· 1638 1623 * evmcs->host_gdtr_base = vmcs12->host_gdtr_base; 1639 1624 * evmcs->host_idtr_base = vmcs12->host_idtr_base; 1640 1625 * evmcs->host_rsp = vmcs12->host_rsp; 1641 - * sync_vmcs12() doesn't read these: 1626 + * sync_vmcs02_to_vmcs12() doesn't read these: 1642 1627 * evmcs->io_bitmap_a = vmcs12->io_bitmap_a; 1643 1628 * evmcs->io_bitmap_b = vmcs12->io_bitmap_b; 1644 1629 * evmcs->msr_bitmap = vmcs12->msr_bitmap; ··· 1783 1768 bool from_launch) 1784 1769 { 1785 1770 struct vcpu_vmx *vmx = to_vmx(vcpu); 1786 - struct hv_vp_assist_page assist_page; 1771 + bool evmcs_gpa_changed = false; 1772 + u64 evmcs_gpa; 1787 1773 1788 1774 if (likely(!vmx->nested.enlightened_vmcs_enabled)) 1789 1775 return 1; 1790 1776 1791 - if (unlikely(!kvm_hv_get_assist_page(vcpu, &assist_page))) 1777 + if (!nested_enlightened_vmentry(vcpu, &evmcs_gpa)) 1792 1778 return 1; 1793 1779 1794 - if (unlikely(!assist_page.enlighten_vmentry)) 1795 - return 1; 1796 - 1797 - if (unlikely(assist_page.current_nested_vmcs != 1798 - vmx->nested.hv_evmcs_vmptr)) { 1799 - 1780 + if (unlikely(evmcs_gpa != vmx->nested.hv_evmcs_vmptr)) { 1800 1781 if (!vmx->nested.hv_evmcs) 1801 1782 vmx->nested.current_vmptr = -1ull; 1802 1783 1803 1784 nested_release_evmcs(vcpu); 1804 1785 1805 - if (kvm_vcpu_map(vcpu, gpa_to_gfn(assist_page.current_nested_vmcs), 1786 + if (kvm_vcpu_map(vcpu, gpa_to_gfn(evmcs_gpa), 1806 1787 &vmx->nested.hv_evmcs_map)) 1807 1788 return 0; 1808 1789 ··· 1833 1822 } 1834 1823 1835 1824 vmx->nested.dirty_vmcs12 = true; 1836 - /* 1837 - * As we keep L2 state for one guest only 'hv_clean_fields' mask 1838 - * can't be used when we switch between them. Reset it here for 1839 - * simplicity. 1840 - */ 1841 - vmx->nested.hv_evmcs->hv_clean_fields &= 1842 - ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL; 1843 - vmx->nested.hv_evmcs_vmptr = assist_page.current_nested_vmcs; 1825 + vmx->nested.hv_evmcs_vmptr = evmcs_gpa; 1844 1826 1827 + evmcs_gpa_changed = true; 1845 1828 /* 1846 1829 * Unlike normal vmcs12, enlightened vmcs12 is not fully 1847 1830 * reloaded from guest's memory (read only fields, fields not ··· 1849 1844 } 1850 1845 1851 1846 } 1847 + 1848 + /* 1849 + * Clean fields data can't de used on VMLAUNCH and when we switch 1850 + * between different L2 guests as KVM keeps a single VMCS12 per L1. 1851 + */ 1852 + if (from_launch || evmcs_gpa_changed) 1853 + vmx->nested.hv_evmcs->hv_clean_fields &= 1854 + ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL; 1855 + 1852 1856 return 1; 1853 1857 } 1854 1858 1855 - void nested_sync_from_vmcs12(struct kvm_vcpu *vcpu) 1859 + void nested_sync_vmcs12_to_shadow(struct kvm_vcpu *vcpu) 1856 1860 { 1857 1861 struct vcpu_vmx *vmx = to_vmx(vcpu); 1858 1862 ··· 1882 1868 copy_vmcs12_to_shadow(vmx); 1883 1869 } 1884 1870 1885 - vmx->nested.need_vmcs12_sync = false; 1871 + vmx->nested.need_vmcs12_to_shadow_sync = false; 1886 1872 } 1887 1873 1888 1874 static enum hrtimer_restart vmx_preemption_timer_fn(struct hrtimer *timer) ··· 1962 1948 if (cpu_has_vmx_msr_bitmap()) 1963 1949 vmcs_write64(MSR_BITMAP, __pa(vmx->nested.vmcs02.msr_bitmap)); 1964 1950 1965 - if (enable_pml) 1951 + /* 1952 + * The PML address never changes, so it is constant in vmcs02. 1953 + * Conceptually we want to copy the PML index from vmcs01 here, 1954 + * and then back to vmcs01 on nested vmexit. But since we flush 1955 + * the log and reset GUEST_PML_INDEX on each vmexit, the PML 1956 + * index is also effectively constant in vmcs02. 1957 + */ 1958 + if (enable_pml) { 1966 1959 vmcs_write64(PML_ADDRESS, page_to_phys(vmx->pml_pg)); 1960 + vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1); 1961 + } 1962 + 1963 + if (cpu_has_vmx_encls_vmexit()) 1964 + vmcs_write64(ENCLS_EXITING_BITMAP, -1ull); 1967 1965 1968 1966 /* 1969 1967 * Set the MSR load/store lists to match L0's settings. Only the ··· 1989 1963 vmx_set_constant_host_state(vmx); 1990 1964 } 1991 1965 1992 - static void prepare_vmcs02_early_full(struct vcpu_vmx *vmx, 1966 + static void prepare_vmcs02_early_rare(struct vcpu_vmx *vmx, 1993 1967 struct vmcs12 *vmcs12) 1994 1968 { 1995 1969 prepare_vmcs02_constant_state(vmx); ··· 2010 1984 u64 guest_efer = nested_vmx_calc_efer(vmx, vmcs12); 2011 1985 2012 1986 if (vmx->nested.dirty_vmcs12 || vmx->nested.hv_evmcs) 2013 - prepare_vmcs02_early_full(vmx, vmcs12); 1987 + prepare_vmcs02_early_rare(vmx, vmcs12); 2014 1988 2015 1989 /* 2016 1990 * PIN CONTROLS 2017 1991 */ 2018 - exec_control = vmcs12->pin_based_vm_exec_control; 2019 - 2020 - /* Preemption timer setting is computed directly in vmx_vcpu_run. */ 2021 - exec_control |= vmcs_config.pin_based_exec_ctrl; 2022 - exec_control &= ~PIN_BASED_VMX_PREEMPTION_TIMER; 2023 - vmx->loaded_vmcs->hv_timer_armed = false; 1992 + exec_control = vmx_pin_based_exec_ctrl(vmx); 1993 + exec_control |= (vmcs12->pin_based_vm_exec_control & 1994 + ~PIN_BASED_VMX_PREEMPTION_TIMER); 2024 1995 2025 1996 /* Posted interrupts setting is only taken from vmcs12. */ 2026 1997 if (nested_cpu_has_posted_intr(vmcs12)) { ··· 2026 2003 } else { 2027 2004 exec_control &= ~PIN_BASED_POSTED_INTR; 2028 2005 } 2029 - vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, exec_control); 2006 + pin_controls_set(vmx, exec_control); 2030 2007 2031 2008 /* 2032 2009 * EXEC CONTROLS ··· 2037 2014 exec_control &= ~CPU_BASED_TPR_SHADOW; 2038 2015 exec_control |= vmcs12->cpu_based_vm_exec_control; 2039 2016 2040 - /* 2041 - * Write an illegal value to VIRTUAL_APIC_PAGE_ADDR. Later, if 2042 - * nested_get_vmcs12_pages can't fix it up, the illegal value 2043 - * will result in a VM entry failure. 2044 - */ 2045 - if (exec_control & CPU_BASED_TPR_SHADOW) { 2046 - vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, -1ull); 2017 + if (exec_control & CPU_BASED_TPR_SHADOW) 2047 2018 vmcs_write32(TPR_THRESHOLD, vmcs12->tpr_threshold); 2048 - } else { 2049 2019 #ifdef CONFIG_X86_64 2020 + else 2050 2021 exec_control |= CPU_BASED_CR8_LOAD_EXITING | 2051 2022 CPU_BASED_CR8_STORE_EXITING; 2052 2023 #endif 2053 - } 2054 2024 2055 2025 /* 2056 2026 * A vmexit (to either L1 hypervisor or L0 userspace) is always needed 2057 2027 * for I/O port accesses. 2058 2028 */ 2059 - exec_control &= ~CPU_BASED_USE_IO_BITMAPS; 2060 2029 exec_control |= CPU_BASED_UNCOND_IO_EXITING; 2061 - vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control); 2030 + exec_control &= ~CPU_BASED_USE_IO_BITMAPS; 2031 + 2032 + /* 2033 + * This bit will be computed in nested_get_vmcs12_pages, because 2034 + * we do not have access to L1's MSR bitmap yet. For now, keep 2035 + * the same bit as before, hoping to avoid multiple VMWRITEs that 2036 + * only set/clear this bit. 2037 + */ 2038 + exec_control &= ~CPU_BASED_USE_MSR_BITMAPS; 2039 + exec_control |= exec_controls_get(vmx) & CPU_BASED_USE_MSR_BITMAPS; 2040 + 2041 + exec_controls_set(vmx, exec_control); 2062 2042 2063 2043 /* 2064 2044 * SECONDARY EXEC CONTROLS ··· 2087 2061 /* VMCS shadowing for L2 is emulated for now */ 2088 2062 exec_control &= ~SECONDARY_EXEC_SHADOW_VMCS; 2089 2063 2064 + /* 2065 + * Preset *DT exiting when emulating UMIP, so that vmx_set_cr4() 2066 + * will not have to rewrite the controls just for this bit. 2067 + */ 2068 + if (!boot_cpu_has(X86_FEATURE_UMIP) && vmx_umip_emulated() && 2069 + (vmcs12->guest_cr4 & X86_CR4_UMIP)) 2070 + exec_control |= SECONDARY_EXEC_DESC; 2071 + 2090 2072 if (exec_control & SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY) 2091 2073 vmcs_write16(GUEST_INTR_STATUS, 2092 2074 vmcs12->guest_intr_status); 2093 2075 2094 - /* 2095 - * Write an illegal value to APIC_ACCESS_ADDR. Later, 2096 - * nested_get_vmcs12_pages will either fix it up or 2097 - * remove the VM execution control. 2098 - */ 2099 - if (exec_control & SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) 2100 - vmcs_write64(APIC_ACCESS_ADDR, -1ull); 2101 - 2102 - if (exec_control & SECONDARY_EXEC_ENCLS_EXITING) 2103 - vmcs_write64(ENCLS_EXITING_BITMAP, -1ull); 2104 - 2105 - vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control); 2076 + secondary_exec_controls_set(vmx, exec_control); 2106 2077 } 2107 2078 2108 2079 /* ··· 2118 2095 if (guest_efer != host_efer) 2119 2096 exec_control |= VM_ENTRY_LOAD_IA32_EFER; 2120 2097 } 2121 - vm_entry_controls_init(vmx, exec_control); 2098 + vm_entry_controls_set(vmx, exec_control); 2122 2099 2123 2100 /* 2124 2101 * EXIT CONTROLS ··· 2130 2107 exec_control = vmx_vmexit_ctrl(); 2131 2108 if (cpu_has_load_ia32_efer() && guest_efer != host_efer) 2132 2109 exec_control |= VM_EXIT_LOAD_IA32_EFER; 2133 - vm_exit_controls_init(vmx, exec_control); 2134 - 2135 - /* 2136 - * Conceptually we want to copy the PML address and index from 2137 - * vmcs01 here, and then back to vmcs01 on nested vmexit. But, 2138 - * since we always flush the log on each vmexit and never change 2139 - * the PML address (once set), this happens to be equivalent to 2140 - * simply resetting the index in vmcs02. 2141 - */ 2142 - if (enable_pml) 2143 - vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1); 2110 + vm_exit_controls_set(vmx, exec_control); 2144 2111 2145 2112 /* 2146 2113 * Interrupt/Exception Fields ··· 2151 2138 } 2152 2139 } 2153 2140 2154 - static void prepare_vmcs02_full(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12) 2141 + static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12) 2155 2142 { 2156 2143 struct hv_enlightened_vmcs *hv_evmcs = vmx->nested.hv_evmcs; 2157 2144 ··· 2175 2162 vmcs_write32(GUEST_TR_LIMIT, vmcs12->guest_tr_limit); 2176 2163 vmcs_write32(GUEST_GDTR_LIMIT, vmcs12->guest_gdtr_limit); 2177 2164 vmcs_write32(GUEST_IDTR_LIMIT, vmcs12->guest_idtr_limit); 2165 + vmcs_write32(GUEST_CS_AR_BYTES, vmcs12->guest_cs_ar_bytes); 2166 + vmcs_write32(GUEST_SS_AR_BYTES, vmcs12->guest_ss_ar_bytes); 2178 2167 vmcs_write32(GUEST_ES_AR_BYTES, vmcs12->guest_es_ar_bytes); 2179 2168 vmcs_write32(GUEST_DS_AR_BYTES, vmcs12->guest_ds_ar_bytes); 2180 2169 vmcs_write32(GUEST_FS_AR_BYTES, vmcs12->guest_fs_ar_bytes); ··· 2213 2198 vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2); 2214 2199 vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3); 2215 2200 } 2201 + 2202 + if (kvm_mpx_supported() && vmx->nested.nested_run_pending && 2203 + (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)) 2204 + vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs); 2216 2205 } 2217 2206 2218 2207 if (nested_cpu_has_xsaves(vmcs12)) ··· 2252 2233 vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr); 2253 2234 2254 2235 set_cr4_guest_host_mask(vmx); 2255 - 2256 - if (kvm_mpx_supported()) { 2257 - if (vmx->nested.nested_run_pending && 2258 - (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)) 2259 - vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs); 2260 - else 2261 - vmcs_write64(GUEST_BNDCFGS, vmx->nested.vmcs01_guest_bndcfgs); 2262 - } 2263 2236 } 2264 2237 2265 2238 /* ··· 2270 2259 { 2271 2260 struct vcpu_vmx *vmx = to_vmx(vcpu); 2272 2261 struct hv_enlightened_vmcs *hv_evmcs = vmx->nested.hv_evmcs; 2262 + bool load_guest_pdptrs_vmcs12 = false; 2273 2263 2274 - if (vmx->nested.dirty_vmcs12 || vmx->nested.hv_evmcs) { 2275 - prepare_vmcs02_full(vmx, vmcs12); 2264 + if (vmx->nested.dirty_vmcs12 || hv_evmcs) { 2265 + prepare_vmcs02_rare(vmx, vmcs12); 2276 2266 vmx->nested.dirty_vmcs12 = false; 2277 - } 2278 2267 2279 - /* 2280 - * First, the fields that are shadowed. This must be kept in sync 2281 - * with vmcs_shadow_fields.h. 2282 - */ 2283 - if (!hv_evmcs || !(hv_evmcs->hv_clean_fields & 2284 - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2)) { 2285 - vmcs_write32(GUEST_CS_AR_BYTES, vmcs12->guest_cs_ar_bytes); 2286 - vmcs_write32(GUEST_SS_AR_BYTES, vmcs12->guest_ss_ar_bytes); 2268 + load_guest_pdptrs_vmcs12 = !hv_evmcs || 2269 + !(hv_evmcs->hv_clean_fields & 2270 + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1); 2287 2271 } 2288 2272 2289 2273 if (vmx->nested.nested_run_pending && ··· 2289 2283 kvm_set_dr(vcpu, 7, vcpu->arch.dr7); 2290 2284 vmcs_write64(GUEST_IA32_DEBUGCTL, vmx->nested.vmcs01_debugctl); 2291 2285 } 2286 + if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending || 2287 + !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))) 2288 + vmcs_write64(GUEST_BNDCFGS, vmx->nested.vmcs01_guest_bndcfgs); 2292 2289 vmx_set_rflags(vcpu, vmcs12->guest_rflags); 2293 2290 2294 2291 /* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the ··· 2380 2371 if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_cpu_has_ept(vmcs12), 2381 2372 entry_failure_code)) 2382 2373 return -EINVAL; 2374 + 2375 + /* Late preparation of GUEST_PDPTRs now that EFER and CRs are set. */ 2376 + if (load_guest_pdptrs_vmcs12 && nested_cpu_has_ept(vmcs12) && 2377 + is_pae_paging(vcpu)) { 2378 + vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0); 2379 + vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1); 2380 + vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2); 2381 + vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3); 2382 + } 2383 2383 2384 2384 if (!enable_ept) 2385 2385 vcpu->arch.walk_mmu->inject_page_fault = vmx_inject_page_fault_nested; ··· 2627 2609 !kvm_pat_valid(vmcs12->host_ia32_pat)) 2628 2610 return -EINVAL; 2629 2611 2612 + ia32e = (vmcs12->vm_exit_controls & 2613 + VM_EXIT_HOST_ADDR_SPACE_SIZE) != 0; 2614 + 2615 + if (vmcs12->host_cs_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK) || 2616 + vmcs12->host_ss_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK) || 2617 + vmcs12->host_ds_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK) || 2618 + vmcs12->host_es_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK) || 2619 + vmcs12->host_fs_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK) || 2620 + vmcs12->host_gs_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK) || 2621 + vmcs12->host_tr_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK) || 2622 + vmcs12->host_cs_selector == 0 || 2623 + vmcs12->host_tr_selector == 0 || 2624 + (vmcs12->host_ss_selector == 0 && !ia32e)) 2625 + return -EINVAL; 2626 + 2627 + #ifdef CONFIG_X86_64 2628 + if (is_noncanonical_address(vmcs12->host_fs_base, vcpu) || 2629 + is_noncanonical_address(vmcs12->host_gs_base, vcpu) || 2630 + is_noncanonical_address(vmcs12->host_gdtr_base, vcpu) || 2631 + is_noncanonical_address(vmcs12->host_idtr_base, vcpu) || 2632 + is_noncanonical_address(vmcs12->host_tr_base, vcpu)) 2633 + return -EINVAL; 2634 + #endif 2635 + 2630 2636 /* 2631 2637 * If the load IA32_EFER VM-exit control is 1, bits reserved in the 2632 2638 * IA32_EFER MSR must be 0 in the field for that register. In addition, ··· 2658 2616 * the host address-space size VM-exit control. 2659 2617 */ 2660 2618 if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER) { 2661 - ia32e = (vmcs12->vm_exit_controls & 2662 - VM_EXIT_HOST_ADDR_SPACE_SIZE) != 0; 2663 2619 if (!kvm_valid_efer(vcpu, vmcs12->host_ia32_efer) || 2664 2620 ia32e != !!(vmcs12->host_ia32_efer & EFER_LMA) || 2665 2621 ia32e != !!(vmcs12->host_ia32_efer & EFER_LME)) ··· 2821 2781 [launched]"i"(offsetof(struct loaded_vmcs, launched)), 2822 2782 [host_state_rsp]"i"(offsetof(struct loaded_vmcs, host_state.rsp)), 2823 2783 [wordsize]"i"(sizeof(ulong)) 2824 - : "cc", "memory" 2784 + : "memory" 2825 2785 ); 2826 2786 2827 2787 if (vmx->msr_autoload.host.nr) ··· 2891 2851 hpa = page_to_phys(vmx->nested.apic_access_page); 2892 2852 vmcs_write64(APIC_ACCESS_ADDR, hpa); 2893 2853 } else { 2894 - vmcs_clear_bits(SECONDARY_VM_EXEC_CONTROL, 2895 - SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES); 2854 + secondary_exec_controls_clearbit(vmx, 2855 + SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES); 2896 2856 } 2897 2857 } 2898 2858 2899 2859 if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) { 2900 2860 map = &vmx->nested.virtual_apic_map; 2901 2861 2902 - /* 2903 - * If translation failed, VM entry will fail because 2904 - * prepare_vmcs02 set VIRTUAL_APIC_PAGE_ADDR to -1ull. 2905 - */ 2906 2862 if (!kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->virtual_apic_page_addr), map)) { 2907 2863 vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, pfn_to_hpa(map->pfn)); 2908 2864 } else if (nested_cpu_has(vmcs12, CPU_BASED_CR8_LOAD_EXITING) && ··· 2912 2876 * _not_ what the processor does but it's basically the 2913 2877 * only possibility we have. 2914 2878 */ 2915 - vmcs_clear_bits(CPU_BASED_VM_EXEC_CONTROL, 2916 - CPU_BASED_TPR_SHADOW); 2879 + exec_controls_clearbit(vmx, CPU_BASED_TPR_SHADOW); 2917 2880 } else { 2918 - printk("bad virtual-APIC page address\n"); 2919 - dump_vmcs(); 2881 + /* 2882 + * Write an illegal value to VIRTUAL_APIC_PAGE_ADDR to 2883 + * force VM-Entry to fail. 2884 + */ 2885 + vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, -1ull); 2920 2886 } 2921 2887 } 2922 2888 ··· 2934 2896 } 2935 2897 } 2936 2898 if (nested_vmx_prepare_msr_bitmap(vcpu, vmcs12)) 2937 - vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL, 2938 - CPU_BASED_USE_MSR_BITMAPS); 2899 + exec_controls_setbit(vmx, CPU_BASED_USE_MSR_BITMAPS); 2939 2900 else 2940 - vmcs_clear_bits(CPU_BASED_VM_EXEC_CONTROL, 2941 - CPU_BASED_USE_MSR_BITMAPS); 2901 + exec_controls_clearbit(vmx, CPU_BASED_USE_MSR_BITMAPS); 2942 2902 } 2943 2903 2944 2904 /* ··· 2989 2953 u32 exit_reason = EXIT_REASON_INVALID_STATE; 2990 2954 u32 exit_qual; 2991 2955 2992 - evaluate_pending_interrupts = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL) & 2956 + evaluate_pending_interrupts = exec_controls_get(vmx) & 2993 2957 (CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_VIRTUAL_NMI_PENDING); 2994 2958 if (likely(!evaluate_pending_interrupts) && kvm_vcpu_apicv_active(vcpu)) 2995 2959 evaluate_pending_interrupts |= vmx_has_apicv_interrupt(vcpu); ··· 2999 2963 if (kvm_mpx_supported() && 3000 2964 !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)) 3001 2965 vmx->nested.vmcs01_guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS); 2966 + 2967 + /* 2968 + * Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled *and* 2969 + * nested early checks are disabled. In the event of a "late" VM-Fail, 2970 + * i.e. a VM-Fail detected by hardware but not KVM, KVM must unwind its 2971 + * software model to the pre-VMEntry host state. When EPT is disabled, 2972 + * GUEST_CR3 holds KVM's shadow CR3, not L1's "real" CR3, which causes 2973 + * nested_vmx_restore_host_state() to corrupt vcpu->arch.cr3. Stuffing 2974 + * vmcs01.GUEST_CR3 results in the unwind naturally setting arch.cr3 to 2975 + * the correct value. Smashing vmcs01.GUEST_CR3 is safe because nested 2976 + * VM-Exits, and the unwind, reset KVM's MMU, i.e. vmcs01.GUEST_CR3 is 2977 + * guaranteed to be overwritten with a shadow CR3 prior to re-entering 2978 + * L1. Don't stuff vmcs01.GUEST_CR3 when using nested early checks as 2979 + * KVM modifies vcpu->arch.cr3 if and only if the early hardware checks 2980 + * pass, and early VM-Fails do not reset KVM's MMU, i.e. the VM-Fail 2981 + * path would need to manually save/restore vmcs01.GUEST_CR3. 2982 + */ 2983 + if (!enable_ept && !nested_early_check) 2984 + vmcs_writel(GUEST_CR3, vcpu->arch.cr3); 3002 2985 3003 2986 vmx_switch_vmcs(vcpu, &vmx->nested.vmcs02); 3004 2987 ··· 3114 3059 vmcs12->vm_exit_reason = exit_reason | VMX_EXIT_REASONS_FAILED_VMENTRY; 3115 3060 vmcs12->exit_qualification = exit_qual; 3116 3061 if (enable_shadow_vmcs || vmx->nested.hv_evmcs) 3117 - vmx->nested.need_vmcs12_sync = true; 3062 + vmx->nested.need_vmcs12_to_shadow_sync = true; 3118 3063 return 1; 3119 3064 } 3120 3065 ··· 3132 3077 if (!nested_vmx_check_permission(vcpu)) 3133 3078 return 1; 3134 3079 3135 - if (!nested_vmx_handle_enlightened_vmptrld(vcpu, true)) 3080 + if (!nested_vmx_handle_enlightened_vmptrld(vcpu, launch)) 3136 3081 return 1; 3137 3082 3138 3083 if (!vmx->nested.hv_evmcs && vmx->nested.current_vmptr == -1ull) ··· 3448 3393 return value >> VMX_MISC_EMULATED_PREEMPTION_TIMER_RATE; 3449 3394 } 3450 3395 3451 - /* 3452 - * Update the guest state fields of vmcs12 to reflect changes that 3453 - * occurred while L2 was running. (The "IA-32e mode guest" bit of the 3454 - * VM-entry controls is also updated, since this is really a guest 3455 - * state bit.) 3456 - */ 3457 - static void sync_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) 3396 + static bool is_vmcs12_ext_field(unsigned long field) 3458 3397 { 3459 - vmcs12->guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12); 3460 - vmcs12->guest_cr4 = vmcs12_guest_cr4(vcpu, vmcs12); 3398 + switch (field) { 3399 + case GUEST_ES_SELECTOR: 3400 + case GUEST_CS_SELECTOR: 3401 + case GUEST_SS_SELECTOR: 3402 + case GUEST_DS_SELECTOR: 3403 + case GUEST_FS_SELECTOR: 3404 + case GUEST_GS_SELECTOR: 3405 + case GUEST_LDTR_SELECTOR: 3406 + case GUEST_TR_SELECTOR: 3407 + case GUEST_ES_LIMIT: 3408 + case GUEST_CS_LIMIT: 3409 + case GUEST_SS_LIMIT: 3410 + case GUEST_DS_LIMIT: 3411 + case GUEST_FS_LIMIT: 3412 + case GUEST_GS_LIMIT: 3413 + case GUEST_LDTR_LIMIT: 3414 + case GUEST_TR_LIMIT: 3415 + case GUEST_GDTR_LIMIT: 3416 + case GUEST_IDTR_LIMIT: 3417 + case GUEST_ES_AR_BYTES: 3418 + case GUEST_DS_AR_BYTES: 3419 + case GUEST_FS_AR_BYTES: 3420 + case GUEST_GS_AR_BYTES: 3421 + case GUEST_LDTR_AR_BYTES: 3422 + case GUEST_TR_AR_BYTES: 3423 + case GUEST_ES_BASE: 3424 + case GUEST_CS_BASE: 3425 + case GUEST_SS_BASE: 3426 + case GUEST_DS_BASE: 3427 + case GUEST_FS_BASE: 3428 + case GUEST_GS_BASE: 3429 + case GUEST_LDTR_BASE: 3430 + case GUEST_TR_BASE: 3431 + case GUEST_GDTR_BASE: 3432 + case GUEST_IDTR_BASE: 3433 + case GUEST_PENDING_DBG_EXCEPTIONS: 3434 + case GUEST_BNDCFGS: 3435 + return true; 3436 + default: 3437 + break; 3438 + } 3461 3439 3462 - vmcs12->guest_rsp = kvm_rsp_read(vcpu); 3463 - vmcs12->guest_rip = kvm_rip_read(vcpu); 3464 - vmcs12->guest_rflags = vmcs_readl(GUEST_RFLAGS); 3440 + return false; 3441 + } 3442 + 3443 + static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu, 3444 + struct vmcs12 *vmcs12) 3445 + { 3446 + struct vcpu_vmx *vmx = to_vmx(vcpu); 3465 3447 3466 3448 vmcs12->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR); 3467 3449 vmcs12->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR); ··· 3519 3427 vmcs12->guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT); 3520 3428 vmcs12->guest_idtr_limit = vmcs_read32(GUEST_IDTR_LIMIT); 3521 3429 vmcs12->guest_es_ar_bytes = vmcs_read32(GUEST_ES_AR_BYTES); 3522 - vmcs12->guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES); 3523 - vmcs12->guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES); 3524 3430 vmcs12->guest_ds_ar_bytes = vmcs_read32(GUEST_DS_AR_BYTES); 3525 3431 vmcs12->guest_fs_ar_bytes = vmcs_read32(GUEST_FS_AR_BYTES); 3526 3432 vmcs12->guest_gs_ar_bytes = vmcs_read32(GUEST_GS_AR_BYTES); ··· 3534 3444 vmcs12->guest_tr_base = vmcs_readl(GUEST_TR_BASE); 3535 3445 vmcs12->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE); 3536 3446 vmcs12->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE); 3447 + vmcs12->guest_pending_dbg_exceptions = 3448 + vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS); 3449 + if (kvm_mpx_supported()) 3450 + vmcs12->guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS); 3451 + 3452 + vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false; 3453 + } 3454 + 3455 + static void copy_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu, 3456 + struct vmcs12 *vmcs12) 3457 + { 3458 + struct vcpu_vmx *vmx = to_vmx(vcpu); 3459 + int cpu; 3460 + 3461 + if (!vmx->nested.need_sync_vmcs02_to_vmcs12_rare) 3462 + return; 3463 + 3464 + 3465 + WARN_ON_ONCE(vmx->loaded_vmcs != &vmx->vmcs01); 3466 + 3467 + cpu = get_cpu(); 3468 + vmx->loaded_vmcs = &vmx->nested.vmcs02; 3469 + vmx_vcpu_load(&vmx->vcpu, cpu); 3470 + 3471 + sync_vmcs02_to_vmcs12_rare(vcpu, vmcs12); 3472 + 3473 + vmx->loaded_vmcs = &vmx->vmcs01; 3474 + vmx_vcpu_load(&vmx->vcpu, cpu); 3475 + put_cpu(); 3476 + } 3477 + 3478 + /* 3479 + * Update the guest state fields of vmcs12 to reflect changes that 3480 + * occurred while L2 was running. (The "IA-32e mode guest" bit of the 3481 + * VM-entry controls is also updated, since this is really a guest 3482 + * state bit.) 3483 + */ 3484 + static void sync_vmcs02_to_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) 3485 + { 3486 + struct vcpu_vmx *vmx = to_vmx(vcpu); 3487 + 3488 + if (vmx->nested.hv_evmcs) 3489 + sync_vmcs02_to_vmcs12_rare(vcpu, vmcs12); 3490 + 3491 + vmx->nested.need_sync_vmcs02_to_vmcs12_rare = !vmx->nested.hv_evmcs; 3492 + 3493 + vmcs12->guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12); 3494 + vmcs12->guest_cr4 = vmcs12_guest_cr4(vcpu, vmcs12); 3495 + 3496 + vmcs12->guest_rsp = kvm_rsp_read(vcpu); 3497 + vmcs12->guest_rip = kvm_rip_read(vcpu); 3498 + vmcs12->guest_rflags = vmcs_readl(GUEST_RFLAGS); 3499 + 3500 + vmcs12->guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES); 3501 + vmcs12->guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES); 3502 + 3503 + vmcs12->guest_sysenter_cs = vmcs_read32(GUEST_SYSENTER_CS); 3504 + vmcs12->guest_sysenter_esp = vmcs_readl(GUEST_SYSENTER_ESP); 3505 + vmcs12->guest_sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP); 3537 3506 3538 3507 vmcs12->guest_interruptibility_info = 3539 3508 vmcs_read32(GUEST_INTERRUPTIBILITY_INFO); 3540 - vmcs12->guest_pending_dbg_exceptions = 3541 - vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS); 3509 + 3542 3510 if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED) 3543 3511 vmcs12->guest_activity_state = GUEST_ACTIVITY_HLT; 3544 3512 else ··· 3617 3469 */ 3618 3470 if (enable_ept) { 3619 3471 vmcs12->guest_cr3 = vmcs_readl(GUEST_CR3); 3620 - vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0); 3621 - vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1); 3622 - vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2); 3623 - vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3); 3472 + if (nested_cpu_has_ept(vmcs12) && is_pae_paging(vcpu)) { 3473 + vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0); 3474 + vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1); 3475 + vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2); 3476 + vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3); 3477 + } 3624 3478 } 3625 3479 3626 3480 vmcs12->guest_linear_address = vmcs_readl(GUEST_LINEAR_ADDRESS); ··· 3634 3484 (vmcs12->vm_entry_controls & ~VM_ENTRY_IA32E_MODE) | 3635 3485 (vm_entry_controls_get(to_vmx(vcpu)) & VM_ENTRY_IA32E_MODE); 3636 3486 3637 - if (vmcs12->vm_exit_controls & VM_EXIT_SAVE_DEBUG_CONTROLS) { 3487 + if (vmcs12->vm_exit_controls & VM_EXIT_SAVE_DEBUG_CONTROLS) 3638 3488 kvm_get_dr(vcpu, 7, (unsigned long *)&vmcs12->guest_dr7); 3639 - vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL); 3640 - } 3641 3489 3642 - /* TODO: These cannot have changed unless we have MSR bitmaps and 3643 - * the relevant bit asks not to trap the change */ 3644 - if (vmcs12->vm_exit_controls & VM_EXIT_SAVE_IA32_PAT) 3645 - vmcs12->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT); 3646 3490 if (vmcs12->vm_exit_controls & VM_EXIT_SAVE_IA32_EFER) 3647 3491 vmcs12->guest_ia32_efer = vcpu->arch.efer; 3648 - vmcs12->guest_sysenter_cs = vmcs_read32(GUEST_SYSENTER_CS); 3649 - vmcs12->guest_sysenter_esp = vmcs_readl(GUEST_SYSENTER_ESP); 3650 - vmcs12->guest_sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP); 3651 - if (kvm_mpx_supported()) 3652 - vmcs12->guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS); 3653 3492 } 3654 3493 3655 3494 /* ··· 3656 3517 u32 exit_reason, u32 exit_intr_info, 3657 3518 unsigned long exit_qualification) 3658 3519 { 3659 - /* update guest state fields: */ 3660 - sync_vmcs12(vcpu, vmcs12); 3661 - 3662 3520 /* update exit information fields: */ 3663 - 3664 3521 vmcs12->vm_exit_reason = exit_reason; 3665 3522 vmcs12->exit_qualification = exit_qualification; 3666 3523 vmcs12->vm_exit_intr_info = exit_intr_info; ··· 3910 3775 vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW)); 3911 3776 3912 3777 nested_ept_uninit_mmu_context(vcpu); 3913 - 3914 - /* 3915 - * This is only valid if EPT is in use, otherwise the vmcs01 GUEST_CR3 3916 - * points to shadow pages! Fortunately we only get here after a WARN_ON 3917 - * if EPT is disabled, so a VMabort is perfectly fine. 3918 - */ 3919 - if (enable_ept) { 3920 - vcpu->arch.cr3 = vmcs_readl(GUEST_CR3); 3921 - __set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail); 3922 - } else { 3923 - nested_vmx_abort(vcpu, VMX_ABORT_VMCS_CORRUPTED); 3924 - } 3778 + vcpu->arch.cr3 = vmcs_readl(GUEST_CR3); 3779 + __set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail); 3925 3780 3926 3781 /* 3927 3782 * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs ··· 3919 3794 * VMFail, like everything else we just need to ensure our 3920 3795 * software model is up-to-date. 3921 3796 */ 3922 - ept_save_pdptrs(vcpu); 3797 + if (enable_ept) 3798 + ept_save_pdptrs(vcpu); 3923 3799 3924 3800 kvm_mmu_reset_context(vcpu); 3925 3801 ··· 4008 3882 vcpu->arch.tsc_offset -= vmcs12->tsc_offset; 4009 3883 4010 3884 if (likely(!vmx->fail)) { 4011 - if (exit_reason == -1) 4012 - sync_vmcs12(vcpu, vmcs12); 4013 - else 3885 + sync_vmcs02_to_vmcs12(vcpu, vmcs12); 3886 + 3887 + if (exit_reason != -1) 4014 3888 prepare_vmcs12(vcpu, vmcs12, exit_reason, exit_intr_info, 4015 3889 exit_qualification); 4016 3890 4017 3891 /* 4018 - * Must happen outside of sync_vmcs12() as it will 3892 + * Must happen outside of sync_vmcs02_to_vmcs12() as it will 4019 3893 * also be used to capture vmcs12 cache as part of 4020 3894 * capturing nVMX state for snapshot (migration). 4021 3895 * ··· 4071 3945 kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu); 4072 3946 4073 3947 if ((exit_reason != -1) && (enable_shadow_vmcs || vmx->nested.hv_evmcs)) 4074 - vmx->nested.need_vmcs12_sync = true; 3948 + vmx->nested.need_vmcs12_to_shadow_sync = true; 4075 3949 4076 3950 /* in case we halted in L2 */ 4077 3951 vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; ··· 4134 4008 * #UD or #GP. 4135 4009 */ 4136 4010 int get_vmx_mem_address(struct kvm_vcpu *vcpu, unsigned long exit_qualification, 4137 - u32 vmx_instruction_info, bool wr, gva_t *ret) 4011 + u32 vmx_instruction_info, bool wr, int len, gva_t *ret) 4138 4012 { 4139 4013 gva_t off; 4140 4014 bool exn; ··· 4241 4115 */ 4242 4116 if (!(s.base == 0 && s.limit == 0xffffffff && 4243 4117 ((s.type & 8) || !(s.type & 4)))) 4244 - exn = exn || (off + sizeof(u64) > s.limit); 4118 + exn = exn || ((u64)off + len - 1 > s.limit); 4245 4119 } 4246 4120 if (exn) { 4247 4121 kvm_queue_exception_e(vcpu, ··· 4260 4134 struct x86_exception e; 4261 4135 4262 4136 if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION), 4263 - vmcs_read32(VMX_INSTRUCTION_INFO), false, &gva)) 4137 + vmcs_read32(VMX_INSTRUCTION_INFO), false, 4138 + sizeof(*vmpointer), &gva)) 4264 4139 return 1; 4265 4140 4266 4141 if (kvm_read_guest_virt(vcpu, gva, vmpointer, sizeof(*vmpointer), &e)) { ··· 4427 4300 if (vmx->nested.current_vmptr == -1ull) 4428 4301 return; 4429 4302 4303 + copy_vmcs02_to_vmcs12_rare(vcpu, get_vmcs12(vcpu)); 4304 + 4430 4305 if (enable_shadow_vmcs) { 4431 4306 /* copy to memory all shadowed fields in case 4432 4307 they were modified */ 4433 4308 copy_shadow_to_vmcs12(vmx); 4434 - vmx->nested.need_vmcs12_sync = false; 4309 + vmx->nested.need_vmcs12_to_shadow_sync = false; 4435 4310 vmx_disable_shadow_vmcs(vmx); 4436 4311 } 4437 4312 vmx->nested.posted_intr_nv = -1; ··· 4463 4334 struct vcpu_vmx *vmx = to_vmx(vcpu); 4464 4335 u32 zero = 0; 4465 4336 gpa_t vmptr; 4337 + u64 evmcs_gpa; 4466 4338 4467 4339 if (!nested_vmx_check_permission(vcpu)) 4468 4340 return 1; ··· 4479 4349 return nested_vmx_failValid(vcpu, 4480 4350 VMXERR_VMCLEAR_VMXON_POINTER); 4481 4351 4482 - if (vmx->nested.hv_evmcs_map.hva) { 4483 - if (vmptr == vmx->nested.hv_evmcs_vmptr) 4484 - nested_release_evmcs(vcpu); 4485 - } else { 4352 + /* 4353 + * When Enlightened VMEntry is enabled on the calling CPU we treat 4354 + * memory area pointer by vmptr as Enlightened VMCS (as there's no good 4355 + * way to distinguish it from VMCS12) and we must not corrupt it by 4356 + * writing to the non-existent 'launch_state' field. The area doesn't 4357 + * have to be the currently active EVMCS on the calling CPU and there's 4358 + * nothing KVM has to do to transition it from 'active' to 'non-active' 4359 + * state. It is possible that the area will stay mapped as 4360 + * vmx->nested.hv_evmcs but this shouldn't be a problem. 4361 + */ 4362 + if (likely(!vmx->nested.enlightened_vmcs_enabled || 4363 + !nested_enlightened_vmentry(vcpu, &evmcs_gpa))) { 4486 4364 if (vmptr == vmx->nested.current_vmptr) 4487 4365 nested_release_vmcs12(vcpu); 4488 4366 ··· 4524 4386 u64 field_value; 4525 4387 unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); 4526 4388 u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); 4389 + int len; 4527 4390 gva_t gva = 0; 4528 4391 struct vmcs12 *vmcs12; 4392 + short offset; 4529 4393 4530 4394 if (!nested_vmx_check_permission(vcpu)) 4531 4395 return 1; ··· 4549 4409 4550 4410 /* Decode instruction info and find the field to read */ 4551 4411 field = kvm_register_readl(vcpu, (((vmx_instruction_info) >> 28) & 0xf)); 4552 - /* Read the field, zero-extended to a u64 field_value */ 4553 - if (vmcs12_read_any(vmcs12, field, &field_value) < 0) 4412 + 4413 + offset = vmcs_field_to_offset(field); 4414 + if (offset < 0) 4554 4415 return nested_vmx_failValid(vcpu, 4555 4416 VMXERR_UNSUPPORTED_VMCS_COMPONENT); 4417 + 4418 + if (!is_guest_mode(vcpu) && is_vmcs12_ext_field(field)) 4419 + copy_vmcs02_to_vmcs12_rare(vcpu, vmcs12); 4420 + 4421 + /* Read the field, zero-extended to a u64 field_value */ 4422 + field_value = vmcs12_read_any(vmcs12, field, offset); 4556 4423 4557 4424 /* 4558 4425 * Now copy part of this value to register or memory, as requested. ··· 4570 4423 kvm_register_writel(vcpu, (((vmx_instruction_info) >> 3) & 0xf), 4571 4424 field_value); 4572 4425 } else { 4426 + len = is_64_bit_mode(vcpu) ? 8 : 4; 4573 4427 if (get_vmx_mem_address(vcpu, exit_qualification, 4574 - vmx_instruction_info, true, &gva)) 4428 + vmx_instruction_info, true, len, &gva)) 4575 4429 return 1; 4576 4430 /* _system ok, nested_vmx_check_permission has verified cpl=0 */ 4577 - kvm_write_guest_virt_system(vcpu, gva, &field_value, 4578 - (is_long_mode(vcpu) ? 8 : 4), NULL); 4431 + kvm_write_guest_virt_system(vcpu, gva, &field_value, len, NULL); 4579 4432 } 4580 4433 4581 4434 return nested_vmx_succeed(vcpu); 4582 4435 } 4583 4436 4437 + static bool is_shadow_field_rw(unsigned long field) 4438 + { 4439 + switch (field) { 4440 + #define SHADOW_FIELD_RW(x, y) case x: 4441 + #include "vmcs_shadow_fields.h" 4442 + return true; 4443 + default: 4444 + break; 4445 + } 4446 + return false; 4447 + } 4448 + 4449 + static bool is_shadow_field_ro(unsigned long field) 4450 + { 4451 + switch (field) { 4452 + #define SHADOW_FIELD_RO(x, y) case x: 4453 + #include "vmcs_shadow_fields.h" 4454 + return true; 4455 + default: 4456 + break; 4457 + } 4458 + return false; 4459 + } 4584 4460 4585 4461 static int handle_vmwrite(struct kvm_vcpu *vcpu) 4586 4462 { 4587 4463 unsigned long field; 4464 + int len; 4588 4465 gva_t gva; 4589 4466 struct vcpu_vmx *vmx = to_vmx(vcpu); 4590 4467 unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); ··· 4623 4452 u64 field_value = 0; 4624 4453 struct x86_exception e; 4625 4454 struct vmcs12 *vmcs12; 4455 + short offset; 4626 4456 4627 4457 if (!nested_vmx_check_permission(vcpu)) 4628 4458 return 1; ··· 4635 4463 field_value = kvm_register_readl(vcpu, 4636 4464 (((vmx_instruction_info) >> 3) & 0xf)); 4637 4465 else { 4466 + len = is_64_bit_mode(vcpu) ? 8 : 4; 4638 4467 if (get_vmx_mem_address(vcpu, exit_qualification, 4639 - vmx_instruction_info, false, &gva)) 4468 + vmx_instruction_info, false, len, &gva)) 4640 4469 return 1; 4641 - if (kvm_read_guest_virt(vcpu, gva, &field_value, 4642 - (is_64_bit_mode(vcpu) ? 8 : 4), &e)) { 4470 + if (kvm_read_guest_virt(vcpu, gva, &field_value, len, &e)) { 4643 4471 kvm_inject_page_fault(vcpu, &e); 4644 4472 return 1; 4645 4473 } ··· 4656 4484 return nested_vmx_failValid(vcpu, 4657 4485 VMXERR_VMWRITE_READ_ONLY_VMCS_COMPONENT); 4658 4486 4659 - if (!is_guest_mode(vcpu)) 4487 + if (!is_guest_mode(vcpu)) { 4660 4488 vmcs12 = get_vmcs12(vcpu); 4661 - else { 4489 + 4490 + /* 4491 + * Ensure vmcs12 is up-to-date before any VMWRITE that dirties 4492 + * vmcs12, else we may crush a field or consume a stale value. 4493 + */ 4494 + if (!is_shadow_field_rw(field)) 4495 + copy_vmcs02_to_vmcs12_rare(vcpu, vmcs12); 4496 + } else { 4662 4497 /* 4663 4498 * When vmcs->vmcs_link_pointer is -1ull, any VMWRITE 4664 4499 * to shadowed-field sets the ALU flags for VMfailInvalid. ··· 4675 4496 vmcs12 = get_shadow_vmcs12(vcpu); 4676 4497 } 4677 4498 4678 - if (vmcs12_write_any(vmcs12, field, field_value) < 0) 4499 + offset = vmcs_field_to_offset(field); 4500 + if (offset < 0) 4679 4501 return nested_vmx_failValid(vcpu, 4680 4502 VMXERR_UNSUPPORTED_VMCS_COMPONENT); 4681 4503 4682 4504 /* 4683 - * Do not track vmcs12 dirty-state if in guest-mode 4684 - * as we actually dirty shadow vmcs12 instead of vmcs12. 4505 + * Some Intel CPUs intentionally drop the reserved bits of the AR byte 4506 + * fields on VMWRITE. Emulate this behavior to ensure consistent KVM 4507 + * behavior regardless of the underlying hardware, e.g. if an AR_BYTE 4508 + * field is intercepted for VMWRITE but not VMREAD (in L1), then VMREAD 4509 + * from L1 will return a different value than VMREAD from L2 (L1 sees 4510 + * the stripped down value, L2 sees the full value as stored by KVM). 4685 4511 */ 4686 - if (!is_guest_mode(vcpu)) { 4687 - switch (field) { 4688 - #define SHADOW_FIELD_RW(x) case x: 4689 - #include "vmcs_shadow_fields.h" 4690 - /* 4691 - * The fields that can be updated by L1 without a vmexit are 4692 - * always updated in the vmcs02, the others go down the slow 4693 - * path of prepare_vmcs02. 4694 - */ 4695 - break; 4696 - default: 4697 - vmx->nested.dirty_vmcs12 = true; 4698 - break; 4512 + if (field >= GUEST_ES_AR_BYTES && field <= GUEST_TR_AR_BYTES) 4513 + field_value &= 0x1f0ff; 4514 + 4515 + vmcs12_write_any(vmcs12, field, offset, field_value); 4516 + 4517 + /* 4518 + * Do not track vmcs12 dirty-state if in guest-mode as we actually 4519 + * dirty shadow vmcs12 instead of vmcs12. Fields that can be updated 4520 + * by L1 without a vmexit are always updated in the vmcs02, i.e. don't 4521 + * "dirty" vmcs12, all others go down the prepare_vmcs02() slow path. 4522 + */ 4523 + if (!is_guest_mode(vcpu) && !is_shadow_field_rw(field)) { 4524 + /* 4525 + * L1 can read these fields without exiting, ensure the 4526 + * shadow VMCS is up-to-date. 4527 + */ 4528 + if (enable_shadow_vmcs && is_shadow_field_ro(field)) { 4529 + preempt_disable(); 4530 + vmcs_load(vmx->vmcs01.shadow_vmcs); 4531 + 4532 + __vmcs_writel(field, field_value); 4533 + 4534 + vmcs_clear(vmx->vmcs01.shadow_vmcs); 4535 + vmcs_load(vmx->loaded_vmcs->vmcs); 4536 + preempt_enable(); 4699 4537 } 4538 + vmx->nested.dirty_vmcs12 = true; 4700 4539 } 4701 4540 4702 4541 return nested_vmx_succeed(vcpu); ··· 4724 4527 { 4725 4528 vmx->nested.current_vmptr = vmptr; 4726 4529 if (enable_shadow_vmcs) { 4727 - vmcs_set_bits(SECONDARY_VM_EXEC_CONTROL, 4728 - SECONDARY_EXEC_SHADOW_VMCS); 4530 + secondary_exec_controls_setbit(vmx, SECONDARY_EXEC_SHADOW_VMCS); 4729 4531 vmcs_write64(VMCS_LINK_POINTER, 4730 4532 __pa(vmx->vmcs01.shadow_vmcs)); 4731 - vmx->nested.need_vmcs12_sync = true; 4533 + vmx->nested.need_vmcs12_to_shadow_sync = true; 4732 4534 } 4733 4535 vmx->nested.dirty_vmcs12 = true; 4734 4536 } ··· 4811 4615 if (unlikely(to_vmx(vcpu)->nested.hv_evmcs)) 4812 4616 return 1; 4813 4617 4814 - if (get_vmx_mem_address(vcpu, exit_qual, instr_info, true, &gva)) 4618 + if (get_vmx_mem_address(vcpu, exit_qual, instr_info, 4619 + true, sizeof(gpa_t), &gva)) 4815 4620 return 1; 4816 4621 /* *_system ok, nested_vmx_check_permission has verified cpl=0 */ 4817 4622 if (kvm_write_guest_virt_system(vcpu, gva, (void *)&current_vmptr, ··· 4858 4661 * operand is read even if it isn't needed (e.g., for type==global) 4859 4662 */ 4860 4663 if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION), 4861 - vmx_instruction_info, false, &gva)) 4664 + vmx_instruction_info, false, sizeof(operand), &gva)) 4862 4665 return 1; 4863 4666 if (kvm_read_guest_virt(vcpu, gva, &operand, sizeof(operand), &e)) { 4864 4667 kvm_inject_page_fault(vcpu, &e); ··· 4867 4670 4868 4671 switch (type) { 4869 4672 case VMX_EPT_EXTENT_GLOBAL: 4870 - /* 4871 - * TODO: track mappings and invalidate 4872 - * single context requests appropriately 4873 - */ 4874 4673 case VMX_EPT_EXTENT_CONTEXT: 4875 - kvm_mmu_sync_roots(vcpu); 4876 - kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); 4674 + /* 4675 + * TODO: Sync the necessary shadow EPT roots here, rather than 4676 + * at the next emulated VM-entry. 4677 + */ 4877 4678 break; 4878 4679 default: 4879 4680 BUG_ON(1); ··· 4918 4723 * operand is read even if it isn't needed (e.g., for type==global) 4919 4724 */ 4920 4725 if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION), 4921 - vmx_instruction_info, false, &gva)) 4726 + vmx_instruction_info, false, sizeof(operand), &gva)) 4922 4727 return 1; 4923 4728 if (kvm_read_guest_virt(vcpu, gva, &operand, sizeof(operand), &e)) { 4924 4729 kvm_inject_page_fault(vcpu, &e); ··· 5479 5284 * When running L2, the authoritative vmcs12 state is in the 5480 5285 * vmcs02. When running L1, the authoritative vmcs12 state is 5481 5286 * in the shadow or enlightened vmcs linked to vmcs01, unless 5482 - * need_vmcs12_sync is set, in which case, the authoritative 5287 + * need_vmcs12_to_shadow_sync is set, in which case, the authoritative 5483 5288 * vmcs12 state is in the vmcs12 already. 5484 5289 */ 5485 5290 if (is_guest_mode(vcpu)) { 5486 - sync_vmcs12(vcpu, vmcs12); 5487 - } else if (!vmx->nested.need_vmcs12_sync) { 5291 + sync_vmcs02_to_vmcs12(vcpu, vmcs12); 5292 + sync_vmcs02_to_vmcs12_rare(vcpu, vmcs12); 5293 + } else if (!vmx->nested.need_vmcs12_to_shadow_sync) { 5488 5294 if (vmx->nested.hv_evmcs) 5489 5295 copy_enlightened_to_vmcs12(vmx); 5490 5296 else if (enable_shadow_vmcs) ··· 5617 5421 * Sync eVMCS upon entry as we may not have 5618 5422 * HV_X64_MSR_VP_ASSIST_PAGE set up yet. 5619 5423 */ 5620 - vmx->nested.need_vmcs12_sync = true; 5424 + vmx->nested.need_vmcs12_to_shadow_sync = true; 5621 5425 } else { 5622 5426 return -EINVAL; 5623 5427 } ··· 5685 5489 void nested_vmx_vcpu_setup(void) 5686 5490 { 5687 5491 if (enable_shadow_vmcs) { 5688 - /* 5689 - * At vCPU creation, "VMWRITE to any supported field 5690 - * in the VMCS" is supported, so use the more 5691 - * permissive vmx_vmread_bitmap to specify both read 5692 - * and write permissions for the shadow VMCS. 5693 - */ 5694 5492 vmcs_write64(VMREAD_BITMAP, __pa(vmx_vmread_bitmap)); 5695 - vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmread_bitmap)); 5493 + vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap)); 5696 5494 } 5697 5495 } 5698 5496 ··· 5816 5626 msrs->secondary_ctls_low = 0; 5817 5627 msrs->secondary_ctls_high &= 5818 5628 SECONDARY_EXEC_DESC | 5629 + SECONDARY_EXEC_RDTSCP | 5819 5630 SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | 5631 + SECONDARY_EXEC_WBINVD_EXITING | 5820 5632 SECONDARY_EXEC_APIC_REGISTER_VIRT | 5821 5633 SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY | 5822 - SECONDARY_EXEC_WBINVD_EXITING; 5634 + SECONDARY_EXEC_RDRAND_EXITING | 5635 + SECONDARY_EXEC_ENABLE_INVPCID | 5636 + SECONDARY_EXEC_RDSEED_EXITING | 5637 + SECONDARY_EXEC_XSAVES; 5823 5638 5824 5639 /* 5825 5640 * We can emulate "VMCS shadowing," even if the hardware ··· 5943 5748 __init int nested_vmx_hardware_setup(int (*exit_handlers[])(struct kvm_vcpu *)) 5944 5749 { 5945 5750 int i; 5946 - 5947 - /* 5948 - * Without EPT it is not possible to restore L1's CR3 and PDPTR on 5949 - * VMfail, because they are not available in vmcs01. Just always 5950 - * use hardware checks. 5951 - */ 5952 - if (!enable_ept) 5953 - nested_early_check = 1; 5954 5751 5955 5752 if (!cpu_has_vmx_shadow_vmcs()) 5956 5753 enable_shadow_vmcs = 0;

+2 -2

arch/x86/kvm/vmx/nested.h

··· 17 17 bool nested_vmx_exit_reflected(struct kvm_vcpu *vcpu, u32 exit_reason); 18 18 void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason, 19 19 u32 exit_intr_info, unsigned long exit_qualification); 20 - void nested_sync_from_vmcs12(struct kvm_vcpu *vcpu); 20 + void nested_sync_vmcs12_to_shadow(struct kvm_vcpu *vcpu); 21 21 int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); 22 22 int vmx_get_vmx_msr(struct nested_vmx_msrs *msrs, u32 msr_index, u64 *pdata); 23 23 int get_vmx_mem_address(struct kvm_vcpu *vcpu, unsigned long exit_qualification, 24 - u32 vmx_instruction_info, bool wr, gva_t *ret); 24 + u32 vmx_instruction_info, bool wr, int len, gva_t *ret); 25 25 26 26 static inline struct vmcs12 *get_vmcs12(struct kvm_vcpu *vcpu) 27 27 {

-1

arch/x86/kvm/vmx/ops.h

··· 146 146 147 147 __vmcs_writel(field, value); 148 148 #ifndef CONFIG_X86_64 149 - asm volatile (""); 150 149 __vmcs_writel(field+1, value >> 32); 151 150 #endif 152 151 }

+16 -1

arch/x86/kvm/vmx/vmcs.h

··· 42 42 #endif 43 43 }; 44 44 45 + struct vmcs_controls_shadow { 46 + u32 vm_entry; 47 + u32 vm_exit; 48 + u32 pin; 49 + u32 exec; 50 + u32 secondary_exec; 51 + }; 52 + 45 53 /* 46 54 * Track a VMCS that may be loaded on a certain CPU. If it is (cpu!=-1), also 47 55 * remember whether it was VMLAUNCHed, and maintain a linked list of all VMCSs ··· 61 53 int cpu; 62 54 bool launched; 63 55 bool nmi_known_unmasked; 64 - bool hv_timer_armed; 56 + bool hv_timer_soft_disabled; 65 57 /* Support for vnmi-less CPUs */ 66 58 int soft_vnmi_blocked; 67 59 ktime_t entry_time; ··· 69 61 unsigned long *msr_bitmap; 70 62 struct list_head loaded_vmcss_on_cpu_link; 71 63 struct vmcs_host_state host_state; 64 + struct vmcs_controls_shadow controls_shadow; 72 65 }; 73 66 74 67 static inline bool is_exception_n(u32 intr_info, u8 vector) ··· 122 113 { 123 114 return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK)) 124 115 == (INTR_TYPE_NMI_INTR | INTR_INFO_VALID_MASK); 116 + } 117 + 118 + static inline bool is_external_intr(u32 intr_info) 119 + { 120 + return (intr_info & (INTR_INFO_VALID_MASK | INTR_INFO_INTR_TYPE_MASK)) 121 + == (INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR); 125 122 } 126 123 127 124 enum vmcs_field_width {

+18 -39

arch/x86/kvm/vmx/vmcs12.h

··· 395 395 396 396 #undef ROL16 397 397 398 - /* 399 - * Read a vmcs12 field. Since these can have varying lengths and we return 400 - * one type, we chose the biggest type (u64) and zero-extend the return value 401 - * to that size. Note that the caller, handle_vmread, might need to use only 402 - * some of the bits we return here (e.g., on 32-bit guests, only 32 bits of 403 - * 64-bit fields are to be returned). 404 - */ 405 - static inline int vmcs12_read_any(struct vmcs12 *vmcs12, 406 - unsigned long field, u64 *ret) 398 + static inline u64 vmcs12_read_any(struct vmcs12 *vmcs12, unsigned long field, 399 + u16 offset) 407 400 { 408 - short offset = vmcs_field_to_offset(field); 409 - char *p; 410 - 411 - if (offset < 0) 412 - return offset; 413 - 414 - p = (char *)vmcs12 + offset; 401 + char *p = (char *)vmcs12 + offset; 415 402 416 403 switch (vmcs_field_width(field)) { 417 404 case VMCS_FIELD_WIDTH_NATURAL_WIDTH: 418 - *ret = *((natural_width *)p); 419 - return 0; 405 + return *((natural_width *)p); 420 406 case VMCS_FIELD_WIDTH_U16: 421 - *ret = *((u16 *)p); 422 - return 0; 407 + return *((u16 *)p); 423 408 case VMCS_FIELD_WIDTH_U32: 424 - *ret = *((u32 *)p); 425 - return 0; 409 + return *((u32 *)p); 426 410 case VMCS_FIELD_WIDTH_U64: 427 - *ret = *((u64 *)p); 428 - return 0; 411 + return *((u64 *)p); 429 412 default: 430 - WARN_ON(1); 431 - return -ENOENT; 413 + WARN_ON_ONCE(1); 414 + return -1; 432 415 } 433 416 } 434 417 435 - static inline int vmcs12_write_any(struct vmcs12 *vmcs12, 436 - unsigned long field, u64 field_value){ 437 - short offset = vmcs_field_to_offset(field); 418 + static inline void vmcs12_write_any(struct vmcs12 *vmcs12, unsigned long field, 419 + u16 offset, u64 field_value) 420 + { 438 421 char *p = (char *)vmcs12 + offset; 439 - 440 - if (offset < 0) 441 - return offset; 442 422 443 423 switch (vmcs_field_width(field)) { 444 424 case VMCS_FIELD_WIDTH_U16: 445 425 *(u16 *)p = field_value; 446 - return 0; 426 + break; 447 427 case VMCS_FIELD_WIDTH_U32: 448 428 *(u32 *)p = field_value; 449 - return 0; 429 + break; 450 430 case VMCS_FIELD_WIDTH_U64: 451 431 *(u64 *)p = field_value; 452 - return 0; 432 + break; 453 433 case VMCS_FIELD_WIDTH_NATURAL_WIDTH: 454 434 *(natural_width *)p = field_value; 455 - return 0; 435 + break; 456 436 default: 457 - WARN_ON(1); 458 - return -ENOENT; 437 + WARN_ON_ONCE(1); 438 + break; 459 439 } 460 - 461 440 } 462 441 463 442 #endif /* __KVM_X86_VMX_VMCS12_H */

+42 -37

arch/x86/kvm/vmx/vmcs_shadow_fields.h

··· 1 + #if !defined(SHADOW_FIELD_RO) && !defined(SHADOW_FIELD_RW) 2 + BUILD_BUG_ON(1) 3 + #endif 4 + 1 5 #ifndef SHADOW_FIELD_RO 2 - #define SHADOW_FIELD_RO(x) 6 + #define SHADOW_FIELD_RO(x, y) 3 7 #endif 4 8 #ifndef SHADOW_FIELD_RW 5 - #define SHADOW_FIELD_RW(x) 9 + #define SHADOW_FIELD_RW(x, y) 6 10 #endif 7 11 8 12 /* ··· 32 28 */ 33 29 34 30 /* 16-bits */ 35 - SHADOW_FIELD_RW(GUEST_INTR_STATUS) 36 - SHADOW_FIELD_RW(GUEST_PML_INDEX) 37 - SHADOW_FIELD_RW(HOST_FS_SELECTOR) 38 - SHADOW_FIELD_RW(HOST_GS_SELECTOR) 31 + SHADOW_FIELD_RW(GUEST_INTR_STATUS, guest_intr_status) 32 + SHADOW_FIELD_RW(GUEST_PML_INDEX, guest_pml_index) 33 + SHADOW_FIELD_RW(HOST_FS_SELECTOR, host_fs_selector) 34 + SHADOW_FIELD_RW(HOST_GS_SELECTOR, host_gs_selector) 39 35 40 36 /* 32-bits */ 41 - SHADOW_FIELD_RO(VM_EXIT_REASON) 42 - SHADOW_FIELD_RO(VM_EXIT_INTR_INFO) 43 - SHADOW_FIELD_RO(VM_EXIT_INSTRUCTION_LEN) 44 - SHADOW_FIELD_RO(IDT_VECTORING_INFO_FIELD) 45 - SHADOW_FIELD_RO(IDT_VECTORING_ERROR_CODE) 46 - SHADOW_FIELD_RO(VM_EXIT_INTR_ERROR_CODE) 47 - SHADOW_FIELD_RW(CPU_BASED_VM_EXEC_CONTROL) 48 - SHADOW_FIELD_RW(EXCEPTION_BITMAP) 49 - SHADOW_FIELD_RW(VM_ENTRY_EXCEPTION_ERROR_CODE) 50 - SHADOW_FIELD_RW(VM_ENTRY_INTR_INFO_FIELD) 51 - SHADOW_FIELD_RW(VM_ENTRY_INSTRUCTION_LEN) 52 - SHADOW_FIELD_RW(TPR_THRESHOLD) 53 - SHADOW_FIELD_RW(GUEST_CS_AR_BYTES) 54 - SHADOW_FIELD_RW(GUEST_SS_AR_BYTES) 55 - SHADOW_FIELD_RW(GUEST_INTERRUPTIBILITY_INFO) 56 - SHADOW_FIELD_RW(VMX_PREEMPTION_TIMER_VALUE) 37 + SHADOW_FIELD_RO(VM_EXIT_REASON, vm_exit_reason) 38 + SHADOW_FIELD_RO(VM_EXIT_INTR_INFO, vm_exit_intr_info) 39 + SHADOW_FIELD_RO(VM_EXIT_INSTRUCTION_LEN, vm_exit_instruction_len) 40 + SHADOW_FIELD_RO(IDT_VECTORING_INFO_FIELD, idt_vectoring_info_field) 41 + SHADOW_FIELD_RO(IDT_VECTORING_ERROR_CODE, idt_vectoring_error_code) 42 + SHADOW_FIELD_RO(VM_EXIT_INTR_ERROR_CODE, vm_exit_intr_error_code) 43 + SHADOW_FIELD_RO(GUEST_CS_AR_BYTES, guest_cs_ar_bytes) 44 + SHADOW_FIELD_RO(GUEST_SS_AR_BYTES, guest_ss_ar_bytes) 45 + SHADOW_FIELD_RW(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control) 46 + SHADOW_FIELD_RW(PIN_BASED_VM_EXEC_CONTROL, pin_based_vm_exec_control) 47 + SHADOW_FIELD_RW(EXCEPTION_BITMAP, exception_bitmap) 48 + SHADOW_FIELD_RW(VM_ENTRY_EXCEPTION_ERROR_CODE, vm_entry_exception_error_code) 49 + SHADOW_FIELD_RW(VM_ENTRY_INTR_INFO_FIELD, vm_entry_intr_info_field) 50 + SHADOW_FIELD_RW(VM_ENTRY_INSTRUCTION_LEN, vm_entry_instruction_len) 51 + SHADOW_FIELD_RW(TPR_THRESHOLD, tpr_threshold) 52 + SHADOW_FIELD_RW(GUEST_INTERRUPTIBILITY_INFO, guest_interruptibility_info) 53 + SHADOW_FIELD_RW(VMX_PREEMPTION_TIMER_VALUE, vmx_preemption_timer_value) 57 54 58 55 /* Natural width */ 59 - SHADOW_FIELD_RO(EXIT_QUALIFICATION) 60 - SHADOW_FIELD_RO(GUEST_LINEAR_ADDRESS) 61 - SHADOW_FIELD_RW(GUEST_RIP) 62 - SHADOW_FIELD_RW(GUEST_RSP) 63 - SHADOW_FIELD_RW(GUEST_CR0) 64 - SHADOW_FIELD_RW(GUEST_CR3) 65 - SHADOW_FIELD_RW(GUEST_CR4) 66 - SHADOW_FIELD_RW(GUEST_RFLAGS) 67 - SHADOW_FIELD_RW(CR0_GUEST_HOST_MASK) 68 - SHADOW_FIELD_RW(CR0_READ_SHADOW) 69 - SHADOW_FIELD_RW(CR4_READ_SHADOW) 70 - SHADOW_FIELD_RW(HOST_FS_BASE) 71 - SHADOW_FIELD_RW(HOST_GS_BASE) 56 + SHADOW_FIELD_RO(EXIT_QUALIFICATION, exit_qualification) 57 + SHADOW_FIELD_RO(GUEST_LINEAR_ADDRESS, guest_linear_address) 58 + SHADOW_FIELD_RW(GUEST_RIP, guest_rip) 59 + SHADOW_FIELD_RW(GUEST_RSP, guest_rsp) 60 + SHADOW_FIELD_RW(GUEST_CR0, guest_cr0) 61 + SHADOW_FIELD_RW(GUEST_CR3, guest_cr3) 62 + SHADOW_FIELD_RW(GUEST_CR4, guest_cr4) 63 + SHADOW_FIELD_RW(GUEST_RFLAGS, guest_rflags) 64 + SHADOW_FIELD_RW(CR0_GUEST_HOST_MASK, cr0_guest_host_mask) 65 + SHADOW_FIELD_RW(CR0_READ_SHADOW, cr0_read_shadow) 66 + SHADOW_FIELD_RW(CR4_READ_SHADOW, cr4_read_shadow) 67 + SHADOW_FIELD_RW(HOST_FS_BASE, host_fs_base) 68 + SHADOW_FIELD_RW(HOST_GS_BASE, host_gs_base) 72 69 73 70 /* 64-bit */ 74 - SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS) 75 - SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS_HIGH) 71 + SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS, guest_physical_address) 72 + SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS_HIGH, guest_physical_address) 76 73 77 74 #undef SHADOW_FIELD_RO 78 75 #undef SHADOW_FIELD_RW

+253 -198

arch/x86/kvm/vmx/vmx.c

··· 389 389 }; 390 390 391 391 u64 host_efer; 392 + static unsigned long host_idt_base; 392 393 393 394 /* 394 395 * Though SYSCALL is only supported in 64-bit mode on Intel CPUs, kvm ··· 1036 1035 wrmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl); 1037 1036 } 1038 1037 1038 + void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel, 1039 + unsigned long fs_base, unsigned long gs_base) 1040 + { 1041 + if (unlikely(fs_sel != host->fs_sel)) { 1042 + if (!(fs_sel & 7)) 1043 + vmcs_write16(HOST_FS_SELECTOR, fs_sel); 1044 + else 1045 + vmcs_write16(HOST_FS_SELECTOR, 0); 1046 + host->fs_sel = fs_sel; 1047 + } 1048 + if (unlikely(gs_sel != host->gs_sel)) { 1049 + if (!(gs_sel & 7)) 1050 + vmcs_write16(HOST_GS_SELECTOR, gs_sel); 1051 + else 1052 + vmcs_write16(HOST_GS_SELECTOR, 0); 1053 + host->gs_sel = gs_sel; 1054 + } 1055 + if (unlikely(fs_base != host->fs_base)) { 1056 + vmcs_writel(HOST_FS_BASE, fs_base); 1057 + host->fs_base = fs_base; 1058 + } 1059 + if (unlikely(gs_base != host->gs_base)) { 1060 + vmcs_writel(HOST_GS_BASE, gs_base); 1061 + host->gs_base = gs_base; 1062 + } 1063 + } 1064 + 1039 1065 void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) 1040 1066 { 1041 1067 struct vcpu_vmx *vmx = to_vmx(vcpu); ··· 1081 1053 * when guest state is loaded. This happens when guest transitions 1082 1054 * to/from long-mode by setting MSR_EFER.LMA. 1083 1055 */ 1084 - if (!vmx->loaded_cpu_state || vmx->guest_msrs_dirty) { 1085 - vmx->guest_msrs_dirty = false; 1056 + if (!vmx->guest_msrs_ready) { 1057 + vmx->guest_msrs_ready = true; 1086 1058 for (i = 0; i < vmx->save_nmsrs; ++i) 1087 1059 kvm_set_shared_msr(vmx->guest_msrs[i].index, 1088 1060 vmx->guest_msrs[i].data, 1089 1061 vmx->guest_msrs[i].mask); 1090 1062 1091 1063 } 1092 - 1093 - if (vmx->loaded_cpu_state) 1064 + if (vmx->guest_state_loaded) 1094 1065 return; 1095 1066 1096 - vmx->loaded_cpu_state = vmx->loaded_vmcs; 1097 - host_state = &vmx->loaded_cpu_state->host_state; 1067 + host_state = &vmx->loaded_vmcs->host_state; 1098 1068 1099 1069 /* 1100 1070 * Set host fs and gs selectors. Unfortunately, 22.2.3 does not ··· 1126 1100 gs_base = segment_base(gs_sel); 1127 1101 #endif 1128 1102 1129 - if (unlikely(fs_sel != host_state->fs_sel)) { 1130 - if (!(fs_sel & 7)) 1131 - vmcs_write16(HOST_FS_SELECTOR, fs_sel); 1132 - else 1133 - vmcs_write16(HOST_FS_SELECTOR, 0); 1134 - host_state->fs_sel = fs_sel; 1135 - } 1136 - if (unlikely(gs_sel != host_state->gs_sel)) { 1137 - if (!(gs_sel & 7)) 1138 - vmcs_write16(HOST_GS_SELECTOR, gs_sel); 1139 - else 1140 - vmcs_write16(HOST_GS_SELECTOR, 0); 1141 - host_state->gs_sel = gs_sel; 1142 - } 1143 - if (unlikely(fs_base != host_state->fs_base)) { 1144 - vmcs_writel(HOST_FS_BASE, fs_base); 1145 - host_state->fs_base = fs_base; 1146 - } 1147 - if (unlikely(gs_base != host_state->gs_base)) { 1148 - vmcs_writel(HOST_GS_BASE, gs_base); 1149 - host_state->gs_base = gs_base; 1150 - } 1103 + vmx_set_host_fs_gs(host_state, fs_sel, gs_sel, fs_base, gs_base); 1104 + vmx->guest_state_loaded = true; 1151 1105 } 1152 1106 1153 1107 static void vmx_prepare_switch_to_host(struct vcpu_vmx *vmx) 1154 1108 { 1155 1109 struct vmcs_host_state *host_state; 1156 1110 1157 - if (!vmx->loaded_cpu_state) 1111 + if (!vmx->guest_state_loaded) 1158 1112 return; 1159 1113 1160 - WARN_ON_ONCE(vmx->loaded_cpu_state != vmx->loaded_vmcs); 1161 - host_state = &vmx->loaded_cpu_state->host_state; 1114 + host_state = &vmx->loaded_vmcs->host_state; 1162 1115 1163 1116 ++vmx->vcpu.stat.host_state_reload; 1164 - vmx->loaded_cpu_state = NULL; 1165 1117 1166 1118 #ifdef CONFIG_X86_64 1167 1119 rdmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base); ··· 1165 1161 wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base); 1166 1162 #endif 1167 1163 load_fixmap_gdt(raw_smp_processor_id()); 1164 + vmx->guest_state_loaded = false; 1165 + vmx->guest_msrs_ready = false; 1168 1166 } 1169 1167 1170 1168 #ifdef CONFIG_X86_64 1171 1169 static u64 vmx_read_guest_kernel_gs_base(struct vcpu_vmx *vmx) 1172 1170 { 1173 1171 preempt_disable(); 1174 - if (vmx->loaded_cpu_state) 1172 + if (vmx->guest_state_loaded) 1175 1173 rdmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base); 1176 1174 preempt_enable(); 1177 1175 return vmx->msr_guest_kernel_gs_base; ··· 1182 1176 static void vmx_write_guest_kernel_gs_base(struct vcpu_vmx *vmx, u64 data) 1183 1177 { 1184 1178 preempt_disable(); 1185 - if (vmx->loaded_cpu_state) 1179 + if (vmx->guest_state_loaded) 1186 1180 wrmsrl(MSR_KERNEL_GS_BASE, data); 1187 1181 preempt_enable(); 1188 1182 vmx->msr_guest_kernel_gs_base = data; ··· 1231 1225 pi_set_on(pi_desc); 1232 1226 } 1233 1227 1234 - /* 1235 - * Switches to specified vcpu, until a matching vcpu_put(), but assumes 1236 - * vcpu mutex is already taken. 1237 - */ 1238 - void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) 1228 + void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu) 1239 1229 { 1240 1230 struct vcpu_vmx *vmx = to_vmx(vcpu); 1241 1231 bool already_loaded = vmx->loaded_vmcs->cpu == cpu; ··· 1292 1290 if (kvm_has_tsc_control && 1293 1291 vmx->current_tsc_ratio != vcpu->arch.tsc_scaling_ratio) 1294 1292 decache_tsc_multiplier(vmx); 1293 + } 1294 + 1295 + /* 1296 + * Switches to specified vcpu, until a matching vcpu_put(), but assumes 1297 + * vcpu mutex is already taken. 1298 + */ 1299 + void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) 1300 + { 1301 + struct vcpu_vmx *vmx = to_vmx(vcpu); 1302 + 1303 + vmx_vcpu_load_vmcs(vcpu, cpu); 1295 1304 1296 1305 vmx_vcpu_pi_load(vcpu, cpu); 1306 + 1297 1307 vmx->host_pkru = read_pkru(); 1298 1308 vmx->host_debugctlmsr = get_debugctlmsr(); 1299 1309 } ··· 1324 1310 pi_set_sn(pi_desc); 1325 1311 } 1326 1312 1327 - void vmx_vcpu_put(struct kvm_vcpu *vcpu) 1313 + static void vmx_vcpu_put(struct kvm_vcpu *vcpu) 1328 1314 { 1329 1315 vmx_vcpu_pi_put(vcpu); 1330 1316 ··· 1593 1579 move_msr_up(vmx, index, save_nmsrs++); 1594 1580 1595 1581 vmx->save_nmsrs = save_nmsrs; 1596 - vmx->guest_msrs_dirty = true; 1582 + vmx->guest_msrs_ready = false; 1597 1583 1598 1584 if (cpu_has_vmx_msr_bitmap()) 1599 1585 vmx_update_msr_bitmap(&vmx->vcpu); ··· 1706 1692 case MSR_IA32_SYSENTER_ESP: 1707 1693 msr_info->data = vmcs_readl(GUEST_SYSENTER_ESP); 1708 1694 break; 1709 - case MSR_IA32_POWER_CTL: 1710 - msr_info->data = vmx->msr_ia32_power_ctl; 1711 - break; 1712 1695 case MSR_IA32_BNDCFGS: 1713 1696 if (!kvm_mpx_supported() || 1714 1697 (!msr_info->host_initiated && ··· 1729 1718 return vmx_get_vmx_msr(&vmx->nested.msrs, msr_info->index, 1730 1719 &msr_info->data); 1731 1720 case MSR_IA32_XSS: 1732 - if (!vmx_xsaves_supported()) 1721 + if (!vmx_xsaves_supported() || 1722 + (!msr_info->host_initiated && 1723 + !(guest_cpuid_has(vcpu, X86_FEATURE_XSAVE) && 1724 + guest_cpuid_has(vcpu, X86_FEATURE_XSAVES)))) 1733 1725 return 1; 1734 1726 msr_info->data = vcpu->arch.ia32_xss; 1735 1727 break; ··· 1831 1817 break; 1832 1818 #endif 1833 1819 case MSR_IA32_SYSENTER_CS: 1820 + if (is_guest_mode(vcpu)) 1821 + get_vmcs12(vcpu)->guest_sysenter_cs = data; 1834 1822 vmcs_write32(GUEST_SYSENTER_CS, data); 1835 1823 break; 1836 1824 case MSR_IA32_SYSENTER_EIP: 1825 + if (is_guest_mode(vcpu)) 1826 + get_vmcs12(vcpu)->guest_sysenter_eip = data; 1837 1827 vmcs_writel(GUEST_SYSENTER_EIP, data); 1838 1828 break; 1839 1829 case MSR_IA32_SYSENTER_ESP: 1830 + if (is_guest_mode(vcpu)) 1831 + get_vmcs12(vcpu)->guest_sysenter_esp = data; 1840 1832 vmcs_writel(GUEST_SYSENTER_ESP, data); 1841 1833 break; 1842 - case MSR_IA32_POWER_CTL: 1843 - vmx->msr_ia32_power_ctl = data; 1834 + case MSR_IA32_DEBUGCTLMSR: 1835 + if (is_guest_mode(vcpu) && get_vmcs12(vcpu)->vm_exit_controls & 1836 + VM_EXIT_SAVE_DEBUG_CONTROLS) 1837 + get_vmcs12(vcpu)->guest_ia32_debugctl = data; 1838 + 1839 + ret = kvm_set_msr_common(vcpu, msr_info); 1844 1840 break; 1841 + 1845 1842 case MSR_IA32_BNDCFGS: 1846 1843 if (!kvm_mpx_supported() || 1847 1844 (!msr_info->host_initiated && ··· 1921 1896 MSR_TYPE_W); 1922 1897 break; 1923 1898 case MSR_IA32_CR_PAT: 1899 + if (!kvm_pat_valid(data)) 1900 + return 1; 1901 + 1902 + if (is_guest_mode(vcpu) && 1903 + get_vmcs12(vcpu)->vm_exit_controls & VM_EXIT_SAVE_IA32_PAT) 1904 + get_vmcs12(vcpu)->guest_ia32_pat = data; 1905 + 1924 1906 if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) { 1925 - if (!kvm_pat_valid(data)) 1926 - return 1; 1927 1907 vmcs_write64(GUEST_IA32_PAT, data); 1928 1908 vcpu->arch.pat = data; 1929 1909 break; ··· 1962 1932 return 1; 1963 1933 return vmx_set_vmx_msr(vcpu, msr_index, data); 1964 1934 case MSR_IA32_XSS: 1965 - if (!vmx_xsaves_supported()) 1935 + if (!vmx_xsaves_supported() || 1936 + (!msr_info->host_initiated && 1937 + !(guest_cpuid_has(vcpu, X86_FEATURE_XSAVE) && 1938 + guest_cpuid_has(vcpu, X86_FEATURE_XSAVES)))) 1966 1939 return 1; 1967 1940 /* 1968 1941 * The only supported bit as of Skylake is bit 8, but ··· 2468 2435 return -ENOMEM; 2469 2436 2470 2437 loaded_vmcs->shadow_vmcs = NULL; 2438 + loaded_vmcs->hv_timer_soft_disabled = false; 2471 2439 loaded_vmcs_init(loaded_vmcs); 2472 2440 2473 2441 if (cpu_has_vmx_msr_bitmap()) { ··· 2489 2455 } 2490 2456 2491 2457 memset(&loaded_vmcs->host_state, 0, sizeof(struct vmcs_host_state)); 2458 + memset(&loaded_vmcs->controls_shadow, 0, 2459 + sizeof(struct vmcs_controls_shadow)); 2492 2460 2493 2461 return 0; 2494 2462 ··· 2773 2737 (unsigned long *)&vcpu->arch.regs_dirty)) 2774 2738 return; 2775 2739 2776 - if (is_paging(vcpu) && is_pae(vcpu) && !is_long_mode(vcpu)) { 2740 + if (is_pae_paging(vcpu)) { 2777 2741 vmcs_write64(GUEST_PDPTR0, mmu->pdptrs[0]); 2778 2742 vmcs_write64(GUEST_PDPTR1, mmu->pdptrs[1]); 2779 2743 vmcs_write64(GUEST_PDPTR2, mmu->pdptrs[2]); ··· 2785 2749 { 2786 2750 struct kvm_mmu *mmu = vcpu->arch.walk_mmu; 2787 2751 2788 - if (is_paging(vcpu) && is_pae(vcpu) && !is_long_mode(vcpu)) { 2752 + if (is_pae_paging(vcpu)) { 2789 2753 mmu->pdptrs[0] = vmcs_read64(GUEST_PDPTR0); 2790 2754 mmu->pdptrs[1] = vmcs_read64(GUEST_PDPTR1); 2791 2755 mmu->pdptrs[2] = vmcs_read64(GUEST_PDPTR2); ··· 2802 2766 unsigned long cr0, 2803 2767 struct kvm_vcpu *vcpu) 2804 2768 { 2769 + struct vcpu_vmx *vmx = to_vmx(vcpu); 2770 + 2805 2771 if (!test_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail)) 2806 2772 vmx_decache_cr3(vcpu); 2807 2773 if (!(cr0 & X86_CR0_PG)) { 2808 2774 /* From paging/starting to nonpaging */ 2809 - vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, 2810 - vmcs_read32(CPU_BASED_VM_EXEC_CONTROL) | 2811 - (CPU_BASED_CR3_LOAD_EXITING | 2812 - CPU_BASED_CR3_STORE_EXITING)); 2775 + exec_controls_setbit(vmx, CPU_BASED_CR3_LOAD_EXITING | 2776 + CPU_BASED_CR3_STORE_EXITING); 2813 2777 vcpu->arch.cr0 = cr0; 2814 2778 vmx_set_cr4(vcpu, kvm_read_cr4(vcpu)); 2815 2779 } else if (!is_paging(vcpu)) { 2816 2780 /* From nonpaging to paging */ 2817 - vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, 2818 - vmcs_read32(CPU_BASED_VM_EXEC_CONTROL) & 2819 - ~(CPU_BASED_CR3_LOAD_EXITING | 2820 - CPU_BASED_CR3_STORE_EXITING)); 2781 + exec_controls_clearbit(vmx, CPU_BASED_CR3_LOAD_EXITING | 2782 + CPU_BASED_CR3_STORE_EXITING); 2821 2783 vcpu->arch.cr0 = cr0; 2822 2784 vmx_set_cr4(vcpu, kvm_read_cr4(vcpu)); 2823 2785 } ··· 2915 2881 2916 2882 int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) 2917 2883 { 2884 + struct vcpu_vmx *vmx = to_vmx(vcpu); 2918 2885 /* 2919 2886 * Pass through host's Machine Check Enable value to hw_cr4, which 2920 2887 * is in force while we are in guest mode. Do not let guests control ··· 2926 2891 hw_cr4 = (cr4_read_shadow() & X86_CR4_MCE) | (cr4 & ~X86_CR4_MCE); 2927 2892 if (enable_unrestricted_guest) 2928 2893 hw_cr4 |= KVM_VM_CR4_ALWAYS_ON_UNRESTRICTED_GUEST; 2929 - else if (to_vmx(vcpu)->rmode.vm86_active) 2894 + else if (vmx->rmode.vm86_active) 2930 2895 hw_cr4 |= KVM_RMODE_VM_CR4_ALWAYS_ON; 2931 2896 else 2932 2897 hw_cr4 |= KVM_PMODE_VM_CR4_ALWAYS_ON; 2933 2898 2934 2899 if (!boot_cpu_has(X86_FEATURE_UMIP) && vmx_umip_emulated()) { 2935 2900 if (cr4 & X86_CR4_UMIP) { 2936 - vmcs_set_bits(SECONDARY_VM_EXEC_CONTROL, 2937 - SECONDARY_EXEC_DESC); 2901 + secondary_exec_controls_setbit(vmx, SECONDARY_EXEC_DESC); 2938 2902 hw_cr4 &= ~X86_CR4_UMIP; 2939 2903 } else if (!is_guest_mode(vcpu) || 2940 - !nested_cpu_has2(get_vmcs12(vcpu), SECONDARY_EXEC_DESC)) 2941 - vmcs_clear_bits(SECONDARY_VM_EXEC_CONTROL, 2942 - SECONDARY_EXEC_DESC); 2904 + !nested_cpu_has2(get_vmcs12(vcpu), SECONDARY_EXEC_DESC)) { 2905 + secondary_exec_controls_clearbit(vmx, SECONDARY_EXEC_DESC); 2906 + } 2943 2907 } 2944 2908 2945 2909 if (cr4 & X86_CR4_VMXE) { ··· 2953 2919 return 1; 2954 2920 } 2955 2921 2956 - if (to_vmx(vcpu)->nested.vmxon && !nested_cr4_valid(vcpu, cr4)) 2922 + if (vmx->nested.vmxon && !nested_cr4_valid(vcpu, cr4)) 2957 2923 return 1; 2958 2924 2959 2925 vcpu->arch.cr4 = cr4; ··· 3571 3537 u8 mode = 0; 3572 3538 3573 3539 if (cpu_has_secondary_exec_ctrls() && 3574 - (vmcs_read32(SECONDARY_VM_EXEC_CONTROL) & 3540 + (secondary_exec_controls_get(to_vmx(vcpu)) & 3575 3541 SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE)) { 3576 3542 mode |= MSR_BITMAP_MODE_X2APIC; 3577 3543 if (enable_apicv && kvm_vcpu_apicv_active(vcpu)) ··· 3765 3731 { 3766 3732 u32 low32, high32; 3767 3733 unsigned long tmpl; 3768 - struct desc_ptr dt; 3769 3734 unsigned long cr0, cr3, cr4; 3770 3735 3771 3736 cr0 = read_cr0(); ··· 3800 3767 vmcs_write16(HOST_SS_SELECTOR, __KERNEL_DS); /* 22.2.4 */ 3801 3768 vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8); /* 22.2.4 */ 3802 3769 3803 - store_idt(&dt); 3804 - vmcs_writel(HOST_IDTR_BASE, dt.address); /* 22.2.4 */ 3805 - vmx->host_idt_base = dt.address; 3770 + vmcs_writel(HOST_IDTR_BASE, host_idt_base); /* 22.2.4 */ 3806 3771 3807 3772 vmcs_writel(HOST_RIP, (unsigned long)vmx_vmexit); /* 22.2.5 */ 3808 3773 ··· 3829 3798 vmcs_writel(CR4_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr4_guest_owned_bits); 3830 3799 } 3831 3800 3832 - static u32 vmx_pin_based_exec_ctrl(struct vcpu_vmx *vmx) 3801 + u32 vmx_pin_based_exec_ctrl(struct vcpu_vmx *vmx) 3833 3802 { 3834 3803 u32 pin_based_exec_ctrl = vmcs_config.pin_based_exec_ctrl; 3835 3804 ··· 3839 3808 if (!enable_vnmi) 3840 3809 pin_based_exec_ctrl &= ~PIN_BASED_VIRTUAL_NMIS; 3841 3810 3842 - /* Enable the preemption timer dynamically */ 3843 - pin_based_exec_ctrl &= ~PIN_BASED_VMX_PREEMPTION_TIMER; 3811 + if (!enable_preemption_timer) 3812 + pin_based_exec_ctrl &= ~PIN_BASED_VMX_PREEMPTION_TIMER; 3813 + 3844 3814 return pin_based_exec_ctrl; 3845 3815 } 3846 3816 ··· 3849 3817 { 3850 3818 struct vcpu_vmx *vmx = to_vmx(vcpu); 3851 3819 3852 - vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, vmx_pin_based_exec_ctrl(vmx)); 3820 + pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx)); 3853 3821 if (cpu_has_secondary_exec_ctrls()) { 3854 3822 if (kvm_vcpu_apicv_active(vcpu)) 3855 - vmcs_set_bits(SECONDARY_VM_EXEC_CONTROL, 3823 + secondary_exec_controls_setbit(vmx, 3856 3824 SECONDARY_EXEC_APIC_REGISTER_VIRT | 3857 3825 SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY); 3858 3826 else 3859 - vmcs_clear_bits(SECONDARY_VM_EXEC_CONTROL, 3827 + secondary_exec_controls_clearbit(vmx, 3860 3828 SECONDARY_EXEC_APIC_REGISTER_VIRT | 3861 3829 SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY); 3862 3830 } ··· 4047 4015 vmcs_write64(VMCS_LINK_POINTER, -1ull); /* 22.3.1.5 */ 4048 4016 4049 4017 /* Control */ 4050 - vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, vmx_pin_based_exec_ctrl(vmx)); 4018 + pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx)); 4051 4019 vmx->hv_deadline_tsc = -1; 4052 4020 4053 - vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, vmx_exec_control(vmx)); 4021 + exec_controls_set(vmx, vmx_exec_control(vmx)); 4054 4022 4055 4023 if (cpu_has_secondary_exec_ctrls()) { 4056 4024 vmx_compute_secondary_exec_control(vmx); 4057 - vmcs_write32(SECONDARY_VM_EXEC_CONTROL, 4058 - vmx->secondary_exec_control); 4025 + secondary_exec_controls_set(vmx, vmx->secondary_exec_control); 4059 4026 } 4060 4027 4061 4028 if (kvm_vcpu_apicv_active(&vmx->vcpu)) { ··· 4112 4081 ++vmx->nmsrs; 4113 4082 } 4114 4083 4115 - vm_exit_controls_init(vmx, vmx_vmexit_ctrl()); 4084 + vm_exit_controls_set(vmx, vmx_vmexit_ctrl()); 4116 4085 4117 4086 /* 22.2.1, 20.8.1 */ 4118 - vm_entry_controls_init(vmx, vmx_vmentry_ctrl()); 4087 + vm_entry_controls_set(vmx, vmx_vmentry_ctrl()); 4119 4088 4120 4089 vmx->vcpu.arch.cr0_guest_owned_bits = X86_CR0_TS; 4121 4090 vmcs_writel(CR0_GUEST_HOST_MASK, ~X86_CR0_TS); ··· 4239 4208 4240 4209 static void enable_irq_window(struct kvm_vcpu *vcpu) 4241 4210 { 4242 - vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL, 4243 - CPU_BASED_VIRTUAL_INTR_PENDING); 4211 + exec_controls_setbit(to_vmx(vcpu), CPU_BASED_VIRTUAL_INTR_PENDING); 4244 4212 } 4245 4213 4246 4214 static void enable_nmi_window(struct kvm_vcpu *vcpu) ··· 4250 4220 return; 4251 4221 } 4252 4222 4253 - vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL, 4254 - CPU_BASED_VIRTUAL_NMI_PENDING); 4223 + exec_controls_setbit(to_vmx(vcpu), CPU_BASED_VIRTUAL_NMI_PENDING); 4255 4224 } 4256 4225 4257 4226 static void vmx_inject_irq(struct kvm_vcpu *vcpu) ··· 4471 4442 4472 4443 static int handle_machine_check(struct kvm_vcpu *vcpu) 4473 4444 { 4474 - /* already handled by vcpu_run */ 4445 + /* handled by vmx_vcpu_run() */ 4475 4446 return 1; 4476 4447 } 4477 4448 4478 - static int handle_exception(struct kvm_vcpu *vcpu) 4449 + static int handle_exception_nmi(struct kvm_vcpu *vcpu) 4479 4450 { 4480 4451 struct vcpu_vmx *vmx = to_vmx(vcpu); 4481 4452 struct kvm_run *kvm_run = vcpu->run; ··· 4487 4458 vect_info = vmx->idt_vectoring_info; 4488 4459 intr_info = vmx->exit_intr_info; 4489 4460 4490 - if (is_machine_check(intr_info)) 4491 - return handle_machine_check(vcpu); 4492 - 4493 - if (is_nmi(intr_info)) 4494 - return 1; /* already handled by vmx_vcpu_run() */ 4461 + if (is_machine_check(intr_info) || is_nmi(intr_info)) 4462 + return 1; /* handled by handle_exception_nmi_irqoff() */ 4495 4463 4496 4464 if (is_invalid_opcode(intr_info)) 4497 4465 return handle_ud(vcpu); ··· 4544 4518 dr6 = vmcs_readl(EXIT_QUALIFICATION); 4545 4519 if (!(vcpu->guest_debug & 4546 4520 (KVM_GUESTDBG_SINGLESTEP | KVM_GUESTDBG_USE_HW_BP))) { 4547 - vcpu->arch.dr6 &= ~15; 4521 + vcpu->arch.dr6 &= ~DR_TRAP_BITS; 4548 4522 vcpu->arch.dr6 |= dr6 | DR6_RTM; 4549 4523 if (is_icebp(intr_info)) 4550 4524 skip_emulated_instruction(vcpu); ··· 4789 4763 vcpu->run->exit_reason = KVM_EXIT_DEBUG; 4790 4764 return 0; 4791 4765 } else { 4792 - vcpu->arch.dr6 &= ~15; 4766 + vcpu->arch.dr6 &= ~DR_TRAP_BITS; 4793 4767 vcpu->arch.dr6 |= DR6_BD | DR6_RTM; 4794 4768 kvm_queue_exception(vcpu, DB_VECTOR); 4795 4769 return 1; ··· 4797 4771 } 4798 4772 4799 4773 if (vcpu->guest_debug == 0) { 4800 - vmcs_clear_bits(CPU_BASED_VM_EXEC_CONTROL, 4801 - CPU_BASED_MOV_DR_EXITING); 4774 + exec_controls_clearbit(to_vmx(vcpu), CPU_BASED_MOV_DR_EXITING); 4802 4775 4803 4776 /* 4804 4777 * No more DR vmexits; force a reload of the debug registers ··· 4841 4816 vcpu->arch.dr7 = vmcs_readl(GUEST_DR7); 4842 4817 4843 4818 vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_WONT_EXIT; 4844 - vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL, CPU_BASED_MOV_DR_EXITING); 4819 + exec_controls_setbit(to_vmx(vcpu), CPU_BASED_MOV_DR_EXITING); 4845 4820 } 4846 4821 4847 4822 static void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val) ··· 4901 4876 4902 4877 static int handle_interrupt_window(struct kvm_vcpu *vcpu) 4903 4878 { 4904 - vmcs_clear_bits(CPU_BASED_VM_EXEC_CONTROL, 4905 - CPU_BASED_VIRTUAL_INTR_PENDING); 4879 + exec_controls_clearbit(to_vmx(vcpu), CPU_BASED_VIRTUAL_INTR_PENDING); 4906 4880 4907 4881 kvm_make_request(KVM_REQ_EVENT, vcpu); 4908 4882 ··· 5155 5131 static int handle_nmi_window(struct kvm_vcpu *vcpu) 5156 5132 { 5157 5133 WARN_ON_ONCE(!enable_vnmi); 5158 - vmcs_clear_bits(CPU_BASED_VM_EXEC_CONTROL, 5159 - CPU_BASED_VIRTUAL_NMI_PENDING); 5134 + exec_controls_clearbit(to_vmx(vcpu), CPU_BASED_VIRTUAL_NMI_PENDING); 5160 5135 ++vcpu->stat.nmi_window_exits; 5161 5136 kvm_make_request(KVM_REQ_EVENT, vcpu); 5162 5137 ··· 5167 5144 struct vcpu_vmx *vmx = to_vmx(vcpu); 5168 5145 enum emulation_result err = EMULATE_DONE; 5169 5146 int ret = 1; 5170 - u32 cpu_exec_ctrl; 5171 5147 bool intr_window_requested; 5172 5148 unsigned count = 130; 5173 5149 ··· 5177 5155 */ 5178 5156 WARN_ON_ONCE(vmx->emulation_required && vmx->nested.nested_run_pending); 5179 5157 5180 - cpu_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); 5181 - intr_window_requested = cpu_exec_ctrl & CPU_BASED_VIRTUAL_INTR_PENDING; 5158 + intr_window_requested = exec_controls_get(vmx) & 5159 + CPU_BASED_VIRTUAL_INTR_PENDING; 5182 5160 5183 5161 while (vmx->emulation_required && count-- != 0) { 5184 5162 if (intr_window_requested && vmx_interrupt_allowed(vcpu)) ··· 5364 5342 * is read even if it isn't needed (e.g., for type==all) 5365 5343 */ 5366 5344 if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION), 5367 - vmx_instruction_info, false, &gva)) 5345 + vmx_instruction_info, false, 5346 + sizeof(operand), &gva)) 5368 5347 return 1; 5369 5348 5370 5349 if (kvm_read_guest_virt(vcpu, gva, &operand, sizeof(operand), &e)) { ··· 5460 5437 5461 5438 static int handle_preemption_timer(struct kvm_vcpu *vcpu) 5462 5439 { 5463 - if (!to_vmx(vcpu)->req_immediate_exit) 5440 + struct vcpu_vmx *vmx = to_vmx(vcpu); 5441 + 5442 + if (!vmx->req_immediate_exit && 5443 + !unlikely(vmx->loaded_vmcs->hv_timer_soft_disabled)) 5464 5444 kvm_lapic_expired_hv_timer(vcpu); 5445 + 5465 5446 return 1; 5466 5447 } 5467 5448 ··· 5496 5469 * to be done to userspace and return 0. 5497 5470 */ 5498 5471 static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = { 5499 - [EXIT_REASON_EXCEPTION_NMI] = handle_exception, 5472 + [EXIT_REASON_EXCEPTION_NMI] = handle_exception_nmi, 5500 5473 [EXIT_REASON_EXTERNAL_INTERRUPT] = handle_external_interrupt, 5501 5474 [EXIT_REASON_TRIPLE_FAULT] = handle_triple_fault, 5502 5475 [EXIT_REASON_NMI_WINDOW] = handle_nmi_window, ··· 5979 5952 5980 5953 void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu) 5981 5954 { 5955 + struct vcpu_vmx *vmx = to_vmx(vcpu); 5982 5956 u32 sec_exec_control; 5983 5957 5984 5958 if (!lapic_in_kernel(vcpu)) ··· 5991 5963 5992 5964 /* Postpone execution until vmcs01 is the current VMCS. */ 5993 5965 if (is_guest_mode(vcpu)) { 5994 - to_vmx(vcpu)->nested.change_vmcs01_virtual_apic_mode = true; 5966 + vmx->nested.change_vmcs01_virtual_apic_mode = true; 5995 5967 return; 5996 5968 } 5997 5969 5998 - sec_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL); 5970 + sec_exec_control = secondary_exec_controls_get(vmx); 5999 5971 sec_exec_control &= ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | 6000 5972 SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE); 6001 5973 ··· 6017 5989 SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE; 6018 5990 break; 6019 5991 } 6020 - vmcs_write32(SECONDARY_VM_EXEC_CONTROL, sec_exec_control); 5992 + secondary_exec_controls_set(vmx, sec_exec_control); 6021 5993 6022 5994 vmx_update_msr_bitmap(vcpu); 6023 5995 } ··· 6135 6107 memset(vmx->pi_desc.pir, 0, sizeof(vmx->pi_desc.pir)); 6136 6108 } 6137 6109 6138 - static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx) 6110 + static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx) 6139 6111 { 6140 - u32 exit_intr_info = 0; 6141 - u16 basic_exit_reason = (u16)vmx->exit_reason; 6142 - 6143 - if (!(basic_exit_reason == EXIT_REASON_MCE_DURING_VMENTRY 6144 - || basic_exit_reason == EXIT_REASON_EXCEPTION_NMI)) 6145 - return; 6146 - 6147 - if (!(vmx->exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY)) 6148 - exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); 6149 - vmx->exit_intr_info = exit_intr_info; 6112 + vmx->exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); 6150 6113 6151 6114 /* if exit due to PF check for async PF */ 6152 - if (is_page_fault(exit_intr_info)) 6115 + if (is_page_fault(vmx->exit_intr_info)) 6153 6116 vmx->vcpu.arch.apf.host_apf_reason = kvm_read_and_reset_pf_reason(); 6154 6117 6155 6118 /* Handle machine checks before interrupts are enabled */ 6156 - if (basic_exit_reason == EXIT_REASON_MCE_DURING_VMENTRY || 6157 - is_machine_check(exit_intr_info)) 6119 + if (is_machine_check(vmx->exit_intr_info)) 6158 6120 kvm_machine_check(); 6159 6121 6160 6122 /* We need to handle NMIs before interrupts are enabled */ 6161 - if (is_nmi(exit_intr_info)) { 6123 + if (is_nmi(vmx->exit_intr_info)) { 6162 6124 kvm_before_interrupt(&vmx->vcpu); 6163 6125 asm("int $2"); 6164 6126 kvm_after_interrupt(&vmx->vcpu); 6165 6127 } 6166 6128 } 6167 6129 6168 - static void vmx_handle_external_intr(struct kvm_vcpu *vcpu) 6130 + static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu) 6169 6131 { 6170 - u32 exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); 6132 + unsigned int vector; 6133 + unsigned long entry; 6134 + #ifdef CONFIG_X86_64 6135 + unsigned long tmp; 6136 + #endif 6137 + gate_desc *desc; 6138 + u32 intr_info; 6171 6139 6172 - if ((exit_intr_info & (INTR_INFO_VALID_MASK | INTR_INFO_INTR_TYPE_MASK)) 6173 - == (INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR)) { 6174 - unsigned int vector; 6175 - unsigned long entry; 6176 - gate_desc *desc; 6177 - struct vcpu_vmx *vmx = to_vmx(vcpu); 6178 - #ifdef CONFIG_X86_64 6179 - unsigned long tmp; 6180 - #endif 6140 + intr_info = vmcs_read32(VM_EXIT_INTR_INFO); 6141 + if (WARN_ONCE(!is_external_intr(intr_info), 6142 + "KVM: unexpected VM-Exit interrupt info: 0x%x", intr_info)) 6143 + return; 6181 6144 6182 - vector = exit_intr_info & INTR_INFO_VECTOR_MASK; 6183 - desc = (gate_desc *)vmx->host_idt_base + vector; 6184 - entry = gate_offset(desc); 6185 - asm volatile( 6145 + vector = intr_info & INTR_INFO_VECTOR_MASK; 6146 + desc = (gate_desc *)host_idt_base + vector; 6147 + entry = gate_offset(desc); 6148 + 6149 + kvm_before_interrupt(vcpu); 6150 + 6151 + asm volatile( 6186 6152 #ifdef CONFIG_X86_64 6187 - "mov %%" _ASM_SP ", %[sp]\n\t" 6188 - "and $0xfffffffffffffff0, %%" _ASM_SP "\n\t" 6189 - "push $%c[ss]\n\t" 6190 - "push %[sp]\n\t" 6153 + "mov %%" _ASM_SP ", %[sp]\n\t" 6154 + "and $0xfffffffffffffff0, %%" _ASM_SP "\n\t" 6155 + "push $%c[ss]\n\t" 6156 + "push %[sp]\n\t" 6191 6157 #endif 6192 - "pushf\n\t" 6193 - __ASM_SIZE(push) " $%c[cs]\n\t" 6194 - CALL_NOSPEC 6195 - : 6158 + "pushf\n\t" 6159 + __ASM_SIZE(push) " $%c[cs]\n\t" 6160 + CALL_NOSPEC 6161 + : 6196 6162 #ifdef CONFIG_X86_64 6197 - [sp]"=&r"(tmp), 6163 + [sp]"=&r"(tmp), 6198 6164 #endif 6199 - ASM_CALL_CONSTRAINT 6200 - : 6201 - THUNK_TARGET(entry), 6202 - [ss]"i"(__KERNEL_DS), 6203 - [cs]"i"(__KERNEL_CS) 6204 - ); 6205 - } 6165 + ASM_CALL_CONSTRAINT 6166 + : 6167 + THUNK_TARGET(entry), 6168 + [ss]"i"(__KERNEL_DS), 6169 + [cs]"i"(__KERNEL_CS) 6170 + ); 6171 + 6172 + kvm_after_interrupt(vcpu); 6206 6173 } 6207 - STACK_FRAME_NON_STANDARD(vmx_handle_external_intr); 6174 + STACK_FRAME_NON_STANDARD(handle_external_interrupt_irqoff); 6175 + 6176 + static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu) 6177 + { 6178 + struct vcpu_vmx *vmx = to_vmx(vcpu); 6179 + 6180 + if (vmx->exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT) 6181 + handle_external_interrupt_irqoff(vcpu); 6182 + else if (vmx->exit_reason == EXIT_REASON_EXCEPTION_NMI) 6183 + handle_exception_nmi_irqoff(vmx); 6184 + } 6208 6185 6209 6186 static bool vmx_has_emulated_msr(int index) 6210 6187 { ··· 6220 6187 * real mode. 6221 6188 */ 6222 6189 return enable_unrestricted_guest || emulate_invalid_guest_state; 6190 + case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC: 6191 + return nested; 6223 6192 case MSR_AMD64_VIRT_SPEC_CTRL: 6224 6193 /* This is AMD only. */ 6225 6194 return false; ··· 6367 6332 msrs[i].host, false); 6368 6333 } 6369 6334 6370 - static void vmx_arm_hv_timer(struct vcpu_vmx *vmx, u32 val) 6371 - { 6372 - vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, val); 6373 - if (!vmx->loaded_vmcs->hv_timer_armed) 6374 - vmcs_set_bits(PIN_BASED_VM_EXEC_CONTROL, 6375 - PIN_BASED_VMX_PREEMPTION_TIMER); 6376 - vmx->loaded_vmcs->hv_timer_armed = true; 6377 - } 6378 - 6379 6335 static void vmx_update_hv_timer(struct kvm_vcpu *vcpu) 6380 6336 { 6381 6337 struct vcpu_vmx *vmx = to_vmx(vcpu); ··· 6374 6348 u32 delta_tsc; 6375 6349 6376 6350 if (vmx->req_immediate_exit) { 6377 - vmx_arm_hv_timer(vmx, 0); 6378 - return; 6379 - } 6380 - 6381 - if (vmx->hv_deadline_tsc != -1) { 6351 + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, 0); 6352 + vmx->loaded_vmcs->hv_timer_soft_disabled = false; 6353 + } else if (vmx->hv_deadline_tsc != -1) { 6382 6354 tscl = rdtsc(); 6383 6355 if (vmx->hv_deadline_tsc > tscl) 6384 6356 /* set_hv_timer ensures the delta fits in 32-bits */ ··· 6385 6361 else 6386 6362 delta_tsc = 0; 6387 6363 6388 - vmx_arm_hv_timer(vmx, delta_tsc); 6389 - return; 6364 + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, delta_tsc); 6365 + vmx->loaded_vmcs->hv_timer_soft_disabled = false; 6366 + } else if (!vmx->loaded_vmcs->hv_timer_soft_disabled) { 6367 + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, -1); 6368 + vmx->loaded_vmcs->hv_timer_soft_disabled = true; 6390 6369 } 6391 - 6392 - if (vmx->loaded_vmcs->hv_timer_armed) 6393 - vmcs_clear_bits(PIN_BASED_VM_EXEC_CONTROL, 6394 - PIN_BASED_VMX_PREEMPTION_TIMER); 6395 - vmx->loaded_vmcs->hv_timer_armed = false; 6396 6370 } 6397 6371 6398 6372 void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp) ··· 6423 6401 vmcs_write32(PLE_WINDOW, vmx->ple_window); 6424 6402 } 6425 6403 6426 - if (vmx->nested.need_vmcs12_sync) 6427 - nested_sync_from_vmcs12(vcpu); 6404 + if (vmx->nested.need_vmcs12_to_shadow_sync) 6405 + nested_sync_vmcs12_to_shadow(vcpu); 6428 6406 6429 6407 if (test_bit(VCPU_REGS_RSP, (unsigned long *)&vcpu->arch.regs_dirty)) 6430 6408 vmcs_writel(GUEST_RSP, vcpu->arch.regs[VCPU_REGS_RSP]); ··· 6462 6440 6463 6441 atomic_switch_perf_msrs(vmx); 6464 6442 6465 - vmx_update_hv_timer(vcpu); 6443 + if (enable_preemption_timer) 6444 + vmx_update_hv_timer(vcpu); 6445 + 6446 + if (lapic_in_kernel(vcpu) && 6447 + vcpu->arch.apic->lapic_timer.timer_advance_ns) 6448 + kvm_wait_lapic_expire(vcpu); 6466 6449 6467 6450 /* 6468 6451 * If this vCPU has touched SPEC_CTRL, restore the guest's value if ··· 6560 6533 vmx->idt_vectoring_info = 0; 6561 6534 6562 6535 vmx->exit_reason = vmx->fail ? 0xdead : vmcs_read32(VM_EXIT_REASON); 6536 + if ((u16)vmx->exit_reason == EXIT_REASON_MCE_DURING_VMENTRY) 6537 + kvm_machine_check(); 6538 + 6563 6539 if (vmx->fail || (vmx->exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY)) 6564 6540 return; 6565 6541 6566 6542 vmx->loaded_vmcs->launched = 1; 6567 6543 vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD); 6568 6544 6569 - vmx_complete_atomic_exit(vmx); 6570 6545 vmx_recover_nmi_blocking(vmx); 6571 6546 vmx_complete_interrupts(vmx); 6572 6547 } ··· 6659 6630 vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_SYSENTER_CS, MSR_TYPE_RW); 6660 6631 vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_SYSENTER_ESP, MSR_TYPE_RW); 6661 6632 vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_SYSENTER_EIP, MSR_TYPE_RW); 6633 + if (kvm_cstate_in_guest(kvm)) { 6634 + vmx_disable_intercept_for_msr(msr_bitmap, MSR_CORE_C1_RES, MSR_TYPE_R); 6635 + vmx_disable_intercept_for_msr(msr_bitmap, MSR_CORE_C3_RESIDENCY, MSR_TYPE_R); 6636 + vmx_disable_intercept_for_msr(msr_bitmap, MSR_CORE_C6_RESIDENCY, MSR_TYPE_R); 6637 + vmx_disable_intercept_for_msr(msr_bitmap, MSR_CORE_C7_RESIDENCY, MSR_TYPE_R); 6638 + } 6662 6639 vmx->msr_bitmap_mode = 0; 6663 6640 6664 6641 vmx->loaded_vmcs = &vmx->vmcs01; ··· 6761 6726 return 0; 6762 6727 } 6763 6728 6764 - static void __init vmx_check_processor_compat(void *rtn) 6729 + static int __init vmx_check_processor_compat(void) 6765 6730 { 6766 6731 struct vmcs_config vmcs_conf; 6767 6732 struct vmx_capability vmx_cap; 6768 6733 6769 - *(int *)rtn = 0; 6770 6734 if (setup_vmcs_config(&vmcs_conf, &vmx_cap) < 0) 6771 - *(int *)rtn = -EIO; 6735 + return -EIO; 6772 6736 if (nested) 6773 6737 nested_vmx_setup_ctls_msrs(&vmcs_conf.nested, vmx_cap.ept, 6774 6738 enable_apicv); 6775 6739 if (memcmp(&vmcs_config, &vmcs_conf, sizeof(struct vmcs_config)) != 0) { 6776 6740 printk(KERN_ERR "kvm: CPU %d feature inconsistency!\n", 6777 6741 smp_processor_id()); 6778 - *(int *)rtn = -EIO; 6742 + return -EIO; 6779 6743 } 6744 + return 0; 6780 6745 } 6781 6746 6782 6747 static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) ··· 6830 6795 return PT_PDPE_LEVEL; 6831 6796 } 6832 6797 6833 - static void vmcs_set_secondary_exec_control(u32 new_ctl) 6798 + static void vmcs_set_secondary_exec_control(struct vcpu_vmx *vmx) 6834 6799 { 6835 6800 /* 6836 6801 * These bits in the secondary execution controls field ··· 6844 6809 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | 6845 6810 SECONDARY_EXEC_DESC; 6846 6811 6847 - u32 cur_ctl = vmcs_read32(SECONDARY_VM_EXEC_CONTROL); 6812 + u32 new_ctl = vmx->secondary_exec_control; 6813 + u32 cur_ctl = secondary_exec_controls_get(vmx); 6848 6814 6849 - vmcs_write32(SECONDARY_VM_EXEC_CONTROL, 6850 - (new_ctl & ~mask) | (cur_ctl & mask)); 6815 + secondary_exec_controls_set(vmx, (new_ctl & ~mask) | (cur_ctl & mask)); 6851 6816 } 6852 6817 6853 6818 /* ··· 6985 6950 6986 6951 if (cpu_has_secondary_exec_ctrls()) { 6987 6952 vmx_compute_secondary_exec_control(vmx); 6988 - vmcs_set_secondary_exec_control(vmx->secondary_exec_control); 6953 + vmcs_set_secondary_exec_control(vmx); 6989 6954 } 6990 6955 6991 6956 if (nested_vmx_allowed(vcpu)) ··· 7459 7424 static __init int hardware_setup(void) 7460 7425 { 7461 7426 unsigned long host_bndcfgs; 7427 + struct desc_ptr dt; 7462 7428 int r, i; 7463 7429 7464 7430 rdmsrl_safe(MSR_EFER, &host_efer); 7431 + 7432 + store_idt(&dt); 7433 + host_idt_base = dt.address; 7465 7434 7466 7435 for (i = 0; i < ARRAY_SIZE(vmx_msr_index); ++i) 7467 7436 kvm_define_shared_msr(i, vmx_msr_index[i]); ··· 7570 7531 } 7571 7532 7572 7533 if (!cpu_has_vmx_preemption_timer()) 7573 - kvm_x86_ops->request_immediate_exit = __kvm_request_immediate_exit; 7534 + enable_preemption_timer = false; 7574 7535 7575 - if (cpu_has_vmx_preemption_timer() && enable_preemption_timer) { 7536 + if (enable_preemption_timer) { 7537 + u64 use_timer_freq = 5000ULL * 1000 * 1000; 7576 7538 u64 vmx_msr; 7577 7539 7578 7540 rdmsrl(MSR_IA32_VMX_MISC, vmx_msr); 7579 7541 cpu_preemption_timer_multi = 7580 7542 vmx_msr & VMX_MISC_PREEMPTION_TIMER_RATE_MASK; 7581 - } else { 7543 + 7544 + if (tsc_khz) 7545 + use_timer_freq = (u64)tsc_khz * 1000; 7546 + use_timer_freq >>= cpu_preemption_timer_multi; 7547 + 7548 + /* 7549 + * KVM "disables" the preemption timer by setting it to its max 7550 + * value. Don't use the timer if it might cause spurious exits 7551 + * at a rate faster than 0.1 Hz (of uninterrupted guest time). 7552 + */ 7553 + if (use_timer_freq > 0xffffffffu / 10) 7554 + enable_preemption_timer = false; 7555 + } 7556 + 7557 + if (!enable_preemption_timer) { 7582 7558 kvm_x86_ops->set_hv_timer = NULL; 7583 7559 kvm_x86_ops->cancel_hv_timer = NULL; 7560 + kvm_x86_ops->request_immediate_exit = __kvm_request_immediate_exit; 7584 7561 } 7585 7562 7586 7563 kvm_set_posted_intr_wakeup_handler(wakeup_handler); ··· 7738 7683 .set_tdp_cr3 = vmx_set_cr3, 7739 7684 7740 7685 .check_intercept = vmx_check_intercept, 7741 - .handle_external_intr = vmx_handle_external_intr, 7686 + .handle_exit_irqoff = vmx_handle_exit_irqoff, 7742 7687 .mpx_supported = vmx_mpx_supported, 7743 7688 .xsaves_supported = vmx_xsaves_supported, 7744 7689 .umip_emulated = vmx_umip_emulated,

+49 -75

arch/x86/kvm/vmx/vmx.h

··· 109 109 * to guest memory during VM exit. 110 110 */ 111 111 struct vmcs12 *cached_shadow_vmcs12; 112 + 112 113 /* 113 114 * Indicates if the shadow vmcs or enlightened vmcs must be updated 114 115 * with the data held by struct vmcs12. 115 116 */ 116 - bool need_vmcs12_sync; 117 + bool need_vmcs12_to_shadow_sync; 117 118 bool dirty_vmcs12; 119 + 120 + /* 121 + * Indicates lazily loaded guest state has not yet been decached from 122 + * vmcs02. 123 + */ 124 + bool need_sync_vmcs02_to_vmcs12_rare; 118 125 119 126 /* 120 127 * vmcs02 has been initialized, i.e. state that is constant for ··· 187 180 struct kvm_vcpu vcpu; 188 181 u8 fail; 189 182 u8 msr_bitmap_mode; 183 + 184 + /* 185 + * If true, host state has been stored in vmx->loaded_vmcs for 186 + * the CPU registers that only need to be switched when transitioning 187 + * to/from the kernel, and the registers have been loaded with guest 188 + * values. If false, host state is loaded in the CPU registers 189 + * and vmx->loaded_vmcs->host_state is invalid. 190 + */ 191 + bool guest_state_loaded; 192 + 190 193 u32 exit_intr_info; 191 194 u32 idt_vectoring_info; 192 195 ulong rflags; 196 + 193 197 struct shared_msr_entry *guest_msrs; 194 198 int nmsrs; 195 199 int save_nmsrs; 196 - bool guest_msrs_dirty; 197 - unsigned long host_idt_base; 200 + bool guest_msrs_ready; 198 201 #ifdef CONFIG_X86_64 199 202 u64 msr_host_kernel_gs_base; 200 203 u64 msr_guest_kernel_gs_base; ··· 212 195 213 196 u64 spec_ctrl; 214 197 215 - u32 vm_entry_controls_shadow; 216 - u32 vm_exit_controls_shadow; 217 198 u32 secondary_exec_control; 218 199 219 200 /* 220 201 * loaded_vmcs points to the VMCS currently used in this vcpu. For a 221 202 * non-nested (L1) guest, it always points to vmcs01. For a nested 222 - * guest (L2), it points to a different VMCS. loaded_cpu_state points 223 - * to the VMCS whose state is loaded into the CPU registers that only 224 - * need to be switched when transitioning to/from the kernel; a NULL 225 - * value indicates that host state is loaded. 203 + * guest (L2), it points to a different VMCS. 226 204 */ 227 205 struct loaded_vmcs vmcs01; 228 206 struct loaded_vmcs *loaded_vmcs; 229 - struct loaded_vmcs *loaded_cpu_state; 230 207 231 208 struct msr_autoload { 232 209 struct vmx_msrs guest; ··· 271 260 272 261 unsigned long host_debugctlmsr; 273 262 274 - u64 msr_ia32_power_ctl; 275 - 276 263 /* 277 264 * Only bits masked by msr_ia32_feature_control_valid_bits can be set in 278 265 * msr_ia32_feature_control. FEATURE_CONTROL_LOCKED is always included ··· 301 292 }; 302 293 303 294 bool nested_vmx_allowed(struct kvm_vcpu *vcpu); 295 + void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu); 304 296 void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu); 305 - void vmx_vcpu_put(struct kvm_vcpu *vcpu); 306 297 int allocate_vpid(void); 307 298 void free_vpid(int vpid); 308 299 void vmx_set_constant_host_state(struct vcpu_vmx *vmx); 309 300 void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu); 301 + void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel, 302 + unsigned long fs_base, unsigned long gs_base); 310 303 int vmx_get_cpl(struct kvm_vcpu *vcpu); 311 304 unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu); 312 305 void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags); ··· 387 376 return vmcs_read16(GUEST_INTR_STATUS) & 0xff; 388 377 } 389 378 390 - static inline void vm_entry_controls_reset_shadow(struct vcpu_vmx *vmx) 391 - { 392 - vmx->vm_entry_controls_shadow = vmcs_read32(VM_ENTRY_CONTROLS); 379 + #define BUILD_CONTROLS_SHADOW(lname, uname) \ 380 + static inline void lname##_controls_set(struct vcpu_vmx *vmx, u32 val) \ 381 + { \ 382 + if (vmx->loaded_vmcs->controls_shadow.lname != val) { \ 383 + vmcs_write32(uname, val); \ 384 + vmx->loaded_vmcs->controls_shadow.lname = val; \ 385 + } \ 386 + } \ 387 + static inline u32 lname##_controls_get(struct vcpu_vmx *vmx) \ 388 + { \ 389 + return vmx->loaded_vmcs->controls_shadow.lname; \ 390 + } \ 391 + static inline void lname##_controls_setbit(struct vcpu_vmx *vmx, u32 val) \ 392 + { \ 393 + lname##_controls_set(vmx, lname##_controls_get(vmx) | val); \ 394 + } \ 395 + static inline void lname##_controls_clearbit(struct vcpu_vmx *vmx, u32 val) \ 396 + { \ 397 + lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val); \ 393 398 } 394 - 395 - static inline void vm_entry_controls_init(struct vcpu_vmx *vmx, u32 val) 396 - { 397 - vmcs_write32(VM_ENTRY_CONTROLS, val); 398 - vmx->vm_entry_controls_shadow = val; 399 - } 400 - 401 - static inline void vm_entry_controls_set(struct vcpu_vmx *vmx, u32 val) 402 - { 403 - if (vmx->vm_entry_controls_shadow != val) 404 - vm_entry_controls_init(vmx, val); 405 - } 406 - 407 - static inline u32 vm_entry_controls_get(struct vcpu_vmx *vmx) 408 - { 409 - return vmx->vm_entry_controls_shadow; 410 - } 411 - 412 - static inline void vm_entry_controls_setbit(struct vcpu_vmx *vmx, u32 val) 413 - { 414 - vm_entry_controls_set(vmx, vm_entry_controls_get(vmx) | val); 415 - } 416 - 417 - static inline void vm_entry_controls_clearbit(struct vcpu_vmx *vmx, u32 val) 418 - { 419 - vm_entry_controls_set(vmx, vm_entry_controls_get(vmx) & ~val); 420 - } 421 - 422 - static inline void vm_exit_controls_reset_shadow(struct vcpu_vmx *vmx) 423 - { 424 - vmx->vm_exit_controls_shadow = vmcs_read32(VM_EXIT_CONTROLS); 425 - } 426 - 427 - static inline void vm_exit_controls_init(struct vcpu_vmx *vmx, u32 val) 428 - { 429 - vmcs_write32(VM_EXIT_CONTROLS, val); 430 - vmx->vm_exit_controls_shadow = val; 431 - } 432 - 433 - static inline void vm_exit_controls_set(struct vcpu_vmx *vmx, u32 val) 434 - { 435 - if (vmx->vm_exit_controls_shadow != val) 436 - vm_exit_controls_init(vmx, val); 437 - } 438 - 439 - static inline u32 vm_exit_controls_get(struct vcpu_vmx *vmx) 440 - { 441 - return vmx->vm_exit_controls_shadow; 442 - } 443 - 444 - static inline void vm_exit_controls_setbit(struct vcpu_vmx *vmx, u32 val) 445 - { 446 - vm_exit_controls_set(vmx, vm_exit_controls_get(vmx) | val); 447 - } 448 - 449 - static inline void vm_exit_controls_clearbit(struct vcpu_vmx *vmx, u32 val) 450 - { 451 - vm_exit_controls_set(vmx, vm_exit_controls_get(vmx) & ~val); 452 - } 399 + BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS) 400 + BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS) 401 + BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL) 402 + BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL) 403 + BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL) 453 404 454 405 static inline void vmx_segment_cache_clear(struct vcpu_vmx *vmx) 455 406 { ··· 441 468 } 442 469 443 470 u32 vmx_exec_control(struct vcpu_vmx *vmx); 471 + u32 vmx_pin_based_exec_ctrl(struct vcpu_vmx *vmx); 444 472 445 473 static inline struct kvm_vmx *to_kvm_vmx(struct kvm *kvm) 446 474 {

+167 -62

arch/x86/kvm/x86.c

··· 717 717 gfn_t gfn; 718 718 int r; 719 719 720 - if (is_long_mode(vcpu) || !is_pae(vcpu) || !is_paging(vcpu)) 720 + if (!is_pae_paging(vcpu)) 721 721 return false; 722 722 723 723 if (!test_bit(VCPU_EXREG_PDPTR, ··· 960 960 if (is_long_mode(vcpu) && 961 961 (cr3 & rsvd_bits(cpuid_maxphyaddr(vcpu), 63))) 962 962 return 1; 963 - else if (is_pae(vcpu) && is_paging(vcpu) && 964 - !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3)) 963 + else if (is_pae_paging(vcpu) && 964 + !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3)) 965 965 return 1; 966 966 967 967 kvm_mmu_new_cr3(vcpu, cr3, skip_tlb_flush); ··· 1174 1174 MSR_AMD64_VIRT_SPEC_CTRL, 1175 1175 MSR_IA32_POWER_CTL, 1176 1176 1177 + /* 1178 + * The following list leaves out MSRs whose values are determined 1179 + * by arch/x86/kvm/vmx/nested.c based on CPUID or other MSRs. 1180 + * We always support the "true" VMX control MSRs, even if the host 1181 + * processor does not, so I am putting these registers here rather 1182 + * than in msrs_to_save. 1183 + */ 1184 + MSR_IA32_VMX_BASIC, 1185 + MSR_IA32_VMX_TRUE_PINBASED_CTLS, 1186 + MSR_IA32_VMX_TRUE_PROCBASED_CTLS, 1187 + MSR_IA32_VMX_TRUE_EXIT_CTLS, 1188 + MSR_IA32_VMX_TRUE_ENTRY_CTLS, 1189 + MSR_IA32_VMX_MISC, 1190 + MSR_IA32_VMX_CR0_FIXED0, 1191 + MSR_IA32_VMX_CR4_FIXED0, 1192 + MSR_IA32_VMX_VMCS_ENUM, 1193 + MSR_IA32_VMX_PROCBASED_CTLS2, 1194 + MSR_IA32_VMX_EPT_VPID_CAP, 1195 + MSR_IA32_VMX_VMFUNC, 1196 + 1177 1197 MSR_K7_HWCR, 1198 + MSR_KVM_POLL_CONTROL, 1178 1199 }; 1179 1200 1180 1201 static unsigned num_emulated_msrs; ··· 1231 1210 1232 1211 static unsigned int num_msr_based_features; 1233 1212 1234 - u64 kvm_get_arch_capabilities(void) 1213 + static u64 kvm_get_arch_capabilities(void) 1235 1214 { 1236 - u64 data; 1215 + u64 data = 0; 1237 1216 1238 - rdmsrl_safe(MSR_IA32_ARCH_CAPABILITIES, &data); 1217 + if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES)) 1218 + rdmsrl(MSR_IA32_ARCH_CAPABILITIES, data); 1239 1219 1240 1220 /* 1241 1221 * If we're doing cache flushes (either "always" or "cond") ··· 1252 1230 1253 1231 return data; 1254 1232 } 1255 - EXPORT_SYMBOL_GPL(kvm_get_arch_capabilities); 1256 1233 1257 1234 static int kvm_get_msr_feature(struct kvm_msr_entry *msr) 1258 1235 { ··· 2566 2545 } 2567 2546 break; 2568 2547 case MSR_IA32_MISC_ENABLE: 2569 - vcpu->arch.ia32_misc_enable_msr = data; 2548 + if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) && 2549 + ((vcpu->arch.ia32_misc_enable_msr ^ data) & MSR_IA32_MISC_ENABLE_MWAIT)) { 2550 + if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3)) 2551 + return 1; 2552 + vcpu->arch.ia32_misc_enable_msr = data; 2553 + kvm_update_cpuid(vcpu); 2554 + } else { 2555 + vcpu->arch.ia32_misc_enable_msr = data; 2556 + } 2570 2557 break; 2571 2558 case MSR_IA32_SMBASE: 2572 2559 if (!msr_info->host_initiated) 2573 2560 return 1; 2574 2561 vcpu->arch.smbase = data; 2562 + break; 2563 + case MSR_IA32_POWER_CTL: 2564 + vcpu->arch.msr_ia32_power_ctl = data; 2575 2565 break; 2576 2566 case MSR_IA32_TSC: 2577 2567 kvm_write_tsc(vcpu, msr_info); ··· 2656 2624 case MSR_KVM_PV_EOI_EN: 2657 2625 if (kvm_lapic_enable_pv_eoi(vcpu, data, sizeof(u8))) 2658 2626 return 1; 2627 + break; 2628 + 2629 + case MSR_KVM_POLL_CONTROL: 2630 + /* only enable bit supported */ 2631 + if (data & (-1ULL << 1)) 2632 + return 1; 2633 + 2634 + vcpu->arch.msr_kvm_poll_control = data; 2659 2635 break; 2660 2636 2661 2637 case MSR_IA32_MCG_CTL: ··· 2843 2803 return 1; 2844 2804 msr_info->data = vcpu->arch.arch_capabilities; 2845 2805 break; 2806 + case MSR_IA32_POWER_CTL: 2807 + msr_info->data = vcpu->arch.msr_ia32_power_ctl; 2808 + break; 2846 2809 case MSR_IA32_TSC: 2847 2810 msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + vcpu->arch.tsc_offset; 2848 2811 break; ··· 2917 2874 break; 2918 2875 case MSR_KVM_PV_EOI_EN: 2919 2876 msr_info->data = vcpu->arch.pv_eoi.msr_val; 2877 + break; 2878 + case MSR_KVM_POLL_CONTROL: 2879 + msr_info->data = vcpu->arch.msr_kvm_poll_control; 2920 2880 break; 2921 2881 case MSR_IA32_P5_MC_ADDR: 2922 2882 case MSR_IA32_P5_MC_TYPE: ··· 3130 3084 case KVM_CAP_SET_BOOT_CPU_ID: 3131 3085 case KVM_CAP_SPLIT_IRQCHIP: 3132 3086 case KVM_CAP_IMMEDIATE_EXIT: 3087 + case KVM_CAP_PMU_EVENT_FILTER: 3133 3088 case KVM_CAP_GET_MSR_FEATURES: 3134 3089 case KVM_CAP_MSR_PLATFORM_INFO: 3135 3090 case KVM_CAP_EXCEPTION_PAYLOAD: ··· 3143 3096 r = KVM_CLOCK_TSC_STABLE; 3144 3097 break; 3145 3098 case KVM_CAP_X86_DISABLE_EXITS: 3146 - r |= KVM_X86_DISABLE_EXITS_HLT | KVM_X86_DISABLE_EXITS_PAUSE; 3099 + r |= KVM_X86_DISABLE_EXITS_HLT | KVM_X86_DISABLE_EXITS_PAUSE | 3100 + KVM_X86_DISABLE_EXITS_CSTATE; 3147 3101 if(kvm_can_mwait_in_guest()) 3148 3102 r |= KVM_X86_DISABLE_EXITS_MWAIT; 3149 3103 break; ··· 4661 4613 kvm->arch.hlt_in_guest = true; 4662 4614 if (cap->args[0] & KVM_X86_DISABLE_EXITS_PAUSE) 4663 4615 kvm->arch.pause_in_guest = true; 4616 + if (cap->args[0] & KVM_X86_DISABLE_EXITS_CSTATE) 4617 + kvm->arch.cstate_in_guest = true; 4664 4618 r = 0; 4665 4619 break; 4666 4620 case KVM_CAP_MSR_PLATFORM_INFO: ··· 4977 4927 r = kvm_vm_ioctl_hv_eventfd(kvm, &hvevfd); 4978 4928 break; 4979 4929 } 4930 + case KVM_SET_PMU_EVENT_FILTER: 4931 + r = kvm_vm_ioctl_set_pmu_event_filter(kvm, argp); 4932 + break; 4980 4933 default: 4981 4934 r = -ENOTTY; 4982 4935 } ··· 6432 6379 vcpu->arch.db); 6433 6380 6434 6381 if (dr6 != 0) { 6435 - vcpu->arch.dr6 &= ~15; 6382 + vcpu->arch.dr6 &= ~DR_TRAP_BITS; 6436 6383 vcpu->arch.dr6 |= dr6 | DR6_RTM; 6437 6384 kvm_queue_exception(vcpu, DB_VECTOR); 6438 6385 *r = EMULATE_DONE; ··· 6759 6706 struct kvm_vcpu *vcpu; 6760 6707 int cpu; 6761 6708 6762 - spin_lock(&kvm_lock); 6709 + mutex_lock(&kvm_lock); 6763 6710 list_for_each_entry(kvm, &vm_list, vm_list) 6764 6711 kvm_make_mclock_inprogress_request(kvm); 6765 6712 ··· 6785 6732 6786 6733 spin_unlock(&ka->pvclock_gtod_sync_lock); 6787 6734 } 6788 - spin_unlock(&kvm_lock); 6735 + mutex_unlock(&kvm_lock); 6789 6736 } 6790 6737 #endif 6791 6738 ··· 6836 6783 6837 6784 smp_call_function_single(cpu, tsc_khz_changed, freq, 1); 6838 6785 6839 - spin_lock(&kvm_lock); 6786 + mutex_lock(&kvm_lock); 6840 6787 list_for_each_entry(kvm, &vm_list, vm_list) { 6841 6788 kvm_for_each_vcpu(i, vcpu, kvm) { 6842 6789 if (vcpu->cpu != cpu) 6843 6790 continue; 6844 6791 kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); 6845 - if (vcpu->cpu != smp_processor_id()) 6792 + if (vcpu->cpu != raw_smp_processor_id()) 6846 6793 send_ipi = 1; 6847 6794 } 6848 6795 } 6849 - spin_unlock(&kvm_lock); 6796 + mutex_unlock(&kvm_lock); 6850 6797 6851 6798 if (freq->old < freq->new && send_ipi) { 6852 6799 /* ··· 6961 6908 .handle_intel_pt_intr = kvm_handle_intel_pt_intr, 6962 6909 }; 6963 6910 6964 - static void kvm_set_mmio_spte_mask(void) 6965 - { 6966 - u64 mask; 6967 - int maxphyaddr = boot_cpu_data.x86_phys_bits; 6968 - 6969 - /* 6970 - * Set the reserved bits and the present bit of an paging-structure 6971 - * entry to generate page fault with PFER.RSV = 1. 6972 - */ 6973 - 6974 - /* 6975 - * Mask the uppermost physical address bit, which would be reserved as 6976 - * long as the supported physical address width is less than 52. 6977 - */ 6978 - mask = 1ull << 51; 6979 - 6980 - /* Set the present bit. */ 6981 - mask |= 1ull; 6982 - 6983 - /* 6984 - * If reserved bit is not supported, clear the present bit to disable 6985 - * mmio page fault. 6986 - */ 6987 - if (IS_ENABLED(CONFIG_X86_64) && maxphyaddr == 52) 6988 - mask &= ~1ull; 6989 - 6990 - kvm_mmu_set_mmio_spte_mask(mask, mask); 6991 - } 6992 - 6993 6911 #ifdef CONFIG_X86_64 6994 6912 static void pvclock_gtod_update_fn(struct work_struct *work) 6995 6913 { ··· 6969 6945 struct kvm_vcpu *vcpu; 6970 6946 int i; 6971 6947 6972 - spin_lock(&kvm_lock); 6948 + mutex_lock(&kvm_lock); 6973 6949 list_for_each_entry(kvm, &vm_list, vm_list) 6974 6950 kvm_for_each_vcpu(i, vcpu, kvm) 6975 6951 kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu); 6976 6952 atomic_set(&kvm_guest_has_master_clock, 0); 6977 - spin_unlock(&kvm_lock); 6953 + mutex_unlock(&kvm_lock); 6978 6954 } 6979 6955 6980 6956 static DECLARE_WORK(pvclock_gtod_work, pvclock_gtod_update_fn); ··· 7056 7032 r = kvm_mmu_module_init(); 7057 7033 if (r) 7058 7034 goto out_free_percpu; 7059 - 7060 - kvm_set_mmio_spte_mask(); 7061 7035 7062 7036 kvm_x86_ops = ops; 7063 7037 ··· 7195 7173 kvm_x86_ops->refresh_apicv_exec_ctrl(vcpu); 7196 7174 } 7197 7175 7176 + static void kvm_sched_yield(struct kvm *kvm, unsigned long dest_id) 7177 + { 7178 + struct kvm_vcpu *target = NULL; 7179 + struct kvm_apic_map *map; 7180 + 7181 + rcu_read_lock(); 7182 + map = rcu_dereference(kvm->arch.apic_map); 7183 + 7184 + if (likely(map) && dest_id <= map->max_apic_id && map->phys_map[dest_id]) 7185 + target = map->phys_map[dest_id]->vcpu; 7186 + 7187 + rcu_read_unlock(); 7188 + 7189 + if (target) 7190 + kvm_vcpu_yield_to(target); 7191 + } 7192 + 7198 7193 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) 7199 7194 { 7200 7195 unsigned long nr, a0, a1, a2, a3, ret; ··· 7257 7218 #endif 7258 7219 case KVM_HC_SEND_IPI: 7259 7220 ret = kvm_pv_send_ipi(vcpu->kvm, a0, a1, a2, a3, op_64_bit); 7221 + break; 7222 + case KVM_HC_SCHED_YIELD: 7223 + kvm_sched_yield(vcpu->kvm, a0); 7224 + ret = 0; 7260 7225 break; 7261 7226 default: 7262 7227 ret = -KVM_ENOSYS; ··· 7994 7951 } 7995 7952 7996 7953 trace_kvm_entry(vcpu->vcpu_id); 7997 - if (lapic_in_kernel(vcpu) && 7998 - vcpu->arch.apic->lapic_timer.timer_advance_ns) 7999 - wait_lapic_expire(vcpu); 8000 7954 guest_enter_irqoff(); 8001 7955 8002 7956 fpregs_assert_state_consistent(); ··· 8042 8002 vcpu->mode = OUTSIDE_GUEST_MODE; 8043 8003 smp_wmb(); 8044 8004 8005 + kvm_x86_ops->handle_exit_irqoff(vcpu); 8006 + 8007 + /* 8008 + * Consume any pending interrupts, including the possible source of 8009 + * VM-Exit on SVM and any ticks that occur between VM-Exit and now. 8010 + * An instruction is required after local_irq_enable() to fully unblock 8011 + * interrupts on processors that implement an interrupt shadow, the 8012 + * stat.exits increment will do nicely. 8013 + */ 8045 8014 kvm_before_interrupt(vcpu); 8046 - kvm_x86_ops->handle_external_intr(vcpu); 8015 + local_irq_enable(); 8016 + ++vcpu->stat.exits; 8017 + local_irq_disable(); 8047 8018 kvm_after_interrupt(vcpu); 8048 8019 8049 - ++vcpu->stat.exits; 8050 - 8051 8020 guest_exit_irqoff(); 8021 + if (lapic_in_kernel(vcpu)) { 8022 + s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta; 8023 + if (delta != S64_MIN) { 8024 + trace_kvm_wait_lapic_expire(vcpu->vcpu_id, delta); 8025 + vcpu->arch.apic->lapic_timer.advance_expire_delta = S64_MIN; 8026 + } 8027 + } 8052 8028 8053 8029 local_irq_enable(); 8054 8030 preempt_enable(); ··· 8650 8594 kvm_update_cpuid(vcpu); 8651 8595 8652 8596 idx = srcu_read_lock(&vcpu->kvm->srcu); 8653 - if (!is_long_mode(vcpu) && is_pae(vcpu) && is_paging(vcpu)) { 8597 + if (is_pae_paging(vcpu)) { 8654 8598 load_pdptrs(vcpu, vcpu->arch.walk_mmu, kvm_read_cr3(vcpu)); 8655 8599 mmu_reset_needed = 1; 8656 8600 } ··· 8931 8875 msr.host_initiated = true; 8932 8876 kvm_write_tsc(vcpu, &msr); 8933 8877 vcpu_put(vcpu); 8878 + 8879 + /* poll control enabled by default */ 8880 + vcpu->arch.msr_kvm_poll_control = 1; 8881 + 8934 8882 mutex_unlock(&vcpu->mutex); 8935 8883 8936 8884 if (!kvmclock_periodic_sync) ··· 9167 9107 kvm_x86_ops->hardware_unsetup(); 9168 9108 } 9169 9109 9170 - void kvm_arch_check_processor_compat(void *rtn) 9110 + int kvm_arch_check_processor_compat(void) 9171 9111 { 9172 - kvm_x86_ops->check_processor_compatibility(rtn); 9112 + return kvm_x86_ops->check_processor_compatibility(); 9173 9113 } 9174 9114 9175 9115 bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu) ··· 9441 9381 kvm_ioapic_destroy(kvm); 9442 9382 kvm_free_vcpus(kvm); 9443 9383 kvfree(rcu_dereference_check(kvm->arch.apic_map, 1)); 9384 + kfree(srcu_dereference_check(kvm->arch.pmu_event_filter, &kvm->srcu, 1)); 9444 9385 kvm_mmu_uninit_vm(kvm); 9445 9386 kvm_page_track_cleanup(kvm); 9446 9387 kvm_hv_destroy_vm(kvm); ··· 9850 9789 sizeof(u32)); 9851 9790 } 9852 9791 9792 + static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu) 9793 + { 9794 + if (!vcpu->arch.apf.delivery_as_pf_vmexit && is_guest_mode(vcpu)) 9795 + return false; 9796 + 9797 + if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) || 9798 + (vcpu->arch.apf.send_user_only && 9799 + kvm_x86_ops->get_cpl(vcpu) == 0)) 9800 + return false; 9801 + 9802 + return true; 9803 + } 9804 + 9805 + bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu) 9806 + { 9807 + if (unlikely(!lapic_in_kernel(vcpu) || 9808 + kvm_event_needs_reinjection(vcpu) || 9809 + vcpu->arch.exception.pending)) 9810 + return false; 9811 + 9812 + if (kvm_hlt_in_guest(vcpu->kvm) && !kvm_can_deliver_async_pf(vcpu)) 9813 + return false; 9814 + 9815 + /* 9816 + * If interrupts are off we cannot even use an artificial 9817 + * halt state. 9818 + */ 9819 + return kvm_x86_ops->interrupt_allowed(vcpu); 9820 + } 9821 + 9853 9822 void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu, 9854 9823 struct kvm_async_pf *work) 9855 9824 { ··· 9888 9797 trace_kvm_async_pf_not_present(work->arch.token, work->gva); 9889 9798 kvm_add_async_pf_gfn(vcpu, work->arch.gfn); 9890 9799 9891 - if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) || 9892 - (vcpu->arch.apf.send_user_only && 9893 - kvm_x86_ops->get_cpl(vcpu) == 0)) 9894 - kvm_make_request(KVM_REQ_APF_HALT, vcpu); 9895 - else if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) { 9800 + if (kvm_can_deliver_async_pf(vcpu) && 9801 + !apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) { 9896 9802 fault.vector = PF_VECTOR; 9897 9803 fault.error_code_valid = true; 9898 9804 fault.error_code = 0; ··· 9897 9809 fault.address = work->arch.token; 9898 9810 fault.async_page_fault = true; 9899 9811 kvm_inject_page_fault(vcpu, &fault); 9812 + } else { 9813 + /* 9814 + * It is not possible to deliver a paravirtualized asynchronous 9815 + * page fault, but putting the guest in an artificial halt state 9816 + * can be beneficial nevertheless: if an interrupt arrives, we 9817 + * can deliver it timely and perhaps the guest will schedule 9818 + * another process. When the instruction that triggered a page 9819 + * fault is retried, hopefully the page will be ready in the host. 9820 + */ 9821 + kvm_make_request(KVM_REQ_APF_HALT, vcpu); 9900 9822 } 9901 9823 } 9902 9824 ··· 10046 9948 return vector_hashing; 10047 9949 } 10048 9950 EXPORT_SYMBOL_GPL(kvm_vector_hashing_enabled); 9951 + 9952 + bool kvm_arch_no_poll(struct kvm_vcpu *vcpu) 9953 + { 9954 + return (vcpu->arch.msr_kvm_poll_control & 1) == 0; 9955 + } 9956 + EXPORT_SYMBOL_GPL(kvm_arch_no_poll); 9957 + 10049 9958 10050 9959 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit); 10051 9960 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);

+10

arch/x86/kvm/x86.h

··· 139 139 return likely(kvm_read_cr0_bits(vcpu, X86_CR0_PG)); 140 140 } 141 141 142 + static inline bool is_pae_paging(struct kvm_vcpu *vcpu) 143 + { 144 + return !is_long_mode(vcpu) && is_pae(vcpu) && is_paging(vcpu); 145 + } 146 + 142 147 static inline u32 bit(int bitno) 143 148 { 144 149 return 1 << (bitno & 31); ··· 336 331 static inline bool kvm_pause_in_guest(struct kvm *kvm) 337 332 { 338 333 return kvm->arch.pause_in_guest; 334 + } 335 + 336 + static inline bool kvm_cstate_in_guest(struct kvm *kvm) 337 + { 338 + return kvm->arch.cstate_in_guest; 339 339 } 340 340 341 341 DECLARE_PER_CPU(struct kvm_vcpu *, current_vcpu);

+6 -5

include/kvm/arm_pmu.h

··· 11 11 #include <asm/perf_event.h> 12 12 13 13 #define ARMV8_PMU_CYCLE_IDX (ARMV8_PMU_MAX_COUNTERS - 1) 14 + #define ARMV8_PMU_MAX_COUNTER_PAIRS ((ARMV8_PMU_MAX_COUNTERS + 1) >> 1) 14 15 15 16 #ifdef CONFIG_KVM_ARM_PMU 16 17 17 18 struct kvm_pmc { 18 19 u8 idx; /* index into the pmu->pmc array */ 19 20 struct perf_event *perf_event; 20 - u64 bitmask; 21 21 }; 22 22 23 23 struct kvm_pmu { 24 24 int irq_num; 25 25 struct kvm_pmc pmc[ARMV8_PMU_MAX_COUNTERS]; 26 + DECLARE_BITMAP(chained, ARMV8_PMU_MAX_COUNTER_PAIRS); 26 27 bool ready; 27 28 bool created; 28 29 bool irq_level; ··· 36 35 u64 kvm_pmu_valid_counter_mask(struct kvm_vcpu *vcpu); 37 36 void kvm_pmu_vcpu_reset(struct kvm_vcpu *vcpu); 38 37 void kvm_pmu_vcpu_destroy(struct kvm_vcpu *vcpu); 39 - void kvm_pmu_disable_counter(struct kvm_vcpu *vcpu, u64 val); 40 - void kvm_pmu_enable_counter(struct kvm_vcpu *vcpu, u64 val); 38 + void kvm_pmu_disable_counter_mask(struct kvm_vcpu *vcpu, u64 val); 39 + void kvm_pmu_enable_counter_mask(struct kvm_vcpu *vcpu, u64 val); 41 40 void kvm_pmu_flush_hwstate(struct kvm_vcpu *vcpu); 42 41 void kvm_pmu_sync_hwstate(struct kvm_vcpu *vcpu); 43 42 bool kvm_pmu_should_notify_user(struct kvm_vcpu *vcpu); ··· 73 72 } 74 73 static inline void kvm_pmu_vcpu_reset(struct kvm_vcpu *vcpu) {} 75 74 static inline void kvm_pmu_vcpu_destroy(struct kvm_vcpu *vcpu) {} 76 - static inline void kvm_pmu_disable_counter(struct kvm_vcpu *vcpu, u64 val) {} 77 - static inline void kvm_pmu_enable_counter(struct kvm_vcpu *vcpu, u64 val) {} 75 + static inline void kvm_pmu_disable_counter_mask(struct kvm_vcpu *vcpu, u64 val) {} 76 + static inline void kvm_pmu_enable_counter_mask(struct kvm_vcpu *vcpu, u64 val) {} 78 77 static inline void kvm_pmu_flush_hwstate(struct kvm_vcpu *vcpu) {} 79 78 static inline void kvm_pmu_sync_hwstate(struct kvm_vcpu *vcpu) {} 80 79 static inline bool kvm_pmu_should_notify_user(struct kvm_vcpu *vcpu)

+3 -2

include/linux/kvm_host.h

··· 159 159 160 160 extern struct kmem_cache *kvm_vcpu_cache; 161 161 162 - extern spinlock_t kvm_lock; 162 + extern struct mutex kvm_lock; 163 163 extern struct list_head vm_list; 164 164 165 165 struct kvm_io_range { ··· 867 867 void kvm_arch_hardware_disable(void); 868 868 int kvm_arch_hardware_setup(void); 869 869 void kvm_arch_hardware_unsetup(void); 870 - void kvm_arch_check_processor_compat(void *rtn); 870 + int kvm_arch_check_processor_compat(void); 871 871 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu); 872 872 bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu); 873 873 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu); ··· 990 990 struct kvm_irq_ack_notifier *kian); 991 991 int kvm_request_irq_source_id(struct kvm *kvm); 992 992 void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id); 993 + bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args); 993 994 994 995 /* 995 996 * search_memslots() and __gfn_to_memslot() are here because they are

+6 -1

include/uapi/linux/kvm.h

··· 696 696 #define KVM_X86_DISABLE_EXITS_MWAIT (1 << 0) 697 697 #define KVM_X86_DISABLE_EXITS_HLT (1 << 1) 698 698 #define KVM_X86_DISABLE_EXITS_PAUSE (1 << 2) 699 + #define KVM_X86_DISABLE_EXITS_CSTATE (1 << 3) 699 700 #define KVM_X86_DISABLE_VALID_EXITS (KVM_X86_DISABLE_EXITS_MWAIT | \ 700 701 KVM_X86_DISABLE_EXITS_HLT | \ 701 - KVM_X86_DISABLE_EXITS_PAUSE) 702 + KVM_X86_DISABLE_EXITS_PAUSE | \ 703 + KVM_X86_DISABLE_EXITS_CSTATE) 702 704 703 705 /* for KVM_ENABLE_CAP */ 704 706 struct kvm_enable_cap { ··· 995 993 #define KVM_CAP_ARM_SVE 170 996 994 #define KVM_CAP_ARM_PTRAUTH_ADDRESS 171 997 995 #define KVM_CAP_ARM_PTRAUTH_GENERIC 172 996 + #define KVM_CAP_PMU_EVENT_FILTER 173 998 997 999 998 #ifdef KVM_CAP_IRQ_ROUTING 1000 999 ··· 1330 1327 #define KVM_PPC_GET_RMMU_INFO _IOW(KVMIO, 0xb0, struct kvm_ppc_rmmu_info) 1331 1328 /* Available with KVM_CAP_PPC_GET_CPU_CHAR */ 1332 1329 #define KVM_PPC_GET_CPU_CHAR _IOR(KVMIO, 0xb1, struct kvm_ppc_cpu_char) 1330 + /* Available with KVM_CAP_PMU_EVENT_FILTER */ 1331 + #define KVM_SET_PMU_EVENT_FILTER _IOW(KVMIO, 0xb2, struct kvm_pmu_event_filter) 1333 1332 1334 1333 /* ioctl for vm fd */ 1335 1334 #define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device)

+1

include/uapi/linux/kvm_para.h

··· 28 28 #define KVM_HC_MIPS_CONSOLE_OUTPUT 8 29 29 #define KVM_HC_CLOCK_PAIRING 9 30 30 #define KVM_HC_SEND_IPI 10 31 + #define KVM_HC_SCHED_YIELD 11 31 32 32 33 /* 33 34 * hypercalls use architecture specific

+3 -1

tools/include/uapi/linux/kvm.h

··· 696 696 #define KVM_X86_DISABLE_EXITS_MWAIT (1 << 0) 697 697 #define KVM_X86_DISABLE_EXITS_HLT (1 << 1) 698 698 #define KVM_X86_DISABLE_EXITS_PAUSE (1 << 2) 699 + #define KVM_X86_DISABLE_EXITS_CSTATE (1 << 3) 699 700 #define KVM_X86_DISABLE_VALID_EXITS (KVM_X86_DISABLE_EXITS_MWAIT | \ 700 701 KVM_X86_DISABLE_EXITS_HLT | \ 701 - KVM_X86_DISABLE_EXITS_PAUSE) 702 + KVM_X86_DISABLE_EXITS_PAUSE | \ 703 + KVM_X86_DISABLE_EXITS_CSTATE) 702 704 703 705 /* for KVM_ENABLE_CAP */ 704 706 struct kvm_enable_cap {

+1 -2

tools/testing/selftests/kvm/dirty_log_test.c

··· 121 121 uint64_t *guest_array; 122 122 uint64_t pages_count = 0; 123 123 struct kvm_run *run; 124 - struct ucall uc; 125 124 126 125 run = vcpu_state(vm, VCPU_ID); 127 126 ··· 131 132 /* Let the guest dirty the random pages */ 132 133 ret = _vcpu_run(vm, VCPU_ID); 133 134 TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret); 134 - if (get_ucall(vm, VCPU_ID, &uc) == UCALL_SYNC) { 135 + if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) { 135 136 pages_count += TEST_PAGES_PER_LOOP; 136 137 generate_random_array(guest_array, TEST_PAGES_PER_LOOP); 137 138 } else {

+4

tools/testing/selftests/kvm/include/aarch64/processor.h

··· 52 52 vcpu_ioctl(vm, vcpuid, KVM_SET_ONE_REG, &reg); 53 53 } 54 54 55 + void aarch64_vcpu_setup(struct kvm_vm *vm, int vcpuid, struct kvm_vcpu_init *init); 56 + void aarch64_vcpu_add_default(struct kvm_vm *vm, uint32_t vcpuid, 57 + struct kvm_vcpu_init *init, void *guest_code); 58 + 55 59 #endif /* SELFTEST_KVM_PROCESSOR_H */

+1 -2

tools/testing/selftests/kvm/include/kvm_util.h

··· 86 86 void *arg); 87 87 void vm_ioctl(struct kvm_vm *vm, unsigned long ioctl, void *arg); 88 88 void vm_mem_region_set_flags(struct kvm_vm *vm, uint32_t slot, uint32_t flags); 89 - void vm_vcpu_add(struct kvm_vm *vm, uint32_t vcpuid, int pgd_memslot, 90 - int gdt_memslot); 89 + void vm_vcpu_add(struct kvm_vm *vm, uint32_t vcpuid); 91 90 vm_vaddr_t vm_vaddr_alloc(struct kvm_vm *vm, size_t sz, vm_vaddr_t vaddr_min, 92 91 uint32_t data_memslot, uint32_t pgd_memslot); 93 92 void virt_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr,

+33 -19

tools/testing/selftests/kvm/lib/aarch64/processor.c

··· 235 235 return vm; 236 236 } 237 237 238 - void vm_vcpu_add_default(struct kvm_vm *vm, uint32_t vcpuid, void *guest_code) 238 + void aarch64_vcpu_setup(struct kvm_vm *vm, int vcpuid, struct kvm_vcpu_init *init) 239 239 { 240 - size_t stack_size = vm->page_size == 4096 ? 241 - DEFAULT_STACK_PGS * vm->page_size : 242 - vm->page_size; 243 - uint64_t stack_vaddr = vm_vaddr_alloc(vm, stack_size, 244 - DEFAULT_ARM64_GUEST_STACK_VADDR_MIN, 0, 0); 245 - 246 - vm_vcpu_add(vm, vcpuid, 0, 0); 247 - 248 - set_reg(vm, vcpuid, ARM64_CORE_REG(sp_el1), stack_vaddr + stack_size); 249 - set_reg(vm, vcpuid, ARM64_CORE_REG(regs.pc), (uint64_t)guest_code); 250 - } 251 - 252 - void vcpu_setup(struct kvm_vm *vm, int vcpuid, int pgd_memslot, int gdt_memslot) 253 - { 254 - struct kvm_vcpu_init init; 240 + struct kvm_vcpu_init default_init = { .target = -1, }; 255 241 uint64_t sctlr_el1, tcr_el1; 256 242 257 - memset(&init, 0, sizeof(init)); 258 - init.target = KVM_ARM_TARGET_GENERIC_V8; 259 - vcpu_ioctl(vm, vcpuid, KVM_ARM_VCPU_INIT, &init); 243 + if (!init) 244 + init = &default_init; 245 + 246 + if (init->target == -1) { 247 + struct kvm_vcpu_init preferred; 248 + vm_ioctl(vm, KVM_ARM_PREFERRED_TARGET, &preferred); 249 + init->target = preferred.target; 250 + } 251 + 252 + vcpu_ioctl(vm, vcpuid, KVM_ARM_VCPU_INIT, init); 260 253 261 254 /* 262 255 * Enable FP/ASIMD to avoid trapping when accessing Q0-Q15 ··· 308 315 309 316 fprintf(stream, "%*spstate: 0x%.16lx pc: 0x%.16lx\n", 310 317 indent, "", pstate, pc); 318 + } 319 + 320 + void aarch64_vcpu_add_default(struct kvm_vm *vm, uint32_t vcpuid, 321 + struct kvm_vcpu_init *init, void *guest_code) 322 + { 323 + size_t stack_size = vm->page_size == 4096 ? 324 + DEFAULT_STACK_PGS * vm->page_size : 325 + vm->page_size; 326 + uint64_t stack_vaddr = vm_vaddr_alloc(vm, stack_size, 327 + DEFAULT_ARM64_GUEST_STACK_VADDR_MIN, 0, 0); 328 + 329 + vm_vcpu_add(vm, vcpuid); 330 + aarch64_vcpu_setup(vm, vcpuid, init); 331 + 332 + set_reg(vm, vcpuid, ARM64_CORE_REG(sp_el1), stack_vaddr + stack_size); 333 + set_reg(vm, vcpuid, ARM64_CORE_REG(regs.pc), (uint64_t)guest_code); 334 + } 335 + 336 + void vm_vcpu_add_default(struct kvm_vm *vm, uint32_t vcpuid, void *guest_code) 337 + { 338 + aarch64_vcpu_add_default(vm, vcpuid, NULL, guest_code); 311 339 }

+3 -6

tools/testing/selftests/kvm/lib/kvm_util.c

··· 763 763 * 764 764 * Return: None 765 765 * 766 - * Creates and adds to the VM specified by vm and virtual CPU with 767 - * the ID given by vcpuid. 766 + * Adds a virtual CPU to the VM specified by vm with the ID given by vcpuid. 767 + * No additional VCPU setup is done. 768 768 */ 769 - void vm_vcpu_add(struct kvm_vm *vm, uint32_t vcpuid, int pgd_memslot, 770 - int gdt_memslot) 769 + void vm_vcpu_add(struct kvm_vm *vm, uint32_t vcpuid) 771 770 { 772 771 struct vcpu *vcpu; 773 772 ··· 800 801 vm->vcpu_head->prev = vcpu; 801 802 vcpu->next = vm->vcpu_head; 802 803 vm->vcpu_head = vcpu; 803 - 804 - vcpu_setup(vm, vcpuid, pgd_memslot, gdt_memslot); 805 804 } 806 805 807 806 /*

-2

tools/testing/selftests/kvm/lib/kvm_util_internal.h

··· 64 64 }; 65 65 66 66 struct vcpu *vcpu_find(struct kvm_vm *vm, uint32_t vcpuid); 67 - void vcpu_setup(struct kvm_vm *vm, int vcpuid, int pgd_memslot, 68 - int gdt_memslot); 69 67 void virt_dump(FILE *stream, struct kvm_vm *vm, uint8_t indent); 70 68 void regs_dump(FILE *stream, struct kvm_regs *regs, uint8_t indent); 71 69 void sregs_dump(FILE *stream, struct kvm_sregs *sregs, uint8_t indent);

+13 -6

tools/testing/selftests/kvm/lib/ucall.c

··· 125 125 uint64_t get_ucall(struct kvm_vm *vm, uint32_t vcpu_id, struct ucall *uc) 126 126 { 127 127 struct kvm_run *run = vcpu_state(vm, vcpu_id); 128 - 129 - memset(uc, 0, sizeof(*uc)); 128 + struct ucall ucall = {}; 129 + bool got_ucall = false; 130 130 131 131 #ifdef __x86_64__ 132 132 if (ucall_type == UCALL_PIO && run->exit_reason == KVM_EXIT_IO && 133 133 run->io.port == UCALL_PIO_PORT) { 134 134 struct kvm_regs regs; 135 135 vcpu_regs_get(vm, vcpu_id, &regs); 136 - memcpy(uc, addr_gva2hva(vm, (vm_vaddr_t)regs.rdi), sizeof(*uc)); 137 - return uc->cmd; 136 + memcpy(&ucall, addr_gva2hva(vm, (vm_vaddr_t)regs.rdi), sizeof(ucall)); 137 + got_ucall = true; 138 138 } 139 139 #endif 140 140 if (ucall_type == UCALL_MMIO && run->exit_reason == KVM_EXIT_MMIO && ··· 143 143 TEST_ASSERT(run->mmio.is_write && run->mmio.len == 8, 144 144 "Unexpected ucall exit mmio address access"); 145 145 memcpy(&gva, run->mmio.data, sizeof(gva)); 146 - memcpy(uc, addr_gva2hva(vm, gva), sizeof(*uc)); 146 + memcpy(&ucall, addr_gva2hva(vm, gva), sizeof(ucall)); 147 + got_ucall = true; 147 148 } 148 149 149 - return uc->cmd; 150 + if (got_ucall) { 151 + vcpu_run_complete_io(vm, vcpu_id); 152 + if (uc) 153 + memcpy(uc, &ucall, sizeof(ucall)); 154 + } 155 + 156 + return ucall.cmd; 150 157 }

+3 -2

tools/testing/selftests/kvm/lib/x86_64/processor.c

··· 609 609 kvm_seg_fill_gdt_64bit(vm, segp); 610 610 } 611 611 612 - void vcpu_setup(struct kvm_vm *vm, int vcpuid, int pgd_memslot, int gdt_memslot) 612 + static void vcpu_setup(struct kvm_vm *vm, int vcpuid, int pgd_memslot, int gdt_memslot) 613 613 { 614 614 struct kvm_sregs sregs; 615 615 ··· 655 655 DEFAULT_GUEST_STACK_VADDR_MIN, 0, 0); 656 656 657 657 /* Create VCPU */ 658 - vm_vcpu_add(vm, vcpuid, 0, 0); 658 + vm_vcpu_add(vm, vcpuid); 659 + vcpu_setup(vm, vcpuid, 0, 0); 659 660 660 661 /* Setup guest general purpose registers */ 661 662 vcpu_regs_get(vm, vcpuid, &regs);

+1 -1

tools/testing/selftests/kvm/x86_64/evmcs_test.c

··· 144 144 145 145 /* Restore state in a new VM. */ 146 146 kvm_vm_restart(vm, O_RDWR); 147 - vm_vcpu_add(vm, VCPU_ID, 0, 0); 147 + vm_vcpu_add(vm, VCPU_ID); 148 148 vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid()); 149 149 vcpu_ioctl(vm, VCPU_ID, KVM_ENABLE_CAP, &enable_evmcs_cap); 150 150 vcpu_load_state(vm, VCPU_ID, state);

+1 -1

tools/testing/selftests/kvm/x86_64/kvm_create_max_vcpus.c

··· 33 33 int vcpu_id = first_vcpu_id + i; 34 34 35 35 /* This asserts that the vCPU was created. */ 36 - vm_vcpu_add(vm, vcpu_id, 0, 0); 36 + vm_vcpu_add(vm, vcpu_id); 37 37 } 38 38 39 39 kvm_vm_free(vm);

+1 -1

tools/testing/selftests/kvm/x86_64/smm_test.c

··· 144 144 state = vcpu_save_state(vm, VCPU_ID); 145 145 kvm_vm_release(vm); 146 146 kvm_vm_restart(vm, O_RDWR); 147 - vm_vcpu_add(vm, VCPU_ID, 0, 0); 147 + vm_vcpu_add(vm, VCPU_ID); 148 148 vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid()); 149 149 vcpu_load_state(vm, VCPU_ID, state); 150 150 run = vcpu_state(vm, VCPU_ID);

+1 -1

tools/testing/selftests/kvm/x86_64/state_test.c

··· 176 176 177 177 /* Restore state in a new VM. */ 178 178 kvm_vm_restart(vm, O_RDWR); 179 - vm_vcpu_add(vm, VCPU_ID, 0, 0); 179 + vm_vcpu_add(vm, VCPU_ID); 180 180 vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid()); 181 181 vcpu_load_state(vm, VCPU_ID, state); 182 182 run = vcpu_state(vm, VCPU_ID);

+12 -12

virt/kvm/arm/arch_timer.c

··· 237 237 238 238 switch (index) { 239 239 case TIMER_VTIMER: 240 - cnt_ctl = read_sysreg_el0(cntv_ctl); 240 + cnt_ctl = read_sysreg_el0(SYS_CNTV_CTL); 241 241 break; 242 242 case TIMER_PTIMER: 243 - cnt_ctl = read_sysreg_el0(cntp_ctl); 243 + cnt_ctl = read_sysreg_el0(SYS_CNTP_CTL); 244 244 break; 245 245 case NR_KVM_TIMERS: 246 246 /* GCC is braindead */ ··· 350 350 351 351 switch (index) { 352 352 case TIMER_VTIMER: 353 - ctx->cnt_ctl = read_sysreg_el0(cntv_ctl); 354 - ctx->cnt_cval = read_sysreg_el0(cntv_cval); 353 + ctx->cnt_ctl = read_sysreg_el0(SYS_CNTV_CTL); 354 + ctx->cnt_cval = read_sysreg_el0(SYS_CNTV_CVAL); 355 355 356 356 /* Disable the timer */ 357 - write_sysreg_el0(0, cntv_ctl); 357 + write_sysreg_el0(0, SYS_CNTV_CTL); 358 358 isb(); 359 359 360 360 break; 361 361 case TIMER_PTIMER: 362 - ctx->cnt_ctl = read_sysreg_el0(cntp_ctl); 363 - ctx->cnt_cval = read_sysreg_el0(cntp_cval); 362 + ctx->cnt_ctl = read_sysreg_el0(SYS_CNTP_CTL); 363 + ctx->cnt_cval = read_sysreg_el0(SYS_CNTP_CVAL); 364 364 365 365 /* Disable the timer */ 366 - write_sysreg_el0(0, cntp_ctl); 366 + write_sysreg_el0(0, SYS_CNTP_CTL); 367 367 isb(); 368 368 369 369 break; ··· 429 429 430 430 switch (index) { 431 431 case TIMER_VTIMER: 432 - write_sysreg_el0(ctx->cnt_cval, cntv_cval); 432 + write_sysreg_el0(ctx->cnt_cval, SYS_CNTV_CVAL); 433 433 isb(); 434 - write_sysreg_el0(ctx->cnt_ctl, cntv_ctl); 434 + write_sysreg_el0(ctx->cnt_ctl, SYS_CNTV_CTL); 435 435 break; 436 436 case TIMER_PTIMER: 437 - write_sysreg_el0(ctx->cnt_cval, cntp_cval); 437 + write_sysreg_el0(ctx->cnt_cval, SYS_CNTP_CVAL); 438 438 isb(); 439 - write_sysreg_el0(ctx->cnt_ctl, cntp_ctl); 439 + write_sysreg_el0(ctx->cnt_ctl, SYS_CNTP_CTL); 440 440 break; 441 441 case NR_KVM_TIMERS: 442 442 BUG();

+4 -3

virt/kvm/arm/arm.c

··· 93 93 return 0; 94 94 } 95 95 96 - void kvm_arch_check_processor_compat(void *rtn) 96 + int kvm_arch_check_processor_compat(void) 97 97 { 98 - *(int *)rtn = 0; 98 + return 0; 99 99 } 100 100 101 101 ··· 1332 1332 1333 1333 static void cpu_hyp_reinit(void) 1334 1334 { 1335 + kvm_init_host_cpu_context(&this_cpu_ptr(&kvm_host_data)->host_ctxt); 1336 + 1335 1337 cpu_hyp_reset(); 1336 1338 1337 1339 if (is_kernel_in_hyp_mode()) ··· 1571 1569 kvm_host_data_t *cpu_data; 1572 1570 1573 1571 cpu_data = per_cpu_ptr(&kvm_host_data, cpu); 1574 - kvm_init_host_cpu_context(&cpu_data->host_ctxt, cpu); 1575 1572 err = create_hyp_mappings(cpu_data, cpu_data + 1, PAGE_HYP); 1576 1573 1577 1574 if (err) {

+303 -87

virt/kvm/arm/pmu.c

··· 13 13 #include <kvm/arm_pmu.h> 14 14 #include <kvm/arm_vgic.h> 15 15 16 + static void kvm_pmu_create_perf_event(struct kvm_vcpu *vcpu, u64 select_idx); 17 + 18 + #define PERF_ATTR_CFG1_KVM_PMU_CHAINED 0x1 19 + 20 + /** 21 + * kvm_pmu_idx_is_64bit - determine if select_idx is a 64bit counter 22 + * @vcpu: The vcpu pointer 23 + * @select_idx: The counter index 24 + */ 25 + static bool kvm_pmu_idx_is_64bit(struct kvm_vcpu *vcpu, u64 select_idx) 26 + { 27 + return (select_idx == ARMV8_PMU_CYCLE_IDX && 28 + __vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMU_PMCR_LC); 29 + } 30 + 31 + static struct kvm_vcpu *kvm_pmc_to_vcpu(struct kvm_pmc *pmc) 32 + { 33 + struct kvm_pmu *pmu; 34 + struct kvm_vcpu_arch *vcpu_arch; 35 + 36 + pmc -= pmc->idx; 37 + pmu = container_of(pmc, struct kvm_pmu, pmc[0]); 38 + vcpu_arch = container_of(pmu, struct kvm_vcpu_arch, pmu); 39 + return container_of(vcpu_arch, struct kvm_vcpu, arch); 40 + } 41 + 42 + /** 43 + * kvm_pmu_pmc_is_chained - determine if the pmc is chained 44 + * @pmc: The PMU counter pointer 45 + */ 46 + static bool kvm_pmu_pmc_is_chained(struct kvm_pmc *pmc) 47 + { 48 + struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); 49 + 50 + return test_bit(pmc->idx >> 1, vcpu->arch.pmu.chained); 51 + } 52 + 53 + /** 54 + * kvm_pmu_idx_is_high_counter - determine if select_idx is a high/low counter 55 + * @select_idx: The counter index 56 + */ 57 + static bool kvm_pmu_idx_is_high_counter(u64 select_idx) 58 + { 59 + return select_idx & 0x1; 60 + } 61 + 62 + /** 63 + * kvm_pmu_get_canonical_pmc - obtain the canonical pmc 64 + * @pmc: The PMU counter pointer 65 + * 66 + * When a pair of PMCs are chained together we use the low counter (canonical) 67 + * to hold the underlying perf event. 68 + */ 69 + static struct kvm_pmc *kvm_pmu_get_canonical_pmc(struct kvm_pmc *pmc) 70 + { 71 + if (kvm_pmu_pmc_is_chained(pmc) && 72 + kvm_pmu_idx_is_high_counter(pmc->idx)) 73 + return pmc - 1; 74 + 75 + return pmc; 76 + } 77 + 78 + /** 79 + * kvm_pmu_idx_has_chain_evtype - determine if the event type is chain 80 + * @vcpu: The vcpu pointer 81 + * @select_idx: The counter index 82 + */ 83 + static bool kvm_pmu_idx_has_chain_evtype(struct kvm_vcpu *vcpu, u64 select_idx) 84 + { 85 + u64 eventsel, reg; 86 + 87 + select_idx |= 0x1; 88 + 89 + if (select_idx == ARMV8_PMU_CYCLE_IDX) 90 + return false; 91 + 92 + reg = PMEVTYPER0_EL0 + select_idx; 93 + eventsel = __vcpu_sys_reg(vcpu, reg) & ARMV8_PMU_EVTYPE_EVENT; 94 + 95 + return eventsel == ARMV8_PMUV3_PERFCTR_CHAIN; 96 + } 97 + 98 + /** 99 + * kvm_pmu_get_pair_counter_value - get PMU counter value 100 + * @vcpu: The vcpu pointer 101 + * @pmc: The PMU counter pointer 102 + */ 103 + static u64 kvm_pmu_get_pair_counter_value(struct kvm_vcpu *vcpu, 104 + struct kvm_pmc *pmc) 105 + { 106 + u64 counter, counter_high, reg, enabled, running; 107 + 108 + if (kvm_pmu_pmc_is_chained(pmc)) { 109 + pmc = kvm_pmu_get_canonical_pmc(pmc); 110 + reg = PMEVCNTR0_EL0 + pmc->idx; 111 + 112 + counter = __vcpu_sys_reg(vcpu, reg); 113 + counter_high = __vcpu_sys_reg(vcpu, reg + 1); 114 + 115 + counter = lower_32_bits(counter) | (counter_high << 32); 116 + } else { 117 + reg = (pmc->idx == ARMV8_PMU_CYCLE_IDX) 118 + ? PMCCNTR_EL0 : PMEVCNTR0_EL0 + pmc->idx; 119 + counter = __vcpu_sys_reg(vcpu, reg); 120 + } 121 + 122 + /* 123 + * The real counter value is equal to the value of counter register plus 124 + * the value perf event counts. 125 + */ 126 + if (pmc->perf_event) 127 + counter += perf_event_read_value(pmc->perf_event, &enabled, 128 + &running); 129 + 130 + return counter; 131 + } 132 + 16 133 /** 17 134 * kvm_pmu_get_counter_value - get PMU counter value 18 135 * @vcpu: The vcpu pointer ··· 137 20 */ 138 21 u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u64 select_idx) 139 22 { 140 - u64 counter, reg, enabled, running; 23 + u64 counter; 141 24 struct kvm_pmu *pmu = &vcpu->arch.pmu; 142 25 struct kvm_pmc *pmc = &pmu->pmc[select_idx]; 143 26 144 - reg = (select_idx == ARMV8_PMU_CYCLE_IDX) 145 - ? PMCCNTR_EL0 : PMEVCNTR0_EL0 + select_idx; 146 - counter = __vcpu_sys_reg(vcpu, reg); 27 + counter = kvm_pmu_get_pair_counter_value(vcpu, pmc); 147 28 148 - /* The real counter value is equal to the value of counter register plus 149 - * the value perf event counts. 150 - */ 151 - if (pmc->perf_event) 152 - counter += perf_event_read_value(pmc->perf_event, &enabled, 153 - &running); 29 + if (kvm_pmu_pmc_is_chained(pmc) && 30 + kvm_pmu_idx_is_high_counter(select_idx)) 31 + counter = upper_32_bits(counter); 154 32 155 - return counter & pmc->bitmask; 33 + else if (!kvm_pmu_idx_is_64bit(vcpu, select_idx)) 34 + counter = lower_32_bits(counter); 35 + 36 + return counter; 156 37 } 157 38 158 39 /** ··· 166 51 reg = (select_idx == ARMV8_PMU_CYCLE_IDX) 167 52 ? PMCCNTR_EL0 : PMEVCNTR0_EL0 + select_idx; 168 53 __vcpu_sys_reg(vcpu, reg) += (s64)val - kvm_pmu_get_counter_value(vcpu, select_idx); 54 + 55 + /* Recreate the perf event to reflect the updated sample_period */ 56 + kvm_pmu_create_perf_event(vcpu, select_idx); 57 + } 58 + 59 + /** 60 + * kvm_pmu_release_perf_event - remove the perf event 61 + * @pmc: The PMU counter pointer 62 + */ 63 + static void kvm_pmu_release_perf_event(struct kvm_pmc *pmc) 64 + { 65 + pmc = kvm_pmu_get_canonical_pmc(pmc); 66 + if (pmc->perf_event) { 67 + perf_event_disable(pmc->perf_event); 68 + perf_event_release_kernel(pmc->perf_event); 69 + pmc->perf_event = NULL; 70 + } 169 71 } 170 72 171 73 /** ··· 195 63 { 196 64 u64 counter, reg; 197 65 198 - if (pmc->perf_event) { 199 - counter = kvm_pmu_get_counter_value(vcpu, pmc->idx); 66 + pmc = kvm_pmu_get_canonical_pmc(pmc); 67 + if (!pmc->perf_event) 68 + return; 69 + 70 + counter = kvm_pmu_get_pair_counter_value(vcpu, pmc); 71 + 72 + if (kvm_pmu_pmc_is_chained(pmc)) { 73 + reg = PMEVCNTR0_EL0 + pmc->idx; 74 + __vcpu_sys_reg(vcpu, reg) = lower_32_bits(counter); 75 + __vcpu_sys_reg(vcpu, reg + 1) = upper_32_bits(counter); 76 + } else { 200 77 reg = (pmc->idx == ARMV8_PMU_CYCLE_IDX) 201 78 ? PMCCNTR_EL0 : PMEVCNTR0_EL0 + pmc->idx; 202 - __vcpu_sys_reg(vcpu, reg) = counter; 203 - perf_event_disable(pmc->perf_event); 204 - perf_event_release_kernel(pmc->perf_event); 205 - pmc->perf_event = NULL; 79 + __vcpu_sys_reg(vcpu, reg) = lower_32_bits(counter); 206 80 } 81 + 82 + kvm_pmu_release_perf_event(pmc); 207 83 } 208 84 209 85 /** ··· 227 87 for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++) { 228 88 kvm_pmu_stop_counter(vcpu, &pmu->pmc[i]); 229 89 pmu->pmc[i].idx = i; 230 - pmu->pmc[i].bitmask = 0xffffffffUL; 231 90 } 91 + 92 + bitmap_zero(vcpu->arch.pmu.chained, ARMV8_PMU_MAX_COUNTER_PAIRS); 232 93 } 233 94 234 95 /** ··· 242 101 int i; 243 102 struct kvm_pmu *pmu = &vcpu->arch.pmu; 244 103 245 - for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++) { 246 - struct kvm_pmc *pmc = &pmu->pmc[i]; 247 - 248 - if (pmc->perf_event) { 249 - perf_event_disable(pmc->perf_event); 250 - perf_event_release_kernel(pmc->perf_event); 251 - pmc->perf_event = NULL; 252 - } 253 - } 104 + for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++) 105 + kvm_pmu_release_perf_event(&pmu->pmc[i]); 254 106 } 255 107 256 108 u64 kvm_pmu_valid_counter_mask(struct kvm_vcpu *vcpu) ··· 258 124 } 259 125 260 126 /** 261 - * kvm_pmu_enable_counter - enable selected PMU counter 127 + * kvm_pmu_enable_counter_mask - enable selected PMU counters 262 128 * @vcpu: The vcpu pointer 263 129 * @val: the value guest writes to PMCNTENSET register 264 130 * 265 131 * Call perf_event_enable to start counting the perf event 266 132 */ 267 - void kvm_pmu_enable_counter(struct kvm_vcpu *vcpu, u64 val) 133 + void kvm_pmu_enable_counter_mask(struct kvm_vcpu *vcpu, u64 val) 268 134 { 269 135 int i; 270 136 struct kvm_pmu *pmu = &vcpu->arch.pmu; ··· 278 144 continue; 279 145 280 146 pmc = &pmu->pmc[i]; 147 + 148 + /* 149 + * For high counters of chained events we must recreate the 150 + * perf event with the long (64bit) attribute set. 151 + */ 152 + if (kvm_pmu_pmc_is_chained(pmc) && 153 + kvm_pmu_idx_is_high_counter(i)) { 154 + kvm_pmu_create_perf_event(vcpu, i); 155 + continue; 156 + } 157 + 158 + /* At this point, pmc must be the canonical */ 281 159 if (pmc->perf_event) { 282 160 perf_event_enable(pmc->perf_event); 283 161 if (pmc->perf_event->state != PERF_EVENT_STATE_ACTIVE) ··· 299 153 } 300 154 301 155 /** 302 - * kvm_pmu_disable_counter - disable selected PMU counter 156 + * kvm_pmu_disable_counter_mask - disable selected PMU counters 303 157 * @vcpu: The vcpu pointer 304 158 * @val: the value guest writes to PMCNTENCLR register 305 159 * 306 160 * Call perf_event_disable to stop counting the perf event 307 161 */ 308 - void kvm_pmu_disable_counter(struct kvm_vcpu *vcpu, u64 val) 162 + void kvm_pmu_disable_counter_mask(struct kvm_vcpu *vcpu, u64 val) 309 163 { 310 164 int i; 311 165 struct kvm_pmu *pmu = &vcpu->arch.pmu; ··· 319 173 continue; 320 174 321 175 pmc = &pmu->pmc[i]; 176 + 177 + /* 178 + * For high counters of chained events we must recreate the 179 + * perf event with the long (64bit) attribute unset. 180 + */ 181 + if (kvm_pmu_pmc_is_chained(pmc) && 182 + kvm_pmu_idx_is_high_counter(i)) { 183 + kvm_pmu_create_perf_event(vcpu, i); 184 + continue; 185 + } 186 + 187 + /* At this point, pmc must be the canonical */ 322 188 if (pmc->perf_event) 323 189 perf_event_disable(pmc->perf_event); 324 190 } ··· 420 262 kvm_pmu_update_state(vcpu); 421 263 } 422 264 423 - static inline struct kvm_vcpu *kvm_pmc_to_vcpu(struct kvm_pmc *pmc) 424 - { 425 - struct kvm_pmu *pmu; 426 - struct kvm_vcpu_arch *vcpu_arch; 427 - 428 - pmc -= pmc->idx; 429 - pmu = container_of(pmc, struct kvm_pmu, pmc[0]); 430 - vcpu_arch = container_of(pmu, struct kvm_vcpu_arch, pmu); 431 - return container_of(vcpu_arch, struct kvm_vcpu, arch); 432 - } 433 - 434 265 /** 435 266 * When the perf event overflows, set the overflow status and inform the vcpu. 436 267 */ ··· 476 329 */ 477 330 void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u64 val) 478 331 { 479 - struct kvm_pmu *pmu = &vcpu->arch.pmu; 480 - struct kvm_pmc *pmc; 481 332 u64 mask; 482 333 int i; 483 334 484 335 mask = kvm_pmu_valid_counter_mask(vcpu); 485 336 if (val & ARMV8_PMU_PMCR_E) { 486 - kvm_pmu_enable_counter(vcpu, 337 + kvm_pmu_enable_counter_mask(vcpu, 487 338 __vcpu_sys_reg(vcpu, PMCNTENSET_EL0) & mask); 488 339 } else { 489 - kvm_pmu_disable_counter(vcpu, mask); 340 + kvm_pmu_disable_counter_mask(vcpu, mask); 490 341 } 491 342 492 343 if (val & ARMV8_PMU_PMCR_C) ··· 494 349 for (i = 0; i < ARMV8_PMU_CYCLE_IDX; i++) 495 350 kvm_pmu_set_counter_value(vcpu, i, 0); 496 351 } 497 - 498 - if (val & ARMV8_PMU_PMCR_LC) { 499 - pmc = &pmu->pmc[ARMV8_PMU_CYCLE_IDX]; 500 - pmc->bitmask = 0xffffffffffffffffUL; 501 - } 502 352 } 503 353 504 354 static bool kvm_pmu_counter_is_enabled(struct kvm_vcpu *vcpu, u64 select_idx) 505 355 { 506 356 return (__vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMU_PMCR_E) && 507 357 (__vcpu_sys_reg(vcpu, PMCNTENSET_EL0) & BIT(select_idx)); 358 + } 359 + 360 + /** 361 + * kvm_pmu_create_perf_event - create a perf event for a counter 362 + * @vcpu: The vcpu pointer 363 + * @select_idx: The number of selected counter 364 + */ 365 + static void kvm_pmu_create_perf_event(struct kvm_vcpu *vcpu, u64 select_idx) 366 + { 367 + struct kvm_pmu *pmu = &vcpu->arch.pmu; 368 + struct kvm_pmc *pmc; 369 + struct perf_event *event; 370 + struct perf_event_attr attr; 371 + u64 eventsel, counter, reg, data; 372 + 373 + /* 374 + * For chained counters the event type and filtering attributes are 375 + * obtained from the low/even counter. We also use this counter to 376 + * determine if the event is enabled/disabled. 377 + */ 378 + pmc = kvm_pmu_get_canonical_pmc(&pmu->pmc[select_idx]); 379 + 380 + reg = (pmc->idx == ARMV8_PMU_CYCLE_IDX) 381 + ? PMCCFILTR_EL0 : PMEVTYPER0_EL0 + pmc->idx; 382 + data = __vcpu_sys_reg(vcpu, reg); 383 + 384 + kvm_pmu_stop_counter(vcpu, pmc); 385 + eventsel = data & ARMV8_PMU_EVTYPE_EVENT; 386 + 387 + /* Software increment event does't need to be backed by a perf event */ 388 + if (eventsel == ARMV8_PMUV3_PERFCTR_SW_INCR && 389 + pmc->idx != ARMV8_PMU_CYCLE_IDX) 390 + return; 391 + 392 + memset(&attr, 0, sizeof(struct perf_event_attr)); 393 + attr.type = PERF_TYPE_RAW; 394 + attr.size = sizeof(attr); 395 + attr.pinned = 1; 396 + attr.disabled = !kvm_pmu_counter_is_enabled(vcpu, pmc->idx); 397 + attr.exclude_user = data & ARMV8_PMU_EXCLUDE_EL0 ? 1 : 0; 398 + attr.exclude_kernel = data & ARMV8_PMU_EXCLUDE_EL1 ? 1 : 0; 399 + attr.exclude_hv = 1; /* Don't count EL2 events */ 400 + attr.exclude_host = 1; /* Don't count host events */ 401 + attr.config = (pmc->idx == ARMV8_PMU_CYCLE_IDX) ? 402 + ARMV8_PMUV3_PERFCTR_CPU_CYCLES : eventsel; 403 + 404 + counter = kvm_pmu_get_pair_counter_value(vcpu, pmc); 405 + 406 + if (kvm_pmu_idx_has_chain_evtype(vcpu, pmc->idx)) { 407 + /** 408 + * The initial sample period (overflow count) of an event. For 409 + * chained counters we only support overflow interrupts on the 410 + * high counter. 411 + */ 412 + attr.sample_period = (-counter) & GENMASK(63, 0); 413 + event = perf_event_create_kernel_counter(&attr, -1, current, 414 + kvm_pmu_perf_overflow, 415 + pmc + 1); 416 + 417 + if (kvm_pmu_counter_is_enabled(vcpu, pmc->idx + 1)) 418 + attr.config1 |= PERF_ATTR_CFG1_KVM_PMU_CHAINED; 419 + } else { 420 + /* The initial sample period (overflow count) of an event. */ 421 + if (kvm_pmu_idx_is_64bit(vcpu, pmc->idx)) 422 + attr.sample_period = (-counter) & GENMASK(63, 0); 423 + else 424 + attr.sample_period = (-counter) & GENMASK(31, 0); 425 + 426 + event = perf_event_create_kernel_counter(&attr, -1, current, 427 + kvm_pmu_perf_overflow, pmc); 428 + } 429 + 430 + if (IS_ERR(event)) { 431 + pr_err_once("kvm: pmu event creation failed %ld\n", 432 + PTR_ERR(event)); 433 + return; 434 + } 435 + 436 + pmc->perf_event = event; 437 + } 438 + 439 + /** 440 + * kvm_pmu_update_pmc_chained - update chained bitmap 441 + * @vcpu: The vcpu pointer 442 + * @select_idx: The number of selected counter 443 + * 444 + * Update the chained bitmap based on the event type written in the 445 + * typer register. 446 + */ 447 + static void kvm_pmu_update_pmc_chained(struct kvm_vcpu *vcpu, u64 select_idx) 448 + { 449 + struct kvm_pmu *pmu = &vcpu->arch.pmu; 450 + struct kvm_pmc *pmc = &pmu->pmc[select_idx]; 451 + 452 + if (kvm_pmu_idx_has_chain_evtype(vcpu, pmc->idx)) { 453 + /* 454 + * During promotion from !chained to chained we must ensure 455 + * the adjacent counter is stopped and its event destroyed 456 + */ 457 + if (!kvm_pmu_pmc_is_chained(pmc)) 458 + kvm_pmu_stop_counter(vcpu, pmc); 459 + 460 + set_bit(pmc->idx >> 1, vcpu->arch.pmu.chained); 461 + } else { 462 + clear_bit(pmc->idx >> 1, vcpu->arch.pmu.chained); 463 + } 508 464 } 509 465 510 466 /** ··· 621 375 void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u64 data, 622 376 u64 select_idx) 623 377 { 624 - struct kvm_pmu *pmu = &vcpu->arch.pmu; 625 - struct kvm_pmc *pmc = &pmu->pmc[select_idx]; 626 - struct perf_event *event; 627 - struct perf_event_attr attr; 628 - u64 eventsel, counter; 378 + u64 reg, event_type = data & ARMV8_PMU_EVTYPE_MASK; 629 379 630 - kvm_pmu_stop_counter(vcpu, pmc); 631 - eventsel = data & ARMV8_PMU_EVTYPE_EVENT; 380 + reg = (select_idx == ARMV8_PMU_CYCLE_IDX) 381 + ? PMCCFILTR_EL0 : PMEVTYPER0_EL0 + select_idx; 632 382 633 - /* Software increment event does't need to be backed by a perf event */ 634 - if (eventsel == ARMV8_PMUV3_PERFCTR_SW_INCR && 635 - select_idx != ARMV8_PMU_CYCLE_IDX) 636 - return; 383 + __vcpu_sys_reg(vcpu, reg) = event_type; 637 384 638 - memset(&attr, 0, sizeof(struct perf_event_attr)); 639 - attr.type = PERF_TYPE_RAW; 640 - attr.size = sizeof(attr); 641 - attr.pinned = 1; 642 - attr.disabled = !kvm_pmu_counter_is_enabled(vcpu, select_idx); 643 - attr.exclude_user = data & ARMV8_PMU_EXCLUDE_EL0 ? 1 : 0; 644 - attr.exclude_kernel = data & ARMV8_PMU_EXCLUDE_EL1 ? 1 : 0; 645 - attr.exclude_hv = 1; /* Don't count EL2 events */ 646 - attr.exclude_host = 1; /* Don't count host events */ 647 - attr.config = (select_idx == ARMV8_PMU_CYCLE_IDX) ? 648 - ARMV8_PMUV3_PERFCTR_CPU_CYCLES : eventsel; 649 - 650 - counter = kvm_pmu_get_counter_value(vcpu, select_idx); 651 - /* The initial sample period (overflow count) of an event. */ 652 - attr.sample_period = (-counter) & pmc->bitmask; 653 - 654 - event = perf_event_create_kernel_counter(&attr, -1, current, 655 - kvm_pmu_perf_overflow, pmc); 656 - if (IS_ERR(event)) { 657 - pr_err_once("kvm: pmu event creation failed %ld\n", 658 - PTR_ERR(event)); 659 - return; 660 - } 661 - 662 - pmc->perf_event = event; 385 + kvm_pmu_update_pmc_chained(vcpu, select_idx); 386 + kvm_pmu_create_perf_event(vcpu, select_idx); 663 387 } 664 388 665 389 bool kvm_arm_support_pmu_v3(void)

+136 -19

virt/kvm/arm/psci.c

··· 401 401 feature = smccc_get_arg1(vcpu); 402 402 switch(feature) { 403 403 case ARM_SMCCC_ARCH_WORKAROUND_1: 404 - if (kvm_arm_harden_branch_predictor()) 404 + switch (kvm_arm_harden_branch_predictor()) { 405 + case KVM_BP_HARDEN_UNKNOWN: 406 + break; 407 + case KVM_BP_HARDEN_WA_NEEDED: 405 408 val = SMCCC_RET_SUCCESS; 409 + break; 410 + case KVM_BP_HARDEN_NOT_REQUIRED: 411 + val = SMCCC_RET_NOT_REQUIRED; 412 + break; 413 + } 406 414 break; 407 415 case ARM_SMCCC_ARCH_WORKAROUND_2: 408 416 switch (kvm_arm_have_ssbd()) { ··· 438 430 439 431 int kvm_arm_get_fw_num_regs(struct kvm_vcpu *vcpu) 440 432 { 441 - return 1; /* PSCI version */ 433 + return 3; /* PSCI version and two workaround registers */ 442 434 } 443 435 444 436 int kvm_arm_copy_fw_reg_indices(struct kvm_vcpu *vcpu, u64 __user *uindices) 445 437 { 446 - if (put_user(KVM_REG_ARM_PSCI_VERSION, uindices)) 438 + if (put_user(KVM_REG_ARM_PSCI_VERSION, uindices++)) 439 + return -EFAULT; 440 + 441 + if (put_user(KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1, uindices++)) 442 + return -EFAULT; 443 + 444 + if (put_user(KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2, uindices++)) 447 445 return -EFAULT; 448 446 449 447 return 0; 450 448 } 451 449 452 - int kvm_arm_get_fw_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg) 450 + #define KVM_REG_FEATURE_LEVEL_WIDTH 4 451 + #define KVM_REG_FEATURE_LEVEL_MASK (BIT(KVM_REG_FEATURE_LEVEL_WIDTH) - 1) 452 + 453 + /* 454 + * Convert the workaround level into an easy-to-compare number, where higher 455 + * values mean better protection. 456 + */ 457 + static int get_kernel_wa_level(u64 regid) 453 458 { 454 - if (reg->id == KVM_REG_ARM_PSCI_VERSION) { 455 - void __user *uaddr = (void __user *)(long)reg->addr; 456 - u64 val; 457 - 458 - val = kvm_psci_version(vcpu, vcpu->kvm); 459 - if (copy_to_user(uaddr, &val, KVM_REG_SIZE(reg->id))) 460 - return -EFAULT; 461 - 462 - return 0; 459 + switch (regid) { 460 + case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1: 461 + switch (kvm_arm_harden_branch_predictor()) { 462 + case KVM_BP_HARDEN_UNKNOWN: 463 + return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_AVAIL; 464 + case KVM_BP_HARDEN_WA_NEEDED: 465 + return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_AVAIL; 466 + case KVM_BP_HARDEN_NOT_REQUIRED: 467 + return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_REQUIRED; 468 + } 469 + return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_AVAIL; 470 + case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2: 471 + switch (kvm_arm_have_ssbd()) { 472 + case KVM_SSBD_FORCE_DISABLE: 473 + return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL; 474 + case KVM_SSBD_KERNEL: 475 + return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL; 476 + case KVM_SSBD_FORCE_ENABLE: 477 + case KVM_SSBD_MITIGATED: 478 + return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED; 479 + case KVM_SSBD_UNKNOWN: 480 + default: 481 + return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_UNKNOWN; 482 + } 463 483 } 464 484 465 485 return -EINVAL; 466 486 } 467 487 488 + int kvm_arm_get_fw_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg) 489 + { 490 + void __user *uaddr = (void __user *)(long)reg->addr; 491 + u64 val; 492 + 493 + switch (reg->id) { 494 + case KVM_REG_ARM_PSCI_VERSION: 495 + val = kvm_psci_version(vcpu, vcpu->kvm); 496 + break; 497 + case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1: 498 + val = get_kernel_wa_level(reg->id) & KVM_REG_FEATURE_LEVEL_MASK; 499 + break; 500 + case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2: 501 + val = get_kernel_wa_level(reg->id) & KVM_REG_FEATURE_LEVEL_MASK; 502 + 503 + if (val == KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL && 504 + kvm_arm_get_vcpu_workaround_2_flag(vcpu)) 505 + val |= KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED; 506 + break; 507 + default: 508 + return -ENOENT; 509 + } 510 + 511 + if (copy_to_user(uaddr, &val, KVM_REG_SIZE(reg->id))) 512 + return -EFAULT; 513 + 514 + return 0; 515 + } 516 + 468 517 int kvm_arm_set_fw_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg) 469 518 { 470 - if (reg->id == KVM_REG_ARM_PSCI_VERSION) { 471 - void __user *uaddr = (void __user *)(long)reg->addr; 472 - bool wants_02; 473 - u64 val; 519 + void __user *uaddr = (void __user *)(long)reg->addr; 520 + u64 val; 521 + int wa_level; 474 522 475 - if (copy_from_user(&val, uaddr, KVM_REG_SIZE(reg->id))) 476 - return -EFAULT; 523 + if (copy_from_user(&val, uaddr, KVM_REG_SIZE(reg->id))) 524 + return -EFAULT; 525 + 526 + switch (reg->id) { 527 + case KVM_REG_ARM_PSCI_VERSION: 528 + { 529 + bool wants_02; 477 530 478 531 wants_02 = test_bit(KVM_ARM_VCPU_PSCI_0_2, vcpu->arch.features); 479 532 ··· 551 482 vcpu->kvm->arch.psci_version = val; 552 483 return 0; 553 484 } 485 + break; 486 + } 487 + 488 + case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1: 489 + if (val & ~KVM_REG_FEATURE_LEVEL_MASK) 490 + return -EINVAL; 491 + 492 + if (get_kernel_wa_level(reg->id) < val) 493 + return -EINVAL; 494 + 495 + return 0; 496 + 497 + case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2: 498 + if (val & ~(KVM_REG_FEATURE_LEVEL_MASK | 499 + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED)) 500 + return -EINVAL; 501 + 502 + wa_level = val & KVM_REG_FEATURE_LEVEL_MASK; 503 + 504 + if (get_kernel_wa_level(reg->id) < wa_level) 505 + return -EINVAL; 506 + 507 + /* The enabled bit must not be set unless the level is AVAIL. */ 508 + if (wa_level != KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL && 509 + wa_level != val) 510 + return -EINVAL; 511 + 512 + /* Are we finished or do we need to check the enable bit ? */ 513 + if (kvm_arm_have_ssbd() != KVM_SSBD_KERNEL) 514 + return 0; 515 + 516 + /* 517 + * If this kernel supports the workaround to be switched on 518 + * or off, make sure it matches the requested setting. 519 + */ 520 + switch (wa_level) { 521 + case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL: 522 + kvm_arm_set_vcpu_workaround_2_flag(vcpu, 523 + val & KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED); 524 + break; 525 + case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED: 526 + kvm_arm_set_vcpu_workaround_2_flag(vcpu, true); 527 + break; 528 + } 529 + 530 + return 0; 531 + default: 532 + return -ENOENT; 554 533 } 555 534 556 535 return -EINVAL;

+1 -3

virt/kvm/irqchip.c

··· 184 184 185 185 nr_rt_entries += 1; 186 186 187 - new = kzalloc(sizeof(*new) + (nr_rt_entries * sizeof(struct hlist_head)), 188 - GFP_KERNEL_ACCOUNT); 189 - 187 + new = kzalloc(struct_size(new, map, nr_rt_entries), GFP_KERNEL_ACCOUNT); 190 188 if (!new) 191 189 return -ENOMEM; 192 190

+22 -19

virt/kvm/kvm_main.c

··· 95 95 * kvm->lock --> kvm->slots_lock --> kvm->irq_lock 96 96 */ 97 97 98 - DEFINE_SPINLOCK(kvm_lock); 98 + DEFINE_MUTEX(kvm_lock); 99 99 static DEFINE_RAW_SPINLOCK(kvm_count_lock); 100 100 LIST_HEAD(vm_list); 101 101 ··· 680 680 if (r) 681 681 goto out_err; 682 682 683 - spin_lock(&kvm_lock); 683 + mutex_lock(&kvm_lock); 684 684 list_add(&kvm->vm_list, &vm_list); 685 - spin_unlock(&kvm_lock); 685 + mutex_unlock(&kvm_lock); 686 686 687 687 preempt_notifier_inc(); 688 688 ··· 728 728 kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm); 729 729 kvm_destroy_vm_debugfs(kvm); 730 730 kvm_arch_sync_events(kvm); 731 - spin_lock(&kvm_lock); 731 + mutex_lock(&kvm_lock); 732 732 list_del(&kvm->vm_list); 733 - spin_unlock(&kvm_lock); 733 + mutex_unlock(&kvm_lock); 734 734 kvm_free_irq_routing(kvm); 735 735 for (i = 0; i < KVM_NR_BUSES; i++) { 736 736 struct kvm_io_bus *bus = kvm_get_bus(kvm, i); ··· 1790 1790 if (!map->hva) 1791 1791 return; 1792 1792 1793 - if (map->page) 1793 + if (map->page != KVM_UNMAPPED_PAGE) 1794 1794 kunmap(map->page); 1795 1795 #ifdef CONFIG_HAS_IOMEM 1796 1796 else ··· 4031 4031 u64 tmp_val; 4032 4032 4033 4033 *val = 0; 4034 - spin_lock(&kvm_lock); 4034 + mutex_lock(&kvm_lock); 4035 4035 list_for_each_entry(kvm, &vm_list, vm_list) { 4036 4036 stat_tmp.kvm = kvm; 4037 4037 vm_stat_get_per_vm((void *)&stat_tmp, &tmp_val); 4038 4038 *val += tmp_val; 4039 4039 } 4040 - spin_unlock(&kvm_lock); 4040 + mutex_unlock(&kvm_lock); 4041 4041 return 0; 4042 4042 } 4043 4043 ··· 4050 4050 if (val) 4051 4051 return -EINVAL; 4052 4052 4053 - spin_lock(&kvm_lock); 4053 + mutex_lock(&kvm_lock); 4054 4054 list_for_each_entry(kvm, &vm_list, vm_list) { 4055 4055 stat_tmp.kvm = kvm; 4056 4056 vm_stat_clear_per_vm((void *)&stat_tmp, 0); 4057 4057 } 4058 - spin_unlock(&kvm_lock); 4058 + mutex_unlock(&kvm_lock); 4059 4059 4060 4060 return 0; 4061 4061 } ··· 4070 4070 u64 tmp_val; 4071 4071 4072 4072 *val = 0; 4073 - spin_lock(&kvm_lock); 4073 + mutex_lock(&kvm_lock); 4074 4074 list_for_each_entry(kvm, &vm_list, vm_list) { 4075 4075 stat_tmp.kvm = kvm; 4076 4076 vcpu_stat_get_per_vm((void *)&stat_tmp, &tmp_val); 4077 4077 *val += tmp_val; 4078 4078 } 4079 - spin_unlock(&kvm_lock); 4079 + mutex_unlock(&kvm_lock); 4080 4080 return 0; 4081 4081 } 4082 4082 ··· 4089 4089 if (val) 4090 4090 return -EINVAL; 4091 4091 4092 - spin_lock(&kvm_lock); 4092 + mutex_lock(&kvm_lock); 4093 4093 list_for_each_entry(kvm, &vm_list, vm_list) { 4094 4094 stat_tmp.kvm = kvm; 4095 4095 vcpu_stat_clear_per_vm((void *)&stat_tmp, 0); 4096 4096 } 4097 - spin_unlock(&kvm_lock); 4097 + mutex_unlock(&kvm_lock); 4098 4098 4099 4099 return 0; 4100 4100 } ··· 4115 4115 if (!kvm_dev.this_device || !kvm) 4116 4116 return; 4117 4117 4118 - spin_lock(&kvm_lock); 4118 + mutex_lock(&kvm_lock); 4119 4119 if (type == KVM_EVENT_CREATE_VM) { 4120 4120 kvm_createvm_count++; 4121 4121 kvm_active_vms++; ··· 4124 4124 } 4125 4125 created = kvm_createvm_count; 4126 4126 active = kvm_active_vms; 4127 - spin_unlock(&kvm_lock); 4127 + mutex_unlock(&kvm_lock); 4128 4128 4129 4129 env = kzalloc(sizeof(*env), GFP_KERNEL_ACCOUNT); 4130 4130 if (!env) ··· 4221 4221 kvm_arch_vcpu_put(vcpu); 4222 4222 } 4223 4223 4224 + static void check_processor_compat(void *rtn) 4225 + { 4226 + *(int *)rtn = kvm_arch_check_processor_compat(); 4227 + } 4228 + 4224 4229 int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align, 4225 4230 struct module *module) 4226 4231 { ··· 4257 4252 goto out_free_0a; 4258 4253 4259 4254 for_each_online_cpu(cpu) { 4260 - smp_call_function_single(cpu, 4261 - kvm_arch_check_processor_compat, 4262 - &r, 1); 4255 + smp_call_function_single(cpu, check_processor_compat, &r, 1); 4263 4256 if (r < 0) 4264 4257 goto out_free_1; 4265 4258 }