Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

KVM: x86: Provide a capability to disable APERF/MPERF read intercepts

Allow a guest to read the physical IA32_APERF and IA32_MPERF MSRs
without interception.

The IA32_APERF and IA32_MPERF MSRs are not virtualized. Writes are not
handled at all. The MSR values are not zeroed on vCPU creation, saved
on suspend, or restored on resume. No accommodation is made for
processor migration or for sharing a logical processor with other
tasks. No adjustments are made for non-unit TSC multipliers. The MSRs
do not account for time the same way as the comparable PMU events,
whether the PMU is virtualized by the traditional emulation method or
the new mediated pass-through approach.

Nonetheless, in a properly constrained environment, this capability
can be combined with a guest CPUID table that advertises support for
CPUID.6:ECX.APERFMPERF[bit 0] to induce a Linux guest to report the
effective physical CPU frequency in /proc/cpuinfo. Moreover, there is
no performance cost for this capability.

Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250530185239.2335185-3-jmattson@google.com
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250626001225.744268-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

authored by

Jim Mattson and committed by
Sean Christopherson
a7cec208 6fbef861

+53 -2
+23
Documentation/virt/kvm/api.rst
··· 7844 7844 #define KVM_X86_DISABLE_EXITS_HLT (1 << 1) 7845 7845 #define KVM_X86_DISABLE_EXITS_PAUSE (1 << 2) 7846 7846 #define KVM_X86_DISABLE_EXITS_CSTATE (1 << 3) 7847 + #define KVM_X86_DISABLE_EXITS_APERFMPERF (1 << 4) 7847 7848 7848 7849 Enabling this capability on a VM provides userspace with a way to no 7849 7850 longer intercept some instructions for improved latency in some ··· 7854 7853 all such vmexits. 7855 7854 7856 7855 Do not enable KVM_FEATURE_PV_UNHALT if you disable HLT exits. 7856 + 7857 + Virtualizing the ``IA32_APERF`` and ``IA32_MPERF`` MSRs requires more 7858 + than just disabling APERF/MPERF exits. While both Intel and AMD 7859 + document strict usage conditions for these MSRs--emphasizing that only 7860 + the ratio of their deltas over a time interval (T0 to T1) is 7861 + architecturally defined--simply passing through the MSRs can still 7862 + produce an incorrect ratio. 7863 + 7864 + This erroneous ratio can occur if, between T0 and T1: 7865 + 7866 + 1. The vCPU thread migrates between logical processors. 7867 + 2. Live migration or suspend/resume operations take place. 7868 + 3. Another task shares the vCPU's logical processor. 7869 + 4. C-states lower than C0 are emulated (e.g., via HLT interception). 7870 + 5. The guest TSC frequency doesn't match the host TSC frequency. 7871 + 7872 + Due to these complexities, KVM does not automatically associate this 7873 + passthrough capability with the guest CPUID bit, 7874 + ``CPUID.6:ECX.APERFMPERF[bit 0]``. Userspace VMMs that deem this 7875 + mechanism adequate for virtualizing the ``IA32_APERF`` and 7876 + ``IA32_MPERF`` MSRs must set the guest CPUID bit explicitly. 7877 + 7857 7878 7858 7879 7.14 KVM_CAP_S390_HPAGE_1M 7859 7880 --------------------------
+3 -1
arch/x86/kvm/svm/nested.c
··· 194 194 * Hardcode the capacity of the array based on the maximum number of _offsets_. 195 195 * MSRs are batched together, so there are fewer offsets than MSRs. 196 196 */ 197 - static int nested_svm_msrpm_merge_offsets[6] __ro_after_init; 197 + static int nested_svm_msrpm_merge_offsets[7] __ro_after_init; 198 198 static int nested_svm_nr_msrpm_merge_offsets __ro_after_init; 199 199 typedef unsigned long nsvm_msrpm_merge_t; 200 200 ··· 216 216 MSR_IA32_SPEC_CTRL, 217 217 MSR_IA32_PRED_CMD, 218 218 MSR_IA32_FLUSH_CMD, 219 + MSR_IA32_APERF, 220 + MSR_IA32_MPERF, 219 221 MSR_IA32_LASTBRANCHFROMIP, 220 222 MSR_IA32_LASTBRANCHTOIP, 221 223 MSR_IA32_LASTINTFROMIP,
+5
arch/x86/kvm/svm/svm.c
··· 838 838 svm_set_intercept_for_msr(vcpu, MSR_IA32_SYSENTER_ESP, MSR_TYPE_RW, 839 839 guest_cpuid_is_intel_compatible(vcpu)); 840 840 841 + if (kvm_aperfmperf_in_guest(vcpu->kvm)) { 842 + svm_disable_intercept_for_msr(vcpu, MSR_IA32_APERF, MSR_TYPE_R); 843 + svm_disable_intercept_for_msr(vcpu, MSR_IA32_MPERF, MSR_TYPE_R); 844 + } 845 + 841 846 if (sev_es_guest(vcpu->kvm)) 842 847 sev_es_recalc_msr_intercepts(vcpu); 843 848
+6
arch/x86/kvm/vmx/nested.c
··· 715 715 nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0, 716 716 MSR_IA32_FLUSH_CMD, MSR_TYPE_W); 717 717 718 + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0, 719 + MSR_IA32_APERF, MSR_TYPE_R); 720 + 721 + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0, 722 + MSR_IA32_MPERF, MSR_TYPE_R); 723 + 718 724 kvm_vcpu_unmap(vcpu, &map); 719 725 720 726 vmx->nested.force_msr_bitmap_recalc = false;
+4
arch/x86/kvm/vmx/vmx.c
··· 4084 4084 vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C6_RESIDENCY, MSR_TYPE_R); 4085 4085 vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C7_RESIDENCY, MSR_TYPE_R); 4086 4086 } 4087 + if (kvm_aperfmperf_in_guest(vcpu->kvm)) { 4088 + vmx_disable_intercept_for_msr(vcpu, MSR_IA32_APERF, MSR_TYPE_R); 4089 + vmx_disable_intercept_for_msr(vcpu, MSR_IA32_MPERF, MSR_TYPE_R); 4090 + } 4087 4091 4088 4092 /* PT MSRs can be passed through iff PT is exposed to the guest. */ 4089 4093 if (vmx_pt_mode_is_host_guest())
+5 -1
arch/x86/kvm/x86.c
··· 4577 4577 { 4578 4578 u64 r = KVM_X86_DISABLE_EXITS_PAUSE; 4579 4579 4580 + if (boot_cpu_has(X86_FEATURE_APERFMPERF)) 4581 + r |= KVM_X86_DISABLE_EXITS_APERFMPERF; 4582 + 4580 4583 if (!mitigate_smt_rsb) { 4581 4584 r |= KVM_X86_DISABLE_EXITS_HLT | 4582 4585 KVM_X86_DISABLE_EXITS_CSTATE; ··· 6616 6613 6617 6614 if (!mitigate_smt_rsb && boot_cpu_has_bug(X86_BUG_SMT_RSB) && 6618 6615 cpu_smt_possible() && 6619 - (cap->args[0] & ~KVM_X86_DISABLE_EXITS_PAUSE)) 6616 + (cap->args[0] & ~(KVM_X86_DISABLE_EXITS_PAUSE | 6617 + KVM_X86_DISABLE_EXITS_APERFMPERF))) 6620 6618 pr_warn_once(SMT_RSB_MSG); 6621 6619 6622 6620 kvm_disable_exits(kvm, cap->args[0]);
+5
arch/x86/kvm/x86.h
··· 524 524 return kvm->arch.disabled_exits & KVM_X86_DISABLE_EXITS_CSTATE; 525 525 } 526 526 527 + static inline bool kvm_aperfmperf_in_guest(struct kvm *kvm) 528 + { 529 + return kvm->arch.disabled_exits & KVM_X86_DISABLE_EXITS_APERFMPERF; 530 + } 531 + 527 532 static inline bool kvm_notify_vmexit_enabled(struct kvm *kvm) 528 533 { 529 534 return kvm->arch.notify_vmexit_flags & KVM_X86_NOTIFY_VMEXIT_ENABLED;
+1
include/uapi/linux/kvm.h
··· 644 644 #define KVM_X86_DISABLE_EXITS_HLT (1 << 1) 645 645 #define KVM_X86_DISABLE_EXITS_PAUSE (1 << 2) 646 646 #define KVM_X86_DISABLE_EXITS_CSTATE (1 << 3) 647 + #define KVM_X86_DISABLE_EXITS_APERFMPERF (1 << 4) 647 648 648 649 /* for KVM_ENABLE_CAP */ 649 650 struct kvm_enable_cap {
+1
tools/include/uapi/linux/kvm.h
··· 617 617 #define KVM_X86_DISABLE_EXITS_HLT (1 << 1) 618 618 #define KVM_X86_DISABLE_EXITS_PAUSE (1 << 2) 619 619 #define KVM_X86_DISABLE_EXITS_CSTATE (1 << 3) 620 + #define KVM_X86_DISABLE_EXITS_APERFMPERF (1 << 4) 620 621 621 622 /* for KVM_ENABLE_CAP */ 622 623 struct kvm_enable_cap {