Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

kvm: better MWAIT emulation for guests

Guests that are heavy on futexes end up IPI'ing each other a lot. That
can lead to significant slowdowns and latency increase for those guests
when running within KVM.

If only a single guest is needed on a host, we have a lot of spare host
CPU time we can throw at the problem. Modern CPUs implement a feature
called "MWAIT" which allows guests to wake up sleeping remote CPUs without
an IPI - thus without an exit - at the expense of never going out of guest
context.

The decision whether this is something sensible to use should be up to the
VM admin, so to user space. We can however allow MWAIT execution on systems
that support it properly hardware wise.

This patch adds a CAP to user space and a KVM cpuid leaf to indicate
availability of native MWAIT execution. With that enabled, the worst a
guest can do is waste as many cycles as a "jmp ." would do, so it's not
a privilege problem.

We consciously do *not* expose the feature in our CPUID bitmap, as most
people will want to benefit from sleeping vCPUs to allow for over commit.

Reported-by: "Gabriel L. Somlo" <gsomlo@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
[agraf: fix amd, change commit message]
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

authored by

Michael S. Tsirkin and committed by
Paolo Bonzini
668fffa3 db2336a8

+58 -4
+9
Documentation/virtual/kvm/api.txt
··· 4111 4111 2: MIPS64 or microMIPS64 with access to all address segments. 4112 4112 Both registers and addresses are 64-bits wide. 4113 4113 It will be possible to run 64-bit or 32-bit guest code. 4114 + 4115 + 8.8 KVM_CAP_X86_GUEST_MWAIT 4116 + 4117 + Architectures: x86 4118 + 4119 + This capability indicates that guest using memory monotoring instructions 4120 + (MWAIT/MWAITX) to stop the virtual CPU will not cause a VM exit. As such time 4121 + spent while virtual CPU is halted in this way will then be accounted for as 4122 + guest running time on the host (as opposed to e.g. HLT).
+5 -2
arch/x86/kvm/svm.c
··· 1198 1198 set_intercept(svm, INTERCEPT_CLGI); 1199 1199 set_intercept(svm, INTERCEPT_SKINIT); 1200 1200 set_intercept(svm, INTERCEPT_WBINVD); 1201 - set_intercept(svm, INTERCEPT_MONITOR); 1202 - set_intercept(svm, INTERCEPT_MWAIT); 1203 1201 set_intercept(svm, INTERCEPT_XSETBV); 1202 + 1203 + if (!kvm_mwait_in_guest()) { 1204 + set_intercept(svm, INTERCEPT_MONITOR); 1205 + set_intercept(svm, INTERCEPT_MWAIT); 1206 + } 1204 1207 1205 1208 control->iopm_base_pa = iopm_base; 1206 1209 control->msrpm_base_pa = __pa(svm->msrpm);
+4 -2
arch/x86/kvm/vmx.c
··· 3527 3527 CPU_BASED_USE_IO_BITMAPS | 3528 3528 CPU_BASED_MOV_DR_EXITING | 3529 3529 CPU_BASED_USE_TSC_OFFSETING | 3530 - CPU_BASED_MWAIT_EXITING | 3531 - CPU_BASED_MONITOR_EXITING | 3532 3530 CPU_BASED_INVLPG_EXITING | 3533 3531 CPU_BASED_RDPMC_EXITING; 3532 + 3533 + if (!kvm_mwait_in_guest()) 3534 + min |= CPU_BASED_MWAIT_EXITING | 3535 + CPU_BASED_MONITOR_EXITING; 3534 3536 3535 3537 opt = CPU_BASED_TPR_SHADOW | 3536 3538 CPU_BASED_USE_MSR_BITMAPS |
+3
arch/x86/kvm/x86.c
··· 2687 2687 case KVM_CAP_ADJUST_CLOCK: 2688 2688 r = KVM_CLOCK_TSC_STABLE; 2689 2689 break; 2690 + case KVM_CAP_X86_GUEST_MWAIT: 2691 + r = kvm_mwait_in_guest(); 2692 + break; 2690 2693 case KVM_CAP_X86_SMM: 2691 2694 /* SMBASE is usually relocated above 1M on modern chipsets, 2692 2695 * and SMM handlers might indeed rely on 4G segment limits,
+36
arch/x86/kvm/x86.h
··· 1 1 #ifndef ARCH_X86_KVM_X86_H 2 2 #define ARCH_X86_KVM_X86_H 3 3 4 + #include <asm/processor.h> 5 + #include <asm/mwait.h> 4 6 #include <linux/kvm_host.h> 5 7 #include <asm/pvclock.h> 6 8 #include "kvm_cache_regs.h" ··· 213 211 n = __quot; \ 214 212 __rem; \ 215 213 }) 214 + 215 + static inline bool kvm_mwait_in_guest(void) 216 + { 217 + unsigned int eax, ebx, ecx, edx; 218 + 219 + if (!cpu_has(&boot_cpu_data, X86_FEATURE_MWAIT)) 220 + return false; 221 + 222 + switch (boot_cpu_data.x86_vendor) { 223 + case X86_VENDOR_AMD: 224 + /* All AMD CPUs have a working MWAIT implementation */ 225 + return true; 226 + case X86_VENDOR_INTEL: 227 + /* Handle Intel below */ 228 + break; 229 + default: 230 + return false; 231 + } 232 + 233 + /* 234 + * Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as 235 + * they would allow guest to stop the CPU completely by disabling 236 + * interrupts then invoking MWAIT. 237 + */ 238 + if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF) 239 + return false; 240 + 241 + cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &edx); 242 + 243 + if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK)) 244 + return false; 245 + 246 + return true; 247 + } 216 248 217 249 #endif
+1
include/uapi/linux/kvm.h
··· 893 893 #define KVM_CAP_S390_GS 140 894 894 #define KVM_CAP_S390_AIS 141 895 895 #define KVM_CAP_SPAPR_TCE_VFIO 142 896 + #define KVM_CAP_X86_GUEST_MWAIT 143 896 897 897 898 #ifdef KVM_CAP_IRQ_ROUTING 898 899