x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64

It was found when running fio sequential write test with a XFS ramdisk
on a KVM guest running on a 2-socket x86-64 system, the %CPU times
as reported by perf were as follows:

69.75% 0.59% fio [k] down_write
69.15% 0.01% fio [k] call_rwsem_down_write_failed
67.12% 1.12% fio [k] rwsem_down_write_failed
63.48% 52.77% fio [k] osq_lock
9.46% 7.88% fio [k] __raw_callee_save___kvm_vcpu_is_preempt
3.93% 3.93% fio [k] __kvm_vcpu_is_preempted

Making vcpu_is_preempted() a callee-save function has a relatively
high cost on x86-64 primarily due to at least one more cacheline of
data access from the saving and restoring of registers (8 of them)
to and from stack as well as one more level of function call.

To reduce this performance overhead, an optimized assembly version
of the the __raw_callee_save___kvm_vcpu_is_preempt() function is
provided for x86-64.

With this patch applied on a KVM guest on a 2-socket 16-core 32-thread
system with 16 parallel jobs (8 on each socket), the aggregrate
bandwidth of the fio test on an XFS ramdisk were as follows:

I/O Type w/o patch with patch
-------- --------- ----------
random read 8141.2 MB/s 8497.1 MB/s
seq read 8229.4 MB/s 8304.2 MB/s
random write 1675.5 MB/s 1701.5 MB/s
seq write 1681.3 MB/s 1699.9 MB/s

There are some increases in the aggregated bandwidth because of
the patch.

The perf data now became:

70.78% 0.58% fio [k] down_write
70.20% 0.01% fio [k] call_rwsem_down_write_failed
69.70% 1.17% fio [k] rwsem_down_write_failed
59.91% 55.42% fio [k] osq_lock
10.14% 10.14% fio [k] __kvm_vcpu_is_preempted

The assembly code was verified by using a test kernel module to
compare the output of C __kvm_vcpu_is_preempted() and that of assembly
__raw_callee_save___kvm_vcpu_is_preempt() to verify that they matched.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

authored by Waiman Long and committed by Paolo Bonzini dd0fd8bc 6c62985d

+33
+9
arch/x86/kernel/asm-offsets_64.c
··· 13 13 #include <asm/syscalls_32.h> 14 14 }; 15 15 16 + #if defined(CONFIG_KVM_GUEST) && defined(CONFIG_PARAVIRT_SPINLOCKS) 17 + #include <asm/kvm_para.h> 18 + #endif 19 + 16 20 int main(void) 17 21 { 18 22 #ifdef CONFIG_PARAVIRT 19 23 OFFSET(PV_IRQ_adjust_exception_frame, pv_irq_ops, adjust_exception_frame); 20 24 OFFSET(PV_CPU_usergs_sysret64, pv_cpu_ops, usergs_sysret64); 21 25 OFFSET(PV_CPU_swapgs, pv_cpu_ops, swapgs); 26 + BLANK(); 27 + #endif 28 + 29 + #if defined(CONFIG_KVM_GUEST) && defined(CONFIG_PARAVIRT_SPINLOCKS) 30 + OFFSET(KVM_STEAL_TIME_preempted, kvm_steal_time, preempted); 22 31 BLANK(); 23 32 #endif 24 33
+24
arch/x86/kernel/kvm.c
··· 589 589 local_irq_restore(flags); 590 590 } 591 591 592 + #ifdef CONFIG_X86_32 592 593 __visible bool __kvm_vcpu_is_preempted(long cpu) 593 594 { 594 595 struct kvm_steal_time *src = &per_cpu(steal_time, cpu); ··· 597 596 return !!src->preempted; 598 597 } 599 598 PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); 599 + 600 + #else 601 + 602 + #include <asm/asm-offsets.h> 603 + 604 + extern bool __raw_callee_save___kvm_vcpu_is_preempted(long); 605 + 606 + /* 607 + * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and 608 + * restoring to/from the stack. 609 + */ 610 + asm( 611 + ".pushsection .text;" 612 + ".global __raw_callee_save___kvm_vcpu_is_preempted;" 613 + ".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" 614 + "__raw_callee_save___kvm_vcpu_is_preempted:" 615 + "movq __per_cpu_offset(,%rdi,8), %rax;" 616 + "cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax);" 617 + "setne %al;" 618 + "ret;" 619 + ".popsection"); 620 + 621 + #endif 600 622 601 623 /* 602 624 * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.