sched: improve sched_clock() performance

in scheduler-intense workloads native_read_tsc() overhead accounts for
20% of the system overhead:

659567 system_call 41222.9375
686796 schedule 435.7843
718382 __switch_to 665.1685
823875 switch_mm 4526.7857
1883122 native_read_tsc 55385.9412
9761990 total 2.8468

this is large part due to the rdtsc_barrier() that is done before
and after reading the TSC.

But sched_clock() is not a precise clock in the GTOD sense, using such
barriers is completely pointless. So remove the barriers and only use
them in vget_cycles().

This improves lat_ctx performance by about 5%.

Signed-off-by: Ingo Molnar <mingo@elte.hu>

+7 -3
-2
arch/x86/include/asm/msr.h
··· 108 { 109 DECLARE_ARGS(val, low, high); 110 111 - rdtsc_barrier(); 112 asm volatile("rdtsc" : EAX_EDX_RET(val, low, high)); 113 - rdtsc_barrier(); 114 115 return EAX_EDX_VAL(val, low, high); 116 }
··· 108 { 109 DECLARE_ARGS(val, low, high); 110 111 asm volatile("rdtsc" : EAX_EDX_RET(val, low, high)); 112 113 return EAX_EDX_VAL(val, low, high); 114 }
+7 -1
arch/x86/include/asm/tsc.h
··· 34 35 static __always_inline cycles_t vget_cycles(void) 36 { 37 /* 38 * We only do VDSOs on TSC capable CPUs, so this shouldnt 39 * access boot_cpu_data (which is not VDSO-safe): ··· 44 if (!cpu_has_tsc) 45 return 0; 46 #endif 47 - return (cycles_t)__native_read_tsc(); 48 } 49 50 extern void tsc_init(void);
··· 34 35 static __always_inline cycles_t vget_cycles(void) 36 { 37 + cycles_t cycles; 38 + 39 /* 40 * We only do VDSOs on TSC capable CPUs, so this shouldnt 41 * access boot_cpu_data (which is not VDSO-safe): ··· 42 if (!cpu_has_tsc) 43 return 0; 44 #endif 45 + rdtsc_barrier(); 46 + cycles = (cycles_t)__native_read_tsc(); 47 + rdtsc_barrier(); 48 + 49 + return cycles; 50 } 51 52 extern void tsc_init(void);