sched/x86: Optimize switch_mm() for multi-threaded workloads

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

Dick Fowles, Don Zickus and Joe Mario have been working on
improvements to perf, and noticed heavy cache line contention
on the mm_cpumask, running linpack on a 60 core / 120 thread
system.

The cause turned out to be unnecessary atomic accesses to the
mm_cpumask. When in lazy TLB mode, the CPU is only removed from
the mm_cpumask if there is a TLB flush event.

Most of the time, no such TLB flush happens, and the kernel
skips the TLB reload. It can also skip the atomic memory
set & test.

Here is a summary of Joe's test results:

* The __schedule function dropped from 24% of all program cycles down
to 5.5%.

* The cacheline contention/hotness for accesses to that bitmask went
from being the 1st/2nd hottest - down to the 84th hottest (0.3% of
all shared misses which is now quite cold)

* The average load latency for the bit-test-n-set instruction in
__schedule dropped from 10k-15k cycles down to an average of 600 cycles.

* The linpack program results improved from 133 GFlops to 144 GFlops.
Peak GFlops rose from 133 to 153.

Reported-by: Don Zickus <dzickus@redhat.com>
Reported-by: Joe Mario <jmario@redhat.com>
Tested-by: Joe Mario <jmario@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Paul Turner <pjt@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20130731221421.616d3d20@annuminas.surriel.com
[ Made the comments consistent around the modified code. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>

authored by

Rik van Riel and committed by

Ingo Molnar 12 years ago 8f898fbb 46591962

+13 -7

1 changed file

expand all

arch

x86

include

asm

mmu_context.h

+13 -7

arch/x86/include/asm/mmu_context.h

··· 45 45 /* Re-load page tables */ 46 46 load_cr3(next->pgd); 47 47 48 - /* stop flush ipis for the previous mm */ 48 + /* Stop flush ipis for the previous mm */ 49 49 cpumask_clear_cpu(cpu, mm_cpumask(prev)); 50 50 51 - /* 52 - * load the LDT, if the LDT is different: 53 - */ 51 + /* Load the LDT, if the LDT is different: */ 54 52 if (unlikely(prev->context.ldt != next->context.ldt)) 55 53 load_LDT_nolock(&next->context); 56 54 } 57 55 #ifdef CONFIG_SMP 58 - else { 56 + else { 59 57 this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); 60 58 BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next); 61 59 62 - if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) { 63 - /* We were in lazy tlb mode and leave_mm disabled 60 + if (!cpumask_test_cpu(cpu, mm_cpumask(next))) { 61 + /* 62 + * On established mms, the mm_cpumask is only changed 63 + * from irq context, from ptep_clear_flush() while in 64 + * lazy tlb mode, and here. Irqs are blocked during 65 + * schedule, protecting us from simultaneous changes. 66 + */ 67 + cpumask_set_cpu(cpu, mm_cpumask(next)); 68 + /* 69 + * We were in lazy tlb mode and leave_mm disabled 64 70 * tlb flush IPI delivery. We must reload CR3 65 71 * to make sure to use no freed page tables. 66 72 */