Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

powerpc: Speed up clear_page by unrolling it

Unroll clear_page 8 times. A simple microbenchmark which
allocates and frees a zeroed page:

for (i = 0; i < iterations; i++) {
unsigned long p = __get_free_page(GFP_KERNEL | __GFP_ZERO);
free_page(p);
}

improves 20% on POWER8.

This assumes cacheline sizes won't grow beyond 512 bytes or
page sizes wont drop below 1kB, which is unlikely, but we could
add a runtime check during early init if it makes people nervous.

Michael found that some versions of gcc produce quite bad code
(all multiplies), so we give gcc a hand by using shifts and adds.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

authored by

Anton Blanchard and committed by
Michael Ellerman
e35735b9 2013add4

+29 -9
+29 -9
arch/powerpc/include/asm/page_64.h
··· 42 42 43 43 typedef unsigned long pte_basic_t; 44 44 45 - static __inline__ void clear_page(void *addr) 45 + static inline void clear_page(void *addr) 46 46 { 47 - unsigned long lines, line_size; 47 + unsigned long iterations; 48 + unsigned long onex, twox, fourx, eightx; 48 49 49 - line_size = ppc64_caches.dline_size; 50 - lines = ppc64_caches.dlines_per_page; 50 + iterations = ppc64_caches.dlines_per_page / 8; 51 51 52 - __asm__ __volatile__( 52 + /* 53 + * Some verisions of gcc use multiply instructions to 54 + * calculate the offsets so lets give it a hand to 55 + * do better. 56 + */ 57 + onex = ppc64_caches.dline_size; 58 + twox = onex << 1; 59 + fourx = onex << 2; 60 + eightx = onex << 3; 61 + 62 + asm volatile( 53 63 "mtctr %1 # clear_page\n\ 54 - 1: dcbz 0,%0\n\ 55 - add %0,%0,%3\n\ 64 + .balign 16\n\ 65 + 1: dcbz 0,%0\n\ 66 + dcbz %3,%0\n\ 67 + dcbz %4,%0\n\ 68 + dcbz %5,%0\n\ 69 + dcbz %6,%0\n\ 70 + dcbz %7,%0\n\ 71 + dcbz %8,%0\n\ 72 + dcbz %9,%0\n\ 73 + add %0,%0,%10\n\ 56 74 bdnz+ 1b" 57 - : "=r" (addr) 58 - : "r" (lines), "0" (addr), "r" (line_size) 75 + : "=&r" (addr) 76 + : "r" (iterations), "0" (addr), "b" (onex), "b" (twox), 77 + "b" (twox+onex), "b" (fourx), "b" (fourx+onex), 78 + "b" (twox+fourx), "b" (eightx-onex), "r" (eightx) 59 79 : "ctr", "memory"); 60 80 } 61 81