powerpc: Optimise the 64bit optimised __clear_user

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

I blame Mikey for this. He elevated my slightly dubious testcase:

to benchmark status. And naturally we need to be number 1 at creating
zeros. So lets improve __clear_user some more.

As Paul suggests we can use dcbz for large lengths. This patch gets
the destination cacheline aligned then uses dcbz on whole cachelines.

Before:
10485760000 bytes (10 GB) copied, 0.414744 s, 25.3 GB/s

After:
10485760000 bytes (10 GB) copied, 0.268597 s, 39.0 GB/s

39 GB/s, a new record.

Signed-off-by: Anton Blanchard <anton@samba.org>
Tested-by: Olof Johansson <olof@lixom.net>
Acked-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

authored by

Anton Blanchard and committed by

Benjamin Herrenschmidt 13 years ago cf8fb553 b4c3a872

+62 -1

1 changed file

expand all

arch

powerpc

lib

string_64.S

+62 -1

arch/powerpc/lib/string_64.S

··· 19 19 */ 20 20 21 21 #include <asm/ppc_asm.h> 22 + #include <asm/asm-offsets.h> 23 + 24 + .section ".toc","aw" 25 + PPC64_CACHES: 26 + .tc ppc64_caches[TC],ppc64_caches 27 + .section ".text" 22 28 23 29 /** 24 30 * __clear_user: - Zero a block of memory in user space, with less checking. ··· 100 94 addi r3,r3,4 101 95 102 96 3: sub r4,r4,r6 103 - srdi r6,r4,5 97 + 104 98 cmpdi r4,32 99 + cmpdi cr1,r4,512 105 100 blt .Lshort_clear 101 + bgt cr1,.Llong_clear 102 + 103 + .Lmedium_clear: 104 + srdi r6,r4,5 106 105 mtctr r6 107 106 108 107 /* Do 32 byte chunks */ ··· 150 139 151 140 10: li r3,0 152 141 blr 142 + 143 + .Llong_clear: 144 + ld r5,PPC64_CACHES@toc(r2) 145 + 146 + bf cr7*4+0,11f 147 + err2; std r0,0(r3) 148 + addi r3,r3,8 149 + addi r4,r4,-8 150 + 151 + /* Destination is 16 byte aligned, need to get it cacheline aligned */ 152 + 11: lwz r7,DCACHEL1LOGLINESIZE(r5) 153 + lwz r9,DCACHEL1LINESIZE(r5) 154 + 155 + /* 156 + * With worst case alignment the long clear loop takes a minimum 157 + * of 1 byte less than 2 cachelines. 158 + */ 159 + sldi r10,r9,2 160 + cmpd r4,r10 161 + blt .Lmedium_clear 162 + 163 + neg r6,r3 164 + addi r10,r9,-1 165 + and. r5,r6,r10 166 + beq 13f 167 + 168 + srdi r6,r5,4 169 + mtctr r6 170 + mr r8,r3 171 + 12: 172 + err1; std r0,0(r3) 173 + err1; std r0,8(r3) 174 + addi r3,r3,16 175 + bdnz 12b 176 + 177 + sub r4,r4,r5 178 + 179 + 13: srd r6,r4,r7 180 + mtctr r6 181 + mr r8,r3 182 + 14: 183 + err1; dcbz r0,r3 184 + add r3,r3,r9 185 + bdnz 14b 186 + 187 + and r4,r4,r10 188 + 189 + cmpdi r4,32 190 + blt .Lshort_clear 191 + b .Lmedium_clear