Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

dm writecache: improve performance on DDR persistent memory (Optane)

When testing the dm-writecache target on a real DDR persistent memory
(Intel Optane), it turned out that explicit cache flushing using the
clflushopt instruction performs better than non-temporal stores for
block sizes 1k, 2k and 4k.

The dm-writecache target is singlethreaded (all the copying is done
while holding the writecache lock), so it benefits from clwb, see:
http://lore.kernel.org/r/alpine.LRH.2.02.2004160411460.7833@file01.intranet.prod.int.rdu2.redhat.com

Add a new function memcpy_flushcache_optimized() that tests if
clflushopt is present - and if it is, we use it instead of
memcpy_flushcache.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

authored by

Mikulas Patocka and committed by
Mike Snitzer
48338daa 499c1804

+37 -1
+37 -1
drivers/md/dm-writecache.c
··· 1139 1139 return r; 1140 1140 } 1141 1141 1142 + static void memcpy_flushcache_optimized(void *dest, void *source, size_t size) 1143 + { 1144 + /* 1145 + * clflushopt performs better with block size 1024, 2048, 4096 1146 + * non-temporal stores perform better with block size 512 1147 + * 1148 + * block size 512 1024 2048 4096 1149 + * movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s 1150 + * clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s 1151 + * 1152 + * We see that movnti performs better for 512-byte blocks, and 1153 + * clflushopt performs better for 1024-byte and larger blocks. So, we 1154 + * prefer clflushopt for sizes >= 768. 1155 + * 1156 + * NOTE: this happens to be the case now (with dm-writecache's single 1157 + * threaded model) but re-evaluate this once memcpy_flushcache() is 1158 + * enabled to use movdir64b which might invalidate this performance 1159 + * advantage seen with cache-allocating-writes plus flushing. 1160 + */ 1161 + #ifdef CONFIG_X86 1162 + if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && 1163 + likely(boot_cpu_data.x86_clflush_size == 64) && 1164 + likely(size >= 768)) { 1165 + do { 1166 + memcpy((void *)dest, (void *)source, 64); 1167 + clflushopt((void *)dest); 1168 + dest += 64; 1169 + source += 64; 1170 + size -= 64; 1171 + } while (size >= 64); 1172 + return; 1173 + } 1174 + #endif 1175 + memcpy_flushcache(dest, source, size); 1176 + } 1177 + 1142 1178 static void bio_copy_block(struct dm_writecache *wc, struct bio *bio, void *data) 1143 1179 { 1144 1180 void *buf; ··· 1200 1164 } 1201 1165 } else { 1202 1166 flush_dcache_page(bio_page(bio)); 1203 - memcpy_flushcache(data, buf, size); 1167 + memcpy_flushcache_optimized(data, buf, size); 1204 1168 } 1205 1169 1206 1170 bvec_kunmap_irq(buf, &flags);