x86/uaccess/64: Handle the caching of 4-byte nocache copies properly in __copy_user_nocache()

Data corruption issues were observed in tests which initiated
a system crash/reset while accessing BTT devices. This problem
is reproducible.

The BTT driver calls pmem_rw_bytes() to update data in pmem
devices. This interface calls __copy_user_nocache(), which
uses non-temporal stores so that the stores to pmem are
persistent.

__copy_user_nocache() uses non-temporal stores when a request
size is 8 bytes or larger (and is aligned by 8 bytes). The
BTT driver updates the BTT map table, which entry size is
4 bytes. Therefore, updates to the map table entries remain
cached, and are not written to pmem after a crash.

Change __copy_user_nocache() to use non-temporal store when
a request size is 4 bytes. The change extends the current
byte-copy path for a less-than-8-bytes request, and does not
add any overhead to the regular path.

Reported-and-tested-by: Micah Parrish <micah.parrish@hpe.com>
Reported-and-tested-by: Brian Boylston <brian.boylston@hpe.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: <stable@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: linux-nvdimm@lists.01.org
Link: http://lkml.kernel.org/r/1455225857-12039-3-git-send-email-toshi.kani@hpe.com
[ Small readability edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>

authored by Toshi Kani and committed by Ingo Molnar a82eee74 ee9737c9

Changed files
+32 -4
arch
x86
+32 -4
arch/x86/lib/copy_user_64.S
··· 237 237 * Note: Cached memory copy is used when destination or size is not 238 238 * naturally aligned. That is: 239 239 * - Require 8-byte alignment when size is 8 bytes or larger. 240 + * - Require 4-byte alignment when size is 4 bytes. 240 241 */ 241 242 ENTRY(__copy_user_nocache) 242 243 ASM_STAC 243 244 244 - /* If size is less than 8 bytes, go to byte copy */ 245 + /* If size is less than 8 bytes, go to 4-byte copy */ 245 246 cmpl $8,%edx 246 - jb .L_1b_cache_copy_entry 247 + jb .L_4b_nocache_copy_entry 247 248 248 249 /* If destination is not 8-byte aligned, "cache" copy to align it */ 249 250 ALIGN_DESTINATION ··· 283 282 movl %edx,%ecx 284 283 andl $7,%edx 285 284 shrl $3,%ecx 286 - jz .L_1b_cache_copy_entry /* jump if count is 0 */ 285 + jz .L_4b_nocache_copy_entry /* jump if count is 0 */ 287 286 288 287 /* Perform 8-byte nocache loop-copy */ 289 288 .L_8b_nocache_copy_loop: ··· 295 294 jnz .L_8b_nocache_copy_loop 296 295 297 296 /* If no byte left, we're done */ 298 - .L_1b_cache_copy_entry: 297 + .L_4b_nocache_copy_entry: 298 + andl %edx,%edx 299 + jz .L_finish_copy 300 + 301 + /* If destination is not 4-byte aligned, go to byte copy: */ 302 + movl %edi,%ecx 303 + andl $3,%ecx 304 + jnz .L_1b_cache_copy_entry 305 + 306 + /* Set 4-byte copy count (1 or 0) and remainder */ 307 + movl %edx,%ecx 308 + andl $3,%edx 309 + shrl $2,%ecx 310 + jz .L_1b_cache_copy_entry /* jump if count is 0 */ 311 + 312 + /* Perform 4-byte nocache copy: */ 313 + 30: movl (%rsi),%r8d 314 + 31: movnti %r8d,(%rdi) 315 + leaq 4(%rsi),%rsi 316 + leaq 4(%rdi),%rdi 317 + 318 + /* If no bytes left, we're done: */ 299 319 andl %edx,%edx 300 320 jz .L_finish_copy 301 321 302 322 /* Perform byte "cache" loop-copy for the remainder */ 323 + .L_1b_cache_copy_entry: 303 324 movl %edx,%ecx 304 325 .L_1b_cache_copy_loop: 305 326 40: movb (%rsi),%al ··· 345 322 jmp .L_fixup_handle_tail 346 323 .L_fixup_8b_copy: 347 324 lea (%rdx,%rcx,8),%rdx 325 + jmp .L_fixup_handle_tail 326 + .L_fixup_4b_copy: 327 + lea (%rdx,%rcx,4),%rdx 348 328 jmp .L_fixup_handle_tail 349 329 .L_fixup_1b_copy: 350 330 movl %ecx,%edx ··· 374 348 _ASM_EXTABLE(16b,.L_fixup_4x8b_copy) 375 349 _ASM_EXTABLE(20b,.L_fixup_8b_copy) 376 350 _ASM_EXTABLE(21b,.L_fixup_8b_copy) 351 + _ASM_EXTABLE(30b,.L_fixup_4b_copy) 352 + _ASM_EXTABLE(31b,.L_fixup_4b_copy) 377 353 _ASM_EXTABLE(40b,.L_fixup_1b_copy) 378 354 _ASM_EXTABLE(41b,.L_fixup_1b_copy) 379 355 ENDPROC(__copy_user_nocache)