x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions

Before we rework the "pmem api" to stop abusing __copy_user_nocache()
for memcpy_to_pmem() we need to fix cases where we may strand dirty data
in the cpu cache. The problem occurs when copy_from_iter_pmem() is used
for arbitrary data transfers from userspace. There is no guarantee that
these transfers, performed by dax_iomap_actor(), will have aligned
destinations or aligned transfer lengths. Backstop the usage
__copy_user_nocache() with explicit cache management in these unaligned
cases.

Yes, copy_from_iter_pmem() is now too big for an inline, but addressing
that is saved for a later patch that moves the entirety of the "pmem
api" into the pmem driver directly.

Fixes: 5de490daec8b ("pmem: add copy_from_iter_pmem() and clear_pmem()")
Cc: <stable@vger.kernel.org>
Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

+31 -11
+31 -11
arch/x86/include/asm/pmem.h
··· 55 55 * @size: number of bytes to write back 56 56 * 57 57 * Write back a cache range using the CLWB (cache line write back) 58 - * instruction. 58 + * instruction. Note that @size is internally rounded up to be cache 59 + * line size aligned. 59 60 */ 60 61 static inline void arch_wb_cache_pmem(void *addr, size_t size) 61 62 { ··· 68 67 for (p = (void *)((unsigned long)addr & ~clflush_mask); 69 68 p < vend; p += x86_clflush_size) 70 69 clwb(p); 71 - } 72 - 73 - /* 74 - * copy_from_iter_nocache() on x86 only uses non-temporal stores for iovec 75 - * iterators, so for other types (bvec & kvec) we must do a cache write-back. 76 - */ 77 - static inline bool __iter_needs_pmem_wb(struct iov_iter *i) 78 - { 79 - return iter_is_iovec(i) == false; 80 70 } 81 71 82 72 /** ··· 86 94 /* TODO: skip the write-back by always using non-temporal stores */ 87 95 len = copy_from_iter_nocache(addr, bytes, i); 88 96 89 - if (__iter_needs_pmem_wb(i)) 97 + /* 98 + * In the iovec case on x86_64 copy_from_iter_nocache() uses 99 + * non-temporal stores for the bulk of the transfer, but we need 100 + * to manually flush if the transfer is unaligned. A cached 101 + * memory copy is used when destination or size is not naturally 102 + * aligned. That is: 103 + * - Require 8-byte alignment when size is 8 bytes or larger. 104 + * - Require 4-byte alignment when size is 4 bytes. 105 + * 106 + * In the non-iovec case the entire destination needs to be 107 + * flushed. 108 + */ 109 + if (iter_is_iovec(i)) { 110 + unsigned long flushed, dest = (unsigned long) addr; 111 + 112 + if (bytes < 8) { 113 + if (!IS_ALIGNED(dest, 4) || (bytes != 4)) 114 + arch_wb_cache_pmem(addr, 1); 115 + } else { 116 + if (!IS_ALIGNED(dest, 8)) { 117 + dest = ALIGN(dest, boot_cpu_data.x86_clflush_size); 118 + arch_wb_cache_pmem(addr, 1); 119 + } 120 + 121 + flushed = dest - (unsigned long) addr; 122 + if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8)) 123 + arch_wb_cache_pmem(addr + bytes - 1, 1); 124 + } 125 + } else 90 126 arch_wb_cache_pmem(addr, bytes); 91 127 92 128 return len;