Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

x86/mm, mm/hwpoison: Don't unconditionally unmap kernel 1:1 pages

In the following commit:

ce0fa3e56ad2 ("x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages")

... we added code to memory_failure() to unmap the page from the
kernel 1:1 virtual address space to avoid speculative access to the
page logging additional errors.

But memory_failure() may not always succeed in taking the page offline,
especially if the page belongs to the kernel. This can happen if
there are too many corrected errors on a page and either mcelog(8)
or drivers/ras/cec.c asks to take a page offline.

Since we remove the 1:1 mapping early in memory_failure(), we can
end up with the page unmapped, but still in use. On the next access
the kernel crashes :-(

There are also various debug paths that call memory_failure() to simulate
occurrence of an error. Since there is no actual error in memory, we
don't need to map out the page for those cases.

Revert most of the previous attempt and keep the solution local to
arch/x86/kernel/cpu/mcheck/mce.c. Unmap the page only when:

1) there is a real error
2) memory_failure() succeeds.

All of this only applies to 64-bit systems. 32-bit kernel doesn't map
all of memory into kernel space. It isn't worth adding the code to unmap
the piece that is mapped because nobody would run a 32-bit kernel on a
machine that has recoverable machine checks.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Robert (Persistent Memory) <elliott@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: stable@vger.kernel.org #v4.14
Fixes: ce0fa3e56ad2 ("x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages")
Signed-off-by: Ingo Molnar <mingo@kernel.org>

authored by

Tony Luck and committed by
Ingo Molnar
fd0e786d 01684e72

+26 -18
-4
arch/x86/include/asm/page_64.h
··· 52 52 53 53 void copy_page(void *to, void *from); 54 54 55 - #ifdef CONFIG_X86_MCE 56 - #define arch_unmap_kpfn arch_unmap_kpfn 57 - #endif 58 - 59 55 #endif /* !__ASSEMBLY__ */ 60 56 61 57 #ifdef CONFIG_X86_VSYSCALL_EMULATION
+15
arch/x86/kernel/cpu/mcheck/mce-internal.h
··· 115 115 116 116 extern struct mca_config mca_cfg; 117 117 118 + #ifndef CONFIG_X86_64 119 + /* 120 + * On 32-bit systems it would be difficult to safely unmap a poison page 121 + * from the kernel 1:1 map because there are no non-canonical addresses that 122 + * we can use to refer to the address without risking a speculative access. 123 + * However, this isn't much of an issue because: 124 + * 1) Few unmappable pages are in the 1:1 map. Most are in HIGHMEM which 125 + * are only mapped into the kernel as needed 126 + * 2) Few people would run a 32-bit kernel on a machine that supports 127 + * recoverable errors because they have too much memory to boot 32-bit. 128 + */ 129 + static inline void mce_unmap_kpfn(unsigned long pfn) {} 130 + #define mce_unmap_kpfn mce_unmap_kpfn 131 + #endif 132 + 118 133 #endif /* __X86_MCE_INTERNAL_H__ */
+11 -6
arch/x86/kernel/cpu/mcheck/mce.c
··· 105 105 106 106 static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs); 107 107 108 + #ifndef mce_unmap_kpfn 109 + static void mce_unmap_kpfn(unsigned long pfn); 110 + #endif 111 + 108 112 /* 109 113 * CPU/chipset specific EDAC code can register a notifier call here to print 110 114 * MCE errors in a human-readable form. ··· 594 590 595 591 if (mce_usable_address(mce) && (mce->severity == MCE_AO_SEVERITY)) { 596 592 pfn = mce->addr >> PAGE_SHIFT; 597 - memory_failure(pfn, 0); 593 + if (!memory_failure(pfn, 0)) 594 + mce_unmap_kpfn(pfn); 598 595 } 599 596 600 597 return NOTIFY_OK; ··· 1062 1057 ret = memory_failure(m->addr >> PAGE_SHIFT, flags); 1063 1058 if (ret) 1064 1059 pr_err("Memory error not recovered"); 1060 + else 1061 + mce_unmap_kpfn(m->addr >> PAGE_SHIFT); 1065 1062 return ret; 1066 1063 } 1067 1064 1068 - #if defined(arch_unmap_kpfn) && defined(CONFIG_MEMORY_FAILURE) 1069 - 1070 - void arch_unmap_kpfn(unsigned long pfn) 1065 + #ifndef mce_unmap_kpfn 1066 + static void mce_unmap_kpfn(unsigned long pfn) 1071 1067 { 1072 1068 unsigned long decoy_addr; 1073 1069 ··· 1079 1073 * We would like to just call: 1080 1074 * set_memory_np((unsigned long)pfn_to_kaddr(pfn), 1); 1081 1075 * but doing that would radically increase the odds of a 1082 - * speculative access to the posion page because we'd have 1076 + * speculative access to the poison page because we'd have 1083 1077 * the virtual address of the kernel 1:1 mapping sitting 1084 1078 * around in registers. 1085 1079 * Instead we get tricky. We create a non-canonical address ··· 1104 1098 1105 1099 if (set_memory_np(decoy_addr, 1)) 1106 1100 pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn); 1107 - 1108 1101 } 1109 1102 #endif 1110 1103
-6
include/linux/mm_inline.h
··· 127 127 128 128 #define lru_to_page(head) (list_entry((head)->prev, struct page, lru)) 129 129 130 - #ifdef arch_unmap_kpfn 131 - extern void arch_unmap_kpfn(unsigned long pfn); 132 - #else 133 - static __always_inline void arch_unmap_kpfn(unsigned long pfn) { } 134 - #endif 135 - 136 130 #endif
-2
mm/memory-failure.c
··· 1139 1139 return 0; 1140 1140 } 1141 1141 1142 - arch_unmap_kpfn(pfn); 1143 - 1144 1142 orig_head = hpage = compound_head(p); 1145 1143 num_poisoned_pages_inc(); 1146 1144