Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

powerpc/64s/hash: Make hash faults work in NMI context

Hash faults are not resoved in NMI context, instead causing the access
to fail. This is done because perf interrupts can get backtraces
including walking the user stack, and taking a hash fault on those could
deadlock on the HPTE lock if the perf interrupt hits while the same HPTE
lock is being held by the hash fault code. The user-access for the stack
walking will notice the access failed and deal with that in the perf
code.

The reason to allow perf interrupts in is to better profile hash faults.

The problem with this is any hash fault on a kernel access that happens
in NMI context will crash, because kernel accesses must not fail.

Hard lockups, system reset, machine checks that access vmalloc space
including modules and including stack backtracing and symbol lookup in
modules, per-cpu data, etc could all run into this problem.

Fix this by disallowing perf interrupts in the hash fault code (the
direct hash fault is covered by MSR[EE]=0 so the PMI disable just needs
to extend to the preload case). This simplifies the tricky logic in hash
faults and perf, at the cost of reduced profiling of hash faults.

perf can still latch addresses when interrupts are disabled, it just
won't get the stack trace at that point, so it would still find hot
spots, just sometimes with confusing stack chains.

An alternative could be to allow perf interrupts here but always do the
slowpath stack walk if we are in nmi context, but that slows down all
perf interrupt stack walking on hash though and it does not remove as
much tricky code.

Reported-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220204035348.545435-1-npiggin@gmail.com

authored by

Nicholas Piggin and committed by
Michael Ellerman
8b91cee5 406a8c1d

+10 -82
+1 -1
arch/powerpc/include/asm/interrupt.h
··· 612 612 DECLARE_INTERRUPT_HANDLER(do_bad_segment_interrupt); 613 613 614 614 /* hash_utils.c */ 615 - DECLARE_INTERRUPT_HANDLER_RAW(do_hash_fault); 615 + DECLARE_INTERRUPT_HANDLER(do_hash_fault); 616 616 617 617 /* fault.c */ 618 618 DECLARE_INTERRUPT_HANDLER(do_page_fault);
+8 -46
arch/powerpc/mm/book3s64/hash_utils.c
··· 1621 1621 } 1622 1622 EXPORT_SYMBOL_GPL(hash_page); 1623 1623 1624 - DECLARE_INTERRUPT_HANDLER(__do_hash_fault); 1625 - DEFINE_INTERRUPT_HANDLER(__do_hash_fault) 1624 + DEFINE_INTERRUPT_HANDLER(do_hash_fault) 1626 1625 { 1627 1626 unsigned long ea = regs->dar; 1628 1627 unsigned long dsisr = regs->dsisr; ··· 1678 1679 } else if (err) { 1679 1680 hash__do_page_fault(regs); 1680 1681 } 1681 - } 1682 - 1683 - /* 1684 - * The _RAW interrupt entry checks for the in_nmi() case before 1685 - * running the full handler. 1686 - */ 1687 - DEFINE_INTERRUPT_HANDLER_RAW(do_hash_fault) 1688 - { 1689 - /* 1690 - * If we are in an "NMI" (e.g., an interrupt when soft-disabled), then 1691 - * don't call hash_page, just fail the fault. This is required to 1692 - * prevent re-entrancy problems in the hash code, namely perf 1693 - * interrupts hitting while something holds H_PAGE_BUSY, and taking a 1694 - * hash fault. See the comment in hash_preload(). 1695 - * 1696 - * We come here as a result of a DSI at a point where we don't want 1697 - * to call hash_page, such as when we are accessing memory (possibly 1698 - * user memory) inside a PMU interrupt that occurred while interrupts 1699 - * were soft-disabled. We want to invoke the exception handler for 1700 - * the access, or panic if there isn't a handler. 1701 - */ 1702 - if (unlikely(in_nmi())) { 1703 - do_bad_page_fault_segv(regs); 1704 - return 0; 1705 - } 1706 - 1707 - __do_hash_fault(regs); 1708 - 1709 - return 0; 1710 1682 } 1711 1683 1712 1684 #ifdef CONFIG_PPC_MM_SLICES ··· 1746 1776 #endif /* CONFIG_PPC_64K_PAGES */ 1747 1777 1748 1778 /* 1749 - * __hash_page_* must run with interrupts off, as it sets the 1750 - * H_PAGE_BUSY bit. It's possible for perf interrupts to hit at any 1751 - * time and may take a hash fault reading the user stack, see 1752 - * read_user_stack_slow() in the powerpc/perf code. 1779 + * __hash_page_* must run with interrupts off, including PMI interrupts 1780 + * off, as it sets the H_PAGE_BUSY bit. 1753 1781 * 1754 - * If that takes a hash fault on the same page as we lock here, it 1755 - * will bail out when seeing H_PAGE_BUSY set, and retry the access 1756 - * leading to an infinite loop. 1757 - * 1758 - * Disabling interrupts here does not prevent perf interrupts, but it 1759 - * will prevent them taking hash faults (see the NMI test in 1760 - * do_hash_page), then read_user_stack's copy_from_user_nofault will 1761 - * fail and perf will fall back to read_user_stack_slow(), which 1762 - * walks the Linux page tables. 1782 + * It's otherwise possible for perf interrupts to hit at any time and 1783 + * may take a hash fault reading the user stack, which could take a 1784 + * hash miss and deadlock on the same H_PAGE_BUSY bit. 1763 1785 * 1764 1786 * Interrupts must also be off for the duration of the 1765 1787 * mm_is_thread_local test and update, to prevent preempt running the 1766 1788 * mm on another CPU (XXX: this may be racy vs kthread_use_mm). 1767 1789 */ 1768 - local_irq_save(flags); 1790 + powerpc_local_irq_pmu_save(flags); 1769 1791 1770 1792 /* Is that local to this CPU ? */ 1771 1793 if (mm_is_thread_local(mm)) ··· 1782 1820 mm_ctx_user_psize(&mm->context), 1783 1821 pte_val(*ptep)); 1784 1822 1785 - local_irq_restore(flags); 1823 + powerpc_local_irq_pmu_restore(flags); 1786 1824 } 1787 1825 1788 1826 /*
+1 -8
arch/powerpc/perf/callchain.h
··· 2 2 #ifndef _POWERPC_PERF_CALLCHAIN_H 3 3 #define _POWERPC_PERF_CALLCHAIN_H 4 4 5 - int read_user_stack_slow(const void __user *ptr, void *buf, int nb); 6 5 void perf_callchain_user_64(struct perf_callchain_entry_ctx *entry, 7 6 struct pt_regs *regs); 8 7 void perf_callchain_user_32(struct perf_callchain_entry_ctx *entry, ··· 25 26 size_t size) 26 27 { 27 28 unsigned long addr = (unsigned long)ptr; 28 - int rc; 29 29 30 30 if (addr > TASK_SIZE - size || (addr & (size - 1))) 31 31 return -EFAULT; 32 32 33 - rc = copy_from_user_nofault(ret, ptr, size); 34 - 35 - if (IS_ENABLED(CONFIG_PPC64) && !radix_enabled() && rc) 36 - return read_user_stack_slow(ptr, ret, size); 37 - 38 - return rc; 33 + return copy_from_user_nofault(ret, ptr, size); 39 34 } 40 35 41 36 #endif /* _POWERPC_PERF_CALLCHAIN_H */
-27
arch/powerpc/perf/callchain_64.c
··· 18 18 19 19 #include "callchain.h" 20 20 21 - /* 22 - * On 64-bit we don't want to invoke hash_page on user addresses from 23 - * interrupt context, so if the access faults, we read the page tables 24 - * to find which page (if any) is mapped and access it directly. Radix 25 - * has no need for this so it doesn't use read_user_stack_slow. 26 - */ 27 - int read_user_stack_slow(const void __user *ptr, void *buf, int nb) 28 - { 29 - 30 - unsigned long addr = (unsigned long) ptr; 31 - unsigned long offset; 32 - struct page *page; 33 - void *kaddr; 34 - 35 - if (get_user_page_fast_only(addr, FOLL_WRITE, &page)) { 36 - kaddr = page_address(page); 37 - 38 - /* align address to page boundary */ 39 - offset = addr & ~PAGE_MASK; 40 - 41 - memcpy(buf, kaddr + offset, nb); 42 - put_page(page); 43 - return 0; 44 - } 45 - return -EFAULT; 46 - } 47 - 48 21 static int read_user_stack_64(const unsigned long __user *ptr, unsigned long *ret) 49 22 { 50 23 return __read_user_stack(ptr, ret, sizeof(*ret));