Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

powerpc/thp: Serialize pmd clear against a linux page table walk.

Serialize against find_linux_pte_or_hugepte() which does lock-less
lookup in page tables with local interrupts disabled. For huge pages it
casts pmd_t to pte_t. Since the format of pte_t is different from pmd_t
we want to prevent transit from pmd pointing to page table to pmd
pointing to huge page (and back) while interrupts are disabled. We
clear pmd to possibly replace it with page table pointer in different
code paths. So make sure we wait for the parallel
find_linux_pte_or_hugepage() to finish.

Without this patch, a find_linux_pte_or_hugepte() running in parallel to
__split_huge_zero_page_pmd() or do_huge_pmd_wp_page_fallback() or
zap_huge_pmd() can run into the above issue. With
__split_huge_zero_page_pmd() and do_huge_pmd_wp_page_fallback() we clear
the hugepage pte before inserting the pmd entry with a regular pgtable
address. Such a clear need to wait for the parallel
find_linux_pte_or_hugepte() to finish.

With zap_huge_pmd(), we can run into issues, with a hugepage pte getting
zapped due to a MADV_DONTNEED while other cpu fault it in as small
pages.

Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

authored by

Aneesh Kumar K.V and committed by
Michael Ellerman
13bd817b 030bbdbf

+11
+11
arch/powerpc/mm/pgtable_64.c
··· 839 839 * hash fault look at them. 840 840 */ 841 841 memset(pgtable, 0, PTE_FRAG_SIZE); 842 + /* 843 + * Serialize against find_linux_pte_or_hugepte which does lock-less 844 + * lookup in page tables with local interrupts disabled. For huge pages 845 + * it casts pmd_t to pte_t. Since format of pte_t is different from 846 + * pmd_t we want to prevent transit from pmd pointing to page table 847 + * to pmd pointing to huge page (and back) while interrupts are disabled. 848 + * We clear pmd to possibly replace it with page table pointer in 849 + * different code paths. So make sure we wait for the parallel 850 + * find_linux_pte_or_hugepage to finish. 851 + */ 852 + kick_all_cpus_sync(); 842 853 return old_pmd; 843 854 } 844 855