Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm/memory: convert print_bad_pte() to print_bad_page_map()

print_bad_pte() looks like something that should actually be a WARN or
similar, but historically it apparently has proven to be useful to detect
corruption of page tables even on production systems -- report the issue
and keep the system running to make it easier to actually detect what is
going wrong (e.g., multiple such messages might shed a light).

As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll
have to take care of print_bad_pte() as well.

Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
implementation and renaming the function to print_bad_page_map(). Provide
print_bad_pte() as a simple wrapper.

Document the implicit locking requirements for the page table re-walk.

To make the function a bit more readable, factor out the ratelimit check
into is_bad_page_map_ratelimited() and place the printing of page table
content into __print_bad_page_map_pgtable(). We'll now dump information
from each level in a single line, and just stop the table walk once we hit
something that is not a present page table.

The report will now look something like (dumping pgd to pmd values):

[ 77.943408] BUG: Bad page map in process XXX pte:80000001233f5867
[ 77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
[ 77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067

Not using pgdp_get(), because that does not work properly on some arm
configs where pgd_t is an array. Note that we are dumping all levels even
when levels are folded for simplicity.

[david@redhat.com: drop warning]
Link: https://lkml.kernel.org/r/923b279c-de33-44dd-a923-2959afad8626@redhat.com
Link: https://lkml.kernel.org/r/20250811112631.759341-9-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Juegren Gross <jgross@suse.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

David Hildenbrand and committed by
Andrew Morton
ec63a440 b22cc9a9

+102 -20
+18
include/linux/pgtable.h
··· 1983 1983 PGTABLE_LEVEL_PGD, 1984 1984 }; 1985 1985 1986 + static inline const char *pgtable_level_to_str(enum pgtable_level level) 1987 + { 1988 + switch (level) { 1989 + case PGTABLE_LEVEL_PTE: 1990 + return "pte"; 1991 + case PGTABLE_LEVEL_PMD: 1992 + return "pmd"; 1993 + case PGTABLE_LEVEL_PUD: 1994 + return "pud"; 1995 + case PGTABLE_LEVEL_P4D: 1996 + return "p4d"; 1997 + case PGTABLE_LEVEL_PGD: 1998 + return "pgd"; 1999 + default: 2000 + return "unknown"; 2001 + } 2002 + } 2003 + 1986 2004 #endif /* !__ASSEMBLY__ */ 1987 2005 1988 2006 #if !defined(MAX_POSSIBLE_PHYSMEM_BITS) && !defined(CONFIG_64BIT)
+84 -20
mm/memory.c
··· 491 491 add_mm_counter(mm, i, rss[i]); 492 492 } 493 493 494 - /* 495 - * This function is called to print an error when a bad pte 496 - * is found. For example, we might have a PFN-mapped pte in 497 - * a region that doesn't allow it. 498 - * 499 - * The calling function must still handle the error. 500 - */ 501 - static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, 502 - pte_t pte, struct page *page) 494 + static bool is_bad_page_map_ratelimited(void) 503 495 { 504 - pgd_t *pgd = pgd_offset(vma->vm_mm, addr); 505 - p4d_t *p4d = p4d_offset(pgd, addr); 506 - pud_t *pud = pud_offset(p4d, addr); 507 - pmd_t *pmd = pmd_offset(pud, addr); 508 - struct address_space *mapping; 509 - pgoff_t index; 510 496 static unsigned long resume; 511 497 static unsigned long nr_shown; 512 498 static unsigned long nr_unshown; ··· 504 518 if (nr_shown == 60) { 505 519 if (time_before(jiffies, resume)) { 506 520 nr_unshown++; 507 - return; 521 + return true; 508 522 } 509 523 if (nr_unshown) { 510 524 pr_alert("BUG: Bad page map: %lu messages suppressed\n", ··· 515 529 } 516 530 if (nr_shown++ == 0) 517 531 resume = jiffies + 60 * HZ; 532 + return false; 533 + } 534 + 535 + static void __print_bad_page_map_pgtable(struct mm_struct *mm, unsigned long addr) 536 + { 537 + unsigned long long pgdv, p4dv, pudv, pmdv; 538 + p4d_t p4d, *p4dp; 539 + pud_t pud, *pudp; 540 + pmd_t pmd, *pmdp; 541 + pgd_t *pgdp; 542 + 543 + /* 544 + * Although this looks like a fully lockless pgtable walk, it is not: 545 + * see locking requirements for print_bad_page_map(). 546 + */ 547 + pgdp = pgd_offset(mm, addr); 548 + pgdv = pgd_val(*pgdp); 549 + 550 + if (!pgd_present(*pgdp) || pgd_leaf(*pgdp)) { 551 + pr_alert("pgd:%08llx\n", pgdv); 552 + return; 553 + } 554 + 555 + p4dp = p4d_offset(pgdp, addr); 556 + p4d = p4dp_get(p4dp); 557 + p4dv = p4d_val(p4d); 558 + 559 + if (!p4d_present(p4d) || p4d_leaf(p4d)) { 560 + pr_alert("pgd:%08llx p4d:%08llx\n", pgdv, p4dv); 561 + return; 562 + } 563 + 564 + pudp = pud_offset(p4dp, addr); 565 + pud = pudp_get(pudp); 566 + pudv = pud_val(pud); 567 + 568 + if (!pud_present(pud) || pud_leaf(pud)) { 569 + pr_alert("pgd:%08llx p4d:%08llx pud:%08llx\n", pgdv, p4dv, pudv); 570 + return; 571 + } 572 + 573 + pmdp = pmd_offset(pudp, addr); 574 + pmd = pmdp_get(pmdp); 575 + pmdv = pmd_val(pmd); 576 + 577 + /* 578 + * Dumping the PTE would be nice, but it's tricky with CONFIG_HIGHPTE, 579 + * because the table should already be mapped by the caller and 580 + * doing another map would be bad. print_bad_page_map() should 581 + * already take care of printing the PTE. 582 + */ 583 + pr_alert("pgd:%08llx p4d:%08llx pud:%08llx pmd:%08llx\n", pgdv, 584 + p4dv, pudv, pmdv); 585 + } 586 + 587 + /* 588 + * This function is called to print an error when a bad page table entry (e.g., 589 + * corrupted page table entry) is found. For example, we might have a 590 + * PFN-mapped pte in a region that doesn't allow it. 591 + * 592 + * The calling function must still handle the error. 593 + * 594 + * This function must be called during a proper page table walk, as it will 595 + * re-walk the page table to dump information: the caller MUST prevent page 596 + * table teardown (by holding mmap, vma or rmap lock) and MUST hold the leaf 597 + * page table lock. 598 + */ 599 + static void print_bad_page_map(struct vm_area_struct *vma, 600 + unsigned long addr, unsigned long long entry, struct page *page, 601 + enum pgtable_level level) 602 + { 603 + struct address_space *mapping; 604 + pgoff_t index; 605 + 606 + if (is_bad_page_map_ratelimited()) 607 + return; 518 608 519 609 mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL; 520 610 index = linear_page_index(vma, addr); 521 611 522 - pr_alert("BUG: Bad page map in process %s pte:%08llx pmd:%08llx\n", 523 - current->comm, 524 - (long long)pte_val(pte), (long long)pmd_val(*pmd)); 612 + pr_alert("BUG: Bad page map in process %s %s:%08llx", current->comm, 613 + pgtable_level_to_str(level), entry); 614 + __print_bad_page_map_pgtable(vma->vm_mm, addr); 525 615 if (page) 526 - dump_page(page, "bad pte"); 616 + dump_page(page, "bad page map"); 527 617 pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n", 528 618 (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index); 529 619 pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n", ··· 611 549 dump_stack(); 612 550 add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); 613 551 } 552 + #define print_bad_pte(vma, addr, pte, page) \ 553 + print_bad_page_map(vma, addr, pte_val(pte), page, PGTABLE_LEVEL_PTE) 614 554 615 555 /* 616 556 * vm_normal_page -- This function gets the "struct page" associated with a pte.