Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm: pass mm down to pagetable_{pte,pmd}_ctor

Patch series "Always call constructor for kernel page tables", v2.

There has been much confusion around exactly when page table
constructors/destructors (pagetable_*_[cd]tor) are supposed to be called.
They were initially introduced for user PTEs only (to support split page
table locks), then at the PMD level for the same purpose. Accounting was
added later on, starting at the PTE level and then moving to higher levels
(PMD, PUD). Finally, with my earlier series "Account page tables at all
levels" [1], the ctor/dtor is run for all levels, all the way to PGD.

I thought this was the end of the story, and it hopefully is for user
pgtables, but I was wrong for what concerns kernel pgtables. The current
situation there makes very little sense:

* At the PTE level, the ctor/dtor is not called (at least in the generic
implementation). Specific helpers are used for kernel pgtables at this
level (pte_{alloc,free}_kernel()) and those have never called the
ctor/dtor, most likely because they were initially irrelevant in the
kernel case.

* At all other levels, the ctor/dtor is normally called. This is
potentially wasteful at the PMD level (more on that later).

This series aims to ensure that the ctor/dtor is always called for kernel
pgtables, as it already is for user pgtables. Besides consistency, the
main motivation is to guarantee that ctor/dtor hooks are systematically
called; this makes it possible to insert hooks to protect page tables [2],
for instance. There is however an extra challenge: split locks are not
used for kernel pgtables, and it would therefore be wasteful to initialise
them (ptlock_init()).

It is worth clarifying exactly when split locks are used. They clearly
are for user pgtables, but as illustrated in commit 61444cde9170 ("ARM:
8591/1: mm: use fully constructed struct pages for EFI pgd allocations"),
they also are for special page tables like efi_mm. The one case where
split locks are definitely unused is pgtables owned by init_mm; this is
consistent with the behaviour of apply_to_pte_range().

The approach chosen in this series is therefore to pass the mm associated
to the pgtables being constructed to pagetable_{pte,pmd}_ctor() (patch 1),
and skip ptlock_init() if mm == &init_mm (patch 3 and 7). This makes it
possible to call the PTE ctor/dtor from pte_{alloc,free}_kernel() without
unintended consequences (patch 3). As a result the accounting functions
are now called at all levels for kernel pgtables, and split locks are
never initialised.

In configurations where ptlocks are dynamically allocated (32-bit,
PREEMPT_RT, etc.) and ARCH_ENABLE_SPLIT_PMD_PTLOCK is selected, this
series results in the removal of a kmem_cache allocation for every kernel
PMD. Additionally, for certain architectures that do not use
<asm-generic/pgalloc.h> such as s390, the same optimisation occurs at the
PTE level.

===

Things get more complicated when it comes to special pgtable allocators
(patch 8-12). All architectures need such allocators to create initial
kernel pgtables; we are not concerned with those as the ctor cannot be
called so early in the boot sequence. However, those allocators may also
be used later in the boot sequence or during normal operations. There are
two main use-cases:

1. Mapping EFI memory: efi_mm (arm, arm64, riscv)
2. arch_add_memory(): init_mm

The ctor is already explicitly run (at the PTE/PMD level) in the first
case, as required for pgtables that are not associated with init_mm.
However the same allocators may also be used for the second use-case (or
others), and this is where it gets messy. Patch 1 calls the ctor with
NULL as mm in those situations, as the actual mm isn't available.
Practically this means that ptlocks will be unconditionally initialised.
This is fine on arm - create_mapping_late() is only used for the EFI
mapping. On arm64, __create_pgd_mapping() is also used by
arch_add_memory(); patch 8/9/11 ensure that ctors are called at all levels
with the appropriate mm. The situation is similar on riscv, but
propagating the mm down to the ctor would require significant refactoring.
Since they are already called unconditionally, this series leaves riscv
no worse off - patch 10 adds comments to clarify the situation.

From a cursory look at other architectures implementing arch_add_memory(),
s390 and x86 may also need a similar treatment to add constructor calls.
This is to be taken care of in a future version or as a follow-up.

===

The complications in those special pgtable allocators beg the question:
does it really make sense to treat efi_mm and init_mm differently in e.g.
apply_to_pte_range()? Maybe what we really need is a way to tell if an mm
corresponds to user memory or not, and never use split locks for non-user
mm's. Feedback and suggestions welcome!


This patch (of 12):

In preparation for calling constructors for all kernel page tables while
eliding unnecessary ptlock initialisation, let's pass down the associated
mm to the PTE/PMD level ctors. (These are the two levels where ptlocks
are used.)

In most cases the mm is already around at the point of calling the ctor so
we simply pass it down. This is however not the case for special page
table allocators:

* arch/arm/mm/mmu.c
* arch/arm64/mm/mmu.c
* arch/riscv/mm/init.c

In those cases, the page tables being allocated are either for standard
kernel memory (init_mm) or special page directories, which may not be
associated to any mm. For now let's pass NULL as mm; this will be refined
where possible in future patches.

No functional change in this patch.

Link: https://lore.kernel.org/linux-mm/20250103184415.2744423-1-kevin.brodsky@arm.com/ [1]
Link: https://lore.kernel.org/linux-hardening/20250203101839.1223008-1-kevin.brodsky@arm.com/ [2]
Link: https://lkml.kernel.org/r/20250408095222.860601-1-kevin.brodsky@arm.com
Link: https://lkml.kernel.org/r/20250408095222.860601-2-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390]
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Linus Waleij <linus.walleij@linaro.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: <x86@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Kevin Brodsky and committed by
Andrew Morton
d82d3bf4 24c76f37

+30 -28
+1 -1
arch/arm/mm/mmu.c
··· 735 735 void *ptdesc = pagetable_alloc(GFP_PGTABLE_KERNEL & ~__GFP_HIGHMEM, 736 736 get_order(sz)); 737 737 738 - if (!ptdesc || !pagetable_pte_ctor(ptdesc)) 738 + if (!ptdesc || !pagetable_pte_ctor(NULL, ptdesc)) 739 739 BUG(); 740 740 return ptdesc_to_virt(ptdesc); 741 741 }
+2 -2
arch/arm64/mm/mmu.c
··· 494 494 * folded, and if so pagetable_pte_ctor() becomes nop. 495 495 */ 496 496 if (shift == PAGE_SHIFT) 497 - BUG_ON(!pagetable_pte_ctor(ptdesc)); 497 + BUG_ON(!pagetable_pte_ctor(NULL, ptdesc)); 498 498 else if (shift == PMD_SHIFT) 499 - BUG_ON(!pagetable_pmd_ctor(ptdesc)); 499 + BUG_ON(!pagetable_pmd_ctor(NULL, ptdesc)); 500 500 501 501 return pa; 502 502 }
+1 -1
arch/loongarch/include/asm/pgalloc.h
··· 69 69 if (!ptdesc) 70 70 return NULL; 71 71 72 - if (!pagetable_pmd_ctor(ptdesc)) { 72 + if (!pagetable_pmd_ctor(mm, ptdesc)) { 73 73 pagetable_free(ptdesc); 74 74 return NULL; 75 75 }
+1 -1
arch/m68k/include/asm/mcf_pgalloc.h
··· 48 48 49 49 if (!ptdesc) 50 50 return NULL; 51 - if (!pagetable_pte_ctor(ptdesc)) { 51 + if (!pagetable_pte_ctor(mm, ptdesc)) { 52 52 pagetable_free(ptdesc); 53 53 return NULL; 54 54 }
+5 -5
arch/m68k/include/asm/motorola_pgalloc.h
··· 15 15 }; 16 16 17 17 extern void init_pointer_table(void *table, int type); 18 - extern void *get_pointer_table(int type); 18 + extern void *get_pointer_table(struct mm_struct *mm, int type); 19 19 extern int free_pointer_table(void *table, int type); 20 20 21 21 /* ··· 26 26 27 27 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm) 28 28 { 29 - return get_pointer_table(TABLE_PTE); 29 + return get_pointer_table(mm, TABLE_PTE); 30 30 } 31 31 32 32 static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) ··· 36 36 37 37 static inline pgtable_t pte_alloc_one(struct mm_struct *mm) 38 38 { 39 - return get_pointer_table(TABLE_PTE); 39 + return get_pointer_table(mm, TABLE_PTE); 40 40 } 41 41 42 42 static inline void pte_free(struct mm_struct *mm, pgtable_t pgtable) ··· 53 53 54 54 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address) 55 55 { 56 - return get_pointer_table(TABLE_PMD); 56 + return get_pointer_table(mm, TABLE_PMD); 57 57 } 58 58 59 59 static inline int pmd_free(struct mm_struct *mm, pmd_t *pmd) ··· 75 75 76 76 static inline pgd_t *pgd_alloc(struct mm_struct *mm) 77 77 { 78 - return get_pointer_table(TABLE_PGD); 78 + return get_pointer_table(mm, TABLE_PGD); 79 79 } 80 80 81 81
+3 -3
arch/m68k/mm/motorola.c
··· 139 139 return; 140 140 } 141 141 142 - void *get_pointer_table(int type) 142 + void *get_pointer_table(struct mm_struct *mm, int type) 143 143 { 144 144 ptable_desc *dp = ptable_list[type].next; 145 145 unsigned int mask = list_empty(&ptable_list[type]) ? 0 : PD_MARKBITS(dp); ··· 164 164 * m68k doesn't have SPLIT_PTE_PTLOCKS for not having 165 165 * SMP. 166 166 */ 167 - pagetable_pte_ctor(virt_to_ptdesc(page)); 167 + pagetable_pte_ctor(mm, virt_to_ptdesc(page)); 168 168 break; 169 169 case TABLE_PMD: 170 - pagetable_pmd_ctor(virt_to_ptdesc(page)); 170 + pagetable_pmd_ctor(mm, virt_to_ptdesc(page)); 171 171 break; 172 172 case TABLE_PGD: 173 173 pagetable_pgd_ctor(virt_to_ptdesc(page));
+1 -1
arch/mips/include/asm/pgalloc.h
··· 62 62 if (!ptdesc) 63 63 return NULL; 64 64 65 - if (!pagetable_pmd_ctor(ptdesc)) { 65 + if (!pagetable_pmd_ctor(mm, ptdesc)) { 66 66 pagetable_free(ptdesc); 67 67 return NULL; 68 68 }
+1 -1
arch/parisc/include/asm/pgalloc.h
··· 39 39 ptdesc = pagetable_alloc(gfp, PMD_TABLE_ORDER); 40 40 if (!ptdesc) 41 41 return NULL; 42 - if (!pagetable_pmd_ctor(ptdesc)) { 42 + if (!pagetable_pmd_ctor(mm, ptdesc)) { 43 43 pagetable_free(ptdesc); 44 44 return NULL; 45 45 }
+1 -1
arch/powerpc/mm/book3s64/pgtable.c
··· 417 417 ptdesc = pagetable_alloc(gfp, 0); 418 418 if (!ptdesc) 419 419 return NULL; 420 - if (!pagetable_pmd_ctor(ptdesc)) { 420 + if (!pagetable_pmd_ctor(mm, ptdesc)) { 421 421 pagetable_free(ptdesc); 422 422 return NULL; 423 423 }
+1 -1
arch/powerpc/mm/pgtable-frag.c
··· 61 61 ptdesc = pagetable_alloc(PGALLOC_GFP | __GFP_ACCOUNT, 0); 62 62 if (!ptdesc) 63 63 return NULL; 64 - if (!pagetable_pte_ctor(ptdesc)) { 64 + if (!pagetable_pte_ctor(mm, ptdesc)) { 65 65 pagetable_free(ptdesc); 66 66 return NULL; 67 67 }
+2 -2
arch/riscv/mm/init.c
··· 442 442 { 443 443 struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0); 444 444 445 - BUG_ON(!ptdesc || !pagetable_pte_ctor(ptdesc)); 445 + BUG_ON(!ptdesc || !pagetable_pte_ctor(NULL, ptdesc)); 446 446 return __pa((pte_t *)ptdesc_address(ptdesc)); 447 447 } 448 448 ··· 522 522 { 523 523 struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0); 524 524 525 - BUG_ON(!ptdesc || !pagetable_pmd_ctor(ptdesc)); 525 + BUG_ON(!ptdesc || !pagetable_pmd_ctor(NULL, ptdesc)); 526 526 return __pa((pmd_t *)ptdesc_address(ptdesc)); 527 527 } 528 528
+1 -1
arch/s390/include/asm/pgalloc.h
··· 97 97 if (!table) 98 98 return NULL; 99 99 crst_table_init(table, _SEGMENT_ENTRY_EMPTY); 100 - if (!pagetable_pmd_ctor(virt_to_ptdesc(table))) { 100 + if (!pagetable_pmd_ctor(mm, virt_to_ptdesc(table))) { 101 101 crst_table_free(mm, table); 102 102 return NULL; 103 103 }
+1 -1
arch/s390/mm/pgalloc.c
··· 145 145 ptdesc = pagetable_alloc(GFP_KERNEL, 0); 146 146 if (!ptdesc) 147 147 return NULL; 148 - if (!pagetable_pte_ctor(ptdesc)) { 148 + if (!pagetable_pte_ctor(mm, ptdesc)) { 149 149 pagetable_free(ptdesc); 150 150 return NULL; 151 151 }
+1 -1
arch/sparc/mm/init_64.c
··· 2895 2895 2896 2896 if (!ptdesc) 2897 2897 return NULL; 2898 - if (!pagetable_pte_ctor(ptdesc)) { 2898 + if (!pagetable_pte_ctor(mm, ptdesc)) { 2899 2899 pagetable_free(ptdesc); 2900 2900 return NULL; 2901 2901 }
+1 -1
arch/sparc/mm/srmmu.c
··· 350 350 page = pfn_to_page(__nocache_pa((unsigned long)ptep) >> PAGE_SHIFT); 351 351 spin_lock(&mm->page_table_lock); 352 352 if (page_ref_inc_return(page) == 2 && 353 - !pagetable_pte_ctor(page_ptdesc(page))) { 353 + !pagetable_pte_ctor(mm, page_ptdesc(page))) { 354 354 page_ref_dec(page); 355 355 ptep = NULL; 356 356 }
+1 -1
arch/x86/mm/pgtable.c
··· 205 205 206 206 if (!ptdesc) 207 207 failed = true; 208 - if (ptdesc && !pagetable_pmd_ctor(ptdesc)) { 208 + if (ptdesc && !pagetable_pmd_ctor(mm, ptdesc)) { 209 209 pagetable_free(ptdesc); 210 210 ptdesc = NULL; 211 211 failed = true;
+2 -2
include/asm-generic/pgalloc.h
··· 70 70 ptdesc = pagetable_alloc_noprof(gfp, 0); 71 71 if (!ptdesc) 72 72 return NULL; 73 - if (!pagetable_pte_ctor(ptdesc)) { 73 + if (!pagetable_pte_ctor(mm, ptdesc)) { 74 74 pagetable_free(ptdesc); 75 75 return NULL; 76 76 } ··· 137 137 ptdesc = pagetable_alloc_noprof(gfp, 0); 138 138 if (!ptdesc) 139 139 return NULL; 140 - if (!pagetable_pmd_ctor(ptdesc)) { 140 + if (!pagetable_pmd_ctor(mm, ptdesc)) { 141 141 pagetable_free(ptdesc); 142 142 return NULL; 143 143 }
+4 -2
include/linux/mm.h
··· 3147 3147 pagetable_free(ptdesc); 3148 3148 } 3149 3149 3150 - static inline bool pagetable_pte_ctor(struct ptdesc *ptdesc) 3150 + static inline bool pagetable_pte_ctor(struct mm_struct *mm, 3151 + struct ptdesc *ptdesc) 3151 3152 { 3152 3153 if (!ptlock_init(ptdesc)) 3153 3154 return false; ··· 3254 3253 return ptl; 3255 3254 } 3256 3255 3257 - static inline bool pagetable_pmd_ctor(struct ptdesc *ptdesc) 3256 + static inline bool pagetable_pmd_ctor(struct mm_struct *mm, 3257 + struct ptdesc *ptdesc) 3258 3258 { 3259 3259 if (!pmd_ptlock_init(ptdesc)) 3260 3260 return false;