Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm/page_table_check: support userfault wr-protect entries

Allow page_table_check hooks to check over userfaultfd wr-protect criteria
upon pgtable updates. The rule is no co-existance allowed for any
writable flag against userfault wr-protect flag.

This should be better than c2da319c2e, where we used to only sanitize such
issues during a pgtable walk, but when hitting such issue we don't have a
good chance to know where does that writable bit came from [1], so that
even the pgtable walk exposes a kernel bug (which is still helpful on
triaging) but not easy to track and debug.

Now we switch to track the source. It's much easier too with the recent
introduction of page table check.

There are some limitations with using the page table check here for
userfaultfd wr-protect purpose:

- It is only enabled with explicit enablement of page table check configs
and/or boot parameters, but should be good enough to track at least
syzbot issues, as syzbot should enable PAGE_TABLE_CHECK[_ENFORCED] for
x86 [1]. We used to have DEBUG_VM but it's now off for most distros,
while distros also normally not enable PAGE_TABLE_CHECK[_ENFORCED], which
is similar.

- It conditionally works with the ptep_modify_prot API. It will be
bypassed when e.g. XEN PV is enabled, however still work for most of the
rest scenarios, which should be the common cases so should be good
enough.

- Hugetlb check is a bit hairy, as the page table check cannot identify
hugetlb pte or normal pte via trapping at set_pte_at(), because of the
current design where hugetlb maps every layers to pte_t... For example,
the default set_huge_pte_at() can invoke set_pte_at() directly and lose
the hugetlb context, treating it the same as a normal pte_t. So far it's
fine because we have huge_pte_uffd_wp() always equals to pte_uffd_wp() as
long as supported (x86 only). It'll be a bigger problem when we'll
define _PAGE_UFFD_WP differently at various pgtable levels, because then
one huge_pte_uffd_wp() per-arch will stop making sense first.. as of now
we can leave this for later too.

This patch also removes commit c2da319c2e altogether, as we have something
better now.

[1] https://lore.kernel.org/all/000000000000dce0530615c89210@google.com/

Link: https://lkml.kernel.org/r/20240417212549.2766883-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Peter Xu and committed by
Andrew Morton
8430557f 3ccae1dc

+39 -18
+8 -1
Documentation/mm/page_table_check.rst
··· 14 14 accessible from the userspace by getting their page table entries (PTEs PMDs 15 15 etc.) added into the table. 16 16 17 - In case of detected corruption, the kernel is crashed. There is a small 17 + In case of most detected corruption, the kernel is crashed. There is a small 18 18 performance and memory overhead associated with the page table check. Therefore, 19 19 it is disabled by default, but can be optionally enabled on systems where the 20 20 extra hardening outweighs the performance costs. Also, because page table check 21 21 is synchronous, it can help with debugging double map memory corruption issues, 22 22 by crashing kernel at the time wrong mapping occurs instead of later which is 23 23 often the case with memory corruptions bugs. 24 + 25 + It can also be used to do page table entry checks over various flags, dump 26 + warnings when illegal combinations of entry flags are detected. Currently, 27 + userfaultfd is the only user of such to sanity check wr-protect bit against 28 + any writable flags. Illegal flag combinations will not directly cause data 29 + corruption in this case immediately, but that will cause read-only data to 30 + be writable, leading to corrupt when the page content is later modified. 24 31 25 32 Double mapping detection logic 26 33 ==============================
+1 -17
arch/x86/include/asm/pgtable.h
··· 388 388 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP 389 389 static inline int pte_uffd_wp(pte_t pte) 390 390 { 391 - bool wp = pte_flags(pte) & _PAGE_UFFD_WP; 392 - 393 - #ifdef CONFIG_DEBUG_VM 394 - /* 395 - * Having write bit for wr-protect-marked present ptes is fatal, 396 - * because it means the uffd-wp bit will be ignored and write will 397 - * just go through. 398 - * 399 - * Use any chance of pgtable walking to verify this (e.g., when 400 - * page swapped out or being migrated for all purposes). It means 401 - * something is already wrong. Tell the admin even before the 402 - * process crashes. We also nail it with wrong pgtable setup. 403 - */ 404 - WARN_ON_ONCE(wp && pte_write(pte)); 405 - #endif 406 - 407 - return wp; 391 + return pte_flags(pte) & _PAGE_UFFD_WP; 408 392 } 409 393 410 394 static inline pte_t pte_mkuffd_wp(pte_t pte)
+30
mm/page_table_check.c
··· 7 7 #include <linux/kstrtox.h> 8 8 #include <linux/mm.h> 9 9 #include <linux/page_table_check.h> 10 + #include <linux/swap.h> 11 + #include <linux/swapops.h> 10 12 11 13 #undef pr_fmt 12 14 #define pr_fmt(fmt) "page_table_check: " fmt ··· 184 182 } 185 183 EXPORT_SYMBOL(__page_table_check_pud_clear); 186 184 185 + /* Whether the swap entry cached writable information */ 186 + static inline bool swap_cached_writable(swp_entry_t entry) 187 + { 188 + return is_writable_device_exclusive_entry(entry) || 189 + is_writable_device_private_entry(entry) || 190 + is_writable_migration_entry(entry); 191 + } 192 + 193 + static inline void page_table_check_pte_flags(pte_t pte) 194 + { 195 + if (pte_present(pte) && pte_uffd_wp(pte)) 196 + WARN_ON_ONCE(pte_write(pte)); 197 + else if (is_swap_pte(pte) && pte_swp_uffd_wp(pte)) 198 + WARN_ON_ONCE(swap_cached_writable(pte_to_swp_entry(pte))); 199 + } 200 + 187 201 void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte, 188 202 unsigned int nr) 189 203 { ··· 208 190 if (&init_mm == mm) 209 191 return; 210 192 193 + page_table_check_pte_flags(pte); 194 + 211 195 for (i = 0; i < nr; i++) 212 196 __page_table_check_pte_clear(mm, ptep_get(ptep + i)); 213 197 if (pte_user_accessible_page(pte)) ··· 217 197 } 218 198 EXPORT_SYMBOL(__page_table_check_ptes_set); 219 199 200 + static inline void page_table_check_pmd_flags(pmd_t pmd) 201 + { 202 + if (pmd_present(pmd) && pmd_uffd_wp(pmd)) 203 + WARN_ON_ONCE(pmd_write(pmd)); 204 + else if (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)) 205 + WARN_ON_ONCE(swap_cached_writable(pmd_to_swp_entry(pmd))); 206 + } 207 + 220 208 void __page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd) 221 209 { 222 210 if (&init_mm == mm) 223 211 return; 212 + 213 + page_table_check_pmd_flags(pmd); 224 214 225 215 __page_table_check_pmd_clear(mm, *pmdp); 226 216 if (pmd_user_accessible_page(pmd)) {