Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

userfaultfd: UFFD_FEATURE_WP_ASYNC

Patch series "Implement IOCTL to get and optionally clear info about
PTEs", v33.

*Motivation*
The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
GetWriteWatch() and ResetWriteWatch() syscalls [1]. The GetWriteWatch()
retrieves the addresses of the pages that are written to in a region of
virtual memory.

This syscall is used in Windows applications and games etc. This syscall
is being emulated in pretty slow manner in userspace. Our purpose is to
enhance the kernel such that we translate it efficiently in a better way.
Currently some out of tree hack patches are being used to efficiently
emulate it in some kernels. We intend to replace those with these
patches. So the whole gaming on Linux can effectively get benefit from
this. It means there would be tons of users of this code.

CRIU use case [2] was mentioned by Andrei and Danylo:
> Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
> MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
> shadow memory [4]. Being able to migrate such binaries allows to highly
> reduce the amount of work needed to identify and fix post-migration
> crashes, which happen constantly.

Andrei defines the following uses of this code:
* it is more granular and allows us to track changed pages more
effectively. The current interface can clear dirty bits for the entire
process only. In addition, reading info about pages is a separate
operation. It means we must freeze the process to read information
about all its pages, reset dirty bits, only then we can start dumping
pages. The information about pages becomes more and more outdated,
while we are processing pages. The new interface solves both these
downsides. First, it allows us to read pte bits and clear the
soft-dirty bit atomically. It means that CRIU will not need to freeze
processes to pre-dump their memory. Second, it clears soft-dirty bits
for a specified region of memory. It means CRIU will have actual info
about pages to the moment of dumping them.
* The new interface has to be much faster because basic page filtering
is happening in the kernel. With the old interface, we have to read
pagemap for each page.

*Implementation Evolution (Short Summary)*
From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
feature can be used under the hood with some additions like:
* reset soft-dirty flag for only a specific region of memory instead of
clearing the flag for the entire process
* get and clear soft-dirty flag for a specific region atomically

So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
flag. But using soft-dirty flag, sometimes we get extra pages which weren't
even written. They had become soft-dirty because of VMA merging and
VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
reported that mprotect etc messes up the soft-dirty flag while ignoring
VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
discussed if we can revert these patches. But we could not reach to any
conclusion. So at this point, I made couple of tries to solve this whole
VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
* [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
regression. We left it behind.
* [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
got the reply don't increase the size of the VMA by 8 bytes.

At this point, we left soft-dirty considering it is too much delicate and
userfaultfd [9] seemed like the only way forward. From there onward, we
have been basing soft-dirty emulation on userfaultfd wp feature where
kernel resolves the faults itself when WP_ASYNC feature is used. It was
straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
those pages dirty or written-to which are really written in reality. (PS
There is another WP_UNPOPULATED userfautfd feature is required which is
needed to avoid pre-faulting memory before write-protecting [9].)

All the different masks were added on the request of CRIU devs to create
interface more generic and better.

[1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-getwritewatch
[2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
[3] https://github.com/google/sanitizers
[4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
[5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
[6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
[7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.com
[8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.com
[9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
[10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com


This patch (of 6):

Add a new userfaultfd-wp feature UFFD_FEATURE_WP_ASYNC, that allows
userfaultfd wr-protect faults to be resolved by the kernel directly.

It can be used like a high accuracy version of soft-dirty, without vma
modifications during tracking, and also with ranged support by default
rather than for a whole mm when reset the protections due to existence of
ioctl(UFFDIO_WRITEPROTECT).

Several goals of such a dirty tracking interface:

1. All types of memory should be supported and tracable. This is nature
for soft-dirty but should mention when the context is userfaultfd,
because it used to only support anon/shmem/hugetlb. The problem is for
a dirty tracking purpose these three types may not be enough, and it's
legal to track anything e.g. any page cache writes from mmap.

2. Protections can be applied to partial of a memory range, without vma
split/merge fuss. The hope is that the tracking itself should not
affect any vma layout change. It also helps when reset happens because
the reset will not need mmap write lock which can block the tracee.

3. Accuracy needs to be maintained. This means we need pte markers to work
on any type of VMA.

One could question that, the whole concept of async dirty tracking is not
really close to fundamentally what userfaultfd used to be: it's not "a
fault to be serviced by userspace" anymore. However, using userfaultfd-wp
here as a framework is convenient for us in at least:

1. VM_UFFD_WP vma flag, which has a very good name to suite something like
this, so we don't need VM_YET_ANOTHER_SOFT_DIRTY. Just use a new
feature bit to identify from a sync version of uffd-wp registration.

2. PTE markers logic can be leveraged across the whole kernel to maintain
the uffd-wp bit as long as an arch supports, this also applies to this
case where uffd-wp bit will be a hint to dirty information and it will
not go lost easily (e.g. when some page cache ptes got zapped).

3. Reuse ioctl(UFFDIO_WRITEPROTECT) interface for either starting or
resetting a range of memory, while there's no counterpart in the old
soft-dirty world, hence if this is wanted in a new design we'll need a
new interface otherwise.

We can somehow understand that commonality because uffd-wp was
fundamentally a similar idea of write-protecting pages just like
soft-dirty.

This implementation allows WP_ASYNC to imply WP_UNPOPULATED, because so
far WP_ASYNC seems to not usable if without WP_UNPOPULATE. This also
gives us chance to modify impl of WP_ASYNC just in case it could be not
depending on WP_UNPOPULATED anymore in the future kernels. It's also fine
to imply that because both features will rely on PTE_MARKER_UFFD_WP config
option, so they'll show up together (or both missing) in an UFFDIO_API
probe.

vma_can_userfault() now allows any VMA if the userfaultfd registration is
only about async uffd-wp. So we can track dirty for all kinds of memory
including generic file systems (like XFS, EXT4 or BTRFS).

One trick worth mention in do_wp_page() is that we need to manually update
vmf->orig_pte here because it can be used later with a pte_same() check -
this path always has FAULT_FLAG_ORIG_PTE_VALID set in the flags.

The major defect of this approach of dirty tracking is we need to populate
the pgtables when tracking starts. Soft-dirty doesn't do it like that.
It's unwanted in the case where the range of memory to track is huge and
unpopulated (e.g., tracking updates on a 10G file with mmap() on top,
without having any page cache installed yet). One way to improve this is
to allow pte markers exist for larger than PTE level for PMD+. That will
not change the interface if to implemented, so we can leave that for
later.

Link: https://lkml.kernel.org/r/20230821141518.870589-1-usama.anjum@collabora.com
Link: https://lkml.kernel.org/r/20230821141518.870589-2-usama.anjum@collabora.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Co-developed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Miroslaw <emmir@google.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Paul Gofman <pgofman@codeweavers.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yun Zhou <yun.zhou@windriver.com>
Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Peter Xu and committed by
Andrew Morton
d61ea1cb 7bd5bc3c

+129 -22
+35
Documentation/admin-guide/mm/userfaultfd.rst
··· 244 244 support a mode flag (``UFFDIO_COPY_MODE_WP`` or ``UFFDIO_CONTINUE_MODE_WP`` 245 245 respectively) to configure the mapping this way. 246 246 247 + If the userfaultfd context has ``UFFD_FEATURE_WP_ASYNC`` feature bit set, 248 + any vma registered with write-protection will work in async mode rather 249 + than the default sync mode. 250 + 251 + In async mode, there will be no message generated when a write operation 252 + happens, meanwhile the write-protection will be resolved automatically by 253 + the kernel. It can be seen as a more accurate version of soft-dirty 254 + tracking and it can be different in a few ways: 255 + 256 + - The dirty result will not be affected by vma changes (e.g. vma 257 + merging) because the dirty is only tracked by the pte. 258 + 259 + - It supports range operations by default, so one can enable tracking on 260 + any range of memory as long as page aligned. 261 + 262 + - Dirty information will not get lost if the pte was zapped due to 263 + various reasons (e.g. during split of a shmem transparent huge page). 264 + 265 + - Due to a reverted meaning of soft-dirty (page clean when uffd-wp bit 266 + set; dirty when uffd-wp bit cleared), it has different semantics on 267 + some of the memory operations. For example: ``MADV_DONTNEED`` on 268 + anonymous (or ``MADV_REMOVE`` on a file mapping) will be treated as 269 + dirtying of memory by dropping uffd-wp bit during the procedure. 270 + 271 + The user app can collect the "written/dirty" status by looking up the 272 + uffd-wp bit for the pages being interested in /proc/pagemap. 273 + 274 + The page will not be under track of uffd-wp async mode until the page is 275 + explicitly write-protected by ``ioctl(UFFDIO_WRITEPROTECT)`` with the mode 276 + flag ``UFFDIO_WRITEPROTECT_MODE_WP`` set. Trying to resolve a page fault 277 + that was tracked by async mode userfaultfd-wp is invalid. 278 + 279 + When userfaultfd-wp async mode is used alone, it can be applied to all 280 + kinds of memory. 281 + 247 282 Memory Poisioning Emulation 248 283 --------------------------- 249 284
+22 -4
fs/userfaultfd.c
··· 123 123 return ctx->features & UFFD_FEATURE_INITIALIZED; 124 124 } 125 125 126 + static bool userfaultfd_wp_async_ctx(struct userfaultfd_ctx *ctx) 127 + { 128 + return ctx && (ctx->features & UFFD_FEATURE_WP_ASYNC); 129 + } 130 + 126 131 /* 127 132 * Whether WP_UNPOPULATED is enabled on the uffd context. It is only 128 133 * meaningful when userfaultfd_wp()==true on the vma and when it's ··· 1330 1325 bool basic_ioctls; 1331 1326 unsigned long start, end, vma_end; 1332 1327 struct vma_iterator vmi; 1328 + bool wp_async = userfaultfd_wp_async_ctx(ctx); 1333 1329 pgoff_t pgoff; 1334 1330 1335 1331 user_uffdio_register = (struct uffdio_register __user *) arg; ··· 1405 1399 1406 1400 /* check not compatible vmas */ 1407 1401 ret = -EINVAL; 1408 - if (!vma_can_userfault(cur, vm_flags)) 1402 + if (!vma_can_userfault(cur, vm_flags, wp_async)) 1409 1403 goto out_unlock; 1410 1404 1411 1405 /* ··· 1466 1460 for_each_vma_range(vmi, vma, end) { 1467 1461 cond_resched(); 1468 1462 1469 - BUG_ON(!vma_can_userfault(vma, vm_flags)); 1463 + BUG_ON(!vma_can_userfault(vma, vm_flags, wp_async)); 1470 1464 BUG_ON(vma->vm_userfaultfd_ctx.ctx && 1471 1465 vma->vm_userfaultfd_ctx.ctx != ctx); 1472 1466 WARN_ON(!(vma->vm_flags & VM_MAYWRITE)); ··· 1567 1561 unsigned long start, end, vma_end; 1568 1562 const void __user *buf = (void __user *)arg; 1569 1563 struct vma_iterator vmi; 1564 + bool wp_async = userfaultfd_wp_async_ctx(ctx); 1570 1565 pgoff_t pgoff; 1571 1566 1572 1567 ret = -EFAULT; ··· 1622 1615 * provides for more strict behavior to notice 1623 1616 * unregistration errors. 1624 1617 */ 1625 - if (!vma_can_userfault(cur, cur->vm_flags)) 1618 + if (!vma_can_userfault(cur, cur->vm_flags, wp_async)) 1626 1619 goto out_unlock; 1627 1620 1628 1621 found = true; ··· 1638 1631 for_each_vma_range(vmi, vma, end) { 1639 1632 cond_resched(); 1640 1633 1641 - BUG_ON(!vma_can_userfault(vma, vma->vm_flags)); 1634 + BUG_ON(!vma_can_userfault(vma, vma->vm_flags, wp_async)); 1642 1635 1643 1636 /* 1644 1637 * Nothing to do: this vma is already registered into this ··· 2025 2018 return ret; 2026 2019 } 2027 2020 2021 + bool userfaultfd_wp_async(struct vm_area_struct *vma) 2022 + { 2023 + return userfaultfd_wp_async_ctx(vma->vm_userfaultfd_ctx.ctx); 2024 + } 2025 + 2028 2026 static inline unsigned int uffd_ctx_features(__u64 user_features) 2029 2027 { 2030 2028 /* ··· 2063 2051 ret = -EPERM; 2064 2052 if ((features & UFFD_FEATURE_EVENT_FORK) && !capable(CAP_SYS_PTRACE)) 2065 2053 goto err_out; 2054 + 2055 + /* WP_ASYNC relies on WP_UNPOPULATED, choose it unconditionally */ 2056 + if (features & UFFD_FEATURE_WP_ASYNC) 2057 + features |= UFFD_FEATURE_WP_UNPOPULATED; 2058 + 2066 2059 /* report all available features and ioctls to userland */ 2067 2060 uffdio_api.features = UFFD_API_FEATURES; 2068 2061 #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR ··· 2080 2063 #ifndef CONFIG_PTE_MARKER_UFFD_WP 2081 2064 uffdio_api.features &= ~UFFD_FEATURE_WP_HUGETLBFS_SHMEM; 2082 2065 uffdio_api.features &= ~UFFD_FEATURE_WP_UNPOPULATED; 2066 + uffdio_api.features &= ~UFFD_FEATURE_WP_ASYNC; 2083 2067 #endif 2084 2068 uffdio_api.ioctls = UFFD_API_IOCTLS; 2085 2069 ret = -EFAULT;
+20 -1
include/linux/userfaultfd_k.h
··· 161 161 } 162 162 163 163 static inline bool vma_can_userfault(struct vm_area_struct *vma, 164 - unsigned long vm_flags) 164 + unsigned long vm_flags, 165 + bool wp_async) 165 166 { 167 + vm_flags &= __VM_UFFD_FLAGS; 168 + 166 169 if ((vm_flags & VM_UFFD_MINOR) && 167 170 (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma))) 168 171 return false; 172 + 173 + /* 174 + * If wp async enabled, and WP is the only mode enabled, allow any 175 + * memory type. 176 + */ 177 + if (wp_async && (vm_flags == VM_UFFD_WP)) 178 + return true; 179 + 169 180 #ifndef CONFIG_PTE_MARKER_UFFD_WP 170 181 /* 171 182 * If user requested uffd-wp but not enabled pte markers for ··· 186 175 if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma)) 187 176 return false; 188 177 #endif 178 + 179 + /* By default, allow any of anon|shmem|hugetlb */ 189 180 return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) || 190 181 vma_is_shmem(vma); 191 182 } ··· 210 197 extern void userfaultfd_unmap_complete(struct mm_struct *mm, 211 198 struct list_head *uf); 212 199 extern bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma); 200 + extern bool userfaultfd_wp_async(struct vm_area_struct *vma); 213 201 214 202 #else /* CONFIG_USERFAULTFD */ 215 203 ··· 307 293 } 308 294 309 295 static inline bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma) 296 + { 297 + return false; 298 + } 299 + 300 + static inline bool userfaultfd_wp_async(struct vm_area_struct *vma) 310 301 { 311 302 return false; 312 303 }
+8 -1
include/uapi/linux/userfaultfd.h
··· 40 40 UFFD_FEATURE_EXACT_ADDRESS | \ 41 41 UFFD_FEATURE_WP_HUGETLBFS_SHMEM | \ 42 42 UFFD_FEATURE_WP_UNPOPULATED | \ 43 - UFFD_FEATURE_POISON) 43 + UFFD_FEATURE_POISON | \ 44 + UFFD_FEATURE_WP_ASYNC) 44 45 #define UFFD_API_IOCTLS \ 45 46 ((__u64)1 << _UFFDIO_REGISTER | \ 46 47 (__u64)1 << _UFFDIO_UNREGISTER | \ ··· 217 216 * (i.e. empty ptes). This will be the default behavior for shmem 218 217 * & hugetlbfs, so this flag only affects anonymous memory behavior 219 218 * when userfault write-protection mode is registered. 219 + * 220 + * UFFD_FEATURE_WP_ASYNC indicates that userfaultfd write-protection 221 + * asynchronous mode is supported in which the write fault is 222 + * automatically resolved and write-protection is un-set. 223 + * It implies UFFD_FEATURE_WP_UNPOPULATED. 220 224 */ 221 225 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) 222 226 #define UFFD_FEATURE_EVENT_FORK (1<<1) ··· 238 232 #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12) 239 233 #define UFFD_FEATURE_WP_UNPOPULATED (1<<13) 240 234 #define UFFD_FEATURE_POISON (1<<14) 235 + #define UFFD_FEATURE_WP_ASYNC (1<<15) 241 236 __u64 features; 242 237 243 238 __u64 ioctls;
+19 -13
mm/hugetlb.c
··· 6247 6247 /* Handle userfault-wp first, before trying to lock more pages */ 6248 6248 if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) && 6249 6249 (flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) { 6250 - struct vm_fault vmf = { 6251 - .vma = vma, 6252 - .address = haddr, 6253 - .real_address = address, 6254 - .flags = flags, 6255 - }; 6250 + if (!userfaultfd_wp_async(vma)) { 6251 + struct vm_fault vmf = { 6252 + .vma = vma, 6253 + .address = haddr, 6254 + .real_address = address, 6255 + .flags = flags, 6256 + }; 6256 6257 6257 - spin_unlock(ptl); 6258 - if (pagecache_folio) { 6259 - folio_unlock(pagecache_folio); 6260 - folio_put(pagecache_folio); 6258 + spin_unlock(ptl); 6259 + if (pagecache_folio) { 6260 + folio_unlock(pagecache_folio); 6261 + folio_put(pagecache_folio); 6262 + } 6263 + hugetlb_vma_unlock_read(vma); 6264 + mutex_unlock(&hugetlb_fault_mutex_table[hash]); 6265 + return handle_userfault(&vmf, VM_UFFD_WP); 6261 6266 } 6262 - hugetlb_vma_unlock_read(vma); 6263 - mutex_unlock(&hugetlb_fault_mutex_table[hash]); 6264 - return handle_userfault(&vmf, VM_UFFD_WP); 6267 + 6268 + entry = huge_pte_clear_uffd_wp(entry); 6269 + set_huge_pte_at(mm, haddr, ptep, entry); 6270 + /* Fallthrough to CoW */ 6265 6271 } 6266 6272 6267 6273 /*
+25 -3
mm/memory.c
··· 1 + 1 2 // SPDX-License-Identifier: GPL-2.0-only 2 3 /* 3 4 * linux/mm/memory.c ··· 3350 3349 const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE; 3351 3350 struct vm_area_struct *vma = vmf->vma; 3352 3351 struct folio *folio = NULL; 3352 + pte_t pte; 3353 3353 3354 3354 if (likely(!unshare)) { 3355 3355 if (userfaultfd_pte_wp(vma, ptep_get(vmf->pte))) { 3356 - pte_unmap_unlock(vmf->pte, vmf->ptl); 3357 - return handle_userfault(vmf, VM_UFFD_WP); 3356 + if (!userfaultfd_wp_async(vma)) { 3357 + pte_unmap_unlock(vmf->pte, vmf->ptl); 3358 + return handle_userfault(vmf, VM_UFFD_WP); 3359 + } 3360 + 3361 + /* 3362 + * Nothing needed (cache flush, TLB invalidations, 3363 + * etc.) because we're only removing the uffd-wp bit, 3364 + * which is completely invisible to the user. 3365 + */ 3366 + pte = pte_clear_uffd_wp(ptep_get(vmf->pte)); 3367 + 3368 + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); 3369 + /* 3370 + * Update this to be prepared for following up CoW 3371 + * handling 3372 + */ 3373 + vmf->orig_pte = pte; 3358 3374 } 3359 3375 3360 3376 /* ··· 4897 4879 4898 4880 if (vma_is_anonymous(vma)) { 4899 4881 if (likely(!unshare) && 4900 - userfaultfd_huge_pmd_wp(vma, vmf->orig_pmd)) 4882 + userfaultfd_huge_pmd_wp(vma, vmf->orig_pmd)) { 4883 + if (userfaultfd_wp_async(vmf->vma)) 4884 + goto split; 4901 4885 return handle_userfault(vmf, VM_UFFD_WP); 4886 + } 4902 4887 return do_huge_pmd_wp_page(vmf); 4903 4888 } 4904 4889 ··· 4913 4892 } 4914 4893 } 4915 4894 4895 + split: 4916 4896 /* COW or write-notify handled on pte level: split pmd. */ 4917 4897 __split_huge_pmd(vma, vmf->pmd, vmf->address, false, NULL); 4918 4898