Merge branch 'akpm' (patches from Andrew)

+21

Documentation/admin-guide/kernel-parameters.txt

··· 1594 1594 Documentation/admin-guide/mm/hugetlbpage.rst. 1595 1595 Format: size[KMG] 1596 1596 1597 + hugetlb_free_vmemmap= 1598 + [KNL] Reguires CONFIG_HUGETLB_PAGE_FREE_VMEMMAP 1599 + enabled. 1600 + Allows heavy hugetlb users to free up some more 1601 + memory (6 * PAGE_SIZE for each 2MB hugetlb page). 1602 + Format: { on | off (default) } 1603 + 1604 + on: enable the feature 1605 + off: disable the feature 1606 + 1607 + Built with CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON=y, 1608 + the default is on. 1609 + 1610 + This is not compatible with memory_hotplug.memmap_on_memory. 1611 + If both parameters are enabled, hugetlb_free_vmemmap takes 1612 + precedence over memory_hotplug.memmap_on_memory. 1613 + 1597 1614 hung_task_panic= 1598 1615 [KNL] Should the hung task detector generate panics. 1599 1616 Format: 0 | 1 ··· 2876 2859 /sys/module/memory_hotplug/parameters/memmap_on_memory. 2877 2860 Note that even when enabled, there are a few cases where 2878 2861 the feature is not effective. 2862 + 2863 + This is not compatible with hugetlb_free_vmemmap. If 2864 + both parameters are enabled, hugetlb_free_vmemmap takes 2865 + precedence over memory_hotplug.memmap_on_memory. 2879 2866 2880 2867 memtest= [KNL,X86,ARM,PPC,RISCV] Enable memtest 2881 2868 Format: <integer>

+11

Documentation/admin-guide/mm/hugetlbpage.rst

··· 60 60 the pool above the value in ``/proc/sys/vm/nr_hugepages``. The 61 61 maximum number of surplus huge pages is controlled by 62 62 ``/proc/sys/vm/nr_overcommit_hugepages``. 63 + Note: When the feature of freeing unused vmemmap pages associated 64 + with each hugetlb page is enabled, the number of surplus huge pages 65 + may be temporarily larger than the maximum number of surplus huge 66 + pages when the system is under memory pressure. 63 67 Hugepagesize 64 68 is the default hugepage size (in Kb). 65 69 Hugetlb ··· 83 79 returned to the huge page pool when freed by a task. A user with root 84 80 privileges can dynamically allocate more or free some persistent huge pages 85 81 by increasing or decreasing the value of ``nr_hugepages``. 82 + 83 + Note: When the feature of freeing unused vmemmap pages associated with each 84 + hugetlb page is enabled, we can fail to free the huge pages triggered by 85 + the user when ths system is under memory pressure. Please try again later. 86 86 87 87 Pages that are used as huge pages are reserved inside the kernel and cannot 88 88 be used for other purposes. Huge pages cannot be swapped out under ··· 153 145 154 146 will all result in 256 2M huge pages being allocated. Valid default 155 147 huge page size is architecture dependent. 148 + hugetlb_free_vmemmap 149 + When CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is set, this enables freeing 150 + unused vmemmap pages associated with each HugeTLB page. 156 151 157 152 When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages`` 158 153 indicates the current number of pre-allocated huge pages of the default size.

+13

Documentation/admin-guide/mm/memory-hotplug.rst

··· 357 357 Unfortunately, there is no information to show which memory block belongs 358 358 to ZONE_MOVABLE. This is TBD. 359 359 360 + Memory offlining can fail when dissolving a free huge page on ZONE_MOVABLE 361 + and the feature of freeing unused vmemmap pages associated with each hugetlb 362 + page is enabled. 363 + 364 + This can happen when we have plenty of ZONE_MOVABLE memory, but not enough 365 + kernel memory to allocate vmemmmap pages. We may even be able to migrate 366 + huge page contents, but will not be able to dissolve the source huge page. 367 + This will prevent an offline operation and is unfortunate as memory offlining 368 + is expected to succeed on movable zones. Users that depend on memory hotplug 369 + to succeed for movable zones should carefully consider whether the memory 370 + savings gained from this feature are worth the risk of possibly not being 371 + able to offline memory in certain situations. 372 + 360 373 .. note:: 361 374 Techniques that rely on long-term pinnings of memory (especially, RDMA and 362 375 vfio) are fundamentally problematic with ZONE_MOVABLE and, therefore, memory

+2

Documentation/admin-guide/mm/pagemap.rst

··· 21 21 * Bit 55 pte is soft-dirty (see 22 22 :ref:`Documentation/admin-guide/mm/soft-dirty.rst <soft_dirty>`) 23 23 * Bit 56 page exclusively mapped (since 4.2) 24 + * Bit 57 pte is uffd-wp write-protected (since 5.13) (see 25 + :ref:`Documentation/admin-guide/mm/userfaultfd.rst <userfaultfd>`) 24 26 * Bits 57-60 zero 25 27 * Bit 61 page is file-page or shared-anon (since 3.5) 26 28 * Bit 62 page swapped

+2 -1

Documentation/admin-guide/mm/userfaultfd.rst

··· 77 77 78 78 - ``UFFD_FEATURE_MINOR_HUGETLBFS`` indicates that the kernel supports 79 79 ``UFFDIO_REGISTER_MODE_MINOR`` registration for hugetlbfs virtual memory 80 - areas. 80 + areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating 81 + support for shmem virtual memory areas. 81 82 82 83 The userland application should set the feature flags it intends to use 83 84 when invoking the ``UFFDIO_API`` ioctl, to request that those features be

+2 -5

Documentation/core-api/kernel-api.rst

··· 24 24 .. kernel-doc:: lib/vsprintf.c 25 25 :export: 26 26 27 - .. kernel-doc:: include/linux/kernel.h 28 - :functions: kstrtol 29 - 30 - .. kernel-doc:: include/linux/kernel.h 31 - :functions: kstrtoul 27 + .. kernel-doc:: include/linux/kstrtox.h 28 + :functions: kstrtol kstrtoul 32 29 33 30 .. kernel-doc:: lib/kstrtox.c 34 31 :export:

+40 -8

Documentation/filesystems/proc.rst

··· 933 933 ~~~~~~~ 934 934 935 935 Provides information about distribution and utilization of memory. This 936 - varies by architecture and compile options. The following is from a 937 - 16GB PIII, which has highmem enabled. You may not have all of these fields. 936 + varies by architecture and compile options. Some of the counters reported 937 + here overlap. The memory reported by the non overlapping counters may not 938 + add up to the overall memory usage and the difference for some workloads 939 + can be substantial. In many cases there are other means to find out 940 + additional memory using subsystem specific interfaces, for instance 941 + /proc/net/sockstat for TCP memory allocations. 942 + 943 + The following is from a 16GB PIII, which has highmem enabled. 944 + You may not have all of these fields. 938 945 939 946 :: 940 947 ··· 1920 1913 3.8 /proc/<pid>/fdinfo/<fd> - Information about opened file 1921 1914 --------------------------------------------------------------- 1922 1915 This file provides information associated with an opened file. The regular 1923 - files have at least three fields -- 'pos', 'flags' and 'mnt_id'. The 'pos' 1924 - represents the current offset of the opened file in decimal form [see lseek(2) 1925 - for details], 'flags' denotes the octal O_xxx mask the file has been 1926 - created with [see open(2) for details] and 'mnt_id' represents mount ID of 1927 - the file system containing the opened file [see 3.5 /proc/<pid>/mountinfo 1928 - for details]. 1916 + files have at least four fields -- 'pos', 'flags', 'mnt_id' and 'ino'. 1917 + The 'pos' represents the current offset of the opened file in decimal 1918 + form [see lseek(2) for details], 'flags' denotes the octal O_xxx mask the 1919 + file has been created with [see open(2) for details] and 'mnt_id' represents 1920 + mount ID of the file system containing the opened file [see 3.5 1921 + /proc/<pid>/mountinfo for details]. 'ino' represents the inode number of 1922 + the file. 1929 1923 1930 1924 A typical output is:: 1931 1925 1932 1926 pos: 0 1933 1927 flags: 0100002 1934 1928 mnt_id: 19 1929 + ino: 63107 1935 1930 1936 1931 All locks associated with a file descriptor are shown in its fdinfo too:: 1937 1932 ··· 1950 1941 pos: 0 1951 1942 flags: 04002 1952 1943 mnt_id: 9 1944 + ino: 63107 1953 1945 eventfd-count: 5a 1954 1946 1955 1947 where 'eventfd-count' is hex value of a counter. ··· 1963 1953 pos: 0 1964 1954 flags: 04002 1965 1955 mnt_id: 9 1956 + ino: 63107 1966 1957 sigmask: 0000000000000200 1967 1958 1968 1959 where 'sigmask' is hex value of the signal mask associated ··· 1977 1966 pos: 0 1978 1967 flags: 02 1979 1968 mnt_id: 9 1969 + ino: 63107 1980 1970 tfd: 5 events: 1d data: ffffffffffffffff pos:0 ino:61af sdev:7 1981 1971 1982 1972 where 'tfd' is a target file descriptor number in decimal form, ··· 1994 1982 1995 1983 pos: 0 1996 1984 flags: 02000000 1985 + mnt_id: 9 1986 + ino: 63107 1997 1987 inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d 1998 1988 1999 1989 where 'wd' is a watch descriptor in decimal form, i.e. a target file ··· 2018 2004 pos: 0 2019 2005 flags: 02 2020 2006 mnt_id: 9 2007 + ino: 63107 2021 2008 fanotify flags:10 event-flags:0 2022 2009 fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003 2023 2010 fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4 ··· 2043 2028 pos: 0 2044 2029 flags: 02 2045 2030 mnt_id: 9 2031 + ino: 63107 2046 2032 clockid: 0 2047 2033 ticks: 0 2048 2034 settime flags: 01 ··· 2057 2041 'it_interval' is the interval for the timer. Note the timer might be set up 2058 2042 with TIMER_ABSTIME option which will be shown in 'settime flags', but 'it_value' 2059 2043 still exhibits timer's remaining time. 2044 + 2045 + DMA Buffer files 2046 + ~~~~~~~~~~~~~~~~ 2047 + 2048 + :: 2049 + 2050 + pos: 0 2051 + flags: 04002 2052 + mnt_id: 9 2053 + ino: 63107 2054 + size: 32768 2055 + count: 2 2056 + exp_name: system-heap 2057 + 2058 + where 'size' is the size of the DMA buffer in bytes. 'count' is the file count of 2059 + the DMA buffer file. 'exp_name' is the name of the DMA buffer exporter. 2060 2060 2061 2061 3.9 /proc/<pid>/map_files - Information about memory mapped files 2062 2062 ---------------------------------------------------------------------

+18 -1

Documentation/vm/hmm.rst

··· 332 332 walks to fill in the ``args->src`` array with PFNs to be migrated. 333 333 The ``invalidate_range_start()`` callback is passed a 334 334 ``struct mmu_notifier_range`` with the ``event`` field set to 335 - ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to 335 + ``MMU_NOTIFY_MIGRATE`` and the ``owner`` field set to 336 336 the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is 337 337 allows the device driver to skip the invalidation callback and only 338 338 invalidate device private MMU mappings that are actually migrating. ··· 404 404 7. ``mmap_read_unlock()`` 405 405 406 406 The lock can now be released. 407 + 408 + Exclusive access memory 409 + ======================= 410 + 411 + Some devices have features such as atomic PTE bits that can be used to implement 412 + atomic access to system memory. To support atomic operations to a shared virtual 413 + memory page such a device needs access to that page which is exclusive of any 414 + userspace access from the CPU. The ``make_device_exclusive_range()`` function 415 + can be used to make a memory range inaccessible from userspace. 416 + 417 + This replaces all mappings for pages in the given range with special swap 418 + entries. Any attempt to access the swap entry results in a fault which is 419 + resovled by replacing the entry with the original mapping. A driver gets 420 + notified that the mapping has been changed by MMU notifiers, after which point 421 + it will no longer have exclusive access to the page. Exclusive access is 422 + guranteed to last until the driver drops the page lock and page reference, at 423 + which point any CPU faults on the page may proceed as described. 407 424 408 425 Memory cgroup (memcg) and rss accounting 409 426 ========================================

+13 -20

Documentation/vm/unevictable-lru.rst

··· 389 389 mlocked pages. Note, however, that at this point we haven't checked whether 390 390 the page is mapped by other VM_LOCKED VMAs. 391 391 392 - We can't call try_to_munlock(), the function that walks the reverse map to 392 + We can't call page_mlock(), the function that walks the reverse map to 393 393 check for other VM_LOCKED VMAs, without first isolating the page from the LRU. 394 - try_to_munlock() is a variant of try_to_unmap() and thus requires that the page 394 + page_mlock() is a variant of try_to_unmap() and thus requires that the page 395 395 not be on an LRU list [more on these below]. However, the call to 396 - isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). So, 396 + isolate_lru_page() could fail, in which case we can't call page_mlock(). So, 397 397 we go ahead and clear PG_mlocked up front, as this might be the only chance we 398 - have. If we can successfully isolate the page, we go ahead and 399 - try_to_munlock(), which will restore the PG_mlocked flag and update the zone 398 + have. If we can successfully isolate the page, we go ahead and call 399 + page_mlock(), which will restore the PG_mlocked flag and update the zone 400 400 page statistics if it finds another VMA holding the page mlocked. If we fail 401 401 to isolate the page, we'll have left a potentially mlocked page on the LRU. 402 402 This is fine, because we'll catch it later if and if vmscan tries to reclaim ··· 545 545 holepunching, and truncation of file pages and their anonymous COWed pages. 546 546 547 547 548 - try_to_munlock() Reverse Map Scan 548 + page_mlock() Reverse Map Scan 549 549 --------------------------------- 550 - 551 - .. warning:: 552 - [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the 553 - page_referenced() reverse map walker. 554 550 555 551 When munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call 556 552 Handling <munlock_munlockall_handling>` above] tries to munlock a 557 553 page, it needs to determine whether or not the page is mapped by any 558 554 VM_LOCKED VMA without actually attempting to unmap all PTEs from the 559 555 page. For this purpose, the unevictable/mlock infrastructure 560 - introduced a variant of try_to_unmap() called try_to_munlock(). 556 + introduced a variant of try_to_unmap() called page_mlock(). 561 557 562 - try_to_munlock() calls the same functions as try_to_unmap() for anonymous and 563 - mapped file and KSM pages with a flag argument specifying unlock versus unmap 564 - processing. Again, these functions walk the respective reverse maps looking 565 - for VM_LOCKED VMAs. When such a VMA is found, as in the try_to_unmap() case, 566 - the functions mlock the page via mlock_vma_page() and return SWAP_MLOCK. This 567 - undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page. 558 + page_mlock() walks the respective reverse maps looking for VM_LOCKED VMAs. When 559 + such a VMA is found the page is mlocked via mlock_vma_page(). This undoes the 560 + pre-clearing of the page's PG_mlocked done by munlock_vma_page. 568 561 569 - Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's 562 + Note that page_mlock()'s reverse map walk must visit every VMA in a page's 570 563 reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA. 571 564 However, the scan can terminate when it encounters a VM_LOCKED VMA. 572 - Although try_to_munlock() might be called a great many times when munlocking a 565 + Although page_mlock() might be called a great many times when munlocking a 573 566 large region or tearing down a large address space that has been mlocked via 574 567 mlockall(), overall this is a fairly rare event. 575 568 ··· 595 602 shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd 596 603 after shrink_active_list() had moved them to the inactive list, or pages mapped 597 604 into VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to 598 - recheck via try_to_munlock(). shrink_inactive_list() won't notice the latter, 605 + recheck via page_mlock(). shrink_inactive_list() won't notice the latter, 599 606 but will pass on to shrink_page_list(). 600 607 601 608 shrink_page_list() again culls obviously unevictable pages that it could

+9 -1

MAINTAINERS

··· 7704 7704 S: Maintained 7705 7705 F: drivers/input/touchscreen/resistive-adc-touch.c 7706 7706 7707 + GENERIC STRING LIBRARY 7708 + R: Andy Shevchenko <andy@kernel.org> 7709 + S: Maintained 7710 + F: lib/string.c 7711 + F: lib/string_helpers.c 7712 + F: lib/test_string.c 7713 + F: lib/test-string_helpers.c 7714 + 7707 7715 GENERIC UIO DRIVER FOR PCI DEVICES 7708 7716 M: "Michael S. Tsirkin" <mst@redhat.com> 7709 7717 L: kvm@vger.kernel.org ··· 11908 11900 F: include/linux/pagewalk.h 11909 11901 F: include/linux/vmalloc.h 11910 11902 F: mm/ 11903 + F: tools/testing/selftests/vm/ 11911 11904 11912 11905 MEMORY TECHNOLOGY DEVICES (MTD) 11913 11906 M: Miquel Raynal <miquel.raynal@bootlin.com> ··· 20316 20307 M: Dan Streetman <ddstreet@ieee.org> 20317 20308 L: linux-mm@kvack.org 20318 20309 S: Maintained 20319 - F: include/linux/zbud.h 20320 20310 F: mm/zbud.c 20321 20311 20322 20312 ZD1211RW WIRELESS DRIVER

+1 -4

arch/alpha/Kconfig

··· 40 40 select MMU_GATHER_NO_RANGE 41 41 select SET_FS 42 42 select SPARSEMEM_EXTREME if SPARSEMEM 43 + select ZONE_DMA 43 44 help 44 45 The Alpha is a 64-bit general-purpose processor designed and 45 46 marketed by the Digital Equipment Corporation of blessed memory, ··· 63 62 default n 64 63 65 64 config GENERIC_CALIBRATE_DELAY 66 - bool 67 - default y 68 - 69 - config ZONE_DMA 70 65 bool 71 66 default y 72 67

-1

arch/alpha/include/asm/pgalloc.h

··· 18 18 { 19 19 pmd_set(pmd, (pte_t *)(page_to_pa(pte) + PAGE_OFFSET)); 20 20 } 21 - #define pmd_pgtable(pmd) pmd_page(pmd) 22 21 23 22 static inline void 24 23 pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte)

-1

arch/alpha/include/asm/pgtable.h

··· 46 46 #define PTRS_PER_PMD (1UL << (PAGE_SHIFT-3)) 47 47 #define PTRS_PER_PGD (1UL << (PAGE_SHIFT-3)) 48 48 #define USER_PTRS_PER_PGD (TASK_SIZE / PGDIR_SIZE) 49 - #define FIRST_USER_ADDRESS 0UL 50 49 51 50 /* Number of pointers that fit on a page: this will go away. */ 52 51 #define PTRS_PER_PAGE (1UL << (PAGE_SHIFT-3))

+3

arch/alpha/include/uapi/asm/mman.h

··· 71 71 #define MADV_COLD 20 /* deactivate these pages */ 72 72 #define MADV_PAGEOUT 21 /* reclaim these pages */ 73 73 74 + #define MADV_POPULATE_READ 22 /* populate (prefault) page tables readable */ 75 + #define MADV_POPULATE_WRITE 23 /* populate (prefault) page tables writable */ 76 + 74 77 /* compatibility flags */ 75 78 #define MAP_FILE 0 76 79

+1 -1

arch/alpha/kernel/setup.c

··· 28 28 #include <linux/init.h> 29 29 #include <linux/string.h> 30 30 #include <linux/ioport.h> 31 + #include <linux/panic_notifier.h> 31 32 #include <linux/platform_device.h> 32 33 #include <linux/memblock.h> 33 34 #include <linux/pci.h> ··· 47 46 #include <linux/log2.h> 48 47 #include <linux/export.h> 49 48 50 - extern struct atomic_notifier_head panic_notifier_list; 51 49 static int alpha_panic_event(struct notifier_block *, unsigned long, void *); 52 50 static struct notifier_block alpha_panic_block = { 53 51 alpha_panic_event,

-2

arch/arc/include/asm/pgalloc.h

··· 129 129 130 130 #define __pte_free_tlb(tlb, pte, addr) pte_free((tlb)->mm, pte) 131 131 132 - #define pmd_pgtable(pmd) ((pgtable_t) pmd_page_vaddr(pmd)) 133 - 134 132 #endif /* _ASM_ARC_PGALLOC_H */

+2 -6

arch/arc/include/asm/pgtable.h

··· 222 222 */ 223 223 #define USER_PTRS_PER_PGD (TASK_SIZE / PGDIR_SIZE) 224 224 225 - /* 226 - * No special requirements for lowest virtual address we permit any user space 227 - * mapping to be mapped at. 228 - */ 229 - #define FIRST_USER_ADDRESS 0UL 230 - 231 225 232 226 /**************************************************************** 233 227 * Bucket load of VM Helpers ··· 349 355 #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) 350 356 351 357 #define kern_addr_valid(addr) (1) 358 + 359 + #define pmd_pgtable(pmd) ((pgtable_t) pmd_page_vaddr(pmd)) 352 360 353 361 /* 354 362 * remap a physical page `pfn' of size `size' with page protection `prot'

-3

arch/arm/Kconfig

··· 218 218 config ARCH_MAY_HAVE_PC_FDC 219 219 bool 220 220 221 - config ZONE_DMA 222 - bool 223 - 224 221 config ARCH_SUPPORTS_UPROBES 225 222 def_bool y 226 223

-1

arch/arm/include/asm/pgalloc.h

··· 143 143 144 144 __pmd_populate(pmdp, page_to_phys(ptep), prot); 145 145 } 146 - #define pmd_pgtable(pmd) pmd_page(pmd) 147 146 148 147 #endif /* CONFIG_MMU */ 149 148

+1 -12

arch/arm64/Kconfig

··· 42 42 select ARCH_HAS_SYSCALL_WRAPPER 43 43 select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT 44 44 select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST 45 + select ARCH_HAS_ZONE_DMA_SET if EXPERT 45 46 select ARCH_HAVE_ELF_PROT 46 47 select ARCH_HAVE_NMI_SAFE_CMPXCHG 47 48 select ARCH_INLINE_READ_LOCK if !PREEMPTION ··· 156 155 select HAVE_ARCH_KGDB 157 156 select HAVE_ARCH_MMAP_RND_BITS 158 157 select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT 159 - select HAVE_ARCH_PFN_VALID 160 158 select HAVE_ARCH_PREL32_RELOCATIONS 161 159 select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET 162 160 select HAVE_ARCH_SECCOMP_FILTER ··· 307 307 308 308 config GENERIC_CALIBRATE_DELAY 309 309 def_bool y 310 - 311 - config ZONE_DMA 312 - bool "Support DMA zone" if EXPERT 313 - default y 314 - 315 - config ZONE_DMA32 316 - bool "Support DMA32 zone" if EXPERT 317 - default y 318 310 319 311 config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE 320 312 def_bool y ··· 1044 1052 config NEED_PER_CPU_EMBED_FIRST_CHUNK 1045 1053 def_bool y 1046 1054 depends on NUMA 1047 - 1048 - config HOLES_IN_ZONE 1049 - def_bool y 1050 1055 1051 1056 source "kernel/Kconfig.hz" 1052 1057

+1 -2

arch/arm64/include/asm/hugetlb.h

··· 23 23 } 24 24 #define arch_clear_hugepage_flags arch_clear_hugepage_flags 25 25 26 - extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, 27 - struct page *page, int writable); 26 + pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags); 28 27 #define arch_make_huge_pte arch_make_huge_pte 29 28 #define __HAVE_ARCH_HUGE_SET_HUGE_PTE_AT 30 29 extern void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,

+1 -1

arch/arm64/include/asm/memory.h

··· 351 351 352 352 #define virt_addr_valid(addr) ({ \ 353 353 __typeof__(addr) __addr = __tag_reset(addr); \ 354 - __is_lm_address(__addr) && pfn_valid(virt_to_pfn(__addr)); \ 354 + __is_lm_address(__addr) && pfn_is_map_memory(virt_to_pfn(__addr)); \ 355 355 }) 356 356 357 357 void dump_mem_limit(void);

+1 -1

arch/arm64/include/asm/page.h

··· 41 41 42 42 typedef struct page *pgtable_t; 43 43 44 - extern int pfn_valid(unsigned long); 44 + int pfn_is_map_memory(unsigned long pfn); 45 45 46 46 #include <asm/memory.h> 47 47

-1

arch/arm64/include/asm/pgalloc.h

··· 86 86 VM_BUG_ON(mm == &init_mm); 87 87 __pmd_populate(pmdp, page_to_phys(ptep), PMD_TYPE_TABLE | PMD_TABLE_PXN); 88 88 } 89 - #define pmd_pgtable(pmd) pmd_page(pmd) 90 89 91 90 #endif

-2

arch/arm64/include/asm/pgtable.h

··· 26 26 27 27 #define vmemmap ((struct page *)VMEMMAP_START - (memstart_addr >> PAGE_SHIFT)) 28 28 29 - #define FIRST_USER_ADDRESS 0UL 30 - 31 29 #ifndef __ASSEMBLY__ 32 30 33 31 #include <asm/cmpxchg.h>

+1

arch/arm64/kernel/setup.c

··· 23 23 #include <linux/interrupt.h> 24 24 #include <linux/smp.h> 25 25 #include <linux/fs.h> 26 + #include <linux/panic_notifier.h> 26 27 #include <linux/proc_fs.h> 27 28 #include <linux/memblock.h> 28 29 #include <linux/of_fdt.h>

+1 -1

arch/arm64/kvm/mmu.c

··· 85 85 86 86 static bool kvm_is_device_pfn(unsigned long pfn) 87 87 { 88 - return !pfn_valid(pfn); 88 + return !pfn_is_map_memory(pfn); 89 89 } 90 90 91 91 static void *stage2_memcache_zalloc_page(void *arg)

+2 -3

arch/arm64/mm/hugetlbpage.c

··· 339 339 return NULL; 340 340 } 341 341 342 - pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, 343 - struct page *page, int writable) 342 + pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags) 344 343 { 345 - size_t pagesize = huge_page_size(hstate_vma(vma)); 344 + size_t pagesize = 1UL << shift; 346 345 347 346 if (pagesize == CONT_PTE_SIZE) { 348 347 entry = pte_mkcont(entry);

+3 -28

arch/arm64/mm/init.c

··· 219 219 free_area_init(max_zone_pfns); 220 220 } 221 221 222 - int pfn_valid(unsigned long pfn) 222 + int pfn_is_map_memory(unsigned long pfn) 223 223 { 224 224 phys_addr_t addr = PFN_PHYS(pfn); 225 - struct mem_section *ms; 226 225 227 - /* 228 - * Ensure the upper PAGE_SHIFT bits are clear in the 229 - * pfn. Else it might lead to false positives when 230 - * some of the upper bits are set, but the lower bits 231 - * match a valid pfn. 232 - */ 226 + /* avoid false positives for bogus PFNs, see comment in pfn_valid() */ 233 227 if (PHYS_PFN(addr) != pfn) 234 228 return 0; 235 229 236 - if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) 237 - return 0; 238 - 239 - ms = __pfn_to_section(pfn); 240 - if (!valid_section(ms)) 241 - return 0; 242 - 243 - /* 244 - * ZONE_DEVICE memory does not have the memblock entries. 245 - * memblock_is_map_memory() check for ZONE_DEVICE based 246 - * addresses will always fail. Even the normal hotplugged 247 - * memory will never have MEMBLOCK_NOMAP flag set in their 248 - * memblock entries. Skip memblock search for all non early 249 - * memory sections covering all of hotplug memory including 250 - * both normal and ZONE_DEVICE based. 251 - */ 252 - if (!early_section(ms)) 253 - return pfn_section_valid(ms, pfn); 254 - 255 230 return memblock_is_map_memory(addr); 256 231 } 257 - EXPORT_SYMBOL(pfn_valid); 232 + EXPORT_SYMBOL(pfn_is_map_memory); 258 233 259 234 static phys_addr_t memory_limit = PHYS_ADDR_MAX; 260 235

+2 -2

arch/arm64/mm/ioremap.c

··· 43 43 /* 44 44 * Don't allow RAM to be mapped. 45 45 */ 46 - if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr)))) 46 + if (WARN_ON(pfn_is_map_memory(__phys_to_pfn(phys_addr)))) 47 47 return NULL; 48 48 49 49 area = get_vm_area_caller(size, VM_IOREMAP, caller); ··· 84 84 void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size) 85 85 { 86 86 /* For normal memory we already have a cacheable mapping. */ 87 - if (pfn_valid(__phys_to_pfn(phys_addr))) 87 + if (pfn_is_map_memory(__phys_to_pfn(phys_addr))) 88 88 return (void __iomem *)__phys_to_virt(phys_addr); 89 89 90 90 return __ioremap_caller(phys_addr, size, __pgprot(PROT_NORMAL),

+13 -9

arch/arm64/mm/mmu.c

··· 82 82 pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn, 83 83 unsigned long size, pgprot_t vma_prot) 84 84 { 85 - if (!pfn_valid(pfn)) 85 + if (!pfn_is_map_memory(pfn)) 86 86 return pgprot_noncached(vma_prot); 87 87 else if (file->f_flags & O_SYNC) 88 88 return pgprot_writecombine(vma_prot); ··· 1339 1339 return dt_virt; 1340 1340 } 1341 1341 1342 + #if CONFIG_PGTABLE_LEVELS > 3 1342 1343 int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot) 1343 1344 { 1344 1345 pud_t new_pud = pfn_pud(__phys_to_pfn(phys), mk_pud_sect_prot(prot)); ··· 1354 1353 return 1; 1355 1354 } 1356 1355 1356 + int pud_clear_huge(pud_t *pudp) 1357 + { 1358 + if (!pud_sect(READ_ONCE(*pudp))) 1359 + return 0; 1360 + pud_clear(pudp); 1361 + return 1; 1362 + } 1363 + #endif 1364 + 1365 + #if CONFIG_PGTABLE_LEVELS > 2 1357 1366 int pmd_set_huge(pmd_t *pmdp, phys_addr_t phys, pgprot_t prot) 1358 1367 { 1359 1368 pmd_t new_pmd = pfn_pmd(__phys_to_pfn(phys), mk_pmd_sect_prot(prot)); ··· 1378 1367 return 1; 1379 1368 } 1380 1369 1381 - int pud_clear_huge(pud_t *pudp) 1382 - { 1383 - if (!pud_sect(READ_ONCE(*pudp))) 1384 - return 0; 1385 - pud_clear(pudp); 1386 - return 1; 1387 - } 1388 - 1389 1370 int pmd_clear_huge(pmd_t *pmdp) 1390 1371 { 1391 1372 if (!pmd_sect(READ_ONCE(*pmdp))) ··· 1385 1382 pmd_clear(pmdp); 1386 1383 return 1; 1387 1384 } 1385 + #endif 1388 1386 1389 1387 int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr) 1390 1388 {

-2

arch/csky/include/asm/pgalloc.h

··· 22 22 set_pmd(pmd, __pmd(__pa(page_address(pte)))); 23 23 } 24 24 25 - #define pmd_pgtable(pmd) pmd_page(pmd) 26 - 27 25 extern void pgd_init(unsigned long *p); 28 26 29 27 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)

-1

arch/csky/include/asm/pgtable.h

··· 14 14 #define PGDIR_MASK (~(PGDIR_SIZE-1)) 15 15 16 16 #define USER_PTRS_PER_PGD (PAGE_OFFSET/PGDIR_SIZE) 17 - #define FIRST_USER_ADDRESS 0UL 18 17 19 18 /* 20 19 * C-SKY is two-level paging structure:

-4

arch/hexagon/include/asm/pgtable.h

··· 155 155 156 156 extern pgd_t swapper_pg_dir[PTRS_PER_PGD]; /* located in head.S */ 157 157 158 - /* Seems to be zero even in architectures where the zero page is firewalled? */ 159 - #define FIRST_USER_ADDRESS 0UL 160 - 161 158 /* HUGETLB not working currently */ 162 159 #ifdef CONFIG_HUGETLB_PAGE 163 160 #define pte_mkhuge(pte) __pte((pte_val(pte) & ~0x3) | HVM_HUGEPAGE_SIZE) ··· 239 242 * pmd_page - converts a PMD entry to a page pointer 240 243 */ 241 244 #define pmd_page(pmd) (pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)) 242 - #define pmd_pgtable(pmd) pmd_page(pmd) 243 245 244 246 /** 245 247 * pte_none - check if pte is mapped

+1 -6

arch/ia64/Kconfig

··· 60 60 select NUMA if !FLATMEM 61 61 select PCI_MSI_ARCH_FALLBACKS if PCI_MSI 62 62 select SET_FS 63 + select ZONE_DMA32 63 64 default y 64 65 help 65 66 The Itanium Processor Family is Intel's 64-bit successor to ··· 72 71 bool 73 72 select ATA_NONSTANDARD if ATA 74 73 default y 75 - 76 - config ZONE_DMA32 77 - def_bool y 78 74 79 75 config MMU 80 76 bool ··· 305 307 This option specifies the maximum number of nodes in your SSI system. 306 308 MAX_NUMNODES will be 2^(This value). 307 309 If in doubt, use the default. 308 - 309 - config HOLES_IN_ZONE 310 - bool 311 310 312 311 config HAVE_ARCH_NODEDATA_EXTENSION 313 312 def_bool y

+1

arch/ia64/include/asm/pal.h

··· 99 99 100 100 #include <linux/types.h> 101 101 #include <asm/fpu.h> 102 + #include <asm/intrinsics.h> 102 103 103 104 /* 104 105 * Data types needed to pass information into PAL procedures and

-1

arch/ia64/include/asm/pgalloc.h

··· 52 52 { 53 53 pmd_val(*pmd_entry) = page_to_phys(pte); 54 54 } 55 - #define pmd_pgtable(pmd) pmd_page(pmd) 56 55 57 56 static inline void 58 57 pmd_populate_kernel(struct mm_struct *mm, pmd_t * pmd_entry, pte_t * pte)

-1

arch/ia64/include/asm/pgtable.h

··· 128 128 #define PTRS_PER_PGD_SHIFT PTRS_PER_PTD_SHIFT 129 129 #define PTRS_PER_PGD (1UL << PTRS_PER_PGD_SHIFT) 130 130 #define USER_PTRS_PER_PGD (5*PTRS_PER_PGD/8) /* regions 0-4 are user regions */ 131 - #define FIRST_USER_ADDRESS 0UL 132 131 133 132 /* 134 133 * All the normal masks have the "page accessed" bits on, as any time

+1 -4

arch/m68k/Kconfig

··· 34 34 select SET_FS 35 35 select UACCESS_MEMCPY if !MMU 36 36 select VIRT_TO_BUS 37 + select ZONE_DMA 37 38 38 39 config CPU_BIG_ENDIAN 39 40 def_bool y ··· 62 61 63 62 config NO_IOPORT_MAP 64 63 def_bool y 65 - 66 - config ZONE_DMA 67 - bool 68 - default y 69 64 70 65 config HZ 71 66 int

-2

arch/m68k/include/asm/mcf_pgalloc.h

··· 32 32 33 33 #define pmd_populate_kernel pmd_populate 34 34 35 - #define pmd_pgtable(pmd) pfn_to_virt(pmd_val(pmd) >> PAGE_SHIFT) 36 - 37 35 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pgtable, 38 36 unsigned long address) 39 37 {

+2

arch/m68k/include/asm/mcf_pgtable.h

··· 150 150 151 151 #ifndef __ASSEMBLY__ 152 152 153 + #define pmd_pgtable(pmd) pfn_to_virt(pmd_val(pmd) >> PAGE_SHIFT) 154 + 153 155 /* 154 156 * Conversion functions: convert a page and protection to a page entry, 155 157 * and a page entry and page directory to the page they refer to.

-1

arch/m68k/include/asm/motorola_pgalloc.h

··· 88 88 { 89 89 pmd_set(pmd, page); 90 90 } 91 - #define pmd_pgtable(pmd) ((pgtable_t)pmd_page_vaddr(pmd)) 92 91 93 92 static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd) 94 93 {

+2

arch/m68k/include/asm/motorola_pgtable.h

··· 105 105 #define __S110 PAGE_SHARED_C 106 106 #define __S111 PAGE_SHARED_C 107 107 108 + #define pmd_pgtable(pmd) ((pgtable_t)pmd_page_vaddr(pmd)) 109 + 108 110 /* 109 111 * Conversion functions: convert a page and protection to a page entry, 110 112 * and a page entry and page directory to the page they refer to.

-1

arch/m68k/include/asm/pgtable_mm.h

··· 72 72 #define PTRS_PER_PGD 128 73 73 #endif 74 74 #define USER_PTRS_PER_PGD (TASK_SIZE/PGDIR_SIZE) 75 - #define FIRST_USER_ADDRESS 0UL 76 75 77 76 /* Virtual address region for use by kernel_map() */ 78 77 #ifdef CONFIG_SUN3

-1

arch/m68k/include/asm/sun3_pgalloc.h

··· 32 32 { 33 33 pmd_val(*pmd) = __pa((unsigned long)page_address(page)); 34 34 } 35 - #define pmd_pgtable(pmd) pmd_page(pmd) 36 35 37 36 /* 38 37 * allocating and freeing a pmd is trivial: the 1-entry pmd is

+1 -3

arch/microblaze/Kconfig

··· 43 43 select MMU_GATHER_NO_RANGE 44 44 select SPARSE_IRQ 45 45 select SET_FS 46 + select ZONE_DMA 46 47 47 48 # Endianness selection 48 49 choice ··· 60 59 bool "Little endian" 61 60 62 61 endchoice 63 - 64 - config ZONE_DMA 65 - def_bool y 66 62 67 63 config ARCH_HAS_ILOG2_U32 68 64 def_bool n

-2

arch/microblaze/include/asm/pgalloc.h

··· 28 28 29 29 #define pgd_alloc(mm) get_pgd() 30 30 31 - #define pmd_pgtable(pmd) pmd_page(pmd) 32 - 33 31 extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm); 34 32 35 33 #define __pte_free_tlb(tlb, pte, addr) pte_free((tlb)->mm, (pte))

-2

arch/microblaze/include/asm/pgtable.h

··· 25 25 #include <asm/mmu.h> 26 26 #include <asm/page.h> 27 27 28 - #define FIRST_USER_ADDRESS 0UL 29 - 30 28 extern unsigned long va_to_phys(unsigned long address); 31 29 extern pte_t *va_to_pte(unsigned long address); 32 30

-7

arch/mips/Kconfig

··· 3274 3274 select CLKSRC_I8253 3275 3275 select CLKEVT_I8253 3276 3276 select MIPS_EXTERNAL_TIMER 3277 - 3278 - config ZONE_DMA 3279 - bool 3280 - 3281 - config ZONE_DMA32 3282 - bool 3283 - 3284 3277 endmenu 3285 3278 3286 3279 config TRAD_SIGNALS

-1

arch/mips/include/asm/pgalloc.h

··· 28 28 { 29 29 set_pmd(pmd, __pmd((unsigned long)page_address(pte))); 30 30 } 31 - #define pmd_pgtable(pmd) pmd_page(pmd) 32 31 33 32 /* 34 33 * Initialize a new pmd table with invalid pointers.

-1

arch/mips/include/asm/pgtable-32.h

··· 93 93 #endif 94 94 95 95 #define USER_PTRS_PER_PGD (0x80000000UL/PGDIR_SIZE) 96 - #define FIRST_USER_ADDRESS 0UL 97 96 98 97 #define VMALLOC_START MAP_BASE 99 98

-1

arch/mips/include/asm/pgtable-64.h

··· 137 137 #define PTRS_PER_PTE ((PAGE_SIZE << PTE_ORDER) / sizeof(pte_t)) 138 138 139 139 #define USER_PTRS_PER_PGD ((TASK_SIZE64 / PGDIR_SIZE)?(TASK_SIZE64 / PGDIR_SIZE):1) 140 - #define FIRST_USER_ADDRESS 0UL 141 140 142 141 /* 143 142 * TLB refill handlers also map the vmalloc area into xuseg. Avoid

+3

arch/mips/include/uapi/asm/mman.h

··· 98 98 #define MADV_COLD 20 /* deactivate these pages */ 99 99 #define MADV_PAGEOUT 21 /* reclaim these pages */ 100 100 101 + #define MADV_POPULATE_READ 22 /* populate (prefault) page tables readable */ 102 + #define MADV_POPULATE_WRITE 23 /* populate (prefault) page tables writable */ 103 + 101 104 /* compatibility flags */ 102 105 #define MAP_FILE 0 103 106

+1

arch/mips/kernel/relocate.c

··· 18 18 #include <linux/kernel.h> 19 19 #include <linux/libfdt.h> 20 20 #include <linux/of_fdt.h> 21 + #include <linux/panic_notifier.h> 21 22 #include <linux/sched/task.h> 22 23 #include <linux/start_kernel.h> 23 24 #include <linux/string.h>

+1

arch/mips/sgi-ip22/ip22-reset.c

··· 12 12 #include <linux/kernel.h> 13 13 #include <linux/sched/signal.h> 14 14 #include <linux/notifier.h> 15 + #include <linux/panic_notifier.h> 15 16 #include <linux/pm.h> 16 17 #include <linux/timer.h> 17 18

+1

arch/mips/sgi-ip32/ip32-reset.c

··· 12 12 #include <linux/init.h> 13 13 #include <linux/kernel.h> 14 14 #include <linux/module.h> 15 + #include <linux/panic_notifier.h> 15 16 #include <linux/sched.h> 16 17 #include <linux/sched/signal.h> 17 18 #include <linux/notifier.h>

-5

arch/nds32/include/asm/pgalloc.h

··· 12 12 #define __HAVE_ARCH_PTE_ALLOC_ONE 13 13 #include <asm-generic/pgalloc.h> /* for pte_{alloc,free}_one */ 14 14 15 - /* 16 - * Since we have only two-level page tables, these are trivial 17 - */ 18 - #define pmd_pgtable(pmd) pmd_page(pmd) 19 - 20 15 extern pgd_t *pgd_alloc(struct mm_struct *mm); 21 16 extern void pgd_free(struct mm_struct *mm, pgd_t * pgd); 22 17

-1

arch/nios2/include/asm/pgalloc.h

··· 25 25 { 26 26 set_pmd(pmd, __pmd((unsigned long)page_address(pte))); 27 27 } 28 - #define pmd_pgtable(pmd) pmd_page(pmd) 29 28 30 29 /* 31 30 * Initialize a new pmd table with invalid pointers.

-2

arch/nios2/include/asm/pgtable.h

··· 24 24 #include <asm/pgtable-bits.h> 25 25 #include <asm-generic/pgtable-nopmd.h> 26 26 27 - #define FIRST_USER_ADDRESS 0UL 28 - 29 27 #define VMALLOC_START CONFIG_NIOS2_KERNEL_MMU_REGION_BASE 30 28 #define VMALLOC_END (CONFIG_NIOS2_KERNEL_REGION_BASE - 1) 31 29

-2

arch/openrisc/include/asm/pgalloc.h

··· 72 72 tlb_remove_page((tlb), (pte)); \ 73 73 } while (0) 74 74 75 - #define pmd_pgtable(pmd) pmd_page(pmd) 76 - 77 75 #endif

-1

arch/openrisc/include/asm/pgtable.h

··· 73 73 */ 74 74 75 75 #define USER_PTRS_PER_PGD (TASK_SIZE/PGDIR_SIZE) 76 - #define FIRST_USER_ADDRESS 0UL 77 76 78 77 /* 79 78 * Kernels own virtual memory area.

-1

arch/parisc/include/asm/pgalloc.h

··· 69 69 70 70 #define pmd_populate(mm, pmd, pte_page) \ 71 71 pmd_populate_kernel(mm, pmd, page_address(pte_page)) 72 - #define pmd_pgtable(pmd) pmd_page(pmd) 73 72 74 73 #endif

-2

arch/parisc/include/asm/pgtable.h

··· 171 171 * pgd entries used up by user/kernel: 172 172 */ 173 173 174 - #define FIRST_USER_ADDRESS 0UL 175 - 176 174 /* NB: The tlb miss handlers make certain assumptions about the order */ 177 175 /* of the following bits, so be careful (One example, bits 25-31 */ 178 176 /* are moved together in one instruction). */

+3

arch/parisc/include/uapi/asm/mman.h

··· 52 52 #define MADV_COLD 20 /* deactivate these pages */ 53 53 #define MADV_PAGEOUT 21 /* reclaim these pages */ 54 54 55 + #define MADV_POPULATE_READ 22 /* populate (prefault) page tables readable */ 56 + #define MADV_POPULATE_WRITE 23 /* populate (prefault) page tables writable */ 57 + 55 58 #define MADV_MERGEABLE 65 /* KSM may merge identical pages */ 56 59 #define MADV_UNMERGEABLE 66 /* KSM may not merge identical pages */ 57 60

+1

arch/parisc/kernel/pdc_chassis.c

··· 20 20 #include <linux/init.h> 21 21 #include <linux/module.h> 22 22 #include <linux/kernel.h> 23 + #include <linux/panic_notifier.h> 23 24 #include <linux/reboot.h> 24 25 #include <linux/notifier.h> 25 26 #include <linux/cache.h>

+1 -5

arch/powerpc/Kconfig

··· 187 187 select GENERIC_VDSO_TIME_NS 188 188 select HAVE_ARCH_AUDITSYSCALL 189 189 select HAVE_ARCH_HUGE_VMALLOC if HAVE_ARCH_HUGE_VMAP 190 - select HAVE_ARCH_HUGE_VMAP if PPC_BOOK3S_64 && PPC_RADIX_MMU 190 + select HAVE_ARCH_HUGE_VMAP if PPC_RADIX_MMU || PPC_8xx 191 191 select HAVE_ARCH_JUMP_LABEL 192 192 select HAVE_ARCH_JUMP_LABEL_RELATIVE 193 193 select HAVE_ARCH_KASAN if PPC32 && PPC_PAGE_SHIFT <= 14 ··· 402 402 403 403 config PPC_DAWR 404 404 bool 405 - 406 - config ZONE_DMA 407 - bool 408 - default y if PPC_BOOK3E_64 409 405 410 406 config PGTABLE_LEVELS 411 407 int

-1

arch/powerpc/include/asm/book3s/pgtable.h

··· 8 8 #include <asm/book3s/32/pgtable.h> 9 9 #endif 10 10 11 - #define FIRST_USER_ADDRESS 0UL 12 11 #ifndef __ASSEMBLY__ 13 12 /* Insert a PTE, top-level function is out of line. It uses an inline 14 13 * low level function in the respective pgtable-* files

+2 -3

arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h

··· 66 66 } 67 67 68 68 #ifdef CONFIG_PPC_4K_PAGES 69 - static inline pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, 70 - struct page *page, int writable) 69 + static inline pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags) 71 70 { 72 - size_t size = huge_page_size(hstate_vma(vma)); 71 + size_t size = 1UL << shift; 73 72 74 73 if (size == SZ_16K) 75 74 return __pte(pte_val(entry) & ~_PAGE_HUGE);

+43

arch/powerpc/include/asm/nohash/32/mmu-8xx.h

··· 178 178 #ifndef __ASSEMBLY__ 179 179 180 180 #include <linux/mmdebug.h> 181 + #include <linux/sizes.h> 181 182 182 183 void mmu_pin_tlb(unsigned long top, bool readonly); 183 184 ··· 225 224 return mmu_psize_defs[mmu_psize].shift; 226 225 BUG(); 227 226 } 227 + 228 + static inline bool arch_vmap_try_size(unsigned long addr, unsigned long end, u64 pfn, 229 + unsigned int max_page_shift, unsigned long size) 230 + { 231 + if (end - addr < size) 232 + return false; 233 + 234 + if ((1UL << max_page_shift) < size) 235 + return false; 236 + 237 + if (!IS_ALIGNED(addr, size)) 238 + return false; 239 + 240 + if (!IS_ALIGNED(PFN_PHYS(pfn), size)) 241 + return false; 242 + 243 + return true; 244 + } 245 + 246 + static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr, unsigned long end, 247 + u64 pfn, unsigned int max_page_shift) 248 + { 249 + if (arch_vmap_try_size(addr, end, pfn, max_page_shift, SZ_512K)) 250 + return SZ_512K; 251 + if (PAGE_SIZE == SZ_16K) 252 + return SZ_16K; 253 + if (arch_vmap_try_size(addr, end, pfn, max_page_shift, SZ_16K)) 254 + return SZ_16K; 255 + return PAGE_SIZE; 256 + } 257 + #define arch_vmap_pte_range_map_size arch_vmap_pte_range_map_size 258 + 259 + static inline int arch_vmap_pte_supported_shift(unsigned long size) 260 + { 261 + if (size >= SZ_512K) 262 + return 19; 263 + else if (size >= SZ_16K) 264 + return 14; 265 + else 266 + return PAGE_SHIFT; 267 + } 268 + #define arch_vmap_pte_supported_shift arch_vmap_pte_supported_shift 228 269 229 270 /* patch sites */ 230 271 extern s32 patch__itlbmiss_exit_1, patch__dtlbmiss_exit_1;

-1

arch/powerpc/include/asm/nohash/32/pgtable.h

··· 54 54 #define PGD_MASKED_BITS 0 55 55 56 56 #define USER_PTRS_PER_PGD (TASK_SIZE / PGDIR_SIZE) 57 - #define FIRST_USER_ADDRESS 0UL 58 57 59 58 #define pte_ERROR(e) \ 60 59 pr_err("%s:%d: bad pte %llx.\n", __FILE__, __LINE__, \

-2

arch/powerpc/include/asm/nohash/64/pgtable.h

··· 12 12 #include <asm/barrier.h> 13 13 #include <asm/asm-const.h> 14 14 15 - #define FIRST_USER_ADDRESS 0UL 16 - 17 15 /* 18 16 * Size of EA range mapped by our pagetables. 19 17 */

-5

arch/powerpc/include/asm/pgalloc.h

··· 70 70 #include <asm/nohash/pgalloc.h> 71 71 #endif 72 72 73 - static inline pgtable_t pmd_pgtable(pmd_t pmd) 74 - { 75 - return (pgtable_t)pmd_page_vaddr(pmd); 76 - } 77 - 78 73 #endif /* _ASM_POWERPC_PGALLOC_H */

+6

arch/powerpc/include/asm/pgtable.h

··· 152 152 } 153 153 #endif 154 154 155 + #define pmd_pgtable pmd_pgtable 156 + static inline pgtable_t pmd_pgtable(pmd_t pmd) 157 + { 158 + return (pgtable_t)pmd_page_vaddr(pmd); 159 + } 160 + 155 161 #ifdef CONFIG_PPC64 156 162 #define is_ioremap_addr is_ioremap_addr 157 163 static inline bool is_ioremap_addr(const void *x)

+1

arch/powerpc/kernel/setup-common.c

··· 9 9 #undef DEBUG 10 10 11 11 #include <linux/export.h> 12 + #include <linux/panic_notifier.h> 12 13 #include <linux/string.h> 13 14 #include <linux/sched.h> 14 15 #include <linux/init.h>

+1

arch/powerpc/platforms/Kconfig.cputype

··· 111 111 select PPC_FPU # Make it a choice ? 112 112 select PPC_SMP_MUXED_IPI 113 113 select PPC_DOORBELL 114 + select ZONE_DMA 114 115 115 116 endchoice 116 117

+1 -4

arch/riscv/Kconfig

··· 104 104 select SYSCTL_EXCEPTION_TRACE 105 105 select THREAD_INFO_IN_TASK 106 106 select UACCESS_MEMCPY if !MMU 107 + select ZONE_DMA32 if 64BIT 107 108 108 109 config ARCH_MMAP_RND_BITS_MIN 109 110 default 18 if 64BIT ··· 133 132 help 134 133 Select if you want MMU-based virtualised addressing space 135 134 support by paged memory management. If unsure, say 'Y'. 136 - 137 - config ZONE_DMA32 138 - bool 139 - default y if 64BIT 140 135 141 136 config VA_BITS 142 137 int

-2

arch/riscv/include/asm/pgalloc.h

··· 38 38 } 39 39 #endif /* __PAGETABLE_PMD_FOLDED */ 40 40 41 - #define pmd_pgtable(pmd) pmd_page(pmd) 42 - 43 41 static inline pgd_t *pgd_alloc(struct mm_struct *mm) 44 42 { 45 43 pgd_t *pgd;

-2

arch/riscv/include/asm/pgtable.h

··· 536 536 void paging_init(void); 537 537 void misc_mem_init(void); 538 538 539 - #define FIRST_USER_ADDRESS 0 540 - 541 539 /* 542 540 * ZERO_PAGE is a global shared page that is always zero, 543 541 * used for zero-mapped memory areas, etc.

+2 -4

arch/s390/Kconfig

··· 2 2 config MMU 3 3 def_bool y 4 4 5 - config ZONE_DMA 6 - def_bool y 7 - 8 5 config CPU_BIG_ENDIAN 9 6 def_bool y 10 7 ··· 59 62 select ARCH_BINFMT_ELF_STATE 60 63 select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM 61 64 select ARCH_ENABLE_MEMORY_HOTREMOVE 62 - select ARCH_ENABLE_SPLIT_PMD_PTLOCK 65 + select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2 63 66 select ARCH_HAS_DEBUG_VM_PGTABLE 64 67 select ARCH_HAS_DEBUG_WX 65 68 select ARCH_HAS_DEVMEM_IS_ALLOWED ··· 208 211 select THREAD_INFO_IN_TASK 209 212 select TTY 210 213 select VIRT_CPU_ACCOUNTING 214 + select ZONE_DMA 211 215 # Note: keep the above list sorted alphabetically 212 216 213 217 config SCHED_OMIT_FRAME_POINTER

-3

arch/s390/include/asm/pgalloc.h

··· 134 134 135 135 #define pmd_populate_kernel(mm, pmd, pte) pmd_populate(mm, pmd, pte) 136 136 137 - #define pmd_pgtable(pmd) \ 138 - ((pgtable_t)__va(pmd_val(pmd) & -sizeof(pte_t)*PTRS_PER_PTE)) 139 - 140 137 /* 141 138 * page table entry allocation/free routines. 142 139 */

+3 -2

arch/s390/include/asm/pgtable.h

··· 65 65 66 66 /* TODO: s390 cannot support io_remap_pfn_range... */ 67 67 68 - #define FIRST_USER_ADDRESS 0UL 69 - 70 68 #define pte_ERROR(e) \ 71 69 printk("%s:%d: bad pte %p.\n", __FILE__, __LINE__, (void *) pte_val(e)) 72 70 #define pmd_ERROR(e) \ ··· 1708 1710 /* s390 has a private copy of get unmapped area to deal with cache synonyms */ 1709 1711 #define HAVE_ARCH_UNMAPPED_AREA 1710 1712 #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN 1713 + 1714 + #define pmd_pgtable(pmd) \ 1715 + ((pgtable_t)__va(pmd_val(pmd) & -sizeof(pte_t)*PTRS_PER_PTE)) 1711 1716 1712 1717 #endif /* _S390_PAGE_H */

+1

arch/s390/kernel/ipl.c

··· 13 13 #include <linux/init.h> 14 14 #include <linux/device.h> 15 15 #include <linux/delay.h> 16 + #include <linux/panic_notifier.h> 16 17 #include <linux/reboot.h> 17 18 #include <linux/ctype.h> 18 19 #include <linux/fs.h>

-5

arch/s390/kernel/kprobes.c

··· 44 44 return page; 45 45 } 46 46 47 - void free_insn_page(void *page) 48 - { 49 - module_memfree(page); 50 - } 51 - 52 47 static void *alloc_s390_insn_page(void) 53 48 { 54 49 if (xchg(&insn_page_in_use, 1) == 1)

+1 -1

arch/s390/mm/pgtable.c

··· 691 691 if (!non_swap_entry(entry)) 692 692 dec_mm_counter(mm, MM_SWAPENTS); 693 693 else if (is_migration_entry(entry)) { 694 - struct page *page = migration_entry_to_page(entry); 694 + struct page *page = pfn_swap_entry_to_page(entry); 695 695 696 696 dec_mm_counter(mm, mm_counter(page)); 697 697 }

-1

arch/sh/include/asm/pgalloc.h

··· 30 30 { 31 31 set_pmd(pmd, __pmd((unsigned long)page_address(pte))); 32 32 } 33 - #define pmd_pgtable(pmd) pmd_page(pmd) 34 33 35 34 #define __pte_free_tlb(tlb,pte,addr) \ 36 35 do { \

-2

arch/sh/include/asm/pgtable.h

··· 59 59 /* Entries per level */ 60 60 #define PTRS_PER_PTE (PAGE_SIZE / (1 << PTE_MAGNITUDE)) 61 61 62 - #define FIRST_USER_ADDRESS 0UL 63 - 64 62 #define PHYS_ADDR_MASK29 0x1fffffff 65 63 #define PHYS_ADDR_MASK32 0xffffffff 66 64

+1 -4

arch/sparc/Kconfig

··· 59 59 select CLZ_TAB 60 60 select HAVE_UID16 61 61 select OLD_SIGACTION 62 + select ZONE_DMA 62 63 63 64 config SPARC64 64 65 def_bool 64BIT ··· 141 140 bool 142 141 default y if SPARC32 143 142 select KMAP_LOCAL 144 - 145 - config ZONE_DMA 146 - bool 147 - default y if SPARC32 148 143 149 144 config GENERIC_ISA_DMA 150 145 bool

-1

arch/sparc/include/asm/pgalloc_32.h

··· 51 51 #define __pmd_free_tlb(tlb, pmd, addr) pmd_free((tlb)->mm, pmd) 52 52 53 53 #define pmd_populate(mm, pmd, pte) pmd_set(pmd, pte) 54 - #define pmd_pgtable(pmd) (pgtable_t)__pmd_page(pmd) 55 54 56 55 void pmd_set(pmd_t *pmdp, pte_t *ptep); 57 56 #define pmd_populate_kernel pmd_populate

-1

arch/sparc/include/asm/pgalloc_64.h

··· 67 67 68 68 #define pmd_populate_kernel(MM, PMD, PTE) pmd_set(MM, PMD, PTE) 69 69 #define pmd_populate(MM, PMD, PTE) pmd_set(MM, PMD, PTE) 70 - #define pmd_pgtable(PMD) ((pte_t *)pmd_page_vaddr(PMD)) 71 70 72 71 void pgtable_free(void *table, bool is_page); 73 72

+2 -1

arch/sparc/include/asm/pgtable_32.h

··· 48 48 #define PTRS_PER_PMD 64 49 49 #define PTRS_PER_PGD 256 50 50 #define USER_PTRS_PER_PGD PAGE_OFFSET / PGDIR_SIZE 51 - #define FIRST_USER_ADDRESS 0UL 52 51 #define PTE_SIZE (PTRS_PER_PTE*4) 53 52 54 53 #define PAGE_NONE SRMMU_PAGE_NONE ··· 431 432 432 433 /* We provide our own get_unmapped_area to cope with VA holes for userland */ 433 434 #define HAVE_ARCH_UNMAPPED_AREA 435 + 436 + #define pmd_pgtable(pmd) ((pgtable_t)__pmd_page(pmd)) 434 437 435 438 #endif /* !(_SPARC_PGTABLE_H) */

+3 -5

arch/sparc/include/asm/pgtable_64.h

··· 95 95 #define PTRS_PER_PUD (1UL << PUD_BITS) 96 96 #define PTRS_PER_PGD (1UL << PGDIR_BITS) 97 97 98 - /* Kernel has a separate 44bit address space. */ 99 - #define FIRST_USER_ADDRESS 0UL 100 - 101 98 #define pmd_ERROR(e) \ 102 99 pr_err("%s:%d: bad pmd %p(%016lx) seen at (%pS)\n", \ 103 100 __FILE__, __LINE__, &(e), pmd_val(e), __builtin_return_address(0)) ··· 374 377 #define pgprot_noncached pgprot_noncached 375 378 376 379 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) 377 - extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, 378 - struct page *page, int writable); 380 + pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags); 379 381 #define arch_make_huge_pte arch_make_huge_pte 380 382 static inline unsigned long __pte_default_huge_mask(void) 381 383 { ··· 1116 1120 extern unsigned long cmdline_memory_size; 1117 1121 1118 1122 asmlinkage void do_sparc64_fault(struct pt_regs *regs); 1123 + 1124 + #define pmd_pgtable(PMD) ((pte_t *)pmd_page_vaddr(PMD)) 1119 1125 1120 1126 #ifdef CONFIG_HUGETLB_PAGE 1121 1127

+1

arch/sparc/kernel/sstate.c

··· 6 6 7 7 #include <linux/kernel.h> 8 8 #include <linux/notifier.h> 9 + #include <linux/panic_notifier.h> 9 10 #include <linux/reboot.h> 10 11 #include <linux/init.h> 11 12

+2 -4

arch/sparc/mm/hugetlbpage.c

··· 177 177 return sun4u_hugepage_shift_to_tte(entry, shift); 178 178 } 179 179 180 - pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, 181 - struct page *page, int writeable) 180 + pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags) 182 181 { 183 - unsigned int shift = huge_page_shift(hstate_vma(vma)); 184 182 pte_t pte; 185 183 186 184 pte = hugepage_shift_to_tte(entry, shift); ··· 186 188 #ifdef CONFIG_SPARC64 187 189 /* If this vma has ADI enabled on it, turn on TTE.mcd 188 190 */ 189 - if (vma->vm_flags & VM_SPARC_ADI) 191 + if (flags & VM_SPARC_ADI) 190 192 return pte_mkmcd(pte); 191 193 else 192 194 return pte_mknotmcd(pte);

+1

arch/sparc/mm/init_64.c

··· 27 27 #include <linux/percpu.h> 28 28 #include <linux/mmzone.h> 29 29 #include <linux/gfp.h> 30 + #include <linux/bootmem_info.h> 30 31 31 32 #include <asm/head.h> 32 33 #include <asm/page.h>

+1

arch/um/drivers/mconsole_kern.c

··· 12 12 #include <linux/mm.h> 13 13 #include <linux/module.h> 14 14 #include <linux/notifier.h> 15 + #include <linux/panic_notifier.h> 15 16 #include <linux/reboot.h> 16 17 #include <linux/sched/debug.h> 17 18 #include <linux/proc_fs.h>

-1

arch/um/include/asm/pgalloc.h

··· 19 19 set_pmd(pmd, __pmd(_PAGE_TABLE + \ 20 20 ((unsigned long long)page_to_pfn(pte) << \ 21 21 (unsigned long long) PAGE_SHIFT))) 22 - #define pmd_pgtable(pmd) pmd_page(pmd) 23 22 24 23 /* 25 24 * Allocate and free page tables.

-1

arch/um/include/asm/pgtable-2level.h

··· 23 23 #define PTRS_PER_PTE 1024 24 24 #define USER_PTRS_PER_PGD ((TASK_SIZE + (PGDIR_SIZE - 1)) / PGDIR_SIZE) 25 25 #define PTRS_PER_PGD 1024 26 - #define FIRST_USER_ADDRESS 0UL 27 26 28 27 #define pte_ERROR(e) \ 29 28 printk("%s:%d: bad pte %p(%08lx).\n", __FILE__, __LINE__, &(e), \

-1

arch/um/include/asm/pgtable-3level.h

··· 41 41 #endif 42 42 43 43 #define USER_PTRS_PER_PGD ((TASK_SIZE + (PGDIR_SIZE - 1)) / PGDIR_SIZE) 44 - #define FIRST_USER_ADDRESS 0UL 45 44 46 45 #define pte_ERROR(e) \ 47 46 printk("%s:%d: bad pte %p(%016lx).\n", __FILE__, __LINE__, &(e), \

+1

arch/um/kernel/um_arch.c

··· 7 7 #include <linux/init.h> 8 8 #include <linux/mm.h> 9 9 #include <linux/module.h> 10 + #include <linux/panic_notifier.h> 10 11 #include <linux/seq_file.h> 11 12 #include <linux/string.h> 12 13 #include <linux/utsname.h>

+3 -14

arch/x86/Kconfig

··· 33 33 select NEED_DMA_MAP_STATE 34 34 select SWIOTLB 35 35 select ARCH_HAS_ELFCORE_COMPAT 36 + select ZONE_DMA32 36 37 37 38 config FORCE_DYNAMIC_FTRACE 38 39 def_bool y ··· 64 63 select ARCH_ENABLE_HUGEPAGE_MIGRATION if X86_64 && HUGETLB_PAGE && MIGRATION 65 64 select ARCH_ENABLE_MEMORY_HOTPLUG if X86_64 || (X86_32 && HIGHMEM) 66 65 select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG 67 - select ARCH_ENABLE_SPLIT_PMD_PTLOCK if X86_64 || X86_PAE 66 + select ARCH_ENABLE_SPLIT_PMD_PTLOCK if (PGTABLE_LEVELS > 2) && (X86_64 || X86_PAE) 68 67 select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE 69 68 select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI 70 69 select ARCH_HAS_CACHE_LINE_SIZE ··· 94 93 select ARCH_HAS_SYSCALL_WRAPPER 95 94 select ARCH_HAS_UBSAN_SANITIZE_ALL 96 95 select ARCH_HAS_DEBUG_WX 96 + select ARCH_HAS_ZONE_DMA_SET if EXPERT 97 97 select ARCH_HAVE_NMI_SAFE_CMPXCHG 98 98 select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI 99 99 select ARCH_MIGHT_HAVE_PC_PARPORT ··· 346 344 config ARCH_WANT_GENERAL_HUGETLB 347 345 def_bool y 348 346 349 - config ZONE_DMA32 350 - def_bool y if X86_64 351 - 352 347 config AUDIT_ARCH 353 348 def_bool y if X86_64 354 349 ··· 392 393 the segment on 32-bit kernels. 393 394 394 395 menu "Processor type and features" 395 - 396 - config ZONE_DMA 397 - bool "DMA memory allocation support" if EXPERT 398 - default y 399 - help 400 - DMA memory allocation support allows devices with less than 32-bit 401 - addressing to allocate within the first 16MB of address space. 402 - Disable if no such devices will be used. 403 - 404 - If unsure, say Y. 405 396 406 397 config SMP 407 398 bool "Symmetric multi-processing support"

+1

arch/x86/include/asm/desc.h

··· 9 9 #include <asm/irq_vectors.h> 10 10 #include <asm/cpu_entry_area.h> 11 11 12 + #include <linux/debug_locks.h> 12 13 #include <linux/smp.h> 13 14 #include <linux/percpu.h> 14 15

-2

arch/x86/include/asm/pgalloc.h

··· 84 84 set_pmd(pmd, __pmd(((pteval_t)pfn << PAGE_SHIFT) | _PAGE_TABLE)); 85 85 } 86 86 87 - #define pmd_pgtable(pmd) pmd_page(pmd) 88 - 89 87 #if CONFIG_PGTABLE_LEVELS > 2 90 88 extern void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd); 91 89

-2

arch/x86/include/asm/pgtable_types.h

··· 7 7 8 8 #include <asm/page_types.h> 9 9 10 - #define FIRST_USER_ADDRESS 0UL 11 - 12 10 #define _PAGE_BIT_PRESENT 0 /* is present */ 13 11 #define _PAGE_BIT_RW 1 /* writeable */ 14 12 #define _PAGE_BIT_USER 2 /* userspace addressable */

+1

arch/x86/kernel/cpu/mshyperv.c

··· 17 17 #include <linux/irq.h> 18 18 #include <linux/kexec.h> 19 19 #include <linux/i8253.h> 20 + #include <linux/panic_notifier.h> 20 21 #include <linux/random.h> 21 22 #include <asm/processor.h> 22 23 #include <asm/hypervisor.h>

-6

arch/x86/kernel/kprobes/core.c

··· 422 422 return page; 423 423 } 424 424 425 - /* Recover page to RW mode before releasing it */ 426 - void free_insn_page(void *page) 427 - { 428 - module_memfree(page); 429 - } 430 - 431 425 /* Kprobe x86 instruction emulation - only regs->ip or IF flag modifiers */ 432 426 433 427 static void kprobe_emulate_ifmodifiers(struct kprobe *p, struct pt_regs *regs)

+1

arch/x86/kernel/setup.c

··· 14 14 #include <linux/initrd.h> 15 15 #include <linux/iscsi_ibft.h> 16 16 #include <linux/memblock.h> 17 + #include <linux/panic_notifier.h> 17 18 #include <linux/pci.h> 18 19 #include <linux/root_dev.h> 19 20 #include <linux/hugetlb.h>

+3 -2

arch/x86/mm/init_64.c

··· 33 33 #include <linux/nmi.h> 34 34 #include <linux/gfp.h> 35 35 #include <linux/kcore.h> 36 + #include <linux/bootmem_info.h> 36 37 37 38 #include <asm/processor.h> 38 39 #include <asm/bios_ebda.h> ··· 1270 1269 1271 1270 static void __init register_page_bootmem_info(void) 1272 1271 { 1273 - #ifdef CONFIG_NUMA 1272 + #if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP) 1274 1273 int i; 1275 1274 1276 1275 for_each_online_node(i) ··· 1624 1623 return err; 1625 1624 } 1626 1625 1627 - #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HAVE_BOOTMEM_INFO_NODE) 1626 + #ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE 1628 1627 void register_page_bootmem_memmap(unsigned long section_nr, 1629 1628 struct page *start_page, unsigned long nr_pages) 1630 1629 {

+19 -15

arch/x86/mm/pgtable.c

··· 682 682 } 683 683 #endif 684 684 685 + #if CONFIG_PGTABLE_LEVELS > 3 685 686 /** 686 687 * pud_set_huge - setup kernel PUD mapping 687 688 * ··· 722 721 } 723 722 724 723 /** 724 + * pud_clear_huge - clear kernel PUD mapping when it is set 725 + * 726 + * Returns 1 on success and 0 on failure (no PUD map is found). 727 + */ 728 + int pud_clear_huge(pud_t *pud) 729 + { 730 + if (pud_large(*pud)) { 731 + pud_clear(pud); 732 + return 1; 733 + } 734 + 735 + return 0; 736 + } 737 + #endif 738 + 739 + #if CONFIG_PGTABLE_LEVELS > 2 740 + /** 725 741 * pmd_set_huge - setup kernel PMD mapping 726 742 * 727 743 * See text over pud_set_huge() above. ··· 769 751 } 770 752 771 753 /** 772 - * pud_clear_huge - clear kernel PUD mapping when it is set 773 - * 774 - * Returns 1 on success and 0 on failure (no PUD map is found). 775 - */ 776 - int pud_clear_huge(pud_t *pud) 777 - { 778 - if (pud_large(*pud)) { 779 - pud_clear(pud); 780 - return 1; 781 - } 782 - 783 - return 0; 784 - } 785 - 786 - /** 787 754 * pmd_clear_huge - clear kernel PMD mapping when it is set 788 755 * 789 756 * Returns 1 on success and 0 on failure (no PMD map is found). ··· 782 779 783 780 return 0; 784 781 } 782 + #endif 785 783 786 784 #ifdef CONFIG_X86_64 787 785 /**

+2

arch/x86/purgatory/purgatory.c

··· 9 9 */ 10 10 11 11 #include <linux/bug.h> 12 + #include <linux/kernel.h> 13 + #include <linux/types.h> 12 14 #include <crypto/sha2.h> 13 15 #include <asm/purgatory.h> 14 16

+1

arch/x86/xen/enlighten.c

··· 6 6 #include <linux/cpu.h> 7 7 #include <linux/kexec.h> 8 8 #include <linux/slab.h> 9 + #include <linux/panic_notifier.h> 9 10 10 11 #include <xen/xen.h> 11 12 #include <xen/features.h>

-2

arch/xtensa/include/asm/pgalloc.h

··· 25 25 (pmd_val(*(pmdp)) = ((unsigned long)ptep)) 26 26 #define pmd_populate(mm, pmdp, page) \ 27 27 (pmd_val(*(pmdp)) = ((unsigned long)page_to_virt(page))) 28 - #define pmd_pgtable(pmd) pmd_page(pmd) 29 28 30 29 static inline pgd_t* 31 30 pgd_alloc(struct mm_struct *mm) ··· 62 63 return page; 63 64 } 64 65 65 - #define pmd_pgtable(pmd) pmd_page(pmd) 66 66 #endif /* CONFIG_MMU */ 67 67 68 68 #endif /* _XTENSA_PGALLOC_H */

-1

arch/xtensa/include/asm/pgtable.h

··· 59 59 #define PTRS_PER_PGD 1024 60 60 #define PGD_ORDER 0 61 61 #define USER_PTRS_PER_PGD (TASK_SIZE/PGDIR_SIZE) 62 - #define FIRST_USER_ADDRESS 0UL 63 62 #define FIRST_USER_PGD_NR (FIRST_USER_ADDRESS >> PGDIR_SHIFT) 64 63 65 64 #ifdef CONFIG_MMU

+3

arch/xtensa/include/uapi/asm/mman.h

··· 106 106 #define MADV_COLD 20 /* deactivate these pages */ 107 107 #define MADV_PAGEOUT 21 /* reclaim these pages */ 108 108 109 + #define MADV_POPULATE_READ 22 /* populate (prefault) page tables readable */ 110 + #define MADV_POPULATE_WRITE 23 /* populate (prefault) page tables writable */ 111 + 109 112 /* compatibility flags */ 110 113 #define MAP_FILE 0 111 114

+1

arch/xtensa/platforms/iss/setup.c

··· 14 14 #include <linux/init.h> 15 15 #include <linux/kernel.h> 16 16 #include <linux/notifier.h> 17 + #include <linux/panic_notifier.h> 17 18 #include <linux/printk.h> 18 19 #include <linux/string.h> 19 20

+1 -1

drivers/block/zram/zram_drv.h

··· 113 113 * zram is claimed so open request will be failed 114 114 */ 115 115 bool claim; /* Protected by disk->open_mutex */ 116 - struct file *backing_dev; 117 116 #ifdef CONFIG_ZRAM_WRITEBACK 117 + struct file *backing_dev; 118 118 spinlock_t wb_limit_lock; 119 119 bool wb_limit_enable; 120 120 u64 bd_wb_limit;

+1

drivers/bus/brcmstb_gisb.c

··· 6 6 #include <linux/init.h> 7 7 #include <linux/types.h> 8 8 #include <linux/module.h> 9 + #include <linux/panic_notifier.h> 9 10 #include <linux/platform_device.h> 10 11 #include <linux/interrupt.h> 11 12 #include <linux/sysfs.h>

+1

drivers/char/ipmi/ipmi_msghandler.c

··· 16 16 17 17 #include <linux/module.h> 18 18 #include <linux/errno.h> 19 + #include <linux/panic_notifier.h> 19 20 #include <linux/poll.h> 20 21 #include <linux/sched.h> 21 22 #include <linux/seq_file.h>

+4

drivers/clk/analogbits/wrpll-cln28hpc.c

··· 23 23 24 24 #include <linux/bug.h> 25 25 #include <linux/err.h> 26 + #include <linux/limits.h> 26 27 #include <linux/log2.h> 27 28 #include <linux/math64.h> 29 + #include <linux/math.h> 30 + #include <linux/minmax.h> 31 + 28 32 #include <linux/clk/analogbits-wrpll-cln28hpc.h> 29 33 30 34 /* MIN_INPUT_FREQ: minimum input clock frequency, in Hz (Fref_min) */

+1

drivers/edac/altera_edac.c

··· 20 20 #include <linux/of_address.h> 21 21 #include <linux/of_irq.h> 22 22 #include <linux/of_platform.h> 23 + #include <linux/panic_notifier.h> 23 24 #include <linux/platform_device.h> 24 25 #include <linux/regmap.h> 25 26 #include <linux/types.h>

+1

drivers/firmware/google/gsmi.c

··· 19 19 #include <linux/dma-mapping.h> 20 20 #include <linux/fs.h> 21 21 #include <linux/slab.h> 22 + #include <linux/panic_notifier.h> 22 23 #include <linux/ioctl.h> 23 24 #include <linux/acpi.h> 24 25 #include <linux/io.h>

+1

drivers/gpu/drm/nouveau/include/nvif/if000c.h

··· 77 77 #define NVIF_VMM_PFNMAP_V0_APER 0x00000000000000f0ULL 78 78 #define NVIF_VMM_PFNMAP_V0_HOST 0x0000000000000000ULL 79 79 #define NVIF_VMM_PFNMAP_V0_VRAM 0x0000000000000010ULL 80 + #define NVIF_VMM_PFNMAP_V0_A 0x0000000000000004ULL 80 81 #define NVIF_VMM_PFNMAP_V0_W 0x0000000000000002ULL 81 82 #define NVIF_VMM_PFNMAP_V0_V 0x0000000000000001ULL 82 83 #define NVIF_VMM_PFNMAP_V0_NONE 0x0000000000000000ULL

+133 -23

drivers/gpu/drm/nouveau/nouveau_svm.c

··· 35 35 #include <linux/sched/mm.h> 36 36 #include <linux/sort.h> 37 37 #include <linux/hmm.h> 38 + #include <linux/rmap.h> 38 39 39 40 struct nouveau_svm { 40 41 struct nouveau_drm *drm; ··· 67 66 int fault_nr; 68 67 } buffer[1]; 69 68 }; 69 + 70 + #define FAULT_ACCESS_READ 0 71 + #define FAULT_ACCESS_WRITE 1 72 + #define FAULT_ACCESS_ATOMIC 2 73 + #define FAULT_ACCESS_PREFETCH 3 70 74 71 75 #define SVM_DBG(s,f,a...) NV_DEBUG((s)->drm, "svm: "f"\n", ##a) 72 76 #define SVM_ERR(s,f,a...) NV_WARN((s)->drm, "svm: "f"\n", ##a) ··· 271 265 * the invalidation is handled as part of the migration process. 272 266 */ 273 267 if (update->event == MMU_NOTIFY_MIGRATE && 274 - update->migrate_pgmap_owner == svmm->vmm->cli->drm->dev) 268 + update->owner == svmm->vmm->cli->drm->dev) 275 269 goto out; 276 270 277 271 if (limit > svmm->unmanaged.start && start < svmm->unmanaged.limit) { ··· 418 412 } 419 413 420 414 static int 415 + nouveau_svm_fault_priority(u8 fault) 416 + { 417 + switch (fault) { 418 + case FAULT_ACCESS_PREFETCH: 419 + return 0; 420 + case FAULT_ACCESS_READ: 421 + return 1; 422 + case FAULT_ACCESS_WRITE: 423 + return 2; 424 + case FAULT_ACCESS_ATOMIC: 425 + return 3; 426 + default: 427 + WARN_ON_ONCE(1); 428 + return -1; 429 + } 430 + } 431 + 432 + static int 421 433 nouveau_svm_fault_cmp(const void *a, const void *b) 422 434 { 423 435 const struct nouveau_svm_fault *fa = *(struct nouveau_svm_fault **)a; ··· 445 421 return ret; 446 422 if ((ret = (s64)fa->addr - fb->addr)) 447 423 return ret; 448 - /*XXX: atomic? */ 449 - return (fa->access == 0 || fa->access == 3) - 450 - (fb->access == 0 || fb->access == 3); 424 + return nouveau_svm_fault_priority(fa->access) - 425 + nouveau_svm_fault_priority(fb->access); 451 426 } 452 427 453 428 static void ··· 509 486 { 510 487 struct svm_notifier *sn = 511 488 container_of(mni, struct svm_notifier, notifier); 489 + 490 + if (range->event == MMU_NOTIFY_EXCLUSIVE && 491 + range->owner == sn->svmm->vmm->cli->drm->dev) 492 + return true; 512 493 513 494 /* 514 495 * serializes the update to mni->invalidate_seq done by caller and ··· 582 555 args->p.phys[0] |= NVIF_VMM_PFNMAP_V0_W; 583 556 } 584 557 558 + static int nouveau_atomic_range_fault(struct nouveau_svmm *svmm, 559 + struct nouveau_drm *drm, 560 + struct nouveau_pfnmap_args *args, u32 size, 561 + struct svm_notifier *notifier) 562 + { 563 + unsigned long timeout = 564 + jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT); 565 + struct mm_struct *mm = svmm->notifier.mm; 566 + struct page *page; 567 + unsigned long start = args->p.addr; 568 + unsigned long notifier_seq; 569 + int ret = 0; 570 + 571 + ret = mmu_interval_notifier_insert(&notifier->notifier, mm, 572 + args->p.addr, args->p.size, 573 + &nouveau_svm_mni_ops); 574 + if (ret) 575 + return ret; 576 + 577 + while (true) { 578 + if (time_after(jiffies, timeout)) { 579 + ret = -EBUSY; 580 + goto out; 581 + } 582 + 583 + notifier_seq = mmu_interval_read_begin(&notifier->notifier); 584 + mmap_read_lock(mm); 585 + ret = make_device_exclusive_range(mm, start, start + PAGE_SIZE, 586 + &page, drm->dev); 587 + mmap_read_unlock(mm); 588 + if (ret <= 0 || !page) { 589 + ret = -EINVAL; 590 + goto out; 591 + } 592 + 593 + mutex_lock(&svmm->mutex); 594 + if (!mmu_interval_read_retry(&notifier->notifier, 595 + notifier_seq)) 596 + break; 597 + mutex_unlock(&svmm->mutex); 598 + } 599 + 600 + /* Map the page on the GPU. */ 601 + args->p.page = 12; 602 + args->p.size = PAGE_SIZE; 603 + args->p.addr = start; 604 + args->p.phys[0] = page_to_phys(page) | 605 + NVIF_VMM_PFNMAP_V0_V | 606 + NVIF_VMM_PFNMAP_V0_W | 607 + NVIF_VMM_PFNMAP_V0_A | 608 + NVIF_VMM_PFNMAP_V0_HOST; 609 + 610 + svmm->vmm->vmm.object.client->super = true; 611 + ret = nvif_object_ioctl(&svmm->vmm->vmm.object, args, size, NULL); 612 + svmm->vmm->vmm.object.client->super = false; 613 + mutex_unlock(&svmm->mutex); 614 + 615 + unlock_page(page); 616 + put_page(page); 617 + 618 + out: 619 + mmu_interval_notifier_remove(&notifier->notifier); 620 + return ret; 621 + } 622 + 585 623 static int nouveau_range_fault(struct nouveau_svmm *svmm, 586 624 struct nouveau_drm *drm, 587 625 struct nouveau_pfnmap_args *args, u32 size, ··· 659 567 unsigned long hmm_pfns[1]; 660 568 struct hmm_range range = { 661 569 .notifier = &notifier->notifier, 662 - .start = notifier->notifier.interval_tree.start, 663 - .end = notifier->notifier.interval_tree.last + 1, 664 570 .default_flags = hmm_flags, 665 571 .hmm_pfns = hmm_pfns, 666 572 .dev_private_owner = drm->dev, 667 573 }; 668 - struct mm_struct *mm = notifier->notifier.mm; 574 + struct mm_struct *mm = svmm->notifier.mm; 669 575 int ret; 670 576 577 + ret = mmu_interval_notifier_insert(&notifier->notifier, mm, 578 + args->p.addr, args->p.size, 579 + &nouveau_svm_mni_ops); 580 + if (ret) 581 + return ret; 582 + 583 + range.start = notifier->notifier.interval_tree.start; 584 + range.end = notifier->notifier.interval_tree.last + 1; 585 + 671 586 while (true) { 672 - if (time_after(jiffies, timeout)) 673 - return -EBUSY; 587 + if (time_after(jiffies, timeout)) { 588 + ret = -EBUSY; 589 + goto out; 590 + } 674 591 675 592 range.notifier_seq = mmu_interval_read_begin(range.notifier); 676 593 mmap_read_lock(mm); ··· 688 587 if (ret) { 689 588 if (ret == -EBUSY) 690 589 continue; 691 - return ret; 590 + goto out; 692 591 } 693 592 694 593 mutex_lock(&svmm->mutex); ··· 706 605 ret = nvif_object_ioctl(&svmm->vmm->vmm.object, args, size, NULL); 707 606 svmm->vmm->vmm.object.client->super = false; 708 607 mutex_unlock(&svmm->mutex); 608 + 609 + out: 610 + mmu_interval_notifier_remove(&notifier->notifier); 709 611 710 612 return ret; 711 613 } ··· 729 625 unsigned long hmm_flags; 730 626 u64 inst, start, limit; 731 627 int fi, fn; 732 - int replay = 0, ret; 628 + int replay = 0, atomic = 0, ret; 733 629 734 630 /* Parse available fault buffer entries into a cache, and update 735 631 * the GET pointer so HW can reuse the entries. ··· 810 706 /* 811 707 * Determine required permissions based on GPU fault 812 708 * access flags. 813 - * XXX: atomic? 814 709 */ 815 710 switch (buffer->fault[fi]->access) { 816 711 case 0: /* READ. */ 817 712 hmm_flags = HMM_PFN_REQ_FAULT; 713 + break; 714 + case 2: /* ATOMIC. */ 715 + atomic = true; 818 716 break; 819 717 case 3: /* PREFETCH. */ 820 718 hmm_flags = 0; ··· 833 727 } 834 728 835 729 notifier.svmm = svmm; 836 - ret = mmu_interval_notifier_insert(&notifier.notifier, mm, 837 - args.i.p.addr, args.i.p.size, 838 - &nouveau_svm_mni_ops); 839 - if (!ret) { 730 + if (atomic) 731 + ret = nouveau_atomic_range_fault(svmm, svm->drm, 732 + &args.i, sizeof(args), 733 + &notifier); 734 + else 840 735 ret = nouveau_range_fault(svmm, svm->drm, &args.i, 841 - sizeof(args), hmm_flags, &notifier); 842 - mmu_interval_notifier_remove(&notifier.notifier); 843 - } 736 + sizeof(args), hmm_flags, 737 + &notifier); 844 738 mmput(mm); 845 739 846 740 limit = args.i.p.addr + args.i.p.size; ··· 856 750 */ 857 751 if (buffer->fault[fn]->svmm != svmm || 858 752 buffer->fault[fn]->addr >= limit || 859 - (buffer->fault[fi]->access == 0 /* READ. */ && 753 + (buffer->fault[fi]->access == FAULT_ACCESS_READ && 860 754 !(args.phys[0] & NVIF_VMM_PFNMAP_V0_V)) || 861 - (buffer->fault[fi]->access != 0 /* READ. */ && 862 - buffer->fault[fi]->access != 3 /* PREFETCH. */ && 863 - !(args.phys[0] & NVIF_VMM_PFNMAP_V0_W))) 755 + (buffer->fault[fi]->access != FAULT_ACCESS_READ && 756 + buffer->fault[fi]->access != FAULT_ACCESS_PREFETCH && 757 + !(args.phys[0] & NVIF_VMM_PFNMAP_V0_W)) || 758 + (buffer->fault[fi]->access != FAULT_ACCESS_READ && 759 + buffer->fault[fi]->access != FAULT_ACCESS_WRITE && 760 + buffer->fault[fi]->access != FAULT_ACCESS_PREFETCH && 761 + !(args.phys[0] & NVIF_VMM_PFNMAP_V0_A))) 864 762 break; 865 763 } 866 764

+1

drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h

··· 178 178 #define NVKM_VMM_PFN_APER 0x00000000000000f0ULL 179 179 #define NVKM_VMM_PFN_HOST 0x0000000000000000ULL 180 180 #define NVKM_VMM_PFN_VRAM 0x0000000000000010ULL 181 + #define NVKM_VMM_PFN_A 0x0000000000000004ULL 181 182 #define NVKM_VMM_PFN_W 0x0000000000000002ULL 182 183 #define NVKM_VMM_PFN_V 0x0000000000000001ULL 183 184 #define NVKM_VMM_PFN_NONE 0x0000000000000000ULL

+6

drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c

··· 88 88 if (!(*map->pfn & NVKM_VMM_PFN_W)) 89 89 data |= BIT_ULL(6); /* RO. */ 90 90 91 + if (!(*map->pfn & NVKM_VMM_PFN_A)) 92 + data |= BIT_ULL(7); /* Atomic disable. */ 93 + 91 94 if (!(*map->pfn & NVKM_VMM_PFN_VRAM)) { 92 95 addr = *map->pfn >> NVKM_VMM_PFN_ADDR_SHIFT; 93 96 addr = dma_map_page(dev, pfn_to_page(addr), 0, ··· 324 321 325 322 if (!(*map->pfn & NVKM_VMM_PFN_W)) 326 323 data |= BIT_ULL(6); /* RO. */ 324 + 325 + if (!(*map->pfn & NVKM_VMM_PFN_A)) 326 + data |= BIT_ULL(7); /* Atomic disable. */ 327 327 328 328 if (!(*map->pfn & NVKM_VMM_PFN_VRAM)) { 329 329 addr = *map->pfn >> NVKM_VMM_PFN_ADDR_SHIFT;

+1

drivers/hv/vmbus_drv.c

··· 25 25 26 26 #include <linux/delay.h> 27 27 #include <linux/notifier.h> 28 + #include <linux/panic_notifier.h> 28 29 #include <linux/ptrace.h> 29 30 #include <linux/screen_info.h> 30 31 #include <linux/kdebug.h>

+1

drivers/hwtracing/coresight/coresight-cpu-debug.c

··· 17 17 #include <linux/kernel.h> 18 18 #include <linux/module.h> 19 19 #include <linux/moduleparam.h> 20 + #include <linux/panic_notifier.h> 20 21 #include <linux/pm_qos.h> 21 22 #include <linux/slab.h> 22 23 #include <linux/smp.h>

+1

drivers/leds/trigger/ledtrig-activity.c

··· 11 11 #include <linux/kernel_stat.h> 12 12 #include <linux/leds.h> 13 13 #include <linux/module.h> 14 + #include <linux/panic_notifier.h> 14 15 #include <linux/reboot.h> 15 16 #include <linux/sched.h> 16 17 #include <linux/slab.h>

+1

drivers/leds/trigger/ledtrig-heartbeat.c

··· 11 11 #include <linux/module.h> 12 12 #include <linux/kernel.h> 13 13 #include <linux/init.h> 14 + #include <linux/panic_notifier.h> 14 15 #include <linux/slab.h> 15 16 #include <linux/timer.h> 16 17 #include <linux/sched.h>

+1

drivers/leds/trigger/ledtrig-panic.c

··· 8 8 #include <linux/kernel.h> 9 9 #include <linux/init.h> 10 10 #include <linux/notifier.h> 11 + #include <linux/panic_notifier.h> 11 12 #include <linux/leds.h> 12 13 #include "../leds.h" 13 14

+1

drivers/misc/bcm-vk/bcm_vk_dev.c

··· 9 9 #include <linux/fs.h> 10 10 #include <linux/idr.h> 11 11 #include <linux/interrupt.h> 12 + #include <linux/panic_notifier.h> 12 13 #include <linux/kref.h> 13 14 #include <linux/module.h> 14 15 #include <linux/mutex.h>

+1

drivers/misc/ibmasm/heartbeat.c

··· 9 9 */ 10 10 11 11 #include <linux/notifier.h> 12 + #include <linux/panic_notifier.h> 12 13 #include "ibmasm.h" 13 14 #include "dot_command.h" 14 15 #include "lowlevel.h"

+1

drivers/misc/pvpanic/pvpanic.c

··· 13 13 #include <linux/mod_devicetable.h> 14 14 #include <linux/module.h> 15 15 #include <linux/platform_device.h> 16 + #include <linux/panic_notifier.h> 16 17 #include <linux/types.h> 17 18 #include <linux/cdev.h> 18 19 #include <linux/list.h>

+1

drivers/net/ipa/ipa_smp2p.c

··· 8 8 #include <linux/device.h> 9 9 #include <linux/interrupt.h> 10 10 #include <linux/notifier.h> 11 + #include <linux/panic_notifier.h> 11 12 #include <linux/soc/qcom/smem.h> 12 13 #include <linux/soc/qcom/smem_state.h> 13 14

+1

drivers/parisc/power.c

··· 38 38 #include <linux/init.h> 39 39 #include <linux/kernel.h> 40 40 #include <linux/notifier.h> 41 + #include <linux/panic_notifier.h> 41 42 #include <linux/reboot.h> 42 43 #include <linux/sched/signal.h> 43 44 #include <linux/kthread.h>

+1

drivers/power/reset/ltc2952-poweroff.c

··· 52 52 #include <linux/slab.h> 53 53 #include <linux/kmod.h> 54 54 #include <linux/module.h> 55 + #include <linux/panic_notifier.h> 55 56 #include <linux/mod_devicetable.h> 56 57 #include <linux/gpio/consumer.h> 57 58 #include <linux/reboot.h>

+1

drivers/remoteproc/remoteproc_core.c

··· 20 20 #include <linux/kernel.h> 21 21 #include <linux/module.h> 22 22 #include <linux/device.h> 23 + #include <linux/panic_notifier.h> 23 24 #include <linux/slab.h> 24 25 #include <linux/mutex.h> 25 26 #include <linux/dma-map-ops.h>

+1

drivers/s390/char/con3215.c

··· 19 19 #include <linux/console.h> 20 20 #include <linux/interrupt.h> 21 21 #include <linux/err.h> 22 + #include <linux/panic_notifier.h> 22 23 #include <linux/reboot.h> 23 24 #include <linux/serial.h> /* ASYNC_* flags */ 24 25 #include <linux/slab.h>

+1

drivers/s390/char/con3270.c

··· 13 13 #include <linux/init.h> 14 14 #include <linux/interrupt.h> 15 15 #include <linux/list.h> 16 + #include <linux/panic_notifier.h> 16 17 #include <linux/types.h> 17 18 #include <linux/slab.h> 18 19 #include <linux/err.h>

+1

drivers/s390/char/sclp.c

··· 11 11 #include <linux/kernel_stat.h> 12 12 #include <linux/module.h> 13 13 #include <linux/err.h> 14 + #include <linux/panic_notifier.h> 14 15 #include <linux/spinlock.h> 15 16 #include <linux/interrupt.h> 16 17 #include <linux/timer.h>

+1

drivers/s390/char/sclp_con.c

··· 10 10 #include <linux/kmod.h> 11 11 #include <linux/console.h> 12 12 #include <linux/init.h> 13 + #include <linux/panic_notifier.h> 13 14 #include <linux/timer.h> 14 15 #include <linux/jiffies.h> 15 16 #include <linux/termios.h>

+1

drivers/s390/char/sclp_vt220.c

··· 9 9 10 10 #include <linux/module.h> 11 11 #include <linux/spinlock.h> 12 + #include <linux/panic_notifier.h> 12 13 #include <linux/list.h> 13 14 #include <linux/wait.h> 14 15 #include <linux/timer.h>

+1

drivers/s390/char/zcore.c

··· 15 15 #include <linux/init.h> 16 16 #include <linux/slab.h> 17 17 #include <linux/debugfs.h> 18 + #include <linux/panic_notifier.h> 18 19 #include <linux/reboot.h> 19 20 20 21 #include <asm/asm-offsets.h>

+1

drivers/soc/bcm/brcmstb/pm/pm-arm.c

··· 28 28 #include <linux/notifier.h> 29 29 #include <linux/of.h> 30 30 #include <linux/of_address.h> 31 + #include <linux/panic_notifier.h> 31 32 #include <linux/platform_device.h> 32 33 #include <linux/pm.h> 33 34 #include <linux/printk.h>

+1

drivers/staging/olpc_dcon/olpc_dcon.c

··· 22 22 #include <linux/device.h> 23 23 #include <linux/uaccess.h> 24 24 #include <linux/ctype.h> 25 + #include <linux/panic_notifier.h> 25 26 #include <linux/reboot.h> 26 27 #include <linux/olpc-ec.h> 27 28 #include <asm/tsc.h>

+1

drivers/video/fbdev/hyperv_fb.c

··· 52 52 #include <linux/completion.h> 53 53 #include <linux/fb.h> 54 54 #include <linux/pci.h> 55 + #include <linux/panic_notifier.h> 55 56 #include <linux/efi.h> 56 57 #include <linux/console.h> 57 58

+2

drivers/virtio/virtio_mem.c

··· 1065 1065 static void virtio_mem_set_fake_offline(unsigned long pfn, 1066 1066 unsigned long nr_pages, bool onlined) 1067 1067 { 1068 + page_offline_begin(); 1068 1069 for (; nr_pages--; pfn++) { 1069 1070 struct page *page = pfn_to_page(pfn); 1070 1071 ··· 1076 1075 ClearPageReserved(page); 1077 1076 } 1078 1077 } 1078 + page_offline_end(); 1079 1079 } 1080 1080 1081 1081 /*

+15

fs/Kconfig

··· 240 240 config HUGETLB_PAGE 241 241 def_bool HUGETLBFS 242 242 243 + config HUGETLB_PAGE_FREE_VMEMMAP 244 + def_bool HUGETLB_PAGE 245 + depends on X86_64 246 + depends on SPARSEMEM_VMEMMAP 247 + 248 + config HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON 249 + bool "Default freeing vmemmap pages of HugeTLB to on" 250 + default n 251 + depends on HUGETLB_PAGE_FREE_VMEMMAP 252 + help 253 + When using HUGETLB_PAGE_FREE_VMEMMAP, the freeing unused vmemmap 254 + pages associated with each HugeTLB page is default off. Say Y here 255 + to enable freeing vmemmap pages of HugeTLB by default. It can then 256 + be disabled on the command line via hugetlb_free_vmemmap=off. 257 + 243 258 config MEMFD_CREATE 244 259 def_bool TMPFS || HUGETLBFS 245 260

-3

fs/exec.c

··· 84 84 85 85 void __register_binfmt(struct linux_binfmt * fmt, int insert) 86 86 { 87 - BUG_ON(!fmt); 88 - if (WARN_ON(!fmt->load_binary)) 89 - return; 90 87 write_lock(&binfmt_lock); 91 88 insert ? list_add(&fmt->lh, &formats) : 92 89 list_add_tail(&fmt->lh, &formats);

+5

fs/hfsplus/inode.c

··· 281 281 struct inode *inode = d_inode(path->dentry); 282 282 struct hfsplus_inode_info *hip = HFSPLUS_I(inode); 283 283 284 + if (request_mask & STATX_BTIME) { 285 + stat->result_mask |= STATX_BTIME; 286 + stat->btime = hfsp_mt2ut(hip->create_date); 287 + } 288 + 284 289 if (inode->i_flags & S_APPEND) 285 290 stat->attributes |= STATX_ATTR_APPEND; 286 291 if (inode->i_flags & S_IMMUTABLE)

-1

fs/hfsplus/xattr.c

··· 204 204 205 205 buf = kzalloc(node_size, GFP_NOFS); 206 206 if (!buf) { 207 - pr_err("failed to allocate memory for header node\n"); 208 207 err = -ENOMEM; 209 208 goto end_attr_file_creation; 210 209 }

+1 -1

fs/nfsd/nfs4state.c

··· 2351 2351 static void seq_quote_mem(struct seq_file *m, char *data, int len) 2352 2352 { 2353 2353 seq_printf(m, "\""); 2354 - seq_escape_mem_ascii(m, data, len); 2354 + seq_escape_mem(m, data, len, ESCAPE_HEX | ESCAPE_NAP | ESCAPE_APPEND, "\"\\"); 2355 2355 seq_printf(m, "\""); 2356 2356 } 2357 2357

-1

fs/nilfs2/btree.c

··· 738 738 if (ptr2 != ptr + cnt || ++cnt == maxblocks) 739 739 goto end; 740 740 index++; 741 - continue; 742 741 } 743 742 if (level == maxlevel) 744 743 break;

+11 -2

fs/open.c

··· 852 852 * XXX: Huge page cache doesn't support writing yet. Drop all page 853 853 * cache for this file before processing writes. 854 854 */ 855 - if ((f->f_mode & FMODE_WRITE) && filemap_nr_thps(inode->i_mapping)) 856 - truncate_pagecache(inode, 0); 855 + if (f->f_mode & FMODE_WRITE) { 856 + /* 857 + * Paired with smp_mb() in collapse_file() to ensure nr_thps 858 + * is up to date and the update to i_writecount by 859 + * get_write_access() is visible. Ensures subsequent insertion 860 + * of THPs into the page cache will fail. 861 + */ 862 + smp_mb(); 863 + if (filemap_nr_thps(inode->i_mapping)) 864 + truncate_pagecache(inode, 0); 865 + } 857 866 858 867 return 0; 859 868

+3 -3

fs/proc/base.c

··· 854 854 flags = FOLL_FORCE | (write ? FOLL_WRITE : 0); 855 855 856 856 while (count > 0) { 857 - int this_len = min_t(int, count, PAGE_SIZE); 857 + size_t this_len = min_t(size_t, count, PAGE_SIZE); 858 858 859 859 if (write && copy_from_user(page, buf, this_len)) { 860 860 copied = -EFAULT; ··· 3172 3172 DIR("task", S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations), 3173 3173 DIR("fd", S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations), 3174 3174 DIR("map_files", S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations), 3175 - DIR("fdinfo", S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations), 3175 + DIR("fdinfo", S_IRUGO|S_IXUGO, proc_fdinfo_inode_operations, proc_fdinfo_operations), 3176 3176 DIR("ns", S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations), 3177 3177 #ifdef CONFIG_NET 3178 3178 DIR("net", S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations), ··· 3517 3517 */ 3518 3518 static const struct pid_entry tid_base_stuff[] = { 3519 3519 DIR("fd", S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations), 3520 - DIR("fdinfo", S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations), 3520 + DIR("fdinfo", S_IRUGO|S_IXUGO, proc_fdinfo_inode_operations, proc_fdinfo_operations), 3521 3521 DIR("ns", S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations), 3522 3522 #ifdef CONFIG_NET 3523 3523 DIR("net", S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),

+17 -3

fs/proc/fd.c

··· 6 6 #include <linux/fdtable.h> 7 7 #include <linux/namei.h> 8 8 #include <linux/pid.h> 9 + #include <linux/ptrace.h> 9 10 #include <linux/security.h> 10 11 #include <linux/file.h> 11 12 #include <linux/seq_file.h> ··· 54 53 if (ret) 55 54 return ret; 56 55 57 - seq_printf(m, "pos:\t%lli\nflags:\t0%o\nmnt_id:\t%i\n", 56 + seq_printf(m, "pos:\t%lli\nflags:\t0%o\nmnt_id:\t%i\nino:\t%lu\n", 58 57 (long long)file->f_pos, f_flags, 59 - real_mount(file->f_path.mnt)->mnt_id); 58 + real_mount(file->f_path.mnt)->mnt_id, 59 + file_inode(file)->i_ino); 60 60 61 61 /* show_fd_locks() never deferences files so a stale value is safe */ 62 62 show_fd_locks(m, file, files); ··· 74 72 75 73 static int seq_fdinfo_open(struct inode *inode, struct file *file) 76 74 { 75 + bool allowed = false; 76 + struct task_struct *task = get_proc_task(inode); 77 + 78 + if (!task) 79 + return -ESRCH; 80 + 81 + allowed = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS); 82 + put_task_struct(task); 83 + 84 + if (!allowed) 85 + return -EACCES; 86 + 77 87 return single_open(file, seq_show, inode); 78 88 } 79 89 ··· 322 308 struct proc_inode *ei; 323 309 struct inode *inode; 324 310 325 - inode = proc_pid_make_inode(dentry->d_sb, task, S_IFREG | S_IRUSR); 311 + inode = proc_pid_make_inode(dentry->d_sb, task, S_IFREG | S_IRUGO); 326 312 if (!inode) 327 313 return ERR_PTR(-ENOENT); 328 314

+54 -13

fs/proc/kcore.c

··· 313 313 { 314 314 char *buf = file->private_data; 315 315 size_t phdrs_offset, notes_offset, data_offset; 316 + size_t page_offline_frozen = 1; 316 317 size_t phdrs_len, notes_len; 317 318 struct kcore_list *m; 318 319 size_t tsz; ··· 323 322 int ret = 0; 324 323 325 324 down_read(&kclist_lock); 325 + /* 326 + * Don't race against drivers that set PageOffline() and expect no 327 + * further page access. 328 + */ 329 + page_offline_freeze(); 326 330 327 331 get_kcore_size(&nphdr, &phdrs_len, &notes_len, &data_offset); 328 332 phdrs_offset = sizeof(struct elfhdr); ··· 386 380 phdr->p_type = PT_LOAD; 387 381 phdr->p_flags = PF_R | PF_W | PF_X; 388 382 phdr->p_offset = kc_vaddr_to_offset(m->addr) + data_offset; 389 - if (m->type == KCORE_REMAP) 390 - phdr->p_vaddr = (size_t)m->vaddr; 391 - else 392 - phdr->p_vaddr = (size_t)m->addr; 393 - if (m->type == KCORE_RAM || m->type == KCORE_REMAP) 383 + phdr->p_vaddr = (size_t)m->addr; 384 + if (m->type == KCORE_RAM) 394 385 phdr->p_paddr = __pa(m->addr); 395 386 else if (m->type == KCORE_TEXT) 396 387 phdr->p_paddr = __pa_symbol(m->addr); ··· 471 468 472 469 m = NULL; 473 470 while (buflen) { 471 + struct page *page; 472 + unsigned long pfn; 473 + 474 474 /* 475 475 * If this is the first iteration or the address is not within 476 476 * the previous entry, search for a matching entry. ··· 486 480 } 487 481 } 488 482 483 + if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) { 484 + page_offline_thaw(); 485 + cond_resched(); 486 + page_offline_freeze(); 487 + } 488 + 489 489 if (&m->list == &kclist_head) { 490 490 if (clear_user(buffer, tsz)) { 491 491 ret = -EFAULT; 492 492 goto out; 493 493 } 494 494 m = NULL; /* skip the list anchor */ 495 - } else if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) { 496 - if (clear_user(buffer, tsz)) { 497 - ret = -EFAULT; 498 - goto out; 499 - } 500 - } else if (m->type == KCORE_VMALLOC) { 495 + goto skip; 496 + } 497 + 498 + switch (m->type) { 499 + case KCORE_VMALLOC: 501 500 vread(buf, (char *)start, tsz); 502 501 /* we have to zero-fill user buffer even if no read */ 503 502 if (copy_to_user(buffer, buf, tsz)) { 504 503 ret = -EFAULT; 505 504 goto out; 506 505 } 507 - } else if (m->type == KCORE_USER) { 506 + break; 507 + case KCORE_USER: 508 508 /* User page is handled prior to normal kernel page: */ 509 509 if (copy_to_user(buffer, (char *)start, tsz)) { 510 510 ret = -EFAULT; 511 511 goto out; 512 512 } 513 - } else { 513 + break; 514 + case KCORE_RAM: 515 + pfn = __pa(start) >> PAGE_SHIFT; 516 + page = pfn_to_online_page(pfn); 517 + 518 + /* 519 + * Don't read offline sections, logically offline pages 520 + * (e.g., inflated in a balloon), hwpoisoned pages, 521 + * and explicitly excluded physical ranges. 522 + */ 523 + if (!page || PageOffline(page) || 524 + is_page_hwpoison(page) || !pfn_is_ram(pfn)) { 525 + if (clear_user(buffer, tsz)) { 526 + ret = -EFAULT; 527 + goto out; 528 + } 529 + break; 530 + } 531 + fallthrough; 532 + case KCORE_VMEMMAP: 533 + case KCORE_TEXT: 514 534 if (kern_addr_valid(start)) { 515 535 /* 516 536 * Using bounce buffer to bypass the ··· 560 528 goto out; 561 529 } 562 530 } 531 + break; 532 + default: 533 + pr_warn_once("Unhandled KCORE type: %d\n", m->type); 534 + if (clear_user(buffer, tsz)) { 535 + ret = -EFAULT; 536 + goto out; 537 + } 563 538 } 539 + skip: 564 540 buflen -= tsz; 565 541 *fpos += tsz; 566 542 buffer += tsz; ··· 577 537 } 578 538 579 539 out: 540 + page_offline_thaw(); 580 541 up_read(&kclist_lock); 581 542 if (ret) 582 543 return ret;

+18 -16

fs/proc/task_mmu.c

··· 514 514 } else { 515 515 mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT; 516 516 } 517 - } else if (is_migration_entry(swpent)) 518 - page = migration_entry_to_page(swpent); 519 - else if (is_device_private_entry(swpent)) 520 - page = device_private_entry_to_page(swpent); 517 + } else if (is_pfn_swap_entry(swpent)) 518 + page = pfn_swap_entry_to_page(swpent); 521 519 } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap 522 520 && pte_none(*pte))) { 523 521 page = xa_load(&vma->vm_file->f_mapping->i_pages, ··· 547 549 swp_entry_t entry = pmd_to_swp_entry(*pmd); 548 550 549 551 if (is_migration_entry(entry)) 550 - page = migration_entry_to_page(entry); 552 + page = pfn_swap_entry_to_page(entry); 551 553 } 552 554 if (IS_ERR_OR_NULL(page)) 553 555 return; ··· 692 694 } else if (is_swap_pte(*pte)) { 693 695 swp_entry_t swpent = pte_to_swp_entry(*pte); 694 696 695 - if (is_migration_entry(swpent)) 696 - page = migration_entry_to_page(swpent); 697 - else if (is_device_private_entry(swpent)) 698 - page = device_private_entry_to_page(swpent); 697 + if (is_pfn_swap_entry(swpent)) 698 + page = pfn_swap_entry_to_page(swpent); 699 699 } 700 700 if (page) { 701 701 int mapcount = page_mapcount(page); ··· 828 832 __show_smap(m, &mss, false); 829 833 830 834 seq_printf(m, "THPeligible: %d\n", 831 - transparent_hugepage_enabled(vma)); 835 + transparent_hugepage_active(vma)); 832 836 833 837 if (arch_pkeys_enabled()) 834 838 seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma)); ··· 1298 1302 #define PM_PFRAME_MASK GENMASK_ULL(PM_PFRAME_BITS - 1, 0) 1299 1303 #define PM_SOFT_DIRTY BIT_ULL(55) 1300 1304 #define PM_MMAP_EXCLUSIVE BIT_ULL(56) 1305 + #define PM_UFFD_WP BIT_ULL(57) 1301 1306 #define PM_FILE BIT_ULL(61) 1302 1307 #define PM_SWAP BIT_ULL(62) 1303 1308 #define PM_PRESENT BIT_ULL(63) ··· 1372 1375 page = vm_normal_page(vma, addr, pte); 1373 1376 if (pte_soft_dirty(pte)) 1374 1377 flags |= PM_SOFT_DIRTY; 1378 + if (pte_uffd_wp(pte)) 1379 + flags |= PM_UFFD_WP; 1375 1380 } else if (is_swap_pte(pte)) { 1376 1381 swp_entry_t entry; 1377 1382 if (pte_swp_soft_dirty(pte)) 1378 1383 flags |= PM_SOFT_DIRTY; 1384 + if (pte_swp_uffd_wp(pte)) 1385 + flags |= PM_UFFD_WP; 1379 1386 entry = pte_to_swp_entry(pte); 1380 1387 if (pm->show_pfn) 1381 1388 frame = swp_type(entry) | 1382 1389 (swp_offset(entry) << MAX_SWAPFILES_SHIFT); 1383 1390 flags |= PM_SWAP; 1384 - if (is_migration_entry(entry)) 1385 - page = migration_entry_to_page(entry); 1386 - 1387 - if (is_device_private_entry(entry)) 1388 - page = device_private_entry_to_page(entry); 1391 + if (is_pfn_swap_entry(entry)) 1392 + page = pfn_swap_entry_to_page(entry); 1389 1393 } 1390 1394 1391 1395 if (page && !PageAnon(page)) ··· 1424 1426 flags |= PM_PRESENT; 1425 1427 if (pmd_soft_dirty(pmd)) 1426 1428 flags |= PM_SOFT_DIRTY; 1429 + if (pmd_uffd_wp(pmd)) 1430 + flags |= PM_UFFD_WP; 1427 1431 if (pm->show_pfn) 1428 1432 frame = pmd_pfn(pmd) + 1429 1433 ((addr & ~PMD_MASK) >> PAGE_SHIFT); ··· 1444 1444 flags |= PM_SWAP; 1445 1445 if (pmd_swp_soft_dirty(pmd)) 1446 1446 flags |= PM_SOFT_DIRTY; 1447 + if (pmd_swp_uffd_wp(pmd)) 1448 + flags |= PM_UFFD_WP; 1447 1449 VM_BUG_ON(!is_pmd_migration_entry(pmd)); 1448 - page = migration_entry_to_page(entry); 1450 + page = pfn_swap_entry_to_page(entry); 1449 1451 } 1450 1452 #endif 1451 1453

+26 -17

fs/seq_file.c

··· 356 356 EXPORT_SYMBOL(seq_release); 357 357 358 358 /** 359 + * seq_escape_mem - print data into buffer, escaping some characters 360 + * @m: target buffer 361 + * @src: source buffer 362 + * @len: size of source buffer 363 + * @flags: flags to pass to string_escape_mem() 364 + * @esc: set of characters that need escaping 365 + * 366 + * Puts data into buffer, replacing each occurrence of character from 367 + * given class (defined by @flags and @esc) with printable escaped sequence. 368 + * 369 + * Use seq_has_overflowed() to check for errors. 370 + */ 371 + void seq_escape_mem(struct seq_file *m, const char *src, size_t len, 372 + unsigned int flags, const char *esc) 373 + { 374 + char *buf; 375 + size_t size = seq_get_buf(m, &buf); 376 + int ret; 377 + 378 + ret = string_escape_mem(src, len, buf, size, flags, esc); 379 + seq_commit(m, ret < size ? ret : -1); 380 + } 381 + EXPORT_SYMBOL(seq_escape_mem); 382 + 383 + /** 359 384 * seq_escape - print string into buffer, escaping some characters 360 385 * @m: target buffer 361 386 * @s: string ··· 392 367 */ 393 368 void seq_escape(struct seq_file *m, const char *s, const char *esc) 394 369 { 395 - char *buf; 396 - size_t size = seq_get_buf(m, &buf); 397 - int ret; 398 - 399 - ret = string_escape_str(s, buf, size, ESCAPE_OCTAL, esc); 400 - seq_commit(m, ret < size ? ret : -1); 370 + seq_escape_str(m, s, ESCAPE_OCTAL, esc); 401 371 } 402 372 EXPORT_SYMBOL(seq_escape); 403 - 404 - void seq_escape_mem_ascii(struct seq_file *m, const char *src, size_t isz) 405 - { 406 - char *buf; 407 - size_t size = seq_get_buf(m, &buf); 408 - int ret; 409 - 410 - ret = string_escape_mem_ascii(src, isz, buf, size); 411 - seq_commit(m, ret < size ? ret : -1); 412 - } 413 - EXPORT_SYMBOL(seq_escape_mem_ascii); 414 373 415 374 void seq_vprintf(struct seq_file *m, const char *f, va_list args) 416 375 {

+11 -4

fs/userfaultfd.c

··· 1267 1267 } 1268 1268 1269 1269 if (vm_flags & VM_UFFD_MINOR) { 1270 - /* FIXME: Add minor fault interception for shmem. */ 1271 - if (!is_vm_hugetlb_page(vma)) 1270 + if (!(is_vm_hugetlb_page(vma) || vma_is_shmem(vma))) 1272 1271 return false; 1273 1272 } 1274 1273 ··· 1303 1304 vm_flags = 0; 1304 1305 if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING) 1305 1306 vm_flags |= VM_UFFD_MISSING; 1306 - if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) 1307 + if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) { 1308 + #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP 1309 + goto out; 1310 + #endif 1307 1311 vm_flags |= VM_UFFD_WP; 1312 + } 1308 1313 if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR) { 1309 1314 #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR 1310 1315 goto out; ··· 1944 1941 /* report all available features and ioctls to userland */ 1945 1942 uffdio_api.features = UFFD_API_FEATURES; 1946 1943 #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR 1947 - uffdio_api.features &= ~UFFD_FEATURE_MINOR_HUGETLBFS; 1944 + uffdio_api.features &= 1945 + ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM); 1946 + #endif 1947 + #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP 1948 + uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP; 1948 1949 #endif 1949 1950 uffdio_api.ioctls = UFFD_API_IOCTLS; 1950 1951 ret = -EFAULT;

+2 -1

include/asm-generic/bug.h

··· 18 18 #endif 19 19 20 20 #ifndef __ASSEMBLY__ 21 - #include <linux/kernel.h> 21 + #include <linux/panic.h> 22 + #include <linux/printk.h> 22 23 23 24 #ifdef CONFIG_BUG 24 25

+2 -1

include/linux/ascii85.h

··· 8 8 #ifndef _ASCII85_H_ 9 9 #define _ASCII85_H_ 10 10 11 - #include <linux/kernel.h> 11 + #include <linux/math.h> 12 + #include <linux/types.h> 12 13 13 14 #define ASCII85_BUFSZ 6 14 15

+66

include/linux/bootmem_info.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef __LINUX_BOOTMEM_INFO_H 3 + #define __LINUX_BOOTMEM_INFO_H 4 + 5 + #include <linux/mm.h> 6 + 7 + /* 8 + * Types for free bootmem stored in page->lru.next. These have to be in 9 + * some random range in unsigned long space for debugging purposes. 10 + */ 11 + enum { 12 + MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE = 12, 13 + SECTION_INFO = MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE, 14 + MIX_SECTION_INFO, 15 + NODE_INFO, 16 + MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = NODE_INFO, 17 + }; 18 + 19 + #ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE 20 + void __init register_page_bootmem_info_node(struct pglist_data *pgdat); 21 + 22 + void get_page_bootmem(unsigned long info, struct page *page, 23 + unsigned long type); 24 + void put_page_bootmem(struct page *page); 25 + 26 + /* 27 + * Any memory allocated via the memblock allocator and not via the 28 + * buddy will be marked reserved already in the memmap. For those 29 + * pages, we can call this function to free it to buddy allocator. 30 + */ 31 + static inline void free_bootmem_page(struct page *page) 32 + { 33 + unsigned long magic = (unsigned long)page->freelist; 34 + 35 + /* 36 + * The reserve_bootmem_region sets the reserved flag on bootmem 37 + * pages. 38 + */ 39 + VM_BUG_ON_PAGE(page_ref_count(page) != 2, page); 40 + 41 + if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) 42 + put_page_bootmem(page); 43 + else 44 + VM_BUG_ON_PAGE(1, page); 45 + } 46 + #else 47 + static inline void register_page_bootmem_info_node(struct pglist_data *pgdat) 48 + { 49 + } 50 + 51 + static inline void put_page_bootmem(struct page *page) 52 + { 53 + } 54 + 55 + static inline void get_page_bootmem(unsigned long info, struct page *page, 56 + unsigned long type) 57 + { 58 + } 59 + 60 + static inline void free_bootmem_page(struct page *page) 61 + { 62 + free_reserved_page(page); 63 + } 64 + #endif 65 + 66 + #endif /* __LINUX_BOOTMEM_INFO_H */

-2

include/linux/compat.h

··· 532 532 &__uss->ss_sp, label); \ 533 533 unsafe_put_user(t->sas_ss_flags, &__uss->ss_flags, label); \ 534 534 unsafe_put_user(t->sas_ss_size, &__uss->ss_size, label); \ 535 - if (t->sas_ss_flags & SS_AUTODISARM) \ 536 - sas_ss_reset(t); \ 537 535 } while (0); 538 536 539 537 /*

+17

include/linux/compiler-clang.h

··· 13 13 /* all clang versions usable with the kernel support KASAN ABI version 5 */ 14 14 #define KASAN_ABI_VERSION 5 15 15 16 + /* 17 + * Note: Checking __has_feature(*_sanitizer) is only true if the feature is 18 + * enabled. Therefore it is not required to additionally check defined(CONFIG_*) 19 + * to avoid adding redundant attributes in other configurations. 20 + */ 21 + 16 22 #if __has_feature(address_sanitizer) || __has_feature(hwaddress_sanitizer) 17 23 /* Emulate GCC's __SANITIZE_ADDRESS__ flag */ 18 24 #define __SANITIZE_ADDRESS__ ··· 49 43 __attribute__((no_sanitize("undefined"))) 50 44 #else 51 45 #define __no_sanitize_undefined 46 + #endif 47 + 48 + /* 49 + * Support for __has_feature(coverage_sanitizer) was added in Clang 13 together 50 + * with no_sanitize("coverage"). Prior versions of Clang support coverage 51 + * instrumentation, but cannot be queried for support by the preprocessor. 52 + */ 53 + #if __has_feature(coverage_sanitizer) 54 + #define __no_sanitize_coverage __attribute__((no_sanitize("coverage"))) 55 + #else 56 + #define __no_sanitize_coverage 52 57 #endif 53 58 54 59 /*

+6

include/linux/compiler-gcc.h

··· 122 122 #define __no_sanitize_undefined 123 123 #endif 124 124 125 + #if defined(CONFIG_KCOV) && __has_attribute(__no_sanitize_coverage__) 126 + #define __no_sanitize_coverage __attribute__((no_sanitize_coverage)) 127 + #else 128 + #define __no_sanitize_coverage 129 + #endif 130 + 125 131 #if GCC_VERSION >= 50100 126 132 #define COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW 1 127 133 #endif

+1 -1

include/linux/compiler_types.h

··· 210 210 /* Section for code which can't be instrumented at all */ 211 211 #define noinstr \ 212 212 noinline notrace __attribute((__section__(".noinstr.text"))) \ 213 - __no_kcsan __no_sanitize_address __no_profile 213 + __no_kcsan __no_sanitize_address __no_profile __no_sanitize_coverage 214 214 215 215 #endif /* __KERNEL__ */ 216 216

+40 -30

include/linux/huge_mm.h

··· 10 10 vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf); 11 11 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, 12 12 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, 13 - struct vm_area_struct *vma); 14 - void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd); 13 + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma); 14 + void huge_pmd_set_accessed(struct vm_fault *vmf); 15 15 int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm, 16 16 pud_t *dst_pud, pud_t *src_pud, unsigned long addr, 17 17 struct vm_area_struct *vma); ··· 24 24 } 25 25 #endif 26 26 27 - vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd); 27 + vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf); 28 28 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, 29 29 unsigned long addr, pmd_t *pmd, 30 30 unsigned int flags); ··· 115 115 116 116 extern unsigned long transparent_hugepage_flags; 117 117 118 + static inline bool transhuge_vma_suitable(struct vm_area_struct *vma, 119 + unsigned long haddr) 120 + { 121 + /* Don't have to check pgoff for anonymous vma */ 122 + if (!vma_is_anonymous(vma)) { 123 + if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, 124 + HPAGE_PMD_NR)) 125 + return false; 126 + } 127 + 128 + if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end) 129 + return false; 130 + return true; 131 + } 132 + 133 + static inline bool transhuge_vma_enabled(struct vm_area_struct *vma, 134 + unsigned long vm_flags) 135 + { 136 + /* Explicitly disabled through madvise. */ 137 + if ((vm_flags & VM_NOHUGEPAGE) || 138 + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) 139 + return false; 140 + return true; 141 + } 142 + 118 143 /* 119 144 * to be used on vmas which are known to support THP. 120 - * Use transparent_hugepage_enabled otherwise 145 + * Use transparent_hugepage_active otherwise 121 146 */ 122 147 static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma) 123 148 { ··· 153 128 if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_NEVER_DAX)) 154 129 return false; 155 130 156 - if (vma->vm_flags & VM_NOHUGEPAGE) 131 + if (!transhuge_vma_enabled(vma, vma->vm_flags)) 157 132 return false; 158 133 159 134 if (vma_is_temporary_stack(vma)) 160 - return false; 161 - 162 - if (test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) 163 135 return false; 164 136 165 137 if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG)) ··· 172 150 return false; 173 151 } 174 152 175 - bool transparent_hugepage_enabled(struct vm_area_struct *vma); 176 - 177 - #define HPAGE_CACHE_INDEX_MASK (HPAGE_PMD_NR - 1) 178 - 179 - static inline bool transhuge_vma_suitable(struct vm_area_struct *vma, 180 - unsigned long haddr) 181 - { 182 - /* Don't have to check pgoff for anonymous vma */ 183 - if (!vma_is_anonymous(vma)) { 184 - if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) != 185 - (vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK)) 186 - return false; 187 - } 188 - 189 - if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end) 190 - return false; 191 - return true; 192 - } 153 + bool transparent_hugepage_active(struct vm_area_struct *vma); 193 154 194 155 #define transparent_hugepage_use_zero_page() \ 195 156 (transparent_hugepage_flags & \ ··· 288 283 struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, 289 284 pud_t *pud, int flags, struct dev_pagemap **pgmap); 290 285 291 - vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd); 286 + vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf); 292 287 293 288 extern struct page *huge_zero_page; 294 289 extern unsigned long huge_zero_pfn; ··· 359 354 return false; 360 355 } 361 356 362 - static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma) 357 + static inline bool transparent_hugepage_active(struct vm_area_struct *vma) 363 358 { 364 359 return false; 365 360 } 366 361 367 362 static inline bool transhuge_vma_suitable(struct vm_area_struct *vma, 368 363 unsigned long haddr) 364 + { 365 + return false; 366 + } 367 + 368 + static inline bool transhuge_vma_enabled(struct vm_area_struct *vma, 369 + unsigned long vm_flags) 369 370 { 370 371 return false; 371 372 } ··· 441 430 return NULL; 442 431 } 443 432 444 - static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, 445 - pmd_t orig_pmd) 433 + static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) 446 434 { 447 435 return 0; 448 436 }

+38 -4

include/linux/hugetlb.h

··· 29 29 #include <linux/shm.h> 30 30 #include <asm/tlbflush.h> 31 31 32 + /* 33 + * For HugeTLB page, there are more metadata to save in the struct page. But 34 + * the head struct page cannot meet our needs, so we have to abuse other tail 35 + * struct page to store the metadata. In order to avoid conflicts caused by 36 + * subsequent use of more tail struct pages, we gather these discrete indexes 37 + * of tail struct page here. 38 + */ 39 + enum { 40 + SUBPAGE_INDEX_SUBPOOL = 1, /* reuse page->private */ 41 + #ifdef CONFIG_CGROUP_HUGETLB 42 + SUBPAGE_INDEX_CGROUP, /* reuse page->private */ 43 + SUBPAGE_INDEX_CGROUP_RSVD, /* reuse page->private */ 44 + __MAX_CGROUP_SUBPAGE_INDEX = SUBPAGE_INDEX_CGROUP_RSVD, 45 + #endif 46 + __NR_USED_SUBPAGE, 47 + }; 48 + 32 49 struct hugepage_subpool { 33 50 spinlock_t lock; 34 51 long count; ··· 532 515 * modifications require hugetlb_lock. 533 516 * HPG_freed - Set when page is on the free lists. 534 517 * Synchronization: hugetlb_lock held for examination and modification. 518 + * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed. 535 519 */ 536 520 enum hugetlb_page_flags { 537 521 HPG_restore_reserve = 0, 538 522 HPG_migratable, 539 523 HPG_temporary, 540 524 HPG_freed, 525 + HPG_vmemmap_optimized, 541 526 __NR_HPAGEFLAGS, 542 527 }; 543 528 ··· 585 566 HPAGEFLAG(Migratable, migratable) 586 567 HPAGEFLAG(Temporary, temporary) 587 568 HPAGEFLAG(Freed, freed) 569 + HPAGEFLAG(VmemmapOptimized, vmemmap_optimized) 588 570 589 571 #ifdef CONFIG_HUGETLB_PAGE 590 572 ··· 608 588 unsigned int nr_huge_pages_node[MAX_NUMNODES]; 609 589 unsigned int free_huge_pages_node[MAX_NUMNODES]; 610 590 unsigned int surplus_huge_pages_node[MAX_NUMNODES]; 591 + #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP 592 + unsigned int nr_free_vmemmap_pages; 593 + #endif 611 594 #ifdef CONFIG_CGROUP_HUGETLB 612 595 /* cgroup control files */ 613 596 struct cftype cgroup_files_dfl[7]; ··· 658 635 */ 659 636 static inline struct hugepage_subpool *hugetlb_page_subpool(struct page *hpage) 660 637 { 661 - return (struct hugepage_subpool *)(hpage+1)->private; 638 + return (void *)page_private(hpage + SUBPAGE_INDEX_SUBPOOL); 662 639 } 663 640 664 641 static inline void hugetlb_set_page_subpool(struct page *hpage, 665 642 struct hugepage_subpool *subpool) 666 643 { 667 - set_page_private(hpage+1, (unsigned long)subpool); 644 + set_page_private(hpage + SUBPAGE_INDEX_SUBPOOL, (unsigned long)subpool); 668 645 } 669 646 670 647 static inline struct hstate *hstate_file(struct file *f) ··· 741 718 #endif 742 719 743 720 #ifndef arch_make_huge_pte 744 - static inline pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, 745 - struct page *page, int writable) 721 + static inline pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, 722 + vm_flags_t flags) 746 723 { 747 724 return entry; 748 725 } ··· 898 875 #else /* CONFIG_HUGETLB_PAGE */ 899 876 struct hstate {}; 900 877 878 + static inline struct hugepage_subpool *hugetlb_page_subpool(struct page *hpage) 879 + { 880 + return NULL; 881 + } 882 + 901 883 static inline int isolate_or_dissolve_huge_page(struct page *page, 902 884 struct list_head *list) 903 885 { ··· 1055 1027 { 1056 1028 } 1057 1029 #endif /* CONFIG_HUGETLB_PAGE */ 1030 + 1031 + #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP 1032 + extern bool hugetlb_free_vmemmap_enabled; 1033 + #else 1034 + #define hugetlb_free_vmemmap_enabled false 1035 + #endif 1058 1036 1059 1037 static inline spinlock_t *huge_pte_lock(struct hstate *h, 1060 1038 struct mm_struct *mm, pte_t *pte)

+11 -8

include/linux/hugetlb_cgroup.h

··· 21 21 struct resv_map; 22 22 struct file_region; 23 23 24 + #ifdef CONFIG_CGROUP_HUGETLB 24 25 /* 25 26 * Minimum page order trackable by hugetlb cgroup. 26 27 * At least 4 pages are necessary for all the tracking information. 27 - * The second tail page (hpage[2]) is the fault usage cgroup. 28 - * The third tail page (hpage[3]) is the reservation usage cgroup. 28 + * The second tail page (hpage[SUBPAGE_INDEX_CGROUP]) is the fault 29 + * usage cgroup. The third tail page (hpage[SUBPAGE_INDEX_CGROUP_RSVD]) 30 + * is the reservation usage cgroup. 29 31 */ 30 - #define HUGETLB_CGROUP_MIN_ORDER 2 32 + #define HUGETLB_CGROUP_MIN_ORDER order_base_2(__MAX_CGROUP_SUBPAGE_INDEX + 1) 31 33 32 - #ifdef CONFIG_CGROUP_HUGETLB 33 34 enum hugetlb_memory_event { 34 35 HUGETLB_MAX, 35 36 HUGETLB_NR_MEMORY_EVENTS, ··· 67 66 if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER) 68 67 return NULL; 69 68 if (rsvd) 70 - return (struct hugetlb_cgroup *)page[3].private; 69 + return (void *)page_private(page + SUBPAGE_INDEX_CGROUP_RSVD); 71 70 else 72 - return (struct hugetlb_cgroup *)page[2].private; 71 + return (void *)page_private(page + SUBPAGE_INDEX_CGROUP); 73 72 } 74 73 75 74 static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page) ··· 91 90 if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER) 92 91 return -1; 93 92 if (rsvd) 94 - page[3].private = (unsigned long)h_cg; 93 + set_page_private(page + SUBPAGE_INDEX_CGROUP_RSVD, 94 + (unsigned long)h_cg); 95 95 else 96 - page[2].private = (unsigned long)h_cg; 96 + set_page_private(page + SUBPAGE_INDEX_CGROUP, 97 + (unsigned long)h_cg); 97 98 return 0; 98 99 } 99 100

-3

include/linux/kcore.h

··· 11 11 KCORE_RAM, 12 12 KCORE_VMEMMAP, 13 13 KCORE_USER, 14 - KCORE_OTHER, 15 - KCORE_REMAP, 16 14 }; 17 15 18 16 struct kcore_list { 19 17 struct list_head list; 20 18 unsigned long addr; 21 - unsigned long vaddr; 22 19 size_t size; 23 20 int type; 24 21 };

+2 -225

include/linux/kernel.h

··· 10 10 #include <linux/types.h> 11 11 #include <linux/compiler.h> 12 12 #include <linux/bitops.h> 13 + #include <linux/kstrtox.h> 13 14 #include <linux/log2.h> 14 15 #include <linux/math.h> 15 16 #include <linux/minmax.h> 16 17 #include <linux/typecheck.h> 18 + #include <linux/panic.h> 17 19 #include <linux/printk.h> 18 20 #include <linux/build_bug.h> 19 21 #include <linux/static_call_types.h> ··· 86 84 #define lower_16_bits(n) ((u16)((n) & 0xffff)) 87 85 88 86 struct completion; 89 - struct pt_regs; 90 87 struct user; 91 88 92 89 #ifdef CONFIG_PREEMPT_VOLUNTARY ··· 190 189 static inline void might_fault(void) { } 191 190 #endif 192 191 193 - extern struct atomic_notifier_head panic_notifier_list; 194 - extern long (*panic_blink)(int state); 195 - __printf(1, 2) 196 - void panic(const char *fmt, ...) __noreturn __cold; 197 - void nmi_panic(struct pt_regs *regs, const char *msg); 198 - extern void oops_enter(void); 199 - extern void oops_exit(void); 200 - extern bool oops_may_print(void); 201 192 void do_exit(long error_code) __noreturn; 202 193 void complete_and_exit(struct completion *, long) __noreturn; 203 - 204 - /* Internal, do not use. */ 205 - int __must_check _kstrtoul(const char *s, unsigned int base, unsigned long *res); 206 - int __must_check _kstrtol(const char *s, unsigned int base, long *res); 207 - 208 - int __must_check kstrtoull(const char *s, unsigned int base, unsigned long long *res); 209 - int __must_check kstrtoll(const char *s, unsigned int base, long long *res); 210 - 211 - /** 212 - * kstrtoul - convert a string to an unsigned long 213 - * @s: The start of the string. The string must be null-terminated, and may also 214 - * include a single newline before its terminating null. The first character 215 - * may also be a plus sign, but not a minus sign. 216 - * @base: The number base to use. The maximum supported base is 16. If base is 217 - * given as 0, then the base of the string is automatically detected with the 218 - * conventional semantics - If it begins with 0x the number will be parsed as a 219 - * hexadecimal (case insensitive), if it otherwise begins with 0, it will be 220 - * parsed as an octal number. Otherwise it will be parsed as a decimal. 221 - * @res: Where to write the result of the conversion on success. 222 - * 223 - * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. 224 - * Preferred over simple_strtoul(). Return code must be checked. 225 - */ 226 - static inline int __must_check kstrtoul(const char *s, unsigned int base, unsigned long *res) 227 - { 228 - /* 229 - * We want to shortcut function call, but 230 - * __builtin_types_compatible_p(unsigned long, unsigned long long) = 0. 231 - */ 232 - if (sizeof(unsigned long) == sizeof(unsigned long long) && 233 - __alignof__(unsigned long) == __alignof__(unsigned long long)) 234 - return kstrtoull(s, base, (unsigned long long *)res); 235 - else 236 - return _kstrtoul(s, base, res); 237 - } 238 - 239 - /** 240 - * kstrtol - convert a string to a long 241 - * @s: The start of the string. The string must be null-terminated, and may also 242 - * include a single newline before its terminating null. The first character 243 - * may also be a plus sign or a minus sign. 244 - * @base: The number base to use. The maximum supported base is 16. If base is 245 - * given as 0, then the base of the string is automatically detected with the 246 - * conventional semantics - If it begins with 0x the number will be parsed as a 247 - * hexadecimal (case insensitive), if it otherwise begins with 0, it will be 248 - * parsed as an octal number. Otherwise it will be parsed as a decimal. 249 - * @res: Where to write the result of the conversion on success. 250 - * 251 - * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. 252 - * Preferred over simple_strtol(). Return code must be checked. 253 - */ 254 - static inline int __must_check kstrtol(const char *s, unsigned int base, long *res) 255 - { 256 - /* 257 - * We want to shortcut function call, but 258 - * __builtin_types_compatible_p(long, long long) = 0. 259 - */ 260 - if (sizeof(long) == sizeof(long long) && 261 - __alignof__(long) == __alignof__(long long)) 262 - return kstrtoll(s, base, (long long *)res); 263 - else 264 - return _kstrtol(s, base, res); 265 - } 266 - 267 - int __must_check kstrtouint(const char *s, unsigned int base, unsigned int *res); 268 - int __must_check kstrtoint(const char *s, unsigned int base, int *res); 269 - 270 - static inline int __must_check kstrtou64(const char *s, unsigned int base, u64 *res) 271 - { 272 - return kstrtoull(s, base, res); 273 - } 274 - 275 - static inline int __must_check kstrtos64(const char *s, unsigned int base, s64 *res) 276 - { 277 - return kstrtoll(s, base, res); 278 - } 279 - 280 - static inline int __must_check kstrtou32(const char *s, unsigned int base, u32 *res) 281 - { 282 - return kstrtouint(s, base, res); 283 - } 284 - 285 - static inline int __must_check kstrtos32(const char *s, unsigned int base, s32 *res) 286 - { 287 - return kstrtoint(s, base, res); 288 - } 289 - 290 - int __must_check kstrtou16(const char *s, unsigned int base, u16 *res); 291 - int __must_check kstrtos16(const char *s, unsigned int base, s16 *res); 292 - int __must_check kstrtou8(const char *s, unsigned int base, u8 *res); 293 - int __must_check kstrtos8(const char *s, unsigned int base, s8 *res); 294 - int __must_check kstrtobool(const char *s, bool *res); 295 - 296 - int __must_check kstrtoull_from_user(const char __user *s, size_t count, unsigned int base, unsigned long long *res); 297 - int __must_check kstrtoll_from_user(const char __user *s, size_t count, unsigned int base, long long *res); 298 - int __must_check kstrtoul_from_user(const char __user *s, size_t count, unsigned int base, unsigned long *res); 299 - int __must_check kstrtol_from_user(const char __user *s, size_t count, unsigned int base, long *res); 300 - int __must_check kstrtouint_from_user(const char __user *s, size_t count, unsigned int base, unsigned int *res); 301 - int __must_check kstrtoint_from_user(const char __user *s, size_t count, unsigned int base, int *res); 302 - int __must_check kstrtou16_from_user(const char __user *s, size_t count, unsigned int base, u16 *res); 303 - int __must_check kstrtos16_from_user(const char __user *s, size_t count, unsigned int base, s16 *res); 304 - int __must_check kstrtou8_from_user(const char __user *s, size_t count, unsigned int base, u8 *res); 305 - int __must_check kstrtos8_from_user(const char __user *s, size_t count, unsigned int base, s8 *res); 306 - int __must_check kstrtobool_from_user(const char __user *s, size_t count, bool *res); 307 - 308 - static inline int __must_check kstrtou64_from_user(const char __user *s, size_t count, unsigned int base, u64 *res) 309 - { 310 - return kstrtoull_from_user(s, count, base, res); 311 - } 312 - 313 - static inline int __must_check kstrtos64_from_user(const char __user *s, size_t count, unsigned int base, s64 *res) 314 - { 315 - return kstrtoll_from_user(s, count, base, res); 316 - } 317 - 318 - static inline int __must_check kstrtou32_from_user(const char __user *s, size_t count, unsigned int base, u32 *res) 319 - { 320 - return kstrtouint_from_user(s, count, base, res); 321 - } 322 - 323 - static inline int __must_check kstrtos32_from_user(const char __user *s, size_t count, unsigned int base, s32 *res) 324 - { 325 - return kstrtoint_from_user(s, count, base, res); 326 - } 327 - 328 - /* 329 - * Use kstrto<foo> instead. 330 - * 331 - * NOTE: simple_strto<foo> does not check for the range overflow and, 332 - * depending on the input, may give interesting results. 333 - * 334 - * Use these functions if and only if you cannot use kstrto<foo>, because 335 - * the conversion ends on the first non-digit character, which may be far 336 - * beyond the supported range. It might be useful to parse the strings like 337 - * 10x50 or 12:21 without altering original string or temporary buffer in use. 338 - * Keep in mind above caveat. 339 - */ 340 - 341 - extern unsigned long simple_strtoul(const char *,char **,unsigned int); 342 - extern long simple_strtol(const char *,char **,unsigned int); 343 - extern unsigned long long simple_strtoull(const char *,char **,unsigned int); 344 - extern long long simple_strtoll(const char *,char **,unsigned int); 345 194 346 195 extern int num_to_str(char *buf, int size, 347 196 unsigned long long num, unsigned int width); ··· 235 384 extern int kernel_text_address(unsigned long addr); 236 385 extern int func_ptr_is_kernel_text(void *ptr); 237 386 238 - #ifdef CONFIG_SMP 239 - extern unsigned int sysctl_oops_all_cpu_backtrace; 240 - #else 241 - #define sysctl_oops_all_cpu_backtrace 0 242 - #endif /* CONFIG_SMP */ 243 - 244 387 extern void bust_spinlocks(int yes); 245 - extern int panic_timeout; 246 - extern unsigned long panic_print; 247 - extern int panic_on_oops; 248 - extern int panic_on_unrecovered_nmi; 249 - extern int panic_on_io_nmi; 250 - extern int panic_on_warn; 251 - extern unsigned long panic_on_taint; 252 - extern bool panic_on_taint_nousertaint; 253 - extern int sysctl_panic_on_rcu_stall; 254 - extern int sysctl_max_rcu_stall_to_panic; 255 - extern int sysctl_panic_on_stackoverflow; 256 388 257 - extern bool crash_kexec_post_notifiers; 258 - 259 - /* 260 - * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It 261 - * holds a CPU number which is executing panic() currently. A value of 262 - * PANIC_CPU_INVALID means no CPU has entered panic() or crash_kexec(). 263 - */ 264 - extern atomic_t panic_cpu; 265 - #define PANIC_CPU_INVALID -1 266 - 267 - /* 268 - * Only to be used by arch init code. If the user over-wrote the default 269 - * CONFIG_PANIC_TIMEOUT, honor it. 270 - */ 271 - static inline void set_arch_panic_timeout(int timeout, int arch_default_timeout) 272 - { 273 - if (panic_timeout == arch_default_timeout) 274 - panic_timeout = timeout; 275 - } 276 - extern const char *print_tainted(void); 277 - enum lockdep_ok { 278 - LOCKDEP_STILL_OK, 279 - LOCKDEP_NOW_UNRELIABLE 280 - }; 281 - extern void add_taint(unsigned flag, enum lockdep_ok); 282 - extern int test_taint(unsigned flag); 283 - extern unsigned long get_taint(void); 284 389 extern int root_mountflags; 285 390 286 391 extern bool early_boot_irqs_disabled; ··· 254 447 SYSTEM_RESTART, 255 448 SYSTEM_SUSPEND, 256 449 } system_state; 257 - 258 - /* This cannot be an enum because some may be used in assembly source. */ 259 - #define TAINT_PROPRIETARY_MODULE 0 260 - #define TAINT_FORCED_MODULE 1 261 - #define TAINT_CPU_OUT_OF_SPEC 2 262 - #define TAINT_FORCED_RMMOD 3 263 - #define TAINT_MACHINE_CHECK 4 264 - #define TAINT_BAD_PAGE 5 265 - #define TAINT_USER 6 266 - #define TAINT_DIE 7 267 - #define TAINT_OVERRIDDEN_ACPI_TABLE 8 268 - #define TAINT_WARN 9 269 - #define TAINT_CRAP 10 270 - #define TAINT_FIRMWARE_WORKAROUND 11 271 - #define TAINT_OOT_MODULE 12 272 - #define TAINT_UNSIGNED_MODULE 13 273 - #define TAINT_SOFTLOCKUP 14 274 - #define TAINT_LIVEPATCH 15 275 - #define TAINT_AUX 16 276 - #define TAINT_RANDSTRUCT 17 277 - #define TAINT_FLAGS_COUNT 18 278 - #define TAINT_FLAGS_MAX ((1UL << TAINT_FLAGS_COUNT) - 1) 279 - 280 - struct taint_flag { 281 - char c_true; /* character printed when tainted */ 282 - char c_false; /* character printed when not tainted */ 283 - bool module; /* also show as a per-module taint flag */ 284 - }; 285 - 286 - extern const struct taint_flag taint_flags[TAINT_FLAGS_COUNT]; 287 450 288 451 extern const char hex_asc[]; 289 452 #define hex_asc_lo(x) hex_asc[((x) & 0x0f)]

-1

include/linux/kprobes.h

··· 399 399 void dump_kprobe(struct kprobe *kp); 400 400 401 401 void *alloc_insn_page(void); 402 - void free_insn_page(void *page); 403 402 404 403 int kprobe_get_kallsym(unsigned int symnum, unsigned long *value, char *type, 405 404 char *sym);

+155

include/linux/kstrtox.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _LINUX_KSTRTOX_H 3 + #define _LINUX_KSTRTOX_H 4 + 5 + #include <linux/compiler.h> 6 + #include <linux/types.h> 7 + 8 + /* Internal, do not use. */ 9 + int __must_check _kstrtoul(const char *s, unsigned int base, unsigned long *res); 10 + int __must_check _kstrtol(const char *s, unsigned int base, long *res); 11 + 12 + int __must_check kstrtoull(const char *s, unsigned int base, unsigned long long *res); 13 + int __must_check kstrtoll(const char *s, unsigned int base, long long *res); 14 + 15 + /** 16 + * kstrtoul - convert a string to an unsigned long 17 + * @s: The start of the string. The string must be null-terminated, and may also 18 + * include a single newline before its terminating null. The first character 19 + * may also be a plus sign, but not a minus sign. 20 + * @base: The number base to use. The maximum supported base is 16. If base is 21 + * given as 0, then the base of the string is automatically detected with the 22 + * conventional semantics - If it begins with 0x the number will be parsed as a 23 + * hexadecimal (case insensitive), if it otherwise begins with 0, it will be 24 + * parsed as an octal number. Otherwise it will be parsed as a decimal. 25 + * @res: Where to write the result of the conversion on success. 26 + * 27 + * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. 28 + * Preferred over simple_strtoul(). Return code must be checked. 29 + */ 30 + static inline int __must_check kstrtoul(const char *s, unsigned int base, unsigned long *res) 31 + { 32 + /* 33 + * We want to shortcut function call, but 34 + * __builtin_types_compatible_p(unsigned long, unsigned long long) = 0. 35 + */ 36 + if (sizeof(unsigned long) == sizeof(unsigned long long) && 37 + __alignof__(unsigned long) == __alignof__(unsigned long long)) 38 + return kstrtoull(s, base, (unsigned long long *)res); 39 + else 40 + return _kstrtoul(s, base, res); 41 + } 42 + 43 + /** 44 + * kstrtol - convert a string to a long 45 + * @s: The start of the string. The string must be null-terminated, and may also 46 + * include a single newline before its terminating null. The first character 47 + * may also be a plus sign or a minus sign. 48 + * @base: The number base to use. The maximum supported base is 16. If base is 49 + * given as 0, then the base of the string is automatically detected with the 50 + * conventional semantics - If it begins with 0x the number will be parsed as a 51 + * hexadecimal (case insensitive), if it otherwise begins with 0, it will be 52 + * parsed as an octal number. Otherwise it will be parsed as a decimal. 53 + * @res: Where to write the result of the conversion on success. 54 + * 55 + * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. 56 + * Preferred over simple_strtol(). Return code must be checked. 57 + */ 58 + static inline int __must_check kstrtol(const char *s, unsigned int base, long *res) 59 + { 60 + /* 61 + * We want to shortcut function call, but 62 + * __builtin_types_compatible_p(long, long long) = 0. 63 + */ 64 + if (sizeof(long) == sizeof(long long) && 65 + __alignof__(long) == __alignof__(long long)) 66 + return kstrtoll(s, base, (long long *)res); 67 + else 68 + return _kstrtol(s, base, res); 69 + } 70 + 71 + int __must_check kstrtouint(const char *s, unsigned int base, unsigned int *res); 72 + int __must_check kstrtoint(const char *s, unsigned int base, int *res); 73 + 74 + static inline int __must_check kstrtou64(const char *s, unsigned int base, u64 *res) 75 + { 76 + return kstrtoull(s, base, res); 77 + } 78 + 79 + static inline int __must_check kstrtos64(const char *s, unsigned int base, s64 *res) 80 + { 81 + return kstrtoll(s, base, res); 82 + } 83 + 84 + static inline int __must_check kstrtou32(const char *s, unsigned int base, u32 *res) 85 + { 86 + return kstrtouint(s, base, res); 87 + } 88 + 89 + static inline int __must_check kstrtos32(const char *s, unsigned int base, s32 *res) 90 + { 91 + return kstrtoint(s, base, res); 92 + } 93 + 94 + int __must_check kstrtou16(const char *s, unsigned int base, u16 *res); 95 + int __must_check kstrtos16(const char *s, unsigned int base, s16 *res); 96 + int __must_check kstrtou8(const char *s, unsigned int base, u8 *res); 97 + int __must_check kstrtos8(const char *s, unsigned int base, s8 *res); 98 + int __must_check kstrtobool(const char *s, bool *res); 99 + 100 + int __must_check kstrtoull_from_user(const char __user *s, size_t count, unsigned int base, unsigned long long *res); 101 + int __must_check kstrtoll_from_user(const char __user *s, size_t count, unsigned int base, long long *res); 102 + int __must_check kstrtoul_from_user(const char __user *s, size_t count, unsigned int base, unsigned long *res); 103 + int __must_check kstrtol_from_user(const char __user *s, size_t count, unsigned int base, long *res); 104 + int __must_check kstrtouint_from_user(const char __user *s, size_t count, unsigned int base, unsigned int *res); 105 + int __must_check kstrtoint_from_user(const char __user *s, size_t count, unsigned int base, int *res); 106 + int __must_check kstrtou16_from_user(const char __user *s, size_t count, unsigned int base, u16 *res); 107 + int __must_check kstrtos16_from_user(const char __user *s, size_t count, unsigned int base, s16 *res); 108 + int __must_check kstrtou8_from_user(const char __user *s, size_t count, unsigned int base, u8 *res); 109 + int __must_check kstrtos8_from_user(const char __user *s, size_t count, unsigned int base, s8 *res); 110 + int __must_check kstrtobool_from_user(const char __user *s, size_t count, bool *res); 111 + 112 + static inline int __must_check kstrtou64_from_user(const char __user *s, size_t count, unsigned int base, u64 *res) 113 + { 114 + return kstrtoull_from_user(s, count, base, res); 115 + } 116 + 117 + static inline int __must_check kstrtos64_from_user(const char __user *s, size_t count, unsigned int base, s64 *res) 118 + { 119 + return kstrtoll_from_user(s, count, base, res); 120 + } 121 + 122 + static inline int __must_check kstrtou32_from_user(const char __user *s, size_t count, unsigned int base, u32 *res) 123 + { 124 + return kstrtouint_from_user(s, count, base, res); 125 + } 126 + 127 + static inline int __must_check kstrtos32_from_user(const char __user *s, size_t count, unsigned int base, s32 *res) 128 + { 129 + return kstrtoint_from_user(s, count, base, res); 130 + } 131 + 132 + /* 133 + * Use kstrto<foo> instead. 134 + * 135 + * NOTE: simple_strto<foo> does not check for the range overflow and, 136 + * depending on the input, may give interesting results. 137 + * 138 + * Use these functions if and only if you cannot use kstrto<foo>, because 139 + * the conversion ends on the first non-digit character, which may be far 140 + * beyond the supported range. It might be useful to parse the strings like 141 + * 10x50 or 12:21 without altering original string or temporary buffer in use. 142 + * Keep in mind above caveat. 143 + */ 144 + 145 + extern unsigned long simple_strtoul(const char *,char **,unsigned int); 146 + extern long simple_strtol(const char *,char **,unsigned int); 147 + extern unsigned long long simple_strtoull(const char *,char **,unsigned int); 148 + extern long long simple_strtoll(const char *,char **,unsigned int); 149 + 150 + static inline int strtobool(const char *s, bool *res) 151 + { 152 + return kstrtobool(s, res); 153 + } 154 + 155 + #endif /* _LINUX_KSTRTOX_H */

+3 -1

include/linux/memblock.h

··· 30 30 * @MEMBLOCK_NONE: no special request 31 31 * @MEMBLOCK_HOTPLUG: hotpluggable region 32 32 * @MEMBLOCK_MIRROR: mirrored region 33 - * @MEMBLOCK_NOMAP: don't add to kernel direct mapping 33 + * @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as 34 + * reserved in the memory map; refer to memblock_mark_nomap() description 35 + * for further details 34 36 */ 35 37 enum memblock_flags { 36 38 MEMBLOCK_NONE = 0x0, /* No special request */

-27

include/linux/memory_hotplug.h

··· 18 18 #ifdef CONFIG_MEMORY_HOTPLUG 19 19 struct page *pfn_to_online_page(unsigned long pfn); 20 20 21 - /* 22 - * Types for free bootmem stored in page->lru.next. These have to be in 23 - * some random range in unsigned long space for debugging purposes. 24 - */ 25 - enum { 26 - MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE = 12, 27 - SECTION_INFO = MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE, 28 - MIX_SECTION_INFO, 29 - NODE_INFO, 30 - MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = NODE_INFO, 31 - }; 32 - 33 21 /* Types for control the zone type of onlined and offlined memory */ 34 22 enum { 35 23 /* Offline the memory. */ ··· 210 222 #endif /* CONFIG_NUMA */ 211 223 #endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */ 212 224 213 - #ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE 214 - extern void __init register_page_bootmem_info_node(struct pglist_data *pgdat); 215 - #else 216 - static inline void register_page_bootmem_info_node(struct pglist_data *pgdat) 217 - { 218 - } 219 - #endif 220 - extern void put_page_bootmem(struct page *page); 221 - extern void get_page_bootmem(unsigned long ingo, struct page *page, 222 - unsigned long type); 223 - 224 225 void get_online_mems(void); 225 226 void put_online_mems(void); 226 227 ··· 236 259 static inline void zone_span_writelock(struct zone *zone) {} 237 260 static inline void zone_span_writeunlock(struct zone *zone) {} 238 261 static inline void zone_seqlock_init(struct zone *zone) {} 239 - 240 - static inline void register_page_bootmem_info_node(struct pglist_data *pgdat) 241 - { 242 - } 243 262 244 263 static inline int try_online_node(int nid) 245 264 {

+3 -6

include/linux/mempolicy.h

··· 46 46 atomic_t refcnt; 47 47 unsigned short mode; /* See MPOL_* above */ 48 48 unsigned short flags; /* See set_mempolicy() MPOL_F_* above */ 49 - union { 50 - short preferred_node; /* preferred */ 51 - nodemask_t nodes; /* interleave/bind */ 52 - /* undefined for default */ 53 - } v; 49 + nodemask_t nodes; /* interleave/bind/perfer */ 50 + 54 51 union { 55 52 nodemask_t cpuset_mems_allowed; /* relative to these nodes */ 56 53 nodemask_t user_nodemask; /* nodemask passed by user */ ··· 147 150 unsigned long addr, gfp_t gfp_flags, 148 151 struct mempolicy **mpol, nodemask_t **nodemask); 149 152 extern bool init_nodemask_of_mempolicy(nodemask_t *mask); 150 - extern bool mempolicy_nodemask_intersects(struct task_struct *tsk, 153 + extern bool mempolicy_in_oom_domain(struct task_struct *tsk, 151 154 const nodemask_t *mask); 152 155 extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy); 153 156

+1 -1

include/linux/memremap.h

··· 26 26 }; 27 27 28 28 /* 29 - * Specialize ZONE_DEVICE memory into multiple types each having differents 29 + * Specialize ZONE_DEVICE memory into multiple types each has a different 30 30 * usage. 31 31 * 32 32 * MEMORY_DEVICE_PRIVATE:

+4 -23

include/linux/migrate.h

··· 51 51 struct page *newpage, struct page *page); 52 52 extern int migrate_page_move_mapping(struct address_space *mapping, 53 53 struct page *newpage, struct page *page, int extra_count); 54 + extern void copy_huge_page(struct page *dst, struct page *src); 54 55 #else 55 56 56 57 static inline void putback_movable_pages(struct list_head *l) {} ··· 78 77 return -ENOSYS; 79 78 } 80 79 80 + static inline void copy_huge_page(struct page *dst, struct page *src) 81 + { 82 + } 81 83 #endif /* CONFIG_MIGRATION */ 82 84 83 85 #ifdef CONFIG_COMPACTION ··· 99 95 #endif 100 96 101 97 #ifdef CONFIG_NUMA_BALANCING 102 - extern bool pmd_trans_migrating(pmd_t pmd); 103 98 extern int migrate_misplaced_page(struct page *page, 104 99 struct vm_area_struct *vma, int node); 105 100 #else 106 - static inline bool pmd_trans_migrating(pmd_t pmd) 107 - { 108 - return false; 109 - } 110 101 static inline int migrate_misplaced_page(struct page *page, 111 102 struct vm_area_struct *vma, int node) 112 103 { 113 104 return -EAGAIN; /* can't migrate now */ 114 105 } 115 106 #endif /* CONFIG_NUMA_BALANCING */ 116 - 117 - #if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE) 118 - extern int migrate_misplaced_transhuge_page(struct mm_struct *mm, 119 - struct vm_area_struct *vma, 120 - pmd_t *pmd, pmd_t entry, 121 - unsigned long address, 122 - struct page *page, int node); 123 - #else 124 - static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm, 125 - struct vm_area_struct *vma, 126 - pmd_t *pmd, pmd_t entry, 127 - unsigned long address, 128 - struct page *page, int node) 129 - { 130 - return -EAGAIN; 131 - } 132 - #endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/ 133 - 134 107 135 108 #ifdef CONFIG_MIGRATION 136 109

+12 -2

include/linux/mm.h

··· 145 145 /* This function must be updated when the size of struct page grows above 80 146 146 * or reduces below 56. The idea that compiler optimizes out switch() 147 147 * statement, and only leaves move/store instructions. Also the compiler can 148 - * combine write statments if they are both assignments and can be reordered, 148 + * combine write statements if they are both assignments and can be reordered, 149 149 * this can result in several of the writes here being dropped. 150 150 */ 151 151 #define mm_zero_struct_page(pp) __mm_zero_struct_page(pp) ··· 540 540 pud_t *pud; /* Pointer to pud entry matching 541 541 * the 'address' 542 542 */ 543 - pte_t orig_pte; /* Value of PTE at the time of fault */ 543 + union { 544 + pte_t orig_pte; /* Value of PTE at the time of fault */ 545 + pmd_t orig_pmd; /* Value of PMD at the time of fault, 546 + * used by PMD fault only. 547 + */ 548 + }; 544 549 545 550 struct page *cow_page; /* Page handler may use for COW fault */ 546 551 struct page *page; /* ->fault handlers should return a ··· 3071 3066 { 3072 3067 } 3073 3068 #endif 3069 + 3070 + int vmemmap_remap_free(unsigned long start, unsigned long end, 3071 + unsigned long reuse); 3072 + int vmemmap_remap_alloc(unsigned long start, unsigned long end, 3073 + unsigned long reuse, gfp_t gfp_mask); 3074 3074 3075 3075 void *sparse_buffer_alloc(unsigned long size); 3076 3076 struct page * __populate_section_memmap(unsigned long pfn,

+1 -1

include/linux/mm_types.h

··· 404 404 unsigned long mmap_base; /* base of mmap area */ 405 405 unsigned long mmap_legacy_base; /* base of mmap area in bottom-up allocations */ 406 406 #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES 407 - /* Base adresses for compatible mmap() */ 407 + /* Base addresses for compatible mmap() */ 408 408 unsigned long mmap_compat_base; 409 409 unsigned long mmap_compat_legacy_base; 410 410 #endif

+16 -10

include/linux/mmu_notifier.h

··· 41 41 * 42 42 * @MMU_NOTIFY_MIGRATE: used during migrate_vma_collect() invalidate to signal 43 43 * a device driver to possibly ignore the invalidation if the 44 - * migrate_pgmap_owner field matches the driver's device private pgmap owner. 44 + * owner field matches the driver's device private pgmap owner. 45 + * 46 + * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no 47 + * longer have exclusive access to the page. When sent during creation of an 48 + * exclusive range the owner will be initialised to the value provided by the 49 + * caller of make_device_exclusive_range(), otherwise the owner will be NULL. 45 50 */ 46 51 enum mmu_notifier_event { 47 52 MMU_NOTIFY_UNMAP = 0, ··· 56 51 MMU_NOTIFY_SOFT_DIRTY, 57 52 MMU_NOTIFY_RELEASE, 58 53 MMU_NOTIFY_MIGRATE, 54 + MMU_NOTIFY_EXCLUSIVE, 59 55 }; 60 56 61 57 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0) ··· 275 269 unsigned long end; 276 270 unsigned flags; 277 271 enum mmu_notifier_event event; 278 - void *migrate_pgmap_owner; 272 + void *owner; 279 273 }; 280 274 281 275 static inline int mm_has_notifiers(struct mm_struct *mm) ··· 527 521 range->flags = flags; 528 522 } 529 523 530 - static inline void mmu_notifier_range_init_migrate( 531 - struct mmu_notifier_range *range, unsigned int flags, 524 + static inline void mmu_notifier_range_init_owner( 525 + struct mmu_notifier_range *range, 526 + enum mmu_notifier_event event, unsigned int flags, 532 527 struct vm_area_struct *vma, struct mm_struct *mm, 533 - unsigned long start, unsigned long end, void *pgmap) 528 + unsigned long start, unsigned long end, void *owner) 534 529 { 535 - mmu_notifier_range_init(range, MMU_NOTIFY_MIGRATE, flags, vma, mm, 536 - start, end); 537 - range->migrate_pgmap_owner = pgmap; 530 + mmu_notifier_range_init(range, event, flags, vma, mm, start, end); 531 + range->owner = owner; 538 532 } 539 533 540 534 #define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ ··· 661 655 662 656 #define mmu_notifier_range_init(range,event,flags,vma,mm,start,end) \ 663 657 _mmu_notifier_range_init(range, start, end) 664 - #define mmu_notifier_range_init_migrate(range, flags, vma, mm, start, end, \ 665 - pgmap) \ 658 + #define mmu_notifier_range_init_owner(range, event, flags, vma, mm, start, \ 659 + end, owner) \ 666 660 _mmu_notifier_range_init(range, start, end) 667 661 668 662 static inline bool

+25 -2

include/linux/mmzone.h

··· 114 114 struct pglist_data; 115 115 116 116 /* 117 - * Add a wild amount of padding here to ensure datas fall into separate 117 + * Add a wild amount of padding here to ensure data fall into separate 118 118 * cachelines. There are very few zone structures in the machine, so space 119 119 * consumption is not a concern here. 120 120 */ ··· 1064 1064 #ifndef CONFIG_NUMA 1065 1065 1066 1066 extern struct pglist_data contig_page_data; 1067 - #define NODE_DATA(nid) (&contig_page_data) 1067 + static inline struct pglist_data *NODE_DATA(int nid) 1068 + { 1069 + return &contig_page_data; 1070 + } 1068 1071 #define NODE_MEM_MAP(nid) mem_map 1069 1072 1070 1073 #else /* CONFIG_NUMA */ ··· 1448 1445 #endif 1449 1446 1450 1447 #ifndef CONFIG_HAVE_ARCH_PFN_VALID 1448 + /** 1449 + * pfn_valid - check if there is a valid memory map entry for a PFN 1450 + * @pfn: the page frame number to check 1451 + * 1452 + * Check if there is a valid memory map entry aka struct page for the @pfn. 1453 + * Note, that availability of the memory map entry does not imply that 1454 + * there is actual usable memory at that @pfn. The struct page may 1455 + * represent a hole or an unusable page frame. 1456 + * 1457 + * Return: 1 for PFNs that have memory map entries and 0 otherwise 1458 + */ 1451 1459 static inline int pfn_valid(unsigned long pfn) 1452 1460 { 1453 1461 struct mem_section *ms; 1462 + 1463 + /* 1464 + * Ensure the upper PAGE_SHIFT bits are clear in the 1465 + * pfn. Else it might lead to false positives when 1466 + * some of the upper bits are set, but the lower bits 1467 + * match a valid pfn. 1468 + */ 1469 + if (PHYS_PFN(PFN_PHYS(pfn)) != pfn) 1470 + return 0; 1454 1471 1455 1472 if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) 1456 1473 return 0;

+2 -2

include/linux/mpi.h

··· 200 200 unsigned int nbits; /* Number of bits. */ 201 201 202 202 /* Domain parameters. Note that they may not all be set and if set 203 - * the MPIs may be flaged as constant. 203 + * the MPIs may be flagged as constant. 204 204 */ 205 205 MPI p; /* Prime specifying the field GF(p). */ 206 206 MPI a; /* First coefficient of the Weierstrass equation. */ ··· 267 267 /** 268 268 * mpi_get_size() - returns max size required to store the number 269 269 * 270 - * @a: A multi precision integer for which we want to allocate a bufer 270 + * @a: A multi precision integer for which we want to allocate a buffer 271 271 * 272 272 * Return: size required to store the number 273 273 */

+22

include/linux/page-flags.h

··· 704 704 #endif 705 705 706 706 /* 707 + * Check if a page is currently marked HWPoisoned. Note that this check is 708 + * best effort only and inherently racy: there is no way to synchronize with 709 + * failing hardware. 710 + */ 711 + static inline bool is_page_hwpoison(struct page *page) 712 + { 713 + if (PageHWPoison(page)) 714 + return true; 715 + return PageHuge(page) && PageHWPoison(compound_head(page)); 716 + } 717 + 718 + /* 707 719 * For pages that are never mapped to userspace (and aren't PageSlab), 708 720 * page_type may be used. Because it is initialised to -1, we invert the 709 721 * sense of the bit, so __SetPageFoo *clears* the bit used for PageFoo, and ··· 778 766 * relies on this feature is aware that re-onlining the memory block will 779 767 * require to re-set the pages PageOffline() and not giving them to the 780 768 * buddy via online_page_callback_t. 769 + * 770 + * There are drivers that mark a page PageOffline() and expect there won't be 771 + * any further access to page content. PFN walkers that read content of random 772 + * pages should check PageOffline() and synchronize with such drivers using 773 + * page_offline_freeze()/page_offline_thaw(). 781 774 */ 782 775 PAGE_TYPE_OPS(Offline, offline) 776 + 777 + extern void page_offline_freeze(void); 778 + extern void page_offline_thaw(void); 779 + extern void page_offline_begin(void); 780 + extern void page_offline_end(void); 783 781 784 782 /* 785 783 * Marks pages in use as page tables.

+98

include/linux/panic.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _LINUX_PANIC_H 3 + #define _LINUX_PANIC_H 4 + 5 + #include <linux/compiler_attributes.h> 6 + #include <linux/types.h> 7 + 8 + struct pt_regs; 9 + 10 + extern long (*panic_blink)(int state); 11 + __printf(1, 2) 12 + void panic(const char *fmt, ...) __noreturn __cold; 13 + void nmi_panic(struct pt_regs *regs, const char *msg); 14 + extern void oops_enter(void); 15 + extern void oops_exit(void); 16 + extern bool oops_may_print(void); 17 + 18 + #ifdef CONFIG_SMP 19 + extern unsigned int sysctl_oops_all_cpu_backtrace; 20 + #else 21 + #define sysctl_oops_all_cpu_backtrace 0 22 + #endif /* CONFIG_SMP */ 23 + 24 + extern int panic_timeout; 25 + extern unsigned long panic_print; 26 + extern int panic_on_oops; 27 + extern int panic_on_unrecovered_nmi; 28 + extern int panic_on_io_nmi; 29 + extern int panic_on_warn; 30 + 31 + extern unsigned long panic_on_taint; 32 + extern bool panic_on_taint_nousertaint; 33 + 34 + extern int sysctl_panic_on_rcu_stall; 35 + extern int sysctl_max_rcu_stall_to_panic; 36 + extern int sysctl_panic_on_stackoverflow; 37 + 38 + extern bool crash_kexec_post_notifiers; 39 + 40 + /* 41 + * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It 42 + * holds a CPU number which is executing panic() currently. A value of 43 + * PANIC_CPU_INVALID means no CPU has entered panic() or crash_kexec(). 44 + */ 45 + extern atomic_t panic_cpu; 46 + #define PANIC_CPU_INVALID -1 47 + 48 + /* 49 + * Only to be used by arch init code. If the user over-wrote the default 50 + * CONFIG_PANIC_TIMEOUT, honor it. 51 + */ 52 + static inline void set_arch_panic_timeout(int timeout, int arch_default_timeout) 53 + { 54 + if (panic_timeout == arch_default_timeout) 55 + panic_timeout = timeout; 56 + } 57 + 58 + /* This cannot be an enum because some may be used in assembly source. */ 59 + #define TAINT_PROPRIETARY_MODULE 0 60 + #define TAINT_FORCED_MODULE 1 61 + #define TAINT_CPU_OUT_OF_SPEC 2 62 + #define TAINT_FORCED_RMMOD 3 63 + #define TAINT_MACHINE_CHECK 4 64 + #define TAINT_BAD_PAGE 5 65 + #define TAINT_USER 6 66 + #define TAINT_DIE 7 67 + #define TAINT_OVERRIDDEN_ACPI_TABLE 8 68 + #define TAINT_WARN 9 69 + #define TAINT_CRAP 10 70 + #define TAINT_FIRMWARE_WORKAROUND 11 71 + #define TAINT_OOT_MODULE 12 72 + #define TAINT_UNSIGNED_MODULE 13 73 + #define TAINT_SOFTLOCKUP 14 74 + #define TAINT_LIVEPATCH 15 75 + #define TAINT_AUX 16 76 + #define TAINT_RANDSTRUCT 17 77 + #define TAINT_FLAGS_COUNT 18 78 + #define TAINT_FLAGS_MAX ((1UL << TAINT_FLAGS_COUNT) - 1) 79 + 80 + struct taint_flag { 81 + char c_true; /* character printed when tainted */ 82 + char c_false; /* character printed when not tainted */ 83 + bool module; /* also show as a per-module taint flag */ 84 + }; 85 + 86 + extern const struct taint_flag taint_flags[TAINT_FLAGS_COUNT]; 87 + 88 + enum lockdep_ok { 89 + LOCKDEP_STILL_OK, 90 + LOCKDEP_NOW_UNRELIABLE, 91 + }; 92 + 93 + extern const char *print_tainted(void); 94 + extern void add_taint(unsigned flag, enum lockdep_ok); 95 + extern int test_taint(unsigned flag); 96 + extern unsigned long get_taint(void); 97 + 98 + #endif /* _LINUX_PANIC_H */

+12

include/linux/panic_notifier.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _LINUX_PANIC_NOTIFIERS_H 3 + #define _LINUX_PANIC_NOTIFIERS_H 4 + 5 + #include <linux/notifier.h> 6 + #include <linux/types.h> 7 + 8 + extern struct atomic_notifier_head panic_notifier_list; 9 + 10 + extern bool crash_kexec_post_notifiers; 11 + 12 + #endif /* _LINUX_PANIC_NOTIFIERS_H */

+43 -1

include/linux/pgtable.h

··· 29 29 #endif 30 30 31 31 /* 32 + * This defines the first usable user address. Platforms 33 + * can override its value with custom FIRST_USER_ADDRESS 34 + * defined in their respective <asm/pgtable.h>. 35 + */ 36 + #ifndef FIRST_USER_ADDRESS 37 + #define FIRST_USER_ADDRESS 0UL 38 + #endif 39 + 40 + /* 41 + * This defines the generic helper for accessing PMD page 42 + * table page. Although platforms can still override this 43 + * via their respective <asm/pgtable.h>. 44 + */ 45 + #ifndef pmd_pgtable 46 + #define pmd_pgtable(pmd) pmd_page(pmd) 47 + #endif 48 + 49 + /* 32 50 * A page table page can be thought of an array like this: pXd_t[PTRS_PER_PxD] 33 51 * 34 52 * The pXx_index() functions return the index of the entry in the page ··· 1397 1379 } 1398 1380 #endif /* !__PAGETABLE_P4D_FOLDED */ 1399 1381 1382 + #ifndef __PAGETABLE_PUD_FOLDED 1400 1383 int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot); 1401 - int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot); 1402 1384 int pud_clear_huge(pud_t *pud); 1385 + #else 1386 + static inline int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot) 1387 + { 1388 + return 0; 1389 + } 1390 + static inline int pud_clear_huge(pud_t *pud) 1391 + { 1392 + return 0; 1393 + } 1394 + #endif /* !__PAGETABLE_PUD_FOLDED */ 1395 + 1396 + #ifndef __PAGETABLE_PMD_FOLDED 1397 + int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot); 1403 1398 int pmd_clear_huge(pmd_t *pmd); 1399 + #else 1400 + static inline int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot) 1401 + { 1402 + return 0; 1403 + } 1404 + static inline int pmd_clear_huge(pmd_t *pmd) 1405 + { 1406 + return 0; 1407 + } 1408 + #endif /* !__PAGETABLE_PMD_FOLDED */ 1409 + 1404 1410 int p4d_free_pud_page(p4d_t *p4d, unsigned long addr); 1405 1411 int pud_free_pmd_page(pud_t *pud, unsigned long addr); 1406 1412 int pmd_free_pte_page(pmd_t *pmd, unsigned long addr);

+7 -6

include/linux/rmap.h

··· 86 86 }; 87 87 88 88 enum ttu_flags { 89 - TTU_MIGRATION = 0x1, /* migration mode */ 90 - TTU_MUNLOCK = 0x2, /* munlock mode */ 91 - 92 89 TTU_SPLIT_HUGE_PMD = 0x4, /* split huge PMD if any */ 93 90 TTU_IGNORE_MLOCK = 0x8, /* ignore mlock */ 94 91 TTU_SYNC = 0x10, /* avoid racy checks with PVMW_SYNC */ ··· 95 98 * do a final flush if necessary */ 96 99 TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock: 97 100 * caller holds it */ 98 - TTU_SPLIT_FREEZE = 0x100, /* freeze pte under splitting thp */ 99 101 }; 100 102 101 103 #ifdef CONFIG_MMU ··· 191 195 int page_referenced(struct page *, int is_locked, 192 196 struct mem_cgroup *memcg, unsigned long *vm_flags); 193 197 194 - bool try_to_unmap(struct page *, enum ttu_flags flags); 198 + void try_to_migrate(struct page *page, enum ttu_flags flags); 199 + void try_to_unmap(struct page *, enum ttu_flags flags); 200 + 201 + int make_device_exclusive_range(struct mm_struct *mm, unsigned long start, 202 + unsigned long end, struct page **pages, 203 + void *arg); 195 204 196 205 /* Avoid racy checks */ 197 206 #define PVMW_SYNC (1 << 0) ··· 241 240 * called in munlock()/munmap() path to check for other vmas holding 242 241 * the page mlocked. 243 242 */ 244 - void try_to_munlock(struct page *); 243 + void page_mlock(struct page *page); 245 244 246 245 void remove_migration_ptes(struct page *old, struct page *new, bool locked); 247 246

+9 -1

include/linux/seq_file.h

··· 126 126 void seq_put_hex_ll(struct seq_file *m, const char *delimiter, 127 127 unsigned long long v, unsigned int width); 128 128 129 + void seq_escape_mem(struct seq_file *m, const char *src, size_t len, 130 + unsigned int flags, const char *esc); 131 + 132 + static inline void seq_escape_str(struct seq_file *m, const char *src, 133 + unsigned int flags, const char *esc) 134 + { 135 + seq_escape_mem(m, src, strlen(src), flags, esc); 136 + } 137 + 129 138 void seq_escape(struct seq_file *m, const char *s, const char *esc); 130 - void seq_escape_mem_ascii(struct seq_file *m, const char *src, size_t isz); 131 139 132 140 void seq_hex_dump(struct seq_file *m, const char *prefix_str, int prefix_type, 133 141 int rowsize, int groupsize, const void *buf, size_t len,

+8 -11

include/linux/shmem_fs.h

··· 122 122 extern bool shmem_charge(struct inode *inode, long pages); 123 123 extern void shmem_uncharge(struct inode *inode, long pages); 124 124 125 + #ifdef CONFIG_USERFAULTFD 125 126 #ifdef CONFIG_SHMEM 126 - extern int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, 127 + extern int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, 127 128 struct vm_area_struct *dst_vma, 128 129 unsigned long dst_addr, 129 130 unsigned long src_addr, 131 + bool zeropage, 130 132 struct page **pagep); 131 - extern int shmem_mfill_zeropage_pte(struct mm_struct *dst_mm, 132 - pmd_t *dst_pmd, 133 - struct vm_area_struct *dst_vma, 134 - unsigned long dst_addr); 135 - #else 136 - #define shmem_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma, dst_addr, \ 137 - src_addr, pagep) ({ BUG(); 0; }) 138 - #define shmem_mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma, \ 139 - dst_addr) ({ BUG(); 0; }) 140 - #endif 133 + #else /* !CONFIG_SHMEM */ 134 + #define shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, \ 135 + src_addr, zeropage, pagep) ({ BUG(); 0; }) 136 + #endif /* CONFIG_SHMEM */ 137 + #endif /* CONFIG_USERFAULTFD */ 141 138 142 139 #endif

-2

include/linux/signal.h

··· 462 462 unsafe_put_user((void __user *)t->sas_ss_sp, &__uss->ss_sp, label); \ 463 463 unsafe_put_user(t->sas_ss_flags, &__uss->ss_flags, label); \ 464 464 unsafe_put_user(t->sas_ss_size, &__uss->ss_size, label); \ 465 - if (t->sas_ss_flags & SS_AUTODISARM) \ 466 - sas_ss_reset(t); \ 467 465 } while (0); 468 466 469 467 #ifdef CONFIG_PROC_FS

-7

include/linux/string.h

··· 2 2 #ifndef _LINUX_STRING_H_ 3 3 #define _LINUX_STRING_H_ 4 4 5 - 6 5 #include <linux/compiler.h> /* for inline */ 7 6 #include <linux/types.h> /* for size_t */ 8 7 #include <linux/stddef.h> /* for NULL */ ··· 183 184 extern void argv_free(char **argv); 184 185 185 186 extern bool sysfs_streq(const char *s1, const char *s2); 186 - extern int kstrtobool(const char *s, bool *res); 187 - static inline int strtobool(const char *s, bool *res) 188 - { 189 - return kstrtobool(s, res); 190 - } 191 - 192 187 int match_string(const char * const *array, size_t n, const char *string); 193 188 int __sysfs_match_string(const char * const *array, size_t n, const char *s); 194 189

+18 -13

include/linux/string_helpers.h

··· 2 2 #ifndef _LINUX_STRING_HELPERS_H_ 3 3 #define _LINUX_STRING_HELPERS_H_ 4 4 5 + #include <linux/bits.h> 5 6 #include <linux/ctype.h> 6 7 #include <linux/types.h> 7 8 ··· 19 18 void string_get_size(u64 size, u64 blk_size, enum string_size_units units, 20 19 char *buf, int len); 21 20 22 - #define UNESCAPE_SPACE 0x01 23 - #define UNESCAPE_OCTAL 0x02 24 - #define UNESCAPE_HEX 0x04 25 - #define UNESCAPE_SPECIAL 0x08 21 + #define UNESCAPE_SPACE BIT(0) 22 + #define UNESCAPE_OCTAL BIT(1) 23 + #define UNESCAPE_HEX BIT(2) 24 + #define UNESCAPE_SPECIAL BIT(3) 26 25 #define UNESCAPE_ANY \ 27 26 (UNESCAPE_SPACE | UNESCAPE_OCTAL | UNESCAPE_HEX | UNESCAPE_SPECIAL) 27 + 28 + #define UNESCAPE_ALL_MASK GENMASK(3, 0) 28 29 29 30 int string_unescape(char *src, char *dst, size_t size, unsigned int flags); 30 31 ··· 45 42 return string_unescape_any(buf, buf, 0); 46 43 } 47 44 48 - #define ESCAPE_SPACE 0x01 49 - #define ESCAPE_SPECIAL 0x02 50 - #define ESCAPE_NULL 0x04 51 - #define ESCAPE_OCTAL 0x08 45 + #define ESCAPE_SPACE BIT(0) 46 + #define ESCAPE_SPECIAL BIT(1) 47 + #define ESCAPE_NULL BIT(2) 48 + #define ESCAPE_OCTAL BIT(3) 52 49 #define ESCAPE_ANY \ 53 50 (ESCAPE_SPACE | ESCAPE_OCTAL | ESCAPE_SPECIAL | ESCAPE_NULL) 54 - #define ESCAPE_NP 0x10 51 + #define ESCAPE_NP BIT(4) 55 52 #define ESCAPE_ANY_NP (ESCAPE_ANY | ESCAPE_NP) 56 - #define ESCAPE_HEX 0x20 53 + #define ESCAPE_HEX BIT(5) 54 + #define ESCAPE_NA BIT(6) 55 + #define ESCAPE_NAP BIT(7) 56 + #define ESCAPE_APPEND BIT(8) 57 + 58 + #define ESCAPE_ALL_MASK GENMASK(8, 0) 57 59 58 60 int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz, 59 61 unsigned int flags, const char *only); 60 - 61 - int string_escape_mem_ascii(const char *src, size_t isz, char *dst, 62 - size_t osz); 63 62 64 63 static inline int string_escape_mem_any_np(const char *src, size_t isz, 65 64 char *dst, size_t osz, const char *only)

+1

include/linux/sunrpc/cache.h

··· 14 14 #include <linux/kref.h> 15 15 #include <linux/slab.h> 16 16 #include <linux/atomic.h> 17 + #include <linux/kstrtox.h> 17 18 #include <linux/proc_fs.h> 18 19 19 20 /*

+14 -5

include/linux/swap.h

··· 62 62 * migrate part of a process memory to device memory. 63 63 * 64 64 * When a page is migrated from CPU to device, we set the CPU page table entry 65 - * to a special SWP_DEVICE_* entry. 65 + * to a special SWP_DEVICE_{READ|WRITE} entry. 66 + * 67 + * When a page is mapped by the device for exclusive access we set the CPU page 68 + * table entries to special SWP_DEVICE_EXCLUSIVE_* entries. 66 69 */ 67 70 #ifdef CONFIG_DEVICE_PRIVATE 68 - #define SWP_DEVICE_NUM 2 71 + #define SWP_DEVICE_NUM 4 69 72 #define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM) 70 73 #define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1) 74 + #define SWP_DEVICE_EXCLUSIVE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2) 75 + #define SWP_DEVICE_EXCLUSIVE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+3) 71 76 #else 72 77 #define SWP_DEVICE_NUM 0 73 78 #endif ··· 542 537 { 543 538 } 544 539 545 - #define swap_address_space(entry) (NULL) 540 + static inline struct address_space *swap_address_space(swp_entry_t entry) 541 + { 542 + return NULL; 543 + } 544 + 546 545 #define get_nr_swap_pages() 0L 547 546 #define total_swap_pages 0L 548 547 #define total_swapcache_pages() 0UL ··· 569 560 { 570 561 } 571 562 572 - #define free_swap_and_cache(e) ({(is_migration_entry(e) || is_device_private_entry(e));}) 573 - #define swapcache_prepare(e) ({(is_migration_entry(e) || is_device_private_entry(e));}) 563 + /* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */ 564 + #define free_swap_and_cache(e) is_pfn_swap_entry(e) 574 565 575 566 static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask) 576 567 {

+82 -57

include/linux/swapops.h

··· 107 107 } 108 108 109 109 #if IS_ENABLED(CONFIG_DEVICE_PRIVATE) 110 - static inline swp_entry_t make_device_private_entry(struct page *page, bool write) 110 + static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset) 111 111 { 112 - return swp_entry(write ? SWP_DEVICE_WRITE : SWP_DEVICE_READ, 113 - page_to_pfn(page)); 112 + return swp_entry(SWP_DEVICE_READ, offset); 113 + } 114 + 115 + static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset) 116 + { 117 + return swp_entry(SWP_DEVICE_WRITE, offset); 114 118 } 115 119 116 120 static inline bool is_device_private_entry(swp_entry_t entry) ··· 123 119 return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE; 124 120 } 125 121 126 - static inline void make_device_private_entry_read(swp_entry_t *entry) 127 - { 128 - *entry = swp_entry(SWP_DEVICE_READ, swp_offset(*entry)); 129 - } 130 - 131 - static inline bool is_write_device_private_entry(swp_entry_t entry) 122 + static inline bool is_writable_device_private_entry(swp_entry_t entry) 132 123 { 133 124 return unlikely(swp_type(entry) == SWP_DEVICE_WRITE); 134 125 } 135 126 136 - static inline unsigned long device_private_entry_to_pfn(swp_entry_t entry) 127 + static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset) 137 128 { 138 - return swp_offset(entry); 129 + return swp_entry(SWP_DEVICE_EXCLUSIVE_READ, offset); 139 130 } 140 131 141 - static inline struct page *device_private_entry_to_page(swp_entry_t entry) 132 + static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset) 142 133 { 143 - return pfn_to_page(swp_offset(entry)); 134 + return swp_entry(SWP_DEVICE_EXCLUSIVE_WRITE, offset); 135 + } 136 + 137 + static inline bool is_device_exclusive_entry(swp_entry_t entry) 138 + { 139 + return swp_type(entry) == SWP_DEVICE_EXCLUSIVE_READ || 140 + swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE; 141 + } 142 + 143 + static inline bool is_writable_device_exclusive_entry(swp_entry_t entry) 144 + { 145 + return unlikely(swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE); 144 146 } 145 147 #else /* CONFIG_DEVICE_PRIVATE */ 146 - static inline swp_entry_t make_device_private_entry(struct page *page, bool write) 148 + static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset) 147 149 { 148 150 return swp_entry(0, 0); 149 151 } 150 152 151 - static inline void make_device_private_entry_read(swp_entry_t *entry) 153 + static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset) 152 154 { 155 + return swp_entry(0, 0); 153 156 } 154 157 155 158 static inline bool is_device_private_entry(swp_entry_t entry) ··· 164 153 return false; 165 154 } 166 155 167 - static inline bool is_write_device_private_entry(swp_entry_t entry) 156 + static inline bool is_writable_device_private_entry(swp_entry_t entry) 168 157 { 169 158 return false; 170 159 } 171 160 172 - static inline unsigned long device_private_entry_to_pfn(swp_entry_t entry) 161 + static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset) 173 162 { 174 - return 0; 163 + return swp_entry(0, 0); 175 164 } 176 165 177 - static inline struct page *device_private_entry_to_page(swp_entry_t entry) 166 + static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset) 178 167 { 179 - return NULL; 168 + return swp_entry(0, 0); 169 + } 170 + 171 + static inline bool is_device_exclusive_entry(swp_entry_t entry) 172 + { 173 + return false; 174 + } 175 + 176 + static inline bool is_writable_device_exclusive_entry(swp_entry_t entry) 177 + { 178 + return false; 180 179 } 181 180 #endif /* CONFIG_DEVICE_PRIVATE */ 182 181 183 182 #ifdef CONFIG_MIGRATION 184 - static inline swp_entry_t make_migration_entry(struct page *page, int write) 185 - { 186 - BUG_ON(!PageLocked(compound_head(page))); 187 - 188 - return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ, 189 - page_to_pfn(page)); 190 - } 191 - 192 183 static inline int is_migration_entry(swp_entry_t entry) 193 184 { 194 185 return unlikely(swp_type(entry) == SWP_MIGRATION_READ || 195 186 swp_type(entry) == SWP_MIGRATION_WRITE); 196 187 } 197 188 198 - static inline int is_write_migration_entry(swp_entry_t entry) 189 + static inline int is_writable_migration_entry(swp_entry_t entry) 199 190 { 200 191 return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE); 201 192 } 202 193 203 - static inline unsigned long migration_entry_to_pfn(swp_entry_t entry) 194 + static inline swp_entry_t make_readable_migration_entry(pgoff_t offset) 204 195 { 205 - return swp_offset(entry); 196 + return swp_entry(SWP_MIGRATION_READ, offset); 206 197 } 207 198 208 - static inline struct page *migration_entry_to_page(swp_entry_t entry) 199 + static inline swp_entry_t make_writable_migration_entry(pgoff_t offset) 209 200 { 210 - struct page *p = pfn_to_page(swp_offset(entry)); 211 - /* 212 - * Any use of migration entries may only occur while the 213 - * corresponding page is locked 214 - */ 215 - BUG_ON(!PageLocked(compound_head(p))); 216 - return p; 217 - } 218 - 219 - static inline void make_migration_entry_read(swp_entry_t *entry) 220 - { 221 - *entry = swp_entry(SWP_MIGRATION_READ, swp_offset(*entry)); 201 + return swp_entry(SWP_MIGRATION_WRITE, offset); 222 202 } 223 203 224 204 extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, ··· 219 217 extern void migration_entry_wait_huge(struct vm_area_struct *vma, 220 218 struct mm_struct *mm, pte_t *pte); 221 219 #else 220 + static inline swp_entry_t make_readable_migration_entry(pgoff_t offset) 221 + { 222 + return swp_entry(0, 0); 223 + } 222 224 223 - #define make_migration_entry(page, write) swp_entry(0, 0) 225 + static inline swp_entry_t make_writable_migration_entry(pgoff_t offset) 226 + { 227 + return swp_entry(0, 0); 228 + } 229 + 224 230 static inline int is_migration_entry(swp_entry_t swp) 225 231 { 226 232 return 0; 227 233 } 228 234 229 - static inline unsigned long migration_entry_to_pfn(swp_entry_t entry) 230 - { 231 - return 0; 232 - } 233 - 234 - static inline struct page *migration_entry_to_page(swp_entry_t entry) 235 - { 236 - return NULL; 237 - } 238 - 239 - static inline void make_migration_entry_read(swp_entry_t *entryp) { } 240 235 static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, 241 236 spinlock_t *ptl) { } 242 237 static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, 243 238 unsigned long address) { } 244 239 static inline void migration_entry_wait_huge(struct vm_area_struct *vma, 245 240 struct mm_struct *mm, pte_t *pte) { } 246 - static inline int is_write_migration_entry(swp_entry_t entry) 241 + static inline int is_writable_migration_entry(swp_entry_t entry) 247 242 { 248 243 return 0; 249 244 } 250 245 251 246 #endif 247 + 248 + static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry) 249 + { 250 + struct page *p = pfn_to_page(swp_offset(entry)); 251 + 252 + /* 253 + * Any use of migration entries may only occur while the 254 + * corresponding page is locked 255 + */ 256 + BUG_ON(is_migration_entry(entry) && !PageLocked(p)); 257 + 258 + return p; 259 + } 260 + 261 + /* 262 + * A pfn swap entry is a special type of swap entry that always has a pfn stored 263 + * in the swap offset. They are used to represent unaddressable device memory 264 + * and to restrict access to a page undergoing migration. 265 + */ 266 + static inline bool is_pfn_swap_entry(swp_entry_t entry) 267 + { 268 + return is_migration_entry(entry) || is_device_private_entry(entry) || 269 + is_device_exclusive_entry(entry); 270 + } 252 271 253 272 struct page_vma_mapped_walk; 254 273 ··· 288 265 289 266 if (pmd_swp_soft_dirty(pmd)) 290 267 pmd = pmd_swp_clear_soft_dirty(pmd); 268 + if (pmd_swp_uffd_wp(pmd)) 269 + pmd = pmd_swp_clear_uffd_wp(pmd); 291 270 arch_entry = __pmd_to_swp_entry(pmd); 292 271 return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry)); 293 272 }

+1

include/linux/thread_info.h

··· 9 9 #define _LINUX_THREAD_INFO_H 10 10 11 11 #include <linux/types.h> 12 + #include <linux/limits.h> 12 13 #include <linux/bug.h> 13 14 #include <linux/restart_block.h> 14 15 #include <linux/errno.h>

+5

include/linux/userfaultfd_k.h

··· 53 53 MCOPY_ATOMIC_CONTINUE, 54 54 }; 55 55 56 + extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, 57 + struct vm_area_struct *dst_vma, 58 + unsigned long dst_addr, struct page *page, 59 + bool newly_allocated, bool wp_copy); 60 + 56 61 extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start, 57 62 unsigned long src_start, unsigned long len, 58 63 bool *mmap_changing, __u64 mode);

+15

include/linux/vmalloc.h

··· 104 104 } 105 105 #endif 106 106 107 + #ifndef arch_vmap_pte_range_map_size 108 + static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr, unsigned long end, 109 + u64 pfn, unsigned int max_page_shift) 110 + { 111 + return PAGE_SIZE; 112 + } 113 + #endif 114 + 115 + #ifndef arch_vmap_pte_supported_shift 116 + static inline int arch_vmap_pte_supported_shift(unsigned long size) 117 + { 118 + return PAGE_SHIFT; 119 + } 120 + #endif 121 + 107 122 /* 108 123 * Highlevel APIs for driver use 109 124 */

-23

include/linux/zbud.h

··· 1 - /* SPDX-License-Identifier: GPL-2.0 */ 2 - #ifndef _ZBUD_H_ 3 - #define _ZBUD_H_ 4 - 5 - #include <linux/types.h> 6 - 7 - struct zbud_pool; 8 - 9 - struct zbud_ops { 10 - int (*evict)(struct zbud_pool *pool, unsigned long handle); 11 - }; 12 - 13 - struct zbud_pool *zbud_create_pool(gfp_t gfp, const struct zbud_ops *ops); 14 - void zbud_destroy_pool(struct zbud_pool *pool); 15 - int zbud_alloc(struct zbud_pool *pool, size_t size, gfp_t gfp, 16 - unsigned long *handle); 17 - void zbud_free(struct zbud_pool *pool, unsigned long handle); 18 - int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries); 19 - void *zbud_map(struct zbud_pool *pool, unsigned long handle); 20 - void zbud_unmap(struct zbud_pool *pool, unsigned long handle); 21 - u64 zbud_get_pool_size(struct zbud_pool *pool); 22 - 23 - #endif /* _ZBUD_H_ */

-41

include/trace/events/vmscan.h

··· 423 423 show_reclaim_flags(__entry->reclaim_flags)) 424 424 ); 425 425 426 - TRACE_EVENT(mm_vmscan_inactive_list_is_low, 427 - 428 - TP_PROTO(int nid, int reclaim_idx, 429 - unsigned long total_inactive, unsigned long inactive, 430 - unsigned long total_active, unsigned long active, 431 - unsigned long ratio, int file), 432 - 433 - TP_ARGS(nid, reclaim_idx, total_inactive, inactive, total_active, active, ratio, file), 434 - 435 - TP_STRUCT__entry( 436 - __field(int, nid) 437 - __field(int, reclaim_idx) 438 - __field(unsigned long, total_inactive) 439 - __field(unsigned long, inactive) 440 - __field(unsigned long, total_active) 441 - __field(unsigned long, active) 442 - __field(unsigned long, ratio) 443 - __field(int, reclaim_flags) 444 - ), 445 - 446 - TP_fast_assign( 447 - __entry->nid = nid; 448 - __entry->reclaim_idx = reclaim_idx; 449 - __entry->total_inactive = total_inactive; 450 - __entry->inactive = inactive; 451 - __entry->total_active = total_active; 452 - __entry->active = active; 453 - __entry->ratio = ratio; 454 - __entry->reclaim_flags = trace_reclaim_flags(file) & 455 - RECLAIM_WB_LRU; 456 - ), 457 - 458 - TP_printk("nid=%d reclaim_idx=%d total_inactive=%ld inactive=%ld total_active=%ld active=%ld ratio=%ld flags=%s", 459 - __entry->nid, 460 - __entry->reclaim_idx, 461 - __entry->total_inactive, __entry->inactive, 462 - __entry->total_active, __entry->active, 463 - __entry->ratio, 464 - show_reclaim_flags(__entry->reclaim_flags)) 465 - ); 466 - 467 426 TRACE_EVENT(mm_vmscan_node_reclaim_begin, 468 427 469 428 TP_PROTO(int nid, int order, gfp_t gfp_flags),

+3

include/uapi/asm-generic/mman-common.h

··· 72 72 #define MADV_COLD 20 /* deactivate these pages */ 73 73 #define MADV_PAGEOUT 21 /* reclaim these pages */ 74 74 75 + #define MADV_POPULATE_READ 22 /* populate (prefault) page tables readable */ 76 + #define MADV_POPULATE_WRITE 23 /* populate (prefault) page tables writable */ 77 + 75 78 /* compatibility flags */ 76 79 #define MAP_FILE 0 77 80

-1

include/uapi/linux/mempolicy.h

··· 60 60 * are never OR'ed into the mode in mempolicy API arguments. 61 61 */ 62 62 #define MPOL_F_SHARED (1 << 0) /* identify shared policies */ 63 - #define MPOL_F_LOCAL (1 << 1) /* preferred local allocation */ 64 63 #define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */ 65 64 #define MPOL_F_MORON (1 << 4) /* Migrate On protnone Reference On Node */ 66 65

+6 -1

include/uapi/linux/userfaultfd.h

··· 31 31 UFFD_FEATURE_MISSING_SHMEM | \ 32 32 UFFD_FEATURE_SIGBUS | \ 33 33 UFFD_FEATURE_THREAD_ID | \ 34 - UFFD_FEATURE_MINOR_HUGETLBFS) 34 + UFFD_FEATURE_MINOR_HUGETLBFS | \ 35 + UFFD_FEATURE_MINOR_SHMEM) 35 36 #define UFFD_API_IOCTLS \ 36 37 ((__u64)1 << _UFFDIO_REGISTER | \ 37 38 (__u64)1 << _UFFDIO_UNREGISTER | \ ··· 186 185 * UFFD_FEATURE_MINOR_HUGETLBFS indicates that minor faults 187 186 * can be intercepted (via REGISTER_MODE_MINOR) for 188 187 * hugetlbfs-backed pages. 188 + * 189 + * UFFD_FEATURE_MINOR_SHMEM indicates the same support as 190 + * UFFD_FEATURE_MINOR_HUGETLBFS, but for shmem-backed pages instead. 189 191 */ 190 192 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) 191 193 #define UFFD_FEATURE_EVENT_FORK (1<<1) ··· 200 196 #define UFFD_FEATURE_SIGBUS (1<<7) 201 197 #define UFFD_FEATURE_THREAD_ID (1<<8) 202 198 #define UFFD_FEATURE_MINOR_HUGETLBFS (1<<9) 199 + #define UFFD_FEATURE_MINOR_SHMEM (1<<10) 203 200 __u64 features; 204 201 205 202 __u64 ioctls;

+42

init/main.c

··· 873 873 rest_init(); 874 874 } 875 875 876 + static void __init print_unknown_bootoptions(void) 877 + { 878 + char *unknown_options; 879 + char *end; 880 + const char *const *p; 881 + size_t len; 882 + 883 + if (panic_later || (!argv_init[1] && !envp_init[2])) 884 + return; 885 + 886 + /* 887 + * Determine how many options we have to print out, plus a space 888 + * before each 889 + */ 890 + len = 1; /* null terminator */ 891 + for (p = &argv_init[1]; *p; p++) { 892 + len++; 893 + len += strlen(*p); 894 + } 895 + for (p = &envp_init[2]; *p; p++) { 896 + len++; 897 + len += strlen(*p); 898 + } 899 + 900 + unknown_options = memblock_alloc(len, SMP_CACHE_BYTES); 901 + if (!unknown_options) { 902 + pr_err("%s: Failed to allocate %zu bytes\n", 903 + __func__, len); 904 + return; 905 + } 906 + end = unknown_options; 907 + 908 + for (p = &argv_init[1]; *p; p++) 909 + end += sprintf(end, " %s", *p); 910 + for (p = &envp_init[2]; *p; p++) 911 + end += sprintf(end, " %s", *p); 912 + 913 + pr_notice("Unknown command line parameters:%s\n", unknown_options); 914 + memblock_free(__pa(unknown_options), len); 915 + } 916 + 876 917 asmlinkage __visible void __init __no_sanitize_address start_kernel(void) 877 918 { 878 919 char *command_line; ··· 955 914 static_command_line, __start___param, 956 915 __stop___param - __start___param, 957 916 -1, -1, NULL, &unknown_bootoption); 917 + print_unknown_bootoptions(); 958 918 if (!IS_ERR_OR_NULL(after_dashes)) 959 919 parse_args("Setting init args", after_dashes, NULL, 0, -1, -1, 960 920 NULL, set_init_arg);

+3 -3

ipc/msg.c

··· 130 130 struct msg_queue *msq = container_of(p, struct msg_queue, q_perm); 131 131 132 132 security_msg_queue_free(&msq->q_perm); 133 - kvfree(msq); 133 + kfree(msq); 134 134 } 135 135 136 136 /** ··· 147 147 key_t key = params->key; 148 148 int msgflg = params->flg; 149 149 150 - msq = kvmalloc(sizeof(*msq), GFP_KERNEL); 150 + msq = kmalloc(sizeof(*msq), GFP_KERNEL); 151 151 if (unlikely(!msq)) 152 152 return -ENOMEM; 153 153 ··· 157 157 msq->q_perm.security = NULL; 158 158 retval = security_msg_queue_alloc(&msq->q_perm); 159 159 if (retval) { 160 - kvfree(msq); 160 + kfree(msq); 161 161 return retval; 162 162 } 163 163

+15 -10

ipc/sem.c

··· 217 217 * this smp_load_acquire(), this is guaranteed because the smp_load_acquire() 218 218 * is inside a spin_lock() and after a write from 0 to non-zero a 219 219 * spin_lock()+spin_unlock() is done. 220 + * To prevent the compiler/cpu temporarily writing 0 to use_global_lock, 221 + * READ_ONCE()/WRITE_ONCE() is used. 220 222 * 221 223 * 2) queue.status: (SEM_BARRIER_2) 222 224 * Initialization is done while holding sem_lock(), so no further barrier is ··· 344 342 * Nothing to do, just reset the 345 343 * counter until we return to simple mode. 346 344 */ 347 - sma->use_global_lock = USE_GLOBAL_LOCK_HYSTERESIS; 345 + WRITE_ONCE(sma->use_global_lock, USE_GLOBAL_LOCK_HYSTERESIS); 348 346 return; 349 347 } 350 - sma->use_global_lock = USE_GLOBAL_LOCK_HYSTERESIS; 348 + WRITE_ONCE(sma->use_global_lock, USE_GLOBAL_LOCK_HYSTERESIS); 351 349 352 350 for (i = 0; i < sma->sem_nsems; i++) { 353 351 sem = &sma->sems[i]; ··· 373 371 /* See SEM_BARRIER_1 for purpose/pairing */ 374 372 smp_store_release(&sma->use_global_lock, 0); 375 373 } else { 376 - sma->use_global_lock--; 374 + WRITE_ONCE(sma->use_global_lock, 375 + sma->use_global_lock-1); 377 376 } 378 377 } 379 378 ··· 415 412 * Initial check for use_global_lock. Just an optimization, 416 413 * no locking, no memory barrier. 417 414 */ 418 - if (!sma->use_global_lock) { 415 + if (!READ_ONCE(sma->use_global_lock)) { 419 416 /* 420 417 * It appears that no complex operation is around. 421 418 * Acquire the per-semaphore lock. ··· 1157 1154 un->semid = -1; 1158 1155 list_del_rcu(&un->list_proc); 1159 1156 spin_unlock(&un->ulp->lock); 1160 - kfree_rcu(un, rcu); 1157 + kvfree_rcu(un, rcu); 1161 1158 } 1162 1159 1163 1160 /* Wake up all pending processes and let them fail with EIDRM. */ ··· 1940 1937 rcu_read_unlock(); 1941 1938 1942 1939 /* step 2: allocate new undo structure */ 1943 - new = kzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems, GFP_KERNEL); 1940 + new = kvzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems, 1941 + GFP_KERNEL); 1944 1942 if (!new) { 1945 1943 ipc_rcu_putref(&sma->sem_perm, sem_rcu_free); 1946 1944 return ERR_PTR(-ENOMEM); ··· 1953 1949 if (!ipc_valid_object(&sma->sem_perm)) { 1954 1950 sem_unlock(sma, -1); 1955 1951 rcu_read_unlock(); 1956 - kfree(new); 1952 + kvfree(new); 1957 1953 un = ERR_PTR(-EIDRM); 1958 1954 goto out; 1959 1955 } ··· 1964 1960 */ 1965 1961 un = lookup_undo(ulp, semid); 1966 1962 if (un) { 1967 - kfree(new); 1963 + kvfree(new); 1968 1964 goto success; 1969 1965 } 1970 1966 /* step 5: initialize & link new undo structure */ ··· 2424 2420 rcu_read_unlock(); 2425 2421 wake_up_q(&wake_q); 2426 2422 2427 - kfree_rcu(un, rcu); 2423 + kvfree_rcu(un, rcu); 2428 2424 } 2429 2425 kfree(ulp); 2430 2426 } ··· 2439 2435 2440 2436 /* 2441 2437 * The proc interface isn't aware of sem_lock(), it calls 2442 - * ipc_lock_object() directly (in sysvipc_find_ipc). 2438 + * ipc_lock_object(), i.e. spin_lock(&sma->sem_perm.lock). 2439 + * (in sysvipc_find_ipc) 2443 2440 * In order to stay compatible with sem_lock(), we must 2444 2441 * enter / leave complex_mode. 2445 2442 */

+3 -3

ipc/shm.c

··· 222 222 struct shmid_kernel *shp = container_of(ptr, struct shmid_kernel, 223 223 shm_perm); 224 224 security_shm_free(&shp->shm_perm); 225 - kvfree(shp); 225 + kfree(shp); 226 226 } 227 227 228 228 static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s) ··· 619 619 ns->shm_tot + numpages > ns->shm_ctlall) 620 620 return -ENOSPC; 621 621 622 - shp = kvmalloc(sizeof(*shp), GFP_KERNEL); 622 + shp = kmalloc(sizeof(*shp), GFP_KERNEL); 623 623 if (unlikely(!shp)) 624 624 return -ENOMEM; 625 625 ··· 630 630 shp->shm_perm.security = NULL; 631 631 error = security_shm_alloc(&shp->shm_perm); 632 632 if (error) { 633 - kvfree(shp); 633 + kfree(shp); 634 634 return error; 635 635 } 636 636

+39 -5

ipc/util.c

··· 64 64 #include <linux/memory.h> 65 65 #include <linux/ipc_namespace.h> 66 66 #include <linux/rhashtable.h> 67 + #include <linux/log2.h> 67 68 68 69 #include <asm/unistd.h> 69 70 ··· 452 451 } 453 452 454 453 /** 454 + * ipc_search_maxidx - search for the highest assigned index 455 + * @ids: ipc identifier set 456 + * @limit: known upper limit for highest assigned index 457 + * 458 + * The function determines the highest assigned index in @ids. It is intended 459 + * to be called when ids->max_idx needs to be updated. 460 + * Updating ids->max_idx is necessary when the current highest index ipc 461 + * object is deleted. 462 + * If no ipc object is allocated, then -1 is returned. 463 + * 464 + * ipc_ids.rwsem needs to be held by the caller. 465 + */ 466 + static int ipc_search_maxidx(struct ipc_ids *ids, int limit) 467 + { 468 + int tmpidx; 469 + int i; 470 + int retval; 471 + 472 + i = ilog2(limit+1); 473 + 474 + retval = 0; 475 + for (; i >= 0; i--) { 476 + tmpidx = retval | (1<<i); 477 + /* 478 + * "0" is a possible index value, thus search using 479 + * e.g. 15,7,3,1,0 instead of 16,8,4,2,1. 480 + */ 481 + tmpidx = tmpidx-1; 482 + if (idr_get_next(&ids->ipcs_idr, &tmpidx)) 483 + retval |= (1<<i); 484 + } 485 + return retval - 1; 486 + } 487 + 488 + /** 455 489 * ipc_rmid - remove an ipc identifier 456 490 * @ids: ipc identifier set 457 491 * @ipcp: ipc perm structure containing the identifier to remove ··· 504 468 ipcp->deleted = true; 505 469 506 470 if (unlikely(idx == ids->max_idx)) { 507 - do { 508 - idx--; 509 - if (idx == -1) 510 - break; 511 - } while (!idr_find(&ids->ipcs_idr, idx)); 471 + idx = ids->max_idx-1; 472 + if (idx >= 0) 473 + idx = ipc_search_maxidx(ids, idx); 512 474 ids->max_idx = idx; 513 475 } 514 476 }

+3

ipc/util.h

··· 145 145 * ipc_get_maxidx - get the highest assigned index 146 146 * @ids: ipc identifier set 147 147 * 148 + * The function returns the highest assigned index for @ids. The function 149 + * doesn't scan the idr tree, it uses a cached value. 150 + * 148 151 * Called with ipc_ids.rwsem held for reading. 149 152 */ 150 153 static inline int ipc_get_maxidx(struct ipc_ids *ids)

+1

kernel/hung_task.c

··· 15 15 #include <linux/kthread.h> 16 16 #include <linux/lockdep.h> 17 17 #include <linux/export.h> 18 + #include <linux/panic_notifier.h> 18 19 #include <linux/sysctl.h> 19 20 #include <linux/suspend.h> 20 21 #include <linux/utsname.h>

+1

kernel/kexec_core.c

··· 26 26 #include <linux/suspend.h> 27 27 #include <linux/device.h> 28 28 #include <linux/freezer.h> 29 + #include <linux/panic_notifier.h> 29 30 #include <linux/pm.h> 30 31 #include <linux/cpu.h> 31 32 #include <linux/uaccess.h>

+1 -1

kernel/kprobes.c

··· 106 106 return module_alloc(PAGE_SIZE); 107 107 } 108 108 109 - void __weak free_insn_page(void *page) 109 + static void free_insn_page(void *page) 110 110 { 111 111 module_memfree(page); 112 112 }

+1

kernel/panic.c

··· 23 23 #include <linux/reboot.h> 24 24 #include <linux/delay.h> 25 25 #include <linux/kexec.h> 26 + #include <linux/panic_notifier.h> 26 27 #include <linux/sched.h> 27 28 #include <linux/sysrq.h> 28 29 #include <linux/init.h>

+2

kernel/rcu/tree.c

··· 32 32 #include <linux/export.h> 33 33 #include <linux/completion.h> 34 34 #include <linux/moduleparam.h> 35 + #include <linux/panic.h> 36 + #include <linux/panic_notifier.h> 35 37 #include <linux/percpu.h> 36 38 #include <linux/notifier.h> 37 39 #include <linux/cpu.h>

+4 -10

kernel/signal.c

··· 2830 2830 if (!(ksig->ka.sa.sa_flags & SA_NODEFER)) 2831 2831 sigaddset(&blocked, ksig->sig); 2832 2832 set_current_blocked(&blocked); 2833 + if (current->sas_ss_flags & SS_AUTODISARM) 2834 + sas_ss_reset(current); 2833 2835 tracehook_signal_handler(stepping); 2834 2836 } 2835 2837 ··· 4150 4148 int err = __put_user((void __user *)t->sas_ss_sp, &uss->ss_sp) | 4151 4149 __put_user(t->sas_ss_flags, &uss->ss_flags) | 4152 4150 __put_user(t->sas_ss_size, &uss->ss_size); 4153 - if (err) 4154 - return err; 4155 - if (t->sas_ss_flags & SS_AUTODISARM) 4156 - sas_ss_reset(t); 4157 - return 0; 4151 + return err; 4158 4152 } 4159 4153 4160 4154 #ifdef CONFIG_COMPAT ··· 4205 4207 &uss->ss_sp) | 4206 4208 __put_user(t->sas_ss_flags, &uss->ss_flags) | 4207 4209 __put_user(t->sas_ss_size, &uss->ss_size); 4208 - if (err) 4209 - return err; 4210 - if (t->sas_ss_flags & SS_AUTODISARM) 4211 - sas_ss_reset(t); 4212 - return 0; 4210 + return err; 4213 4211 } 4214 4212 #endif 4215 4213

+2 -2

kernel/sysctl.c

··· 27 27 #include <linux/sysctl.h> 28 28 #include <linux/bitmap.h> 29 29 #include <linux/signal.h> 30 + #include <linux/panic.h> 30 31 #include <linux/printk.h> 31 32 #include <linux/proc_fs.h> 32 33 #include <linux/security.h> ··· 1496 1495 void *buffer, size_t *lenp, loff_t *ppos) 1497 1496 { 1498 1497 int err = 0; 1499 - bool first = 1; 1500 1498 size_t left = *lenp; 1501 1499 unsigned long bitmap_len = table->maxlen; 1502 1500 unsigned long *bitmap = *(unsigned long **) table->data; ··· 1580 1580 } 1581 1581 1582 1582 bitmap_set(tmp_bitmap, val_a, val_b - val_a + 1); 1583 - first = 0; 1584 1583 proc_skip_char(&p, &left, '\n'); 1585 1584 } 1586 1585 left += skipped; 1587 1586 } else { 1588 1587 unsigned long bit_a, bit_b = 0; 1588 + bool first = 1; 1589 1589 1590 1590 while (left) { 1591 1591 bit_a = find_next_bit(bitmap, bitmap_len, bit_b);

+1

kernel/trace/trace.c

··· 39 39 #include <linux/slab.h> 40 40 #include <linux/ctype.h> 41 41 #include <linux/init.h> 42 + #include <linux/panic_notifier.h> 42 43 #include <linux/poll.h> 43 44 #include <linux/nmi.h> 44 45 #include <linux/fs.h>

+12

lib/Kconfig.debug

··· 2446 2446 2447 2447 If unsure, say N. 2448 2448 2449 + config RATIONAL_KUNIT_TEST 2450 + tristate "KUnit test for rational.c" if !KUNIT_ALL_TESTS 2451 + depends on KUNIT 2452 + select RATIONAL 2453 + default KUNIT_ALL_TESTS 2454 + help 2455 + This builds the rational math unit test. 2456 + For more information on KUnit and unit tests in general please refer 2457 + to the KUnit documentation in Documentation/dev-tools/kunit/. 2458 + 2459 + If unsure, say N. 2460 + 2449 2461 config TEST_UDELAY 2450 2462 tristate "udelay test driver" 2451 2463 help

+3 -3

lib/decompress_bunzip2.c

··· 80 80 81 81 /* This is what we know about each Huffman coding group */ 82 82 struct group_data { 83 - /* We have an extra slot at the end of limit[] for a sentinal value. */ 83 + /* We have an extra slot at the end of limit[] for a sentinel value. */ 84 84 int limit[MAX_HUFCODE_BITS+1]; 85 85 int base[MAX_HUFCODE_BITS]; 86 86 int permute[MAX_SYMBOLS]; ··· 337 337 pp <<= 1; 338 338 base[i+1] = pp-(t += temp[i]); 339 339 } 340 - limit[maxLen+1] = INT_MAX; /* Sentinal value for 340 + limit[maxLen+1] = INT_MAX; /* Sentinel value for 341 341 * reading next sym. */ 342 342 limit[maxLen] = pp+temp[maxLen]-1; 343 343 base[minLen] = 0; ··· 385 385 bd->inbufBits = 386 386 (bd->inbufBits << 8)|bd->inbuf[bd->inbufPos++]; 387 387 bd->inbufBitCount += 8; 388 - }; 388 + } 389 389 bd->inbufBitCount -= hufGroup->maxLen; 390 390 j = (bd->inbufBits >> bd->inbufBitCount)& 391 391 ((1 << hufGroup->maxLen)-1);

+8

lib/decompress_unlz4.c

··· 112 112 error("data corrupted"); 113 113 goto exit_2; 114 114 } 115 + } else if (size < 4) { 116 + /* empty or end-of-file */ 117 + goto exit_3; 115 118 } 116 119 117 120 chunksize = get_unaligned_le32(inp); ··· 128 125 continue; 129 126 } 130 127 128 + if (!fill && chunksize == 0) { 129 + /* empty or end-of-file */ 130 + goto exit_3; 131 + } 131 132 132 133 if (posp) 133 134 *posp += 4; ··· 191 184 } 192 185 } 193 186 187 + exit_3: 194 188 ret = 0; 195 189 exit_2: 196 190 if (!input)

+1 -2

lib/decompress_unlzo.c

··· 43 43 int l; 44 44 u8 *parse = input; 45 45 u8 *end = input + in_len; 46 - u8 level = 0; 47 46 u16 version; 48 47 49 48 /* ··· 64 65 version = get_unaligned_be16(parse); 65 66 parse += 7; 66 67 if (version >= 0x0940) 67 - level = *parse++; 68 + parse++; 68 69 if (get_unaligned_be32(parse) & HEADER_HAS_FILTER) 69 70 parse += 8; /* flags + filter info */ 70 71 else

+1 -1

lib/decompress_unxz.c

··· 23 23 * uncompressible. Thus, we must look for worst-case expansion when the 24 24 * compressor is encoding uncompressible data. 25 25 * 26 - * The structure of the .xz file in case of a compresed kernel is as follows. 26 + * The structure of the .xz file in case of a compressed kernel is as follows. 27 27 * Sizes (as bytes) of the fields are in parenthesis. 28 28 * 29 29 * Stream Header (12)

+2 -2

lib/decompress_unzstd.c

··· 16 16 * uncompressible. Thus, we must look for worst-case expansion when the 17 17 * compressor is encoding uncompressible data. 18 18 * 19 - * The structure of the .zst file in case of a compresed kernel is as follows. 19 + * The structure of the .zst file in case of a compressed kernel is as follows. 20 20 * Maximum sizes (as bytes) of the fields are in parenthesis. 21 21 * 22 22 * Frame Header: (18) ··· 56 56 /* 57 57 * Preboot environments #include "path/to/decompress_unzstd.c". 58 58 * All of the source files we depend on must be #included. 59 - * zstd's only source dependeny is xxhash, which has no source 59 + * zstd's only source dependency is xxhash, which has no source 60 60 * dependencies. 61 61 * 62 62 * When UNZSTD_PREBOOT is defined we declare __decompress(), which is

+3 -2

lib/kstrtox.c

··· 14 14 */ 15 15 #include <linux/ctype.h> 16 16 #include <linux/errno.h> 17 - #include <linux/kernel.h> 18 - #include <linux/math64.h> 19 17 #include <linux/export.h> 18 + #include <linux/kstrtox.h> 19 + #include <linux/math64.h> 20 20 #include <linux/types.h> 21 21 #include <linux/uaccess.h> 22 + 22 23 #include "kstrtox.h" 23 24 24 25 const char *_parse_integer_fixup_radix(const char *s, unsigned int *base)

+1 -1

lib/lz4/lz4_decompress.c

··· 481 481 482 482 /* ===== Instantiate a few more decoding cases, used more than once. ===== */ 483 483 484 - int LZ4_decompress_safe_withPrefix64k(const char *source, char *dest, 484 + static int LZ4_decompress_safe_withPrefix64k(const char *source, char *dest, 485 485 int compressedSize, int maxOutputSize) 486 486 { 487 487 return LZ4_decompress_generic(source, dest,

+1

lib/math/Makefile

··· 6 6 obj-$(CONFIG_RATIONAL) += rational.o 7 7 8 8 obj-$(CONFIG_TEST_DIV64) += test_div64.o 9 + obj-$(CONFIG_RATIONAL_KUNIT_TEST) += rational-test.o

+56

lib/math/rational-test.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include <kunit/test.h> 4 + 5 + #include <linux/rational.h> 6 + 7 + struct rational_test_param { 8 + unsigned long num, den; 9 + unsigned long max_num, max_den; 10 + unsigned long exp_num, exp_den; 11 + 12 + const char *name; 13 + }; 14 + 15 + static const struct rational_test_param test_parameters[] = { 16 + { 1230, 10, 100, 20, 100, 1, "Exceeds bounds, semi-convergent term > 1/2 last term" }, 17 + { 34567,100, 120, 20, 120, 1, "Exceeds bounds, semi-convergent term < 1/2 last term" }, 18 + { 1, 30, 100, 10, 0, 1, "Closest to zero" }, 19 + { 1, 19, 100, 10, 1, 10, "Closest to smallest non-zero" }, 20 + { 27,32, 16, 16, 11, 13, "Use convergent" }, 21 + { 1155, 7735, 255, 255, 33, 221, "Exact answer" }, 22 + { 87, 32, 70, 32, 68, 25, "Semiconvergent, numerator limit" }, 23 + { 14533, 4626, 15000, 2400, 7433, 2366, "Semiconvergent, denominator limit" }, 24 + }; 25 + 26 + static void get_desc(const struct rational_test_param *param, char *desc) 27 + { 28 + strscpy(desc, param->name, KUNIT_PARAM_DESC_SIZE); 29 + } 30 + 31 + /* Creates function rational_gen_params */ 32 + KUNIT_ARRAY_PARAM(rational, test_parameters, get_desc); 33 + 34 + static void rational_test(struct kunit *test) 35 + { 36 + const struct rational_test_param *param = (const struct rational_test_param *)test->param_value; 37 + unsigned long n = 0, d = 0; 38 + 39 + rational_best_approximation(param->num, param->den, param->max_num, param->max_den, &n, &d); 40 + KUNIT_EXPECT_EQ(test, n, param->exp_num); 41 + KUNIT_EXPECT_EQ(test, d, param->exp_den); 42 + } 43 + 44 + static struct kunit_case rational_test_cases[] = { 45 + KUNIT_CASE_PARAM(rational_test, rational_gen_params), 46 + {} 47 + }; 48 + 49 + static struct kunit_suite rational_test_suite = { 50 + .name = "rational", 51 + .test_cases = rational_test_cases, 52 + }; 53 + 54 + kunit_test_suites(&rational_test_suite); 55 + 56 + MODULE_LICENSE("GPL v2");

+11 -5

lib/math/rational.c

··· 12 12 #include <linux/compiler.h> 13 13 #include <linux/export.h> 14 14 #include <linux/minmax.h> 15 + #include <linux/limits.h> 15 16 16 17 /* 17 18 * calculate best rational approximation for a given fraction ··· 79 78 * found below as 't'. 80 79 */ 81 80 if ((n2 > max_numerator) || (d2 > max_denominator)) { 82 - unsigned long t = min((max_numerator - n0) / n1, 83 - (max_denominator - d0) / d1); 81 + unsigned long t = ULONG_MAX; 84 82 85 - /* This tests if the semi-convergent is closer 86 - * than the previous convergent. 83 + if (d1) 84 + t = (max_denominator - d0) / d1; 85 + if (n1) 86 + t = min(t, (max_numerator - n0) / n1); 87 + 88 + /* This tests if the semi-convergent is closer than the previous 89 + * convergent. If d1 is zero there is no previous convergent as this 90 + * is the 1st iteration, so always choose the semi-convergent. 87 91 */ 88 - if (2u * t > a || (2u * t == a && d0 * dp > d1 * d)) { 92 + if (!d1 || 2u * t > a || (2u * t == a && d0 * dp > d1 * d)) { 89 93 n1 = n0 + t * n1; 90 94 d1 = d0 + t * d1; 91 95 }

+2 -2

lib/mpi/longlong.h

··· 48 48 49 49 /* Define auxiliary asm macros. 50 50 * 51 - * 1) umul_ppmm(high_prod, low_prod, multipler, multiplicand) multiplies two 52 - * UWtype integers MULTIPLER and MULTIPLICAND, and generates a two UWtype 51 + * 1) umul_ppmm(high_prod, low_prod, multiplier, multiplicand) multiplies two 52 + * UWtype integers MULTIPLIER and MULTIPLICAND, and generates a two UWtype 53 53 * word product in HIGH_PROD and LOW_PROD. 54 54 * 55 55 * 2) __umulsidi3(a,b) multiplies two UWtype integers A and B, and returns a

+3 -3

lib/mpi/mpicoder.c

··· 234 234 } 235 235 236 236 /** 237 - * mpi_read_buffer() - read MPI to a bufer provided by user (msb first) 237 + * mpi_read_buffer() - read MPI to a buffer provided by user (msb first) 238 238 * 239 239 * @a: a multi precision integer 240 - * @buf: bufer to which the output will be written to. Needs to be at 241 - * leaset mpi_get_size(a) long. 240 + * @buf: buffer to which the output will be written to. Needs to be at 241 + * least mpi_get_size(a) long. 242 242 * @buf_len: size of the buf. 243 243 * @nbytes: receives the actual length of the data written on success and 244 244 * the data to-be-written on -EOVERFLOW in case buf_len was too

+1 -1

lib/mpi/mpiutil.c

··· 80 80 /**************** 81 81 * Note: It was a bad idea to use the number of limbs to allocate 82 82 * because on a alpha the limbs are large but we normally need 83 - * integers of n bits - So we should chnage this to bits (or bytes). 83 + * integers of n bits - So we should change this to bits (or bytes). 84 84 * 85 85 * But mpi_alloc is used in a lot of places :-) 86 86 */

+1

lib/parser.c

··· 6 6 #include <linux/ctype.h> 7 7 #include <linux/types.h> 8 8 #include <linux/export.h> 9 + #include <linux/kstrtox.h> 9 10 #include <linux/parser.h> 10 11 #include <linux/slab.h> 11 12 #include <linux/string.h>

+1 -1

lib/string.c

··· 977 977 unsigned char *p = addr; 978 978 979 979 while (size) { 980 - if (*p == c) 980 + if (*p == (unsigned char)c) 981 981 return (void *)p; 982 982 p++; 983 983 size--;

+63 -43

lib/string_helpers.c

··· 452 452 * The process of escaping byte buffer includes several parts. They are applied 453 453 * in the following sequence. 454 454 * 455 - * 1. The character is matched to the printable class, if asked, and in 456 - * case of match it passes through to the output. 457 - * 2. The character is not matched to the one from @only string and thus 455 + * 1. The character is not matched to the one from @only string and thus 458 456 * must go as-is to the output. 459 - * 3. The character is checked if it falls into the class given by @flags. 457 + * 2. The character is matched to the printable and ASCII classes, if asked, 458 + * and in case of match it passes through to the output. 459 + * 3. The character is matched to the printable or ASCII class, if asked, 460 + * and in case of match it passes through to the output. 461 + * 4. The character is checked if it falls into the class given by @flags. 460 462 * %ESCAPE_OCTAL and %ESCAPE_HEX are going last since they cover any 461 463 * character. Note that they actually can't go together, otherwise 462 464 * %ESCAPE_HEX will be ignored. 463 465 * 464 466 * Caller must provide valid source and destination pointers. Be aware that 465 467 * destination buffer will not be NULL-terminated, thus caller have to append 466 - * it if needs. The supported flags are:: 468 + * it if needs. The supported flags are:: 467 469 * 468 470 * %ESCAPE_SPACE: (special white space, not space itself) 469 471 * '\f' - form feed ··· 484 482 * %ESCAPE_ANY: 485 483 * all previous together 486 484 * %ESCAPE_NP: 487 - * escape only non-printable characters (checked by isprint) 485 + * escape only non-printable characters, checked by isprint() 488 486 * %ESCAPE_ANY_NP: 489 487 * all previous together 490 488 * %ESCAPE_HEX: 491 489 * '\xHH' - byte with hexadecimal value HH (2 digits) 490 + * %ESCAPE_NA: 491 + * escape only non-ascii characters, checked by isascii() 492 + * %ESCAPE_NAP: 493 + * escape only non-printable or non-ascii characters 494 + * %ESCAPE_APPEND: 495 + * append characters from @only to be escaped by the given classes 496 + * 497 + * %ESCAPE_APPEND would help to pass additional characters to the escaped, when 498 + * one of %ESCAPE_NP, %ESCAPE_NA, or %ESCAPE_NAP is provided. 499 + * 500 + * One notable caveat, the %ESCAPE_NAP, %ESCAPE_NP and %ESCAPE_NA have the 501 + * higher priority than the rest of the flags (%ESCAPE_NAP is the highest). 502 + * It doesn't make much sense to use either of them without %ESCAPE_OCTAL 503 + * or %ESCAPE_HEX, because they cover most of the other character classes. 504 + * %ESCAPE_NAP can utilize %ESCAPE_SPACE or %ESCAPE_SPECIAL in addition to 505 + * the above. 492 506 * 493 507 * Return: 494 508 * The total size of the escaped output that would be generated for ··· 518 500 char *p = dst; 519 501 char *end = p + osz; 520 502 bool is_dict = only && *only; 503 + bool is_append = flags & ESCAPE_APPEND; 521 504 522 505 while (isz--) { 523 506 unsigned char c = *src++; 507 + bool in_dict = is_dict && strchr(only, c); 524 508 525 509 /* 526 510 * Apply rules in the following sequence: 527 - * - the character is printable, when @flags has 528 - * %ESCAPE_NP bit set 529 511 * - the @only string is supplied and does not contain a 530 512 * character under question 513 + * - the character is printable and ASCII, when @flags has 514 + * %ESCAPE_NAP bit set 515 + * - the character is printable, when @flags has 516 + * %ESCAPE_NP bit set 517 + * - the character is ASCII, when @flags has 518 + * %ESCAPE_NA bit set 531 519 * - the character doesn't fall into a class of symbols 532 520 * defined by given @flags 533 521 * In these cases we just pass through a character to the 534 522 * output buffer. 523 + * 524 + * When %ESCAPE_APPEND is passed, the characters from @only 525 + * have been excluded from the %ESCAPE_NAP, %ESCAPE_NP, and 526 + * %ESCAPE_NA cases. 535 527 */ 536 - if ((flags & ESCAPE_NP && isprint(c)) || 537 - (is_dict && !strchr(only, c))) { 538 - /* do nothing */ 539 - } else { 540 - if (flags & ESCAPE_SPACE && escape_space(c, &p, end)) 541 - continue; 528 + if (!(is_append || in_dict) && is_dict && 529 + escape_passthrough(c, &p, end)) 530 + continue; 542 531 543 - if (flags & ESCAPE_SPECIAL && escape_special(c, &p, end)) 544 - continue; 532 + if (!(is_append && in_dict) && isascii(c) && isprint(c) && 533 + flags & ESCAPE_NAP && escape_passthrough(c, &p, end)) 534 + continue; 545 535 546 - if (flags & ESCAPE_NULL && escape_null(c, &p, end)) 547 - continue; 536 + if (!(is_append && in_dict) && isprint(c) && 537 + flags & ESCAPE_NP && escape_passthrough(c, &p, end)) 538 + continue; 548 539 549 - /* ESCAPE_OCTAL and ESCAPE_HEX always go last */ 550 - if (flags & ESCAPE_OCTAL && escape_octal(c, &p, end)) 551 - continue; 540 + if (!(is_append && in_dict) && isascii(c) && 541 + flags & ESCAPE_NA && escape_passthrough(c, &p, end)) 542 + continue; 552 543 553 - if (flags & ESCAPE_HEX && escape_hex(c, &p, end)) 554 - continue; 555 - } 544 + if (flags & ESCAPE_SPACE && escape_space(c, &p, end)) 545 + continue; 546 + 547 + if (flags & ESCAPE_SPECIAL && escape_special(c, &p, end)) 548 + continue; 549 + 550 + if (flags & ESCAPE_NULL && escape_null(c, &p, end)) 551 + continue; 552 + 553 + /* ESCAPE_OCTAL and ESCAPE_HEX always go last */ 554 + if (flags & ESCAPE_OCTAL && escape_octal(c, &p, end)) 555 + continue; 556 + 557 + if (flags & ESCAPE_HEX && escape_hex(c, &p, end)) 558 + continue; 556 559 557 560 escape_passthrough(c, &p, end); 558 561 } ··· 581 542 return p - dst; 582 543 } 583 544 EXPORT_SYMBOL(string_escape_mem); 584 - 585 - int string_escape_mem_ascii(const char *src, size_t isz, char *dst, 586 - size_t osz) 587 - { 588 - char *p = dst; 589 - char *end = p + osz; 590 - 591 - while (isz--) { 592 - unsigned char c = *src++; 593 - 594 - if (!isprint(c) || !isascii(c) || c == '"' || c == '\\') 595 - escape_hex(c, &p, end); 596 - else 597 - escape_passthrough(c, &p, end); 598 - } 599 - 600 - return p - dst; 601 - } 602 - EXPORT_SYMBOL(string_escape_mem_ascii); 603 545 604 546 /* 605 547 * Return an allocated string that has been escaped of special characters

+145 -20

lib/test-string_helpers.c

··· 19 19 if (q_real == q_test && !memcmp(out_test, out_real, q_test)) 20 20 return true; 21 21 22 - pr_warn("Test '%s' failed: flags = %u\n", name, flags); 22 + pr_warn("Test '%s' failed: flags = %#x\n", name, flags); 23 23 24 24 print_hex_dump(KERN_WARNING, "Input: ", DUMP_PREFIX_NONE, 16, 1, 25 25 in, p, true); ··· 136 136 .flags = ESCAPE_SPACE | ESCAPE_HEX, 137 137 },{ 138 138 /* terminator */ 139 - }}, 139 + }} 140 140 },{ 141 141 .in = "\\h\\\"\a\e\\", 142 142 .s1 = {{ ··· 150 150 .flags = ESCAPE_SPECIAL | ESCAPE_HEX, 151 151 },{ 152 152 /* terminator */ 153 - }}, 153 + }} 154 154 },{ 155 155 .in = "\eb \\C\007\"\x90\r]", 156 156 .s1 = {{ ··· 201 201 .flags = ESCAPE_NP | ESCAPE_HEX, 202 202 },{ 203 203 /* terminator */ 204 - }}, 204 + }} 205 + },{ 206 + .in = "\007 \eb\"\x90\xCF\r", 207 + .s1 = {{ 208 + .out = "\007 \eb\"\\220\\317\r", 209 + .flags = ESCAPE_OCTAL | ESCAPE_NA, 210 + },{ 211 + .out = "\007 \eb\"\\x90\\xcf\r", 212 + .flags = ESCAPE_HEX | ESCAPE_NA, 213 + },{ 214 + .out = "\007 \eb\"\x90\xCF\r", 215 + .flags = ESCAPE_NA, 216 + },{ 217 + /* terminator */ 218 + }} 205 219 },{ 206 220 /* terminator */ 207 221 }}; 208 222 209 - #define TEST_STRING_2_DICT_1 "b\\ \t\r" 223 + #define TEST_STRING_2_DICT_1 "b\\ \t\r\xCF" 210 224 static const struct test_string_2 escape1[] __initconst = {{ 211 225 .in = "\f\\ \n\r\t\v", 212 226 .s1 = {{ ··· 230 216 .out = "\f\\x5c\\x20\n\\x0d\\x09\v", 231 217 .flags = ESCAPE_HEX, 232 218 },{ 233 - /* terminator */ 234 - }}, 235 - },{ 236 - .in = "\\h\\\"\a\e\\", 237 - .s1 = {{ 238 - .out = "\\134h\\134\"\a\e\\134", 239 - .flags = ESCAPE_OCTAL, 219 + .out = "\f\\134\\040\n\\015\\011\v", 220 + .flags = ESCAPE_ANY | ESCAPE_APPEND, 221 + },{ 222 + .out = "\\014\\134\\040\\012\\015\\011\\013", 223 + .flags = ESCAPE_OCTAL | ESCAPE_APPEND | ESCAPE_NAP, 224 + },{ 225 + .out = "\\x0c\\x5c\\x20\\x0a\\x0d\\x09\\x0b", 226 + .flags = ESCAPE_HEX | ESCAPE_APPEND | ESCAPE_NAP, 227 + },{ 228 + .out = "\f\\134\\040\n\\015\\011\v", 229 + .flags = ESCAPE_OCTAL | ESCAPE_APPEND | ESCAPE_NA, 230 + },{ 231 + .out = "\f\\x5c\\x20\n\\x0d\\x09\v", 232 + .flags = ESCAPE_HEX | ESCAPE_APPEND | ESCAPE_NA, 240 233 },{ 241 234 /* terminator */ 242 - }}, 235 + }} 236 + },{ 237 + .in = "\\h\\\"\a\xCF\e\\", 238 + .s1 = {{ 239 + .out = "\\134h\\134\"\a\\317\e\\134", 240 + .flags = ESCAPE_OCTAL, 241 + },{ 242 + .out = "\\134h\\134\"\a\\317\e\\134", 243 + .flags = ESCAPE_ANY | ESCAPE_APPEND, 244 + },{ 245 + .out = "\\134h\\134\"\\007\\317\\033\\134", 246 + .flags = ESCAPE_OCTAL | ESCAPE_APPEND | ESCAPE_NAP, 247 + },{ 248 + .out = "\\134h\\134\"\a\\317\e\\134", 249 + .flags = ESCAPE_OCTAL | ESCAPE_APPEND | ESCAPE_NA, 250 + },{ 251 + /* terminator */ 252 + }} 243 253 },{ 244 254 .in = "\eb \\C\007\"\x90\r]", 245 255 .s1 = {{ ··· 271 233 .flags = ESCAPE_OCTAL, 272 234 },{ 273 235 /* terminator */ 274 - }}, 236 + }} 237 + },{ 238 + .in = "\007 \eb\"\x90\xCF\r", 239 + .s1 = {{ 240 + .out = "\007 \eb\"\x90\xCF\r", 241 + .flags = ESCAPE_NA, 242 + },{ 243 + .out = "\007 \eb\"\x90\xCF\r", 244 + .flags = ESCAPE_SPACE | ESCAPE_NA, 245 + },{ 246 + .out = "\007 \eb\"\x90\xCF\r", 247 + .flags = ESCAPE_SPECIAL | ESCAPE_NA, 248 + },{ 249 + .out = "\007 \eb\"\x90\xCF\r", 250 + .flags = ESCAPE_SPACE | ESCAPE_SPECIAL | ESCAPE_NA, 251 + },{ 252 + .out = "\007 \eb\"\x90\\317\r", 253 + .flags = ESCAPE_OCTAL | ESCAPE_NA, 254 + },{ 255 + .out = "\007 \eb\"\x90\\317\r", 256 + .flags = ESCAPE_SPACE | ESCAPE_OCTAL | ESCAPE_NA, 257 + },{ 258 + .out = "\007 \eb\"\x90\\317\r", 259 + .flags = ESCAPE_SPECIAL | ESCAPE_OCTAL | ESCAPE_NA, 260 + },{ 261 + .out = "\007 \eb\"\x90\\317\r", 262 + .flags = ESCAPE_ANY | ESCAPE_NA, 263 + },{ 264 + .out = "\007 \eb\"\x90\\xcf\r", 265 + .flags = ESCAPE_HEX | ESCAPE_NA, 266 + },{ 267 + .out = "\007 \eb\"\x90\\xcf\r", 268 + .flags = ESCAPE_SPACE | ESCAPE_HEX | ESCAPE_NA, 269 + },{ 270 + .out = "\007 \eb\"\x90\\xcf\r", 271 + .flags = ESCAPE_SPECIAL | ESCAPE_HEX | ESCAPE_NA, 272 + },{ 273 + .out = "\007 \eb\"\x90\\xcf\r", 274 + .flags = ESCAPE_SPACE | ESCAPE_SPECIAL | ESCAPE_HEX | ESCAPE_NA, 275 + },{ 276 + /* terminator */ 277 + }} 278 + },{ 279 + .in = "\007 \eb\"\x90\xCF\r", 280 + .s1 = {{ 281 + .out = "\007 \eb\"\x90\xCF\r", 282 + .flags = ESCAPE_NAP, 283 + },{ 284 + .out = "\007 \eb\"\x90\xCF\\r", 285 + .flags = ESCAPE_SPACE | ESCAPE_NAP, 286 + },{ 287 + .out = "\007 \eb\"\x90\xCF\r", 288 + .flags = ESCAPE_SPECIAL | ESCAPE_NAP, 289 + },{ 290 + .out = "\007 \eb\"\x90\xCF\\r", 291 + .flags = ESCAPE_SPACE | ESCAPE_SPECIAL | ESCAPE_NAP, 292 + },{ 293 + .out = "\007 \eb\"\x90\\317\\015", 294 + .flags = ESCAPE_OCTAL | ESCAPE_NAP, 295 + },{ 296 + .out = "\007 \eb\"\x90\\317\\r", 297 + .flags = ESCAPE_SPACE | ESCAPE_OCTAL | ESCAPE_NAP, 298 + },{ 299 + .out = "\007 \eb\"\x90\\317\\015", 300 + .flags = ESCAPE_SPECIAL | ESCAPE_OCTAL | ESCAPE_NAP, 301 + },{ 302 + .out = "\007 \eb\"\x90\\317\r", 303 + .flags = ESCAPE_ANY | ESCAPE_NAP, 304 + },{ 305 + .out = "\007 \eb\"\x90\\xcf\\x0d", 306 + .flags = ESCAPE_HEX | ESCAPE_NAP, 307 + },{ 308 + .out = "\007 \eb\"\x90\\xcf\\r", 309 + .flags = ESCAPE_SPACE | ESCAPE_HEX | ESCAPE_NAP, 310 + },{ 311 + .out = "\007 \eb\"\x90\\xcf\\x0d", 312 + .flags = ESCAPE_SPECIAL | ESCAPE_HEX | ESCAPE_NAP, 313 + },{ 314 + .out = "\007 \eb\"\x90\\xcf\\r", 315 + .flags = ESCAPE_SPACE | ESCAPE_SPECIAL | ESCAPE_HEX | ESCAPE_NAP, 316 + },{ 317 + /* terminator */ 318 + }} 275 319 },{ 276 320 /* terminator */ 277 321 }}; ··· 410 290 411 291 q_real = string_escape_mem(in, p, NULL, 0, flags, esc); 412 292 if (q_real != q_test) 413 - pr_warn("Test '%s' failed: flags = %u, osz = 0, expected %d, got %d\n", 293 + pr_warn("Test '%s' failed: flags = %#x, osz = 0, expected %d, got %d\n", 414 294 name, flags, q_test, q_real); 415 295 } 416 296 ··· 435 315 /* NULL injection */ 436 316 if (flags & ESCAPE_NULL) { 437 317 in[p++] = '\0'; 438 - out_test[q_test++] = '\\'; 439 - out_test[q_test++] = '0'; 318 + /* '\0' passes isascii() test */ 319 + if (flags & ESCAPE_NA && !(flags & ESCAPE_APPEND && esc)) { 320 + out_test[q_test++] = '\0'; 321 + } else { 322 + out_test[q_test++] = '\\'; 323 + out_test[q_test++] = '0'; 324 + } 440 325 } 441 326 442 327 /* Don't try strings that have no output */ ··· 584 459 unsigned int i; 585 460 586 461 pr_info("Running tests...\n"); 587 - for (i = 0; i < UNESCAPE_ANY + 1; i++) 462 + for (i = 0; i < UNESCAPE_ALL_MASK + 1; i++) 588 463 test_string_unescape("unescape", i, false); 589 464 test_string_unescape("unescape inplace", 590 465 get_random_int() % (UNESCAPE_ANY + 1), true); 591 466 592 467 /* Without dictionary */ 593 - for (i = 0; i < (ESCAPE_ANY_NP | ESCAPE_HEX) + 1; i++) 468 + for (i = 0; i < ESCAPE_ALL_MASK + 1; i++) 594 469 test_string_escape("escape 0", escape0, i, TEST_STRING_2_DICT_0); 595 470 596 471 /* With dictionary */ 597 - for (i = 0; i < (ESCAPE_ANY_NP | ESCAPE_HEX) + 1; i++) 472 + for (i = 0; i < ESCAPE_ALL_MASK + 1; i++) 598 473 test_string_escape("escape 1", escape1, i, TEST_STRING_2_DICT_1); 599 474 600 475 /* Test string_get_size() */

+126 -1

lib/test_hmm.c

··· 25 25 #include <linux/swapops.h> 26 26 #include <linux/sched/mm.h> 27 27 #include <linux/platform_device.h> 28 + #include <linux/rmap.h> 28 29 29 30 #include "test_hmm_uapi.h" 30 31 ··· 47 46 unsigned long cpages; 48 47 }; 49 48 49 + #define DPT_XA_TAG_ATOMIC 1UL 50 50 #define DPT_XA_TAG_WRITE 3UL 51 51 52 52 /* ··· 220 218 * the invalidation is handled as part of the migration process. 221 219 */ 222 220 if (range->event == MMU_NOTIFY_MIGRATE && 223 - range->migrate_pgmap_owner == dmirror->mdevice) 221 + range->owner == dmirror->mdevice) 224 222 return true; 225 223 226 224 if (mmu_notifier_range_blockable(range)) ··· 621 619 } 622 620 } 623 621 622 + static int dmirror_check_atomic(struct dmirror *dmirror, unsigned long start, 623 + unsigned long end) 624 + { 625 + unsigned long pfn; 626 + 627 + for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++) { 628 + void *entry; 629 + struct page *page; 630 + 631 + entry = xa_load(&dmirror->pt, pfn); 632 + page = xa_untag_pointer(entry); 633 + if (xa_pointer_tag(entry) == DPT_XA_TAG_ATOMIC) 634 + return -EPERM; 635 + } 636 + 637 + return 0; 638 + } 639 + 640 + static int dmirror_atomic_map(unsigned long start, unsigned long end, 641 + struct page **pages, struct dmirror *dmirror) 642 + { 643 + unsigned long pfn, mapped = 0; 644 + int i; 645 + 646 + /* Map the migrated pages into the device's page tables. */ 647 + mutex_lock(&dmirror->mutex); 648 + 649 + for (i = 0, pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++, i++) { 650 + void *entry; 651 + 652 + if (!pages[i]) 653 + continue; 654 + 655 + entry = pages[i]; 656 + entry = xa_tag_pointer(entry, DPT_XA_TAG_ATOMIC); 657 + entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC); 658 + if (xa_is_err(entry)) { 659 + mutex_unlock(&dmirror->mutex); 660 + return xa_err(entry); 661 + } 662 + 663 + mapped++; 664 + } 665 + 666 + mutex_unlock(&dmirror->mutex); 667 + return mapped; 668 + } 669 + 624 670 static int dmirror_migrate_finalize_and_map(struct migrate_vma *args, 625 671 struct dmirror *dmirror) 626 672 { ··· 709 659 710 660 mutex_unlock(&dmirror->mutex); 711 661 return 0; 662 + } 663 + 664 + static int dmirror_exclusive(struct dmirror *dmirror, 665 + struct hmm_dmirror_cmd *cmd) 666 + { 667 + unsigned long start, end, addr; 668 + unsigned long size = cmd->npages << PAGE_SHIFT; 669 + struct mm_struct *mm = dmirror->notifier.mm; 670 + struct page *pages[64]; 671 + struct dmirror_bounce bounce; 672 + unsigned long next; 673 + int ret; 674 + 675 + start = cmd->addr; 676 + end = start + size; 677 + if (end < start) 678 + return -EINVAL; 679 + 680 + /* Since the mm is for the mirrored process, get a reference first. */ 681 + if (!mmget_not_zero(mm)) 682 + return -EINVAL; 683 + 684 + mmap_read_lock(mm); 685 + for (addr = start; addr < end; addr = next) { 686 + unsigned long mapped; 687 + int i; 688 + 689 + if (end < addr + (ARRAY_SIZE(pages) << PAGE_SHIFT)) 690 + next = end; 691 + else 692 + next = addr + (ARRAY_SIZE(pages) << PAGE_SHIFT); 693 + 694 + ret = make_device_exclusive_range(mm, addr, next, pages, NULL); 695 + mapped = dmirror_atomic_map(addr, next, pages, dmirror); 696 + for (i = 0; i < ret; i++) { 697 + if (pages[i]) { 698 + unlock_page(pages[i]); 699 + put_page(pages[i]); 700 + } 701 + } 702 + 703 + if (addr + (mapped << PAGE_SHIFT) < next) { 704 + mmap_read_unlock(mm); 705 + mmput(mm); 706 + return -EBUSY; 707 + } 708 + } 709 + mmap_read_unlock(mm); 710 + mmput(mm); 711 + 712 + /* Return the migrated data for verification. */ 713 + ret = dmirror_bounce_init(&bounce, start, size); 714 + if (ret) 715 + return ret; 716 + mutex_lock(&dmirror->mutex); 717 + ret = dmirror_do_read(dmirror, start, end, &bounce); 718 + mutex_unlock(&dmirror->mutex); 719 + if (ret == 0) { 720 + if (copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr, 721 + bounce.size)) 722 + ret = -EFAULT; 723 + } 724 + 725 + cmd->cpages = bounce.cpages; 726 + dmirror_bounce_fini(&bounce); 727 + return ret; 712 728 } 713 729 714 730 static int dmirror_migrate(struct dmirror *dmirror, ··· 1062 946 1063 947 case HMM_DMIRROR_MIGRATE: 1064 948 ret = dmirror_migrate(dmirror, &cmd); 949 + break; 950 + 951 + case HMM_DMIRROR_EXCLUSIVE: 952 + ret = dmirror_exclusive(dmirror, &cmd); 953 + break; 954 + 955 + case HMM_DMIRROR_CHECK_EXCLUSIVE: 956 + ret = dmirror_check_atomic(dmirror, cmd.addr, 957 + cmd.addr + (cmd.npages << PAGE_SHIFT)); 1065 958 break; 1066 959 1067 960 case HMM_DMIRROR_SNAPSHOT:

+2

lib/test_hmm_uapi.h

··· 33 33 #define HMM_DMIRROR_WRITE _IOWR('H', 0x01, struct hmm_dmirror_cmd) 34 34 #define HMM_DMIRROR_MIGRATE _IOWR('H', 0x02, struct hmm_dmirror_cmd) 35 35 #define HMM_DMIRROR_SNAPSHOT _IOWR('H', 0x03, struct hmm_dmirror_cmd) 36 + #define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x04, struct hmm_dmirror_cmd) 37 + #define HMM_DMIRROR_CHECK_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd) 36 38 37 39 /* 38 40 * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.

+5

lib/test_string.c

··· 179 179 return 0; 180 180 } 181 181 182 + static __exit void string_selftest_remove(void) 183 + { 184 + } 185 + 182 186 static __init int string_selftest_init(void) 183 187 { 184 188 int test, subtest; ··· 220 216 } 221 217 222 218 module_init(string_selftest_init); 219 + module_exit(string_selftest_remove); 223 220 MODULE_LICENSE("GPL v2");

+1

lib/vsprintf.c

··· 86 86 * 87 87 * This function has caveats. Please use kstrtoull instead. 88 88 */ 89 + noinline 89 90 unsigned long long simple_strtoull(const char *cp, char **endp, unsigned int base) 90 91 { 91 92 return simple_strntoull(cp, INT_MAX, endp, base);

+1 -1

lib/xz/xz_dec_bcj.c

··· 422 422 423 423 /* 424 424 * Flush pending already filtered data to the output buffer. Return 425 - * immediatelly if we couldn't flush everything, or if the next 425 + * immediately if we couldn't flush everything, or if the next 426 426 * filter in the chain had already returned XZ_STREAM_END. 427 427 */ 428 428 if (s->temp.filtered > 0) {

+4 -4

lib/xz/xz_dec_lzma2.c

··· 147 147 148 148 /* 149 149 * LZMA properties or related bit masks (number of literal 150 - * context bits, a mask dervied from the number of literal 151 - * position bits, and a mask dervied from the number 150 + * context bits, a mask derived from the number of literal 151 + * position bits, and a mask derived from the number 152 152 * position bits) 153 153 */ 154 154 uint32_t lc; ··· 484 484 } 485 485 486 486 /* 487 - * Decode one bit. In some versions, this function has been splitted in three 487 + * Decode one bit. In some versions, this function has been split in three 488 488 * functions so that the compiler is supposed to be able to more easily avoid 489 489 * an extra branch. In this particular version of the LZMA decoder, this 490 490 * doesn't seem to be a good idea (tested with GCC 3.3.6, 3.4.6, and 4.3.3 ··· 761 761 } 762 762 763 763 /* 764 - * Reset the LZMA decoder and range decoder state. Dictionary is nore reset 764 + * Reset the LZMA decoder and range decoder state. Dictionary is not reset 765 765 * here, because LZMA state may be reset without resetting the dictionary. 766 766 */ 767 767 static void lzma_reset(struct xz_dec_lzma2 *s)

+1 -1

lib/zlib_inflate/inffast.c

··· 15 15 unsigned char b[2]; 16 16 }; 17 17 18 - /* Endian independed version */ 18 + /* Endian independent version */ 19 19 static inline unsigned short 20 20 get_unaligned16(const unsigned short *p) 21 21 {

+1 -1

lib/zstd/huf.h

··· 134 134 HUF_repeat_none, /**< Cannot use the previous table */ 135 135 HUF_repeat_check, /**< Can use the previous table but it must be checked. Note : The previous table must have been constructed by HUF_compress{1, 136 136 4}X_repeat */ 137 - HUF_repeat_valid /**< Can use the previous table and it is asumed to be valid */ 137 + HUF_repeat_valid /**< Can use the previous table and it is assumed to be valid */ 138 138 } HUF_repeat; 139 139 /** HUF_compress4X_repeat() : 140 140 * Same as HUF_compress4X_wksp(), but considers using hufTable if *repeat != HUF_repeat_none.

+16

mm/Kconfig

··· 96 96 depends on MMU 97 97 bool 98 98 99 + config HOLES_IN_ZONE 100 + bool 101 + 99 102 # Don't discard allocated memory used to track "memory" and "reserved" memblocks 100 103 # after early boot, so it can still be used to test for validity of memory. 101 104 # Also, memblocks are updated with memory hot(un)plug. ··· 674 671 675 672 config ZBUD 676 673 tristate "Low (Up to 2x) density storage for compressed pages" 674 + depends on ZPOOL 677 675 help 678 676 A special purpose allocator for storing compressed pages. 679 677 It is designed to store up to two compressed pages per physical ··· 760 756 761 757 config ARCH_HAS_PTE_DEVMAP 762 758 bool 759 + 760 + config ARCH_HAS_ZONE_DMA_SET 761 + bool 762 + 763 + config ZONE_DMA 764 + bool "Support DMA zone" if ARCH_HAS_ZONE_DMA_SET 765 + default y if ARM64 || X86 766 + 767 + config ZONE_DMA32 768 + bool "Support DMA32 zone" if ARCH_HAS_ZONE_DMA_SET 769 + depends on !X86_32 770 + default y if ARM64 763 771 764 772 config ZONE_DEVICE 765 773 bool "Device memory (pmem, HMM, etc...) hotplug support"

+2

mm/Makefile

··· 75 75 obj-$(CONFIG_ZSWAP) += zswap.o 76 76 obj-$(CONFIG_HAS_DMA) += dmapool.o 77 77 obj-$(CONFIG_HUGETLBFS) += hugetlb.o 78 + obj-$(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP) += hugetlb_vmemmap.o 78 79 obj-$(CONFIG_NUMA) += mempolicy.o 79 80 obj-$(CONFIG_SPARSEMEM) += sparse.o 80 81 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o ··· 126 125 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o 127 126 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o 128 127 obj-$(CONFIG_IO_MAPPING) += io-mapping.o 128 + obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o

+127

mm/bootmem_info.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Bootmem core functions. 4 + * 5 + * Copyright (c) 2020, Bytedance. 6 + * 7 + * Author: Muchun Song <songmuchun@bytedance.com> 8 + * 9 + */ 10 + #include <linux/mm.h> 11 + #include <linux/compiler.h> 12 + #include <linux/memblock.h> 13 + #include <linux/bootmem_info.h> 14 + #include <linux/memory_hotplug.h> 15 + 16 + void get_page_bootmem(unsigned long info, struct page *page, unsigned long type) 17 + { 18 + page->freelist = (void *)type; 19 + SetPagePrivate(page); 20 + set_page_private(page, info); 21 + page_ref_inc(page); 22 + } 23 + 24 + void put_page_bootmem(struct page *page) 25 + { 26 + unsigned long type; 27 + 28 + type = (unsigned long) page->freelist; 29 + BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE || 30 + type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE); 31 + 32 + if (page_ref_dec_return(page) == 1) { 33 + page->freelist = NULL; 34 + ClearPagePrivate(page); 35 + set_page_private(page, 0); 36 + INIT_LIST_HEAD(&page->lru); 37 + free_reserved_page(page); 38 + } 39 + } 40 + 41 + #ifndef CONFIG_SPARSEMEM_VMEMMAP 42 + static void register_page_bootmem_info_section(unsigned long start_pfn) 43 + { 44 + unsigned long mapsize, section_nr, i; 45 + struct mem_section *ms; 46 + struct page *page, *memmap; 47 + struct mem_section_usage *usage; 48 + 49 + section_nr = pfn_to_section_nr(start_pfn); 50 + ms = __nr_to_section(section_nr); 51 + 52 + /* Get section's memmap address */ 53 + memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr); 54 + 55 + /* 56 + * Get page for the memmap's phys address 57 + * XXX: need more consideration for sparse_vmemmap... 58 + */ 59 + page = virt_to_page(memmap); 60 + mapsize = sizeof(struct page) * PAGES_PER_SECTION; 61 + mapsize = PAGE_ALIGN(mapsize) >> PAGE_SHIFT; 62 + 63 + /* remember memmap's page */ 64 + for (i = 0; i < mapsize; i++, page++) 65 + get_page_bootmem(section_nr, page, SECTION_INFO); 66 + 67 + usage = ms->usage; 68 + page = virt_to_page(usage); 69 + 70 + mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT; 71 + 72 + for (i = 0; i < mapsize; i++, page++) 73 + get_page_bootmem(section_nr, page, MIX_SECTION_INFO); 74 + 75 + } 76 + #else /* CONFIG_SPARSEMEM_VMEMMAP */ 77 + static void register_page_bootmem_info_section(unsigned long start_pfn) 78 + { 79 + unsigned long mapsize, section_nr, i; 80 + struct mem_section *ms; 81 + struct page *page, *memmap; 82 + struct mem_section_usage *usage; 83 + 84 + section_nr = pfn_to_section_nr(start_pfn); 85 + ms = __nr_to_section(section_nr); 86 + 87 + memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr); 88 + 89 + register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION); 90 + 91 + usage = ms->usage; 92 + page = virt_to_page(usage); 93 + 94 + mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT; 95 + 96 + for (i = 0; i < mapsize; i++, page++) 97 + get_page_bootmem(section_nr, page, MIX_SECTION_INFO); 98 + } 99 + #endif /* !CONFIG_SPARSEMEM_VMEMMAP */ 100 + 101 + void __init register_page_bootmem_info_node(struct pglist_data *pgdat) 102 + { 103 + unsigned long i, pfn, end_pfn, nr_pages; 104 + int node = pgdat->node_id; 105 + struct page *page; 106 + 107 + nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT; 108 + page = virt_to_page(pgdat); 109 + 110 + for (i = 0; i < nr_pages; i++, page++) 111 + get_page_bootmem(node, page, NODE_INFO); 112 + 113 + pfn = pgdat->node_start_pfn; 114 + end_pfn = pgdat_end_pfn(pgdat); 115 + 116 + /* register section info */ 117 + for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) { 118 + /* 119 + * Some platforms can assign the same pfn to multiple nodes - on 120 + * node0 as well as nodeN. To avoid registering a pfn against 121 + * multiple nodes we check that this pfn does not already 122 + * reside in some other nodes. 123 + */ 124 + if (pfn_valid(pfn) && (early_pfn_to_nid(pfn) == node)) 125 + register_page_bootmem_info_section(pfn); 126 + } 127 + }

+9 -11

mm/compaction.c

··· 1297 1297 1298 1298 if (!list_is_last(freelist, &freepage->lru)) { 1299 1299 list_cut_before(&sublist, freelist, &freepage->lru); 1300 - if (!list_empty(&sublist)) 1301 - list_splice_tail(&sublist, freelist); 1300 + list_splice_tail(&sublist, freelist); 1302 1301 } 1303 1302 } 1304 1303 ··· 1314 1315 1315 1316 if (!list_is_first(freelist, &freepage->lru)) { 1316 1317 list_cut_position(&sublist, freelist, &freepage->lru); 1317 - if (!list_empty(&sublist)) 1318 - list_splice_tail(&sublist, freelist); 1318 + list_splice_tail(&sublist, freelist); 1319 1319 } 1320 1320 } 1321 1321 ··· 1378 1380 static unsigned long 1379 1381 fast_isolate_freepages(struct compact_control *cc) 1380 1382 { 1381 - unsigned int limit = min(1U, freelist_scan_limit(cc) >> 1); 1383 + unsigned int limit = max(1U, freelist_scan_limit(cc) >> 1); 1382 1384 unsigned int nr_scanned = 0; 1383 1385 unsigned long low_pfn, min_pfn, highest = 0; 1384 1386 unsigned long nr_isolated = 0; ··· 1490 1492 spin_unlock_irqrestore(&cc->zone->lock, flags); 1491 1493 1492 1494 /* 1493 - * Smaller scan on next order so the total scan ig related 1495 + * Smaller scan on next order so the total scan is related 1494 1496 * to freelist_scan_limit. 1495 1497 */ 1496 1498 if (order_scanned >= limit) 1497 - limit = min(1U, limit >> 1); 1499 + limit = max(1U, limit >> 1); 1498 1500 } 1499 1501 1500 1502 if (!page) { ··· 2720 2722 } 2721 2723 2722 2724 #if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA) 2723 - static ssize_t sysfs_compact_node(struct device *dev, 2724 - struct device_attribute *attr, 2725 - const char *buf, size_t count) 2725 + static ssize_t compact_store(struct device *dev, 2726 + struct device_attribute *attr, 2727 + const char *buf, size_t count) 2726 2728 { 2727 2729 int nid = dev->id; 2728 2730 ··· 2735 2737 2736 2738 return count; 2737 2739 } 2738 - static DEVICE_ATTR(compact, 0200, NULL, sysfs_compact_node); 2740 + static DEVICE_ATTR_WO(compact); 2739 2741 2740 2742 int compaction_register_node(struct node *node) 2741 2743 {

+59 -72

mm/debug_vm_pgtable.c

··· 91 91 unsigned long pfn, unsigned long vaddr, 92 92 pgprot_t prot) 93 93 { 94 - pte_t pte = pfn_pte(pfn, prot); 94 + pte_t pte; 95 95 96 96 /* 97 97 * Architectures optimize set_pte_at by avoiding TLB flush. ··· 248 248 WARN_ON(!pmd_leaf(pmd)); 249 249 } 250 250 251 - #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP 252 - static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) 253 - { 254 - pmd_t pmd; 255 - 256 - if (!arch_vmap_pmd_supported(prot)) 257 - return; 258 - 259 - pr_debug("Validating PMD huge\n"); 260 - /* 261 - * X86 defined pmd_set_huge() verifies that the given 262 - * PMD is not a populated non-leaf entry. 263 - */ 264 - WRITE_ONCE(*pmdp, __pmd(0)); 265 - WARN_ON(!pmd_set_huge(pmdp, __pfn_to_phys(pfn), prot)); 266 - WARN_ON(!pmd_clear_huge(pmdp)); 267 - pmd = READ_ONCE(*pmdp); 268 - WARN_ON(!pmd_none(pmd)); 269 - } 270 - #else /* CONFIG_HAVE_ARCH_HUGE_VMAP */ 271 - static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) { } 272 - #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ 273 - 274 251 static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) 275 252 { 276 253 pmd_t pmd; ··· 372 395 pud = pud_mkhuge(pud); 373 396 WARN_ON(!pud_leaf(pud)); 374 397 } 398 + #else /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 399 + static void __init pud_basic_tests(struct mm_struct *mm, unsigned long pfn, int idx) { } 400 + static void __init pud_advanced_tests(struct mm_struct *mm, 401 + struct vm_area_struct *vma, pud_t *pudp, 402 + unsigned long pfn, unsigned long vaddr, 403 + pgprot_t prot) 404 + { 405 + } 406 + static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) { } 407 + #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 408 + #else /* !CONFIG_TRANSPARENT_HUGEPAGE */ 409 + static void __init pmd_basic_tests(unsigned long pfn, int idx) { } 410 + static void __init pud_basic_tests(struct mm_struct *mm, unsigned long pfn, int idx) { } 411 + static void __init pmd_advanced_tests(struct mm_struct *mm, 412 + struct vm_area_struct *vma, pmd_t *pmdp, 413 + unsigned long pfn, unsigned long vaddr, 414 + pgprot_t prot, pgtable_t pgtable) 415 + { 416 + } 417 + static void __init pud_advanced_tests(struct mm_struct *mm, 418 + struct vm_area_struct *vma, pud_t *pudp, 419 + unsigned long pfn, unsigned long vaddr, 420 + pgprot_t prot) 421 + { 422 + } 423 + static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot) { } 424 + static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) { } 425 + static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) { } 426 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 375 427 376 428 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP 429 + static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) 430 + { 431 + pmd_t pmd; 432 + 433 + if (!arch_vmap_pmd_supported(prot)) 434 + return; 435 + 436 + pr_debug("Validating PMD huge\n"); 437 + /* 438 + * X86 defined pmd_set_huge() verifies that the given 439 + * PMD is not a populated non-leaf entry. 440 + */ 441 + WRITE_ONCE(*pmdp, __pmd(0)); 442 + WARN_ON(!pmd_set_huge(pmdp, __pfn_to_phys(pfn), prot)); 443 + WARN_ON(!pmd_clear_huge(pmdp)); 444 + pmd = READ_ONCE(*pmdp); 445 + WARN_ON(!pmd_none(pmd)); 446 + } 447 + 377 448 static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) 378 449 { 379 450 pud_t pud; ··· 441 416 WARN_ON(!pud_none(pud)); 442 417 } 443 418 #else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */ 419 + static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) { } 444 420 static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) { } 445 - #endif /* !CONFIG_HAVE_ARCH_HUGE_VMAP */ 446 - 447 - #else /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 448 - static void __init pud_basic_tests(struct mm_struct *mm, unsigned long pfn, int idx) { } 449 - static void __init pud_advanced_tests(struct mm_struct *mm, 450 - struct vm_area_struct *vma, pud_t *pudp, 451 - unsigned long pfn, unsigned long vaddr, 452 - pgprot_t prot) 453 - { 454 - } 455 - static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) { } 456 - static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) 457 - { 458 - } 459 - #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 460 - #else /* !CONFIG_TRANSPARENT_HUGEPAGE */ 461 - static void __init pmd_basic_tests(unsigned long pfn, int idx) { } 462 - static void __init pud_basic_tests(struct mm_struct *mm, unsigned long pfn, int idx) { } 463 - static void __init pmd_advanced_tests(struct mm_struct *mm, 464 - struct vm_area_struct *vma, pmd_t *pmdp, 465 - unsigned long pfn, unsigned long vaddr, 466 - pgprot_t prot, pgtable_t pgtable) 467 - { 468 - } 469 - static void __init pud_advanced_tests(struct mm_struct *mm, 470 - struct vm_area_struct *vma, pud_t *pudp, 471 - unsigned long pfn, unsigned long vaddr, 472 - pgprot_t prot) 473 - { 474 - } 475 - static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot) { } 476 - static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) { } 477 - static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) 478 - { 479 - } 480 - static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) 481 - { 482 - } 483 - static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) { } 484 - #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 421 + #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ 485 422 486 423 static void __init p4d_basic_tests(unsigned long pfn, pgprot_t prot) 487 424 { ··· 778 791 WARN_ON(!pmd_swp_soft_dirty(pmd_swp_mksoft_dirty(pmd))); 779 792 WARN_ON(pmd_swp_soft_dirty(pmd_swp_clear_soft_dirty(pmd))); 780 793 } 781 - #else /* !CONFIG_ARCH_HAS_PTE_DEVMAP */ 794 + #else /* !CONFIG_TRANSPARENT_HUGEPAGE */ 782 795 static void __init pmd_soft_dirty_tests(unsigned long pfn, pgprot_t prot) { } 783 796 static void __init pmd_swap_soft_dirty_tests(unsigned long pfn, pgprot_t prot) 784 797 { 785 798 } 786 - #endif /* CONFIG_ARCH_HAS_PTE_DEVMAP */ 799 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 787 800 788 801 static void __init pte_swap_tests(unsigned long pfn, pgprot_t prot) 789 802 { ··· 843 856 * locked, otherwise it stumbles upon a BUG_ON(). 844 857 */ 845 858 __SetPageLocked(page); 846 - swp = make_migration_entry(page, 1); 859 + swp = make_writable_migration_entry(page_to_pfn(page)); 847 860 WARN_ON(!is_migration_entry(swp)); 848 - WARN_ON(!is_write_migration_entry(swp)); 861 + WARN_ON(!is_writable_migration_entry(swp)); 849 862 850 - make_migration_entry_read(&swp); 863 + swp = make_readable_migration_entry(swp_offset(swp)); 851 864 WARN_ON(!is_migration_entry(swp)); 852 - WARN_ON(is_write_migration_entry(swp)); 865 + WARN_ON(is_writable_migration_entry(swp)); 853 866 854 - swp = make_migration_entry(page, 0); 867 + swp = make_readable_migration_entry(page_to_pfn(page)); 855 868 WARN_ON(!is_migration_entry(swp)); 856 - WARN_ON(is_write_migration_entry(swp)); 869 + WARN_ON(is_writable_migration_entry(swp)); 857 870 __ClearPageLocked(page); 858 871 __free_page(page); 859 872 }

+58

mm/gup.c

··· 1501 1501 } 1502 1502 1503 1503 /* 1504 + * faultin_vma_page_range() - populate (prefault) page tables inside the 1505 + * given VMA range readable/writable 1506 + * 1507 + * This takes care of mlocking the pages, too, if VM_LOCKED is set. 1508 + * 1509 + * @vma: target vma 1510 + * @start: start address 1511 + * @end: end address 1512 + * @write: whether to prefault readable or writable 1513 + * @locked: whether the mmap_lock is still held 1514 + * 1515 + * Returns either number of processed pages in the vma, or a negative error 1516 + * code on error (see __get_user_pages()). 1517 + * 1518 + * vma->vm_mm->mmap_lock must be held. The range must be page-aligned and 1519 + * covered by the VMA. 1520 + * 1521 + * If @locked is NULL, it may be held for read or write and will be unperturbed. 1522 + * 1523 + * If @locked is non-NULL, it must held for read only and may be released. If 1524 + * it's released, *@locked will be set to 0. 1525 + */ 1526 + long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start, 1527 + unsigned long end, bool write, int *locked) 1528 + { 1529 + struct mm_struct *mm = vma->vm_mm; 1530 + unsigned long nr_pages = (end - start) / PAGE_SIZE; 1531 + int gup_flags; 1532 + 1533 + VM_BUG_ON(!PAGE_ALIGNED(start)); 1534 + VM_BUG_ON(!PAGE_ALIGNED(end)); 1535 + VM_BUG_ON_VMA(start < vma->vm_start, vma); 1536 + VM_BUG_ON_VMA(end > vma->vm_end, vma); 1537 + mmap_assert_locked(mm); 1538 + 1539 + /* 1540 + * FOLL_TOUCH: Mark page accessed and thereby young; will also mark 1541 + * the page dirty with FOLL_WRITE -- which doesn't make a 1542 + * difference with !FOLL_FORCE, because the page is writable 1543 + * in the page table. 1544 + * FOLL_HWPOISON: Return -EHWPOISON instead of -EFAULT when we hit 1545 + * a poisoned page. 1546 + * FOLL_POPULATE: Always populate memory with VM_LOCKONFAULT. 1547 + * !FOLL_FORCE: Require proper access permissions. 1548 + */ 1549 + gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK | FOLL_HWPOISON; 1550 + if (write) 1551 + gup_flags |= FOLL_WRITE; 1552 + 1553 + /* 1554 + * See check_vma_flags(): Will return -EFAULT on incompatible mappings 1555 + * or with insufficient permissions. 1556 + */ 1557 + return __get_user_pages(mm, start, nr_pages, gup_flags, 1558 + NULL, NULL, locked); 1559 + } 1560 + 1561 + /* 1504 1562 * __mm_populate - populate and/or mlock pages within a range of address space. 1505 1563 * 1506 1564 * This is used to implement mlock() and the MAP_POPULATE / MAP_LOCKED mmap

+8 -4

mm/hmm.c

··· 26 26 #include <linux/mmu_notifier.h> 27 27 #include <linux/memory_hotplug.h> 28 28 29 + #include "internal.h" 30 + 29 31 struct hmm_vma_walk { 30 32 struct hmm_range *range; 31 33 unsigned long last; ··· 216 214 swp_entry_t entry) 217 215 { 218 216 return is_device_private_entry(entry) && 219 - device_private_entry_to_page(entry)->pgmap->owner == 217 + pfn_swap_entry_to_page(entry)->pgmap->owner == 220 218 range->dev_private_owner; 221 219 } 222 220 ··· 257 255 */ 258 256 if (hmm_is_device_private_entry(range, entry)) { 259 257 cpu_flags = HMM_PFN_VALID; 260 - if (is_write_device_private_entry(entry)) 258 + if (is_writable_device_private_entry(entry)) 261 259 cpu_flags |= HMM_PFN_WRITE; 262 - *hmm_pfn = device_private_entry_to_pfn(entry) | 263 - cpu_flags; 260 + *hmm_pfn = swp_offset(entry) | cpu_flags; 264 261 return 0; 265 262 } 266 263 ··· 271 270 } 272 271 273 272 if (!non_swap_entry(entry)) 273 + goto fault; 274 + 275 + if (is_device_exclusive_entry(entry)) 274 276 goto fault; 275 277 276 278 if (is_migration_entry(entry)) {

+119 -146

mm/huge_memory.c

··· 64 64 struct page *huge_zero_page __read_mostly; 65 65 unsigned long huge_zero_pfn __read_mostly = ~0UL; 66 66 67 - bool transparent_hugepage_enabled(struct vm_area_struct *vma) 67 + static inline bool file_thp_enabled(struct vm_area_struct *vma) 68 + { 69 + return transhuge_vma_enabled(vma, vma->vm_flags) && vma->vm_file && 70 + !inode_is_open_for_write(vma->vm_file->f_inode) && 71 + (vma->vm_flags & VM_EXEC); 72 + } 73 + 74 + bool transparent_hugepage_active(struct vm_area_struct *vma) 68 75 { 69 76 /* The addr is used to check if the vma size fits */ 70 77 unsigned long addr = (vma->vm_end & HPAGE_PMD_MASK) - HPAGE_PMD_SIZE; ··· 82 75 return __transparent_hugepage_enabled(vma); 83 76 if (vma_is_shmem(vma)) 84 77 return shmem_huge_enabled(vma); 78 + if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS)) 79 + return file_thp_enabled(vma); 85 80 86 81 return false; 87 82 } ··· 1026 1017 1027 1018 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, 1028 1019 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, 1029 - struct vm_area_struct *vma) 1020 + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) 1030 1021 { 1031 1022 spinlock_t *dst_ptl, *src_ptl; 1032 1023 struct page *src_page; ··· 1035 1026 int ret = -ENOMEM; 1036 1027 1037 1028 /* Skip if can be re-fill on fault */ 1038 - if (!vma_is_anonymous(vma)) 1029 + if (!vma_is_anonymous(dst_vma)) 1039 1030 return 0; 1040 1031 1041 1032 pgtable = pte_alloc_one(dst_mm); ··· 1049 1040 ret = -EAGAIN; 1050 1041 pmd = *src_pmd; 1051 1042 1052 - /* 1053 - * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA 1054 - * does not have the VM_UFFD_WP, which means that the uffd 1055 - * fork event is not enabled. 1056 - */ 1057 - if (!(vma->vm_flags & VM_UFFD_WP)) 1058 - pmd = pmd_clear_uffd_wp(pmd); 1059 - 1060 1043 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION 1061 1044 if (unlikely(is_swap_pmd(pmd))) { 1062 1045 swp_entry_t entry = pmd_to_swp_entry(pmd); 1063 1046 1064 1047 VM_BUG_ON(!is_pmd_migration_entry(pmd)); 1065 - if (is_write_migration_entry(entry)) { 1066 - make_migration_entry_read(&entry); 1048 + if (is_writable_migration_entry(entry)) { 1049 + entry = make_readable_migration_entry( 1050 + swp_offset(entry)); 1067 1051 pmd = swp_entry_to_pmd(entry); 1068 1052 if (pmd_swp_soft_dirty(*src_pmd)) 1069 1053 pmd = pmd_swp_mksoft_dirty(pmd); 1054 + if (pmd_swp_uffd_wp(*src_pmd)) 1055 + pmd = pmd_swp_mkuffd_wp(pmd); 1070 1056 set_pmd_at(src_mm, addr, src_pmd, pmd); 1071 1057 } 1072 1058 add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); 1073 1059 mm_inc_nr_ptes(dst_mm); 1074 1060 pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); 1061 + if (!userfaultfd_wp(dst_vma)) 1062 + pmd = pmd_swp_clear_uffd_wp(pmd); 1075 1063 set_pmd_at(dst_mm, addr, dst_pmd, pmd); 1076 1064 ret = 0; 1077 1065 goto out_unlock; ··· 1085 1079 * a page table. 1086 1080 */ 1087 1081 if (is_huge_zero_pmd(pmd)) { 1088 - struct page *zero_page; 1089 1082 /* 1090 1083 * get_huge_zero_page() will never allocate a new page here, 1091 1084 * since we already have a zero page to copy. It just takes a 1092 1085 * reference. 1093 1086 */ 1094 - zero_page = mm_get_huge_zero_page(dst_mm); 1095 - set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd, 1096 - zero_page); 1097 - ret = 0; 1098 - goto out_unlock; 1087 + mm_get_huge_zero_page(dst_mm); 1088 + goto out_zero_page; 1099 1089 } 1100 1090 1101 1091 src_page = pmd_page(pmd); ··· 1104 1102 * best effort that the pinned pages won't be replaced by another 1105 1103 * random page during the coming copy-on-write. 1106 1104 */ 1107 - if (unlikely(page_needs_cow_for_dma(vma, src_page))) { 1105 + if (unlikely(page_needs_cow_for_dma(src_vma, src_page))) { 1108 1106 pte_free(dst_mm, pgtable); 1109 1107 spin_unlock(src_ptl); 1110 1108 spin_unlock(dst_ptl); 1111 - __split_huge_pmd(vma, src_pmd, addr, false, NULL); 1109 + __split_huge_pmd(src_vma, src_pmd, addr, false, NULL); 1112 1110 return -EAGAIN; 1113 1111 } 1114 1112 1115 1113 get_page(src_page); 1116 1114 page_dup_rmap(src_page, true); 1117 1115 add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); 1116 + out_zero_page: 1118 1117 mm_inc_nr_ptes(dst_mm); 1119 1118 pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); 1120 - 1121 1119 pmdp_set_wrprotect(src_mm, addr, src_pmd); 1120 + if (!userfaultfd_wp(dst_vma)) 1121 + pmd = pmd_clear_uffd_wp(pmd); 1122 1122 pmd = pmd_mkold(pmd_wrprotect(pmd)); 1123 1123 set_pmd_at(dst_mm, addr, dst_pmd, pmd); 1124 1124 ··· 1258 1254 } 1259 1255 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 1260 1256 1261 - void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd) 1257 + void huge_pmd_set_accessed(struct vm_fault *vmf) 1262 1258 { 1263 1259 pmd_t entry; 1264 1260 unsigned long haddr; 1265 1261 bool write = vmf->flags & FAULT_FLAG_WRITE; 1262 + pmd_t orig_pmd = vmf->orig_pmd; 1266 1263 1267 1264 vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd); 1268 1265 if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) ··· 1280 1275 spin_unlock(vmf->ptl); 1281 1276 } 1282 1277 1283 - vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd) 1278 + vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) 1284 1279 { 1285 1280 struct vm_area_struct *vma = vmf->vma; 1286 1281 struct page *page; 1287 1282 unsigned long haddr = vmf->address & HPAGE_PMD_MASK; 1283 + pmd_t orig_pmd = vmf->orig_pmd; 1288 1284 1289 1285 vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd); 1290 1286 VM_BUG_ON_VMA(!vma->anon_vma, vma); ··· 1421 1415 } 1422 1416 1423 1417 /* NUMA hinting page fault entry point for trans huge pmds */ 1424 - vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) 1418 + vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) 1425 1419 { 1426 1420 struct vm_area_struct *vma = vmf->vma; 1427 - struct anon_vma *anon_vma = NULL; 1421 + pmd_t oldpmd = vmf->orig_pmd; 1422 + pmd_t pmd; 1428 1423 struct page *page; 1429 1424 unsigned long haddr = vmf->address & HPAGE_PMD_MASK; 1430 - int page_nid = NUMA_NO_NODE, this_nid = numa_node_id(); 1425 + int page_nid = NUMA_NO_NODE; 1431 1426 int target_nid, last_cpupid = -1; 1432 - bool page_locked; 1433 1427 bool migrated = false; 1434 - bool was_writable; 1428 + bool was_writable = pmd_savedwrite(oldpmd); 1435 1429 int flags = 0; 1436 1430 1437 1431 vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); 1438 - if (unlikely(!pmd_same(pmd, *vmf->pmd))) 1439 - goto out_unlock; 1440 - 1441 - /* 1442 - * If there are potential migrations, wait for completion and retry 1443 - * without disrupting NUMA hinting information. Do not relock and 1444 - * check_same as the page may no longer be mapped. 1445 - */ 1446 - if (unlikely(pmd_trans_migrating(*vmf->pmd))) { 1447 - page = pmd_page(*vmf->pmd); 1448 - if (!get_page_unless_zero(page)) 1449 - goto out_unlock; 1432 + if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) { 1450 1433 spin_unlock(vmf->ptl); 1451 - put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE); 1452 1434 goto out; 1453 - } 1454 - 1455 - page = pmd_page(pmd); 1456 - BUG_ON(is_huge_zero_page(page)); 1457 - page_nid = page_to_nid(page); 1458 - last_cpupid = page_cpupid_last(page); 1459 - count_vm_numa_event(NUMA_HINT_FAULTS); 1460 - if (page_nid == this_nid) { 1461 - count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL); 1462 - flags |= TNF_FAULT_LOCAL; 1463 - } 1464 - 1465 - /* See similar comment in do_numa_page for explanation */ 1466 - if (!pmd_savedwrite(pmd)) 1467 - flags |= TNF_NO_GROUP; 1468 - 1469 - /* 1470 - * Acquire the page lock to serialise THP migrations but avoid dropping 1471 - * page_table_lock if at all possible 1472 - */ 1473 - page_locked = trylock_page(page); 1474 - target_nid = mpol_misplaced(page, vma, haddr); 1475 - /* Migration could have started since the pmd_trans_migrating check */ 1476 - if (!page_locked) { 1477 - page_nid = NUMA_NO_NODE; 1478 - if (!get_page_unless_zero(page)) 1479 - goto out_unlock; 1480 - spin_unlock(vmf->ptl); 1481 - put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE); 1482 - goto out; 1483 - } else if (target_nid == NUMA_NO_NODE) { 1484 - /* There are no parallel migrations and page is in the right 1485 - * node. Clear the numa hinting info in this pmd. 1486 - */ 1487 - goto clear_pmdnuma; 1488 - } 1489 - 1490 - /* 1491 - * Page is misplaced. Page lock serialises migrations. Acquire anon_vma 1492 - * to serialises splits 1493 - */ 1494 - get_page(page); 1495 - spin_unlock(vmf->ptl); 1496 - anon_vma = page_lock_anon_vma_read(page); 1497 - 1498 - /* Confirm the PMD did not change while page_table_lock was released */ 1499 - spin_lock(vmf->ptl); 1500 - if (unlikely(!pmd_same(pmd, *vmf->pmd))) { 1501 - unlock_page(page); 1502 - put_page(page); 1503 - page_nid = NUMA_NO_NODE; 1504 - goto out_unlock; 1505 - } 1506 - 1507 - /* Bail if we fail to protect against THP splits for any reason */ 1508 - if (unlikely(!anon_vma)) { 1509 - put_page(page); 1510 - page_nid = NUMA_NO_NODE; 1511 - goto clear_pmdnuma; 1512 1435 } 1513 1436 1514 1437 /* ··· 1466 1531 haddr + HPAGE_PMD_SIZE); 1467 1532 } 1468 1533 1469 - /* 1470 - * Migrate the THP to the requested node, returns with page unlocked 1471 - * and access rights restored. 1472 - */ 1534 + pmd = pmd_modify(oldpmd, vma->vm_page_prot); 1535 + page = vm_normal_page_pmd(vma, haddr, pmd); 1536 + if (!page) 1537 + goto out_map; 1538 + 1539 + /* See similar comment in do_numa_page for explanation */ 1540 + if (!was_writable) 1541 + flags |= TNF_NO_GROUP; 1542 + 1543 + page_nid = page_to_nid(page); 1544 + last_cpupid = page_cpupid_last(page); 1545 + target_nid = numa_migrate_prep(page, vma, haddr, page_nid, 1546 + &flags); 1547 + 1548 + if (target_nid == NUMA_NO_NODE) { 1549 + put_page(page); 1550 + goto out_map; 1551 + } 1552 + 1473 1553 spin_unlock(vmf->ptl); 1474 1554 1475 - migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma, 1476 - vmf->pmd, pmd, vmf->address, page, target_nid); 1555 + migrated = migrate_misplaced_page(page, vma, target_nid); 1477 1556 if (migrated) { 1478 1557 flags |= TNF_MIGRATED; 1479 1558 page_nid = target_nid; 1480 - } else 1559 + } else { 1481 1560 flags |= TNF_MIGRATE_FAIL; 1482 - 1483 - goto out; 1484 - clear_pmdnuma: 1485 - BUG_ON(!PageLocked(page)); 1486 - was_writable = pmd_savedwrite(pmd); 1487 - pmd = pmd_modify(pmd, vma->vm_page_prot); 1488 - pmd = pmd_mkyoung(pmd); 1489 - if (was_writable) 1490 - pmd = pmd_mkwrite(pmd); 1491 - set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd); 1492 - update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); 1493 - unlock_page(page); 1494 - out_unlock: 1495 - spin_unlock(vmf->ptl); 1561 + vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); 1562 + if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) { 1563 + spin_unlock(vmf->ptl); 1564 + goto out; 1565 + } 1566 + goto out_map; 1567 + } 1496 1568 1497 1569 out: 1498 - if (anon_vma) 1499 - page_unlock_anon_vma_read(anon_vma); 1500 - 1501 1570 if (page_nid != NUMA_NO_NODE) 1502 1571 task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, 1503 1572 flags); 1504 1573 1505 1574 return 0; 1575 + 1576 + out_map: 1577 + /* Restore the PMD */ 1578 + pmd = pmd_modify(oldpmd, vma->vm_page_prot); 1579 + pmd = pmd_mkyoung(pmd); 1580 + if (was_writable) 1581 + pmd = pmd_mkwrite(pmd); 1582 + set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd); 1583 + update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); 1584 + spin_unlock(vmf->ptl); 1585 + goto out; 1506 1586 } 1507 1587 1508 1588 /* ··· 1554 1604 * If other processes are mapping this page, we couldn't discard 1555 1605 * the page unless they all do MADV_FREE so let's skip the page. 1556 1606 */ 1557 - if (page_mapcount(page) != 1) 1607 + if (total_mapcount(page) != 1) 1558 1608 goto out; 1559 1609 1560 1610 if (!trylock_page(page)) ··· 1627 1677 if (arch_needs_pgtable_deposit()) 1628 1678 zap_deposited_table(tlb->mm, pmd); 1629 1679 spin_unlock(ptl); 1630 - if (is_huge_zero_pmd(orig_pmd)) 1631 - tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE); 1632 1680 } else if (is_huge_zero_pmd(orig_pmd)) { 1633 1681 zap_deposited_table(tlb->mm, pmd); 1634 1682 spin_unlock(ptl); 1635 - tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE); 1636 1683 } else { 1637 1684 struct page *page = NULL; 1638 1685 int flush_needed = 1; ··· 1644 1697 1645 1698 VM_BUG_ON(!is_pmd_migration_entry(orig_pmd)); 1646 1699 entry = pmd_to_swp_entry(orig_pmd); 1647 - page = migration_entry_to_page(entry); 1700 + page = pfn_swap_entry_to_page(entry); 1648 1701 flush_needed = 0; 1649 1702 } else 1650 1703 WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!"); ··· 1743 1796 * Returns 1744 1797 * - 0 if PMD could not be locked 1745 1798 * - 1 if PMD was locked but protections unchanged and TLB flush unnecessary 1799 + * or if prot_numa but THP migration is not supported 1746 1800 * - HPAGE_PMD_NR if protections changed and TLB flush necessary 1747 1801 */ 1748 1802 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, ··· 1758 1810 bool uffd_wp = cp_flags & MM_CP_UFFD_WP; 1759 1811 bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; 1760 1812 1813 + if (prot_numa && !thp_migration_supported()) 1814 + return 1; 1815 + 1761 1816 ptl = __pmd_trans_huge_lock(pmd, vma); 1762 1817 if (!ptl) 1763 1818 return 0; ··· 1773 1822 swp_entry_t entry = pmd_to_swp_entry(*pmd); 1774 1823 1775 1824 VM_BUG_ON(!is_pmd_migration_entry(*pmd)); 1776 - if (is_write_migration_entry(entry)) { 1825 + if (is_writable_migration_entry(entry)) { 1777 1826 pmd_t newpmd; 1778 1827 /* 1779 1828 * A protection check is difficult so 1780 1829 * just be safe and disable write 1781 1830 */ 1782 - make_migration_entry_read(&entry); 1831 + entry = make_readable_migration_entry( 1832 + swp_offset(entry)); 1783 1833 newpmd = swp_entry_to_pmd(entry); 1784 1834 if (pmd_swp_soft_dirty(*pmd)) 1785 1835 newpmd = pmd_swp_mksoft_dirty(newpmd); 1836 + if (pmd_swp_uffd_wp(*pmd)) 1837 + newpmd = pmd_swp_mkuffd_wp(newpmd); 1786 1838 set_pmd_at(mm, addr, pmd, newpmd); 1787 1839 } 1788 1840 goto unlock; ··· 2014 2060 swp_entry_t entry; 2015 2061 2016 2062 entry = pmd_to_swp_entry(old_pmd); 2017 - page = migration_entry_to_page(entry); 2063 + page = pfn_swap_entry_to_page(entry); 2018 2064 } else { 2019 2065 page = pmd_page(old_pmd); 2020 2066 if (!PageDirty(page) && pmd_dirty(old_pmd)) ··· 2068 2114 swp_entry_t entry; 2069 2115 2070 2116 entry = pmd_to_swp_entry(old_pmd); 2071 - page = migration_entry_to_page(entry); 2072 - write = is_write_migration_entry(entry); 2117 + page = pfn_swap_entry_to_page(entry); 2118 + write = is_writable_migration_entry(entry); 2073 2119 young = false; 2074 2120 soft_dirty = pmd_swp_soft_dirty(old_pmd); 2075 2121 uffd_wp = pmd_swp_uffd_wp(old_pmd); ··· 2101 2147 */ 2102 2148 if (freeze || pmd_migration) { 2103 2149 swp_entry_t swp_entry; 2104 - swp_entry = make_migration_entry(page + i, write); 2150 + if (write) 2151 + swp_entry = make_writable_migration_entry( 2152 + page_to_pfn(page + i)); 2153 + else 2154 + swp_entry = make_readable_migration_entry( 2155 + page_to_pfn(page + i)); 2105 2156 entry = swp_entry_to_pte(swp_entry); 2106 2157 if (soft_dirty) 2107 2158 entry = pte_swp_mksoft_dirty(entry); ··· 2309 2350 2310 2351 static void unmap_page(struct page *page) 2311 2352 { 2312 - enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_SYNC | 2313 - TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD; 2353 + enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD | 2354 + TTU_SYNC; 2314 2355 2315 2356 VM_BUG_ON_PAGE(!PageHead(page), page); 2316 2357 2358 + /* 2359 + * Anon pages need migration entries to preserve them, but file 2360 + * pages can simply be left unmapped, then faulted back on demand. 2361 + * If that is ever changed (perhaps for mlock), update remap_page(). 2362 + */ 2317 2363 if (PageAnon(page)) 2318 - ttu_flags |= TTU_SPLIT_FREEZE; 2319 - 2320 - try_to_unmap(page, ttu_flags); 2364 + try_to_migrate(page, ttu_flags); 2365 + else 2366 + try_to_unmap(page, ttu_flags | TTU_IGNORE_MLOCK); 2321 2367 2322 2368 VM_WARN_ON_ONCE_PAGE(page_mapped(page), page); 2323 2369 } ··· 2330 2366 static void remap_page(struct page *page, unsigned int nr) 2331 2367 { 2332 2368 int i; 2369 + 2370 + /* If TTU_SPLIT_FREEZE is ever extended to file, remove this check */ 2371 + if (!PageAnon(page)) 2372 + return; 2333 2373 if (PageTransHuge(page)) { 2334 2374 remove_migration_ptes(page, page, true); 2335 2375 } else { ··· 2838 2870 spin_lock_irqsave(&ds_queue->split_queue_lock, flags); 2839 2871 /* Take pin on all head pages to avoid freeing them under us */ 2840 2872 list_for_each_safe(pos, next, &ds_queue->split_queue) { 2841 - page = list_entry((void *)pos, struct page, mapping); 2873 + page = list_entry((void *)pos, struct page, deferred_list); 2842 2874 page = compound_head(page); 2843 2875 if (get_page_unless_zero(page)) { 2844 2876 list_move(page_deferred_list(page), &list); ··· 2853 2885 spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); 2854 2886 2855 2887 list_for_each_safe(pos, next, &list) { 2856 - page = list_entry((void *)pos, struct page, mapping); 2888 + page = list_entry((void *)pos, struct page, deferred_list); 2857 2889 if (!trylock_page(page)) 2858 2890 goto next; 2859 2891 /* split_huge_page() removes page from list on success */ ··· 3112 3144 3113 3145 tok = strsep(&buf, ","); 3114 3146 if (tok) { 3115 - strncpy(file_path, tok, MAX_INPUT_BUF_SZ); 3147 + strcpy(file_path, tok); 3116 3148 } else { 3117 3149 ret = -EINVAL; 3118 3150 goto out; ··· 3182 3214 pmdval = pmdp_invalidate(vma, address, pvmw->pmd); 3183 3215 if (pmd_dirty(pmdval)) 3184 3216 set_page_dirty(page); 3185 - entry = make_migration_entry(page, pmd_write(pmdval)); 3217 + if (pmd_write(pmdval)) 3218 + entry = make_writable_migration_entry(page_to_pfn(page)); 3219 + else 3220 + entry = make_readable_migration_entry(page_to_pfn(page)); 3186 3221 pmdswp = swp_entry_to_pmd(entry); 3187 3222 if (pmd_soft_dirty(pmdval)) 3188 3223 pmdswp = pmd_swp_mksoft_dirty(pmdswp); ··· 3211 3240 pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot)); 3212 3241 if (pmd_swp_soft_dirty(*pvmw->pmd)) 3213 3242 pmde = pmd_mksoft_dirty(pmde); 3214 - if (is_write_migration_entry(entry)) 3243 + if (is_writable_migration_entry(entry)) 3215 3244 pmde = maybe_pmd_mkwrite(pmde, vma); 3245 + if (pmd_swp_uffd_wp(*pvmw->pmd)) 3246 + pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde)); 3216 3247 3217 3248 flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE); 3218 3249 if (PageAnon(new))

+289 -72

mm/hugetlb.c

··· 30 30 #include <linux/numa.h> 31 31 #include <linux/llist.h> 32 32 #include <linux/cma.h> 33 + #include <linux/migrate.h> 33 34 34 35 #include <asm/page.h> 35 36 #include <asm/pgalloc.h> ··· 42 41 #include <linux/node.h> 43 42 #include <linux/page_owner.h> 44 43 #include "internal.h" 44 + #include "hugetlb_vmemmap.h" 45 45 46 46 int hugetlb_max_hstate __read_mostly; 47 47 unsigned int default_hstate_idx; ··· 1320 1318 return alloc_contig_pages(nr_pages, gfp_mask, nid, nodemask); 1321 1319 } 1322 1320 1323 - static void prep_new_huge_page(struct hstate *h, struct page *page, int nid); 1324 - static void prep_compound_gigantic_page(struct page *page, unsigned int order); 1325 1321 #else /* !CONFIG_CONTIG_ALLOC */ 1326 1322 static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask, 1327 1323 int nid, nodemask_t *nodemask) ··· 1375 1375 h->nr_huge_pages_node[nid]--; 1376 1376 } 1377 1377 1378 - static void update_and_free_page(struct hstate *h, struct page *page) 1378 + static void add_hugetlb_page(struct hstate *h, struct page *page, 1379 + bool adjust_surplus) 1380 + { 1381 + int zeroed; 1382 + int nid = page_to_nid(page); 1383 + 1384 + VM_BUG_ON_PAGE(!HPageVmemmapOptimized(page), page); 1385 + 1386 + lockdep_assert_held(&hugetlb_lock); 1387 + 1388 + INIT_LIST_HEAD(&page->lru); 1389 + h->nr_huge_pages++; 1390 + h->nr_huge_pages_node[nid]++; 1391 + 1392 + if (adjust_surplus) { 1393 + h->surplus_huge_pages++; 1394 + h->surplus_huge_pages_node[nid]++; 1395 + } 1396 + 1397 + set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); 1398 + set_page_private(page, 0); 1399 + SetHPageVmemmapOptimized(page); 1400 + 1401 + /* 1402 + * This page is now managed by the hugetlb allocator and has 1403 + * no users -- drop the last reference. 1404 + */ 1405 + zeroed = put_page_testzero(page); 1406 + VM_BUG_ON_PAGE(!zeroed, page); 1407 + arch_clear_hugepage_flags(page); 1408 + enqueue_huge_page(h, page); 1409 + } 1410 + 1411 + static void __update_and_free_page(struct hstate *h, struct page *page) 1379 1412 { 1380 1413 int i; 1381 1414 struct page *subpage = page; 1382 1415 1383 1416 if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) 1384 1417 return; 1418 + 1419 + if (alloc_huge_page_vmemmap(h, page)) { 1420 + spin_lock_irq(&hugetlb_lock); 1421 + /* 1422 + * If we cannot allocate vmemmap pages, just refuse to free the 1423 + * page and put the page back on the hugetlb free list and treat 1424 + * as a surplus page. 1425 + */ 1426 + add_hugetlb_page(h, page, true); 1427 + spin_unlock_irq(&hugetlb_lock); 1428 + return; 1429 + } 1385 1430 1386 1431 for (i = 0; i < pages_per_huge_page(h); 1387 1432 i++, subpage = mem_map_next(subpage, page, i)) { ··· 1443 1398 } 1444 1399 } 1445 1400 1401 + /* 1402 + * As update_and_free_page() can be called under any context, so we cannot 1403 + * use GFP_KERNEL to allocate vmemmap pages. However, we can defer the 1404 + * actual freeing in a workqueue to prevent from using GFP_ATOMIC to allocate 1405 + * the vmemmap pages. 1406 + * 1407 + * free_hpage_workfn() locklessly retrieves the linked list of pages to be 1408 + * freed and frees them one-by-one. As the page->mapping pointer is going 1409 + * to be cleared in free_hpage_workfn() anyway, it is reused as the llist_node 1410 + * structure of a lockless linked list of huge pages to be freed. 1411 + */ 1412 + static LLIST_HEAD(hpage_freelist); 1413 + 1414 + static void free_hpage_workfn(struct work_struct *work) 1415 + { 1416 + struct llist_node *node; 1417 + 1418 + node = llist_del_all(&hpage_freelist); 1419 + 1420 + while (node) { 1421 + struct page *page; 1422 + struct hstate *h; 1423 + 1424 + page = container_of((struct address_space **)node, 1425 + struct page, mapping); 1426 + node = node->next; 1427 + page->mapping = NULL; 1428 + /* 1429 + * The VM_BUG_ON_PAGE(!PageHuge(page), page) in page_hstate() 1430 + * is going to trigger because a previous call to 1431 + * remove_hugetlb_page() will set_compound_page_dtor(page, 1432 + * NULL_COMPOUND_DTOR), so do not use page_hstate() directly. 1433 + */ 1434 + h = size_to_hstate(page_size(page)); 1435 + 1436 + __update_and_free_page(h, page); 1437 + 1438 + cond_resched(); 1439 + } 1440 + } 1441 + static DECLARE_WORK(free_hpage_work, free_hpage_workfn); 1442 + 1443 + static inline void flush_free_hpage_work(struct hstate *h) 1444 + { 1445 + if (free_vmemmap_pages_per_hpage(h)) 1446 + flush_work(&free_hpage_work); 1447 + } 1448 + 1449 + static void update_and_free_page(struct hstate *h, struct page *page, 1450 + bool atomic) 1451 + { 1452 + if (!HPageVmemmapOptimized(page) || !atomic) { 1453 + __update_and_free_page(h, page); 1454 + return; 1455 + } 1456 + 1457 + /* 1458 + * Defer freeing to avoid using GFP_ATOMIC to allocate vmemmap pages. 1459 + * 1460 + * Only call schedule_work() if hpage_freelist is previously 1461 + * empty. Otherwise, schedule_work() had been called but the workfn 1462 + * hasn't retrieved the list yet. 1463 + */ 1464 + if (llist_add((struct llist_node *)&page->mapping, &hpage_freelist)) 1465 + schedule_work(&free_hpage_work); 1466 + } 1467 + 1446 1468 static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list) 1447 1469 { 1448 1470 struct page *page, *t_page; 1449 1471 1450 1472 list_for_each_entry_safe(page, t_page, list, lru) { 1451 - update_and_free_page(h, page); 1473 + update_and_free_page(h, page, false); 1452 1474 cond_resched(); 1453 1475 } 1454 1476 } ··· 1582 1470 if (HPageTemporary(page)) { 1583 1471 remove_hugetlb_page(h, page, false); 1584 1472 spin_unlock_irqrestore(&hugetlb_lock, flags); 1585 - update_and_free_page(h, page); 1473 + update_and_free_page(h, page, true); 1586 1474 } else if (h->surplus_huge_pages_node[nid]) { 1587 1475 /* remove the page from active list */ 1588 1476 remove_hugetlb_page(h, page, true); 1589 1477 spin_unlock_irqrestore(&hugetlb_lock, flags); 1590 - update_and_free_page(h, page); 1478 + update_and_free_page(h, page, true); 1591 1479 } else { 1592 1480 arch_clear_hugepage_flags(page); 1593 1481 enqueue_huge_page(h, page); ··· 1605 1493 h->nr_huge_pages_node[nid]++; 1606 1494 } 1607 1495 1608 - static void __prep_new_huge_page(struct page *page) 1496 + static void __prep_new_huge_page(struct hstate *h, struct page *page) 1609 1497 { 1498 + free_huge_page_vmemmap(h, page); 1610 1499 INIT_LIST_HEAD(&page->lru); 1611 1500 set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); 1612 1501 hugetlb_set_page_subpool(page, NULL); ··· 1617 1504 1618 1505 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) 1619 1506 { 1620 - __prep_new_huge_page(page); 1507 + __prep_new_huge_page(h, page); 1621 1508 spin_lock_irq(&hugetlb_lock); 1622 1509 __prep_account_new_huge_page(h, nid); 1623 1510 spin_unlock_irq(&hugetlb_lock); 1624 1511 } 1625 1512 1626 - static void prep_compound_gigantic_page(struct page *page, unsigned int order) 1513 + static bool prep_compound_gigantic_page(struct page *page, unsigned int order) 1627 1514 { 1628 - int i; 1515 + int i, j; 1629 1516 int nr_pages = 1 << order; 1630 1517 struct page *p = page + 1; 1631 1518 ··· 1647 1534 * after get_user_pages(). 1648 1535 */ 1649 1536 __ClearPageReserved(p); 1537 + /* 1538 + * Subtle and very unlikely 1539 + * 1540 + * Gigantic 'page allocators' such as memblock or cma will 1541 + * return a set of pages with each page ref counted. We need 1542 + * to turn this set of pages into a compound page with tail 1543 + * page ref counts set to zero. Code such as speculative page 1544 + * cache adding could take a ref on a 'to be' tail page. 1545 + * We need to respect any increased ref count, and only set 1546 + * the ref count to zero if count is currently 1. If count 1547 + * is not 1, we call synchronize_rcu in the hope that a rcu 1548 + * grace period will cause ref count to drop and then retry. 1549 + * If count is still inflated on retry we return an error and 1550 + * must discard the pages. 1551 + */ 1552 + if (!page_ref_freeze(p, 1)) { 1553 + pr_info("HugeTLB unexpected inflated ref count on freshly allocated page\n"); 1554 + synchronize_rcu(); 1555 + if (!page_ref_freeze(p, 1)) 1556 + goto out_error; 1557 + } 1650 1558 set_page_count(p, 0); 1651 1559 set_compound_head(p, page); 1652 1560 } 1653 1561 atomic_set(compound_mapcount_ptr(page), -1); 1654 1562 atomic_set(compound_pincount_ptr(page), 0); 1563 + return true; 1564 + 1565 + out_error: 1566 + /* undo tail page modifications made above */ 1567 + p = page + 1; 1568 + for (j = 1; j < i; j++, p = mem_map_next(p, page, j)) { 1569 + clear_compound_head(p); 1570 + set_page_refcounted(p); 1571 + } 1572 + /* need to clear PG_reserved on remaining tail pages */ 1573 + for (; j < nr_pages; j++, p = mem_map_next(p, page, j)) 1574 + __ClearPageReserved(p); 1575 + set_compound_order(page, 0); 1576 + page[1].compound_nr = 0; 1577 + __ClearPageHead(page); 1578 + return false; 1655 1579 } 1656 1580 1657 1581 /* ··· 1808 1658 nodemask_t *node_alloc_noretry) 1809 1659 { 1810 1660 struct page *page; 1661 + bool retry = false; 1811 1662 1663 + retry: 1812 1664 if (hstate_is_gigantic(h)) 1813 1665 page = alloc_gigantic_page(h, gfp_mask, nid, nmask); 1814 1666 else ··· 1819 1667 if (!page) 1820 1668 return NULL; 1821 1669 1822 - if (hstate_is_gigantic(h)) 1823 - prep_compound_gigantic_page(page, huge_page_order(h)); 1670 + if (hstate_is_gigantic(h)) { 1671 + if (!prep_compound_gigantic_page(page, huge_page_order(h))) { 1672 + /* 1673 + * Rare failure to convert pages to compound page. 1674 + * Free pages and try again - ONCE! 1675 + */ 1676 + free_gigantic_page(page, huge_page_order(h)); 1677 + if (!retry) { 1678 + retry = true; 1679 + goto retry; 1680 + } 1681 + pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n"); 1682 + return NULL; 1683 + } 1684 + } 1824 1685 prep_new_huge_page(h, page, page_to_nid(page)); 1825 1686 1826 1687 return page; ··· 1902 1737 * nothing for in-use hugepages and non-hugepages. 1903 1738 * This function returns values like below: 1904 1739 * 1905 - * -EBUSY: failed to dissolved free hugepages or the hugepage is in-use 1906 - * (allocated or reserved.) 1907 - * 0: successfully dissolved free hugepages or the page is not a 1908 - * hugepage (considered as already dissolved) 1740 + * -ENOMEM: failed to allocate vmemmap pages to free the freed hugepages 1741 + * when the system is under memory pressure and the feature of 1742 + * freeing unused vmemmap pages associated with each hugetlb page 1743 + * is enabled. 1744 + * -EBUSY: failed to dissolved free hugepages or the hugepage is in-use 1745 + * (allocated or reserved.) 1746 + * 0: successfully dissolved free hugepages or the page is not a 1747 + * hugepage (considered as already dissolved) 1909 1748 */ 1910 1749 int dissolve_free_huge_page(struct page *page) 1911 1750 { ··· 1951 1782 goto retry; 1952 1783 } 1953 1784 1954 - /* 1955 - * Move PageHWPoison flag from head page to the raw error page, 1956 - * which makes any subpages rather than the error page reusable. 1957 - */ 1958 - if (PageHWPoison(head) && page != head) { 1959 - SetPageHWPoison(page); 1960 - ClearPageHWPoison(head); 1961 - } 1962 1785 remove_hugetlb_page(h, head, false); 1963 1786 h->max_huge_pages--; 1964 1787 spin_unlock_irq(&hugetlb_lock); 1965 - update_and_free_page(h, head); 1966 - return 0; 1788 + 1789 + /* 1790 + * Normally update_and_free_page will allocate required vmemmmap 1791 + * before freeing the page. update_and_free_page will fail to 1792 + * free the page if it can not allocate required vmemmap. We 1793 + * need to adjust max_huge_pages if the page is not freed. 1794 + * Attempt to allocate vmemmmap here so that we can take 1795 + * appropriate action on failure. 1796 + */ 1797 + rc = alloc_huge_page_vmemmap(h, head); 1798 + if (!rc) { 1799 + /* 1800 + * Move PageHWPoison flag from head page to the raw 1801 + * error page, which makes any subpages rather than 1802 + * the error page reusable. 1803 + */ 1804 + if (PageHWPoison(head) && page != head) { 1805 + SetPageHWPoison(page); 1806 + ClearPageHWPoison(head); 1807 + } 1808 + update_and_free_page(h, head, false); 1809 + } else { 1810 + spin_lock_irq(&hugetlb_lock); 1811 + add_hugetlb_page(h, head, false); 1812 + h->max_huge_pages++; 1813 + spin_unlock_irq(&hugetlb_lock); 1814 + } 1815 + 1816 + return rc; 1967 1817 } 1968 1818 out: 1969 1819 spin_unlock_irq(&hugetlb_lock); ··· 2539 2351 2540 2352 /* 2541 2353 * Before dissolving the page, we need to allocate a new one for the 2542 - * pool to remain stable. Using alloc_buddy_huge_page() allows us to 2543 - * not having to deal with prep_new_huge_page() and avoids dealing of any 2544 - * counters. This simplifies and let us do the whole thing under the 2545 - * lock. 2354 + * pool to remain stable. Here, we allocate the page and 'prep' it 2355 + * by doing everything but actually updating counters and adding to 2356 + * the pool. This simplifies and let us do most of the processing 2357 + * under the lock. 2546 2358 */ 2547 2359 new_page = alloc_buddy_huge_page(h, gfp_mask, nid, NULL, NULL); 2548 2360 if (!new_page) 2549 2361 return -ENOMEM; 2362 + __prep_new_huge_page(h, new_page); 2550 2363 2551 2364 retry: 2552 2365 spin_lock_irq(&hugetlb_lock); ··· 2586 2397 remove_hugetlb_page(h, old_page, false); 2587 2398 2588 2399 /* 2589 - * new_page needs to be initialized with the standard hugetlb 2590 - * state. This is normally done by prep_new_huge_page() but 2591 - * that takes hugetlb_lock which is already held so we need to 2592 - * open code it here. 2593 2400 * Reference count trick is needed because allocator gives us 2594 2401 * referenced page but the pool requires pages with 0 refcount. 2595 2402 */ 2596 - __prep_new_huge_page(new_page); 2597 2403 __prep_account_new_huge_page(h, nid); 2598 2404 page_ref_dec(new_page); 2599 2405 enqueue_huge_page(h, new_page); ··· 2597 2413 * Pages have been replaced, we can safely free the old one. 2598 2414 */ 2599 2415 spin_unlock_irq(&hugetlb_lock); 2600 - update_and_free_page(h, old_page); 2416 + update_and_free_page(h, old_page, false); 2601 2417 } 2602 2418 2603 2419 return ret; 2604 2420 2605 2421 free_new: 2606 2422 spin_unlock_irq(&hugetlb_lock); 2607 - __free_pages(new_page, huge_page_order(h)); 2423 + update_and_free_page(h, new_page, false); 2608 2424 2609 2425 return ret; 2610 2426 } ··· 2809 2625 return 1; 2810 2626 } 2811 2627 2812 - static void __init prep_compound_huge_page(struct page *page, 2813 - unsigned int order) 2814 - { 2815 - if (unlikely(order > (MAX_ORDER - 1))) 2816 - prep_compound_gigantic_page(page, order); 2817 - else 2818 - prep_compound_page(page, order); 2819 - } 2820 - 2821 - /* Put bootmem huge pages into the standard lists after mem_map is up */ 2628 + /* 2629 + * Put bootmem huge pages into the standard lists after mem_map is up. 2630 + * Note: This only applies to gigantic (order > MAX_ORDER) pages. 2631 + */ 2822 2632 static void __init gather_bootmem_prealloc(void) 2823 2633 { 2824 2634 struct huge_bootmem_page *m; ··· 2821 2643 struct page *page = virt_to_page(m); 2822 2644 struct hstate *h = m->hstate; 2823 2645 2646 + VM_BUG_ON(!hstate_is_gigantic(h)); 2824 2647 WARN_ON(page_count(page) != 1); 2825 - prep_compound_huge_page(page, huge_page_order(h)); 2826 - WARN_ON(PageReserved(page)); 2827 - prep_new_huge_page(h, page, page_to_nid(page)); 2828 - put_page(page); /* free it into the hugepage allocator */ 2648 + if (prep_compound_gigantic_page(page, huge_page_order(h))) { 2649 + WARN_ON(PageReserved(page)); 2650 + prep_new_huge_page(h, page, page_to_nid(page)); 2651 + put_page(page); /* add to the hugepage allocator */ 2652 + } else { 2653 + free_gigantic_page(page, huge_page_order(h)); 2654 + pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n"); 2655 + } 2829 2656 2830 2657 /* 2831 - * If we had gigantic hugepages allocated at boot time, we need 2832 - * to restore the 'stolen' pages to totalram_pages in order to 2833 - * fix confusing memory reports from free(1) and another 2834 - * side-effects, like CommitLimit going negative. 2658 + * We need to restore the 'stolen' pages to totalram_pages 2659 + * in order to fix confusing memory reports from free(1) and 2660 + * other side-effects, like CommitLimit going negative. 2835 2661 */ 2836 - if (hstate_is_gigantic(h)) 2837 - adjust_managed_page_count(page, pages_per_huge_page(h)); 2662 + adjust_managed_page_count(page, pages_per_huge_page(h)); 2838 2663 cond_resched(); 2839 2664 } 2840 2665 } ··· 3015 2834 * pages in hstate via the proc/sysfs interfaces. 3016 2835 */ 3017 2836 mutex_lock(&h->resize_lock); 2837 + flush_free_hpage_work(h); 3018 2838 spin_lock_irq(&hugetlb_lock); 3019 2839 3020 2840 /* ··· 3125 2943 /* free the pages after dropping lock */ 3126 2944 spin_unlock_irq(&hugetlb_lock); 3127 2945 update_and_free_pages_bulk(h, &page_list); 2946 + flush_free_hpage_work(h); 3128 2947 spin_lock_irq(&hugetlb_lock); 3129 2948 3130 2949 while (count < persistent_huge_pages(h)) { ··· 3633 3450 h->next_nid_to_free = first_memory_node; 3634 3451 snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB", 3635 3452 huge_page_size(h)/1024); 3453 + hugetlb_vmemmap_init(h); 3636 3454 3637 3455 parsed_hstate = h; 3638 3456 } ··· 4108 3924 int writable) 4109 3925 { 4110 3926 pte_t entry; 3927 + unsigned int shift = huge_page_shift(hstate_vma(vma)); 4111 3928 4112 3929 if (writable) { 4113 3930 entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_huge_pte(page, ··· 4119 3934 } 4120 3935 entry = pte_mkyoung(entry); 4121 3936 entry = pte_mkhuge(entry); 4122 - entry = arch_make_huge_pte(entry, vma, page, writable); 3937 + entry = arch_make_huge_pte(entry, shift, vma->vm_flags); 4123 3938 4124 3939 return entry; 4125 3940 } ··· 4242 4057 is_hugetlb_entry_hwpoisoned(entry))) { 4243 4058 swp_entry_t swp_entry = pte_to_swp_entry(entry); 4244 4059 4245 - if (is_write_migration_entry(swp_entry) && cow) { 4060 + if (is_writable_migration_entry(swp_entry) && cow) { 4246 4061 /* 4247 4062 * COW mappings require pages in both 4248 4063 * parent and child to be set to read. 4249 4064 */ 4250 - make_migration_entry_read(&swp_entry); 4065 + swp_entry = make_readable_migration_entry( 4066 + swp_offset(swp_entry)); 4251 4067 entry = swp_entry_to_pte(swp_entry); 4252 4068 set_huge_swap_pte_at(src, addr, src_pte, 4253 4069 entry, sz); ··· 5125 4939 struct page **pagep) 5126 4940 { 5127 4941 bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE); 5128 - struct address_space *mapping; 5129 - pgoff_t idx; 4942 + struct hstate *h = hstate_vma(dst_vma); 4943 + struct address_space *mapping = dst_vma->vm_file->f_mapping; 4944 + pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr); 5130 4945 unsigned long size; 5131 4946 int vm_shared = dst_vma->vm_flags & VM_SHARED; 5132 - struct hstate *h = hstate_vma(dst_vma); 5133 4947 pte_t _dst_pte; 5134 4948 spinlock_t *ptl; 5135 - int ret; 4949 + int ret = -ENOMEM; 5136 4950 struct page *page; 5137 4951 int writable; 5138 - 5139 - mapping = dst_vma->vm_file->f_mapping; 5140 - idx = vma_hugecache_offset(h, dst_vma, dst_addr); 5141 4952 5142 4953 if (is_continue) { 5143 4954 ret = -EFAULT; ··· 5164 4981 /* fallback to copy_from_user outside mmap_lock */ 5165 4982 if (unlikely(ret)) { 5166 4983 ret = -ENOENT; 4984 + /* Free the allocated page which may have 4985 + * consumed a reservation. 4986 + */ 4987 + restore_reserve_on_error(h, dst_vma, dst_addr, page); 4988 + put_page(page); 4989 + 4990 + /* Allocate a temporary page to hold the copied 4991 + * contents. 4992 + */ 4993 + page = alloc_huge_page_vma(h, dst_vma, dst_addr); 4994 + if (!page) { 4995 + ret = -ENOMEM; 4996 + goto out; 4997 + } 5167 4998 *pagep = page; 5168 - /* don't free the page */ 4999 + /* Set the outparam pagep and return to the caller to 5000 + * copy the contents outside the lock. Don't free the 5001 + * page. 5002 + */ 5169 5003 goto out; 5170 5004 } 5171 5005 } else { 5172 - page = *pagep; 5006 + if (vm_shared && 5007 + hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) { 5008 + put_page(*pagep); 5009 + ret = -EEXIST; 5010 + *pagep = NULL; 5011 + goto out; 5012 + } 5013 + 5014 + page = alloc_huge_page(dst_vma, dst_addr, 0); 5015 + if (IS_ERR(page)) { 5016 + ret = -ENOMEM; 5017 + *pagep = NULL; 5018 + goto out; 5019 + } 5020 + copy_huge_page(page, *pagep); 5021 + put_page(*pagep); 5173 5022 *pagep = NULL; 5174 5023 } 5175 5024 ··· 5533 5318 if (unlikely(is_hugetlb_entry_migration(pte))) { 5534 5319 swp_entry_t entry = pte_to_swp_entry(pte); 5535 5320 5536 - if (is_write_migration_entry(entry)) { 5321 + if (is_writable_migration_entry(entry)) { 5537 5322 pte_t newpte; 5538 5323 5539 - make_migration_entry_read(&entry); 5324 + entry = make_readable_migration_entry( 5325 + swp_offset(entry)); 5540 5326 newpte = swp_entry_to_pte(entry); 5541 5327 set_huge_swap_pte_at(mm, address, ptep, 5542 5328 newpte, huge_page_size(h)); ··· 5548 5332 } 5549 5333 if (!huge_pte_none(pte)) { 5550 5334 pte_t old_pte; 5335 + unsigned int shift = huge_page_shift(hstate_vma(vma)); 5551 5336 5552 5337 old_pte = huge_ptep_modify_prot_start(vma, address, ptep); 5553 5338 pte = pte_mkhuge(huge_pte_modify(old_pte, newprot)); 5554 - pte = arch_make_huge_pte(pte, vma, NULL, 0); 5339 + pte = arch_make_huge_pte(pte, shift, vma->vm_flags); 5555 5340 huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte); 5556 5341 pages++; 5557 5342 }

+298

mm/hugetlb_vmemmap.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Free some vmemmap pages of HugeTLB 4 + * 5 + * Copyright (c) 2020, Bytedance. All rights reserved. 6 + * 7 + * Author: Muchun Song <songmuchun@bytedance.com> 8 + * 9 + * The struct page structures (page structs) are used to describe a physical 10 + * page frame. By default, there is a one-to-one mapping from a page frame to 11 + * it's corresponding page struct. 12 + * 13 + * HugeTLB pages consist of multiple base page size pages and is supported by 14 + * many architectures. See hugetlbpage.rst in the Documentation directory for 15 + * more details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB 16 + * are currently supported. Since the base page size on x86 is 4KB, a 2MB 17 + * HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of 18 + * 4096 base pages. For each base page, there is a corresponding page struct. 19 + * 20 + * Within the HugeTLB subsystem, only the first 4 page structs are used to 21 + * contain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides 22 + * this upper limit. The only 'useful' information in the remaining page structs 23 + * is the compound_head field, and this field is the same for all tail pages. 24 + * 25 + * By removing redundant page structs for HugeTLB pages, memory can be returned 26 + * to the buddy allocator for other uses. 27 + * 28 + * Different architectures support different HugeTLB pages. For example, the 29 + * following table is the HugeTLB page size supported by x86 and arm64 30 + * architectures. Because arm64 supports 4k, 16k, and 64k base pages and 31 + * supports contiguous entries, so it supports many kinds of sizes of HugeTLB 32 + * page. 33 + * 34 + * +--------------+-----------+-----------------------------------------------+ 35 + * | Architecture | Page Size | HugeTLB Page Size | 36 + * +--------------+-----------+-----------+-----------+-----------+-----------+ 37 + * | x86-64 | 4KB | 2MB | 1GB | | | 38 + * +--------------+-----------+-----------+-----------+-----------+-----------+ 39 + * | | 4KB | 64KB | 2MB | 32MB | 1GB | 40 + * | +-----------+-----------+-----------+-----------+-----------+ 41 + * | arm64 | 16KB | 2MB | 32MB | 1GB | | 42 + * | +-----------+-----------+-----------+-----------+-----------+ 43 + * | | 64KB | 2MB | 512MB | 16GB | | 44 + * +--------------+-----------+-----------+-----------+-----------+-----------+ 45 + * 46 + * When the system boot up, every HugeTLB page has more than one struct page 47 + * structs which size is (unit: pages): 48 + * 49 + * struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE 50 + * 51 + * Where HugeTLB_Size is the size of the HugeTLB page. We know that the size 52 + * of the HugeTLB page is always n times PAGE_SIZE. So we can get the following 53 + * relationship. 54 + * 55 + * HugeTLB_Size = n * PAGE_SIZE 56 + * 57 + * Then, 58 + * 59 + * struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE 60 + * = n * sizeof(struct page) / PAGE_SIZE 61 + * 62 + * We can use huge mapping at the pud/pmd level for the HugeTLB page. 63 + * 64 + * For the HugeTLB page of the pmd level mapping, then 65 + * 66 + * struct_size = n * sizeof(struct page) / PAGE_SIZE 67 + * = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE 68 + * = sizeof(struct page) / sizeof(pte_t) 69 + * = 64 / 8 70 + * = 8 (pages) 71 + * 72 + * Where n is how many pte entries which one page can contains. So the value of 73 + * n is (PAGE_SIZE / sizeof(pte_t)). 74 + * 75 + * This optimization only supports 64-bit system, so the value of sizeof(pte_t) 76 + * is 8. And this optimization also applicable only when the size of struct page 77 + * is a power of two. In most cases, the size of struct page is 64 bytes (e.g. 78 + * x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the 79 + * size of struct page structs of it is 8 page frames which size depends on the 80 + * size of the base page. 81 + * 82 + * For the HugeTLB page of the pud level mapping, then 83 + * 84 + * struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd) 85 + * = PAGE_SIZE / 8 * 8 (pages) 86 + * = PAGE_SIZE (pages) 87 + * 88 + * Where the struct_size(pmd) is the size of the struct page structs of a 89 + * HugeTLB page of the pmd level mapping. 90 + * 91 + * E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB 92 + * HugeTLB page consists in 4096. 93 + * 94 + * Next, we take the pmd level mapping of the HugeTLB page as an example to 95 + * show the internal implementation of this optimization. There are 8 pages 96 + * struct page structs associated with a HugeTLB page which is pmd mapped. 97 + * 98 + * Here is how things look before optimization. 99 + * 100 + * HugeTLB struct pages(8 pages) page frame(8 pages) 101 + * +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ 102 + * | | | 0 | -------------> | 0 | 103 + * | | +-----------+ +-----------+ 104 + * | | | 1 | -------------> | 1 | 105 + * | | +-----------+ +-----------+ 106 + * | | | 2 | -------------> | 2 | 107 + * | | +-----------+ +-----------+ 108 + * | | | 3 | -------------> | 3 | 109 + * | | +-----------+ +-----------+ 110 + * | | | 4 | -------------> | 4 | 111 + * | PMD | +-----------+ +-----------+ 112 + * | level | | 5 | -------------> | 5 | 113 + * | mapping | +-----------+ +-----------+ 114 + * | | | 6 | -------------> | 6 | 115 + * | | +-----------+ +-----------+ 116 + * | | | 7 | -------------> | 7 | 117 + * | | +-----------+ +-----------+ 118 + * | | 119 + * | | 120 + * | | 121 + * +-----------+ 122 + * 123 + * The value of page->compound_head is the same for all tail pages. The first 124 + * page of page structs (page 0) associated with the HugeTLB page contains the 4 125 + * page structs necessary to describe the HugeTLB. The only use of the remaining 126 + * pages of page structs (page 1 to page 7) is to point to page->compound_head. 127 + * Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs 128 + * will be used for each HugeTLB page. This will allow us to free the remaining 129 + * 6 pages to the buddy allocator. 130 + * 131 + * Here is how things look after remapping. 132 + * 133 + * HugeTLB struct pages(8 pages) page frame(8 pages) 134 + * +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ 135 + * | | | 0 | -------------> | 0 | 136 + * | | +-----------+ +-----------+ 137 + * | | | 1 | -------------> | 1 | 138 + * | | +-----------+ +-----------+ 139 + * | | | 2 | ----------------^ ^ ^ ^ ^ ^ 140 + * | | +-----------+ | | | | | 141 + * | | | 3 | ------------------+ | | | | 142 + * | | +-----------+ | | | | 143 + * | | | 4 | --------------------+ | | | 144 + * | PMD | +-----------+ | | | 145 + * | level | | 5 | ----------------------+ | | 146 + * | mapping | +-----------+ | | 147 + * | | | 6 | ------------------------+ | 148 + * | | +-----------+ | 149 + * | | | 7 | --------------------------+ 150 + * | | +-----------+ 151 + * | | 152 + * | | 153 + * | | 154 + * +-----------+ 155 + * 156 + * When a HugeTLB is freed to the buddy system, we should allocate 6 pages for 157 + * vmemmap pages and restore the previous mapping relationship. 158 + * 159 + * For the HugeTLB page of the pud level mapping. It is similar to the former. 160 + * We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages. 161 + * 162 + * Apart from the HugeTLB page of the pmd/pud level mapping, some architectures 163 + * (e.g. aarch64) provides a contiguous bit in the translation table entries 164 + * that hints to the MMU to indicate that it is one of a contiguous set of 165 + * entries that can be cached in a single TLB entry. 166 + * 167 + * The contiguous bit is used to increase the mapping size at the pmd and pte 168 + * (last) level. So this type of HugeTLB page can be optimized only when its 169 + * size of the struct page structs is greater than 2 pages. 170 + */ 171 + #define pr_fmt(fmt) "HugeTLB: " fmt 172 + 173 + #include "hugetlb_vmemmap.h" 174 + 175 + /* 176 + * There are a lot of struct page structures associated with each HugeTLB page. 177 + * For tail pages, the value of compound_head is the same. So we can reuse first 178 + * page of tail page structures. We map the virtual addresses of the remaining 179 + * pages of tail page structures to the first tail page struct, and then free 180 + * these page frames. Therefore, we need to reserve two pages as vmemmap areas. 181 + */ 182 + #define RESERVE_VMEMMAP_NR 2U 183 + #define RESERVE_VMEMMAP_SIZE (RESERVE_VMEMMAP_NR << PAGE_SHIFT) 184 + 185 + bool hugetlb_free_vmemmap_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON); 186 + 187 + static int __init early_hugetlb_free_vmemmap_param(char *buf) 188 + { 189 + /* We cannot optimize if a "struct page" crosses page boundaries. */ 190 + if ((!is_power_of_2(sizeof(struct page)))) { 191 + pr_warn("cannot free vmemmap pages because \"struct page\" crosses page boundaries\n"); 192 + return 0; 193 + } 194 + 195 + if (!buf) 196 + return -EINVAL; 197 + 198 + if (!strcmp(buf, "on")) 199 + hugetlb_free_vmemmap_enabled = true; 200 + else if (!strcmp(buf, "off")) 201 + hugetlb_free_vmemmap_enabled = false; 202 + else 203 + return -EINVAL; 204 + 205 + return 0; 206 + } 207 + early_param("hugetlb_free_vmemmap", early_hugetlb_free_vmemmap_param); 208 + 209 + static inline unsigned long free_vmemmap_pages_size_per_hpage(struct hstate *h) 210 + { 211 + return (unsigned long)free_vmemmap_pages_per_hpage(h) << PAGE_SHIFT; 212 + } 213 + 214 + /* 215 + * Previously discarded vmemmap pages will be allocated and remapping 216 + * after this function returns zero. 217 + */ 218 + int alloc_huge_page_vmemmap(struct hstate *h, struct page *head) 219 + { 220 + int ret; 221 + unsigned long vmemmap_addr = (unsigned long)head; 222 + unsigned long vmemmap_end, vmemmap_reuse; 223 + 224 + if (!HPageVmemmapOptimized(head)) 225 + return 0; 226 + 227 + vmemmap_addr += RESERVE_VMEMMAP_SIZE; 228 + vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h); 229 + vmemmap_reuse = vmemmap_addr - PAGE_SIZE; 230 + /* 231 + * The pages which the vmemmap virtual address range [@vmemmap_addr, 232 + * @vmemmap_end) are mapped to are freed to the buddy allocator, and 233 + * the range is mapped to the page which @vmemmap_reuse is mapped to. 234 + * When a HugeTLB page is freed to the buddy allocator, previously 235 + * discarded vmemmap pages must be allocated and remapping. 236 + */ 237 + ret = vmemmap_remap_alloc(vmemmap_addr, vmemmap_end, vmemmap_reuse, 238 + GFP_KERNEL | __GFP_NORETRY | __GFP_THISNODE); 239 + 240 + if (!ret) 241 + ClearHPageVmemmapOptimized(head); 242 + 243 + return ret; 244 + } 245 + 246 + void free_huge_page_vmemmap(struct hstate *h, struct page *head) 247 + { 248 + unsigned long vmemmap_addr = (unsigned long)head; 249 + unsigned long vmemmap_end, vmemmap_reuse; 250 + 251 + if (!free_vmemmap_pages_per_hpage(h)) 252 + return; 253 + 254 + vmemmap_addr += RESERVE_VMEMMAP_SIZE; 255 + vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h); 256 + vmemmap_reuse = vmemmap_addr - PAGE_SIZE; 257 + 258 + /* 259 + * Remap the vmemmap virtual address range [@vmemmap_addr, @vmemmap_end) 260 + * to the page which @vmemmap_reuse is mapped to, then free the pages 261 + * which the range [@vmemmap_addr, @vmemmap_end] is mapped to. 262 + */ 263 + if (!vmemmap_remap_free(vmemmap_addr, vmemmap_end, vmemmap_reuse)) 264 + SetHPageVmemmapOptimized(head); 265 + } 266 + 267 + void __init hugetlb_vmemmap_init(struct hstate *h) 268 + { 269 + unsigned int nr_pages = pages_per_huge_page(h); 270 + unsigned int vmemmap_pages; 271 + 272 + /* 273 + * There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct 274 + * page structs that can be used when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP, 275 + * so add a BUILD_BUG_ON to catch invalid usage of the tail struct page. 276 + */ 277 + BUILD_BUG_ON(__NR_USED_SUBPAGE >= 278 + RESERVE_VMEMMAP_SIZE / sizeof(struct page)); 279 + 280 + if (!hugetlb_free_vmemmap_enabled) 281 + return; 282 + 283 + vmemmap_pages = (nr_pages * sizeof(struct page)) >> PAGE_SHIFT; 284 + /* 285 + * The head page and the first tail page are not to be freed to buddy 286 + * allocator, the other pages will map to the first tail page, so they 287 + * can be freed. 288 + * 289 + * Could RESERVE_VMEMMAP_NR be greater than @vmemmap_pages? It is true 290 + * on some architectures (e.g. aarch64). See Documentation/arm64/ 291 + * hugetlbpage.rst for more details. 292 + */ 293 + if (likely(vmemmap_pages > RESERVE_VMEMMAP_NR)) 294 + h->nr_free_vmemmap_pages = vmemmap_pages - RESERVE_VMEMMAP_NR; 295 + 296 + pr_info("can free %d vmemmap pages for %s\n", h->nr_free_vmemmap_pages, 297 + h->name); 298 + }

+45

mm/hugetlb_vmemmap.h

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Free some vmemmap pages of HugeTLB 4 + * 5 + * Copyright (c) 2020, Bytedance. All rights reserved. 6 + * 7 + * Author: Muchun Song <songmuchun@bytedance.com> 8 + */ 9 + #ifndef _LINUX_HUGETLB_VMEMMAP_H 10 + #define _LINUX_HUGETLB_VMEMMAP_H 11 + #include <linux/hugetlb.h> 12 + 13 + #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP 14 + int alloc_huge_page_vmemmap(struct hstate *h, struct page *head); 15 + void free_huge_page_vmemmap(struct hstate *h, struct page *head); 16 + void hugetlb_vmemmap_init(struct hstate *h); 17 + 18 + /* 19 + * How many vmemmap pages associated with a HugeTLB page that can be freed 20 + * to the buddy allocator. 21 + */ 22 + static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h) 23 + { 24 + return h->nr_free_vmemmap_pages; 25 + } 26 + #else 27 + static inline int alloc_huge_page_vmemmap(struct hstate *h, struct page *head) 28 + { 29 + return 0; 30 + } 31 + 32 + static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head) 33 + { 34 + } 35 + 36 + static inline void hugetlb_vmemmap_init(struct hstate *h) 37 + { 38 + } 39 + 40 + static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h) 41 + { 42 + return 0; 43 + } 44 + #endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */ 45 + #endif /* _LINUX_HUGETLB_VMEMMAP_H */

+8 -21

mm/internal.h

··· 274 274 int 275 275 isolate_migratepages_range(struct compact_control *cc, 276 276 unsigned long low_pfn, unsigned long end_pfn); 277 + #endif 277 278 int find_suitable_fallback(struct free_area *area, unsigned int order, 278 279 int migratetype, bool only_stealable, bool *can_steal); 279 - 280 - #endif 281 280 282 281 /* 283 282 * This function returns the order of a free page in the buddy system. In ··· 343 344 344 345 #ifdef CONFIG_MMU 345 346 extern long populate_vma_page_range(struct vm_area_struct *vma, 346 - unsigned long start, unsigned long end, int *nonblocking); 347 + unsigned long start, unsigned long end, int *locked); 348 + extern long faultin_vma_page_range(struct vm_area_struct *vma, 349 + unsigned long start, unsigned long end, 350 + bool write, int *locked); 347 351 extern void munlock_vma_pages_range(struct vm_area_struct *vma, 348 352 unsigned long start, unsigned long end); 349 353 static inline void munlock_vma_pages_all(struct vm_area_struct *vma) ··· 370 368 * is revert to lazy LRU behaviour -- semantics are not broken. 371 369 */ 372 370 extern void clear_page_mlock(struct page *page); 373 - 374 - /* 375 - * mlock_migrate_page - called only from migrate_misplaced_transhuge_page() 376 - * (because that does not go through the full procedure of migration ptes): 377 - * to migrate the Mlocked page flag; update statistics. 378 - */ 379 - static inline void mlock_migrate_page(struct page *newpage, struct page *page) 380 - { 381 - if (TestClearPageMlocked(page)) { 382 - int nr_pages = thp_nr_pages(page); 383 - 384 - /* Holding pmd lock, no change in irq context: __mod is safe */ 385 - __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages); 386 - SetPageMlocked(newpage); 387 - __mod_zone_page_state(page_zone(newpage), NR_MLOCK, nr_pages); 388 - } 389 - } 390 371 391 372 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma); 392 373 ··· 446 461 #else /* !CONFIG_MMU */ 447 462 static inline void clear_page_mlock(struct page *page) { } 448 463 static inline void mlock_vma_page(struct page *page) { } 449 - static inline void mlock_migrate_page(struct page *new, struct page *old) { } 450 464 static inline void vunmap_range_noflush(unsigned long start, unsigned long end) 451 465 { 452 466 } ··· 655 671 #endif 656 672 657 673 void vunmap_range_noflush(unsigned long start, unsigned long end); 674 + 675 + int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, 676 + unsigned long addr, int page_nid, int *flags); 658 677 659 678 #endif /* __MM_INTERNAL_H */

+2 -2

mm/kfence/core.c

··· 636 636 /* Disable static key and reset timer. */ 637 637 static_branch_disable(&kfence_allocation_key); 638 638 #endif 639 - queue_delayed_work(system_power_efficient_wq, &kfence_timer, 639 + queue_delayed_work(system_unbound_wq, &kfence_timer, 640 640 msecs_to_jiffies(kfence_sample_interval)); 641 641 } 642 642 static DECLARE_DELAYED_WORK(kfence_timer, toggle_allocation_gate); ··· 666 666 } 667 667 668 668 WRITE_ONCE(kfence_enabled, true); 669 - queue_delayed_work(system_power_efficient_wq, &kfence_timer, 0); 669 + queue_delayed_work(system_unbound_wq, &kfence_timer, 0); 670 670 pr_info("initialized - using %lu bytes for %d objects at 0x%p-0x%p\n", KFENCE_POOL_SIZE, 671 671 CONFIG_KFENCE_NUM_OBJECTS, (void *)__kfence_pool, 672 672 (void *)(__kfence_pool + KFENCE_POOL_SIZE));

+16 -4

mm/khugepaged.c

··· 442 442 static bool hugepage_vma_check(struct vm_area_struct *vma, 443 443 unsigned long vm_flags) 444 444 { 445 - /* Explicitly disabled through madvise. */ 446 - if ((vm_flags & VM_NOHUGEPAGE) || 447 - test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) 445 + if (!transhuge_vma_enabled(vma, vm_flags)) 448 446 return false; 449 447 450 448 /* Enabled via shmem mount options or sysfs settings. */ ··· 457 459 458 460 /* Read-only file mappings need to be aligned for THP to work. */ 459 461 if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && vma->vm_file && 460 - (vm_flags & VM_DENYWRITE)) { 462 + !inode_is_open_for_write(vma->vm_file->f_inode) && 463 + (vm_flags & VM_EXEC)) { 461 464 return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, 462 465 HPAGE_PMD_NR); 463 466 } ··· 1863 1864 else { 1864 1865 __mod_lruvec_page_state(new_page, NR_FILE_THPS, nr); 1865 1866 filemap_nr_thps_inc(mapping); 1867 + /* 1868 + * Paired with smp_mb() in do_dentry_open() to ensure 1869 + * i_writecount is up to date and the update to nr_thps is 1870 + * visible. Ensures the page cache will be truncated if the 1871 + * file is opened writable. 1872 + */ 1873 + smp_mb(); 1874 + if (inode_is_open_for_write(mapping->host)) { 1875 + result = SCAN_FAIL; 1876 + __mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr); 1877 + filemap_nr_thps_dec(mapping); 1878 + goto xa_locked; 1879 + } 1866 1880 } 1867 1881 1868 1882 if (nr_none) {

+66

mm/madvise.c

··· 53 53 case MADV_COLD: 54 54 case MADV_PAGEOUT: 55 55 case MADV_FREE: 56 + case MADV_POPULATE_READ: 57 + case MADV_POPULATE_WRITE: 56 58 return 0; 57 59 default: 58 60 /* be safe, default to 1. list exceptions explicitly */ ··· 824 822 return -EINVAL; 825 823 } 826 824 825 + static long madvise_populate(struct vm_area_struct *vma, 826 + struct vm_area_struct **prev, 827 + unsigned long start, unsigned long end, 828 + int behavior) 829 + { 830 + const bool write = behavior == MADV_POPULATE_WRITE; 831 + struct mm_struct *mm = vma->vm_mm; 832 + unsigned long tmp_end; 833 + int locked = 1; 834 + long pages; 835 + 836 + *prev = vma; 837 + 838 + while (start < end) { 839 + /* 840 + * We might have temporarily dropped the lock. For example, 841 + * our VMA might have been split. 842 + */ 843 + if (!vma || start >= vma->vm_end) { 844 + vma = find_vma(mm, start); 845 + if (!vma || start < vma->vm_start) 846 + return -ENOMEM; 847 + } 848 + 849 + tmp_end = min_t(unsigned long, end, vma->vm_end); 850 + /* Populate (prefault) page tables readable/writable. */ 851 + pages = faultin_vma_page_range(vma, start, tmp_end, write, 852 + &locked); 853 + if (!locked) { 854 + mmap_read_lock(mm); 855 + locked = 1; 856 + *prev = NULL; 857 + vma = NULL; 858 + } 859 + if (pages < 0) { 860 + switch (pages) { 861 + case -EINTR: 862 + return -EINTR; 863 + case -EFAULT: /* Incompatible mappings / permissions. */ 864 + return -EINVAL; 865 + case -EHWPOISON: 866 + return -EHWPOISON; 867 + default: 868 + pr_warn_once("%s: unhandled return value: %ld\n", 869 + __func__, pages); 870 + fallthrough; 871 + case -ENOMEM: 872 + return -ENOMEM; 873 + } 874 + } 875 + start += pages * PAGE_SIZE; 876 + } 877 + return 0; 878 + } 879 + 827 880 /* 828 881 * Application wants to free up the pages and associated backing store. 829 882 * This is effectively punching a hole into the middle of a file. ··· 992 935 case MADV_FREE: 993 936 case MADV_DONTNEED: 994 937 return madvise_dontneed_free(vma, prev, start, end, behavior); 938 + case MADV_POPULATE_READ: 939 + case MADV_POPULATE_WRITE: 940 + return madvise_populate(vma, prev, start, end, behavior); 995 941 default: 996 942 return madvise_behavior(vma, prev, start, end, behavior); 997 943 } ··· 1015 955 case MADV_FREE: 1016 956 case MADV_COLD: 1017 957 case MADV_PAGEOUT: 958 + case MADV_POPULATE_READ: 959 + case MADV_POPULATE_WRITE: 1018 960 #ifdef CONFIG_KSM 1019 961 case MADV_MERGEABLE: 1020 962 case MADV_UNMERGEABLE: ··· 1104 1042 * easily if memory pressure happens. 1105 1043 * MADV_PAGEOUT - the application is not expected to use this memory soon, 1106 1044 * page out the pages in this range immediately. 1045 + * MADV_POPULATE_READ - populate (prefault) page tables readable by 1046 + * triggering read faults if required 1047 + * MADV_POPULATE_WRITE - populate (prefault) page tables writable by 1048 + * triggering write faults if required 1107 1049 * 1108 1050 * return values: 1109 1051 * zero - success

+1 -1

mm/mapping_dirty_helpers.c

··· 317 317 * pfn_mkwrite(). And then after a TLB flush following the write-protection 318 318 * pick up all dirty bits. 319 319 * 320 - * Note: This function currently skips transhuge page-table entries, since 320 + * This function currently skips transhuge page-table entries, since 321 321 * it's intended for dirty-tracking on the PTE level. It will warn on 322 322 * encountering transhuge dirty entries, though, and can easily be extended 323 323 * to handle them as well.

+26 -2

mm/memblock.c

··· 906 906 * @base: the base phys addr of the region 907 907 * @size: the size of the region 908 908 * 909 + * The memory regions marked with %MEMBLOCK_NOMAP will not be added to the 910 + * direct mapping of the physical memory. These regions will still be 911 + * covered by the memory map. The struct page representing NOMAP memory 912 + * frames in the memory map will be PageReserved() 913 + * 909 914 * Return: 0 on success, -errno on failure. 910 915 */ 911 916 int __init_memblock memblock_mark_nomap(phys_addr_t base, phys_addr_t size) ··· 2007 2002 return end_pfn - start_pfn; 2008 2003 } 2009 2004 2005 + static void __init memmap_init_reserved_pages(void) 2006 + { 2007 + struct memblock_region *region; 2008 + phys_addr_t start, end; 2009 + u64 i; 2010 + 2011 + /* initialize struct pages for the reserved regions */ 2012 + for_each_reserved_mem_range(i, &start, &end) 2013 + reserve_bootmem_region(start, end); 2014 + 2015 + /* and also treat struct pages for the NOMAP regions as PageReserved */ 2016 + for_each_mem_region(region) { 2017 + if (memblock_is_nomap(region)) { 2018 + start = region->base; 2019 + end = start + region->size; 2020 + reserve_bootmem_region(start, end); 2021 + } 2022 + } 2023 + } 2024 + 2010 2025 static unsigned long __init free_low_memory_core_early(void) 2011 2026 { 2012 2027 unsigned long count = 0; ··· 2035 2010 2036 2011 memblock_clear_hotplug(0, -1); 2037 2012 2038 - for_each_reserved_mem_range(i, &start, &end) 2039 - reserve_bootmem_region(start, end); 2013 + memmap_init_reserved_pages(); 2040 2014 2041 2015 /* 2042 2016 * We need to use NUMA_NO_NODE instead of NODE_DATA(0)->node_id

+2 -2

mm/memcontrol.c

··· 5537 5537 * as special swap entry in the CPU page table. 5538 5538 */ 5539 5539 if (is_device_private_entry(ent)) { 5540 - page = device_private_entry_to_page(ent); 5540 + page = pfn_swap_entry_to_page(ent); 5541 5541 /* 5542 5542 * MEMORY_DEVICE_PRIVATE means ZONE_DEVICE page and which have 5543 5543 * a refcount of 1 when free (unlike normal page) ··· 6644 6644 } 6645 6645 6646 6646 /** 6647 - * mem_cgroup_protected - check if memory consumption is in the normal range 6647 + * mem_cgroup_calculate_protection - check if memory consumption is in the normal range 6648 6648 * @root: the top ancestor of the sub-tree being checked 6649 6649 * @memcg: the memory cgroup to check 6650 6650 *

+25 -13

mm/memory-failure.c

··· 66 66 67 67 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0); 68 68 69 + static bool __page_handle_poison(struct page *page) 70 + { 71 + bool ret; 72 + 73 + zone_pcp_disable(page_zone(page)); 74 + ret = dissolve_free_huge_page(page); 75 + if (!ret) 76 + ret = take_page_off_buddy(page); 77 + zone_pcp_enable(page_zone(page)); 78 + 79 + return ret; 80 + } 81 + 69 82 static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release) 70 83 { 71 84 if (hugepage_or_freepage) { ··· 86 73 * Doing this check for free pages is also fine since dissolve_free_huge_page 87 74 * returns 0 for non-hugetlb pages as well. 88 75 */ 89 - if (dissolve_free_huge_page(page) || !take_page_off_buddy(page)) 76 + if (!__page_handle_poison(page)) 90 77 /* 91 78 * We could fail to take off the target page from buddy 92 79 * for example due to racy page allocation, but that's ··· 998 985 */ 999 986 if (PageAnon(hpage)) 1000 987 put_page(hpage); 1001 - if (!dissolve_free_huge_page(p) && take_page_off_buddy(p)) { 988 + if (__page_handle_poison(p)) { 1002 989 page_ref_inc(p); 1003 990 res = MF_RECOVERED; 1004 991 } ··· 1266 1253 static bool hwpoison_user_mappings(struct page *p, unsigned long pfn, 1267 1254 int flags, struct page **hpagep) 1268 1255 { 1269 - enum ttu_flags ttu = TTU_IGNORE_MLOCK; 1256 + enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_SYNC; 1270 1257 struct address_space *mapping; 1271 1258 LIST_HEAD(tokill); 1272 - bool unmap_success = true; 1259 + bool unmap_success; 1273 1260 int kill = 1, forcekill; 1274 1261 struct page *hpage = *hpagep; 1275 1262 bool mlocked = PageMlocked(hpage); ··· 1332 1319 collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED); 1333 1320 1334 1321 if (!PageHuge(hpage)) { 1335 - unmap_success = try_to_unmap(hpage, ttu); 1322 + try_to_unmap(hpage, ttu); 1336 1323 } else { 1337 1324 if (!PageAnon(hpage)) { 1338 1325 /* ··· 1340 1327 * could potentially call huge_pmd_unshare. Because of 1341 1328 * this, take semaphore in write mode here and set 1342 1329 * TTU_RMAP_LOCKED to indicate we have taken the lock 1343 - * at this higer level. 1330 + * at this higher level. 1344 1331 */ 1345 1332 mapping = hugetlb_page_mapping_lock_write(hpage); 1346 1333 if (mapping) { 1347 - unmap_success = try_to_unmap(hpage, 1348 - ttu|TTU_RMAP_LOCKED); 1334 + try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED); 1349 1335 i_mmap_unlock_write(mapping); 1350 - } else { 1336 + } else 1351 1337 pr_info("Memory failure: %#lx: could not lock mapping for mapped huge page\n", pfn); 1352 - unmap_success = false; 1353 - } 1354 1338 } else { 1355 - unmap_success = try_to_unmap(hpage, ttu); 1339 + try_to_unmap(hpage, ttu); 1356 1340 } 1357 1341 } 1342 + 1343 + unmap_success = !page_mapped(hpage); 1358 1344 if (!unmap_success) 1359 1345 pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n", 1360 1346 pfn, page_mapcount(hpage)); ··· 1458 1446 } 1459 1447 unlock_page(head); 1460 1448 res = MF_FAILED; 1461 - if (!dissolve_free_huge_page(p) && take_page_off_buddy(p)) { 1449 + if (__page_handle_poison(p)) { 1462 1450 page_ref_inc(p); 1463 1451 res = MF_RECOVERED; 1464 1452 }

+182 -53

mm/memory.c

··· 699 699 } 700 700 #endif 701 701 702 + static void restore_exclusive_pte(struct vm_area_struct *vma, 703 + struct page *page, unsigned long address, 704 + pte_t *ptep) 705 + { 706 + pte_t pte; 707 + swp_entry_t entry; 708 + 709 + pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot))); 710 + if (pte_swp_soft_dirty(*ptep)) 711 + pte = pte_mksoft_dirty(pte); 712 + 713 + entry = pte_to_swp_entry(*ptep); 714 + if (pte_swp_uffd_wp(*ptep)) 715 + pte = pte_mkuffd_wp(pte); 716 + else if (is_writable_device_exclusive_entry(entry)) 717 + pte = maybe_mkwrite(pte_mkdirty(pte), vma); 718 + 719 + set_pte_at(vma->vm_mm, address, ptep, pte); 720 + 721 + /* 722 + * No need to take a page reference as one was already 723 + * created when the swap entry was made. 724 + */ 725 + if (PageAnon(page)) 726 + page_add_anon_rmap(page, vma, address, false); 727 + else 728 + /* 729 + * Currently device exclusive access only supports anonymous 730 + * memory so the entry shouldn't point to a filebacked page. 731 + */ 732 + WARN_ON_ONCE(!PageAnon(page)); 733 + 734 + if (vma->vm_flags & VM_LOCKED) 735 + mlock_vma_page(page); 736 + 737 + /* 738 + * No need to invalidate - it was non-present before. However 739 + * secondary CPUs may have mappings that need invalidating. 740 + */ 741 + update_mmu_cache(vma, address, ptep); 742 + } 743 + 744 + /* 745 + * Tries to restore an exclusive pte if the page lock can be acquired without 746 + * sleeping. 747 + */ 748 + static int 749 + try_restore_exclusive_pte(pte_t *src_pte, struct vm_area_struct *vma, 750 + unsigned long addr) 751 + { 752 + swp_entry_t entry = pte_to_swp_entry(*src_pte); 753 + struct page *page = pfn_swap_entry_to_page(entry); 754 + 755 + if (trylock_page(page)) { 756 + restore_exclusive_pte(vma, page, addr, src_pte); 757 + unlock_page(page); 758 + return 0; 759 + } 760 + 761 + return -EBUSY; 762 + } 763 + 702 764 /* 703 765 * copy one vm_area from one task to the other. Assumes the page tables 704 766 * already present in the new task to be cleared in the whole range ··· 769 707 770 708 static unsigned long 771 709 copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, 772 - pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma, 773 - unsigned long addr, int *rss) 710 + pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *dst_vma, 711 + struct vm_area_struct *src_vma, unsigned long addr, int *rss) 774 712 { 775 - unsigned long vm_flags = vma->vm_flags; 713 + unsigned long vm_flags = dst_vma->vm_flags; 776 714 pte_t pte = *src_pte; 777 715 struct page *page; 778 716 swp_entry_t entry = pte_to_swp_entry(pte); 779 717 780 718 if (likely(!non_swap_entry(entry))) { 781 719 if (swap_duplicate(entry) < 0) 782 - return entry.val; 720 + return -EIO; 783 721 784 722 /* make sure dst_mm is on swapoff's mmlist. */ 785 723 if (unlikely(list_empty(&dst_mm->mmlist))) { ··· 791 729 } 792 730 rss[MM_SWAPENTS]++; 793 731 } else if (is_migration_entry(entry)) { 794 - page = migration_entry_to_page(entry); 732 + page = pfn_swap_entry_to_page(entry); 795 733 796 734 rss[mm_counter(page)]++; 797 735 798 - if (is_write_migration_entry(entry) && 736 + if (is_writable_migration_entry(entry) && 799 737 is_cow_mapping(vm_flags)) { 800 738 /* 801 739 * COW mappings require pages in both 802 740 * parent and child to be set to read. 803 741 */ 804 - make_migration_entry_read(&entry); 742 + entry = make_readable_migration_entry( 743 + swp_offset(entry)); 805 744 pte = swp_entry_to_pte(entry); 806 745 if (pte_swp_soft_dirty(*src_pte)) 807 746 pte = pte_swp_mksoft_dirty(pte); ··· 811 748 set_pte_at(src_mm, addr, src_pte, pte); 812 749 } 813 750 } else if (is_device_private_entry(entry)) { 814 - page = device_private_entry_to_page(entry); 751 + page = pfn_swap_entry_to_page(entry); 815 752 816 753 /* 817 754 * Update rss count even for unaddressable pages, as ··· 833 770 * when a device driver is involved (you cannot easily 834 771 * save and restore device driver state). 835 772 */ 836 - if (is_write_device_private_entry(entry) && 773 + if (is_writable_device_private_entry(entry) && 837 774 is_cow_mapping(vm_flags)) { 838 - make_device_private_entry_read(&entry); 775 + entry = make_readable_device_private_entry( 776 + swp_offset(entry)); 839 777 pte = swp_entry_to_pte(entry); 840 778 if (pte_swp_uffd_wp(*src_pte)) 841 779 pte = pte_swp_mkuffd_wp(pte); 842 780 set_pte_at(src_mm, addr, src_pte, pte); 843 781 } 782 + } else if (is_device_exclusive_entry(entry)) { 783 + /* 784 + * Make device exclusive entries present by restoring the 785 + * original entry then copying as for a present pte. Device 786 + * exclusive entries currently only support private writable 787 + * (ie. COW) mappings. 788 + */ 789 + VM_BUG_ON(!is_cow_mapping(src_vma->vm_flags)); 790 + if (try_restore_exclusive_pte(src_pte, src_vma, addr)) 791 + return -EBUSY; 792 + return -ENOENT; 844 793 } 794 + if (!userfaultfd_wp(dst_vma)) 795 + pte = pte_swp_clear_uffd_wp(pte); 845 796 set_pte_at(dst_mm, addr, dst_pte, pte); 846 797 return 0; 847 798 } ··· 921 844 /* All done, just insert the new page copy in the child */ 922 845 pte = mk_pte(new_page, dst_vma->vm_page_prot); 923 846 pte = maybe_mkwrite(pte_mkdirty(pte), dst_vma); 847 + if (userfaultfd_pte_wp(dst_vma, *src_pte)) 848 + /* Uffd-wp needs to be delivered to dest pte as well */ 849 + pte = pte_wrprotect(pte_mkuffd_wp(pte)); 924 850 set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); 925 851 return 0; 926 852 } ··· 973 893 pte = pte_mkclean(pte); 974 894 pte = pte_mkold(pte); 975 895 976 - /* 977 - * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA 978 - * does not have the VM_UFFD_WP, which means that the uffd 979 - * fork event is not enabled. 980 - */ 981 - if (!(vm_flags & VM_UFFD_WP)) 896 + if (!userfaultfd_wp(dst_vma)) 982 897 pte = pte_clear_uffd_wp(pte); 983 898 984 899 set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); ··· 1046 971 continue; 1047 972 } 1048 973 if (unlikely(!pte_present(*src_pte))) { 1049 - entry.val = copy_nonpresent_pte(dst_mm, src_mm, 1050 - dst_pte, src_pte, 1051 - src_vma, addr, rss); 1052 - if (entry.val) 974 + ret = copy_nonpresent_pte(dst_mm, src_mm, 975 + dst_pte, src_pte, 976 + dst_vma, src_vma, 977 + addr, rss); 978 + if (ret == -EIO) { 979 + entry = pte_to_swp_entry(*src_pte); 1053 980 break; 1054 - progress += 8; 1055 - continue; 981 + } else if (ret == -EBUSY) { 982 + break; 983 + } else if (!ret) { 984 + progress += 8; 985 + continue; 986 + } 987 + 988 + /* 989 + * Device exclusive entry restored, continue by copying 990 + * the now present pte. 991 + */ 992 + WARN_ON_ONCE(ret != -ENOENT); 1056 993 } 1057 994 /* copy_present_pte() will clear `*prealloc' if consumed */ 1058 995 ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, ··· 1095 1008 pte_unmap_unlock(orig_dst_pte, dst_ptl); 1096 1009 cond_resched(); 1097 1010 1098 - if (entry.val) { 1011 + if (ret == -EIO) { 1012 + VM_WARN_ON_ONCE(!entry.val); 1099 1013 if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) { 1100 1014 ret = -ENOMEM; 1101 1015 goto out; 1102 1016 } 1103 1017 entry.val = 0; 1104 - } else if (ret) { 1105 - WARN_ON_ONCE(ret != -EAGAIN); 1018 + } else if (ret == -EBUSY) { 1019 + goto out; 1020 + } else if (ret == -EAGAIN) { 1106 1021 prealloc = page_copy_prealloc(src_mm, src_vma, addr); 1107 1022 if (!prealloc) 1108 1023 return -ENOMEM; 1109 - /* We've captured and resolved the error. Reset, try again. */ 1110 - ret = 0; 1024 + } else if (ret) { 1025 + VM_WARN_ON_ONCE(1); 1111 1026 } 1027 + 1028 + /* We've captured and resolved the error. Reset, try again. */ 1029 + ret = 0; 1030 + 1112 1031 if (addr != end) 1113 1032 goto again; 1114 1033 out: ··· 1143 1050 || pmd_devmap(*src_pmd)) { 1144 1051 int err; 1145 1052 VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma); 1146 - err = copy_huge_pmd(dst_mm, src_mm, 1147 - dst_pmd, src_pmd, addr, src_vma); 1053 + err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd, 1054 + addr, dst_vma, src_vma); 1148 1055 if (err == -ENOMEM) 1149 1056 return -ENOMEM; 1150 1057 if (!err) ··· 1371 1278 } 1372 1279 1373 1280 entry = pte_to_swp_entry(ptent); 1374 - if (is_device_private_entry(entry)) { 1375 - struct page *page = device_private_entry_to_page(entry); 1281 + if (is_device_private_entry(entry) || 1282 + is_device_exclusive_entry(entry)) { 1283 + struct page *page = pfn_swap_entry_to_page(entry); 1376 1284 1377 1285 if (unlikely(details && details->check_mapping)) { 1378 1286 /* ··· 1388 1294 1389 1295 pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); 1390 1296 rss[mm_counter(page)]--; 1391 - page_remove_rmap(page, false); 1297 + 1298 + if (is_device_private_entry(entry)) 1299 + page_remove_rmap(page, false); 1300 + 1392 1301 put_page(page); 1393 1302 continue; 1394 1303 } ··· 1405 1308 else if (is_migration_entry(entry)) { 1406 1309 struct page *page; 1407 1310 1408 - page = migration_entry_to_page(entry); 1311 + page = pfn_swap_entry_to_page(entry); 1409 1312 rss[mm_counter(page)]--; 1410 1313 } 1411 1314 if (unlikely(!free_swap_and_cache(entry))) ··· 3440 3343 EXPORT_SYMBOL(unmap_mapping_range); 3441 3344 3442 3345 /* 3346 + * Restore a potential device exclusive pte to a working pte entry 3347 + */ 3348 + static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf) 3349 + { 3350 + struct page *page = vmf->page; 3351 + struct vm_area_struct *vma = vmf->vma; 3352 + struct mmu_notifier_range range; 3353 + 3354 + if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) 3355 + return VM_FAULT_RETRY; 3356 + mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma, 3357 + vma->vm_mm, vmf->address & PAGE_MASK, 3358 + (vmf->address & PAGE_MASK) + PAGE_SIZE, NULL); 3359 + mmu_notifier_invalidate_range_start(&range); 3360 + 3361 + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, 3362 + &vmf->ptl); 3363 + if (likely(pte_same(*vmf->pte, vmf->orig_pte))) 3364 + restore_exclusive_pte(vma, page, vmf->address, vmf->pte); 3365 + 3366 + pte_unmap_unlock(vmf->pte, vmf->ptl); 3367 + unlock_page(page); 3368 + 3369 + mmu_notifier_invalidate_range_end(&range); 3370 + return 0; 3371 + } 3372 + 3373 + /* 3443 3374 * We enter with non-exclusive mmap_lock (to exclude vma changes, 3444 3375 * but allow concurrent faults), and pte mapped but not yet locked. 3445 3376 * We return with pte unmapped and unlocked. ··· 3495 3370 if (is_migration_entry(entry)) { 3496 3371 migration_entry_wait(vma->vm_mm, vmf->pmd, 3497 3372 vmf->address); 3373 + } else if (is_device_exclusive_entry(entry)) { 3374 + vmf->page = pfn_swap_entry_to_page(entry); 3375 + ret = remove_device_exclusive_entry(vmf); 3498 3376 } else if (is_device_private_entry(entry)) { 3499 - vmf->page = device_private_entry_to_page(entry); 3377 + vmf->page = pfn_swap_entry_to_page(entry); 3500 3378 ret = vmf->page->pgmap->ops->migrate_to_ram(vmf); 3501 3379 } else if (is_hwpoison_entry(entry)) { 3502 3380 ret = VM_FAULT_HWPOISON; ··· 4153 4025 * something). 4154 4026 */ 4155 4027 if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) { 4156 - ret = do_fault_around(vmf); 4157 - if (ret) 4158 - return ret; 4028 + if (likely(!userfaultfd_minor(vmf->vma))) { 4029 + ret = do_fault_around(vmf); 4030 + if (ret) 4031 + return ret; 4032 + } 4159 4033 } 4160 4034 4161 4035 ret = __do_fault(vmf); ··· 4302 4172 return ret; 4303 4173 } 4304 4174 4305 - static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, 4306 - unsigned long addr, int page_nid, 4307 - int *flags) 4175 + int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, 4176 + unsigned long addr, int page_nid, int *flags) 4308 4177 { 4309 4178 get_page(page); 4310 4179 ··· 4424 4295 } 4425 4296 4426 4297 /* `inline' is required to avoid gcc 4.1.2 build error */ 4427 - static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd) 4298 + static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) 4428 4299 { 4429 4300 if (vma_is_anonymous(vmf->vma)) { 4430 - if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd)) 4301 + if (userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) 4431 4302 return handle_userfault(vmf, VM_UFFD_WP); 4432 - return do_huge_pmd_wp_page(vmf, orig_pmd); 4303 + return do_huge_pmd_wp_page(vmf); 4433 4304 } 4434 4305 if (vmf->vma->vm_ops->huge_fault) { 4435 4306 vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD); ··· 4656 4527 if (!(ret & VM_FAULT_FALLBACK)) 4657 4528 return ret; 4658 4529 } else { 4659 - pmd_t orig_pmd = *vmf.pmd; 4530 + vmf.orig_pmd = *vmf.pmd; 4660 4531 4661 4532 barrier(); 4662 - if (unlikely(is_swap_pmd(orig_pmd))) { 4533 + if (unlikely(is_swap_pmd(vmf.orig_pmd))) { 4663 4534 VM_BUG_ON(thp_migration_supported() && 4664 - !is_pmd_migration_entry(orig_pmd)); 4665 - if (is_pmd_migration_entry(orig_pmd)) 4535 + !is_pmd_migration_entry(vmf.orig_pmd)); 4536 + if (is_pmd_migration_entry(vmf.orig_pmd)) 4666 4537 pmd_migration_entry_wait(mm, vmf.pmd); 4667 4538 return 0; 4668 4539 } 4669 - if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) { 4670 - if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) 4671 - return do_huge_pmd_numa_page(&vmf, orig_pmd); 4540 + if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) { 4541 + if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) 4542 + return do_huge_pmd_numa_page(&vmf); 4672 4543 4673 - if (dirty && !pmd_write(orig_pmd)) { 4674 - ret = wp_huge_pmd(&vmf, orig_pmd); 4544 + if (dirty && !pmd_write(vmf.orig_pmd)) { 4545 + ret = wp_huge_pmd(&vmf); 4675 4546 if (!(ret & VM_FAULT_FALLBACK)) 4676 4547 return ret; 4677 4548 } else { 4678 - huge_pmd_set_accessed(&vmf, orig_pmd); 4549 + huge_pmd_set_accessed(&vmf); 4679 4550 return 0; 4680 4551 } 4681 4552 }

+18 -141

mm/memory_hotplug.c

··· 154 154 } 155 155 156 156 #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE 157 - void get_page_bootmem(unsigned long info, struct page *page, 158 - unsigned long type) 159 - { 160 - page->freelist = (void *)type; 161 - SetPagePrivate(page); 162 - set_page_private(page, info); 163 - page_ref_inc(page); 164 - } 165 - 166 - void put_page_bootmem(struct page *page) 167 - { 168 - unsigned long type; 169 - 170 - type = (unsigned long) page->freelist; 171 - BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE || 172 - type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE); 173 - 174 - if (page_ref_dec_return(page) == 1) { 175 - page->freelist = NULL; 176 - ClearPagePrivate(page); 177 - set_page_private(page, 0); 178 - INIT_LIST_HEAD(&page->lru); 179 - free_reserved_page(page); 180 - } 181 - } 182 - 183 - #ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE 184 - #ifndef CONFIG_SPARSEMEM_VMEMMAP 185 - static void register_page_bootmem_info_section(unsigned long start_pfn) 186 - { 187 - unsigned long mapsize, section_nr, i; 188 - struct mem_section *ms; 189 - struct page *page, *memmap; 190 - struct mem_section_usage *usage; 191 - 192 - section_nr = pfn_to_section_nr(start_pfn); 193 - ms = __nr_to_section(section_nr); 194 - 195 - /* Get section's memmap address */ 196 - memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr); 197 - 198 - /* 199 - * Get page for the memmap's phys address 200 - * XXX: need more consideration for sparse_vmemmap... 201 - */ 202 - page = virt_to_page(memmap); 203 - mapsize = sizeof(struct page) * PAGES_PER_SECTION; 204 - mapsize = PAGE_ALIGN(mapsize) >> PAGE_SHIFT; 205 - 206 - /* remember memmap's page */ 207 - for (i = 0; i < mapsize; i++, page++) 208 - get_page_bootmem(section_nr, page, SECTION_INFO); 209 - 210 - usage = ms->usage; 211 - page = virt_to_page(usage); 212 - 213 - mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT; 214 - 215 - for (i = 0; i < mapsize; i++, page++) 216 - get_page_bootmem(section_nr, page, MIX_SECTION_INFO); 217 - 218 - } 219 - #else /* CONFIG_SPARSEMEM_VMEMMAP */ 220 - static void register_page_bootmem_info_section(unsigned long start_pfn) 221 - { 222 - unsigned long mapsize, section_nr, i; 223 - struct mem_section *ms; 224 - struct page *page, *memmap; 225 - struct mem_section_usage *usage; 226 - 227 - section_nr = pfn_to_section_nr(start_pfn); 228 - ms = __nr_to_section(section_nr); 229 - 230 - memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr); 231 - 232 - register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION); 233 - 234 - usage = ms->usage; 235 - page = virt_to_page(usage); 236 - 237 - mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT; 238 - 239 - for (i = 0; i < mapsize; i++, page++) 240 - get_page_bootmem(section_nr, page, MIX_SECTION_INFO); 241 - } 242 - #endif /* !CONFIG_SPARSEMEM_VMEMMAP */ 243 - 244 - void __init register_page_bootmem_info_node(struct pglist_data *pgdat) 245 - { 246 - unsigned long i, pfn, end_pfn, nr_pages; 247 - int node = pgdat->node_id; 248 - struct page *page; 249 - 250 - nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT; 251 - page = virt_to_page(pgdat); 252 - 253 - for (i = 0; i < nr_pages; i++, page++) 254 - get_page_bootmem(node, page, NODE_INFO); 255 - 256 - pfn = pgdat->node_start_pfn; 257 - end_pfn = pgdat_end_pfn(pgdat); 258 - 259 - /* register section info */ 260 - for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) { 261 - /* 262 - * Some platforms can assign the same pfn to multiple nodes - on 263 - * node0 as well as nodeN. To avoid registering a pfn against 264 - * multiple nodes we check that this pfn does not already 265 - * reside in some other nodes. 266 - */ 267 - if (pfn_valid(pfn) && (early_pfn_to_nid(pfn) == node)) 268 - register_page_bootmem_info_section(pfn); 269 - } 270 - } 271 - #endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */ 272 - 273 157 static int check_pfn_span(unsigned long pfn, unsigned long nr_pages, 274 158 const char *reason) 275 159 { ··· 329 445 unsigned long pfn; 330 446 int nid = zone_to_nid(zone); 331 447 332 - zone_span_writelock(zone); 333 448 if (zone->zone_start_pfn == start_pfn) { 334 449 /* 335 450 * If the section is smallest section in the zone, it need ··· 361 478 zone->spanned_pages = 0; 362 479 } 363 480 } 364 - zone_span_writeunlock(zone); 365 481 } 366 482 367 483 static void update_pgdat_span(struct pglist_data *pgdat) ··· 397 515 { 398 516 const unsigned long end_pfn = start_pfn + nr_pages; 399 517 struct pglist_data *pgdat = zone->zone_pgdat; 400 - unsigned long pfn, cur_nr_pages, flags; 518 + unsigned long pfn, cur_nr_pages; 401 519 402 520 /* Poison struct pages because they are now uninitialized again. */ 403 521 for (pfn = start_pfn; pfn < end_pfn; pfn += cur_nr_pages) { ··· 422 540 423 541 clear_zone_contiguous(zone); 424 542 425 - pgdat_resize_lock(zone->zone_pgdat, &flags); 426 543 shrink_zone_span(zone, start_pfn, start_pfn + nr_pages); 427 544 update_pgdat_span(pgdat); 428 - pgdat_resize_unlock(zone->zone_pgdat, &flags); 429 545 430 546 set_zone_contiguous(zone); 431 547 } ··· 630 750 { 631 751 struct pglist_data *pgdat = zone->zone_pgdat; 632 752 int nid = pgdat->node_id; 633 - unsigned long flags; 634 753 635 754 clear_zone_contiguous(zone); 636 755 637 - /* TODO Huh pgdat is irqsave while zone is not. It used to be like that before */ 638 - pgdat_resize_lock(pgdat, &flags); 639 - zone_span_writelock(zone); 640 756 if (zone_is_empty(zone)) 641 757 init_currently_empty_zone(zone, start_pfn, nr_pages); 642 758 resize_zone_range(zone, start_pfn, nr_pages); 643 - zone_span_writeunlock(zone); 644 759 resize_pgdat_range(pgdat, start_pfn, nr_pages); 645 - pgdat_resize_unlock(pgdat, &flags); 646 760 647 761 /* 648 762 * Subsection population requires care in pfn_to_online_page(). ··· 726 852 */ 727 853 void adjust_present_page_count(struct zone *zone, long nr_pages) 728 854 { 729 - unsigned long flags; 730 - 731 855 zone->present_pages += nr_pages; 732 - pgdat_resize_lock(zone->zone_pgdat, &flags); 733 856 zone->zone_pgdat->node_present_pages += nr_pages; 734 - pgdat_resize_unlock(zone->zone_pgdat, &flags); 735 857 } 736 858 737 859 int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages, ··· 783 913 784 914 /* 785 915 * {on,off}lining is constrained to full memory sections (or more 786 - * precisly to memory blocks from the user space POV). 916 + * precisely to memory blocks from the user space POV). 787 917 * memmap_on_memory is an exception because it reserves initial part 788 918 * of the physical memory space for vmemmaps. That space is pageblock 789 919 * aligned. ··· 942 1072 } 943 1073 944 1074 945 - /** 946 - * try_online_node - online a node if offlined 1075 + /* 1076 + * __try_online_node - online a node if offlined 947 1077 * @nid: the node ID 948 1078 * @set_node_online: Whether we want to online the node 949 1079 * called by cpu_up() to online a node without onlined memory. ··· 1042 1172 * populate a single PMD. 1043 1173 */ 1044 1174 return memmap_on_memory && 1175 + !hugetlb_free_vmemmap_enabled && 1045 1176 IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) && 1046 1177 size == memory_block_size_bytes() && 1047 1178 IS_ALIGNED(vmemmap_size, PMD_SIZE) && ··· 1392 1521 struct page *page, *head; 1393 1522 int ret = 0; 1394 1523 LIST_HEAD(source); 1524 + static DEFINE_RATELIMIT_STATE(migrate_rs, DEFAULT_RATELIMIT_INTERVAL, 1525 + DEFAULT_RATELIMIT_BURST); 1395 1526 1396 1527 for (pfn = start_pfn; pfn < end_pfn; pfn++) { 1397 1528 if (!pfn_valid(pfn)) ··· 1440 1567 page_is_file_lru(page)); 1441 1568 1442 1569 } else { 1443 - pr_warn("failed to isolate pfn %lx\n", pfn); 1444 - dump_page(page, "isolation failed"); 1570 + if (__ratelimit(&migrate_rs)) { 1571 + pr_warn("failed to isolate pfn %lx\n", pfn); 1572 + dump_page(page, "isolation failed"); 1573 + } 1445 1574 } 1446 1575 put_page(page); 1447 1576 } ··· 1472 1597 (unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_HOTPLUG); 1473 1598 if (ret) { 1474 1599 list_for_each_entry(page, &source, lru) { 1475 - pr_warn("migrating pfn %lx failed ret:%d ", 1476 - page_to_pfn(page), ret); 1477 - dump_page(page, "migration failure"); 1600 + if (__ratelimit(&migrate_rs)) { 1601 + pr_warn("migrating pfn %lx failed ret:%d\n", 1602 + page_to_pfn(page), ret); 1603 + dump_page(page, "migration failure"); 1604 + } 1478 1605 } 1479 1606 putback_movable_pages(&source); 1480 1607 } ··· 1580 1703 1581 1704 /* 1582 1705 * {on,off}lining is constrained to full memory sections (or more 1583 - * precisly to memory blocks from the user space POV). 1706 + * precisely to memory blocks from the user space POV). 1584 1707 * memmap_on_memory is an exception because it reserves initial part 1585 1708 * of the physical memory space for vmemmaps. That space is pageblock 1586 1709 * aligned. ··· 1908 2031 } 1909 2032 1910 2033 /** 1911 - * remove_memory 2034 + * __remove_memory - Remove memory if every memory block is offline 1912 2035 * @nid: the node ID 1913 2036 * @start: physical address of the region to remove 1914 2037 * @size: size of the region to remove

+136 -171

mm/mempolicy.c

··· 121 121 */ 122 122 static struct mempolicy default_policy = { 123 123 .refcnt = ATOMIC_INIT(1), /* never free it */ 124 - .mode = MPOL_PREFERRED, 125 - .flags = MPOL_F_LOCAL, 124 + .mode = MPOL_LOCAL, 126 125 }; 127 126 128 127 static struct mempolicy preferred_node_policy[MAX_NUMNODES]; ··· 193 194 { 194 195 if (nodes_empty(*nodes)) 195 196 return -EINVAL; 196 - pol->v.nodes = *nodes; 197 + pol->nodes = *nodes; 197 198 return 0; 198 199 } 199 200 200 201 static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes) 201 202 { 202 - if (!nodes) 203 - pol->flags |= MPOL_F_LOCAL; /* local allocation */ 204 - else if (nodes_empty(*nodes)) 205 - return -EINVAL; /* no allowed nodes */ 206 - else 207 - pol->v.preferred_node = first_node(*nodes); 203 + if (nodes_empty(*nodes)) 204 + return -EINVAL; 205 + 206 + nodes_clear(pol->nodes); 207 + node_set(first_node(*nodes), pol->nodes); 208 208 return 0; 209 209 } 210 210 ··· 211 213 { 212 214 if (nodes_empty(*nodes)) 213 215 return -EINVAL; 214 - pol->v.nodes = *nodes; 216 + pol->nodes = *nodes; 215 217 return 0; 216 218 } 217 219 218 220 /* 219 221 * mpol_set_nodemask is called after mpol_new() to set up the nodemask, if 220 222 * any, for the new policy. mpol_new() has already validated the nodes 221 - * parameter with respect to the policy mode and flags. But, we need to 222 - * handle an empty nodemask with MPOL_PREFERRED here. 223 + * parameter with respect to the policy mode and flags. 223 224 * 224 225 * Must be called holding task's alloc_lock to protect task's mems_allowed 225 226 * and mempolicy. May also be called holding the mmap_lock for write. ··· 228 231 { 229 232 int ret; 230 233 231 - /* if mode is MPOL_DEFAULT, pol is NULL. This is right. */ 232 - if (pol == NULL) 234 + /* 235 + * Default (pol==NULL) resp. local memory policies are not a 236 + * subject of any remapping. They also do not need any special 237 + * constructor. 238 + */ 239 + if (!pol || pol->mode == MPOL_LOCAL) 233 240 return 0; 241 + 234 242 /* Check N_MEMORY */ 235 243 nodes_and(nsc->mask1, 236 244 cpuset_current_mems_allowed, node_states[N_MEMORY]); 237 245 238 246 VM_BUG_ON(!nodes); 239 - if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes)) 240 - nodes = NULL; /* explicit local allocation */ 241 - else { 242 - if (pol->flags & MPOL_F_RELATIVE_NODES) 243 - mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1); 244 - else 245 - nodes_and(nsc->mask2, *nodes, nsc->mask1); 246 247 247 - if (mpol_store_user_nodemask(pol)) 248 - pol->w.user_nodemask = *nodes; 249 - else 250 - pol->w.cpuset_mems_allowed = 251 - cpuset_current_mems_allowed; 252 - } 253 - 254 - if (nodes) 255 - ret = mpol_ops[pol->mode].create(pol, &nsc->mask2); 248 + if (pol->flags & MPOL_F_RELATIVE_NODES) 249 + mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1); 256 250 else 257 - ret = mpol_ops[pol->mode].create(pol, NULL); 251 + nodes_and(nsc->mask2, *nodes, nsc->mask1); 252 + 253 + if (mpol_store_user_nodemask(pol)) 254 + pol->w.user_nodemask = *nodes; 255 + else 256 + pol->w.cpuset_mems_allowed = cpuset_current_mems_allowed; 257 + 258 + ret = mpol_ops[pol->mode].create(pol, &nsc->mask2); 258 259 return ret; 259 260 } 260 261 ··· 285 290 if (((flags & MPOL_F_STATIC_NODES) || 286 291 (flags & MPOL_F_RELATIVE_NODES))) 287 292 return ERR_PTR(-EINVAL); 293 + 294 + mode = MPOL_LOCAL; 288 295 } 289 296 } else if (mode == MPOL_LOCAL) { 290 297 if (!nodes_empty(*nodes) || 291 298 (flags & MPOL_F_STATIC_NODES) || 292 299 (flags & MPOL_F_RELATIVE_NODES)) 293 300 return ERR_PTR(-EINVAL); 294 - mode = MPOL_PREFERRED; 295 301 } else if (nodes_empty(*nodes)) 296 302 return ERR_PTR(-EINVAL); 297 303 policy = kmem_cache_alloc(policy_cache, GFP_KERNEL); ··· 326 330 else if (pol->flags & MPOL_F_RELATIVE_NODES) 327 331 mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes); 328 332 else { 329 - nodes_remap(tmp, pol->v.nodes, pol->w.cpuset_mems_allowed, 333 + nodes_remap(tmp, pol->nodes, pol->w.cpuset_mems_allowed, 330 334 *nodes); 331 335 pol->w.cpuset_mems_allowed = *nodes; 332 336 } ··· 334 338 if (nodes_empty(tmp)) 335 339 tmp = *nodes; 336 340 337 - pol->v.nodes = tmp; 341 + pol->nodes = tmp; 338 342 } 339 343 340 344 static void mpol_rebind_preferred(struct mempolicy *pol, 341 345 const nodemask_t *nodes) 342 346 { 343 - nodemask_t tmp; 344 - 345 - if (pol->flags & MPOL_F_STATIC_NODES) { 346 - int node = first_node(pol->w.user_nodemask); 347 - 348 - if (node_isset(node, *nodes)) { 349 - pol->v.preferred_node = node; 350 - pol->flags &= ~MPOL_F_LOCAL; 351 - } else 352 - pol->flags |= MPOL_F_LOCAL; 353 - } else if (pol->flags & MPOL_F_RELATIVE_NODES) { 354 - mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes); 355 - pol->v.preferred_node = first_node(tmp); 356 - } else if (!(pol->flags & MPOL_F_LOCAL)) { 357 - pol->v.preferred_node = node_remap(pol->v.preferred_node, 358 - pol->w.cpuset_mems_allowed, 359 - *nodes); 360 - pol->w.cpuset_mems_allowed = *nodes; 361 - } 347 + pol->w.cpuset_mems_allowed = *nodes; 362 348 } 363 349 364 350 /* ··· 354 376 { 355 377 if (!pol) 356 378 return; 357 - if (!mpol_store_user_nodemask(pol) && !(pol->flags & MPOL_F_LOCAL) && 379 + if (!mpol_store_user_nodemask(pol) && 358 380 nodes_equal(pol->w.cpuset_mems_allowed, *newmask)) 359 381 return; 360 382 ··· 405 427 .create = mpol_new_bind, 406 428 .rebind = mpol_rebind_nodemask, 407 429 }, 430 + [MPOL_LOCAL] = { 431 + .rebind = mpol_rebind_default, 432 + }, 408 433 }; 409 434 410 435 static int migrate_page_add(struct page *page, struct list_head *pagelist, ··· 439 458 440 459 /* 441 460 * queue_pages_pmd() has four possible return values: 442 - * 0 - pages are placed on the right node or queued successfully. 461 + * 0 - pages are placed on the right node or queued successfully, or 462 + * special page is met, i.e. huge zero page. 443 463 * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were 444 464 * specified. 445 465 * 2 - THP was split. ··· 464 482 page = pmd_page(*pmd); 465 483 if (is_huge_zero_page(page)) { 466 484 spin_unlock(ptl); 467 - __split_huge_pmd(walk->vma, pmd, addr, false, NULL); 468 - ret = 2; 485 + walk->action = ACTION_CONTINUE; 469 486 goto out; 470 487 } 471 488 if (!queue_pages_required(page, qp)) ··· 491 510 * and move them to the pagelist if they do. 492 511 * 493 512 * queue_pages_pte_range() has three possible return values: 494 - * 0 - pages are placed on the right node or queued successfully. 513 + * 0 - pages are placed on the right node or queued successfully, or 514 + * special page is met, i.e. zero page. 495 515 * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were 496 516 * specified. 497 517 * -EIO - only MPOL_MF_STRICT was specified and an existing page was already ··· 899 917 switch (p->mode) { 900 918 case MPOL_BIND: 901 919 case MPOL_INTERLEAVE: 902 - *nodes = p->v.nodes; 903 - break; 904 920 case MPOL_PREFERRED: 905 - if (!(p->flags & MPOL_F_LOCAL)) 906 - node_set(p->v.preferred_node, *nodes); 907 - /* else return empty node mask for local allocation */ 921 + *nodes = p->nodes; 922 + break; 923 + case MPOL_LOCAL: 924 + /* return empty node mask for local allocation */ 908 925 break; 909 926 default: 910 927 BUG(); ··· 988 1007 *policy = err; 989 1008 } else if (pol == current->mempolicy && 990 1009 pol->mode == MPOL_INTERLEAVE) { 991 - *policy = next_node_in(current->il_prev, pol->v.nodes); 1010 + *policy = next_node_in(current->il_prev, pol->nodes); 992 1011 } else { 993 1012 err = -EINVAL; 994 1013 goto out; ··· 1441 1460 return copy_to_user(mask, nodes_addr(*nodes), copy) ? -EFAULT : 0; 1442 1461 } 1443 1462 1463 + /* Basic parameter sanity check used by both mbind() and set_mempolicy() */ 1464 + static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) 1465 + { 1466 + *flags = *mode & MPOL_MODE_FLAGS; 1467 + *mode &= ~MPOL_MODE_FLAGS; 1468 + if ((unsigned int)(*mode) >= MPOL_MAX) 1469 + return -EINVAL; 1470 + if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) 1471 + return -EINVAL; 1472 + 1473 + return 0; 1474 + } 1475 + 1444 1476 static long kernel_mbind(unsigned long start, unsigned long len, 1445 1477 unsigned long mode, const unsigned long __user *nmask, 1446 1478 unsigned long maxnode, unsigned int flags) 1447 1479 { 1448 - nodemask_t nodes; 1449 - int err; 1450 1480 unsigned short mode_flags; 1481 + nodemask_t nodes; 1482 + int lmode = mode; 1483 + int err; 1451 1484 1452 1485 start = untagged_addr(start); 1453 - mode_flags = mode & MPOL_MODE_FLAGS; 1454 - mode &= ~MPOL_MODE_FLAGS; 1455 - if (mode >= MPOL_MAX) 1456 - return -EINVAL; 1457 - if ((mode_flags & MPOL_F_STATIC_NODES) && 1458 - (mode_flags & MPOL_F_RELATIVE_NODES)) 1459 - return -EINVAL; 1486 + err = sanitize_mpol_flags(&lmode, &mode_flags); 1487 + if (err) 1488 + return err; 1489 + 1460 1490 err = get_nodes(&nodes, nmask, maxnode); 1461 1491 if (err) 1462 1492 return err; 1463 - return do_mbind(start, len, mode, mode_flags, &nodes, flags); 1493 + 1494 + return do_mbind(start, len, lmode, mode_flags, &nodes, flags); 1464 1495 } 1465 1496 1466 1497 SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len, ··· 1486 1493 static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask, 1487 1494 unsigned long maxnode) 1488 1495 { 1489 - int err; 1496 + unsigned short mode_flags; 1490 1497 nodemask_t nodes; 1491 - unsigned short flags; 1498 + int lmode = mode; 1499 + int err; 1492 1500 1493 - flags = mode & MPOL_MODE_FLAGS; 1494 - mode &= ~MPOL_MODE_FLAGS; 1495 - if ((unsigned int)mode >= MPOL_MAX) 1496 - return -EINVAL; 1497 - if ((flags & MPOL_F_STATIC_NODES) && (flags & MPOL_F_RELATIVE_NODES)) 1498 - return -EINVAL; 1501 + err = sanitize_mpol_flags(&lmode, &mode_flags); 1502 + if (err) 1503 + return err; 1504 + 1499 1505 err = get_nodes(&nodes, nmask, maxnode); 1500 1506 if (err) 1501 1507 return err; 1502 - return do_set_mempolicy(mode, flags, &nodes); 1508 + 1509 + return do_set_mempolicy(lmode, mode_flags, &nodes); 1503 1510 } 1504 1511 1505 1512 SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask, ··· 1856 1863 BUG_ON(dynamic_policy_zone == ZONE_MOVABLE); 1857 1864 1858 1865 /* 1859 - * if policy->v.nodes has movable memory only, 1866 + * if policy->nodes has movable memory only, 1860 1867 * we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only. 1861 1868 * 1862 - * policy->v.nodes is intersect with node_states[N_MEMORY]. 1869 + * policy->nodes is intersect with node_states[N_MEMORY]. 1863 1870 * so if the following test fails, it implies 1864 - * policy->v.nodes has movable memory only. 1871 + * policy->nodes has movable memory only. 1865 1872 */ 1866 - if (!nodes_intersects(policy->v.nodes, node_states[N_HIGH_MEMORY])) 1873 + if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY])) 1867 1874 dynamic_policy_zone = ZONE_MOVABLE; 1868 1875 1869 1876 return zone >= dynamic_policy_zone; ··· 1878 1885 /* Lower zones don't get a nodemask applied for MPOL_BIND */ 1879 1886 if (unlikely(policy->mode == MPOL_BIND) && 1880 1887 apply_policy_zone(policy, gfp_zone(gfp)) && 1881 - cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)) 1882 - return &policy->v.nodes; 1888 + cpuset_nodemask_valid_mems_allowed(&policy->nodes)) 1889 + return &policy->nodes; 1883 1890 1884 1891 return NULL; 1885 1892 } ··· 1887 1894 /* Return the node id preferred by the given mempolicy, or the given id */ 1888 1895 static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd) 1889 1896 { 1890 - if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL)) 1891 - nd = policy->v.preferred_node; 1892 - else { 1897 + if (policy->mode == MPOL_PREFERRED) { 1898 + nd = first_node(policy->nodes); 1899 + } else { 1893 1900 /* 1894 1901 * __GFP_THISNODE shouldn't even be used with the bind policy 1895 1902 * because we might easily break the expectation to stay on the ··· 1907 1914 unsigned next; 1908 1915 struct task_struct *me = current; 1909 1916 1910 - next = next_node_in(me->il_prev, policy->v.nodes); 1917 + next = next_node_in(me->il_prev, policy->nodes); 1911 1918 if (next < MAX_NUMNODES) 1912 1919 me->il_prev = next; 1913 1920 return next; ··· 1926 1933 return node; 1927 1934 1928 1935 policy = current->mempolicy; 1929 - if (!policy || policy->flags & MPOL_F_LOCAL) 1936 + if (!policy) 1930 1937 return node; 1931 1938 1932 1939 switch (policy->mode) { 1933 1940 case MPOL_PREFERRED: 1934 - /* 1935 - * handled MPOL_F_LOCAL above 1936 - */ 1937 - return policy->v.preferred_node; 1941 + return first_node(policy->nodes); 1938 1942 1939 1943 case MPOL_INTERLEAVE: 1940 1944 return interleave_nodes(policy); ··· 1947 1957 enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL); 1948 1958 zonelist = &NODE_DATA(node)->node_zonelists[ZONELIST_FALLBACK]; 1949 1959 z = first_zones_zonelist(zonelist, highest_zoneidx, 1950 - &policy->v.nodes); 1960 + &policy->nodes); 1951 1961 return z->zone ? zone_to_nid(z->zone) : node; 1952 1962 } 1963 + case MPOL_LOCAL: 1964 + return node; 1953 1965 1954 1966 default: 1955 1967 BUG(); ··· 1960 1968 1961 1969 /* 1962 1970 * Do static interleaving for a VMA with known offset @n. Returns the n'th 1963 - * node in pol->v.nodes (starting from n=0), wrapping around if n exceeds the 1971 + * node in pol->nodes (starting from n=0), wrapping around if n exceeds the 1964 1972 * number of present nodes. 1965 1973 */ 1966 1974 static unsigned offset_il_node(struct mempolicy *pol, unsigned long n) 1967 1975 { 1968 - unsigned nnodes = nodes_weight(pol->v.nodes); 1976 + unsigned nnodes = nodes_weight(pol->nodes); 1969 1977 unsigned target; 1970 1978 int i; 1971 1979 int nid; ··· 1973 1981 if (!nnodes) 1974 1982 return numa_node_id(); 1975 1983 target = (unsigned int)n % nnodes; 1976 - nid = first_node(pol->v.nodes); 1984 + nid = first_node(pol->nodes); 1977 1985 for (i = 0; i < target; i++) 1978 - nid = next_node(nid, pol->v.nodes); 1986 + nid = next_node(nid, pol->nodes); 1979 1987 return nid; 1980 1988 } 1981 1989 ··· 2031 2039 } else { 2032 2040 nid = policy_node(gfp_flags, *mpol, numa_node_id()); 2033 2041 if ((*mpol)->mode == MPOL_BIND) 2034 - *nodemask = &(*mpol)->v.nodes; 2042 + *nodemask = &(*mpol)->nodes; 2035 2043 } 2036 2044 return nid; 2037 2045 } ··· 2055 2063 bool init_nodemask_of_mempolicy(nodemask_t *mask) 2056 2064 { 2057 2065 struct mempolicy *mempolicy; 2058 - int nid; 2059 2066 2060 2067 if (!(mask && current->mempolicy)) 2061 2068 return false; ··· 2063 2072 mempolicy = current->mempolicy; 2064 2073 switch (mempolicy->mode) { 2065 2074 case MPOL_PREFERRED: 2066 - if (mempolicy->flags & MPOL_F_LOCAL) 2067 - nid = numa_node_id(); 2068 - else 2069 - nid = mempolicy->v.preferred_node; 2070 - init_nodemask_of_node(mask, nid); 2071 - break; 2072 - 2073 2075 case MPOL_BIND: 2074 2076 case MPOL_INTERLEAVE: 2075 - *mask = mempolicy->v.nodes; 2077 + *mask = mempolicy->nodes; 2078 + break; 2079 + 2080 + case MPOL_LOCAL: 2081 + init_nodemask_of_node(mask, numa_node_id()); 2076 2082 break; 2077 2083 2078 2084 default: ··· 2082 2094 #endif 2083 2095 2084 2096 /* 2085 - * mempolicy_nodemask_intersects 2097 + * mempolicy_in_oom_domain 2086 2098 * 2087 - * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default 2088 - * policy. Otherwise, check for intersection between mask and the policy 2089 - * nodemask for 'bind' or 'interleave' policy. For 'preferred' or 'local' 2090 - * policy, always return true since it may allocate elsewhere on fallback. 2099 + * If tsk's mempolicy is "bind", check for intersection between mask and 2100 + * the policy nodemask. Otherwise, return true for all other policies 2101 + * including "interleave", as a tsk with "interleave" policy may have 2102 + * memory allocated from all nodes in system. 2091 2103 * 2092 2104 * Takes task_lock(tsk) to prevent freeing of its mempolicy. 2093 2105 */ 2094 - bool mempolicy_nodemask_intersects(struct task_struct *tsk, 2106 + bool mempolicy_in_oom_domain(struct task_struct *tsk, 2095 2107 const nodemask_t *mask) 2096 2108 { 2097 2109 struct mempolicy *mempolicy; ··· 2099 2111 2100 2112 if (!mask) 2101 2113 return ret; 2114 + 2102 2115 task_lock(tsk); 2103 2116 mempolicy = tsk->mempolicy; 2104 - if (!mempolicy) 2105 - goto out; 2106 - 2107 - switch (mempolicy->mode) { 2108 - case MPOL_PREFERRED: 2109 - /* 2110 - * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to 2111 - * allocate from, they may fallback to other nodes when oom. 2112 - * Thus, it's possible for tsk to have allocated memory from 2113 - * nodes in mask. 2114 - */ 2115 - break; 2116 - case MPOL_BIND: 2117 - case MPOL_INTERLEAVE: 2118 - ret = nodes_intersects(mempolicy->v.nodes, *mask); 2119 - break; 2120 - default: 2121 - BUG(); 2122 - } 2123 - out: 2117 + if (mempolicy && mempolicy->mode == MPOL_BIND) 2118 + ret = nodes_intersects(mempolicy->nodes, *mask); 2124 2119 task_unlock(tsk); 2120 + 2125 2121 return ret; 2126 2122 } 2127 2123 ··· 2176 2204 * If the policy is interleave, or does not allow the current 2177 2205 * node in its nodemask, we allocate the standard way. 2178 2206 */ 2179 - if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL)) 2180 - hpage_node = pol->v.preferred_node; 2207 + if (pol->mode == MPOL_PREFERRED) 2208 + hpage_node = first_node(pol->nodes); 2181 2209 2182 2210 nmask = policy_nodemask(gfp, pol); 2183 2211 if (!nmask || node_isset(hpage_node, *nmask)) { ··· 2310 2338 switch (a->mode) { 2311 2339 case MPOL_BIND: 2312 2340 case MPOL_INTERLEAVE: 2313 - return !!nodes_equal(a->v.nodes, b->v.nodes); 2314 2341 case MPOL_PREFERRED: 2315 - /* a's ->flags is the same as b's */ 2316 - if (a->flags & MPOL_F_LOCAL) 2317 - return true; 2318 - return a->v.preferred_node == b->v.preferred_node; 2342 + return !!nodes_equal(a->nodes, b->nodes); 2343 + case MPOL_LOCAL: 2344 + return true; 2319 2345 default: 2320 2346 BUG(); 2321 2347 return false; ··· 2451 2481 break; 2452 2482 2453 2483 case MPOL_PREFERRED: 2454 - if (pol->flags & MPOL_F_LOCAL) 2455 - polnid = numa_node_id(); 2456 - else 2457 - polnid = pol->v.preferred_node; 2484 + polnid = first_node(pol->nodes); 2485 + break; 2486 + 2487 + case MPOL_LOCAL: 2488 + polnid = numa_node_id(); 2458 2489 break; 2459 2490 2460 2491 case MPOL_BIND: 2461 2492 /* Optimize placement among multiple nodes via NUMA balancing */ 2462 2493 if (pol->flags & MPOL_F_MORON) { 2463 - if (node_isset(thisnid, pol->v.nodes)) 2494 + if (node_isset(thisnid, pol->nodes)) 2464 2495 break; 2465 2496 goto out; 2466 2497 } ··· 2472 2501 * else select nearest allowed node, if any. 2473 2502 * If no allowed nodes, use current [!misplaced]. 2474 2503 */ 2475 - if (node_isset(curnid, pol->v.nodes)) 2504 + if (node_isset(curnid, pol->nodes)) 2476 2505 goto out; 2477 2506 z = first_zones_zonelist( 2478 2507 node_zonelist(numa_node_id(), GFP_HIGHUSER), 2479 2508 gfp_zone(GFP_HIGHUSER), 2480 - &pol->v.nodes); 2509 + &pol->nodes); 2481 2510 polnid = zone_to_nid(z->zone); 2482 2511 break; 2483 2512 ··· 2680 2709 vma->vm_pgoff, 2681 2710 sz, npol ? npol->mode : -1, 2682 2711 npol ? npol->flags : -1, 2683 - npol ? nodes_addr(npol->v.nodes)[0] : NUMA_NO_NODE); 2712 + npol ? nodes_addr(npol->nodes)[0] : NUMA_NO_NODE); 2684 2713 2685 2714 if (npol) { 2686 2715 new = sp_alloc(vma->vm_pgoff, vma->vm_pgoff + sz, npol); ··· 2778 2807 .refcnt = ATOMIC_INIT(1), 2779 2808 .mode = MPOL_PREFERRED, 2780 2809 .flags = MPOL_F_MOF | MPOL_F_MORON, 2781 - .v = { .preferred_node = nid, }, 2810 + .nodes = nodemask_of_node(nid), 2782 2811 }; 2783 2812 } 2784 2813 ··· 2822 2851 * Parse and format mempolicy from/to strings 2823 2852 */ 2824 2853 2825 - /* 2826 - * "local" is implemented internally by MPOL_PREFERRED with MPOL_F_LOCAL flag. 2827 - */ 2828 2854 static const char * const policy_modes[] = 2829 2855 { 2830 2856 [MPOL_DEFAULT] = "default", ··· 2899 2931 */ 2900 2932 if (nodelist) 2901 2933 goto out; 2902 - mode = MPOL_PREFERRED; 2903 2934 break; 2904 2935 case MPOL_DEFAULT: 2905 2936 /* ··· 2937 2970 * Save nodes for mpol_to_str() to show the tmpfs mount options 2938 2971 * for /proc/mounts, /proc/pid/mounts and /proc/pid/mountinfo. 2939 2972 */ 2940 - if (mode != MPOL_PREFERRED) 2941 - new->v.nodes = nodes; 2942 - else if (nodelist) 2943 - new->v.preferred_node = first_node(nodes); 2944 - else 2945 - new->flags |= MPOL_F_LOCAL; 2973 + if (mode != MPOL_PREFERRED) { 2974 + new->nodes = nodes; 2975 + } else if (nodelist) { 2976 + nodes_clear(new->nodes); 2977 + node_set(first_node(nodes), new->nodes); 2978 + } else { 2979 + new->mode = MPOL_LOCAL; 2980 + } 2946 2981 2947 2982 /* 2948 2983 * Save nodes for contextualization: this will be used to "clone" ··· 2990 3021 2991 3022 switch (mode) { 2992 3023 case MPOL_DEFAULT: 3024 + case MPOL_LOCAL: 2993 3025 break; 2994 3026 case MPOL_PREFERRED: 2995 - if (flags & MPOL_F_LOCAL) 2996 - mode = MPOL_LOCAL; 2997 - else 2998 - node_set(pol->v.preferred_node, nodes); 2999 - break; 3000 3027 case MPOL_BIND: 3001 3028 case MPOL_INTERLEAVE: 3002 - nodes = pol->v.nodes; 3029 + nodes = pol->nodes; 3003 3030 break; 3004 3031 default: 3005 3032 WARN_ON_ONCE(1);

+85 -183

mm/migrate.c

··· 210 210 * Recheck VMA as permissions can change since migration started 211 211 */ 212 212 entry = pte_to_swp_entry(*pvmw.pte); 213 - if (is_write_migration_entry(entry)) 213 + if (is_writable_migration_entry(entry)) 214 214 pte = maybe_mkwrite(pte, vma); 215 215 else if (pte_swp_uffd_wp(*pvmw.pte)) 216 216 pte = pte_mkuffd_wp(pte); 217 217 218 218 if (unlikely(is_device_private_page(new))) { 219 - entry = make_device_private_entry(new, pte_write(pte)); 219 + if (pte_write(pte)) 220 + entry = make_writable_device_private_entry( 221 + page_to_pfn(new)); 222 + else 223 + entry = make_readable_device_private_entry( 224 + page_to_pfn(new)); 220 225 pte = swp_entry_to_pte(entry); 221 226 if (pte_swp_soft_dirty(*pvmw.pte)) 222 227 pte = pte_swp_mksoft_dirty(pte); ··· 231 226 232 227 #ifdef CONFIG_HUGETLB_PAGE 233 228 if (PageHuge(new)) { 229 + unsigned int shift = huge_page_shift(hstate_vma(vma)); 230 + 234 231 pte = pte_mkhuge(pte); 235 - pte = arch_make_huge_pte(pte, vma, new, 0); 232 + pte = arch_make_huge_pte(pte, shift, vma->vm_flags); 236 233 set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); 237 234 if (PageAnon(new)) 238 235 hugepage_add_anon_rmap(new, vma, pvmw.address); ··· 301 294 if (!is_migration_entry(entry)) 302 295 goto out; 303 296 304 - page = migration_entry_to_page(entry); 297 + page = pfn_swap_entry_to_page(entry); 305 298 page = compound_head(page); 306 299 307 300 /* ··· 342 335 ptl = pmd_lock(mm, pmd); 343 336 if (!is_pmd_migration_entry(*pmd)) 344 337 goto unlock; 345 - page = migration_entry_to_page(pmd_to_swp_entry(*pmd)); 338 + page = pfn_swap_entry_to_page(pmd_to_swp_entry(*pmd)); 346 339 if (!get_page_unless_zero(page)) 347 340 goto unlock; 348 341 spin_unlock(ptl); ··· 558 551 } 559 552 } 560 553 561 - static void copy_huge_page(struct page *dst, struct page *src) 554 + void copy_huge_page(struct page *dst, struct page *src) 562 555 { 563 556 int i; 564 557 int nr_pages; ··· 633 626 if (PageSwapCache(page)) 634 627 ClearPageSwapCache(page); 635 628 ClearPagePrivate(page); 636 - set_page_private(page, 0); 629 + 630 + /* page->private contains hugetlb specific flags */ 631 + if (!PageHuge(page)) 632 + set_page_private(page, 0); 637 633 638 634 /* 639 635 * If any waiters have accumulated on the new page then ··· 1109 1099 /* Establish migration ptes */ 1110 1100 VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma, 1111 1101 page); 1112 - try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK); 1102 + try_to_migrate(page, 0); 1113 1103 page_was_mapped = 1; 1114 1104 } 1115 1105 ··· 1298 1288 * page_mapping() set, hugetlbfs specific move page routine will not 1299 1289 * be called and we could leak usage counts for subpools. 1300 1290 */ 1301 - if (page_private(hpage) && !page_mapping(hpage)) { 1291 + if (hugetlb_page_subpool(hpage) && !page_mapping(hpage)) { 1302 1292 rc = -EBUSY; 1303 1293 goto out_unlock; 1304 1294 } ··· 1311 1301 1312 1302 if (page_mapped(hpage)) { 1313 1303 bool mapping_locked = false; 1314 - enum ttu_flags ttu = TTU_MIGRATION|TTU_IGNORE_MLOCK; 1304 + enum ttu_flags ttu = 0; 1315 1305 1316 1306 if (!PageAnon(hpage)) { 1317 1307 /* ··· 1328 1318 ttu |= TTU_RMAP_LOCKED; 1329 1319 } 1330 1320 1331 - try_to_unmap(hpage, ttu); 1321 + try_to_migrate(hpage, ttu); 1332 1322 page_was_mapped = 1; 1333 1323 1334 1324 if (mapping_locked) ··· 1428 1418 int swapwrite = current->flags & PF_SWAPWRITE; 1429 1419 int rc, nr_subpages; 1430 1420 LIST_HEAD(ret_pages); 1421 + bool nosplit = (reason == MR_NUMA_MISPLACED); 1431 1422 1432 1423 trace_mm_migrate_pages_start(mode, reason); 1433 1424 ··· 1500 1489 /* 1501 1490 * When memory is low, don't bother to try to migrate 1502 1491 * other pages, just exit. 1492 + * THP NUMA faulting doesn't split THP to retry. 1503 1493 */ 1504 - if (is_thp) { 1494 + if (is_thp && !nosplit) { 1505 1495 if (!try_split_thp(page, &page2, from)) { 1506 1496 nr_thp_split++; 1507 1497 goto retry; ··· 2055 2043 return newpage; 2056 2044 } 2057 2045 2046 + static struct page *alloc_misplaced_dst_page_thp(struct page *page, 2047 + unsigned long data) 2048 + { 2049 + int nid = (int) data; 2050 + struct page *newpage; 2051 + 2052 + newpage = alloc_pages_node(nid, (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE), 2053 + HPAGE_PMD_ORDER); 2054 + if (!newpage) 2055 + goto out; 2056 + 2057 + prep_transhuge_page(newpage); 2058 + 2059 + out: 2060 + return newpage; 2061 + } 2062 + 2058 2063 static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) 2059 2064 { 2060 2065 int page_lru; 2061 2066 2062 2067 VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page); 2068 + 2069 + /* Do not migrate THP mapped by multiple processes */ 2070 + if (PageTransHuge(page) && total_mapcount(page) > 1) 2071 + return 0; 2063 2072 2064 2073 /* Avoid migrating to a node that is nearly full */ 2065 2074 if (!migrate_balanced_pgdat(pgdat, compound_nr(page))) ··· 2088 2055 2089 2056 if (isolate_lru_page(page)) 2090 2057 return 0; 2091 - 2092 - /* 2093 - * migrate_misplaced_transhuge_page() skips page migration's usual 2094 - * check on page_count(), so we must do it here, now that the page 2095 - * has been isolated: a GUP pin, or any other pin, prevents migration. 2096 - * The expected page count is 3: 1 for page's mapcount and 1 for the 2097 - * caller's pin and 1 for the reference taken by isolate_lru_page(). 2098 - */ 2099 - if (PageTransHuge(page) && page_count(page) != 3) { 2100 - putback_lru_page(page); 2101 - return 0; 2102 - } 2103 2058 2104 2059 page_lru = page_is_file_lru(page); 2105 2060 mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru, ··· 2102 2081 return 1; 2103 2082 } 2104 2083 2105 - bool pmd_trans_migrating(pmd_t pmd) 2106 - { 2107 - struct page *page = pmd_page(pmd); 2108 - return PageLocked(page); 2109 - } 2110 - 2111 2084 /* 2112 2085 * Attempt to migrate a misplaced page to the specified destination 2113 2086 * node. Caller is expected to have an elevated reference count on ··· 2114 2099 int isolated; 2115 2100 int nr_remaining; 2116 2101 LIST_HEAD(migratepages); 2102 + new_page_t *new; 2103 + bool compound; 2104 + unsigned int nr_pages = thp_nr_pages(page); 2105 + 2106 + /* 2107 + * PTE mapped THP or HugeTLB page can't reach here so the page could 2108 + * be either base page or THP. And it must be head page if it is 2109 + * THP. 2110 + */ 2111 + compound = PageTransHuge(page); 2112 + 2113 + if (compound) 2114 + new = alloc_misplaced_dst_page_thp; 2115 + else 2116 + new = alloc_misplaced_dst_page; 2117 2117 2118 2118 /* 2119 2119 * Don't migrate file pages that are mapped in multiple processes ··· 2150 2120 goto out; 2151 2121 2152 2122 list_add(&page->lru, &migratepages); 2153 - nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page, 2154 - NULL, node, MIGRATE_ASYNC, 2155 - MR_NUMA_MISPLACED); 2123 + nr_remaining = migrate_pages(&migratepages, *new, NULL, node, 2124 + MIGRATE_ASYNC, MR_NUMA_MISPLACED); 2156 2125 if (nr_remaining) { 2157 2126 if (!list_empty(&migratepages)) { 2158 2127 list_del(&page->lru); 2159 - dec_node_page_state(page, NR_ISOLATED_ANON + 2160 - page_is_file_lru(page)); 2128 + mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + 2129 + page_is_file_lru(page), -nr_pages); 2161 2130 putback_lru_page(page); 2162 2131 } 2163 2132 isolated = 0; 2164 2133 } else 2165 - count_vm_numa_event(NUMA_PAGE_MIGRATE); 2134 + count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_pages); 2166 2135 BUG_ON(!list_empty(&migratepages)); 2167 2136 return isolated; 2168 2137 ··· 2170 2141 return 0; 2171 2142 } 2172 2143 #endif /* CONFIG_NUMA_BALANCING */ 2173 - 2174 - #if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE) 2175 - /* 2176 - * Migrates a THP to a given target node. page must be locked and is unlocked 2177 - * before returning. 2178 - */ 2179 - int migrate_misplaced_transhuge_page(struct mm_struct *mm, 2180 - struct vm_area_struct *vma, 2181 - pmd_t *pmd, pmd_t entry, 2182 - unsigned long address, 2183 - struct page *page, int node) 2184 - { 2185 - spinlock_t *ptl; 2186 - pg_data_t *pgdat = NODE_DATA(node); 2187 - int isolated = 0; 2188 - struct page *new_page = NULL; 2189 - int page_lru = page_is_file_lru(page); 2190 - unsigned long start = address & HPAGE_PMD_MASK; 2191 - 2192 - new_page = alloc_pages_node(node, 2193 - (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE), 2194 - HPAGE_PMD_ORDER); 2195 - if (!new_page) 2196 - goto out_fail; 2197 - prep_transhuge_page(new_page); 2198 - 2199 - isolated = numamigrate_isolate_page(pgdat, page); 2200 - if (!isolated) { 2201 - put_page(new_page); 2202 - goto out_fail; 2203 - } 2204 - 2205 - /* Prepare a page as a migration target */ 2206 - __SetPageLocked(new_page); 2207 - if (PageSwapBacked(page)) 2208 - __SetPageSwapBacked(new_page); 2209 - 2210 - /* anon mapping, we can simply copy page->mapping to the new page: */ 2211 - new_page->mapping = page->mapping; 2212 - new_page->index = page->index; 2213 - /* flush the cache before copying using the kernel virtual address */ 2214 - flush_cache_range(vma, start, start + HPAGE_PMD_SIZE); 2215 - migrate_page_copy(new_page, page); 2216 - WARN_ON(PageLRU(new_page)); 2217 - 2218 - /* Recheck the target PMD */ 2219 - ptl = pmd_lock(mm, pmd); 2220 - if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) { 2221 - spin_unlock(ptl); 2222 - 2223 - /* Reverse changes made by migrate_page_copy() */ 2224 - if (TestClearPageActive(new_page)) 2225 - SetPageActive(page); 2226 - if (TestClearPageUnevictable(new_page)) 2227 - SetPageUnevictable(page); 2228 - 2229 - unlock_page(new_page); 2230 - put_page(new_page); /* Free it */ 2231 - 2232 - /* Retake the callers reference and putback on LRU */ 2233 - get_page(page); 2234 - putback_lru_page(page); 2235 - mod_node_page_state(page_pgdat(page), 2236 - NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR); 2237 - 2238 - goto out_unlock; 2239 - } 2240 - 2241 - entry = mk_huge_pmd(new_page, vma->vm_page_prot); 2242 - entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); 2243 - 2244 - /* 2245 - * Overwrite the old entry under pagetable lock and establish 2246 - * the new PTE. Any parallel GUP will either observe the old 2247 - * page blocking on the page lock, block on the page table 2248 - * lock or observe the new page. The SetPageUptodate on the 2249 - * new page and page_add_new_anon_rmap guarantee the copy is 2250 - * visible before the pagetable update. 2251 - */ 2252 - page_add_anon_rmap(new_page, vma, start, true); 2253 - /* 2254 - * At this point the pmd is numa/protnone (i.e. non present) and the TLB 2255 - * has already been flushed globally. So no TLB can be currently 2256 - * caching this non present pmd mapping. There's no need to clear the 2257 - * pmd before doing set_pmd_at(), nor to flush the TLB after 2258 - * set_pmd_at(). Clearing the pmd here would introduce a race 2259 - * condition against MADV_DONTNEED, because MADV_DONTNEED only holds the 2260 - * mmap_lock for reading. If the pmd is set to NULL at any given time, 2261 - * MADV_DONTNEED won't wait on the pmd lock and it'll skip clearing this 2262 - * pmd. 2263 - */ 2264 - set_pmd_at(mm, start, pmd, entry); 2265 - update_mmu_cache_pmd(vma, address, &entry); 2266 - 2267 - page_ref_unfreeze(page, 2); 2268 - mlock_migrate_page(new_page, page); 2269 - page_remove_rmap(page, true); 2270 - set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED); 2271 - 2272 - spin_unlock(ptl); 2273 - 2274 - /* Take an "isolate" reference and put new page on the LRU. */ 2275 - get_page(new_page); 2276 - putback_lru_page(new_page); 2277 - 2278 - unlock_page(new_page); 2279 - unlock_page(page); 2280 - put_page(page); /* Drop the rmap reference */ 2281 - put_page(page); /* Drop the LRU isolation reference */ 2282 - 2283 - count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR); 2284 - count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR); 2285 - 2286 - mod_node_page_state(page_pgdat(page), 2287 - NR_ISOLATED_ANON + page_lru, 2288 - -HPAGE_PMD_NR); 2289 - return isolated; 2290 - 2291 - out_fail: 2292 - count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR); 2293 - ptl = pmd_lock(mm, pmd); 2294 - if (pmd_same(*pmd, entry)) { 2295 - entry = pmd_modify(entry, vma->vm_page_prot); 2296 - set_pmd_at(mm, start, pmd, entry); 2297 - update_mmu_cache_pmd(vma, address, &entry); 2298 - } 2299 - spin_unlock(ptl); 2300 - 2301 - out_unlock: 2302 - unlock_page(page); 2303 - put_page(page); 2304 - return 0; 2305 - } 2306 - #endif /* CONFIG_NUMA_BALANCING */ 2307 - 2308 2144 #endif /* CONFIG_NUMA */ 2309 2145 2310 2146 #ifdef CONFIG_DEVICE_PRIVATE ··· 2294 2400 if (!is_device_private_entry(entry)) 2295 2401 goto next; 2296 2402 2297 - page = device_private_entry_to_page(entry); 2403 + page = pfn_swap_entry_to_page(entry); 2298 2404 if (!(migrate->flags & 2299 2405 MIGRATE_VMA_SELECT_DEVICE_PRIVATE) || 2300 2406 page->pgmap->owner != migrate->pgmap_owner) ··· 2302 2408 2303 2409 mpfn = migrate_pfn(page_to_pfn(page)) | 2304 2410 MIGRATE_PFN_MIGRATE; 2305 - if (is_write_device_private_entry(entry)) 2411 + if (is_writable_device_private_entry(entry)) 2306 2412 mpfn |= MIGRATE_PFN_WRITE; 2307 2413 } else { 2308 2414 if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) ··· 2348 2454 ptep_get_and_clear(mm, addr, ptep); 2349 2455 2350 2456 /* Setup special migration page table entry */ 2351 - entry = make_migration_entry(page, mpfn & 2352 - MIGRATE_PFN_WRITE); 2457 + if (mpfn & MIGRATE_PFN_WRITE) 2458 + entry = make_writable_migration_entry( 2459 + page_to_pfn(page)); 2460 + else 2461 + entry = make_readable_migration_entry( 2462 + page_to_pfn(page)); 2353 2463 swp_pte = swp_entry_to_pte(entry); 2354 2464 if (pte_present(pte)) { 2355 2465 if (pte_soft_dirty(pte)) ··· 2416 2518 * that the registered device driver can skip invalidating device 2417 2519 * private page mappings that won't be migrated. 2418 2520 */ 2419 - mmu_notifier_range_init_migrate(&range, 0, migrate->vma, 2420 - migrate->vma->vm_mm, migrate->start, migrate->end, 2521 + mmu_notifier_range_init_owner(&range, MMU_NOTIFY_MIGRATE, 0, 2522 + migrate->vma, migrate->vma->vm_mm, migrate->start, migrate->end, 2421 2523 migrate->pgmap_owner); 2422 2524 mmu_notifier_invalidate_range_start(&range); 2423 2525 ··· 2602 2704 */ 2603 2705 static void migrate_vma_unmap(struct migrate_vma *migrate) 2604 2706 { 2605 - int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK; 2606 2707 const unsigned long npages = migrate->npages; 2607 2708 const unsigned long start = migrate->start; 2608 2709 unsigned long addr, i, restore = 0; ··· 2613 2716 continue; 2614 2717 2615 2718 if (page_mapped(page)) { 2616 - try_to_unmap(page, flags); 2719 + try_to_migrate(page, 0); 2617 2720 if (page_mapped(page)) 2618 2721 goto restore; 2619 2722 } ··· 2825 2928 if (is_device_private_page(page)) { 2826 2929 swp_entry_t swp_entry; 2827 2930 2828 - swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE); 2931 + if (vma->vm_flags & VM_WRITE) 2932 + swp_entry = make_writable_device_private_entry( 2933 + page_to_pfn(page)); 2934 + else 2935 + swp_entry = make_readable_device_private_entry( 2936 + page_to_pfn(page)); 2829 2937 entry = swp_entry_to_pte(swp_entry); 2830 2938 } else { 2831 2939 /* ··· 2927 3025 if (!notified) { 2928 3026 notified = true; 2929 3027 2930 - mmu_notifier_range_init_migrate(&range, 0, 2931 - migrate->vma, migrate->vma->vm_mm, 2932 - addr, migrate->end, 3028 + mmu_notifier_range_init_owner(&range, 3029 + MMU_NOTIFY_MIGRATE, 0, migrate->vma, 3030 + migrate->vma->vm_mm, addr, migrate->end, 2933 3031 migrate->pgmap_owner); 2934 3032 mmu_notifier_invalidate_range_start(&range); 2935 3033 }

+6 -6

mm/mlock.c

··· 108 108 /* 109 109 * Finish munlock after successful page isolation 110 110 * 111 - * Page must be locked. This is a wrapper for try_to_munlock() 111 + * Page must be locked. This is a wrapper for page_mlock() 112 112 * and putback_lru_page() with munlock accounting. 113 113 */ 114 114 static void __munlock_isolated_page(struct page *page) ··· 118 118 * and we don't need to check all the other vmas. 119 119 */ 120 120 if (page_mapcount(page) > 1) 121 - try_to_munlock(page); 121 + page_mlock(page); 122 122 123 123 /* Did try_to_unlock() succeed or punt? */ 124 124 if (!PageMlocked(page)) ··· 158 158 * munlock()ed or munmap()ed, we want to check whether other vmas hold the 159 159 * page locked so that we can leave it on the unevictable lru list and not 160 160 * bother vmscan with it. However, to walk the page's rmap list in 161 - * try_to_munlock() we must isolate the page from the LRU. If some other 161 + * page_mlock() we must isolate the page from the LRU. If some other 162 162 * task has removed the page from the LRU, we won't be able to do that. 163 163 * So we clear the PageMlocked as we might not get another chance. If we 164 164 * can't isolate the page, we leave it for putback_lru_page() and vmscan ··· 168 168 { 169 169 int nr_pages; 170 170 171 - /* For try_to_munlock() and to serialize with page migration */ 171 + /* For page_mlock() and to serialize with page migration */ 172 172 BUG_ON(!PageLocked(page)); 173 173 VM_BUG_ON_PAGE(PageTail(page), page); 174 174 ··· 205 205 * 206 206 * The fast path is available only for evictable pages with single mapping. 207 207 * Then we can bypass the per-cpu pvec and get better performance. 208 - * when mapcount > 1 we need try_to_munlock() which can fail. 208 + * when mapcount > 1 we need page_mlock() which can fail. 209 209 * when !page_evictable(), we need the full redo logic of putback_lru_page to 210 210 * avoid leaving evictable page in unevictable list. 211 211 * ··· 414 414 * 415 415 * We don't save and restore VM_LOCKED here because pages are 416 416 * still on lru. In unmap path, pages might be scanned by reclaim 417 - * and re-mlocked by try_to_{munlock|unmap} before we unmap and 417 + * and re-mlocked by page_mlock/try_to_unmap before we unmap and 418 418 * free them. This will result in freeing mlocked pages. 419 419 */ 420 420 void munlock_vma_pages_range(struct vm_area_struct *vma,

+32 -27

mm/mmap_lock.c

··· 153 153 rcu_read_unlock(); 154 154 } 155 155 156 + #define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \ 157 + do { \ 158 + const char *memcg_path; \ 159 + preempt_disable(); \ 160 + memcg_path = get_mm_memcg_path(mm); \ 161 + trace_mmap_lock_##type(mm, \ 162 + memcg_path != NULL ? memcg_path : "", \ 163 + ##__VA_ARGS__); \ 164 + if (likely(memcg_path != NULL)) \ 165 + put_memcg_path_buf(); \ 166 + preempt_enable(); \ 167 + } while (0) 168 + 169 + #else /* !CONFIG_MEMCG */ 170 + 171 + int trace_mmap_lock_reg(void) 172 + { 173 + return 0; 174 + } 175 + 176 + void trace_mmap_lock_unreg(void) 177 + { 178 + } 179 + 180 + #define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \ 181 + trace_mmap_lock_##type(mm, "", ##__VA_ARGS__) 182 + 183 + #endif /* CONFIG_MEMCG */ 184 + 185 + #ifdef CONFIG_TRACING 186 + #ifdef CONFIG_MEMCG 156 187 /* 157 188 * Write the given mm_struct's memcg path to a percpu buffer, and return a 158 189 * pointer to it. If the path cannot be determined, or no buffer was available ··· 218 187 return buf; 219 188 } 220 189 221 - #define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \ 222 - do { \ 223 - const char *memcg_path; \ 224 - local_lock(&memcg_paths.lock); \ 225 - memcg_path = get_mm_memcg_path(mm); \ 226 - trace_mmap_lock_##type(mm, \ 227 - memcg_path != NULL ? memcg_path : "", \ 228 - ##__VA_ARGS__); \ 229 - if (likely(memcg_path != NULL)) \ 230 - put_memcg_path_buf(); \ 231 - local_unlock(&memcg_paths.lock); \ 232 - } while (0) 233 - 234 - #else /* !CONFIG_MEMCG */ 235 - 236 - int trace_mmap_lock_reg(void) 237 - { 238 - return 0; 239 - } 240 - 241 - void trace_mmap_lock_unreg(void) 242 - { 243 - } 244 - 245 - #define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \ 246 - trace_mmap_lock_##type(mm, "", ##__VA_ARGS__) 247 - 248 190 #endif /* CONFIG_MEMCG */ 249 191 250 192 /* ··· 243 239 TRACE_MMAP_LOCK_EVENT(released, mm, write); 244 240 } 245 241 EXPORT_SYMBOL(__mmap_lock_do_trace_released); 242 + #endif /* CONFIG_TRACING */

+14 -4

mm/mprotect.c

··· 143 143 swp_entry_t entry = pte_to_swp_entry(oldpte); 144 144 pte_t newpte; 145 145 146 - if (is_write_migration_entry(entry)) { 146 + if (is_writable_migration_entry(entry)) { 147 147 /* 148 148 * A protection check is difficult so 149 149 * just be safe and disable write 150 150 */ 151 - make_migration_entry_read(&entry); 151 + entry = make_readable_migration_entry( 152 + swp_offset(entry)); 152 153 newpte = swp_entry_to_pte(entry); 153 154 if (pte_swp_soft_dirty(oldpte)) 154 155 newpte = pte_swp_mksoft_dirty(newpte); 155 156 if (pte_swp_uffd_wp(oldpte)) 156 157 newpte = pte_swp_mkuffd_wp(newpte); 157 - } else if (is_write_device_private_entry(entry)) { 158 + } else if (is_writable_device_private_entry(entry)) { 158 159 /* 159 160 * We do not preserve soft-dirtiness. See 160 161 * copy_one_pte() for explanation. 161 162 */ 162 - make_device_private_entry_read(&entry); 163 + entry = make_readable_device_private_entry( 164 + swp_offset(entry)); 163 165 newpte = swp_entry_to_pte(entry); 166 + if (pte_swp_uffd_wp(oldpte)) 167 + newpte = pte_swp_mkuffd_wp(newpte); 168 + } else if (is_writable_device_exclusive_entry(entry)) { 169 + entry = make_readable_device_exclusive_entry( 170 + swp_offset(entry)); 171 + newpte = swp_entry_to_pte(entry); 172 + if (pte_swp_soft_dirty(oldpte)) 173 + newpte = pte_swp_mksoft_dirty(newpte); 164 174 if (pte_swp_uffd_wp(oldpte)) 165 175 newpte = pte_swp_mkuffd_wp(newpte); 166 176 } else {

+2 -3

mm/nommu.c

··· 223 223 */ 224 224 void *vmalloc(unsigned long size) 225 225 { 226 - return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM); 226 + return __vmalloc(size, GFP_KERNEL); 227 227 } 228 228 EXPORT_SYMBOL(vmalloc); 229 229 ··· 241 241 */ 242 242 void *vzalloc(unsigned long size) 243 243 { 244 - return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO); 244 + return __vmalloc(size, GFP_KERNEL | __GFP_ZERO); 245 245 } 246 246 EXPORT_SYMBOL(vzalloc); 247 247 ··· 1501 1501 delete_vma(mm, vma); 1502 1502 return 0; 1503 1503 } 1504 - EXPORT_SYMBOL(do_munmap); 1505 1504 1506 1505 int vm_munmap(unsigned long addr, size_t len) 1507 1506 {

+1 -1

mm/oom_kill.c

··· 104 104 * mempolicy intersects current, otherwise it may be 105 105 * needlessly killed. 106 106 */ 107 - ret = mempolicy_nodemask_intersects(tsk, mask); 107 + ret = mempolicy_in_oom_domain(tsk, mask); 108 108 } else { 109 109 /* 110 110 * This is not a mempolicy constrained oom, so only

+2 -3

mm/page_alloc.c

··· 749 749 __SetPageHead(page); 750 750 for (i = 1; i < nr_pages; i++) { 751 751 struct page *p = page + i; 752 - set_page_count(p, 0); 753 752 p->mapping = TAIL_MAPPING; 754 753 set_compound_head(p, page); 755 754 } ··· 3192 3193 int cpu; 3193 3194 3194 3195 /* 3195 - * Allocate in the BSS so we wont require allocation in 3196 + * Allocate in the BSS so we won't require allocation in 3196 3197 * direct reclaim path for CONFIG_CPUMASK_OFFSTACK=y 3197 3198 */ 3198 3199 static cpumask_t cpus_with_pcps; ··· 3831 3832 3832 3833 #endif /* CONFIG_FAIL_PAGE_ALLOC */ 3833 3834 3834 - noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) 3835 + static noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) 3835 3836 { 3836 3837 return __should_fail_alloc_page(gfp_mask, order); 3837 3838 }

+9 -6

mm/page_vma_mapped.c

··· 41 41 42 42 /* Handle un-addressable ZONE_DEVICE memory */ 43 43 entry = pte_to_swp_entry(*pvmw->pte); 44 - if (!is_device_private_entry(entry)) 44 + if (!is_device_private_entry(entry) && 45 + !is_device_exclusive_entry(entry)) 45 46 return false; 46 47 } else if (!pte_present(*pvmw->pte)) 47 48 return false; ··· 94 93 return false; 95 94 entry = pte_to_swp_entry(*pvmw->pte); 96 95 97 - if (!is_migration_entry(entry)) 96 + if (!is_migration_entry(entry) && 97 + !is_device_exclusive_entry(entry)) 98 98 return false; 99 99 100 - pfn = migration_entry_to_pfn(entry); 100 + pfn = swp_offset(entry); 101 101 } else if (is_swap_pte(*pvmw->pte)) { 102 102 swp_entry_t entry; 103 103 104 104 /* Handle un-addressable ZONE_DEVICE memory */ 105 105 entry = pte_to_swp_entry(*pvmw->pte); 106 - if (!is_device_private_entry(entry)) 106 + if (!is_device_private_entry(entry) && 107 + !is_device_exclusive_entry(entry)) 107 108 return false; 108 109 109 - pfn = device_private_entry_to_pfn(entry); 110 + pfn = swp_offset(entry); 110 111 } else { 111 112 if (!pte_present(*pvmw->pte)) 112 113 return false; ··· 236 233 return not_found(pvmw); 237 234 entry = pmd_to_swp_entry(pmde); 238 235 if (!is_migration_entry(entry) || 239 - migration_entry_to_page(entry) != page) 236 + pfn_swap_entry_to_page(entry) != page) 240 237 return not_found(pvmw); 241 238 return true; 242 239 }

+512 -116

mm/rmap.c

··· 1405 1405 /* 1406 1406 * When racing against e.g. zap_pte_range() on another cpu, 1407 1407 * in between its ptep_get_and_clear_full() and page_remove_rmap(), 1408 - * try_to_unmap() may return false when it is about to become true, 1408 + * try_to_unmap() may return before page_mapped() has become false, 1409 1409 * if page table locking is skipped: use TTU_SYNC to wait for that. 1410 1410 */ 1411 1411 if (flags & TTU_SYNC) 1412 1412 pvmw.flags = PVMW_SYNC; 1413 1413 1414 - /* munlock has nothing to gain from examining un-locked vmas */ 1415 - if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED)) 1416 - return true; 1417 - 1418 - if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) && 1419 - is_zone_device_page(page) && !is_device_private_page(page)) 1420 - return true; 1421 - 1422 - if (flags & TTU_SPLIT_HUGE_PMD) { 1423 - split_huge_pmd_address(vma, address, 1424 - flags & TTU_SPLIT_FREEZE, page); 1425 - } 1414 + if (flags & TTU_SPLIT_HUGE_PMD) 1415 + split_huge_pmd_address(vma, address, false, page); 1426 1416 1427 1417 /* 1428 1418 * For THP, we have to assume the worse case ie pmd for invalidation. ··· 1437 1447 mmu_notifier_invalidate_range_start(&range); 1438 1448 1439 1449 while (page_vma_mapped_walk(&pvmw)) { 1440 - #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION 1441 - /* PMD-mapped THP migration entry */ 1442 - if (!pvmw.pte && (flags & TTU_MIGRATION)) { 1443 - VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page), page); 1444 - 1445 - set_pmd_migration_entry(&pvmw, page); 1446 - continue; 1447 - } 1448 - #endif 1449 - 1450 1450 /* 1451 1451 * If the page is mlock()d, we cannot swap it out. 1452 1452 * If it's recently referenced (perhaps page_referenced ··· 1456 1476 page_vma_mapped_walk_done(&pvmw); 1457 1477 break; 1458 1478 } 1459 - if (flags & TTU_MUNLOCK) 1460 - continue; 1461 1479 } 1462 1480 1463 1481 /* Unexpected PMD-mapped THP? */ ··· 1496 1518 page_vma_mapped_walk_done(&pvmw); 1497 1519 break; 1498 1520 } 1499 - } 1500 - 1501 - if (IS_ENABLED(CONFIG_MIGRATION) && 1502 - (flags & TTU_MIGRATION) && 1503 - is_zone_device_page(page)) { 1504 - swp_entry_t entry; 1505 - pte_t swp_pte; 1506 - 1507 - pteval = ptep_get_and_clear(mm, pvmw.address, pvmw.pte); 1508 - 1509 - /* 1510 - * Store the pfn of the page in a special migration 1511 - * pte. do_swap_page() will wait until the migration 1512 - * pte is removed and then restart fault handling. 1513 - */ 1514 - entry = make_migration_entry(page, 0); 1515 - swp_pte = swp_entry_to_pte(entry); 1516 - 1517 - /* 1518 - * pteval maps a zone device page and is therefore 1519 - * a swap pte. 1520 - */ 1521 - if (pte_swp_soft_dirty(pteval)) 1522 - swp_pte = pte_swp_mksoft_dirty(swp_pte); 1523 - if (pte_swp_uffd_wp(pteval)) 1524 - swp_pte = pte_swp_mkuffd_wp(swp_pte); 1525 - set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); 1526 - /* 1527 - * No need to invalidate here it will synchronize on 1528 - * against the special swap migration pte. 1529 - * 1530 - * The assignment to subpage above was computed from a 1531 - * swap PTE which results in an invalid pointer. 1532 - * Since only PAGE_SIZE pages can currently be 1533 - * migrated, just set it to page. This will need to be 1534 - * changed when hugepage migrations to device private 1535 - * memory are supported. 1536 - */ 1537 - subpage = page; 1538 - goto discard; 1539 1521 } 1540 1522 1541 1523 /* Nuke the page table entry. */ ··· 1550 1612 /* We have to invalidate as we cleared the pte */ 1551 1613 mmu_notifier_invalidate_range(mm, address, 1552 1614 address + PAGE_SIZE); 1553 - } else if (IS_ENABLED(CONFIG_MIGRATION) && 1554 - (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { 1555 - swp_entry_t entry; 1556 - pte_t swp_pte; 1557 - 1558 - if (arch_unmap_one(mm, vma, address, pteval) < 0) { 1559 - set_pte_at(mm, address, pvmw.pte, pteval); 1560 - ret = false; 1561 - page_vma_mapped_walk_done(&pvmw); 1562 - break; 1563 - } 1564 - 1565 - /* 1566 - * Store the pfn of the page in a special migration 1567 - * pte. do_swap_page() will wait until the migration 1568 - * pte is removed and then restart fault handling. 1569 - */ 1570 - entry = make_migration_entry(subpage, 1571 - pte_write(pteval)); 1572 - swp_pte = swp_entry_to_pte(entry); 1573 - if (pte_soft_dirty(pteval)) 1574 - swp_pte = pte_swp_mksoft_dirty(swp_pte); 1575 - if (pte_uffd_wp(pteval)) 1576 - swp_pte = pte_swp_mkuffd_wp(swp_pte); 1577 - set_pte_at(mm, address, pvmw.pte, swp_pte); 1578 - /* 1579 - * No need to invalidate here it will synchronize on 1580 - * against the special swap migration pte. 1581 - */ 1582 1615 } else if (PageAnon(page)) { 1583 1616 swp_entry_t entry = { .val = page_private(subpage) }; 1584 1617 pte_t swp_pte; ··· 1665 1756 * Tries to remove all the page table entries which are mapping this 1666 1757 * page, used in the pageout path. Caller must hold the page lock. 1667 1758 * 1668 - * If unmap is successful, return true. Otherwise, false. 1759 + * It is the caller's responsibility to check if the page is still 1760 + * mapped when needed (use TTU_SYNC to prevent accounting races). 1669 1761 */ 1670 - bool try_to_unmap(struct page *page, enum ttu_flags flags) 1762 + void try_to_unmap(struct page *page, enum ttu_flags flags) 1671 1763 { 1672 1764 struct rmap_walk_control rwc = { 1673 1765 .rmap_one = try_to_unmap_one, ··· 1676 1766 .done = page_not_mapped, 1677 1767 .anon_lock = page_lock_anon_vma_read, 1678 1768 }; 1769 + 1770 + if (flags & TTU_RMAP_LOCKED) 1771 + rmap_walk_locked(page, &rwc); 1772 + else 1773 + rmap_walk(page, &rwc); 1774 + } 1775 + 1776 + /* 1777 + * @arg: enum ttu_flags will be passed to this argument. 1778 + * 1779 + * If TTU_SPLIT_HUGE_PMD is specified any PMD mappings will be split into PTEs 1780 + * containing migration entries. This and TTU_RMAP_LOCKED are the only supported 1781 + * flags. 1782 + */ 1783 + static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, 1784 + unsigned long address, void *arg) 1785 + { 1786 + struct mm_struct *mm = vma->vm_mm; 1787 + struct page_vma_mapped_walk pvmw = { 1788 + .page = page, 1789 + .vma = vma, 1790 + .address = address, 1791 + }; 1792 + pte_t pteval; 1793 + struct page *subpage; 1794 + bool ret = true; 1795 + struct mmu_notifier_range range; 1796 + enum ttu_flags flags = (enum ttu_flags)(long)arg; 1797 + 1798 + if (is_zone_device_page(page) && !is_device_private_page(page)) 1799 + return true; 1800 + 1801 + /* 1802 + * When racing against e.g. zap_pte_range() on another cpu, 1803 + * in between its ptep_get_and_clear_full() and page_remove_rmap(), 1804 + * try_to_migrate() may return before page_mapped() has become false, 1805 + * if page table locking is skipped: use TTU_SYNC to wait for that. 1806 + */ 1807 + if (flags & TTU_SYNC) 1808 + pvmw.flags = PVMW_SYNC; 1809 + 1810 + /* 1811 + * unmap_page() in mm/huge_memory.c is the only user of migration with 1812 + * TTU_SPLIT_HUGE_PMD and it wants to freeze. 1813 + */ 1814 + if (flags & TTU_SPLIT_HUGE_PMD) 1815 + split_huge_pmd_address(vma, address, true, page); 1816 + 1817 + /* 1818 + * For THP, we have to assume the worse case ie pmd for invalidation. 1819 + * For hugetlb, it could be much worse if we need to do pud 1820 + * invalidation in the case of pmd sharing. 1821 + * 1822 + * Note that the page can not be free in this function as call of 1823 + * try_to_unmap() must hold a reference on the page. 1824 + */ 1825 + range.end = PageKsm(page) ? 1826 + address + PAGE_SIZE : vma_address_end(page, vma); 1827 + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm, 1828 + address, range.end); 1829 + if (PageHuge(page)) { 1830 + /* 1831 + * If sharing is possible, start and end will be adjusted 1832 + * accordingly. 1833 + */ 1834 + adjust_range_if_pmd_sharing_possible(vma, &range.start, 1835 + &range.end); 1836 + } 1837 + mmu_notifier_invalidate_range_start(&range); 1838 + 1839 + while (page_vma_mapped_walk(&pvmw)) { 1840 + #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION 1841 + /* PMD-mapped THP migration entry */ 1842 + if (!pvmw.pte) { 1843 + VM_BUG_ON_PAGE(PageHuge(page) || 1844 + !PageTransCompound(page), page); 1845 + 1846 + set_pmd_migration_entry(&pvmw, page); 1847 + continue; 1848 + } 1849 + #endif 1850 + 1851 + /* Unexpected PMD-mapped THP? */ 1852 + VM_BUG_ON_PAGE(!pvmw.pte, page); 1853 + 1854 + subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); 1855 + address = pvmw.address; 1856 + 1857 + if (PageHuge(page) && !PageAnon(page)) { 1858 + /* 1859 + * To call huge_pmd_unshare, i_mmap_rwsem must be 1860 + * held in write mode. Caller needs to explicitly 1861 + * do this outside rmap routines. 1862 + */ 1863 + VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); 1864 + if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) { 1865 + /* 1866 + * huge_pmd_unshare unmapped an entire PMD 1867 + * page. There is no way of knowing exactly 1868 + * which PMDs may be cached for this mm, so 1869 + * we must flush them all. start/end were 1870 + * already adjusted above to cover this range. 1871 + */ 1872 + flush_cache_range(vma, range.start, range.end); 1873 + flush_tlb_range(vma, range.start, range.end); 1874 + mmu_notifier_invalidate_range(mm, range.start, 1875 + range.end); 1876 + 1877 + /* 1878 + * The ref count of the PMD page was dropped 1879 + * which is part of the way map counting 1880 + * is done for shared PMDs. Return 'true' 1881 + * here. When there is no other sharing, 1882 + * huge_pmd_unshare returns false and we will 1883 + * unmap the actual page and drop map count 1884 + * to zero. 1885 + */ 1886 + page_vma_mapped_walk_done(&pvmw); 1887 + break; 1888 + } 1889 + } 1890 + 1891 + /* Nuke the page table entry. */ 1892 + flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); 1893 + pteval = ptep_clear_flush(vma, address, pvmw.pte); 1894 + 1895 + /* Move the dirty bit to the page. Now the pte is gone. */ 1896 + if (pte_dirty(pteval)) 1897 + set_page_dirty(page); 1898 + 1899 + /* Update high watermark before we lower rss */ 1900 + update_hiwater_rss(mm); 1901 + 1902 + if (is_zone_device_page(page)) { 1903 + swp_entry_t entry; 1904 + pte_t swp_pte; 1905 + 1906 + /* 1907 + * Store the pfn of the page in a special migration 1908 + * pte. do_swap_page() will wait until the migration 1909 + * pte is removed and then restart fault handling. 1910 + */ 1911 + entry = make_readable_migration_entry( 1912 + page_to_pfn(page)); 1913 + swp_pte = swp_entry_to_pte(entry); 1914 + 1915 + /* 1916 + * pteval maps a zone device page and is therefore 1917 + * a swap pte. 1918 + */ 1919 + if (pte_swp_soft_dirty(pteval)) 1920 + swp_pte = pte_swp_mksoft_dirty(swp_pte); 1921 + if (pte_swp_uffd_wp(pteval)) 1922 + swp_pte = pte_swp_mkuffd_wp(swp_pte); 1923 + set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); 1924 + /* 1925 + * No need to invalidate here it will synchronize on 1926 + * against the special swap migration pte. 1927 + * 1928 + * The assignment to subpage above was computed from a 1929 + * swap PTE which results in an invalid pointer. 1930 + * Since only PAGE_SIZE pages can currently be 1931 + * migrated, just set it to page. This will need to be 1932 + * changed when hugepage migrations to device private 1933 + * memory are supported. 1934 + */ 1935 + subpage = page; 1936 + } else if (PageHWPoison(page)) { 1937 + pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); 1938 + if (PageHuge(page)) { 1939 + hugetlb_count_sub(compound_nr(page), mm); 1940 + set_huge_swap_pte_at(mm, address, 1941 + pvmw.pte, pteval, 1942 + vma_mmu_pagesize(vma)); 1943 + } else { 1944 + dec_mm_counter(mm, mm_counter(page)); 1945 + set_pte_at(mm, address, pvmw.pte, pteval); 1946 + } 1947 + 1948 + } else if (pte_unused(pteval) && !userfaultfd_armed(vma)) { 1949 + /* 1950 + * The guest indicated that the page content is of no 1951 + * interest anymore. Simply discard the pte, vmscan 1952 + * will take care of the rest. 1953 + * A future reference will then fault in a new zero 1954 + * page. When userfaultfd is active, we must not drop 1955 + * this page though, as its main user (postcopy 1956 + * migration) will not expect userfaults on already 1957 + * copied pages. 1958 + */ 1959 + dec_mm_counter(mm, mm_counter(page)); 1960 + /* We have to invalidate as we cleared the pte */ 1961 + mmu_notifier_invalidate_range(mm, address, 1962 + address + PAGE_SIZE); 1963 + } else { 1964 + swp_entry_t entry; 1965 + pte_t swp_pte; 1966 + 1967 + if (arch_unmap_one(mm, vma, address, pteval) < 0) { 1968 + set_pte_at(mm, address, pvmw.pte, pteval); 1969 + ret = false; 1970 + page_vma_mapped_walk_done(&pvmw); 1971 + break; 1972 + } 1973 + 1974 + /* 1975 + * Store the pfn of the page in a special migration 1976 + * pte. do_swap_page() will wait until the migration 1977 + * pte is removed and then restart fault handling. 1978 + */ 1979 + if (pte_write(pteval)) 1980 + entry = make_writable_migration_entry( 1981 + page_to_pfn(subpage)); 1982 + else 1983 + entry = make_readable_migration_entry( 1984 + page_to_pfn(subpage)); 1985 + 1986 + swp_pte = swp_entry_to_pte(entry); 1987 + if (pte_soft_dirty(pteval)) 1988 + swp_pte = pte_swp_mksoft_dirty(swp_pte); 1989 + if (pte_uffd_wp(pteval)) 1990 + swp_pte = pte_swp_mkuffd_wp(swp_pte); 1991 + set_pte_at(mm, address, pvmw.pte, swp_pte); 1992 + /* 1993 + * No need to invalidate here it will synchronize on 1994 + * against the special swap migration pte. 1995 + */ 1996 + } 1997 + 1998 + /* 1999 + * No need to call mmu_notifier_invalidate_range() it has be 2000 + * done above for all cases requiring it to happen under page 2001 + * table lock before mmu_notifier_invalidate_range_end() 2002 + * 2003 + * See Documentation/vm/mmu_notifier.rst 2004 + */ 2005 + page_remove_rmap(subpage, PageHuge(page)); 2006 + put_page(page); 2007 + } 2008 + 2009 + mmu_notifier_invalidate_range_end(&range); 2010 + 2011 + return ret; 2012 + } 2013 + 2014 + /** 2015 + * try_to_migrate - try to replace all page table mappings with swap entries 2016 + * @page: the page to replace page table entries for 2017 + * @flags: action and flags 2018 + * 2019 + * Tries to remove all the page table entries which are mapping this page and 2020 + * replace them with special swap entries. Caller must hold the page lock. 2021 + * 2022 + * If is successful, return true. Otherwise, false. 2023 + */ 2024 + void try_to_migrate(struct page *page, enum ttu_flags flags) 2025 + { 2026 + struct rmap_walk_control rwc = { 2027 + .rmap_one = try_to_migrate_one, 2028 + .arg = (void *)flags, 2029 + .done = page_not_mapped, 2030 + .anon_lock = page_lock_anon_vma_read, 2031 + }; 2032 + 2033 + /* 2034 + * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and 2035 + * TTU_SPLIT_HUGE_PMD and TTU_SYNC flags. 2036 + */ 2037 + if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD | 2038 + TTU_SYNC))) 2039 + return; 1679 2040 1680 2041 /* 1681 2042 * During exec, a temporary VMA is setup and later moved. ··· 1956 1775 * locking requirements of exec(), migration skips 1957 1776 * temporary VMAs until after exec() completes. 1958 1777 */ 1959 - if ((flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE)) 1960 - && !PageKsm(page) && PageAnon(page)) 1778 + if (!PageKsm(page) && PageAnon(page)) 1961 1779 rwc.invalid_vma = invalid_migration_vma; 1962 1780 1963 1781 if (flags & TTU_RMAP_LOCKED) 1964 1782 rmap_walk_locked(page, &rwc); 1965 1783 else 1966 1784 rmap_walk(page, &rwc); 1785 + } 1967 1786 1968 - /* 1969 - * When racing against e.g. zap_pte_range() on another cpu, 1970 - * in between its ptep_get_and_clear_full() and page_remove_rmap(), 1971 - * try_to_unmap() may return false when it is about to become true, 1972 - * if page table locking is skipped: use TTU_SYNC to wait for that. 1973 - */ 1974 - return !page_mapcount(page); 1787 + /* 1788 + * Walks the vma's mapping a page and mlocks the page if any locked vma's are 1789 + * found. Once one is found the page is locked and the scan can be terminated. 1790 + */ 1791 + static bool page_mlock_one(struct page *page, struct vm_area_struct *vma, 1792 + unsigned long address, void *unused) 1793 + { 1794 + struct page_vma_mapped_walk pvmw = { 1795 + .page = page, 1796 + .vma = vma, 1797 + .address = address, 1798 + }; 1799 + 1800 + /* An un-locked vma doesn't have any pages to lock, continue the scan */ 1801 + if (!(vma->vm_flags & VM_LOCKED)) 1802 + return true; 1803 + 1804 + while (page_vma_mapped_walk(&pvmw)) { 1805 + /* 1806 + * Need to recheck under the ptl to serialise with 1807 + * __munlock_pagevec_fill() after VM_LOCKED is cleared in 1808 + * munlock_vma_pages_range(). 1809 + */ 1810 + if (vma->vm_flags & VM_LOCKED) { 1811 + /* PTE-mapped THP are never mlocked */ 1812 + if (!PageTransCompound(page)) 1813 + mlock_vma_page(page); 1814 + page_vma_mapped_walk_done(&pvmw); 1815 + } 1816 + 1817 + /* 1818 + * no need to continue scanning other vma's if the page has 1819 + * been locked. 1820 + */ 1821 + return false; 1822 + } 1823 + 1824 + return true; 1975 1825 } 1976 1826 1977 1827 /** 1978 - * try_to_munlock - try to munlock a page 1979 - * @page: the page to be munlocked 1828 + * page_mlock - try to mlock a page 1829 + * @page: the page to be mlocked 1980 1830 * 1981 - * Called from munlock code. Checks all of the VMAs mapping the page 1982 - * to make sure nobody else has this page mlocked. The page will be 1983 - * returned with PG_mlocked cleared if no other vmas have it mlocked. 1831 + * Called from munlock code. Checks all of the VMAs mapping the page and mlocks 1832 + * the page if any are found. The page will be returned with PG_mlocked cleared 1833 + * if it is not mapped by any locked vmas. 1984 1834 */ 1985 - 1986 - void try_to_munlock(struct page *page) 1835 + void page_mlock(struct page *page) 1987 1836 { 1988 1837 struct rmap_walk_control rwc = { 1989 - .rmap_one = try_to_unmap_one, 1990 - .arg = (void *)TTU_MUNLOCK, 1838 + .rmap_one = page_mlock_one, 1991 1839 .done = page_not_mapped, 1992 1840 .anon_lock = page_lock_anon_vma_read, 1993 1841 ··· 2027 1817 2028 1818 rmap_walk(page, &rwc); 2029 1819 } 1820 + 1821 + #ifdef CONFIG_DEVICE_PRIVATE 1822 + struct make_exclusive_args { 1823 + struct mm_struct *mm; 1824 + unsigned long address; 1825 + void *owner; 1826 + bool valid; 1827 + }; 1828 + 1829 + static bool page_make_device_exclusive_one(struct page *page, 1830 + struct vm_area_struct *vma, unsigned long address, void *priv) 1831 + { 1832 + struct mm_struct *mm = vma->vm_mm; 1833 + struct page_vma_mapped_walk pvmw = { 1834 + .page = page, 1835 + .vma = vma, 1836 + .address = address, 1837 + }; 1838 + struct make_exclusive_args *args = priv; 1839 + pte_t pteval; 1840 + struct page *subpage; 1841 + bool ret = true; 1842 + struct mmu_notifier_range range; 1843 + swp_entry_t entry; 1844 + pte_t swp_pte; 1845 + 1846 + mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma, 1847 + vma->vm_mm, address, min(vma->vm_end, 1848 + address + page_size(page)), args->owner); 1849 + mmu_notifier_invalidate_range_start(&range); 1850 + 1851 + while (page_vma_mapped_walk(&pvmw)) { 1852 + /* Unexpected PMD-mapped THP? */ 1853 + VM_BUG_ON_PAGE(!pvmw.pte, page); 1854 + 1855 + if (!pte_present(*pvmw.pte)) { 1856 + ret = false; 1857 + page_vma_mapped_walk_done(&pvmw); 1858 + break; 1859 + } 1860 + 1861 + subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); 1862 + address = pvmw.address; 1863 + 1864 + /* Nuke the page table entry. */ 1865 + flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); 1866 + pteval = ptep_clear_flush(vma, address, pvmw.pte); 1867 + 1868 + /* Move the dirty bit to the page. Now the pte is gone. */ 1869 + if (pte_dirty(pteval)) 1870 + set_page_dirty(page); 1871 + 1872 + /* 1873 + * Check that our target page is still mapped at the expected 1874 + * address. 1875 + */ 1876 + if (args->mm == mm && args->address == address && 1877 + pte_write(pteval)) 1878 + args->valid = true; 1879 + 1880 + /* 1881 + * Store the pfn of the page in a special migration 1882 + * pte. do_swap_page() will wait until the migration 1883 + * pte is removed and then restart fault handling. 1884 + */ 1885 + if (pte_write(pteval)) 1886 + entry = make_writable_device_exclusive_entry( 1887 + page_to_pfn(subpage)); 1888 + else 1889 + entry = make_readable_device_exclusive_entry( 1890 + page_to_pfn(subpage)); 1891 + swp_pte = swp_entry_to_pte(entry); 1892 + if (pte_soft_dirty(pteval)) 1893 + swp_pte = pte_swp_mksoft_dirty(swp_pte); 1894 + if (pte_uffd_wp(pteval)) 1895 + swp_pte = pte_swp_mkuffd_wp(swp_pte); 1896 + 1897 + set_pte_at(mm, address, pvmw.pte, swp_pte); 1898 + 1899 + /* 1900 + * There is a reference on the page for the swap entry which has 1901 + * been removed, so shouldn't take another. 1902 + */ 1903 + page_remove_rmap(subpage, false); 1904 + } 1905 + 1906 + mmu_notifier_invalidate_range_end(&range); 1907 + 1908 + return ret; 1909 + } 1910 + 1911 + /** 1912 + * page_make_device_exclusive - mark the page exclusively owned by a device 1913 + * @page: the page to replace page table entries for 1914 + * @mm: the mm_struct where the page is expected to be mapped 1915 + * @address: address where the page is expected to be mapped 1916 + * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks 1917 + * 1918 + * Tries to remove all the page table entries which are mapping this page and 1919 + * replace them with special device exclusive swap entries to grant a device 1920 + * exclusive access to the page. Caller must hold the page lock. 1921 + * 1922 + * Returns false if the page is still mapped, or if it could not be unmapped 1923 + * from the expected address. Otherwise returns true (success). 1924 + */ 1925 + static bool page_make_device_exclusive(struct page *page, struct mm_struct *mm, 1926 + unsigned long address, void *owner) 1927 + { 1928 + struct make_exclusive_args args = { 1929 + .mm = mm, 1930 + .address = address, 1931 + .owner = owner, 1932 + .valid = false, 1933 + }; 1934 + struct rmap_walk_control rwc = { 1935 + .rmap_one = page_make_device_exclusive_one, 1936 + .done = page_not_mapped, 1937 + .anon_lock = page_lock_anon_vma_read, 1938 + .arg = &args, 1939 + }; 1940 + 1941 + /* 1942 + * Restrict to anonymous pages for now to avoid potential writeback 1943 + * issues. Also tail pages shouldn't be passed to rmap_walk so skip 1944 + * those. 1945 + */ 1946 + if (!PageAnon(page) || PageTail(page)) 1947 + return false; 1948 + 1949 + rmap_walk(page, &rwc); 1950 + 1951 + return args.valid && !page_mapcount(page); 1952 + } 1953 + 1954 + /** 1955 + * make_device_exclusive_range() - Mark a range for exclusive use by a device 1956 + * @mm: mm_struct of assoicated target process 1957 + * @start: start of the region to mark for exclusive device access 1958 + * @end: end address of region 1959 + * @pages: returns the pages which were successfully marked for exclusive access 1960 + * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier to allow filtering 1961 + * 1962 + * Returns: number of pages found in the range by GUP. A page is marked for 1963 + * exclusive access only if the page pointer is non-NULL. 1964 + * 1965 + * This function finds ptes mapping page(s) to the given address range, locks 1966 + * them and replaces mappings with special swap entries preventing userspace CPU 1967 + * access. On fault these entries are replaced with the original mapping after 1968 + * calling MMU notifiers. 1969 + * 1970 + * A driver using this to program access from a device must use a mmu notifier 1971 + * critical section to hold a device specific lock during programming. Once 1972 + * programming is complete it should drop the page lock and reference after 1973 + * which point CPU access to the page will revoke the exclusive access. 1974 + */ 1975 + int make_device_exclusive_range(struct mm_struct *mm, unsigned long start, 1976 + unsigned long end, struct page **pages, 1977 + void *owner) 1978 + { 1979 + long npages = (end - start) >> PAGE_SHIFT; 1980 + long i; 1981 + 1982 + npages = get_user_pages_remote(mm, start, npages, 1983 + FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD, 1984 + pages, NULL, NULL); 1985 + if (npages < 0) 1986 + return npages; 1987 + 1988 + for (i = 0; i < npages; i++, start += PAGE_SIZE) { 1989 + if (!trylock_page(pages[i])) { 1990 + put_page(pages[i]); 1991 + pages[i] = NULL; 1992 + continue; 1993 + } 1994 + 1995 + if (!page_make_device_exclusive(pages[i], mm, start, owner)) { 1996 + unlock_page(pages[i]); 1997 + put_page(pages[i]); 1998 + pages[i] = NULL; 1999 + } 2000 + } 2001 + 2002 + return npages; 2003 + } 2004 + EXPORT_SYMBOL_GPL(make_device_exclusive_range); 2005 + #endif 2030 2006 2031 2007 void __put_anon_vma(struct anon_vma *anon_vma) 2032 2008 { ··· 2254 1858 * Find all the mappings of a page using the mapping pointer and the vma chains 2255 1859 * contained in the anon_vma struct it points to. 2256 1860 * 2257 - * When called from try_to_munlock(), the mmap_lock of the mm containing the vma 1861 + * When called from page_mlock(), the mmap_lock of the mm containing the vma 2258 1862 * where the page was found will be held for write. So, we won't recheck 2259 1863 * vm_flags for that VMA. That should be OK, because that vma shouldn't be 2260 1864 * LOCKED. ··· 2307 1911 * Find all the mappings of a page using the mapping pointer and the vma chains 2308 1912 * contained in the address_space struct it points to. 2309 1913 * 2310 - * When called from try_to_munlock(), the mmap_lock of the mm containing the vma 1914 + * When called from page_mlock(), the mmap_lock of the mm containing the vma 2311 1915 * where the page was found will be held for write. So, we won't recheck 2312 1916 * vm_flags for that VMA. That should be OK, because that vma shouldn't be 2313 1917 * LOCKED.

+39 -84

mm/shmem.c

··· 1797 1797 * vm. If we swap it in we mark it dirty since we also free the swap 1798 1798 * entry since a page cannot live in both the swap and page cache. 1799 1799 * 1800 - * vmf and fault_type are only supplied by shmem_fault: 1800 + * vma, vmf, and fault_type are only supplied by shmem_fault: 1801 1801 * otherwise they are NULL. 1802 1802 */ 1803 1803 static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, ··· 1832 1832 1833 1833 page = pagecache_get_page(mapping, index, 1834 1834 FGP_ENTRY | FGP_HEAD | FGP_LOCK, 0); 1835 + 1836 + if (page && vma && userfaultfd_minor(vma)) { 1837 + if (!xa_is_value(page)) { 1838 + unlock_page(page); 1839 + put_page(page); 1840 + } 1841 + *fault_type = handle_userfault(vmf, VM_UFFD_MINOR); 1842 + return 0; 1843 + } 1844 + 1835 1845 if (xa_is_value(page)) { 1836 1846 error = shmem_swapin_page(inode, index, &page, 1837 1847 sgp, gfp, vma, fault_type); ··· 2362 2352 return inode; 2363 2353 } 2364 2354 2365 - static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, 2366 - pmd_t *dst_pmd, 2367 - struct vm_area_struct *dst_vma, 2368 - unsigned long dst_addr, 2369 - unsigned long src_addr, 2370 - bool zeropage, 2371 - struct page **pagep) 2355 + #ifdef CONFIG_USERFAULTFD 2356 + int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, 2357 + pmd_t *dst_pmd, 2358 + struct vm_area_struct *dst_vma, 2359 + unsigned long dst_addr, 2360 + unsigned long src_addr, 2361 + bool zeropage, 2362 + struct page **pagep) 2372 2363 { 2373 2364 struct inode *inode = file_inode(dst_vma->vm_file); 2374 2365 struct shmem_inode_info *info = SHMEM_I(inode); 2375 2366 struct address_space *mapping = inode->i_mapping; 2376 2367 gfp_t gfp = mapping_gfp_mask(mapping); 2377 2368 pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); 2378 - spinlock_t *ptl; 2379 2369 void *page_kaddr; 2380 2370 struct page *page; 2381 - pte_t _dst_pte, *dst_pte; 2382 2371 int ret; 2383 - pgoff_t offset, max_off; 2372 + pgoff_t max_off; 2384 2373 2385 - ret = -ENOMEM; 2386 2374 if (!shmem_inode_acct_block(inode, 1)) { 2387 2375 /* 2388 2376 * We may have got a page, returned -ENOENT triggering a retry, ··· 2391 2383 put_page(*pagep); 2392 2384 *pagep = NULL; 2393 2385 } 2394 - goto out; 2386 + return -ENOMEM; 2395 2387 } 2396 2388 2397 2389 if (!*pagep) { 2390 + ret = -ENOMEM; 2398 2391 page = shmem_alloc_page(gfp, info, pgoff); 2399 2392 if (!page) 2400 2393 goto out_unacct_blocks; 2401 2394 2402 - if (!zeropage) { /* mcopy_atomic */ 2395 + if (!zeropage) { /* COPY */ 2403 2396 page_kaddr = kmap_atomic(page); 2404 2397 ret = copy_from_user(page_kaddr, 2405 2398 (const void __user *)src_addr, ··· 2410 2401 /* fallback to copy_from_user outside mmap_lock */ 2411 2402 if (unlikely(ret)) { 2412 2403 *pagep = page; 2413 - shmem_inode_unacct_blocks(inode, 1); 2404 + ret = -ENOENT; 2414 2405 /* don't free the page */ 2415 - return -ENOENT; 2406 + goto out_unacct_blocks; 2416 2407 } 2417 - } else { /* mfill_zeropage_atomic */ 2408 + } else { /* ZEROPAGE */ 2418 2409 clear_highpage(page); 2419 2410 } 2420 2411 } else { ··· 2422 2413 *pagep = NULL; 2423 2414 } 2424 2415 2425 - VM_BUG_ON(PageLocked(page) || PageSwapBacked(page)); 2416 + VM_BUG_ON(PageLocked(page)); 2417 + VM_BUG_ON(PageSwapBacked(page)); 2426 2418 __SetPageLocked(page); 2427 2419 __SetPageSwapBacked(page); 2428 2420 __SetPageUptodate(page); 2429 2421 2430 2422 ret = -EFAULT; 2431 - offset = linear_page_index(dst_vma, dst_addr); 2432 2423 max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); 2433 - if (unlikely(offset >= max_off)) 2424 + if (unlikely(pgoff >= max_off)) 2434 2425 goto out_release; 2435 2426 2436 2427 ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL, ··· 2438 2429 if (ret) 2439 2430 goto out_release; 2440 2431 2441 - _dst_pte = mk_pte(page, dst_vma->vm_page_prot); 2442 - if (dst_vma->vm_flags & VM_WRITE) 2443 - _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte)); 2444 - else { 2445 - /* 2446 - * We don't set the pte dirty if the vma has no 2447 - * VM_WRITE permission, so mark the page dirty or it 2448 - * could be freed from under us. We could do it 2449 - * unconditionally before unlock_page(), but doing it 2450 - * only if VM_WRITE is not set is faster. 2451 - */ 2452 - set_page_dirty(page); 2453 - } 2454 - 2455 - dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); 2456 - 2457 - ret = -EFAULT; 2458 - max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); 2459 - if (unlikely(offset >= max_off)) 2460 - goto out_release_unlock; 2461 - 2462 - ret = -EEXIST; 2463 - if (!pte_none(*dst_pte)) 2464 - goto out_release_unlock; 2465 - 2466 - lru_cache_add(page); 2432 + ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr, 2433 + page, true, false); 2434 + if (ret) 2435 + goto out_delete_from_cache; 2467 2436 2468 2437 spin_lock_irq(&info->lock); 2469 2438 info->alloced++; ··· 2449 2462 shmem_recalc_inode(inode); 2450 2463 spin_unlock_irq(&info->lock); 2451 2464 2452 - inc_mm_counter(dst_mm, mm_counter_file(page)); 2453 - page_add_file_rmap(page, false); 2454 - set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); 2455 - 2456 - /* No need to invalidate - it was non-present before */ 2457 - update_mmu_cache(dst_vma, dst_addr, dst_pte); 2458 - pte_unmap_unlock(dst_pte, ptl); 2465 + SetPageDirty(page); 2459 2466 unlock_page(page); 2460 - ret = 0; 2461 - out: 2462 - return ret; 2463 - out_release_unlock: 2464 - pte_unmap_unlock(dst_pte, ptl); 2465 - ClearPageDirty(page); 2467 + return 0; 2468 + out_delete_from_cache: 2466 2469 delete_from_page_cache(page); 2467 2470 out_release: 2468 2471 unlock_page(page); 2469 2472 put_page(page); 2470 2473 out_unacct_blocks: 2471 2474 shmem_inode_unacct_blocks(inode, 1); 2472 - goto out; 2475 + return ret; 2473 2476 } 2474 - 2475 - int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, 2476 - pmd_t *dst_pmd, 2477 - struct vm_area_struct *dst_vma, 2478 - unsigned long dst_addr, 2479 - unsigned long src_addr, 2480 - struct page **pagep) 2481 - { 2482 - return shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, 2483 - dst_addr, src_addr, false, pagep); 2484 - } 2485 - 2486 - int shmem_mfill_zeropage_pte(struct mm_struct *dst_mm, 2487 - pmd_t *dst_pmd, 2488 - struct vm_area_struct *dst_vma, 2489 - unsigned long dst_addr) 2490 - { 2491 - struct page *page = NULL; 2492 - 2493 - return shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, 2494 - dst_addr, 0, true, &page); 2495 - } 2477 + #endif /* CONFIG_USERFAULTFD */ 2496 2478 2497 2479 #ifdef CONFIG_TMPFS 2498 2480 static const struct inode_operations shmem_symlink_inode_operations; ··· 3996 4040 loff_t i_size; 3997 4041 pgoff_t off; 3998 4042 3999 - if ((vma->vm_flags & VM_NOHUGEPAGE) || 4000 - test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) 4043 + if (!transhuge_vma_enabled(vma, vma->vm_flags)) 4001 4044 return false; 4002 4045 if (shmem_huge == SHMEM_HUGE_FORCE) 4003 4046 return true;

+354

mm/sparse-vmemmap.c

··· 27 27 #include <linux/spinlock.h> 28 28 #include <linux/vmalloc.h> 29 29 #include <linux/sched.h> 30 + #include <linux/pgtable.h> 31 + #include <linux/bootmem_info.h> 32 + 30 33 #include <asm/dma.h> 31 34 #include <asm/pgalloc.h> 35 + #include <asm/tlbflush.h> 36 + 37 + /** 38 + * struct vmemmap_remap_walk - walk vmemmap page table 39 + * 40 + * @remap_pte: called for each lowest-level entry (PTE). 41 + * @nr_walked: the number of walked pte. 42 + * @reuse_page: the page which is reused for the tail vmemmap pages. 43 + * @reuse_addr: the virtual address of the @reuse_page page. 44 + * @vmemmap_pages: the list head of the vmemmap pages that can be freed 45 + * or is mapped from. 46 + */ 47 + struct vmemmap_remap_walk { 48 + void (*remap_pte)(pte_t *pte, unsigned long addr, 49 + struct vmemmap_remap_walk *walk); 50 + unsigned long nr_walked; 51 + struct page *reuse_page; 52 + unsigned long reuse_addr; 53 + struct list_head *vmemmap_pages; 54 + }; 55 + 56 + static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, 57 + struct vmemmap_remap_walk *walk) 58 + { 59 + pmd_t __pmd; 60 + int i; 61 + unsigned long addr = start; 62 + struct page *page = pmd_page(*pmd); 63 + pte_t *pgtable = pte_alloc_one_kernel(&init_mm); 64 + 65 + if (!pgtable) 66 + return -ENOMEM; 67 + 68 + pmd_populate_kernel(&init_mm, &__pmd, pgtable); 69 + 70 + for (i = 0; i < PMD_SIZE / PAGE_SIZE; i++, addr += PAGE_SIZE) { 71 + pte_t entry, *pte; 72 + pgprot_t pgprot = PAGE_KERNEL; 73 + 74 + entry = mk_pte(page + i, pgprot); 75 + pte = pte_offset_kernel(&__pmd, addr); 76 + set_pte_at(&init_mm, addr, pte, entry); 77 + } 78 + 79 + /* Make pte visible before pmd. See comment in __pte_alloc(). */ 80 + smp_wmb(); 81 + pmd_populate_kernel(&init_mm, pmd, pgtable); 82 + 83 + flush_tlb_kernel_range(start, start + PMD_SIZE); 84 + 85 + return 0; 86 + } 87 + 88 + static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr, 89 + unsigned long end, 90 + struct vmemmap_remap_walk *walk) 91 + { 92 + pte_t *pte = pte_offset_kernel(pmd, addr); 93 + 94 + /* 95 + * The reuse_page is found 'first' in table walk before we start 96 + * remapping (which is calling @walk->remap_pte). 97 + */ 98 + if (!walk->reuse_page) { 99 + walk->reuse_page = pte_page(*pte); 100 + /* 101 + * Because the reuse address is part of the range that we are 102 + * walking, skip the reuse address range. 103 + */ 104 + addr += PAGE_SIZE; 105 + pte++; 106 + walk->nr_walked++; 107 + } 108 + 109 + for (; addr != end; addr += PAGE_SIZE, pte++) { 110 + walk->remap_pte(pte, addr, walk); 111 + walk->nr_walked++; 112 + } 113 + } 114 + 115 + static int vmemmap_pmd_range(pud_t *pud, unsigned long addr, 116 + unsigned long end, 117 + struct vmemmap_remap_walk *walk) 118 + { 119 + pmd_t *pmd; 120 + unsigned long next; 121 + 122 + pmd = pmd_offset(pud, addr); 123 + do { 124 + if (pmd_leaf(*pmd)) { 125 + int ret; 126 + 127 + ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK, walk); 128 + if (ret) 129 + return ret; 130 + } 131 + next = pmd_addr_end(addr, end); 132 + vmemmap_pte_range(pmd, addr, next, walk); 133 + } while (pmd++, addr = next, addr != end); 134 + 135 + return 0; 136 + } 137 + 138 + static int vmemmap_pud_range(p4d_t *p4d, unsigned long addr, 139 + unsigned long end, 140 + struct vmemmap_remap_walk *walk) 141 + { 142 + pud_t *pud; 143 + unsigned long next; 144 + 145 + pud = pud_offset(p4d, addr); 146 + do { 147 + int ret; 148 + 149 + next = pud_addr_end(addr, end); 150 + ret = vmemmap_pmd_range(pud, addr, next, walk); 151 + if (ret) 152 + return ret; 153 + } while (pud++, addr = next, addr != end); 154 + 155 + return 0; 156 + } 157 + 158 + static int vmemmap_p4d_range(pgd_t *pgd, unsigned long addr, 159 + unsigned long end, 160 + struct vmemmap_remap_walk *walk) 161 + { 162 + p4d_t *p4d; 163 + unsigned long next; 164 + 165 + p4d = p4d_offset(pgd, addr); 166 + do { 167 + int ret; 168 + 169 + next = p4d_addr_end(addr, end); 170 + ret = vmemmap_pud_range(p4d, addr, next, walk); 171 + if (ret) 172 + return ret; 173 + } while (p4d++, addr = next, addr != end); 174 + 175 + return 0; 176 + } 177 + 178 + static int vmemmap_remap_range(unsigned long start, unsigned long end, 179 + struct vmemmap_remap_walk *walk) 180 + { 181 + unsigned long addr = start; 182 + unsigned long next; 183 + pgd_t *pgd; 184 + 185 + VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE)); 186 + VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE)); 187 + 188 + pgd = pgd_offset_k(addr); 189 + do { 190 + int ret; 191 + 192 + next = pgd_addr_end(addr, end); 193 + ret = vmemmap_p4d_range(pgd, addr, next, walk); 194 + if (ret) 195 + return ret; 196 + } while (pgd++, addr = next, addr != end); 197 + 198 + /* 199 + * We only change the mapping of the vmemmap virtual address range 200 + * [@start + PAGE_SIZE, end), so we only need to flush the TLB which 201 + * belongs to the range. 202 + */ 203 + flush_tlb_kernel_range(start + PAGE_SIZE, end); 204 + 205 + return 0; 206 + } 207 + 208 + /* 209 + * Free a vmemmap page. A vmemmap page can be allocated from the memblock 210 + * allocator or buddy allocator. If the PG_reserved flag is set, it means 211 + * that it allocated from the memblock allocator, just free it via the 212 + * free_bootmem_page(). Otherwise, use __free_page(). 213 + */ 214 + static inline void free_vmemmap_page(struct page *page) 215 + { 216 + if (PageReserved(page)) 217 + free_bootmem_page(page); 218 + else 219 + __free_page(page); 220 + } 221 + 222 + /* Free a list of the vmemmap pages */ 223 + static void free_vmemmap_page_list(struct list_head *list) 224 + { 225 + struct page *page, *next; 226 + 227 + list_for_each_entry_safe(page, next, list, lru) { 228 + list_del(&page->lru); 229 + free_vmemmap_page(page); 230 + } 231 + } 232 + 233 + static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, 234 + struct vmemmap_remap_walk *walk) 235 + { 236 + /* 237 + * Remap the tail pages as read-only to catch illegal write operation 238 + * to the tail pages. 239 + */ 240 + pgprot_t pgprot = PAGE_KERNEL_RO; 241 + pte_t entry = mk_pte(walk->reuse_page, pgprot); 242 + struct page *page = pte_page(*pte); 243 + 244 + list_add_tail(&page->lru, walk->vmemmap_pages); 245 + set_pte_at(&init_mm, addr, pte, entry); 246 + } 247 + 248 + static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, 249 + struct vmemmap_remap_walk *walk) 250 + { 251 + pgprot_t pgprot = PAGE_KERNEL; 252 + struct page *page; 253 + void *to; 254 + 255 + BUG_ON(pte_page(*pte) != walk->reuse_page); 256 + 257 + page = list_first_entry(walk->vmemmap_pages, struct page, lru); 258 + list_del(&page->lru); 259 + to = page_to_virt(page); 260 + copy_page(to, (void *)walk->reuse_addr); 261 + 262 + set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot)); 263 + } 264 + 265 + /** 266 + * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end) 267 + * to the page which @reuse is mapped to, then free vmemmap 268 + * which the range are mapped to. 269 + * @start: start address of the vmemmap virtual address range that we want 270 + * to remap. 271 + * @end: end address of the vmemmap virtual address range that we want to 272 + * remap. 273 + * @reuse: reuse address. 274 + * 275 + * Return: %0 on success, negative error code otherwise. 276 + */ 277 + int vmemmap_remap_free(unsigned long start, unsigned long end, 278 + unsigned long reuse) 279 + { 280 + int ret; 281 + LIST_HEAD(vmemmap_pages); 282 + struct vmemmap_remap_walk walk = { 283 + .remap_pte = vmemmap_remap_pte, 284 + .reuse_addr = reuse, 285 + .vmemmap_pages = &vmemmap_pages, 286 + }; 287 + 288 + /* 289 + * In order to make remapping routine most efficient for the huge pages, 290 + * the routine of vmemmap page table walking has the following rules 291 + * (see more details from the vmemmap_pte_range()): 292 + * 293 + * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE) 294 + * should be continuous. 295 + * - The @reuse address is part of the range [@reuse, @end) that we are 296 + * walking which is passed to vmemmap_remap_range(). 297 + * - The @reuse address is the first in the complete range. 298 + * 299 + * So we need to make sure that @start and @reuse meet the above rules. 300 + */ 301 + BUG_ON(start - reuse != PAGE_SIZE); 302 + 303 + mmap_write_lock(&init_mm); 304 + ret = vmemmap_remap_range(reuse, end, &walk); 305 + mmap_write_downgrade(&init_mm); 306 + 307 + if (ret && walk.nr_walked) { 308 + end = reuse + walk.nr_walked * PAGE_SIZE; 309 + /* 310 + * vmemmap_pages contains pages from the previous 311 + * vmemmap_remap_range call which failed. These 312 + * are pages which were removed from the vmemmap. 313 + * They will be restored in the following call. 314 + */ 315 + walk = (struct vmemmap_remap_walk) { 316 + .remap_pte = vmemmap_restore_pte, 317 + .reuse_addr = reuse, 318 + .vmemmap_pages = &vmemmap_pages, 319 + }; 320 + 321 + vmemmap_remap_range(reuse, end, &walk); 322 + } 323 + mmap_read_unlock(&init_mm); 324 + 325 + free_vmemmap_page_list(&vmemmap_pages); 326 + 327 + return ret; 328 + } 329 + 330 + static int alloc_vmemmap_page_list(unsigned long start, unsigned long end, 331 + gfp_t gfp_mask, struct list_head *list) 332 + { 333 + unsigned long nr_pages = (end - start) >> PAGE_SHIFT; 334 + int nid = page_to_nid((struct page *)start); 335 + struct page *page, *next; 336 + 337 + while (nr_pages--) { 338 + page = alloc_pages_node(nid, gfp_mask, 0); 339 + if (!page) 340 + goto out; 341 + list_add_tail(&page->lru, list); 342 + } 343 + 344 + return 0; 345 + out: 346 + list_for_each_entry_safe(page, next, list, lru) 347 + __free_pages(page, 0); 348 + return -ENOMEM; 349 + } 350 + 351 + /** 352 + * vmemmap_remap_alloc - remap the vmemmap virtual address range [@start, end) 353 + * to the page which is from the @vmemmap_pages 354 + * respectively. 355 + * @start: start address of the vmemmap virtual address range that we want 356 + * to remap. 357 + * @end: end address of the vmemmap virtual address range that we want to 358 + * remap. 359 + * @reuse: reuse address. 360 + * @gfp_mask: GFP flag for allocating vmemmap pages. 361 + * 362 + * Return: %0 on success, negative error code otherwise. 363 + */ 364 + int vmemmap_remap_alloc(unsigned long start, unsigned long end, 365 + unsigned long reuse, gfp_t gfp_mask) 366 + { 367 + LIST_HEAD(vmemmap_pages); 368 + struct vmemmap_remap_walk walk = { 369 + .remap_pte = vmemmap_restore_pte, 370 + .reuse_addr = reuse, 371 + .vmemmap_pages = &vmemmap_pages, 372 + }; 373 + 374 + /* See the comment in the vmemmap_remap_free(). */ 375 + BUG_ON(start - reuse != PAGE_SIZE); 376 + 377 + if (alloc_vmemmap_page_list(start, end, gfp_mask, &vmemmap_pages)) 378 + return -ENOMEM; 379 + 380 + mmap_read_lock(&init_mm); 381 + vmemmap_remap_range(reuse, end, &walk); 382 + mmap_read_unlock(&init_mm); 383 + 384 + return 0; 385 + } 32 386 33 387 /* 34 388 * Allocate a block of memory to be used to back the virtual memory map

+1

mm/sparse.c

··· 13 13 #include <linux/vmalloc.h> 14 14 #include <linux/swap.h> 15 15 #include <linux/swapops.h> 16 + #include <linux/bootmem_info.h> 16 17 17 18 #include "internal.h" 18 19 #include <asm/dma.h>

+1 -1

mm/swap.c

··· 554 554 } else { 555 555 /* 556 556 * The page's writeback ends up during pagevec 557 - * We moves tha page into tail of inactive. 557 + * We move that page into tail of inactive. 558 558 */ 559 559 add_page_to_lru_list_tail(page, lruvec); 560 560 __count_vm_events(PGROTATED, nr_pages);

+1 -1

mm/swapfile.c

··· 2967 2967 return 0; 2968 2968 } 2969 2969 2970 - /* swap partition endianess hack... */ 2970 + /* swap partition endianness hack... */ 2971 2971 if (swab32(swap_header->info.version) == 1) { 2972 2972 swab32s(&swap_header->info.version); 2973 2973 swab32s(&swap_header->info.last_page);

+125 -100

mm/userfaultfd.c

··· 48 48 return dst_vma; 49 49 } 50 50 51 + /* 52 + * Install PTEs, to map dst_addr (within dst_vma) to page. 53 + * 54 + * This function handles both MCOPY_ATOMIC_NORMAL and _CONTINUE for both shmem 55 + * and anon, and for both shared and private VMAs. 56 + */ 57 + int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, 58 + struct vm_area_struct *dst_vma, 59 + unsigned long dst_addr, struct page *page, 60 + bool newly_allocated, bool wp_copy) 61 + { 62 + int ret; 63 + pte_t _dst_pte, *dst_pte; 64 + bool writable = dst_vma->vm_flags & VM_WRITE; 65 + bool vm_shared = dst_vma->vm_flags & VM_SHARED; 66 + bool page_in_cache = page->mapping; 67 + spinlock_t *ptl; 68 + struct inode *inode; 69 + pgoff_t offset, max_off; 70 + 71 + _dst_pte = mk_pte(page, dst_vma->vm_page_prot); 72 + if (page_in_cache && !vm_shared) 73 + writable = false; 74 + if (writable || !page_in_cache) 75 + _dst_pte = pte_mkdirty(_dst_pte); 76 + if (writable) { 77 + if (wp_copy) 78 + _dst_pte = pte_mkuffd_wp(_dst_pte); 79 + else 80 + _dst_pte = pte_mkwrite(_dst_pte); 81 + } 82 + 83 + dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); 84 + 85 + if (vma_is_shmem(dst_vma)) { 86 + /* serialize against truncate with the page table lock */ 87 + inode = dst_vma->vm_file->f_inode; 88 + offset = linear_page_index(dst_vma, dst_addr); 89 + max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); 90 + ret = -EFAULT; 91 + if (unlikely(offset >= max_off)) 92 + goto out_unlock; 93 + } 94 + 95 + ret = -EEXIST; 96 + if (!pte_none(*dst_pte)) 97 + goto out_unlock; 98 + 99 + if (page_in_cache) 100 + page_add_file_rmap(page, false); 101 + else 102 + page_add_new_anon_rmap(page, dst_vma, dst_addr, false); 103 + 104 + /* 105 + * Must happen after rmap, as mm_counter() checks mapping (via 106 + * PageAnon()), which is set by __page_set_anon_rmap(). 107 + */ 108 + inc_mm_counter(dst_mm, mm_counter(page)); 109 + 110 + if (newly_allocated) 111 + lru_cache_add_inactive_or_unevictable(page, dst_vma); 112 + 113 + set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); 114 + 115 + /* No need to invalidate - it was non-present before */ 116 + update_mmu_cache(dst_vma, dst_addr, dst_pte); 117 + ret = 0; 118 + out_unlock: 119 + pte_unmap_unlock(dst_pte, ptl); 120 + return ret; 121 + } 122 + 51 123 static int mcopy_atomic_pte(struct mm_struct *dst_mm, 52 124 pmd_t *dst_pmd, 53 125 struct vm_area_struct *dst_vma, ··· 128 56 struct page **pagep, 129 57 bool wp_copy) 130 58 { 131 - pte_t _dst_pte, *dst_pte; 132 - spinlock_t *ptl; 133 59 void *page_kaddr; 134 60 int ret; 135 61 struct page *page; 136 - pgoff_t offset, max_off; 137 - struct inode *inode; 138 62 139 63 if (!*pagep) { 140 64 ret = -ENOMEM; ··· 167 99 if (mem_cgroup_charge(page, dst_mm, GFP_KERNEL)) 168 100 goto out_release; 169 101 170 - _dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot)); 171 - if (dst_vma->vm_flags & VM_WRITE) { 172 - if (wp_copy) 173 - _dst_pte = pte_mkuffd_wp(_dst_pte); 174 - else 175 - _dst_pte = pte_mkwrite(_dst_pte); 176 - } 177 - 178 - dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); 179 - if (dst_vma->vm_file) { 180 - /* the shmem MAP_PRIVATE case requires checking the i_size */ 181 - inode = dst_vma->vm_file->f_inode; 182 - offset = linear_page_index(dst_vma, dst_addr); 183 - max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); 184 - ret = -EFAULT; 185 - if (unlikely(offset >= max_off)) 186 - goto out_release_uncharge_unlock; 187 - } 188 - ret = -EEXIST; 189 - if (!pte_none(*dst_pte)) 190 - goto out_release_uncharge_unlock; 191 - 192 - inc_mm_counter(dst_mm, MM_ANONPAGES); 193 - page_add_new_anon_rmap(page, dst_vma, dst_addr, false); 194 - lru_cache_add_inactive_or_unevictable(page, dst_vma); 195 - 196 - set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); 197 - 198 - /* No need to invalidate - it was non-present before */ 199 - update_mmu_cache(dst_vma, dst_addr, dst_pte); 200 - 201 - pte_unmap_unlock(dst_pte, ptl); 202 - ret = 0; 102 + ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr, 103 + page, true, wp_copy); 104 + if (ret) 105 + goto out_release; 203 106 out: 204 107 return ret; 205 - out_release_uncharge_unlock: 206 - pte_unmap_unlock(dst_pte, ptl); 207 108 out_release: 208 109 put_page(page); 209 110 goto out; ··· 213 176 return ret; 214 177 } 215 178 179 + /* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */ 180 + static int mcontinue_atomic_pte(struct mm_struct *dst_mm, 181 + pmd_t *dst_pmd, 182 + struct vm_area_struct *dst_vma, 183 + unsigned long dst_addr, 184 + bool wp_copy) 185 + { 186 + struct inode *inode = file_inode(dst_vma->vm_file); 187 + pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); 188 + struct page *page; 189 + int ret; 190 + 191 + ret = shmem_getpage(inode, pgoff, &page, SGP_READ); 192 + if (ret) 193 + goto out; 194 + if (!page) { 195 + ret = -EFAULT; 196 + goto out; 197 + } 198 + 199 + ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr, 200 + page, false, wp_copy); 201 + if (ret) 202 + goto out_release; 203 + 204 + unlock_page(page); 205 + ret = 0; 206 + out: 207 + return ret; 208 + out_release: 209 + unlock_page(page); 210 + put_page(page); 211 + goto out; 212 + } 213 + 216 214 static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) 217 215 { 218 216 pgd_t *pgd; ··· 281 209 unsigned long len, 282 210 enum mcopy_atomic_mode mode) 283 211 { 284 - int vm_alloc_shared = dst_vma->vm_flags & VM_SHARED; 285 212 int vm_shared = dst_vma->vm_flags & VM_SHARED; 286 213 ssize_t err; 287 214 pte_t *dst_pte; ··· 379 308 380 309 mutex_unlock(&hugetlb_fault_mutex_table[hash]); 381 310 i_mmap_unlock_read(mapping); 382 - vm_alloc_shared = vm_shared; 383 311 384 312 cond_resched(); 385 313 ··· 416 346 out_unlock: 417 347 mmap_read_unlock(dst_mm); 418 348 out: 419 - if (page) { 420 - /* 421 - * We encountered an error and are about to free a newly 422 - * allocated huge page. 423 - * 424 - * Reservation handling is very subtle, and is different for 425 - * private and shared mappings. See the routine 426 - * restore_reserve_on_error for details. Unfortunately, we 427 - * can not call restore_reserve_on_error now as it would 428 - * require holding mmap_lock. 429 - * 430 - * If a reservation for the page existed in the reservation 431 - * map of a private mapping, the map was modified to indicate 432 - * the reservation was consumed when the page was allocated. 433 - * We clear the HPageRestoreReserve flag now so that the global 434 - * reserve count will not be incremented in free_huge_page. 435 - * The reservation map will still indicate the reservation 436 - * was consumed and possibly prevent later page allocation. 437 - * This is better than leaking a global reservation. If no 438 - * reservation existed, it is still safe to clear 439 - * HPageRestoreReserve as no adjustments to reservation counts 440 - * were made during allocation. 441 - * 442 - * The reservation map for shared mappings indicates which 443 - * pages have reservations. When a huge page is allocated 444 - * for an address with a reservation, no change is made to 445 - * the reserve map. In this case HPageRestoreReserve will be 446 - * set to indicate that the global reservation count should be 447 - * incremented when the page is freed. This is the desired 448 - * behavior. However, when a huge page is allocated for an 449 - * address without a reservation a reservation entry is added 450 - * to the reservation map, and HPageRestoreReserve will not be 451 - * set. When the page is freed, the global reserve count will 452 - * NOT be incremented and it will appear as though we have 453 - * leaked reserved page. In this case, set HPageRestoreReserve 454 - * so that the global reserve count will be incremented to 455 - * match the reservation map entry which was created. 456 - * 457 - * Note that vm_alloc_shared is based on the flags of the vma 458 - * for which the page was originally allocated. dst_vma could 459 - * be different or NULL on error. 460 - */ 461 - if (vm_alloc_shared) 462 - SetHPageRestoreReserve(page); 463 - else 464 - ClearHPageRestoreReserve(page); 349 + if (page) 465 350 put_page(page); 466 - } 467 351 BUG_ON(copied < 0); 468 352 BUG_ON(err > 0); 469 353 BUG_ON(!copied && !err); ··· 439 415 unsigned long dst_addr, 440 416 unsigned long src_addr, 441 417 struct page **page, 442 - bool zeropage, 418 + enum mcopy_atomic_mode mode, 443 419 bool wp_copy) 444 420 { 445 421 ssize_t err; 422 + 423 + if (mode == MCOPY_ATOMIC_CONTINUE) { 424 + return mcontinue_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, 425 + wp_copy); 426 + } 446 427 447 428 /* 448 429 * The normal page fault path for a shmem will invoke the ··· 460 431 * and not in the radix tree. 461 432 */ 462 433 if (!(dst_vma->vm_flags & VM_SHARED)) { 463 - if (!zeropage) 434 + if (mode == MCOPY_ATOMIC_NORMAL) 464 435 err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma, 465 436 dst_addr, src_addr, page, 466 437 wp_copy); ··· 469 440 dst_vma, dst_addr); 470 441 } else { 471 442 VM_WARN_ON_ONCE(wp_copy); 472 - if (!zeropage) 473 - err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd, 474 - dst_vma, dst_addr, 475 - src_addr, page); 476 - else 477 - err = shmem_mfill_zeropage_pte(dst_mm, dst_pmd, 478 - dst_vma, dst_addr); 443 + err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, 444 + dst_addr, src_addr, 445 + mode != MCOPY_ATOMIC_NORMAL, 446 + page); 479 447 } 480 448 481 449 return err; ··· 493 467 long copied; 494 468 struct page *page; 495 469 bool wp_copy; 496 - bool zeropage = (mcopy_mode == MCOPY_ATOMIC_ZEROPAGE); 497 470 498 471 /* 499 472 * Sanitize the command parameters: ··· 555 530 556 531 if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) 557 532 goto out_unlock; 558 - if (mcopy_mode == MCOPY_ATOMIC_CONTINUE) 533 + if (!vma_is_shmem(dst_vma) && mcopy_mode == MCOPY_ATOMIC_CONTINUE) 559 534 goto out_unlock; 560 535 561 536 /* ··· 603 578 BUG_ON(pmd_trans_huge(*dst_pmd)); 604 579 605 580 err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, 606 - src_addr, &page, zeropage, wp_copy); 581 + src_addr, &page, mcopy_mode, wp_copy); 607 582 cond_resched(); 608 583 609 584 if (unlikely(err == -ENOENT)) {

+40

mm/util.c

··· 1010 1010 } 1011 1011 EXPORT_SYMBOL_GPL(mem_dump_obj); 1012 1012 #endif 1013 + 1014 + /* 1015 + * A driver might set a page logically offline -- PageOffline() -- and 1016 + * turn the page inaccessible in the hypervisor; after that, access to page 1017 + * content can be fatal. 1018 + * 1019 + * Some special PFN walkers -- i.e., /proc/kcore -- read content of random 1020 + * pages after checking PageOffline(); however, these PFN walkers can race 1021 + * with drivers that set PageOffline(). 1022 + * 1023 + * page_offline_freeze()/page_offline_thaw() allows for a subsystem to 1024 + * synchronize with such drivers, achieving that a page cannot be set 1025 + * PageOffline() while frozen. 1026 + * 1027 + * page_offline_begin()/page_offline_end() is used by drivers that care about 1028 + * such races when setting a page PageOffline(). 1029 + */ 1030 + static DECLARE_RWSEM(page_offline_rwsem); 1031 + 1032 + void page_offline_freeze(void) 1033 + { 1034 + down_read(&page_offline_rwsem); 1035 + } 1036 + 1037 + void page_offline_thaw(void) 1038 + { 1039 + up_read(&page_offline_rwsem); 1040 + } 1041 + 1042 + void page_offline_begin(void) 1043 + { 1044 + down_write(&page_offline_rwsem); 1045 + } 1046 + EXPORT_SYMBOL(page_offline_begin); 1047 + 1048 + void page_offline_end(void) 1049 + { 1050 + up_write(&page_offline_rwsem); 1051 + } 1052 + EXPORT_SYMBOL(page_offline_end);

+28 -9

mm/vmalloc.c

··· 25 25 #include <linux/notifier.h> 26 26 #include <linux/rbtree.h> 27 27 #include <linux/xarray.h> 28 + #include <linux/io.h> 28 29 #include <linux/rcupdate.h> 29 30 #include <linux/pfn.h> 30 31 #include <linux/kmemleak.h> ··· 37 36 #include <linux/overflow.h> 38 37 #include <linux/pgtable.h> 39 38 #include <linux/uaccess.h> 39 + #include <linux/hugetlb.h> 40 40 #include <asm/tlbflush.h> 41 41 #include <asm/shmparam.h> 42 42 ··· 85 83 /*** Page table manipulation functions ***/ 86 84 static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, 87 85 phys_addr_t phys_addr, pgprot_t prot, 88 - pgtbl_mod_mask *mask) 86 + unsigned int max_page_shift, pgtbl_mod_mask *mask) 89 87 { 90 88 pte_t *pte; 91 89 u64 pfn; 90 + unsigned long size = PAGE_SIZE; 92 91 93 92 pfn = phys_addr >> PAGE_SHIFT; 94 93 pte = pte_alloc_kernel_track(pmd, addr, mask); ··· 97 94 return -ENOMEM; 98 95 do { 99 96 BUG_ON(!pte_none(*pte)); 97 + 98 + #ifdef CONFIG_HUGETLB_PAGE 99 + size = arch_vmap_pte_range_map_size(addr, end, pfn, max_page_shift); 100 + if (size != PAGE_SIZE) { 101 + pte_t entry = pfn_pte(pfn, prot); 102 + 103 + entry = pte_mkhuge(entry); 104 + entry = arch_make_huge_pte(entry, ilog2(size), 0); 105 + set_huge_pte_at(&init_mm, addr, pte, entry); 106 + pfn += PFN_DOWN(size); 107 + continue; 108 + } 109 + #endif 100 110 set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot)); 101 111 pfn++; 102 - } while (pte++, addr += PAGE_SIZE, addr != end); 112 + } while (pte += PFN_DOWN(size), addr += size, addr != end); 103 113 *mask |= PGTBL_PTE_MODIFIED; 104 114 return 0; 105 115 } ··· 161 145 continue; 162 146 } 163 147 164 - if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask)) 148 + if (vmap_pte_range(pmd, addr, next, phys_addr, prot, max_page_shift, mask)) 165 149 return -ENOMEM; 166 150 } while (pmd++, phys_addr += (next - addr), addr = next, addr != end); 167 151 return 0; ··· 1608 1592 /* for per-CPU blocks */ 1609 1593 static void purge_fragmented_blocks_allcpus(void); 1610 1594 1595 + #ifdef CONFIG_X86_64 1611 1596 /* 1612 1597 * called before a call to iounmap() if the caller wants vm_area_struct's 1613 1598 * immediately freed. ··· 1617 1600 { 1618 1601 atomic_long_set(&vmap_lazy_nr, lazy_max_pages()+1); 1619 1602 } 1603 + #endif /* CONFIG_X86_64 */ 1620 1604 1621 1605 /* 1622 1606 * Purges all lazily-freed vmap areas. ··· 2930 2912 return NULL; 2931 2913 } 2932 2914 2933 - if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP) && 2934 - arch_vmap_pmd_supported(prot)) { 2915 + if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) { 2935 2916 unsigned long size_per_node; 2936 2917 2937 2918 /* ··· 2943 2926 size_per_node = size; 2944 2927 if (node == NUMA_NO_NODE) 2945 2928 size_per_node /= num_online_nodes(); 2946 - if (size_per_node >= PMD_SIZE) { 2929 + if (arch_vmap_pmd_supported(prot) && size_per_node >= PMD_SIZE) 2947 2930 shift = PMD_SHIFT; 2948 - align = max(real_align, 1UL << shift); 2949 - size = ALIGN(real_size, 1UL << shift); 2950 - } 2931 + else 2932 + shift = arch_vmap_pte_supported_shift(size_per_node); 2933 + 2934 + align = max(real_align, 1UL << shift); 2935 + size = ALIGN(real_size, 1UL << shift); 2951 2936 } 2952 2937 2953 2938 again:

+18 -2

mm/vmscan.c

··· 1499 1499 if (unlikely(PageTransHuge(page))) 1500 1500 flags |= TTU_SPLIT_HUGE_PMD; 1501 1501 1502 - if (!try_to_unmap(page, flags)) { 1502 + try_to_unmap(page, flags); 1503 + if (page_mapped(page)) { 1503 1504 stat->nr_unmap_fail += nr_pages; 1504 1505 if (!was_swapbacked && PageSwapBacked(page)) 1505 1506 stat->nr_lazyfree_fail += nr_pages; ··· 1702 1701 unsigned int nr_reclaimed; 1703 1702 struct page *page, *next; 1704 1703 LIST_HEAD(clean_pages); 1704 + unsigned int noreclaim_flag; 1705 1705 1706 1706 list_for_each_entry_safe(page, next, page_list, lru) { 1707 1707 if (!PageHuge(page) && page_is_file_lru(page) && ··· 1713 1711 } 1714 1712 } 1715 1713 1714 + /* 1715 + * We should be safe here since we are only dealing with file pages and 1716 + * we are not kswapd and therefore cannot write dirty file pages. But 1717 + * call memalloc_noreclaim_save() anyway, just in case these conditions 1718 + * change in the future. 1719 + */ 1720 + noreclaim_flag = memalloc_noreclaim_save(); 1716 1721 nr_reclaimed = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc, 1717 1722 &stat, true); 1723 + memalloc_noreclaim_restore(noreclaim_flag); 1724 + 1718 1725 list_splice(&clean_pages, page_list); 1719 1726 mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, 1720 1727 -(long)nr_reclaimed); ··· 1821 1810 1822 1811 } 1823 1812 1824 - /** 1813 + /* 1825 1814 * Isolating page from the lruvec to fill in @dst list by nr_to_scan times. 1826 1815 * 1827 1816 * lruvec->lru_lock is heavily contended. Some of the functions that ··· 2317 2306 LIST_HEAD(node_page_list); 2318 2307 struct reclaim_stat dummy_stat; 2319 2308 struct page *page; 2309 + unsigned int noreclaim_flag; 2320 2310 struct scan_control sc = { 2321 2311 .gfp_mask = GFP_KERNEL, 2322 2312 .priority = DEF_PRIORITY, ··· 2325 2313 .may_unmap = 1, 2326 2314 .may_swap = 1, 2327 2315 }; 2316 + 2317 + noreclaim_flag = memalloc_noreclaim_save(); 2328 2318 2329 2319 while (!list_empty(page_list)) { 2330 2320 page = lru_to_page(page_list); ··· 2363 2349 putback_lru_page(page); 2364 2350 } 2365 2351 } 2352 + 2353 + memalloc_noreclaim_restore(noreclaim_flag); 2366 2354 2367 2355 return nr_reclaimed; 2368 2356 }

+6 -4

mm/workingset.c

··· 168 168 * refault distance will immediately activate the refaulting page. 169 169 */ 170 170 171 + #define WORKINGSET_SHIFT 1 171 172 #define EVICTION_SHIFT ((BITS_PER_LONG - BITS_PER_XA_VALUE) + \ 172 - 1 + NODES_SHIFT + MEM_CGROUP_ID_SHIFT) 173 + WORKINGSET_SHIFT + NODES_SHIFT + \ 174 + MEM_CGROUP_ID_SHIFT) 173 175 #define EVICTION_MASK (~0UL >> EVICTION_SHIFT) 174 176 175 177 /* ··· 191 189 eviction &= EVICTION_MASK; 192 190 eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid; 193 191 eviction = (eviction << NODES_SHIFT) | pgdat->node_id; 194 - eviction = (eviction << 1) | workingset; 192 + eviction = (eviction << WORKINGSET_SHIFT) | workingset; 195 193 196 194 return xa_mk_value(eviction); 197 195 } ··· 203 201 int memcgid, nid; 204 202 bool workingset; 205 203 206 - workingset = entry & 1; 207 - entry >>= 1; 204 + workingset = entry & ((1UL << WORKINGSET_SHIFT) - 1); 205 + entry >>= WORKINGSET_SHIFT; 208 206 nid = entry & ((1UL << NODES_SHIFT) - 1); 209 207 entry >>= NODES_SHIFT; 210 208 memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);

+16 -23

mm/z3fold.c

··· 62 62 #define ZHDR_SIZE_ALIGNED round_up(sizeof(struct z3fold_header), CHUNK_SIZE) 63 63 #define ZHDR_CHUNKS (ZHDR_SIZE_ALIGNED >> CHUNK_SHIFT) 64 64 #define TOTAL_CHUNKS (PAGE_SIZE >> CHUNK_SHIFT) 65 - #define NCHUNKS ((PAGE_SIZE - ZHDR_SIZE_ALIGNED) >> CHUNK_SHIFT) 65 + #define NCHUNKS (TOTAL_CHUNKS - ZHDR_CHUNKS) 66 66 67 67 #define BUDDY_MASK (0x3) 68 68 #define BUDDY_SHIFT 2 ··· 144 144 * @c_handle: cache for z3fold_buddy_slots allocation 145 145 * @ops: pointer to a structure of user defined operations specified at 146 146 * pool creation time. 147 + * @zpool: zpool driver 148 + * @zpool_ops: zpool operations structure with an evict callback 147 149 * @compact_wq: workqueue for page layout background optimization 148 150 * @release_wq: workqueue for safe page release 149 151 * @work: work_struct for safe page release ··· 255 253 spin_unlock(&zhdr->page_lock); 256 254 } 257 255 258 - 259 - static inline struct z3fold_header *__get_z3fold_header(unsigned long handle, 260 - bool lock) 256 + /* return locked z3fold page if it's not headless */ 257 + static inline struct z3fold_header *get_z3fold_header(unsigned long handle) 261 258 { 262 259 struct z3fold_buddy_slots *slots; 263 260 struct z3fold_header *zhdr; ··· 270 269 read_lock(&slots->lock); 271 270 addr = *(unsigned long *)handle; 272 271 zhdr = (struct z3fold_header *)(addr & PAGE_MASK); 273 - if (lock) 274 - locked = z3fold_page_trylock(zhdr); 272 + locked = z3fold_page_trylock(zhdr); 275 273 read_unlock(&slots->lock); 276 274 if (locked) 277 275 break; 278 276 cpu_relax(); 279 - } while (lock); 277 + } while (true); 280 278 } else { 281 279 zhdr = (struct z3fold_header *)(handle & PAGE_MASK); 282 280 } 283 281 284 282 return zhdr; 285 - } 286 - 287 - /* Returns the z3fold page where a given handle is stored */ 288 - static inline struct z3fold_header *handle_to_z3fold_header(unsigned long h) 289 - { 290 - return __get_z3fold_header(h, false); 291 - } 292 - 293 - /* return locked z3fold page if it's not headless */ 294 - static inline struct z3fold_header *get_z3fold_header(unsigned long h) 295 - { 296 - return __get_z3fold_header(h, true); 297 283 } 298 284 299 285 static inline void put_z3fold_header(struct z3fold_header *zhdr) ··· 986 998 goto out_c; 987 999 spin_lock_init(&pool->lock); 988 1000 spin_lock_init(&pool->stale_lock); 989 - pool->unbuddied = __alloc_percpu(sizeof(struct list_head)*NCHUNKS, 2); 1001 + pool->unbuddied = __alloc_percpu(sizeof(struct list_head) * NCHUNKS, 1002 + __alignof__(struct list_head)); 990 1003 if (!pool->unbuddied) 991 1004 goto out_pool; 992 1005 for_each_possible_cpu(cpu) { ··· 1048 1059 destroy_workqueue(pool->compact_wq); 1049 1060 destroy_workqueue(pool->release_wq); 1050 1061 z3fold_unregister_migration(pool); 1062 + free_percpu(pool->unbuddied); 1051 1063 kfree(pool); 1052 1064 } 1053 1065 ··· 1372 1382 if (zhdr->foreign_handles || 1373 1383 test_and_set_bit(PAGE_CLAIMED, &page->private)) { 1374 1384 if (kref_put(&zhdr->refcount, 1375 - release_z3fold_page)) 1385 + release_z3fold_page_locked)) 1376 1386 atomic64_dec(&pool->pages_nr); 1377 1387 else 1378 1388 z3fold_page_unlock(zhdr); ··· 1793 1803 { 1794 1804 int ret; 1795 1805 1796 - /* Make sure the z3fold header is not larger than the page size */ 1797 - BUILD_BUG_ON(ZHDR_SIZE_ALIGNED > PAGE_SIZE); 1806 + /* 1807 + * Make sure the z3fold header is not larger than the page size and 1808 + * there has remaining spaces for its buddy. 1809 + */ 1810 + BUILD_BUG_ON(ZHDR_SIZE_ALIGNED > PAGE_SIZE - CHUNK_SIZE); 1798 1811 ret = z3fold_mount(); 1799 1812 if (ret) 1800 1813 return ret;

+405 -402

mm/zbud.c

··· 51 51 #include <linux/preempt.h> 52 52 #include <linux/slab.h> 53 53 #include <linux/spinlock.h> 54 - #include <linux/zbud.h> 55 54 #include <linux/zpool.h> 56 55 57 56 /***************** ··· 72 73 #define ZHDR_SIZE_ALIGNED CHUNK_SIZE 73 74 #define NCHUNKS ((PAGE_SIZE - ZHDR_SIZE_ALIGNED) >> CHUNK_SHIFT) 74 75 76 + struct zbud_pool; 77 + 78 + struct zbud_ops { 79 + int (*evict)(struct zbud_pool *pool, unsigned long handle); 80 + }; 81 + 75 82 /** 76 83 * struct zbud_pool - stores metadata for each zbud pool 77 84 * @lock: protects all pool fields and first|last_chunk fields of any ··· 92 87 * @pages_nr: number of zbud pages in the pool. 93 88 * @ops: pointer to a structure of user defined operations specified at 94 89 * pool creation time. 90 + * @zpool: zpool driver 91 + * @zpool_ops: zpool operations structure with an evict callback 95 92 * 96 93 * This structure is allocated at pool creation time and maintains metadata 97 94 * pertaining to a particular zbud pool. 98 95 */ 99 96 struct zbud_pool { 100 97 spinlock_t lock; 101 - struct list_head unbuddied[NCHUNKS]; 102 - struct list_head buddied; 98 + union { 99 + /* 100 + * Reuse unbuddied[0] as buddied on the ground that 101 + * unbuddied[0] is unused. 102 + */ 103 + struct list_head buddied; 104 + struct list_head unbuddied[NCHUNKS]; 105 + }; 103 106 struct list_head lru; 104 107 u64 pages_nr; 105 108 const struct zbud_ops *ops; 106 - #ifdef CONFIG_ZPOOL 107 109 struct zpool *zpool; 108 110 const struct zpool_ops *zpool_ops; 109 - #endif 110 111 }; 111 112 112 113 /* ··· 132 121 }; 133 122 134 123 /***************** 124 + * Helpers 125 + *****************/ 126 + /* Just to make the code easier to read */ 127 + enum buddy { 128 + FIRST, 129 + LAST 130 + }; 131 + 132 + /* Converts an allocation size in bytes to size in zbud chunks */ 133 + static int size_to_chunks(size_t size) 134 + { 135 + return (size + CHUNK_SIZE - 1) >> CHUNK_SHIFT; 136 + } 137 + 138 + #define for_each_unbuddied_list(_iter, _begin) \ 139 + for ((_iter) = (_begin); (_iter) < NCHUNKS; (_iter)++) 140 + 141 + /* Initializes the zbud header of a newly allocated zbud page */ 142 + static struct zbud_header *init_zbud_page(struct page *page) 143 + { 144 + struct zbud_header *zhdr = page_address(page); 145 + zhdr->first_chunks = 0; 146 + zhdr->last_chunks = 0; 147 + INIT_LIST_HEAD(&zhdr->buddy); 148 + INIT_LIST_HEAD(&zhdr->lru); 149 + zhdr->under_reclaim = false; 150 + return zhdr; 151 + } 152 + 153 + /* Resets the struct page fields and frees the page */ 154 + static void free_zbud_page(struct zbud_header *zhdr) 155 + { 156 + __free_page(virt_to_page(zhdr)); 157 + } 158 + 159 + /* 160 + * Encodes the handle of a particular buddy within a zbud page 161 + * Pool lock should be held as this function accesses first|last_chunks 162 + */ 163 + static unsigned long encode_handle(struct zbud_header *zhdr, enum buddy bud) 164 + { 165 + unsigned long handle; 166 + 167 + /* 168 + * For now, the encoded handle is actually just the pointer to the data 169 + * but this might not always be the case. A little information hiding. 170 + * Add CHUNK_SIZE to the handle if it is the first allocation to jump 171 + * over the zbud header in the first chunk. 172 + */ 173 + handle = (unsigned long)zhdr; 174 + if (bud == FIRST) 175 + /* skip over zbud header */ 176 + handle += ZHDR_SIZE_ALIGNED; 177 + else /* bud == LAST */ 178 + handle += PAGE_SIZE - (zhdr->last_chunks << CHUNK_SHIFT); 179 + return handle; 180 + } 181 + 182 + /* Returns the zbud page where a given handle is stored */ 183 + static struct zbud_header *handle_to_zbud_header(unsigned long handle) 184 + { 185 + return (struct zbud_header *)(handle & PAGE_MASK); 186 + } 187 + 188 + /* Returns the number of free chunks in a zbud page */ 189 + static int num_free_chunks(struct zbud_header *zhdr) 190 + { 191 + /* 192 + * Rather than branch for different situations, just use the fact that 193 + * free buddies have a length of zero to simplify everything. 194 + */ 195 + return NCHUNKS - zhdr->first_chunks - zhdr->last_chunks; 196 + } 197 + 198 + /***************** 199 + * API Functions 200 + *****************/ 201 + /** 202 + * zbud_create_pool() - create a new zbud pool 203 + * @gfp: gfp flags when allocating the zbud pool structure 204 + * @ops: user-defined operations for the zbud pool 205 + * 206 + * Return: pointer to the new zbud pool or NULL if the metadata allocation 207 + * failed. 208 + */ 209 + static struct zbud_pool *zbud_create_pool(gfp_t gfp, const struct zbud_ops *ops) 210 + { 211 + struct zbud_pool *pool; 212 + int i; 213 + 214 + pool = kzalloc(sizeof(struct zbud_pool), gfp); 215 + if (!pool) 216 + return NULL; 217 + spin_lock_init(&pool->lock); 218 + for_each_unbuddied_list(i, 0) 219 + INIT_LIST_HEAD(&pool->unbuddied[i]); 220 + INIT_LIST_HEAD(&pool->buddied); 221 + INIT_LIST_HEAD(&pool->lru); 222 + pool->pages_nr = 0; 223 + pool->ops = ops; 224 + return pool; 225 + } 226 + 227 + /** 228 + * zbud_destroy_pool() - destroys an existing zbud pool 229 + * @pool: the zbud pool to be destroyed 230 + * 231 + * The pool should be emptied before this function is called. 232 + */ 233 + static void zbud_destroy_pool(struct zbud_pool *pool) 234 + { 235 + kfree(pool); 236 + } 237 + 238 + /** 239 + * zbud_alloc() - allocates a region of a given size 240 + * @pool: zbud pool from which to allocate 241 + * @size: size in bytes of the desired allocation 242 + * @gfp: gfp flags used if the pool needs to grow 243 + * @handle: handle of the new allocation 244 + * 245 + * This function will attempt to find a free region in the pool large enough to 246 + * satisfy the allocation request. A search of the unbuddied lists is 247 + * performed first. If no suitable free region is found, then a new page is 248 + * allocated and added to the pool to satisfy the request. 249 + * 250 + * gfp should not set __GFP_HIGHMEM as highmem pages cannot be used 251 + * as zbud pool pages. 252 + * 253 + * Return: 0 if success and handle is set, otherwise -EINVAL if the size or 254 + * gfp arguments are invalid or -ENOMEM if the pool was unable to allocate 255 + * a new page. 256 + */ 257 + static int zbud_alloc(struct zbud_pool *pool, size_t size, gfp_t gfp, 258 + unsigned long *handle) 259 + { 260 + int chunks, i, freechunks; 261 + struct zbud_header *zhdr = NULL; 262 + enum buddy bud; 263 + struct page *page; 264 + 265 + if (!size || (gfp & __GFP_HIGHMEM)) 266 + return -EINVAL; 267 + if (size > PAGE_SIZE - ZHDR_SIZE_ALIGNED - CHUNK_SIZE) 268 + return -ENOSPC; 269 + chunks = size_to_chunks(size); 270 + spin_lock(&pool->lock); 271 + 272 + /* First, try to find an unbuddied zbud page. */ 273 + for_each_unbuddied_list(i, chunks) { 274 + if (!list_empty(&pool->unbuddied[i])) { 275 + zhdr = list_first_entry(&pool->unbuddied[i], 276 + struct zbud_header, buddy); 277 + list_del(&zhdr->buddy); 278 + if (zhdr->first_chunks == 0) 279 + bud = FIRST; 280 + else 281 + bud = LAST; 282 + goto found; 283 + } 284 + } 285 + 286 + /* Couldn't find unbuddied zbud page, create new one */ 287 + spin_unlock(&pool->lock); 288 + page = alloc_page(gfp); 289 + if (!page) 290 + return -ENOMEM; 291 + spin_lock(&pool->lock); 292 + pool->pages_nr++; 293 + zhdr = init_zbud_page(page); 294 + bud = FIRST; 295 + 296 + found: 297 + if (bud == FIRST) 298 + zhdr->first_chunks = chunks; 299 + else 300 + zhdr->last_chunks = chunks; 301 + 302 + if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0) { 303 + /* Add to unbuddied list */ 304 + freechunks = num_free_chunks(zhdr); 305 + list_add(&zhdr->buddy, &pool->unbuddied[freechunks]); 306 + } else { 307 + /* Add to buddied list */ 308 + list_add(&zhdr->buddy, &pool->buddied); 309 + } 310 + 311 + /* Add/move zbud page to beginning of LRU */ 312 + if (!list_empty(&zhdr->lru)) 313 + list_del(&zhdr->lru); 314 + list_add(&zhdr->lru, &pool->lru); 315 + 316 + *handle = encode_handle(zhdr, bud); 317 + spin_unlock(&pool->lock); 318 + 319 + return 0; 320 + } 321 + 322 + /** 323 + * zbud_free() - frees the allocation associated with the given handle 324 + * @pool: pool in which the allocation resided 325 + * @handle: handle associated with the allocation returned by zbud_alloc() 326 + * 327 + * In the case that the zbud page in which the allocation resides is under 328 + * reclaim, as indicated by the PG_reclaim flag being set, this function 329 + * only sets the first|last_chunks to 0. The page is actually freed 330 + * once both buddies are evicted (see zbud_reclaim_page() below). 331 + */ 332 + static void zbud_free(struct zbud_pool *pool, unsigned long handle) 333 + { 334 + struct zbud_header *zhdr; 335 + int freechunks; 336 + 337 + spin_lock(&pool->lock); 338 + zhdr = handle_to_zbud_header(handle); 339 + 340 + /* If first buddy, handle will be page aligned */ 341 + if ((handle - ZHDR_SIZE_ALIGNED) & ~PAGE_MASK) 342 + zhdr->last_chunks = 0; 343 + else 344 + zhdr->first_chunks = 0; 345 + 346 + if (zhdr->under_reclaim) { 347 + /* zbud page is under reclaim, reclaim will free */ 348 + spin_unlock(&pool->lock); 349 + return; 350 + } 351 + 352 + /* Remove from existing buddy list */ 353 + list_del(&zhdr->buddy); 354 + 355 + if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) { 356 + /* zbud page is empty, free */ 357 + list_del(&zhdr->lru); 358 + free_zbud_page(zhdr); 359 + pool->pages_nr--; 360 + } else { 361 + /* Add to unbuddied list */ 362 + freechunks = num_free_chunks(zhdr); 363 + list_add(&zhdr->buddy, &pool->unbuddied[freechunks]); 364 + } 365 + 366 + spin_unlock(&pool->lock); 367 + } 368 + 369 + /** 370 + * zbud_reclaim_page() - evicts allocations from a pool page and frees it 371 + * @pool: pool from which a page will attempt to be evicted 372 + * @retries: number of pages on the LRU list for which eviction will 373 + * be attempted before failing 374 + * 375 + * zbud reclaim is different from normal system reclaim in that the reclaim is 376 + * done from the bottom, up. This is because only the bottom layer, zbud, has 377 + * information on how the allocations are organized within each zbud page. This 378 + * has the potential to create interesting locking situations between zbud and 379 + * the user, however. 380 + * 381 + * To avoid these, this is how zbud_reclaim_page() should be called: 382 + * 383 + * The user detects a page should be reclaimed and calls zbud_reclaim_page(). 384 + * zbud_reclaim_page() will remove a zbud page from the pool LRU list and call 385 + * the user-defined eviction handler with the pool and handle as arguments. 386 + * 387 + * If the handle can not be evicted, the eviction handler should return 388 + * non-zero. zbud_reclaim_page() will add the zbud page back to the 389 + * appropriate list and try the next zbud page on the LRU up to 390 + * a user defined number of retries. 391 + * 392 + * If the handle is successfully evicted, the eviction handler should 393 + * return 0 _and_ should have called zbud_free() on the handle. zbud_free() 394 + * contains logic to delay freeing the page if the page is under reclaim, 395 + * as indicated by the setting of the PG_reclaim flag on the underlying page. 396 + * 397 + * If all buddies in the zbud page are successfully evicted, then the 398 + * zbud page can be freed. 399 + * 400 + * Returns: 0 if page is successfully freed, otherwise -EINVAL if there are 401 + * no pages to evict or an eviction handler is not registered, -EAGAIN if 402 + * the retry limit was hit. 403 + */ 404 + static int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries) 405 + { 406 + int i, ret, freechunks; 407 + struct zbud_header *zhdr; 408 + unsigned long first_handle = 0, last_handle = 0; 409 + 410 + spin_lock(&pool->lock); 411 + if (!pool->ops || !pool->ops->evict || list_empty(&pool->lru) || 412 + retries == 0) { 413 + spin_unlock(&pool->lock); 414 + return -EINVAL; 415 + } 416 + for (i = 0; i < retries; i++) { 417 + zhdr = list_last_entry(&pool->lru, struct zbud_header, lru); 418 + list_del(&zhdr->lru); 419 + list_del(&zhdr->buddy); 420 + /* Protect zbud page against free */ 421 + zhdr->under_reclaim = true; 422 + /* 423 + * We need encode the handles before unlocking, since we can 424 + * race with free that will set (first|last)_chunks to 0 425 + */ 426 + first_handle = 0; 427 + last_handle = 0; 428 + if (zhdr->first_chunks) 429 + first_handle = encode_handle(zhdr, FIRST); 430 + if (zhdr->last_chunks) 431 + last_handle = encode_handle(zhdr, LAST); 432 + spin_unlock(&pool->lock); 433 + 434 + /* Issue the eviction callback(s) */ 435 + if (first_handle) { 436 + ret = pool->ops->evict(pool, first_handle); 437 + if (ret) 438 + goto next; 439 + } 440 + if (last_handle) { 441 + ret = pool->ops->evict(pool, last_handle); 442 + if (ret) 443 + goto next; 444 + } 445 + next: 446 + spin_lock(&pool->lock); 447 + zhdr->under_reclaim = false; 448 + if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) { 449 + /* 450 + * Both buddies are now free, free the zbud page and 451 + * return success. 452 + */ 453 + free_zbud_page(zhdr); 454 + pool->pages_nr--; 455 + spin_unlock(&pool->lock); 456 + return 0; 457 + } else if (zhdr->first_chunks == 0 || 458 + zhdr->last_chunks == 0) { 459 + /* add to unbuddied list */ 460 + freechunks = num_free_chunks(zhdr); 461 + list_add(&zhdr->buddy, &pool->unbuddied[freechunks]); 462 + } else { 463 + /* add to buddied list */ 464 + list_add(&zhdr->buddy, &pool->buddied); 465 + } 466 + 467 + /* add to beginning of LRU */ 468 + list_add(&zhdr->lru, &pool->lru); 469 + } 470 + spin_unlock(&pool->lock); 471 + return -EAGAIN; 472 + } 473 + 474 + /** 475 + * zbud_map() - maps the allocation associated with the given handle 476 + * @pool: pool in which the allocation resides 477 + * @handle: handle associated with the allocation to be mapped 478 + * 479 + * While trivial for zbud, the mapping functions for others allocators 480 + * implementing this allocation API could have more complex information encoded 481 + * in the handle and could create temporary mappings to make the data 482 + * accessible to the user. 483 + * 484 + * Returns: a pointer to the mapped allocation 485 + */ 486 + static void *zbud_map(struct zbud_pool *pool, unsigned long handle) 487 + { 488 + return (void *)(handle); 489 + } 490 + 491 + /** 492 + * zbud_unmap() - maps the allocation associated with the given handle 493 + * @pool: pool in which the allocation resides 494 + * @handle: handle associated with the allocation to be unmapped 495 + */ 496 + static void zbud_unmap(struct zbud_pool *pool, unsigned long handle) 497 + { 498 + } 499 + 500 + /** 501 + * zbud_get_pool_size() - gets the zbud pool size in pages 502 + * @pool: pool whose size is being queried 503 + * 504 + * Returns: size in pages of the given pool. The pool lock need not be 505 + * taken to access pages_nr. 506 + */ 507 + static u64 zbud_get_pool_size(struct zbud_pool *pool) 508 + { 509 + return pool->pages_nr; 510 + } 511 + 512 + /***************** 135 513 * zpool 136 514 ****************/ 137 - 138 - #ifdef CONFIG_ZPOOL 139 515 140 516 static int zbud_zpool_evict(struct zbud_pool *pool, unsigned long handle) 141 517 { ··· 614 216 }; 615 217 616 218 MODULE_ALIAS("zpool-zbud"); 617 - #endif /* CONFIG_ZPOOL */ 618 - 619 - /***************** 620 - * Helpers 621 - *****************/ 622 - /* Just to make the code easier to read */ 623 - enum buddy { 624 - FIRST, 625 - LAST 626 - }; 627 - 628 - /* Converts an allocation size in bytes to size in zbud chunks */ 629 - static int size_to_chunks(size_t size) 630 - { 631 - return (size + CHUNK_SIZE - 1) >> CHUNK_SHIFT; 632 - } 633 - 634 - #define for_each_unbuddied_list(_iter, _begin) \ 635 - for ((_iter) = (_begin); (_iter) < NCHUNKS; (_iter)++) 636 - 637 - /* Initializes the zbud header of a newly allocated zbud page */ 638 - static struct zbud_header *init_zbud_page(struct page *page) 639 - { 640 - struct zbud_header *zhdr = page_address(page); 641 - zhdr->first_chunks = 0; 642 - zhdr->last_chunks = 0; 643 - INIT_LIST_HEAD(&zhdr->buddy); 644 - INIT_LIST_HEAD(&zhdr->lru); 645 - zhdr->under_reclaim = false; 646 - return zhdr; 647 - } 648 - 649 - /* Resets the struct page fields and frees the page */ 650 - static void free_zbud_page(struct zbud_header *zhdr) 651 - { 652 - __free_page(virt_to_page(zhdr)); 653 - } 654 - 655 - /* 656 - * Encodes the handle of a particular buddy within a zbud page 657 - * Pool lock should be held as this function accesses first|last_chunks 658 - */ 659 - static unsigned long encode_handle(struct zbud_header *zhdr, enum buddy bud) 660 - { 661 - unsigned long handle; 662 - 663 - /* 664 - * For now, the encoded handle is actually just the pointer to the data 665 - * but this might not always be the case. A little information hiding. 666 - * Add CHUNK_SIZE to the handle if it is the first allocation to jump 667 - * over the zbud header in the first chunk. 668 - */ 669 - handle = (unsigned long)zhdr; 670 - if (bud == FIRST) 671 - /* skip over zbud header */ 672 - handle += ZHDR_SIZE_ALIGNED; 673 - else /* bud == LAST */ 674 - handle += PAGE_SIZE - (zhdr->last_chunks << CHUNK_SHIFT); 675 - return handle; 676 - } 677 - 678 - /* Returns the zbud page where a given handle is stored */ 679 - static struct zbud_header *handle_to_zbud_header(unsigned long handle) 680 - { 681 - return (struct zbud_header *)(handle & PAGE_MASK); 682 - } 683 - 684 - /* Returns the number of free chunks in a zbud page */ 685 - static int num_free_chunks(struct zbud_header *zhdr) 686 - { 687 - /* 688 - * Rather than branch for different situations, just use the fact that 689 - * free buddies have a length of zero to simplify everything. 690 - */ 691 - return NCHUNKS - zhdr->first_chunks - zhdr->last_chunks; 692 - } 693 - 694 - /***************** 695 - * API Functions 696 - *****************/ 697 - /** 698 - * zbud_create_pool() - create a new zbud pool 699 - * @gfp: gfp flags when allocating the zbud pool structure 700 - * @ops: user-defined operations for the zbud pool 701 - * 702 - * Return: pointer to the new zbud pool or NULL if the metadata allocation 703 - * failed. 704 - */ 705 - struct zbud_pool *zbud_create_pool(gfp_t gfp, const struct zbud_ops *ops) 706 - { 707 - struct zbud_pool *pool; 708 - int i; 709 - 710 - pool = kzalloc(sizeof(struct zbud_pool), gfp); 711 - if (!pool) 712 - return NULL; 713 - spin_lock_init(&pool->lock); 714 - for_each_unbuddied_list(i, 0) 715 - INIT_LIST_HEAD(&pool->unbuddied[i]); 716 - INIT_LIST_HEAD(&pool->buddied); 717 - INIT_LIST_HEAD(&pool->lru); 718 - pool->pages_nr = 0; 719 - pool->ops = ops; 720 - return pool; 721 - } 722 - 723 - /** 724 - * zbud_destroy_pool() - destroys an existing zbud pool 725 - * @pool: the zbud pool to be destroyed 726 - * 727 - * The pool should be emptied before this function is called. 728 - */ 729 - void zbud_destroy_pool(struct zbud_pool *pool) 730 - { 731 - kfree(pool); 732 - } 733 - 734 - /** 735 - * zbud_alloc() - allocates a region of a given size 736 - * @pool: zbud pool from which to allocate 737 - * @size: size in bytes of the desired allocation 738 - * @gfp: gfp flags used if the pool needs to grow 739 - * @handle: handle of the new allocation 740 - * 741 - * This function will attempt to find a free region in the pool large enough to 742 - * satisfy the allocation request. A search of the unbuddied lists is 743 - * performed first. If no suitable free region is found, then a new page is 744 - * allocated and added to the pool to satisfy the request. 745 - * 746 - * gfp should not set __GFP_HIGHMEM as highmem pages cannot be used 747 - * as zbud pool pages. 748 - * 749 - * Return: 0 if success and handle is set, otherwise -EINVAL if the size or 750 - * gfp arguments are invalid or -ENOMEM if the pool was unable to allocate 751 - * a new page. 752 - */ 753 - int zbud_alloc(struct zbud_pool *pool, size_t size, gfp_t gfp, 754 - unsigned long *handle) 755 - { 756 - int chunks, i, freechunks; 757 - struct zbud_header *zhdr = NULL; 758 - enum buddy bud; 759 - struct page *page; 760 - 761 - if (!size || (gfp & __GFP_HIGHMEM)) 762 - return -EINVAL; 763 - if (size > PAGE_SIZE - ZHDR_SIZE_ALIGNED - CHUNK_SIZE) 764 - return -ENOSPC; 765 - chunks = size_to_chunks(size); 766 - spin_lock(&pool->lock); 767 - 768 - /* First, try to find an unbuddied zbud page. */ 769 - for_each_unbuddied_list(i, chunks) { 770 - if (!list_empty(&pool->unbuddied[i])) { 771 - zhdr = list_first_entry(&pool->unbuddied[i], 772 - struct zbud_header, buddy); 773 - list_del(&zhdr->buddy); 774 - if (zhdr->first_chunks == 0) 775 - bud = FIRST; 776 - else 777 - bud = LAST; 778 - goto found; 779 - } 780 - } 781 - 782 - /* Couldn't find unbuddied zbud page, create new one */ 783 - spin_unlock(&pool->lock); 784 - page = alloc_page(gfp); 785 - if (!page) 786 - return -ENOMEM; 787 - spin_lock(&pool->lock); 788 - pool->pages_nr++; 789 - zhdr = init_zbud_page(page); 790 - bud = FIRST; 791 - 792 - found: 793 - if (bud == FIRST) 794 - zhdr->first_chunks = chunks; 795 - else 796 - zhdr->last_chunks = chunks; 797 - 798 - if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0) { 799 - /* Add to unbuddied list */ 800 - freechunks = num_free_chunks(zhdr); 801 - list_add(&zhdr->buddy, &pool->unbuddied[freechunks]); 802 - } else { 803 - /* Add to buddied list */ 804 - list_add(&zhdr->buddy, &pool->buddied); 805 - } 806 - 807 - /* Add/move zbud page to beginning of LRU */ 808 - if (!list_empty(&zhdr->lru)) 809 - list_del(&zhdr->lru); 810 - list_add(&zhdr->lru, &pool->lru); 811 - 812 - *handle = encode_handle(zhdr, bud); 813 - spin_unlock(&pool->lock); 814 - 815 - return 0; 816 - } 817 - 818 - /** 819 - * zbud_free() - frees the allocation associated with the given handle 820 - * @pool: pool in which the allocation resided 821 - * @handle: handle associated with the allocation returned by zbud_alloc() 822 - * 823 - * In the case that the zbud page in which the allocation resides is under 824 - * reclaim, as indicated by the PG_reclaim flag being set, this function 825 - * only sets the first|last_chunks to 0. The page is actually freed 826 - * once both buddies are evicted (see zbud_reclaim_page() below). 827 - */ 828 - void zbud_free(struct zbud_pool *pool, unsigned long handle) 829 - { 830 - struct zbud_header *zhdr; 831 - int freechunks; 832 - 833 - spin_lock(&pool->lock); 834 - zhdr = handle_to_zbud_header(handle); 835 - 836 - /* If first buddy, handle will be page aligned */ 837 - if ((handle - ZHDR_SIZE_ALIGNED) & ~PAGE_MASK) 838 - zhdr->last_chunks = 0; 839 - else 840 - zhdr->first_chunks = 0; 841 - 842 - if (zhdr->under_reclaim) { 843 - /* zbud page is under reclaim, reclaim will free */ 844 - spin_unlock(&pool->lock); 845 - return; 846 - } 847 - 848 - /* Remove from existing buddy list */ 849 - list_del(&zhdr->buddy); 850 - 851 - if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) { 852 - /* zbud page is empty, free */ 853 - list_del(&zhdr->lru); 854 - free_zbud_page(zhdr); 855 - pool->pages_nr--; 856 - } else { 857 - /* Add to unbuddied list */ 858 - freechunks = num_free_chunks(zhdr); 859 - list_add(&zhdr->buddy, &pool->unbuddied[freechunks]); 860 - } 861 - 862 - spin_unlock(&pool->lock); 863 - } 864 - 865 - /** 866 - * zbud_reclaim_page() - evicts allocations from a pool page and frees it 867 - * @pool: pool from which a page will attempt to be evicted 868 - * @retries: number of pages on the LRU list for which eviction will 869 - * be attempted before failing 870 - * 871 - * zbud reclaim is different from normal system reclaim in that the reclaim is 872 - * done from the bottom, up. This is because only the bottom layer, zbud, has 873 - * information on how the allocations are organized within each zbud page. This 874 - * has the potential to create interesting locking situations between zbud and 875 - * the user, however. 876 - * 877 - * To avoid these, this is how zbud_reclaim_page() should be called: 878 - * 879 - * The user detects a page should be reclaimed and calls zbud_reclaim_page(). 880 - * zbud_reclaim_page() will remove a zbud page from the pool LRU list and call 881 - * the user-defined eviction handler with the pool and handle as arguments. 882 - * 883 - * If the handle can not be evicted, the eviction handler should return 884 - * non-zero. zbud_reclaim_page() will add the zbud page back to the 885 - * appropriate list and try the next zbud page on the LRU up to 886 - * a user defined number of retries. 887 - * 888 - * If the handle is successfully evicted, the eviction handler should 889 - * return 0 _and_ should have called zbud_free() on the handle. zbud_free() 890 - * contains logic to delay freeing the page if the page is under reclaim, 891 - * as indicated by the setting of the PG_reclaim flag on the underlying page. 892 - * 893 - * If all buddies in the zbud page are successfully evicted, then the 894 - * zbud page can be freed. 895 - * 896 - * Returns: 0 if page is successfully freed, otherwise -EINVAL if there are 897 - * no pages to evict or an eviction handler is not registered, -EAGAIN if 898 - * the retry limit was hit. 899 - */ 900 - int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries) 901 - { 902 - int i, ret, freechunks; 903 - struct zbud_header *zhdr; 904 - unsigned long first_handle = 0, last_handle = 0; 905 - 906 - spin_lock(&pool->lock); 907 - if (!pool->ops || !pool->ops->evict || list_empty(&pool->lru) || 908 - retries == 0) { 909 - spin_unlock(&pool->lock); 910 - return -EINVAL; 911 - } 912 - for (i = 0; i < retries; i++) { 913 - zhdr = list_last_entry(&pool->lru, struct zbud_header, lru); 914 - list_del(&zhdr->lru); 915 - list_del(&zhdr->buddy); 916 - /* Protect zbud page against free */ 917 - zhdr->under_reclaim = true; 918 - /* 919 - * We need encode the handles before unlocking, since we can 920 - * race with free that will set (first|last)_chunks to 0 921 - */ 922 - first_handle = 0; 923 - last_handle = 0; 924 - if (zhdr->first_chunks) 925 - first_handle = encode_handle(zhdr, FIRST); 926 - if (zhdr->last_chunks) 927 - last_handle = encode_handle(zhdr, LAST); 928 - spin_unlock(&pool->lock); 929 - 930 - /* Issue the eviction callback(s) */ 931 - if (first_handle) { 932 - ret = pool->ops->evict(pool, first_handle); 933 - if (ret) 934 - goto next; 935 - } 936 - if (last_handle) { 937 - ret = pool->ops->evict(pool, last_handle); 938 - if (ret) 939 - goto next; 940 - } 941 - next: 942 - spin_lock(&pool->lock); 943 - zhdr->under_reclaim = false; 944 - if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) { 945 - /* 946 - * Both buddies are now free, free the zbud page and 947 - * return success. 948 - */ 949 - free_zbud_page(zhdr); 950 - pool->pages_nr--; 951 - spin_unlock(&pool->lock); 952 - return 0; 953 - } else if (zhdr->first_chunks == 0 || 954 - zhdr->last_chunks == 0) { 955 - /* add to unbuddied list */ 956 - freechunks = num_free_chunks(zhdr); 957 - list_add(&zhdr->buddy, &pool->unbuddied[freechunks]); 958 - } else { 959 - /* add to buddied list */ 960 - list_add(&zhdr->buddy, &pool->buddied); 961 - } 962 - 963 - /* add to beginning of LRU */ 964 - list_add(&zhdr->lru, &pool->lru); 965 - } 966 - spin_unlock(&pool->lock); 967 - return -EAGAIN; 968 - } 969 - 970 - /** 971 - * zbud_map() - maps the allocation associated with the given handle 972 - * @pool: pool in which the allocation resides 973 - * @handle: handle associated with the allocation to be mapped 974 - * 975 - * While trivial for zbud, the mapping functions for others allocators 976 - * implementing this allocation API could have more complex information encoded 977 - * in the handle and could create temporary mappings to make the data 978 - * accessible to the user. 979 - * 980 - * Returns: a pointer to the mapped allocation 981 - */ 982 - void *zbud_map(struct zbud_pool *pool, unsigned long handle) 983 - { 984 - return (void *)(handle); 985 - } 986 - 987 - /** 988 - * zbud_unmap() - maps the allocation associated with the given handle 989 - * @pool: pool in which the allocation resides 990 - * @handle: handle associated with the allocation to be unmapped 991 - */ 992 - void zbud_unmap(struct zbud_pool *pool, unsigned long handle) 993 - { 994 - } 995 - 996 - /** 997 - * zbud_get_pool_size() - gets the zbud pool size in pages 998 - * @pool: pool whose size is being queried 999 - * 1000 - * Returns: size in pages of the given pool. The pool lock need not be 1001 - * taken to access pages_nr. 1002 - */ 1003 - u64 zbud_get_pool_size(struct zbud_pool *pool) 1004 - { 1005 - return pool->pages_nr; 1006 - } 1007 219 1008 220 static int __init init_zbud(void) 1009 221 { ··· 621 613 BUILD_BUG_ON(sizeof(struct zbud_header) > ZHDR_SIZE_ALIGNED); 622 614 pr_info("loaded\n"); 623 615 624 - #ifdef CONFIG_ZPOOL 625 616 zpool_register_driver(&zbud_zpool_driver); 626 - #endif 627 617 628 618 return 0; 629 619 } 630 620 631 621 static void __exit exit_zbud(void) 632 622 { 633 - #ifdef CONFIG_ZPOOL 634 623 zpool_unregister_driver(&zbud_zpool_driver); 635 - #endif 636 - 637 624 pr_info("unloaded\n"); 638 625 } 639 626

+1 -2

mm/zsmalloc.c

··· 1471 1471 unsigned int f_objidx; 1472 1472 void *vaddr; 1473 1473 1474 - obj &= ~OBJ_ALLOCATED_TAG; 1475 1474 obj_to_location(obj, &f_page, &f_objidx); 1476 1475 f_offset = (class->size * f_objidx) & ~PAGE_MASK; 1477 1476 zspage = get_zspage(f_page); ··· 2162 2163 VM_BUG_ON(fullness != ZS_EMPTY); 2163 2164 class = pool->size_class[class_idx]; 2164 2165 spin_lock(&class->lock); 2165 - __free_zspage(pool, pool->size_class[class_idx], zspage); 2166 + __free_zspage(pool, class, zspage); 2166 2167 spin_unlock(&class->lock); 2167 2168 } 2168 2169 };

+8 -18

mm/zswap.c

··· 967 967 spin_unlock(&tree->lock); 968 968 BUG_ON(offset != entry->offset); 969 969 970 + src = (u8 *)zhdr + sizeof(struct zswap_header); 971 + if (!zpool_can_sleep_mapped(pool)) { 972 + memcpy(tmp, src, entry->length); 973 + src = tmp; 974 + zpool_unmap_handle(pool, handle); 975 + } 976 + 970 977 /* try to allocate swap cache page */ 971 978 switch (zswap_get_swap_cache_page(swpentry, &page)) { 972 979 case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */ ··· 989 982 case ZSWAP_SWAPCACHE_NEW: /* page is locked */ 990 983 /* decompress */ 991 984 acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); 992 - 993 985 dlen = PAGE_SIZE; 994 - src = (u8 *)zhdr + sizeof(struct zswap_header); 995 - 996 - if (!zpool_can_sleep_mapped(pool)) { 997 - 998 - memcpy(tmp, src, entry->length); 999 - src = tmp; 1000 - 1001 - zpool_unmap_handle(pool, handle); 1002 - } 1003 986 1004 987 mutex_lock(acomp_ctx->mutex); 1005 988 sg_init_one(&input, src, entry->length); ··· 1200 1203 zswap_reject_alloc_fail++; 1201 1204 goto put_dstmem; 1202 1205 } 1203 - buf = zpool_map_handle(entry->pool->zpool, handle, ZPOOL_MM_RW); 1206 + buf = zpool_map_handle(entry->pool->zpool, handle, ZPOOL_MM_WO); 1204 1207 memcpy(buf, &zhdr, hlen); 1205 1208 memcpy(buf + hlen, dst, dlen); 1206 1209 zpool_unmap_handle(entry->pool->zpool, handle); ··· 1424 1427 1425 1428 return 0; 1426 1429 } 1427 - 1428 - static void __exit zswap_debugfs_exit(void) 1429 - { 1430 - debugfs_remove_recursive(zswap_debugfs_root); 1431 - } 1432 1430 #else 1433 1431 static int __init zswap_debugfs_init(void) 1434 1432 { 1435 1433 return 0; 1436 1434 } 1437 - 1438 - static void __exit zswap_debugfs_exit(void) { } 1439 1435 #endif 1440 1436 1441 1437 /*********************************

+10 -6

scripts/checkpatch.pl

··· 1084 1084 sub is_SPDX_License_valid { 1085 1085 my ($license) = @_; 1086 1086 1087 - return 1 if (!$tree || which("python") eq "" || !(-e "$root/scripts/spdxcheck.py") || !(-e "$gitroot")); 1087 + return 1 if (!$tree || which("python3") eq "" || !(-x "$root/scripts/spdxcheck.py") || !(-e "$gitroot")); 1088 1088 1089 1089 my $root_path = abs_path($root); 1090 - my $status = `cd "$root_path"; echo "$license" | python scripts/spdxcheck.py -`; 1090 + my $status = `cd "$root_path"; echo "$license" | scripts/spdxcheck.py -`; 1091 1091 return 0 if ($status ne ""); 1092 1092 return 1; 1093 1093 } ··· 5361 5361 } 5362 5362 } 5363 5363 5364 - #goto labels aren't indented, allow a single space however 5365 - if ($line=~/^.\s+[A-Za-z\d_]+:(?![0-9]+)/ and 5366 - !($line=~/^. [A-Za-z\d_]+:/) and !($line=~/^.\s+default:/)) { 5364 + # check that goto labels aren't indented (allow a single space indentation) 5365 + # and ignore bitfield definitions like foo:1 5366 + # Strictly, labels can have whitespace after the identifier and before the : 5367 + # but this is not allowed here as many ?: uses would appear to be labels 5368 + if ($sline =~ /^.\s+[A-Za-z_][A-Za-z\d_]*:(?!\s*\d+)/ && 5369 + $sline !~ /^. [A-Za-z\d_][A-Za-z\d_]*:/ && 5370 + $sline !~ /^.\s+default:/) { 5367 5371 if (WARN("INDENTED_LABEL", 5368 5372 "labels should not be indented\n" . $herecurr) && 5369 5373 $fix) { ··· 5462 5458 # Return of what appears to be an errno should normally be negative 5463 5459 if ($sline =~ /\breturn(?:\s*$+\s*|\s+)(E[A-Z]+)(?:\s*$+\s*|\s*)[;:,]/) { 5464 5460 my $name = $1; 5465 - if ($name ne 'EOF' && $name ne 'ERROR') { 5461 + if ($name ne 'EOF' && $name ne 'ERROR' && $name !~ /^EPOLL/) { 5466 5462 WARN("USE_NEGATIVE_ERRNO", 5467 5463 "return of an errno should typically be negative (ie: return -$1)\n" . $herecurr); 5468 5464 }

+3

tools/testing/selftests/vm/.gitignore

··· 12 12 on-fault-limit 13 13 transhuge-stress 14 14 protection_keys 15 + protection_keys_32 16 + protection_keys_64 17 + madv_populate 15 18 userfaultfd 16 19 mlock-intersect-test 17 20 mlock-random-test

+3 -2

tools/testing/selftests/vm/Makefile

··· 31 31 TEST_GEN_FILES += hugepage-mmap 32 32 TEST_GEN_FILES += hugepage-shm 33 33 TEST_GEN_FILES += khugepaged 34 + TEST_GEN_FILES += madv_populate 34 35 TEST_GEN_FILES += map_fixed_noreplace 35 36 TEST_GEN_FILES += map_hugetlb 36 37 TEST_GEN_FILES += map_populate ··· 101 100 endef 102 101 103 102 ifeq ($(CAN_BUILD_I386),1) 104 - $(BINARIES_32): CFLAGS += -m32 103 + $(BINARIES_32): CFLAGS += -m32 -mxsave 105 104 $(BINARIES_32): LDLIBS += -lrt -ldl -lm 106 105 $(BINARIES_32): $(OUTPUT)/%_32: %.c 107 106 $(CC) $(CFLAGS) $(EXTRA_CFLAGS) $(notdir $^) $(LDLIBS) -o $@ ··· 109 108 endif 110 109 111 110 ifeq ($(CAN_BUILD_X86_64),1) 112 - $(BINARIES_64): CFLAGS += -m64 111 + $(BINARIES_64): CFLAGS += -m64 -mxsave 113 112 $(BINARIES_64): LDLIBS += -lrt -ldl 114 113 $(BINARIES_64): $(OUTPUT)/%_64: %.c 115 114 $(CC) $(CFLAGS) $(EXTRA_CFLAGS) $(notdir $^) $(LDLIBS) -o $@

+158

tools/testing/selftests/vm/hmm-tests.c

··· 1485 1485 hmm_buffer_free(buffer); 1486 1486 } 1487 1487 1488 + /* 1489 + * Basic check of exclusive faulting. 1490 + */ 1491 + TEST_F(hmm, exclusive) 1492 + { 1493 + struct hmm_buffer *buffer; 1494 + unsigned long npages; 1495 + unsigned long size; 1496 + unsigned long i; 1497 + int *ptr; 1498 + int ret; 1499 + 1500 + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; 1501 + ASSERT_NE(npages, 0); 1502 + size = npages << self->page_shift; 1503 + 1504 + buffer = malloc(sizeof(*buffer)); 1505 + ASSERT_NE(buffer, NULL); 1506 + 1507 + buffer->fd = -1; 1508 + buffer->size = size; 1509 + buffer->mirror = malloc(size); 1510 + ASSERT_NE(buffer->mirror, NULL); 1511 + 1512 + buffer->ptr = mmap(NULL, size, 1513 + PROT_READ | PROT_WRITE, 1514 + MAP_PRIVATE | MAP_ANONYMOUS, 1515 + buffer->fd, 0); 1516 + ASSERT_NE(buffer->ptr, MAP_FAILED); 1517 + 1518 + /* Initialize buffer in system memory. */ 1519 + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) 1520 + ptr[i] = i; 1521 + 1522 + /* Map memory exclusively for device access. */ 1523 + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_EXCLUSIVE, buffer, npages); 1524 + ASSERT_EQ(ret, 0); 1525 + ASSERT_EQ(buffer->cpages, npages); 1526 + 1527 + /* Check what the device read. */ 1528 + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) 1529 + ASSERT_EQ(ptr[i], i); 1530 + 1531 + /* Fault pages back to system memory and check them. */ 1532 + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) 1533 + ASSERT_EQ(ptr[i]++, i); 1534 + 1535 + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) 1536 + ASSERT_EQ(ptr[i], i+1); 1537 + 1538 + /* Check atomic access revoked */ 1539 + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_CHECK_EXCLUSIVE, buffer, npages); 1540 + ASSERT_EQ(ret, 0); 1541 + 1542 + hmm_buffer_free(buffer); 1543 + } 1544 + 1545 + TEST_F(hmm, exclusive_mprotect) 1546 + { 1547 + struct hmm_buffer *buffer; 1548 + unsigned long npages; 1549 + unsigned long size; 1550 + unsigned long i; 1551 + int *ptr; 1552 + int ret; 1553 + 1554 + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; 1555 + ASSERT_NE(npages, 0); 1556 + size = npages << self->page_shift; 1557 + 1558 + buffer = malloc(sizeof(*buffer)); 1559 + ASSERT_NE(buffer, NULL); 1560 + 1561 + buffer->fd = -1; 1562 + buffer->size = size; 1563 + buffer->mirror = malloc(size); 1564 + ASSERT_NE(buffer->mirror, NULL); 1565 + 1566 + buffer->ptr = mmap(NULL, size, 1567 + PROT_READ | PROT_WRITE, 1568 + MAP_PRIVATE | MAP_ANONYMOUS, 1569 + buffer->fd, 0); 1570 + ASSERT_NE(buffer->ptr, MAP_FAILED); 1571 + 1572 + /* Initialize buffer in system memory. */ 1573 + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) 1574 + ptr[i] = i; 1575 + 1576 + /* Map memory exclusively for device access. */ 1577 + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_EXCLUSIVE, buffer, npages); 1578 + ASSERT_EQ(ret, 0); 1579 + ASSERT_EQ(buffer->cpages, npages); 1580 + 1581 + /* Check what the device read. */ 1582 + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) 1583 + ASSERT_EQ(ptr[i], i); 1584 + 1585 + ret = mprotect(buffer->ptr, size, PROT_READ); 1586 + ASSERT_EQ(ret, 0); 1587 + 1588 + /* Simulate a device writing system memory. */ 1589 + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages); 1590 + ASSERT_EQ(ret, -EPERM); 1591 + 1592 + hmm_buffer_free(buffer); 1593 + } 1594 + 1595 + /* 1596 + * Check copy-on-write works. 1597 + */ 1598 + TEST_F(hmm, exclusive_cow) 1599 + { 1600 + struct hmm_buffer *buffer; 1601 + unsigned long npages; 1602 + unsigned long size; 1603 + unsigned long i; 1604 + int *ptr; 1605 + int ret; 1606 + 1607 + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; 1608 + ASSERT_NE(npages, 0); 1609 + size = npages << self->page_shift; 1610 + 1611 + buffer = malloc(sizeof(*buffer)); 1612 + ASSERT_NE(buffer, NULL); 1613 + 1614 + buffer->fd = -1; 1615 + buffer->size = size; 1616 + buffer->mirror = malloc(size); 1617 + ASSERT_NE(buffer->mirror, NULL); 1618 + 1619 + buffer->ptr = mmap(NULL, size, 1620 + PROT_READ | PROT_WRITE, 1621 + MAP_PRIVATE | MAP_ANONYMOUS, 1622 + buffer->fd, 0); 1623 + ASSERT_NE(buffer->ptr, MAP_FAILED); 1624 + 1625 + /* Initialize buffer in system memory. */ 1626 + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) 1627 + ptr[i] = i; 1628 + 1629 + /* Map memory exclusively for device access. */ 1630 + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_EXCLUSIVE, buffer, npages); 1631 + ASSERT_EQ(ret, 0); 1632 + ASSERT_EQ(buffer->cpages, npages); 1633 + 1634 + fork(); 1635 + 1636 + /* Fault pages back to system memory and check them. */ 1637 + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) 1638 + ASSERT_EQ(ptr[i]++, i); 1639 + 1640 + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) 1641 + ASSERT_EQ(ptr[i], i+1); 1642 + 1643 + hmm_buffer_free(buffer); 1644 + } 1645 + 1488 1646 TEST_HARNESS_MAIN

-4

tools/testing/selftests/vm/khugepaged.c

··· 86 86 enum thp_enabled thp_enabled; 87 87 enum thp_defrag thp_defrag; 88 88 enum shmem_enabled shmem_enabled; 89 - bool debug_cow; 90 89 bool use_zero_page; 91 90 struct khugepaged_settings khugepaged; 92 91 }; ··· 94 95 .thp_enabled = THP_MADVISE, 95 96 .thp_defrag = THP_DEFRAG_ALWAYS, 96 97 .shmem_enabled = SHMEM_NEVER, 97 - .debug_cow = 0, 98 98 .use_zero_page = 0, 99 99 .khugepaged = { 100 100 .defrag = 1, ··· 266 268 write_string("defrag", thp_defrag_strings[settings->thp_defrag]); 267 269 write_string("shmem_enabled", 268 270 shmem_enabled_strings[settings->shmem_enabled]); 269 - write_num("debug_cow", settings->debug_cow); 270 271 write_num("use_zero_page", settings->use_zero_page); 271 272 272 273 write_num("khugepaged/defrag", khugepaged->defrag); ··· 301 304 .thp_defrag = read_string("defrag", thp_defrag_strings), 302 305 .shmem_enabled = 303 306 read_string("shmem_enabled", shmem_enabled_strings), 304 - .debug_cow = read_num("debug_cow"), 305 307 .use_zero_page = read_num("use_zero_page"), 306 308 }; 307 309 saved_settings.khugepaged = (struct khugepaged_settings) {

+342

tools/testing/selftests/vm/madv_populate.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * MADV_POPULATE_READ and MADV_POPULATE_WRITE tests 4 + * 5 + * Copyright 2021, Red Hat, Inc. 6 + * 7 + * Author(s): David Hildenbrand <david@redhat.com> 8 + */ 9 + #define _GNU_SOURCE 10 + #include <stdlib.h> 11 + #include <string.h> 12 + #include <stdbool.h> 13 + #include <stdint.h> 14 + #include <unistd.h> 15 + #include <errno.h> 16 + #include <fcntl.h> 17 + #include <sys/mman.h> 18 + 19 + #include "../kselftest.h" 20 + 21 + #if defined(MADV_POPULATE_READ) && defined(MADV_POPULATE_WRITE) 22 + 23 + /* 24 + * For now, we're using 2 MiB of private anonymous memory for all tests. 25 + */ 26 + #define SIZE (2 * 1024 * 1024) 27 + 28 + static size_t pagesize; 29 + 30 + static uint64_t pagemap_get_entry(int fd, char *start) 31 + { 32 + const unsigned long pfn = (unsigned long)start / pagesize; 33 + uint64_t entry; 34 + int ret; 35 + 36 + ret = pread(fd, &entry, sizeof(entry), pfn * sizeof(entry)); 37 + if (ret != sizeof(entry)) 38 + ksft_exit_fail_msg("reading pagemap failed\n"); 39 + return entry; 40 + } 41 + 42 + static bool pagemap_is_populated(int fd, char *start) 43 + { 44 + uint64_t entry = pagemap_get_entry(fd, start); 45 + 46 + /* Present or swapped. */ 47 + return entry & 0xc000000000000000ull; 48 + } 49 + 50 + static bool pagemap_is_softdirty(int fd, char *start) 51 + { 52 + uint64_t entry = pagemap_get_entry(fd, start); 53 + 54 + return entry & 0x0080000000000000ull; 55 + } 56 + 57 + static void sense_support(void) 58 + { 59 + char *addr; 60 + int ret; 61 + 62 + addr = mmap(0, pagesize, PROT_READ | PROT_WRITE, 63 + MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); 64 + if (!addr) 65 + ksft_exit_fail_msg("mmap failed\n"); 66 + 67 + ret = madvise(addr, pagesize, MADV_POPULATE_READ); 68 + if (ret) 69 + ksft_exit_skip("MADV_POPULATE_READ is not available\n"); 70 + 71 + ret = madvise(addr, pagesize, MADV_POPULATE_WRITE); 72 + if (ret) 73 + ksft_exit_skip("MADV_POPULATE_WRITE is not available\n"); 74 + 75 + munmap(addr, pagesize); 76 + } 77 + 78 + static void test_prot_read(void) 79 + { 80 + char *addr; 81 + int ret; 82 + 83 + ksft_print_msg("[RUN] %s\n", __func__); 84 + 85 + addr = mmap(0, SIZE, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); 86 + if (addr == MAP_FAILED) 87 + ksft_exit_fail_msg("mmap failed\n"); 88 + 89 + ret = madvise(addr, SIZE, MADV_POPULATE_READ); 90 + ksft_test_result(!ret, "MADV_POPULATE_READ with PROT_READ\n"); 91 + 92 + ret = madvise(addr, SIZE, MADV_POPULATE_WRITE); 93 + ksft_test_result(ret == -1 && errno == EINVAL, 94 + "MADV_POPULATE_WRITE with PROT_READ\n"); 95 + 96 + munmap(addr, SIZE); 97 + } 98 + 99 + static void test_prot_write(void) 100 + { 101 + char *addr; 102 + int ret; 103 + 104 + ksft_print_msg("[RUN] %s\n", __func__); 105 + 106 + addr = mmap(0, SIZE, PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); 107 + if (addr == MAP_FAILED) 108 + ksft_exit_fail_msg("mmap failed\n"); 109 + 110 + ret = madvise(addr, SIZE, MADV_POPULATE_READ); 111 + ksft_test_result(ret == -1 && errno == EINVAL, 112 + "MADV_POPULATE_READ with PROT_WRITE\n"); 113 + 114 + ret = madvise(addr, SIZE, MADV_POPULATE_WRITE); 115 + ksft_test_result(!ret, "MADV_POPULATE_WRITE with PROT_WRITE\n"); 116 + 117 + munmap(addr, SIZE); 118 + } 119 + 120 + static void test_holes(void) 121 + { 122 + char *addr; 123 + int ret; 124 + 125 + ksft_print_msg("[RUN] %s\n", __func__); 126 + 127 + addr = mmap(0, SIZE, PROT_READ | PROT_WRITE, 128 + MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); 129 + if (addr == MAP_FAILED) 130 + ksft_exit_fail_msg("mmap failed\n"); 131 + ret = munmap(addr + pagesize, pagesize); 132 + if (ret) 133 + ksft_exit_fail_msg("munmap failed\n"); 134 + 135 + /* Hole in the middle */ 136 + ret = madvise(addr, SIZE, MADV_POPULATE_READ); 137 + ksft_test_result(ret == -1 && errno == ENOMEM, 138 + "MADV_POPULATE_READ with holes in the middle\n"); 139 + ret = madvise(addr, SIZE, MADV_POPULATE_WRITE); 140 + ksft_test_result(ret == -1 && errno == ENOMEM, 141 + "MADV_POPULATE_WRITE with holes in the middle\n"); 142 + 143 + /* Hole at end */ 144 + ret = madvise(addr, 2 * pagesize, MADV_POPULATE_READ); 145 + ksft_test_result(ret == -1 && errno == ENOMEM, 146 + "MADV_POPULATE_READ with holes at the end\n"); 147 + ret = madvise(addr, 2 * pagesize, MADV_POPULATE_WRITE); 148 + ksft_test_result(ret == -1 && errno == ENOMEM, 149 + "MADV_POPULATE_WRITE with holes at the end\n"); 150 + 151 + /* Hole at beginning */ 152 + ret = madvise(addr + pagesize, pagesize, MADV_POPULATE_READ); 153 + ksft_test_result(ret == -1 && errno == ENOMEM, 154 + "MADV_POPULATE_READ with holes at the beginning\n"); 155 + ret = madvise(addr + pagesize, pagesize, MADV_POPULATE_WRITE); 156 + ksft_test_result(ret == -1 && errno == ENOMEM, 157 + "MADV_POPULATE_WRITE with holes at the beginning\n"); 158 + 159 + munmap(addr, SIZE); 160 + } 161 + 162 + static bool range_is_populated(char *start, ssize_t size) 163 + { 164 + int fd = open("/proc/self/pagemap", O_RDONLY); 165 + bool ret = true; 166 + 167 + if (fd < 0) 168 + ksft_exit_fail_msg("opening pagemap failed\n"); 169 + for (; size > 0 && ret; size -= pagesize, start += pagesize) 170 + if (!pagemap_is_populated(fd, start)) 171 + ret = false; 172 + close(fd); 173 + return ret; 174 + } 175 + 176 + static bool range_is_not_populated(char *start, ssize_t size) 177 + { 178 + int fd = open("/proc/self/pagemap", O_RDONLY); 179 + bool ret = true; 180 + 181 + if (fd < 0) 182 + ksft_exit_fail_msg("opening pagemap failed\n"); 183 + for (; size > 0 && ret; size -= pagesize, start += pagesize) 184 + if (pagemap_is_populated(fd, start)) 185 + ret = false; 186 + close(fd); 187 + return ret; 188 + } 189 + 190 + static void test_populate_read(void) 191 + { 192 + char *addr; 193 + int ret; 194 + 195 + ksft_print_msg("[RUN] %s\n", __func__); 196 + 197 + addr = mmap(0, SIZE, PROT_READ | PROT_WRITE, 198 + MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); 199 + if (addr == MAP_FAILED) 200 + ksft_exit_fail_msg("mmap failed\n"); 201 + ksft_test_result(range_is_not_populated(addr, SIZE), 202 + "range initially not populated\n"); 203 + 204 + ret = madvise(addr, SIZE, MADV_POPULATE_READ); 205 + ksft_test_result(!ret, "MADV_POPULATE_READ\n"); 206 + ksft_test_result(range_is_populated(addr, SIZE), 207 + "range is populated\n"); 208 + 209 + munmap(addr, SIZE); 210 + } 211 + 212 + static void test_populate_write(void) 213 + { 214 + char *addr; 215 + int ret; 216 + 217 + ksft_print_msg("[RUN] %s\n", __func__); 218 + 219 + addr = mmap(0, SIZE, PROT_READ | PROT_WRITE, 220 + MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); 221 + if (addr == MAP_FAILED) 222 + ksft_exit_fail_msg("mmap failed\n"); 223 + ksft_test_result(range_is_not_populated(addr, SIZE), 224 + "range initially not populated\n"); 225 + 226 + ret = madvise(addr, SIZE, MADV_POPULATE_WRITE); 227 + ksft_test_result(!ret, "MADV_POPULATE_WRITE\n"); 228 + ksft_test_result(range_is_populated(addr, SIZE), 229 + "range is populated\n"); 230 + 231 + munmap(addr, SIZE); 232 + } 233 + 234 + static bool range_is_softdirty(char *start, ssize_t size) 235 + { 236 + int fd = open("/proc/self/pagemap", O_RDONLY); 237 + bool ret = true; 238 + 239 + if (fd < 0) 240 + ksft_exit_fail_msg("opening pagemap failed\n"); 241 + for (; size > 0 && ret; size -= pagesize, start += pagesize) 242 + if (!pagemap_is_softdirty(fd, start)) 243 + ret = false; 244 + close(fd); 245 + return ret; 246 + } 247 + 248 + static bool range_is_not_softdirty(char *start, ssize_t size) 249 + { 250 + int fd = open("/proc/self/pagemap", O_RDONLY); 251 + bool ret = true; 252 + 253 + if (fd < 0) 254 + ksft_exit_fail_msg("opening pagemap failed\n"); 255 + for (; size > 0 && ret; size -= pagesize, start += pagesize) 256 + if (pagemap_is_softdirty(fd, start)) 257 + ret = false; 258 + close(fd); 259 + return ret; 260 + } 261 + 262 + static void clear_softdirty(void) 263 + { 264 + int fd = open("/proc/self/clear_refs", O_WRONLY); 265 + const char *ctrl = "4"; 266 + int ret; 267 + 268 + if (fd < 0) 269 + ksft_exit_fail_msg("opening clear_refs failed\n"); 270 + ret = write(fd, ctrl, strlen(ctrl)); 271 + if (ret != strlen(ctrl)) 272 + ksft_exit_fail_msg("writing clear_refs failed\n"); 273 + close(fd); 274 + } 275 + 276 + static void test_softdirty(void) 277 + { 278 + char *addr; 279 + int ret; 280 + 281 + ksft_print_msg("[RUN] %s\n", __func__); 282 + 283 + addr = mmap(0, SIZE, PROT_READ | PROT_WRITE, 284 + MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); 285 + if (addr == MAP_FAILED) 286 + ksft_exit_fail_msg("mmap failed\n"); 287 + 288 + /* Clear any softdirty bits. */ 289 + clear_softdirty(); 290 + ksft_test_result(range_is_not_softdirty(addr, SIZE), 291 + "range is not softdirty\n"); 292 + 293 + /* Populating READ should set softdirty. */ 294 + ret = madvise(addr, SIZE, MADV_POPULATE_READ); 295 + ksft_test_result(!ret, "MADV_POPULATE_READ\n"); 296 + ksft_test_result(range_is_not_softdirty(addr, SIZE), 297 + "range is not softdirty\n"); 298 + 299 + /* Populating WRITE should set softdirty. */ 300 + ret = madvise(addr, SIZE, MADV_POPULATE_WRITE); 301 + ksft_test_result(!ret, "MADV_POPULATE_WRITE\n"); 302 + ksft_test_result(range_is_softdirty(addr, SIZE), 303 + "range is softdirty\n"); 304 + 305 + munmap(addr, SIZE); 306 + } 307 + 308 + int main(int argc, char **argv) 309 + { 310 + int err; 311 + 312 + pagesize = getpagesize(); 313 + 314 + ksft_print_header(); 315 + ksft_set_plan(21); 316 + 317 + sense_support(); 318 + test_prot_read(); 319 + test_prot_write(); 320 + test_holes(); 321 + test_populate_read(); 322 + test_populate_write(); 323 + test_softdirty(); 324 + 325 + err = ksft_get_fail_cnt(); 326 + if (err) 327 + ksft_exit_fail_msg("%d out of %d tests failed\n", 328 + err, ksft_test_num()); 329 + return ksft_exit_pass(); 330 + } 331 + 332 + #else /* defined(MADV_POPULATE_READ) && defined(MADV_POPULATE_WRITE) */ 333 + 334 + #warning "missing MADV_POPULATE_READ or MADV_POPULATE_WRITE definition" 335 + 336 + int main(int argc, char **argv) 337 + { 338 + ksft_print_header(); 339 + ksft_exit_skip("MADV_POPULATE_READ or MADV_POPULATE_WRITE not defined\n"); 340 + } 341 + 342 + #endif /* defined(MADV_POPULATE_READ) && defined(MADV_POPULATE_WRITE) */

+1

tools/testing/selftests/vm/pkey-x86.h

··· 126 126 127 127 #define XSTATE_PKEY_BIT (9) 128 128 #define XSTATE_PKEY 0x200 129 + #define XSTATE_BV_OFFSET 512 129 130 130 131 int pkey_reg_xstate_offset(void) 131 132 {

+83 -2

tools/testing/selftests/vm/protection_keys.c

··· 510 510 " shadow: 0x%016llx\n", 511 511 __func__, __LINE__, ret, __read_pkey_reg(), 512 512 shadow_pkey_reg); 513 - if (ret) { 513 + if (ret > 0) { 514 514 /* clear both the bits: */ 515 515 shadow_pkey_reg = set_pkey_bits(shadow_pkey_reg, ret, 516 516 ~PKEY_MASK); ··· 561 561 int nr_alloced = 0; 562 562 int random_index; 563 563 memset(alloced_pkeys, 0, sizeof(alloced_pkeys)); 564 - srand((unsigned int)time(NULL)); 565 564 566 565 /* allocate every possible key and make a note of which ones we got */ 567 566 max_nr_pkey_allocs = NR_PKEYS; ··· 1277 1278 } 1278 1279 } 1279 1280 1281 + void arch_force_pkey_reg_init(void) 1282 + { 1283 + #if defined(__i386__) || defined(__x86_64__) /* arch */ 1284 + u64 *buf; 1285 + 1286 + /* 1287 + * All keys should be allocated and set to allow reads and 1288 + * writes, so the register should be all 0. If not, just 1289 + * skip the test. 1290 + */ 1291 + if (read_pkey_reg()) 1292 + return; 1293 + 1294 + /* 1295 + * Just allocate an absurd about of memory rather than 1296 + * doing the XSAVE size enumeration dance. 1297 + */ 1298 + buf = mmap(NULL, 1*MB, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); 1299 + 1300 + /* These __builtins require compiling with -mxsave */ 1301 + 1302 + /* XSAVE to build a valid buffer: */ 1303 + __builtin_ia32_xsave(buf, XSTATE_PKEY); 1304 + /* Clear XSTATE_BV[PKRU]: */ 1305 + buf[XSTATE_BV_OFFSET/sizeof(u64)] &= ~XSTATE_PKEY; 1306 + /* XRSTOR will likely get PKRU back to the init state: */ 1307 + __builtin_ia32_xrstor(buf, XSTATE_PKEY); 1308 + 1309 + munmap(buf, 1*MB); 1310 + #endif 1311 + } 1312 + 1313 + 1314 + /* 1315 + * This is mostly useless on ppc for now. But it will not 1316 + * hurt anything and should give some better coverage as 1317 + * a long-running test that continually checks the pkey 1318 + * register. 1319 + */ 1320 + void test_pkey_init_state(int *ptr, u16 pkey) 1321 + { 1322 + int err; 1323 + int allocated_pkeys[NR_PKEYS] = {0}; 1324 + int nr_allocated_pkeys = 0; 1325 + int i; 1326 + 1327 + for (i = 0; i < NR_PKEYS; i++) { 1328 + int new_pkey = alloc_pkey(); 1329 + 1330 + if (new_pkey < 0) 1331 + continue; 1332 + allocated_pkeys[nr_allocated_pkeys++] = new_pkey; 1333 + } 1334 + 1335 + dprintf3("%s()::%d\n", __func__, __LINE__); 1336 + 1337 + arch_force_pkey_reg_init(); 1338 + 1339 + /* 1340 + * Loop for a bit, hoping to get exercise the kernel 1341 + * context switch code. 1342 + */ 1343 + for (i = 0; i < 1000000; i++) 1344 + read_pkey_reg(); 1345 + 1346 + for (i = 0; i < nr_allocated_pkeys; i++) { 1347 + err = sys_pkey_free(allocated_pkeys[i]); 1348 + pkey_assert(!err); 1349 + read_pkey_reg(); /* for shadow checking */ 1350 + } 1351 + } 1352 + 1280 1353 /* 1281 1354 * pkey 0 is special. It is allocated by default, so you do not 1282 1355 * have to call pkey_alloc() to use it first. Make sure that it ··· 1520 1449 ret = mprotect(p1, PAGE_SIZE, PROT_EXEC); 1521 1450 pkey_assert(!ret); 1522 1451 1452 + /* 1453 + * Reset the shadow, assuming that the above mprotect() 1454 + * correctly changed PKRU, but to an unknown value since 1455 + * the actual alllocated pkey is unknown. 1456 + */ 1457 + shadow_pkey_reg = __read_pkey_reg(); 1458 + 1523 1459 dprintf2("pkey_reg: %016llx\n", read_pkey_reg()); 1524 1460 1525 1461 /* Make sure this is an *instruction* fault */ ··· 1580 1502 test_implicit_mprotect_exec_only_memory, 1581 1503 test_mprotect_with_pkey_0, 1582 1504 test_ptrace_of_child, 1505 + test_pkey_init_state, 1583 1506 test_pkey_syscalls_on_non_allocated_pkey, 1584 1507 test_pkey_syscalls_bad_args, 1585 1508 test_pkey_alloc_exhaust, ··· 1630 1551 { 1631 1552 int nr_iterations = 22; 1632 1553 int pkeys_supported = is_pkeys_supported(); 1554 + 1555 + srand((unsigned int)time(NULL)); 1633 1556 1634 1557 setup_handlers(); 1635 1558

+16

tools/testing/selftests/vm/run_vmtests.sh

··· 346 346 exitcode=1 347 347 fi 348 348 349 + echo "--------------------------------------------------------" 350 + echo "running MADV_POPULATE_READ and MADV_POPULATE_WRITE tests" 351 + echo "--------------------------------------------------------" 352 + ./madv_populate 353 + ret_val=$? 354 + 355 + if [ $ret_val -eq 0 ]; then 356 + echo "[PASS]" 357 + elif [ $ret_val -eq $ksft_skip ]; then 358 + echo "[SKIP]" 359 + exitcode=$ksft_skip 360 + else 361 + echo "[FAIL]" 362 + exitcode=1 363 + fi 364 + 349 365 exit $exitcode

+521 -537

tools/testing/selftests/vm/userfaultfd.c

··· 85 85 static bool test_uffdio_minor = false; 86 86 87 87 static bool map_shared; 88 + static int shm_fd; 88 89 static int huge_fd; 89 90 static char *huge_fd_off0; 90 91 static unsigned long long *count_verify; 91 - static int uffd, uffd_flags, finished, *pipefd; 92 + static int uffd = -1; 93 + static int uffd_flags, finished, *pipefd; 92 94 static char *area_src, *area_src_alias, *area_dst, *area_dst_alias; 93 95 static char *zeropage; 94 96 pthread_attr_t attr; ··· 142 140 exit(1); 143 141 } 144 142 145 - #define uffd_error(code, fmt, ...) \ 146 - do { \ 147 - fprintf(stderr, fmt, ##__VA_ARGS__); \ 148 - fprintf(stderr, ": %" PRId64 "\n", (int64_t)(code)); \ 149 - exit(1); \ 143 + #define _err(fmt, ...) \ 144 + do { \ 145 + int ret = errno; \ 146 + fprintf(stderr, "ERROR: " fmt, ##__VA_ARGS__); \ 147 + fprintf(stderr, " (errno=%d, line=%d)\n", \ 148 + ret, __LINE__); \ 149 + } while (0) 150 + 151 + #define err(fmt, ...) \ 152 + do { \ 153 + _err(fmt, ##__VA_ARGS__); \ 154 + exit(1); \ 150 155 } while (0) 151 156 152 157 static void uffd_stats_reset(struct uffd_stats *uffd_stats, ··· 180 171 minor_total += stats[i].minor_faults; 181 172 } 182 173 183 - printf("userfaults: %llu missing (", miss_total); 184 - for (i = 0; i < n_cpus; i++) 185 - printf("%lu+", stats[i].missing_faults); 186 - printf("\b), %llu wp (", wp_total); 187 - for (i = 0; i < n_cpus; i++) 188 - printf("%lu+", stats[i].wp_faults); 189 - printf("\b), %llu minor (", minor_total); 190 - for (i = 0; i < n_cpus; i++) 191 - printf("%lu+", stats[i].minor_faults); 192 - printf("\b)\n"); 174 + printf("userfaults: "); 175 + if (miss_total) { 176 + printf("%llu missing (", miss_total); 177 + for (i = 0; i < n_cpus; i++) 178 + printf("%lu+", stats[i].missing_faults); 179 + printf("\b) "); 180 + } 181 + if (wp_total) { 182 + printf("%llu wp (", wp_total); 183 + for (i = 0; i < n_cpus; i++) 184 + printf("%lu+", stats[i].wp_faults); 185 + printf("\b) "); 186 + } 187 + if (minor_total) { 188 + printf("%llu minor (", minor_total); 189 + for (i = 0; i < n_cpus; i++) 190 + printf("%lu+", stats[i].minor_faults); 191 + printf("\b)"); 192 + } 193 + printf("\n"); 193 194 } 194 195 195 - static int anon_release_pages(char *rel_area) 196 + static void anon_release_pages(char *rel_area) 196 197 { 197 - int ret = 0; 198 - 199 - if (madvise(rel_area, nr_pages * page_size, MADV_DONTNEED)) { 200 - perror("madvise"); 201 - ret = 1; 202 - } 203 - 204 - return ret; 198 + if (madvise(rel_area, nr_pages * page_size, MADV_DONTNEED)) 199 + err("madvise(MADV_DONTNEED) failed"); 205 200 } 206 201 207 202 static void anon_allocate_area(void **alloc_area) 208 203 { 209 - if (posix_memalign(alloc_area, page_size, nr_pages * page_size)) { 210 - fprintf(stderr, "out of memory\n"); 211 - *alloc_area = NULL; 212 - } 204 + if (posix_memalign(alloc_area, page_size, nr_pages * page_size)) 205 + err("posix_memalign() failed"); 213 206 } 214 207 215 208 static void noop_alias_mapping(__u64 *start, size_t len, unsigned long offset) 216 209 { 217 210 } 218 211 219 - /* HugeTLB memory */ 220 - static int hugetlb_release_pages(char *rel_area) 212 + static void hugetlb_release_pages(char *rel_area) 221 213 { 222 - int ret = 0; 223 - 224 214 if (fallocate(huge_fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 225 - rel_area == huge_fd_off0 ? 0 : 226 - nr_pages * page_size, 227 - nr_pages * page_size)) { 228 - perror("fallocate"); 229 - ret = 1; 230 - } 231 - 232 - return ret; 215 + rel_area == huge_fd_off0 ? 0 : nr_pages * page_size, 216 + nr_pages * page_size)) 217 + err("fallocate() failed"); 233 218 } 234 219 235 220 static void hugetlb_allocate_area(void **alloc_area) ··· 236 233 MAP_HUGETLB, 237 234 huge_fd, *alloc_area == area_src ? 0 : 238 235 nr_pages * page_size); 239 - if (*alloc_area == MAP_FAILED) { 240 - perror("mmap of hugetlbfs file failed"); 241 - goto fail; 242 - } 236 + if (*alloc_area == MAP_FAILED) 237 + err("mmap of hugetlbfs file failed"); 243 238 244 239 if (map_shared) { 245 240 area_alias = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE, 246 241 MAP_SHARED | MAP_HUGETLB, 247 242 huge_fd, *alloc_area == area_src ? 0 : 248 243 nr_pages * page_size); 249 - if (area_alias == MAP_FAILED) { 250 - perror("mmap of hugetlb file alias failed"); 251 - goto fail_munmap; 252 - } 244 + if (area_alias == MAP_FAILED) 245 + err("mmap of hugetlb file alias failed"); 253 246 } 254 247 255 248 if (*alloc_area == area_src) { ··· 256 257 } 257 258 if (area_alias) 258 259 *alloc_area_alias = area_alias; 259 - 260 - return; 261 - 262 - fail_munmap: 263 - if (munmap(*alloc_area, nr_pages * page_size) < 0) { 264 - perror("hugetlb munmap"); 265 - exit(1); 266 - } 267 - fail: 268 - *alloc_area = NULL; 269 260 } 270 261 271 262 static void hugetlb_alias_mapping(__u64 *start, size_t len, unsigned long offset) ··· 271 282 *start = (unsigned long) area_dst_alias + offset; 272 283 } 273 284 274 - /* Shared memory */ 275 - static int shmem_release_pages(char *rel_area) 285 + static void shmem_release_pages(char *rel_area) 276 286 { 277 - int ret = 0; 278 - 279 - if (madvise(rel_area, nr_pages * page_size, MADV_REMOVE)) { 280 - perror("madvise"); 281 - ret = 1; 282 - } 283 - 284 - return ret; 287 + if (madvise(rel_area, nr_pages * page_size, MADV_REMOVE)) 288 + err("madvise(MADV_REMOVE) failed"); 285 289 } 286 290 287 291 static void shmem_allocate_area(void **alloc_area) 288 292 { 293 + void *area_alias = NULL; 294 + bool is_src = alloc_area == (void **)&area_src; 295 + unsigned long offset = is_src ? 0 : nr_pages * page_size; 296 + 289 297 *alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE, 290 - MAP_ANONYMOUS | MAP_SHARED, -1, 0); 291 - if (*alloc_area == MAP_FAILED) { 292 - fprintf(stderr, "shared memory mmap failed\n"); 293 - *alloc_area = NULL; 294 - } 298 + MAP_SHARED, shm_fd, offset); 299 + if (*alloc_area == MAP_FAILED) 300 + err("mmap of memfd failed"); 301 + 302 + area_alias = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE, 303 + MAP_SHARED, shm_fd, offset); 304 + if (area_alias == MAP_FAILED) 305 + err("mmap of memfd alias failed"); 306 + 307 + if (is_src) 308 + area_src_alias = area_alias; 309 + else 310 + area_dst_alias = area_alias; 311 + } 312 + 313 + static void shmem_alias_mapping(__u64 *start, size_t len, unsigned long offset) 314 + { 315 + *start = (unsigned long)area_dst_alias + offset; 295 316 } 296 317 297 318 struct uffd_test_ops { 298 319 unsigned long expected_ioctls; 299 320 void (*allocate_area)(void **alloc_area); 300 - int (*release_pages)(char *rel_area); 321 + void (*release_pages)(char *rel_area); 301 322 void (*alias_mapping)(__u64 *start, size_t len, unsigned long offset); 302 323 }; 303 324 ··· 331 332 .expected_ioctls = SHMEM_EXPECTED_IOCTLS, 332 333 .allocate_area = shmem_allocate_area, 333 334 .release_pages = shmem_release_pages, 334 - .alias_mapping = noop_alias_mapping, 335 + .alias_mapping = shmem_alias_mapping, 335 336 }; 336 337 337 338 static struct uffd_test_ops hugetlb_uffd_test_ops = { ··· 342 343 }; 343 344 344 345 static struct uffd_test_ops *uffd_test_ops; 346 + 347 + static void userfaultfd_open(uint64_t *features) 348 + { 349 + struct uffdio_api uffdio_api; 350 + 351 + uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY); 352 + if (uffd < 0) 353 + err("userfaultfd syscall not available in this kernel"); 354 + uffd_flags = fcntl(uffd, F_GETFD, NULL); 355 + 356 + uffdio_api.api = UFFD_API; 357 + uffdio_api.features = *features; 358 + if (ioctl(uffd, UFFDIO_API, &uffdio_api)) 359 + err("UFFDIO_API failed.\nPlease make sure to " 360 + "run with either root or ptrace capability."); 361 + if (uffdio_api.api != UFFD_API) 362 + err("UFFDIO_API error: %" PRIu64, (uint64_t)uffdio_api.api); 363 + 364 + *features = uffdio_api.features; 365 + } 366 + 367 + static inline void munmap_area(void **area) 368 + { 369 + if (*area) 370 + if (munmap(*area, nr_pages * page_size)) 371 + err("munmap"); 372 + 373 + *area = NULL; 374 + } 375 + 376 + static void uffd_test_ctx_clear(void) 377 + { 378 + size_t i; 379 + 380 + if (pipefd) { 381 + for (i = 0; i < nr_cpus * 2; ++i) { 382 + if (close(pipefd[i])) 383 + err("close pipefd"); 384 + } 385 + free(pipefd); 386 + pipefd = NULL; 387 + } 388 + 389 + if (count_verify) { 390 + free(count_verify); 391 + count_verify = NULL; 392 + } 393 + 394 + if (uffd != -1) { 395 + if (close(uffd)) 396 + err("close uffd"); 397 + uffd = -1; 398 + } 399 + 400 + huge_fd_off0 = NULL; 401 + munmap_area((void **)&area_src); 402 + munmap_area((void **)&area_src_alias); 403 + munmap_area((void **)&area_dst); 404 + munmap_area((void **)&area_dst_alias); 405 + } 406 + 407 + static void uffd_test_ctx_init_ext(uint64_t *features) 408 + { 409 + unsigned long nr, cpu; 410 + 411 + uffd_test_ctx_clear(); 412 + 413 + uffd_test_ops->allocate_area((void **)&area_src); 414 + uffd_test_ops->allocate_area((void **)&area_dst); 415 + 416 + uffd_test_ops->release_pages(area_src); 417 + uffd_test_ops->release_pages(area_dst); 418 + 419 + userfaultfd_open(features); 420 + 421 + count_verify = malloc(nr_pages * sizeof(unsigned long long)); 422 + if (!count_verify) 423 + err("count_verify"); 424 + 425 + for (nr = 0; nr < nr_pages; nr++) { 426 + *area_mutex(area_src, nr) = 427 + (pthread_mutex_t)PTHREAD_MUTEX_INITIALIZER; 428 + count_verify[nr] = *area_count(area_src, nr) = 1; 429 + /* 430 + * In the transition between 255 to 256, powerpc will 431 + * read out of order in my_bcmp and see both bytes as 432 + * zero, so leave a placeholder below always non-zero 433 + * after the count, to avoid my_bcmp to trigger false 434 + * positives. 435 + */ 436 + *(area_count(area_src, nr) + 1) = 1; 437 + } 438 + 439 + pipefd = malloc(sizeof(int) * nr_cpus * 2); 440 + if (!pipefd) 441 + err("pipefd"); 442 + for (cpu = 0; cpu < nr_cpus; cpu++) 443 + if (pipe2(&pipefd[cpu * 2], O_CLOEXEC | O_NONBLOCK)) 444 + err("pipe"); 445 + } 446 + 447 + static inline void uffd_test_ctx_init(uint64_t features) 448 + { 449 + uffd_test_ctx_init_ext(&features); 450 + } 345 451 346 452 static int my_bcmp(char *str1, char *str2, size_t n) 347 453 { ··· 467 363 /* Undo write-protect, do wakeup after that */ 468 364 prms.mode = wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0; 469 365 470 - if (ioctl(ufd, UFFDIO_WRITEPROTECT, &prms)) { 471 - fprintf(stderr, "clear WP failed for address 0x%" PRIx64 "\n", 472 - (uint64_t)start); 473 - exit(1); 474 - } 366 + if (ioctl(ufd, UFFDIO_WRITEPROTECT, &prms)) 367 + err("clear WP failed: address=0x%"PRIx64, (uint64_t)start); 475 368 } 476 369 477 370 static void continue_range(int ufd, __u64 start, __u64 len) 478 371 { 479 372 struct uffdio_continue req; 373 + int ret; 480 374 481 375 req.range.start = start; 482 376 req.range.len = len; 483 377 req.mode = 0; 484 378 485 - if (ioctl(ufd, UFFDIO_CONTINUE, &req)) { 486 - fprintf(stderr, 487 - "UFFDIO_CONTINUE failed for address 0x%" PRIx64 "\n", 488 - (uint64_t)start); 489 - exit(1); 490 - } 379 + if (ioctl(ufd, UFFDIO_CONTINUE, &req)) 380 + err("UFFDIO_CONTINUE failed for address 0x%" PRIx64, 381 + (uint64_t)start); 382 + 383 + /* 384 + * Error handling within the kernel for continue is subtly different 385 + * from copy or zeropage, so it may be a source of bugs. Trigger an 386 + * error (-EEXIST) on purpose, to verify doing so doesn't cause a BUG. 387 + */ 388 + req.mapped = 0; 389 + ret = ioctl(ufd, UFFDIO_CONTINUE, &req); 390 + if (ret >= 0 || req.mapped != -EEXIST) 391 + err("failed to exercise UFFDIO_CONTINUE error handling, ret=%d, mapped=%" PRId64, 392 + ret, (int64_t) req.mapped); 491 393 } 492 394 493 395 static void *locking_thread(void *arg) ··· 505 395 unsigned long long count; 506 396 char randstate[64]; 507 397 unsigned int seed; 508 - time_t start; 509 398 510 399 if (bounces & BOUNCE_RANDOM) { 511 400 seed = (unsigned int) time(NULL) - bounces; ··· 512 403 seed += cpu; 513 404 bzero(&rand, sizeof(rand)); 514 405 bzero(&randstate, sizeof(randstate)); 515 - if (initstate_r(seed, randstate, sizeof(randstate), &rand)) { 516 - fprintf(stderr, "srandom_r error\n"); 517 - exit(1); 518 - } 406 + if (initstate_r(seed, randstate, sizeof(randstate), &rand)) 407 + err("initstate_r failed"); 519 408 } else { 520 409 page_nr = -bounces; 521 410 if (!(bounces & BOUNCE_RACINGFAULTS)) ··· 522 415 523 416 while (!finished) { 524 417 if (bounces & BOUNCE_RANDOM) { 525 - if (random_r(&rand, &rand_nr)) { 526 - fprintf(stderr, "random_r 1 error\n"); 527 - exit(1); 528 - } 418 + if (random_r(&rand, &rand_nr)) 419 + err("random_r failed"); 529 420 page_nr = rand_nr; 530 421 if (sizeof(page_nr) > sizeof(rand_nr)) { 531 - if (random_r(&rand, &rand_nr)) { 532 - fprintf(stderr, "random_r 2 error\n"); 533 - exit(1); 534 - } 422 + if (random_r(&rand, &rand_nr)) 423 + err("random_r failed"); 535 424 page_nr |= (((unsigned long) rand_nr) << 16) << 536 425 16; 537 426 } 538 427 } else 539 428 page_nr += 1; 540 429 page_nr %= nr_pages; 541 - 542 - start = time(NULL); 543 - if (bounces & BOUNCE_VERIFY) { 544 - count = *area_count(area_dst, page_nr); 545 - if (!count) { 546 - fprintf(stderr, 547 - "page_nr %lu wrong count %Lu %Lu\n", 548 - page_nr, count, 549 - count_verify[page_nr]); 550 - exit(1); 551 - } 552 - 553 - 554 - /* 555 - * We can't use bcmp (or memcmp) because that 556 - * returns 0 erroneously if the memory is 557 - * changing under it (even if the end of the 558 - * page is never changing and always 559 - * different). 560 - */ 561 - #if 1 562 - if (!my_bcmp(area_dst + page_nr * page_size, zeropage, 563 - page_size)) { 564 - fprintf(stderr, 565 - "my_bcmp page_nr %lu wrong count %Lu %Lu\n", 566 - page_nr, count, count_verify[page_nr]); 567 - exit(1); 568 - } 569 - #else 570 - unsigned long loops; 571 - 572 - loops = 0; 573 - /* uncomment the below line to test with mutex */ 574 - /* pthread_mutex_lock(area_mutex(area_dst, page_nr)); */ 575 - while (!bcmp(area_dst + page_nr * page_size, zeropage, 576 - page_size)) { 577 - loops += 1; 578 - if (loops > 10) 579 - break; 580 - } 581 - /* uncomment below line to test with mutex */ 582 - /* pthread_mutex_unlock(area_mutex(area_dst, page_nr)); */ 583 - if (loops) { 584 - fprintf(stderr, 585 - "page_nr %lu all zero thread %lu %p %lu\n", 586 - page_nr, cpu, area_dst + page_nr * page_size, 587 - loops); 588 - if (loops > 10) 589 - exit(1); 590 - } 591 - #endif 592 - } 593 - 594 430 pthread_mutex_lock(area_mutex(area_dst, page_nr)); 595 431 count = *area_count(area_dst, page_nr); 596 - if (count != count_verify[page_nr]) { 597 - fprintf(stderr, 598 - "page_nr %lu memory corruption %Lu %Lu\n", 599 - page_nr, count, 600 - count_verify[page_nr]); exit(1); 601 - } 432 + if (count != count_verify[page_nr]) 433 + err("page_nr %lu memory corruption %llu %llu", 434 + page_nr, count, count_verify[page_nr]); 602 435 count++; 603 436 *area_count(area_dst, page_nr) = count_verify[page_nr] = count; 604 437 pthread_mutex_unlock(area_mutex(area_dst, page_nr)); 605 - 606 - if (time(NULL) - start > 1) 607 - fprintf(stderr, 608 - "userfault too slow %ld " 609 - "possible false positive with overcommit\n", 610 - time(NULL) - start); 611 438 } 612 439 613 440 return NULL; ··· 555 514 offset); 556 515 if (ioctl(ufd, UFFDIO_COPY, uffdio_copy)) { 557 516 /* real retval in ufdio_copy.copy */ 558 - if (uffdio_copy->copy != -EEXIST) { 559 - uffd_error(uffdio_copy->copy, 560 - "UFFDIO_COPY retry error"); 561 - } 562 - } else 563 - uffd_error(uffdio_copy->copy, "UFFDIO_COPY retry unexpected"); 517 + if (uffdio_copy->copy != -EEXIST) 518 + err("UFFDIO_COPY retry error: %"PRId64, 519 + (int64_t)uffdio_copy->copy); 520 + } else { 521 + err("UFFDIO_COPY retry unexpected: %"PRId64, 522 + (int64_t)uffdio_copy->copy); 523 + } 564 524 } 565 525 566 526 static int __copy_page(int ufd, unsigned long offset, bool retry) 567 527 { 568 528 struct uffdio_copy uffdio_copy; 569 529 570 - if (offset >= nr_pages * page_size) { 571 - fprintf(stderr, "unexpected offset %lu\n", offset); 572 - exit(1); 573 - } 530 + if (offset >= nr_pages * page_size) 531 + err("unexpected offset %lu\n", offset); 574 532 uffdio_copy.dst = (unsigned long) area_dst + offset; 575 533 uffdio_copy.src = (unsigned long) area_src + offset; 576 534 uffdio_copy.len = page_size; ··· 581 541 if (ioctl(ufd, UFFDIO_COPY, &uffdio_copy)) { 582 542 /* real retval in ufdio_copy.copy */ 583 543 if (uffdio_copy.copy != -EEXIST) 584 - uffd_error(uffdio_copy.copy, "UFFDIO_COPY error"); 544 + err("UFFDIO_COPY error: %"PRId64, 545 + (int64_t)uffdio_copy.copy); 585 546 } else if (uffdio_copy.copy != page_size) { 586 - uffd_error(uffdio_copy.copy, "UFFDIO_COPY unexpected copy"); 547 + err("UFFDIO_COPY error: %"PRId64, (int64_t)uffdio_copy.copy); 587 548 } else { 588 549 if (test_uffdio_copy_eexist && retry) { 589 550 test_uffdio_copy_eexist = false; ··· 613 572 if (ret < 0) { 614 573 if (errno == EAGAIN) 615 574 return 1; 616 - perror("blocking read error"); 575 + err("blocking read error"); 617 576 } else { 618 - fprintf(stderr, "short read\n"); 577 + err("short read"); 619 578 } 620 - exit(1); 621 579 } 622 580 623 581 return 0; ··· 627 587 { 628 588 unsigned long offset; 629 589 630 - if (msg->event != UFFD_EVENT_PAGEFAULT) { 631 - fprintf(stderr, "unexpected msg event %u\n", msg->event); 632 - exit(1); 633 - } 590 + if (msg->event != UFFD_EVENT_PAGEFAULT) 591 + err("unexpected msg event %u", msg->event); 634 592 635 593 if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP) { 636 594 /* Write protect page faults */ ··· 659 621 stats->minor_faults++; 660 622 } else { 661 623 /* Missing page faults */ 662 - if (bounces & BOUNCE_VERIFY && 663 - msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE) { 664 - fprintf(stderr, "unexpected write fault\n"); 665 - exit(1); 666 - } 624 + if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE) 625 + err("unexpected write fault"); 667 626 668 627 offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst; 669 628 offset &= ~(page_size-1); ··· 687 652 688 653 for (;;) { 689 654 ret = poll(pollfd, 2, -1); 690 - if (!ret) { 691 - fprintf(stderr, "poll error %d\n", ret); 692 - exit(1); 693 - } 694 - if (ret < 0) { 695 - perror("poll"); 696 - exit(1); 697 - } 655 + if (ret <= 0) 656 + err("poll error: %d", ret); 698 657 if (pollfd[1].revents & POLLIN) { 699 - if (read(pollfd[1].fd, &tmp_chr, 1) != 1) { 700 - fprintf(stderr, "read pipefd error\n"); 701 - exit(1); 702 - } 658 + if (read(pollfd[1].fd, &tmp_chr, 1) != 1) 659 + err("read pipefd error"); 703 660 break; 704 661 } 705 - if (!(pollfd[0].revents & POLLIN)) { 706 - fprintf(stderr, "pollfd[0].revents %d\n", 707 - pollfd[0].revents); 708 - exit(1); 709 - } 662 + if (!(pollfd[0].revents & POLLIN)) 663 + err("pollfd[0].revents %d", pollfd[0].revents); 710 664 if (uffd_read_msg(uffd, &msg)) 711 665 continue; 712 666 switch (msg.event) { 713 667 default: 714 - fprintf(stderr, "unexpected msg event %u\n", 715 - msg.event); exit(1); 668 + err("unexpected msg event %u\n", msg.event); 716 669 break; 717 670 case UFFD_EVENT_PAGEFAULT: 718 671 uffd_handle_page_fault(&msg, stats); ··· 714 691 uffd_reg.range.start = msg.arg.remove.start; 715 692 uffd_reg.range.len = msg.arg.remove.end - 716 693 msg.arg.remove.start; 717 - if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_reg.range)) { 718 - fprintf(stderr, "remove failure\n"); 719 - exit(1); 720 - } 694 + if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_reg.range)) 695 + err("remove failure"); 721 696 break; 722 697 case UFFD_EVENT_REMAP: 723 698 area_dst = (char *)(unsigned long)msg.arg.remap.to; ··· 818 797 * UFFDIO_COPY without writing zero pages into area_dst 819 798 * because the background threads already completed). 820 799 */ 821 - if (uffd_test_ops->release_pages(area_src)) 822 - return 1; 823 - 800 + uffd_test_ops->release_pages(area_src); 824 801 825 802 finished = 1; 826 803 for (cpu = 0; cpu < nr_cpus; cpu++) ··· 828 809 for (cpu = 0; cpu < nr_cpus; cpu++) { 829 810 char c; 830 811 if (bounces & BOUNCE_POLL) { 831 - if (write(pipefd[cpu*2+1], &c, 1) != 1) { 832 - fprintf(stderr, "pipefd write error\n"); 833 - return 1; 834 - } 812 + if (write(pipefd[cpu*2+1], &c, 1) != 1) 813 + err("pipefd write error"); 835 814 if (pthread_join(uffd_threads[cpu], 836 815 (void *)&uffd_stats[cpu])) 837 816 return 1; ··· 842 825 } 843 826 844 827 return 0; 845 - } 846 - 847 - static int userfaultfd_open_ext(uint64_t *features) 848 - { 849 - struct uffdio_api uffdio_api; 850 - 851 - uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); 852 - if (uffd < 0) { 853 - fprintf(stderr, 854 - "userfaultfd syscall not available in this kernel\n"); 855 - return 1; 856 - } 857 - uffd_flags = fcntl(uffd, F_GETFD, NULL); 858 - 859 - uffdio_api.api = UFFD_API; 860 - uffdio_api.features = *features; 861 - if (ioctl(uffd, UFFDIO_API, &uffdio_api)) { 862 - fprintf(stderr, "UFFDIO_API failed.\nPlease make sure to " 863 - "run with either root or ptrace capability.\n"); 864 - return 1; 865 - } 866 - if (uffdio_api.api != UFFD_API) { 867 - fprintf(stderr, "UFFDIO_API error: %" PRIu64 "\n", 868 - (uint64_t)uffdio_api.api); 869 - return 1; 870 - } 871 - 872 - *features = uffdio_api.features; 873 - return 0; 874 - } 875 - 876 - static int userfaultfd_open(uint64_t features) 877 - { 878 - return userfaultfd_open_ext(&features); 879 828 } 880 829 881 830 sigjmp_buf jbuf, *sigbuf; ··· 895 912 memset(&act, 0, sizeof(act)); 896 913 act.sa_sigaction = sighndl; 897 914 act.sa_flags = SA_SIGINFO; 898 - if (sigaction(SIGBUS, &act, 0)) { 899 - perror("sigaction"); 900 - return 1; 901 - } 915 + if (sigaction(SIGBUS, &act, 0)) 916 + err("sigaction"); 902 917 lastnr = (unsigned long)-1; 903 918 } 904 919 ··· 906 925 907 926 if (signal_test) { 908 927 if (sigsetjmp(*sigbuf, 1) != 0) { 909 - if (steps == 1 && nr == lastnr) { 910 - fprintf(stderr, "Signal repeated\n"); 911 - return 1; 912 - } 928 + if (steps == 1 && nr == lastnr) 929 + err("Signal repeated"); 913 930 914 931 lastnr = nr; 915 932 if (signal_test == 1) { ··· 932 953 } 933 954 934 955 count = *area_count(area_dst, nr); 935 - if (count != count_verify[nr]) { 936 - fprintf(stderr, 937 - "nr %lu memory corruption %Lu %Lu\n", 938 - nr, count, 939 - count_verify[nr]); 940 - } 956 + if (count != count_verify[nr]) 957 + err("nr %lu memory corruption %llu %llu\n", 958 + nr, count, count_verify[nr]); 941 959 /* 942 960 * Trigger write protection if there is by writing 943 961 * the same value back. ··· 950 974 951 975 area_dst = mremap(area_dst, nr_pages * page_size, nr_pages * page_size, 952 976 MREMAP_MAYMOVE | MREMAP_FIXED, area_src); 953 - if (area_dst == MAP_FAILED) { 954 - perror("mremap"); 955 - exit(1); 956 - } 977 + if (area_dst == MAP_FAILED) 978 + err("mremap"); 979 + /* Reset area_src since we just clobbered it */ 980 + area_src = NULL; 957 981 958 982 for (; nr < nr_pages; nr++) { 959 983 count = *area_count(area_dst, nr); 960 984 if (count != count_verify[nr]) { 961 - fprintf(stderr, 962 - "nr %lu memory corruption %Lu %Lu\n", 963 - nr, count, 964 - count_verify[nr]); exit(1); 985 + err("nr %lu memory corruption %llu %llu\n", 986 + nr, count, count_verify[nr]); 965 987 } 966 988 /* 967 989 * Trigger write protection if there is by writing ··· 968 994 *area_count(area_dst, nr) = count; 969 995 } 970 996 971 - if (uffd_test_ops->release_pages(area_dst)) 972 - return 1; 997 + uffd_test_ops->release_pages(area_dst); 973 998 974 - for (nr = 0; nr < nr_pages; nr++) { 975 - if (my_bcmp(area_dst + nr * page_size, zeropage, page_size)) { 976 - fprintf(stderr, "nr %lu is not zero\n", nr); 977 - exit(1); 978 - } 979 - } 999 + for (nr = 0; nr < nr_pages; nr++) 1000 + if (my_bcmp(area_dst + nr * page_size, zeropage, page_size)) 1001 + err("nr %lu is not zero", nr); 980 1002 981 1003 return 0; 982 1004 } ··· 985 1015 uffdio_zeropage->range.len, 986 1016 offset); 987 1017 if (ioctl(ufd, UFFDIO_ZEROPAGE, uffdio_zeropage)) { 988 - if (uffdio_zeropage->zeropage != -EEXIST) { 989 - uffd_error(uffdio_zeropage->zeropage, 990 - "UFFDIO_ZEROPAGE retry error"); 991 - } 1018 + if (uffdio_zeropage->zeropage != -EEXIST) 1019 + err("UFFDIO_ZEROPAGE error: %"PRId64, 1020 + (int64_t)uffdio_zeropage->zeropage); 992 1021 } else { 993 - uffd_error(uffdio_zeropage->zeropage, 994 - "UFFDIO_ZEROPAGE retry unexpected"); 1022 + err("UFFDIO_ZEROPAGE error: %"PRId64, 1023 + (int64_t)uffdio_zeropage->zeropage); 995 1024 } 996 1025 } 997 1026 ··· 1003 1034 1004 1035 has_zeropage = uffd_test_ops->expected_ioctls & (1 << _UFFDIO_ZEROPAGE); 1005 1036 1006 - if (offset >= nr_pages * page_size) { 1007 - fprintf(stderr, "unexpected offset %lu\n", offset); 1008 - exit(1); 1009 - } 1037 + if (offset >= nr_pages * page_size) 1038 + err("unexpected offset %lu", offset); 1010 1039 uffdio_zeropage.range.start = (unsigned long) area_dst + offset; 1011 1040 uffdio_zeropage.range.len = page_size; 1012 1041 uffdio_zeropage.mode = 0; ··· 1012 1045 res = uffdio_zeropage.zeropage; 1013 1046 if (ret) { 1014 1047 /* real retval in ufdio_zeropage.zeropage */ 1015 - if (has_zeropage) { 1016 - uffd_error(res, "UFFDIO_ZEROPAGE %s", 1017 - res == -EEXIST ? "-EEXIST" : "error"); 1018 - } else if (res != -EINVAL) 1019 - uffd_error(res, "UFFDIO_ZEROPAGE not -EINVAL"); 1048 + if (has_zeropage) 1049 + err("UFFDIO_ZEROPAGE error: %"PRId64, (int64_t)res); 1050 + else if (res != -EINVAL) 1051 + err("UFFDIO_ZEROPAGE not -EINVAL"); 1020 1052 } else if (has_zeropage) { 1021 1053 if (res != page_size) { 1022 - uffd_error(res, "UFFDIO_ZEROPAGE unexpected"); 1054 + err("UFFDIO_ZEROPAGE unexpected size"); 1023 1055 } else { 1024 1056 if (test_uffdio_zeropage_eexist && retry) { 1025 1057 test_uffdio_zeropage_eexist = false; ··· 1028 1062 return 1; 1029 1063 } 1030 1064 } else 1031 - uffd_error(res, "UFFDIO_ZEROPAGE succeeded"); 1065 + err("UFFDIO_ZEROPAGE succeeded"); 1032 1066 1033 1067 return 0; 1034 1068 } ··· 1047 1081 printf("testing UFFDIO_ZEROPAGE: "); 1048 1082 fflush(stdout); 1049 1083 1050 - if (uffd_test_ops->release_pages(area_dst)) 1051 - return 1; 1084 + uffd_test_ctx_init(0); 1052 1085 1053 - if (userfaultfd_open(0)) 1054 - return 1; 1055 1086 uffdio_register.range.start = (unsigned long) area_dst; 1056 1087 uffdio_register.range.len = nr_pages * page_size; 1057 1088 uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; 1058 1089 if (test_uffdio_wp) 1059 1090 uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP; 1060 - if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) { 1061 - fprintf(stderr, "register failure\n"); 1062 - exit(1); 1063 - } 1091 + if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) 1092 + err("register failure"); 1064 1093 1065 1094 expected_ioctls = uffd_test_ops->expected_ioctls; 1066 - if ((uffdio_register.ioctls & expected_ioctls) != 1067 - expected_ioctls) { 1068 - fprintf(stderr, 1069 - "unexpected missing ioctl for anon memory\n"); 1070 - exit(1); 1071 - } 1095 + if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) 1096 + err("unexpected missing ioctl for anon memory"); 1072 1097 1073 - if (uffdio_zeropage(uffd, 0)) { 1074 - if (my_bcmp(area_dst, zeropage, page_size)) { 1075 - fprintf(stderr, "zeropage is not zero\n"); 1076 - exit(1); 1077 - } 1078 - } 1098 + if (uffdio_zeropage(uffd, 0)) 1099 + if (my_bcmp(area_dst, zeropage, page_size)) 1100 + err("zeropage is not zero"); 1079 1101 1080 - close(uffd); 1081 1102 printf("done.\n"); 1082 1103 return 0; 1083 1104 } ··· 1082 1129 printf("testing events (fork, remap, remove): "); 1083 1130 fflush(stdout); 1084 1131 1085 - if (uffd_test_ops->release_pages(area_dst)) 1086 - return 1; 1087 - 1088 1132 features = UFFD_FEATURE_EVENT_FORK | UFFD_FEATURE_EVENT_REMAP | 1089 1133 UFFD_FEATURE_EVENT_REMOVE; 1090 - if (userfaultfd_open(features)) 1091 - return 1; 1134 + uffd_test_ctx_init(features); 1135 + 1092 1136 fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK); 1093 1137 1094 1138 uffdio_register.range.start = (unsigned long) area_dst; ··· 1093 1143 uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; 1094 1144 if (test_uffdio_wp) 1095 1145 uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP; 1096 - if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) { 1097 - fprintf(stderr, "register failure\n"); 1098 - exit(1); 1099 - } 1146 + if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) 1147 + err("register failure"); 1100 1148 1101 1149 expected_ioctls = uffd_test_ops->expected_ioctls; 1102 - if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) { 1103 - fprintf(stderr, "unexpected missing ioctl for anon memory\n"); 1104 - exit(1); 1105 - } 1150 + if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) 1151 + err("unexpected missing ioctl for anon memory"); 1106 1152 1107 - if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats)) { 1108 - perror("uffd_poll_thread create"); 1109 - exit(1); 1110 - } 1153 + if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats)) 1154 + err("uffd_poll_thread create"); 1111 1155 1112 1156 pid = fork(); 1113 - if (pid < 0) { 1114 - perror("fork"); 1115 - exit(1); 1116 - } 1157 + if (pid < 0) 1158 + err("fork"); 1117 1159 1118 1160 if (!pid) 1119 1161 exit(faulting_process(0)); 1120 1162 1121 1163 waitpid(pid, &err, 0); 1122 - if (err) { 1123 - fprintf(stderr, "faulting process failed\n"); 1124 - exit(1); 1125 - } 1126 - 1127 - if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) { 1128 - perror("pipe write"); 1129 - exit(1); 1130 - } 1164 + if (err) 1165 + err("faulting process failed"); 1166 + if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) 1167 + err("pipe write"); 1131 1168 if (pthread_join(uffd_mon, NULL)) 1132 1169 return 1; 1133 - 1134 - close(uffd); 1135 1170 1136 1171 uffd_stats_report(&stats, 1); 1137 1172 ··· 1137 1202 printf("testing signal delivery: "); 1138 1203 fflush(stdout); 1139 1204 1140 - if (uffd_test_ops->release_pages(area_dst)) 1141 - return 1; 1142 - 1143 1205 features = UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_SIGBUS; 1144 - if (userfaultfd_open(features)) 1145 - return 1; 1206 + uffd_test_ctx_init(features); 1207 + 1146 1208 fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK); 1147 1209 1148 1210 uffdio_register.range.start = (unsigned long) area_dst; ··· 1147 1215 uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; 1148 1216 if (test_uffdio_wp) 1149 1217 uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP; 1150 - if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) { 1151 - fprintf(stderr, "register failure\n"); 1152 - exit(1); 1153 - } 1218 + if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) 1219 + err("register failure"); 1154 1220 1155 1221 expected_ioctls = uffd_test_ops->expected_ioctls; 1156 - if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) { 1157 - fprintf(stderr, "unexpected missing ioctl for anon memory\n"); 1158 - exit(1); 1159 - } 1222 + if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) 1223 + err("unexpected missing ioctl for anon memory"); 1160 1224 1161 - if (faulting_process(1)) { 1162 - fprintf(stderr, "faulting process failed\n"); 1163 - exit(1); 1164 - } 1225 + if (faulting_process(1)) 1226 + err("faulting process failed"); 1165 1227 1166 - if (uffd_test_ops->release_pages(area_dst)) 1167 - return 1; 1228 + uffd_test_ops->release_pages(area_dst); 1168 1229 1169 - if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats)) { 1170 - perror("uffd_poll_thread create"); 1171 - exit(1); 1172 - } 1230 + if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats)) 1231 + err("uffd_poll_thread create"); 1173 1232 1174 1233 pid = fork(); 1175 - if (pid < 0) { 1176 - perror("fork"); 1177 - exit(1); 1178 - } 1234 + if (pid < 0) 1235 + err("fork"); 1179 1236 1180 1237 if (!pid) 1181 1238 exit(faulting_process(2)); 1182 1239 1183 1240 waitpid(pid, &err, 0); 1184 - if (err) { 1185 - fprintf(stderr, "faulting process failed\n"); 1186 - exit(1); 1187 - } 1188 - 1189 - if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) { 1190 - perror("pipe write"); 1191 - exit(1); 1192 - } 1241 + if (err) 1242 + err("faulting process failed"); 1243 + if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) 1244 + err("pipe write"); 1193 1245 if (pthread_join(uffd_mon, (void **)&userfaults)) 1194 1246 return 1; 1195 1247 1196 1248 printf("done.\n"); 1197 1249 if (userfaults) 1198 - fprintf(stderr, "Signal test failed, userfaults: %ld\n", 1199 - userfaults); 1200 - close(uffd); 1250 + err("Signal test failed, userfaults: %ld", userfaults); 1251 + 1201 1252 return userfaults != 0; 1202 1253 } 1203 1254 ··· 1194 1279 void *expected_page; 1195 1280 char c; 1196 1281 struct uffd_stats stats = { 0 }; 1197 - uint64_t features = UFFD_FEATURE_MINOR_HUGETLBFS; 1282 + uint64_t req_features, features_out; 1198 1283 1199 1284 if (!test_uffdio_minor) 1200 1285 return 0; ··· 1202 1287 printf("testing minor faults: "); 1203 1288 fflush(stdout); 1204 1289 1205 - if (uffd_test_ops->release_pages(area_dst)) 1290 + if (test_type == TEST_HUGETLB) 1291 + req_features = UFFD_FEATURE_MINOR_HUGETLBFS; 1292 + else if (test_type == TEST_SHMEM) 1293 + req_features = UFFD_FEATURE_MINOR_SHMEM; 1294 + else 1206 1295 return 1; 1207 1296 1208 - if (userfaultfd_open_ext(&features)) 1209 - return 1; 1210 - /* If kernel reports the feature isn't supported, skip the test. */ 1211 - if (!(features & UFFD_FEATURE_MINOR_HUGETLBFS)) { 1297 + features_out = req_features; 1298 + uffd_test_ctx_init_ext(&features_out); 1299 + /* If kernel reports required features aren't supported, skip test. */ 1300 + if ((features_out & req_features) != req_features) { 1212 1301 printf("skipping test due to lack of feature support\n"); 1213 1302 fflush(stdout); 1214 1303 return 0; ··· 1221 1302 uffdio_register.range.start = (unsigned long)area_dst_alias; 1222 1303 uffdio_register.range.len = nr_pages * page_size; 1223 1304 uffdio_register.mode = UFFDIO_REGISTER_MODE_MINOR; 1224 - if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) { 1225 - fprintf(stderr, "register failure\n"); 1226 - exit(1); 1227 - } 1305 + if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) 1306 + err("register failure"); 1228 1307 1229 1308 expected_ioctls = uffd_test_ops->expected_ioctls; 1230 1309 expected_ioctls |= 1 << _UFFDIO_CONTINUE; 1231 - if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) { 1232 - fprintf(stderr, "unexpected missing ioctl(s)\n"); 1233 - exit(1); 1234 - } 1310 + if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) 1311 + err("unexpected missing ioctl(s)"); 1235 1312 1236 1313 /* 1237 1314 * After registering with UFFD, populate the non-UFFD-registered side of ··· 1238 1323 page_size); 1239 1324 } 1240 1325 1241 - if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats)) { 1242 - perror("uffd_poll_thread create"); 1243 - exit(1); 1244 - } 1326 + if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats)) 1327 + err("uffd_poll_thread create"); 1245 1328 1246 1329 /* 1247 1330 * Read each of the pages back using the UFFD-registered mapping. We ··· 1248 1335 * page's contents, and then issuing a CONTINUE ioctl. 1249 1336 */ 1250 1337 1251 - if (posix_memalign(&expected_page, page_size, page_size)) { 1252 - fprintf(stderr, "out of memory\n"); 1253 - return 1; 1254 - } 1338 + if (posix_memalign(&expected_page, page_size, page_size)) 1339 + err("out of memory"); 1255 1340 1256 1341 for (p = 0; p < nr_pages; ++p) { 1257 1342 expected_byte = ~((uint8_t)(p % ((uint8_t)-1))); 1258 1343 memset(expected_page, expected_byte, page_size); 1259 1344 if (my_bcmp(expected_page, area_dst_alias + (p * page_size), 1260 - page_size)) { 1261 - fprintf(stderr, 1262 - "unexpected page contents after minor fault\n"); 1263 - exit(1); 1264 - } 1345 + page_size)) 1346 + err("unexpected page contents after minor fault"); 1265 1347 } 1266 1348 1267 - if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) { 1268 - perror("pipe write"); 1269 - exit(1); 1270 - } 1349 + if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) 1350 + err("pipe write"); 1271 1351 if (pthread_join(uffd_mon, NULL)) 1272 1352 return 1; 1273 - 1274 - close(uffd); 1275 1353 1276 1354 uffd_stats_report(&stats, 1); 1277 1355 1278 1356 return stats.missing_faults != 0 || stats.minor_faults != nr_pages; 1357 + } 1358 + 1359 + #define BIT_ULL(nr) (1ULL << (nr)) 1360 + #define PM_SOFT_DIRTY BIT_ULL(55) 1361 + #define PM_MMAP_EXCLUSIVE BIT_ULL(56) 1362 + #define PM_UFFD_WP BIT_ULL(57) 1363 + #define PM_FILE BIT_ULL(61) 1364 + #define PM_SWAP BIT_ULL(62) 1365 + #define PM_PRESENT BIT_ULL(63) 1366 + 1367 + static int pagemap_open(void) 1368 + { 1369 + int fd = open("/proc/self/pagemap", O_RDONLY); 1370 + 1371 + if (fd < 0) 1372 + err("open pagemap"); 1373 + 1374 + return fd; 1375 + } 1376 + 1377 + static uint64_t pagemap_read_vaddr(int fd, void *vaddr) 1378 + { 1379 + uint64_t value; 1380 + int ret; 1381 + 1382 + ret = pread(fd, &value, sizeof(uint64_t), 1383 + ((uint64_t)vaddr >> 12) * sizeof(uint64_t)); 1384 + if (ret != sizeof(uint64_t)) 1385 + err("pread() on pagemap failed"); 1386 + 1387 + return value; 1388 + } 1389 + 1390 + /* This macro let __LINE__ works in err() */ 1391 + #define pagemap_check_wp(value, wp) do { \ 1392 + if (!!(value & PM_UFFD_WP) != wp) \ 1393 + err("pagemap uffd-wp bit error: 0x%"PRIx64, value); \ 1394 + } while (0) 1395 + 1396 + static int pagemap_test_fork(bool present) 1397 + { 1398 + pid_t child = fork(); 1399 + uint64_t value; 1400 + int fd, result; 1401 + 1402 + if (!child) { 1403 + /* Open the pagemap fd of the child itself */ 1404 + fd = pagemap_open(); 1405 + value = pagemap_read_vaddr(fd, area_dst); 1406 + /* 1407 + * After fork() uffd-wp bit should be gone as long as we're 1408 + * without UFFD_FEATURE_EVENT_FORK 1409 + */ 1410 + pagemap_check_wp(value, false); 1411 + /* Succeed */ 1412 + exit(0); 1413 + } 1414 + waitpid(child, &result, 0); 1415 + return result; 1416 + } 1417 + 1418 + static void userfaultfd_pagemap_test(unsigned int test_pgsize) 1419 + { 1420 + struct uffdio_register uffdio_register; 1421 + int pagemap_fd; 1422 + uint64_t value; 1423 + 1424 + /* Pagemap tests uffd-wp only */ 1425 + if (!test_uffdio_wp) 1426 + return; 1427 + 1428 + /* Not enough memory to test this page size */ 1429 + if (test_pgsize > nr_pages * page_size) 1430 + return; 1431 + 1432 + printf("testing uffd-wp with pagemap (pgsize=%u): ", test_pgsize); 1433 + /* Flush so it doesn't flush twice in parent/child later */ 1434 + fflush(stdout); 1435 + 1436 + uffd_test_ctx_init(0); 1437 + 1438 + if (test_pgsize > page_size) { 1439 + /* This is a thp test */ 1440 + if (madvise(area_dst, nr_pages * page_size, MADV_HUGEPAGE)) 1441 + err("madvise(MADV_HUGEPAGE) failed"); 1442 + } else if (test_pgsize == page_size) { 1443 + /* This is normal page test; force no thp */ 1444 + if (madvise(area_dst, nr_pages * page_size, MADV_NOHUGEPAGE)) 1445 + err("madvise(MADV_NOHUGEPAGE) failed"); 1446 + } 1447 + 1448 + uffdio_register.range.start = (unsigned long) area_dst; 1449 + uffdio_register.range.len = nr_pages * page_size; 1450 + uffdio_register.mode = UFFDIO_REGISTER_MODE_WP; 1451 + if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) 1452 + err("register failed"); 1453 + 1454 + pagemap_fd = pagemap_open(); 1455 + 1456 + /* Touch the page */ 1457 + *area_dst = 1; 1458 + wp_range(uffd, (uint64_t)area_dst, test_pgsize, true); 1459 + value = pagemap_read_vaddr(pagemap_fd, area_dst); 1460 + pagemap_check_wp(value, true); 1461 + /* Make sure uffd-wp bit dropped when fork */ 1462 + if (pagemap_test_fork(true)) 1463 + err("Detected stall uffd-wp bit in child"); 1464 + 1465 + /* Exclusive required or PAGEOUT won't work */ 1466 + if (!(value & PM_MMAP_EXCLUSIVE)) 1467 + err("multiple mapping detected: 0x%"PRIx64, value); 1468 + 1469 + if (madvise(area_dst, test_pgsize, MADV_PAGEOUT)) 1470 + err("madvise(MADV_PAGEOUT) failed"); 1471 + 1472 + /* Uffd-wp should persist even swapped out */ 1473 + value = pagemap_read_vaddr(pagemap_fd, area_dst); 1474 + pagemap_check_wp(value, true); 1475 + /* Make sure uffd-wp bit dropped when fork */ 1476 + if (pagemap_test_fork(false)) 1477 + err("Detected stall uffd-wp bit in child"); 1478 + 1479 + /* Unprotect; this tests swap pte modifications */ 1480 + wp_range(uffd, (uint64_t)area_dst, page_size, false); 1481 + value = pagemap_read_vaddr(pagemap_fd, area_dst); 1482 + pagemap_check_wp(value, false); 1483 + 1484 + /* Fault in the page from disk */ 1485 + *area_dst = 2; 1486 + value = pagemap_read_vaddr(pagemap_fd, area_dst); 1487 + pagemap_check_wp(value, false); 1488 + 1489 + close(pagemap_fd); 1490 + printf("done\n"); 1279 1491 } 1280 1492 1281 1493 static int userfaultfd_stress(void) ··· 1409 1371 char *tmp_area; 1410 1372 unsigned long nr; 1411 1373 struct uffdio_register uffdio_register; 1412 - unsigned long cpu; 1413 - int err; 1414 1374 struct uffd_stats uffd_stats[nr_cpus]; 1415 1375 1416 - uffd_test_ops->allocate_area((void **)&area_src); 1417 - if (!area_src) 1418 - return 1; 1419 - uffd_test_ops->allocate_area((void **)&area_dst); 1420 - if (!area_dst) 1421 - return 1; 1376 + uffd_test_ctx_init(0); 1422 1377 1423 - if (userfaultfd_open(0)) 1424 - return 1; 1425 - 1426 - count_verify = malloc(nr_pages * sizeof(unsigned long long)); 1427 - if (!count_verify) { 1428 - perror("count_verify"); 1429 - return 1; 1430 - } 1431 - 1432 - for (nr = 0; nr < nr_pages; nr++) { 1433 - *area_mutex(area_src, nr) = (pthread_mutex_t) 1434 - PTHREAD_MUTEX_INITIALIZER; 1435 - count_verify[nr] = *area_count(area_src, nr) = 1; 1436 - /* 1437 - * In the transition between 255 to 256, powerpc will 1438 - * read out of order in my_bcmp and see both bytes as 1439 - * zero, so leave a placeholder below always non-zero 1440 - * after the count, to avoid my_bcmp to trigger false 1441 - * positives. 1442 - */ 1443 - *(area_count(area_src, nr) + 1) = 1; 1444 - } 1445 - 1446 - pipefd = malloc(sizeof(int) * nr_cpus * 2); 1447 - if (!pipefd) { 1448 - perror("pipefd"); 1449 - return 1; 1450 - } 1451 - for (cpu = 0; cpu < nr_cpus; cpu++) { 1452 - if (pipe2(&pipefd[cpu*2], O_CLOEXEC | O_NONBLOCK)) { 1453 - perror("pipe"); 1454 - return 1; 1455 - } 1456 - } 1457 - 1458 - if (posix_memalign(&area, page_size, page_size)) { 1459 - fprintf(stderr, "out of memory\n"); 1460 - return 1; 1461 - } 1378 + if (posix_memalign(&area, page_size, page_size)) 1379 + err("out of memory"); 1462 1380 zeropage = area; 1463 1381 bzero(zeropage, page_size); 1464 1382 ··· 1423 1429 pthread_attr_init(&attr); 1424 1430 pthread_attr_setstacksize(&attr, 16*1024*1024); 1425 1431 1426 - err = 0; 1427 1432 while (bounces--) { 1428 1433 unsigned long expected_ioctls; 1429 1434 ··· 1451 1458 uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; 1452 1459 if (test_uffdio_wp) 1453 1460 uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP; 1454 - if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) { 1455 - fprintf(stderr, "register failure\n"); 1456 - return 1; 1457 - } 1461 + if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) 1462 + err("register failure"); 1458 1463 expected_ioctls = uffd_test_ops->expected_ioctls; 1459 1464 if ((uffdio_register.ioctls & expected_ioctls) != 1460 - expected_ioctls) { 1461 - fprintf(stderr, 1462 - "unexpected missing ioctl for anon memory\n"); 1463 - return 1; 1464 - } 1465 + expected_ioctls) 1466 + err("unexpected missing ioctl for anon memory"); 1465 1467 1466 1468 if (area_dst_alias) { 1467 1469 uffdio_register.range.start = (unsigned long) 1468 1470 area_dst_alias; 1469 - if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) { 1470 - fprintf(stderr, "register failure alias\n"); 1471 - return 1; 1472 - } 1471 + if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) 1472 + err("register failure alias"); 1473 1473 } 1474 1474 1475 1475 /* ··· 1489 1503 * MADV_DONTNEED only after the UFFDIO_REGISTER, so it's 1490 1504 * required to MADV_DONTNEED here. 1491 1505 */ 1492 - if (uffd_test_ops->release_pages(area_dst)) 1493 - return 1; 1506 + uffd_test_ops->release_pages(area_dst); 1494 1507 1495 1508 uffd_stats_reset(uffd_stats, nr_cpus); 1496 1509 ··· 1503 1518 nr_pages * page_size, false); 1504 1519 1505 1520 /* unregister */ 1506 - if (ioctl(uffd, UFFDIO_UNREGISTER, &uffdio_register.range)) { 1507 - fprintf(stderr, "unregister failure\n"); 1508 - return 1; 1509 - } 1521 + if (ioctl(uffd, UFFDIO_UNREGISTER, &uffdio_register.range)) 1522 + err("unregister failure"); 1510 1523 if (area_dst_alias) { 1511 1524 uffdio_register.range.start = (unsigned long) area_dst; 1512 1525 if (ioctl(uffd, UFFDIO_UNREGISTER, 1513 - &uffdio_register.range)) { 1514 - fprintf(stderr, "unregister failure alias\n"); 1515 - return 1; 1516 - } 1526 + &uffdio_register.range)) 1527 + err("unregister failure alias"); 1517 1528 } 1518 1529 1519 1530 /* verification */ 1520 - if (bounces & BOUNCE_VERIFY) { 1521 - for (nr = 0; nr < nr_pages; nr++) { 1522 - if (*area_count(area_dst, nr) != count_verify[nr]) { 1523 - fprintf(stderr, 1524 - "error area_count %Lu %Lu %lu\n", 1525 - *area_count(area_src, nr), 1526 - count_verify[nr], 1527 - nr); 1528 - err = 1; 1529 - bounces = 0; 1530 - } 1531 - } 1532 - } 1531 + if (bounces & BOUNCE_VERIFY) 1532 + for (nr = 0; nr < nr_pages; nr++) 1533 + if (*area_count(area_dst, nr) != count_verify[nr]) 1534 + err("error area_count %llu %llu %lu\n", 1535 + *area_count(area_src, nr), 1536 + count_verify[nr], nr); 1533 1537 1534 1538 /* prepare next bounce */ 1535 1539 tmp_area = area_src; ··· 1532 1558 uffd_stats_report(uffd_stats, nr_cpus); 1533 1559 } 1534 1560 1535 - if (err) 1536 - return err; 1561 + if (test_type == TEST_ANON) { 1562 + /* 1563 + * shmem/hugetlb won't be able to run since they have different 1564 + * behavior on fork() (file-backed memory normally drops ptes 1565 + * directly when fork), meanwhile the pagemap test will verify 1566 + * pgtable entry of fork()ed child. 1567 + */ 1568 + userfaultfd_pagemap_test(page_size); 1569 + /* 1570 + * Hard-code for x86_64 for now for 2M THP, as x86_64 is 1571 + * currently the only one that supports uffd-wp 1572 + */ 1573 + userfaultfd_pagemap_test(page_size * 512); 1574 + } 1537 1575 1538 - close(uffd); 1539 1576 return userfaultfd_zeropage_test() || userfaultfd_sig_test() 1540 1577 || userfaultfd_events_test() || userfaultfd_minor_test(); 1541 1578 } ··· 1595 1610 map_shared = true; 1596 1611 test_type = TEST_SHMEM; 1597 1612 uffd_test_ops = &shmem_uffd_test_ops; 1613 + test_uffdio_minor = true; 1598 1614 } else { 1599 - fprintf(stderr, "Unknown test type: %s\n", type); exit(1); 1615 + err("Unknown test type: %s", type); 1600 1616 } 1601 1617 1602 1618 if (test_type == TEST_HUGETLB) ··· 1605 1619 else 1606 1620 page_size = sysconf(_SC_PAGE_SIZE); 1607 1621 1608 - if (!page_size) { 1609 - fprintf(stderr, "Unable to determine page size\n"); 1610 - exit(2); 1611 - } 1622 + if (!page_size) 1623 + err("Unable to determine page size"); 1612 1624 if ((unsigned long) area_count(NULL, 0) + sizeof(unsigned long long) * 2 1613 - > page_size) { 1614 - fprintf(stderr, "Impossible to run this test\n"); 1615 - exit(2); 1616 - } 1625 + > page_size) 1626 + err("Impossible to run this test"); 1617 1627 } 1618 1628 1619 1629 static void sigalrm(int sig) ··· 1626 1644 if (argc < 4) 1627 1645 usage(); 1628 1646 1629 - if (signal(SIGALRM, sigalrm) == SIG_ERR) { 1630 - fprintf(stderr, "failed to arm SIGALRM"); 1631 - exit(1); 1632 - } 1647 + if (signal(SIGALRM, sigalrm) == SIG_ERR) 1648 + err("failed to arm SIGALRM"); 1633 1649 alarm(ALARM_INTERVAL_SECS); 1634 1650 1635 1651 set_test_type(argv[1]); ··· 1636 1656 nr_pages_per_cpu = atol(argv[2]) * 1024*1024 / page_size / 1637 1657 nr_cpus; 1638 1658 if (!nr_pages_per_cpu) { 1639 - fprintf(stderr, "invalid MiB\n"); 1659 + _err("invalid MiB"); 1640 1660 usage(); 1641 1661 } 1642 1662 1643 1663 bounces = atoi(argv[3]); 1644 1664 if (bounces <= 0) { 1645 - fprintf(stderr, "invalid bounces\n"); 1665 + _err("invalid bounces"); 1646 1666 usage(); 1647 1667 } 1648 1668 nr_pages = nr_pages_per_cpu * nr_cpus; ··· 1651 1671 if (argc < 5) 1652 1672 usage(); 1653 1673 huge_fd = open(argv[4], O_CREAT | O_RDWR, 0755); 1654 - if (huge_fd < 0) { 1655 - fprintf(stderr, "Open of %s failed", argv[3]); 1656 - perror("open"); 1657 - exit(1); 1658 - } 1659 - if (ftruncate(huge_fd, 0)) { 1660 - fprintf(stderr, "ftruncate %s to size 0 failed", argv[3]); 1661 - perror("ftruncate"); 1662 - exit(1); 1663 - } 1674 + if (huge_fd < 0) 1675 + err("Open of %s failed", argv[4]); 1676 + if (ftruncate(huge_fd, 0)) 1677 + err("ftruncate %s to size 0 failed", argv[4]); 1678 + } else if (test_type == TEST_SHMEM) { 1679 + shm_fd = memfd_create(argv[0], 0); 1680 + if (shm_fd < 0) 1681 + err("memfd_create"); 1682 + if (ftruncate(shm_fd, nr_pages * page_size * 2)) 1683 + err("ftruncate"); 1684 + if (fallocate(shm_fd, 1685 + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, 1686 + nr_pages * page_size * 2)) 1687 + err("fallocate"); 1664 1688 } 1665 1689 printf("nr_pages: %lu, nr_pages_per_cpu: %lu\n", 1666 1690 nr_pages, nr_pages_per_cpu);